Validation of Haritica pathway enrichment against R/Bioconductor clusterProfiler

Abstract. A 992-gene differential-expression signature from human airway smooth muscle treated with dexamethasone (GSE52778; Himes et al. 2014) — 475 up- and 517 down-regulated genes against a 17,497-gene tested universe — was tested for over-representation in Haritica and independently re-analysed with R/Bioconductor clusterProfiler enricher() on the identical bundled gene sets. Across all six collections (GO BP/MF/CC, WikiPathways, Reactome, MSigDB Hallmark) the enriched term sets are identical (Jaccard = 1.0000) and per-term p-values are collinear (Pearson r = Spearman ρ = 1.0000). A three-way engine cross-check — Haritica's one-sided Fisher's exact test, the clusterProfiler hypergeometric test, and an offline scipy.stats.hypergeom regression — agrees to numerical precision (max |Δp| = 9.97×10⁻¹⁷). The recovered terms reproduce the published glucocorticoid program (hormone response, extracellular-matrix remodeling, vasculature development), and the same biology is rebuilt from raw FASTQ end to end on cloud infrastructure.

1Data and methods

The dataset (GSE52778 / SRP033351) comprises airway smooth muscle from four donors, each profiled untreated and after 18 h exposure to 1 µM dexamethasone: eight paired-end Illumina HiSeq 2000 runs (SRR1039508–521). The gene list under test is the 992 significant differentially-expressed genes (475 up / 517 down) called at adjusted p < 0.05 and |log₂ fold-change| > 1 from the canonical DESeq2 result, against a 17,497-gene tested universe. Within Haritica the list was tested for over-representation against the bundled gene-set collections using a one-sided Fisher's exact test with Benjamini–Hochberg correction; the reference was run with matched thresholds. The list contains textbook glucocorticoid responders (ZBTB16, KLF15, FKBP5, DUSP1, GPX3, PER1, CRISPLD2, SPARCL1).

Table 1. Analysis parameters for the Haritica run and the matched reference.
Parameter	Haritica	Reference (clusterProfiler)
Input	CSV gene-list mode; 992 airway symbols	same 992 symbols
Test	Fisher's exact, one-sided ("greater")	hypergeometric (≡ Fisher greater)
Gene sets	bundled GO BP/MF/CC, WikiPathways, Reactome, Hallmark	`TERM2GENE` from the same bundle
Background	per-collection gene-set union	`universe = NULL` (set union)
Multiple testing	Benjamini–Hochberg (`fdr_bh`)	`pAdjustMethod = "BH"`
Size / overlap filters	min overlap 3; max set 500; GO-simplify off	`minGSSize 3; maxGSSize 500`
Display	top 20 terms; six enrichplot-style plots	`showCategory = 20`
Tested universe	17,497	17,497

1.1 The reference and engine cross-check

The reference per-term numbers and figures are produced by reference_enrichment.R, which calls R/Bioconductor clusterProfiler 4.20.0 enricher() and enrichplot 1.32.0 and adds no statistics of its own. The reference is fed the identical bundled gene sets and the identical 992-gene query, so the comparison isolates Haritica's enrichment engine and renderers from gene-universe differences.

The engine itself is cross-checked three independent ways: Haritica's one-sided Fisher's exact test equals the clusterProfiler hypergeometric test equals an offline scipy.stats.hypergeom regression, to numerical precision (max |Δp| = 9.97×10⁻¹⁷ over all 1,381 significant GO:BP terms). The one-sided Fisher's exact test is the hypergeometric upper tail, so on identical gene sets these are the same computation expressed three ways. This concordance validates the enrichment engine, not the gene-set content: both sides draw terms from Haritica's own bundled collections, so identical term sets are expected by construction. The content is anchored separately, by the biological recovery of the known glucocorticoid program (§2.1, §2.3).

2Results

2.1 Concordance

For each bundled collection the set of enriched terms, their overlap counts, and their p-values were compared term by term against clusterProfiler on the identical gene sets (Table 2). Across all six collections the enriched term sets are identical (Jaccard = 1.0000) and per-term p-values are collinear (Pearson = Spearman = 1.0000), with 100% exact agreement of overlap counts. The GO:BP scatter (Figure 1) places every one of 2,142 terms on the identity line. The recovered terms span the published glucocorticoid program: circulatory and vasculature development, extracellular-matrix organization, and the WikiPathways Glucocorticoid receptor pathway (WP2880).

Table 2. Per-collection concordance between Haritica and clusterProfiler on identical gene sets.
Collection	Terms	Jaccard	p Pearson	p Spearman	count
GO:BP	2,142	1.0000	1.0000	1.0000	100%
GO:MF	375	1.0000	1.0000	1.0000	100%
GO:CC	262	1.0000	1.0000	1.0000	100%
WikiPathways	349	1.0000	1.0000	1.0000	100%
Reactome	523	1.0000	1.0000	1.0000	100%
Hallmark	45	1.0000	1.0000	1.0000	100%

Per-term GO:BP p-value concordance scatter; points fall on the identity line — Figure 1. Per-term p-value concordance for GO:BP. Each point is one term: Haritica (horizontal) versus clusterProfiler (vertical). Every term lands on the identity line (Pearson r = 1.0000, n = 2,142). The app result equals its enrichment engine to floating point (max |Δp| = 9.97×10⁻¹⁷), pinned offline against an independent `scipy.stats.hypergeom` reference.

2.2 Plot parity

Each enrichment renderer was compared against its enrichplot reference on the same airway GO:BP gene set, with the airway DESeq2 log₂ fold-changes overlaid for the heatmap and gene-network views. Both columns enrich the same 992-gene signature over the same bundled GO:BP sets, so every panel's term set, gene Count, and adjusted p-value match the reference term for term — for instance the leading term neuron projection morphogenesis (Count 44, adjusted p = 3.2×10⁻¹⁴) is identical on both sides. A plot is a drawing of already-validated numbers, so matching what enrichplot draws — given both draw the same validated numbers — is the appropriate non-circular check (Figures 2–8).

Haritica bar plot — Figure 2. Bar plot. Bar length is gene Count (clusterProfiler's `barplot` default); fill encodes adjusted p-value on a linear scale, so the most-significant terms saturate and only the least-significant drift, matching the reference. Full term names; no value labels.

clusterProfiler barplot — Figure 2. Bar plot. Bar length is gene Count (clusterProfiler's `barplot` default); fill encodes adjusted p-value on a linear scale, so the most-significant terms saturate and only the least-significant drift, matching the reference. Full term names; no value labels.

Haritica dot plot — Figure 3. Dot plot. GeneRatio on the x-axis (scaled tight to the data), dot size encodes gene Count, colour encodes adjusted p-value; the y-axis is ordered by GeneRatio (enrichplot's `dotplot` default), so the term order matches term for term.

clusterProfiler dotplot — Figure 3. Dot plot. GeneRatio on the x-axis (scaled tight to the data), dot size encodes gene Count, colour encodes adjusted p-value; the y-axis is ordered by GeneRatio (enrichplot's `dotplot` default), so the term order matches term for term.

Haritica fold-change heatmap — Figure 4. Gene-term heatmap. Genes on the x-axis, terms on the y-axis; tiles coloured by per-gene log₂ fold-change on a diverging scale whose white midpoint is centred on zero over the asymmetric fold-change range, as in ggplot's `scale_fill_gradient2`. Gene labels are rendered vertically to keep the dense gene axis separable.

clusterProfiler heatplot — Figure 4. Gene-term heatmap. Genes on the x-axis, terms on the y-axis; tiles coloured by per-gene log₂ fold-change on a diverging scale whose white midpoint is centred on zero over the asymmetric fold-change range, as in ggplot's `scale_fill_gradient2`. Gene labels are rendered vertically to keep the dense gene axis separable.

Haritica tree plot — Figure 5. Tree plot. A spanning dendrogram on Ward.D linkage with pale cluster bands and a per-leaf dot panel (size = Count, colour = adjusted p-value). The top GO:BP terms share so few genes that their raw merge heights are all near 1.0; rank-spreading the internal branch heights lets the tree span its region with visible clades rather than collapsing into a left-margin spike.

clusterProfiler treeplot — Figure 5. Tree plot. A spanning dendrogram on Ward.D linkage with pale cluster bands and a per-leaf dot panel (size = Count, colour = adjusted p-value). The top GO:BP terms share so few genes that their raw merge heights are all near 1.0; rank-spreading the internal branch heights lets the tree span its region with visible clades rather than collapsing into a left-margin spike.

Haritica UpSet plot — Figure 6. UpSet plot. Exclusive intersections (each gene counted once, by its exact term membership) form a descending power-law of bar heights with multi-degree connectors; the dot matrix is x-pinned under its bars, with alternating row shading and term-ID row labels. On the matched gene list the term set and intersections align with the reference.

clusterProfiler upsetplot — Figure 6. UpSet plot. Exclusive intersections (each gene counted once, by its exact term membership) form a descending power-law of bar heights with multi-degree connectors; the dot matrix is x-pinned under its bars, with alternating row shading and term-ID row labels. On the matched gene list the term set and intersections align with the reference.

Haritica gene-concept network — Figure 7. Gene-concept network (cnet). A force-directed layout: term hubs are placed by their shared-gene similarity, so related terms cluster as in enrichplot's igraph layout, and each term's genes fan out around its hub. Gene nodes are coloured by log₂ fold-change; hubs are sized by gene Count.

clusterProfiler cnetplot — Figure 7. Gene-concept network (cnet). A force-directed layout: term hubs are placed by their shared-gene similarity, so related terms cluster as in enrichplot's igraph layout, and each term's genes fan out around its hub. Gene nodes are coloured by log₂ fold-change; hubs are sized by gene Count.

Haritica enrichment map — Figure 8. Enrichment map (emap). Nodes are terms, edges are Jaccard gene-set similarity, colour is −log₁₀(q), size is gene Count. The near-complete GO:BP similarity clique is first reduced to its kNN similarity graph (each term keeps its strongest neighbours, as in enrichplot's `pairwise_termsim`), then laid out by Fruchterman–Reingold, separating the neuron-morphogenesis and vasculature clusters.

clusterProfiler emapplot — Figure 8. Enrichment map (emap). Nodes are terms, edges are Jaccard gene-set similarity, colour is −log₁₀(q), size is gene Count. The near-complete GO:BP similarity clique is first reduced to its kNN similarity graph (each term keeps its strongest neighbours, as in enrichplot's `pairwise_termsim`), then laid out by Fruchterman–Reingold, separating the neuron-morphogenesis and vasculature clusters.

2.3 Reproducibility — cloud and end-to-end from FASTQ

The cloud worker runs the identical enrichment engine over the same bundled gene sets as the desktop application. Run on managed cloud batch infrastructure and compared collection by collection against the local in-app result, the term counts are identical (Table 3), including the WikiPathways hits Glucocorticoid receptor pathway (WP2880), Adipogenesis (WP236), and White fat cell differentiation (WP4149).

The whole pipeline was also run end to end from raw reads: the eight airway runs (SRR1039508–521, approximately 23 GB of FASTQ) were aligned with HISAT2 to GRCh38 (97–98% per sample), quantified with featureCounts, tested with pyDESeq2 (adjusted p < 0.05, |log₂FC| > 1; 837 genes), and enriched with the bundled collections (651 terms). HISAT2 is used here rather than minimap2 because minimap2's spliced-alignment preset targets long reads; on short paired-end RNA-seq the splice-aware short-read aligner is appropriate. The cloud FASTQ run (837 genes) and the published gene list (992 genes) are two independent differential-expression analyses of the same experiment; on the 448 shared GO:BP terms their per-term significance correlates at Pearson r = 0.71 (Figure 9), the more conservative correlation expected from a smaller independent DE call rather than the engine-vs-engine r = 1.0 of §2.1. Ten of twelve canonical airway markers (CRISPLD2, KLF15, FKBP5, GPX3, PER1, ZBTB16, SPARCL1, TSC22D3, STEAP2, MAOA) are significant in both runs, and all four published functional axes (hormone/steroid response, vasculature, extracellular matrix, muscle/contraction) are recovered in the cloud GO:BP set.

Table 3. Enriched term counts, local desktop run versus cloud batch run, on the same gene list and gene sets.
Collection	Local (in-app)	Cloud (batch)
GO:BP	1,312	1,312
Reactome	120	120
WikiPathways	76	76
Hallmark	9	9

Cloud-FASTQ versus published gene-list GO:BP p-value concordance scatter — Figure 9. Cross-pipeline GO:BP concordance. Per-term significance from the cloud raw-FASTQ run (837 genes) versus the published gene list (992 genes), over the 448 shared terms (Pearson r = 0.71). The same pathways rank together; the cloud run sits slightly below the diagonal, consistent with its smaller, more conservative independent DE call.

Cloud bar plot — Figure 10. Representative plots rendered by the application from the completed cloud job, using the same renderers as the gene-list run. The bar plot leads with *TNFA signaling via NFKB*, *vasculature development*, *angiogenesis*, *blood-vessel morphogenesis*, the *Nuclear receptors meta pathway* (WP2882), and *regulation of hormone levels* — the glucocorticoid/vascular airway program, derived from raw reads.