enricher() on identical bundled gene setsAbstract. A 992-gene differential-expression signature from human airway smooth muscle treated with
dexamethasone (GSE52778; Himes et al. 2014) — 475 up- and 517 down-regulated genes against a
17,497-gene tested universe — was tested for over-representation in Haritica and independently re-analysed
with R/Bioconductor clusterProfiler enricher() on the identical bundled gene sets. Across all six
collections (GO BP/MF/CC, WikiPathways, Reactome, MSigDB Hallmark) the enriched term sets are identical
(Jaccard = 1.0000) and per-term p-values are collinear (Pearson r = Spearman
ρ = 1.0000). A three-way engine cross-check — Haritica's one-sided Fisher's exact test,
the clusterProfiler hypergeometric test, and an offline scipy.stats.hypergeom regression — agrees
to numerical precision (max |Δp| = 9.97×10−17). The recovered
terms reproduce the published glucocorticoid program (hormone response, extracellular-matrix remodeling,
vasculature development), and the same biology is rebuilt from raw FASTQ end to end on cloud infrastructure.
The dataset (GSE52778 / SRP033351) comprises airway smooth muscle from four donors, each profiled untreated and after 18 h exposure to 1 µM dexamethasone: eight paired-end Illumina HiSeq 2000 runs (SRR1039508–521). The gene list under test is the 992 significant differentially-expressed genes (475 up / 517 down) called at adjusted p < 0.05 and |log2 fold-change| > 1 from the canonical DESeq2 result, against a 17,497-gene tested universe. Within Haritica the list was tested for over-representation against the bundled gene-set collections using a one-sided Fisher's exact test with Benjamini–Hochberg correction; the reference was run with matched thresholds. The list contains textbook glucocorticoid responders (ZBTB16, KLF15, FKBP5, DUSP1, GPX3, PER1, CRISPLD2, SPARCL1).
| Parameter | Haritica | Reference (clusterProfiler) |
|---|---|---|
| Input | CSV gene-list mode; 992 airway symbols | same 992 symbols |
| Test | Fisher's exact, one-sided ("greater") | hypergeometric (≡ Fisher greater) |
| Gene sets | bundled GO BP/MF/CC, WikiPathways, Reactome, Hallmark | TERM2GENE from the same bundle |
| Background | per-collection gene-set union | universe = NULL (set union) |
| Multiple testing | Benjamini–Hochberg (fdr_bh) | pAdjustMethod = "BH" |
| Size / overlap filters | min overlap 3; max set 500; GO-simplify off | minGSSize 3; maxGSSize 500 |
| Display | top 20 terms; six enrichplot-style plots | showCategory = 20 |
| Tested universe | 17,497 | 17,497 |
The reference per-term numbers and figures are produced by
reference_enrichment.R, which calls R/Bioconductor clusterProfiler
4.20.0 enricher() and enrichplot 1.32.0 and adds no statistics of its own. The reference is fed the
identical bundled gene sets and the identical 992-gene query, so the comparison isolates Haritica's enrichment
engine and renderers from gene-universe differences.
The engine itself is cross-checked three independent ways: Haritica's one-sided Fisher's exact test equals the
clusterProfiler hypergeometric test equals an offline scipy.stats.hypergeom regression, to numerical
precision (max |Δp| = 9.97×10−17 over all 1,381 significant GO:BP
terms). The one-sided Fisher's exact test is the hypergeometric upper tail, so on identical gene sets these are the
same computation expressed three ways. This concordance validates the enrichment engine, not the gene-set
content: both sides draw terms from Haritica's own bundled collections, so identical term sets are
expected by construction. The content is anchored separately, by the biological recovery of the known glucocorticoid
program (§2.1, §2.3).
For each bundled collection the set of enriched terms, their overlap counts, and their p-values were compared term by term against clusterProfiler on the identical gene sets (Table 2). Across all six collections the enriched term sets are identical (Jaccard = 1.0000) and per-term p-values are collinear (Pearson = Spearman = 1.0000), with 100% exact agreement of overlap counts. The GO:BP scatter (Figure 1) places every one of 2,142 terms on the identity line. The recovered terms span the published glucocorticoid program: circulatory and vasculature development, extracellular-matrix organization, and the WikiPathways Glucocorticoid receptor pathway (WP2880).
| Collection | Terms | Jaccard | p Pearson | p Spearman | count |
|---|---|---|---|---|---|
| GO:BP | 2,142 | 1.0000 | 1.0000 | 1.0000 | 100% |
| GO:MF | 375 | 1.0000 | 1.0000 | 1.0000 | 100% |
| GO:CC | 262 | 1.0000 | 1.0000 | 1.0000 | 100% |
| WikiPathways | 349 | 1.0000 | 1.0000 | 1.0000 | 100% |
| Reactome | 523 | 1.0000 | 1.0000 | 1.0000 | 100% |
| Hallmark | 45 | 1.0000 | 1.0000 | 1.0000 | 100% |
scipy.stats.hypergeom reference.Each enrichment renderer was compared against its enrichplot reference on the same airway GO:BP gene set, with the airway DESeq2 log2 fold-changes overlaid for the heatmap and gene-network views. Both columns enrich the same 992-gene signature over the same bundled GO:BP sets, so every panel's term set, gene Count, and adjusted p-value match the reference term for term — for instance the leading term neuron projection morphogenesis (Count 44, adjusted p = 3.2×10−14) is identical on both sides. A plot is a drawing of already-validated numbers, so matching what enrichplot draws — given both draw the same validated numbers — is the appropriate non-circular check (Figures 2–8).
Haritica

clusterProfiler — reference

barplot default); fill encodes adjusted p-value on a linear scale, so the most-significant terms
saturate and only the least-significant drift, matching the reference. Full term names; no value labels.Haritica

clusterProfiler — reference

dotplot default), so the term order matches term for term.Haritica

clusterProfiler — reference

scale_fill_gradient2. Gene labels are
rendered vertically to keep the dense gene axis separable.Haritica

clusterProfiler — reference

Haritica

clusterProfiler — reference

Haritica

clusterProfiler — reference

Haritica

clusterProfiler — reference

pairwise_termsim), then laid out by Fruchterman–Reingold, separating the neuron-morphogenesis
and vasculature clusters.The cloud worker runs the identical enrichment engine over the same bundled gene sets as the desktop application. Run on managed cloud batch infrastructure and compared collection by collection against the local in-app result, the term counts are identical (Table 3), including the WikiPathways hits Glucocorticoid receptor pathway (WP2880), Adipogenesis (WP236), and White fat cell differentiation (WP4149).
The whole pipeline was also run end to end from raw reads: the eight airway runs (SRR1039508–521, approximately 23 GB of FASTQ) were aligned with HISAT2 to GRCh38 (97–98% per sample), quantified with featureCounts, tested with pyDESeq2 (adjusted p < 0.05, |log2FC| > 1; 837 genes), and enriched with the bundled collections (651 terms). HISAT2 is used here rather than minimap2 because minimap2's spliced-alignment preset targets long reads; on short paired-end RNA-seq the splice-aware short-read aligner is appropriate. The cloud FASTQ run (837 genes) and the published gene list (992 genes) are two independent differential-expression analyses of the same experiment; on the 448 shared GO:BP terms their per-term significance correlates at Pearson r = 0.71 (Figure 9), the more conservative correlation expected from a smaller independent DE call rather than the engine-vs-engine r = 1.0 of §2.1. Ten of twelve canonical airway markers (CRISPLD2, KLF15, FKBP5, GPX3, PER1, ZBTB16, SPARCL1, TSC22D3, STEAP2, MAOA) are significant in both runs, and all four published functional axes (hormone/steroid response, vasculature, extracellular matrix, muscle/contraction) are recovered in the cloud GO:BP set.
| Collection | Local (in-app) | Cloud (batch) |
|---|---|---|
| GO:BP | 1,312 | 1,312 |
| Reactome | 120 | 120 |
| WikiPathways | 76 | 76 |
| Hallmark | 9 | 9 |
Cloud FASTQ — bar plot

Cloud FASTQ — enrichment map

All inputs are public. Raw reads: GEO series
GSE52778 / SRA study
SRP033351, runs
SRR1039508–SRR1039521
(ENA; eight paired-end Illumina HiSeq 2000 runs, approximately 23 GB). Processed counts are also
available via the Bioconductor airway package. Reference
genome: Ensembl human GRCh38. The independent reference result is generated by
reference_enrichment.R; concordance metrics by
concordance.py and
cloud_fastq_concordance.py; the shared TERM2GENE
bundle by build_term2gene_from_bundle.py.