Positive control · GSE52778 · human airway smooth muscle, dexamethasone vs untreated

Validation of Haritica pathway enrichment against R/Bioconductor clusterProfiler

Haritica over-representation analysis versus clusterProfiler enricher() on identical bundled gene sets
Dataset GSE52778 (SRP033351) · reference R/Bioconductor clusterProfiler 4.20.0 + enrichplot 1.32.0 · Homo sapiens GRCh38 · 2026

Abstract. A 992-gene differential-expression signature from human airway smooth muscle treated with dexamethasone (GSE52778; Himes et al. 2014) — 475 up- and 517 down-regulated genes against a 17,497-gene tested universe — was tested for over-representation in Haritica and independently re-analysed with R/Bioconductor clusterProfiler enricher() on the identical bundled gene sets. Across all six collections (GO BP/MF/CC, WikiPathways, Reactome, MSigDB Hallmark) the enriched term sets are identical (Jaccard = 1.0000) and per-term p-values are collinear (Pearson r = Spearman ρ = 1.0000). A three-way engine cross-check — Haritica's one-sided Fisher's exact test, the clusterProfiler hypergeometric test, and an offline scipy.stats.hypergeom regression — agrees to numerical precision (max |Δp| = 9.97×10−17). The recovered terms reproduce the published glucocorticoid program (hormone response, extracellular-matrix remodeling, vasculature development), and the same biology is rebuilt from raw FASTQ end to end on cloud infrastructure.

1Data and methods

The dataset (GSE52778 / SRP033351) comprises airway smooth muscle from four donors, each profiled untreated and after 18 h exposure to 1 µM dexamethasone: eight paired-end Illumina HiSeq 2000 runs (SRR1039508–521). The gene list under test is the 992 significant differentially-expressed genes (475 up / 517 down) called at adjusted p < 0.05 and |log2 fold-change|  > 1 from the canonical DESeq2 result, against a 17,497-gene tested universe. Within Haritica the list was tested for over-representation against the bundled gene-set collections using a one-sided Fisher's exact test with Benjamini–Hochberg correction; the reference was run with matched thresholds. The list contains textbook glucocorticoid responders (ZBTB16, KLF15, FKBP5, DUSP1, GPX3, PER1, CRISPLD2, SPARCL1).

Table 1. Analysis parameters for the Haritica run and the matched reference.
ParameterHariticaReference (clusterProfiler)
InputCSV gene-list mode; 992 airway symbolssame 992 symbols
TestFisher's exact, one-sided ("greater")hypergeometric (≡ Fisher greater)
Gene setsbundled GO BP/MF/CC, WikiPathways, Reactome, HallmarkTERM2GENE from the same bundle
Backgroundper-collection gene-set unionuniverse = NULL (set union)
Multiple testingBenjamini–Hochberg (fdr_bh)pAdjustMethod = "BH"
Size / overlap filtersmin overlap 3; max set 500; GO-simplify offminGSSize 3; maxGSSize 500
Displaytop 20 terms; six enrichplot-style plotsshowCategory = 20
Tested universe17,49717,497

1.1  The reference and engine cross-check

The reference per-term numbers and figures are produced by reference_enrichment.R, which calls R/Bioconductor clusterProfiler 4.20.0 enricher() and enrichplot 1.32.0 and adds no statistics of its own. The reference is fed the identical bundled gene sets and the identical 992-gene query, so the comparison isolates Haritica's enrichment engine and renderers from gene-universe differences.

The engine itself is cross-checked three independent ways: Haritica's one-sided Fisher's exact test equals the clusterProfiler hypergeometric test equals an offline scipy.stats.hypergeom regression, to numerical precision (max |Δp| = 9.97×10−17 over all 1,381 significant GO:BP terms). The one-sided Fisher's exact test is the hypergeometric upper tail, so on identical gene sets these are the same computation expressed three ways. This concordance validates the enrichment engine, not the gene-set content: both sides draw terms from Haritica's own bundled collections, so identical term sets are expected by construction. The content is anchored separately, by the biological recovery of the known glucocorticoid program (§2.1, §2.3).

2Results

2.1  Concordance

For each bundled collection the set of enriched terms, their overlap counts, and their p-values were compared term by term against clusterProfiler on the identical gene sets (Table 2). Across all six collections the enriched term sets are identical (Jaccard = 1.0000) and per-term p-values are collinear (Pearson = Spearman = 1.0000), with 100% exact agreement of overlap counts. The GO:BP scatter (Figure 1) places every one of 2,142 terms on the identity line. The recovered terms span the published glucocorticoid program: circulatory and vasculature development, extracellular-matrix organization, and the WikiPathways Glucocorticoid receptor pathway (WP2880).

Table 2. Per-collection concordance between Haritica and clusterProfiler on identical gene sets.
CollectionTermsJaccardp Pearsonp Spearmancount
GO:BP2,1421.00001.00001.0000100%
GO:MF3751.00001.00001.0000100%
GO:CC2621.00001.00001.0000100%
WikiPathways3491.00001.00001.0000100%
Reactome5231.00001.00001.0000100%
Hallmark451.00001.00001.0000100%
Per-term GO:BP p-value concordance scatter; points fall on the identity line
Figure 1. Per-term p-value concordance for GO:BP. Each point is one term: Haritica (horizontal) versus clusterProfiler (vertical). Every term lands on the identity line (Pearson r = 1.0000, n = 2,142). The app result equals its enrichment engine to floating point (max |Δp| = 9.97×10−17), pinned offline against an independent scipy.stats.hypergeom reference.

2.2  Plot parity

Each enrichment renderer was compared against its enrichplot reference on the same airway GO:BP gene set, with the airway DESeq2 log2 fold-changes overlaid for the heatmap and gene-network views. Both columns enrich the same 992-gene signature over the same bundled GO:BP sets, so every panel's term set, gene Count, and adjusted p-value match the reference term for term — for instance the leading term neuron projection morphogenesis (Count 44, adjusted p = 3.2×10−14) is identical on both sides. A plot is a drawing of already-validated numbers, so matching what enrichplot draws — given both draw the same validated numbers — is the appropriate non-circular check (Figures 2–8).

Haritica

Haritica bar plot

clusterProfiler — reference

clusterProfiler barplot
Figure 2. Bar plot. Bar length is gene Count (clusterProfiler's barplot default); fill encodes adjusted p-value on a linear scale, so the most-significant terms saturate and only the least-significant drift, matching the reference. Full term names; no value labels.

Haritica

Haritica dot plot

clusterProfiler — reference

clusterProfiler dotplot
Figure 3. Dot plot. GeneRatio on the x-axis (scaled tight to the data), dot size encodes gene Count, colour encodes adjusted p-value; the y-axis is ordered by GeneRatio (enrichplot's dotplot default), so the term order matches term for term.

Haritica

Haritica fold-change heatmap

clusterProfiler — reference

clusterProfiler heatplot
Figure 4. Gene-term heatmap. Genes on the x-axis, terms on the y-axis; tiles coloured by per-gene log2 fold-change on a diverging scale whose white midpoint is centred on zero over the asymmetric fold-change range, as in ggplot's scale_fill_gradient2. Gene labels are rendered vertically to keep the dense gene axis separable.

Haritica

Haritica tree plot

clusterProfiler — reference

clusterProfiler treeplot
Figure 5. Tree plot. A spanning dendrogram on Ward.D linkage with pale cluster bands and a per-leaf dot panel (size = Count, colour = adjusted p-value). The top GO:BP terms share so few genes that their raw merge heights are all near 1.0; rank-spreading the internal branch heights lets the tree span its region with visible clades rather than collapsing into a left-margin spike.

Haritica

Haritica UpSet plot

clusterProfiler — reference

clusterProfiler upsetplot
Figure 6. UpSet plot. Exclusive intersections (each gene counted once, by its exact term membership) form a descending power-law of bar heights with multi-degree connectors; the dot matrix is x-pinned under its bars, with alternating row shading and term-ID row labels. On the matched gene list the term set and intersections align with the reference.

Haritica

Haritica gene-concept network

clusterProfiler — reference

clusterProfiler cnetplot
Figure 7. Gene-concept network (cnet). A force-directed layout: term hubs are placed by their shared-gene similarity, so related terms cluster as in enrichplot's igraph layout, and each term's genes fan out around its hub. Gene nodes are coloured by log2 fold-change; hubs are sized by gene Count.

Haritica

Haritica enrichment map

clusterProfiler — reference

clusterProfiler emapplot
Figure 8. Enrichment map (emap). Nodes are terms, edges are Jaccard gene-set similarity, colour is −log10(q), size is gene Count. The near-complete GO:BP similarity clique is first reduced to its kNN similarity graph (each term keeps its strongest neighbours, as in enrichplot's pairwise_termsim), then laid out by Fruchterman–Reingold, separating the neuron-morphogenesis and vasculature clusters.

2.3  Reproducibility — cloud and end-to-end from FASTQ

The cloud worker runs the identical enrichment engine over the same bundled gene sets as the desktop application. Run on managed cloud batch infrastructure and compared collection by collection against the local in-app result, the term counts are identical (Table 3), including the WikiPathways hits Glucocorticoid receptor pathway (WP2880), Adipogenesis (WP236), and White fat cell differentiation (WP4149).

The whole pipeline was also run end to end from raw reads: the eight airway runs (SRR1039508–521, approximately 23 GB of FASTQ) were aligned with HISAT2 to GRCh38 (97–98% per sample), quantified with featureCounts, tested with pyDESeq2 (adjusted p < 0.05, |log2FC| > 1; 837 genes), and enriched with the bundled collections (651 terms). HISAT2 is used here rather than minimap2 because minimap2's spliced-alignment preset targets long reads; on short paired-end RNA-seq the splice-aware short-read aligner is appropriate. The cloud FASTQ run (837 genes) and the published gene list (992 genes) are two independent differential-expression analyses of the same experiment; on the 448 shared GO:BP terms their per-term significance correlates at Pearson r = 0.71 (Figure 9), the more conservative correlation expected from a smaller independent DE call rather than the engine-vs-engine r = 1.0 of §2.1. Ten of twelve canonical airway markers (CRISPLD2, KLF15, FKBP5, GPX3, PER1, ZBTB16, SPARCL1, TSC22D3, STEAP2, MAOA) are significant in both runs, and all four published functional axes (hormone/steroid response, vasculature, extracellular matrix, muscle/contraction) are recovered in the cloud GO:BP set.

Table 3. Enriched term counts, local desktop run versus cloud batch run, on the same gene list and gene sets.
CollectionLocal (in-app)Cloud (batch)
GO:BP1,3121,312
Reactome120120
WikiPathways7676
Hallmark99
Cloud-FASTQ versus published gene-list GO:BP p-value concordance scatter
Figure 9. Cross-pipeline GO:BP concordance. Per-term significance from the cloud raw-FASTQ run (837 genes) versus the published gene list (992 genes), over the 448 shared terms (Pearson r = 0.71). The same pathways rank together; the cloud run sits slightly below the diagonal, consistent with its smaller, more conservative independent DE call.

Cloud FASTQ — bar plot

Cloud bar plot

Cloud FASTQ — enrichment map

Cloud enrichment map
Figure 10. Representative plots rendered by the application from the completed cloud job, using the same renderers as the gene-list run. The bar plot leads with TNFA signaling via NFKB, vasculature development, angiogenesis, blood-vessel morphogenesis, the Nuclear receptors meta pathway (WP2882), and regulation of hormone levels — the glucocorticoid/vascular airway program, derived from raw reads.

3Data availability and references

All inputs are public. Raw reads: GEO series GSE52778 / SRA study SRP033351, runs SRR1039508SRR1039521 (ENA; eight paired-end Illumina HiSeq 2000 runs, approximately 23 GB). Processed counts are also available via the Bioconductor airway package. Reference genome: Ensembl human GRCh38. The independent reference result is generated by reference_enrichment.R; concordance metrics by concordance.py and cloud_fastq_concordance.py; the shared TERM2GENE bundle by build_term2gene_from_bundle.py.

  1. Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2021;2(3):100141.
  2. Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, et al. RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. PLoS ONE 2014;9(6):e99625. PMID 24926665.