Abstract. A published human RNA-seq run (SRR14253270, a GM12878 sample — the same individual,
HG001/NA12878, behind the NIST Genome-in-a-Bottle reference) was processed end-to-end in Haritica
— HISAT2 alignment, bcftools calling, and SnpEff annotation on chromosomes 20 and 21
— yielding 610 quality-passed variants (540 SNVs, 70 indels). Two independent anchors validate the
result. Against the externally-published NIST GIAB HG001 GRCh38 v4.2.1 small-variant truth set, restricted
to its high-confidence regions, the calls reach precision 0.53 (265 true positives, 235 false positives),
rising to 0.62 once the ADAR A>G / T>C RNA-editing signature — real RNA biology absent from a DNA
truth set — is set aside. Independently, Haritica computes the Ti/Tv ratio, the 12-class substitution
spectrum, and the depth and quality histograms in its own Python (cyvcf2), and bcftools computes
the same quantities in C: the two agree bit-for-bit (Ti/Tv 3.06 = 3.06; all twelve substitution
classes identical). SnpEff's consequence calls agree with bcftools csq on 66.8% of variants,
rising to 97.1% on the consequence classes both tools can emit.
The input is a single published Illumina paired-end RNA-seq run, SRR14253270, from GM12878 — the
lymphoblastoid cell line of individual NA12878, which NIST designates HG001 and characterises in its
Genome-in-a-Bottle reference materials. Within Haritica, reads were aligned to Ensembl GRCh38 (release 110)
with HISAT2, restricted to chromosomes 20 and 21, called with bcftools mpileup
(-q20 -Q20 -B) followed by bcftools call -mv, and annotated with SnpEff
(GRCh38.99). Of 2,882 raw variants, 610 passed quality control (540 SNVs, 70 indels; Ti/Tv 3.06).
HISAT2 is used because the data are short-read RNA-seq; the minimap2 default -ax splice preset
targets long reads.
| Parameter | Value |
|---|---|
| Input | GM12878 RNA-seq SRR14253270, Illumina paired-end, 8 M read pairs |
| Reference genome | Ensembl GRCh38 (release 110), prebuilt HISAT2 index |
| Chromosomes | chr20, chr21 |
| Aligner | HISAT2 (short-read spliced) |
| Variant calling | bcftools mpileup -q20 -Q20 -B → call -mv |
| Annotation | SnpEff GRCh38.99 |
| Truth set | GIAB HG001_GRCh38_1_22_v4.2.1_benchmark VCF + high-confidence BED |
| Result | 610 passed (540 SNV / 70 indel); Ti/Tv 3.06 |
The variant calls are scored against the NIST Genome-in-a-Bottle HG001 / NA12878 GRCh38 v4.2.1
small-variant benchmark [1], the canonical truth set for human variant calling, which Haritica did not
produce. Concordance is measured with standard tooling — bcftools norm followed by
bcftools isec, restricted to the benchmark's high-confidence regions — by the script
reference_concordance.py. Because the input is a real,
published RNA-seq run rather than simulated data, the comparison reflects a true end-to-end pipeline run.
The truth set is whole-genome DNA, so region-level recall is low by construction (a single RNA-seq run
samples only expressed exons); the informative quantity here is precision — what fraction of the
pipeline's calls are real.
Haritica computes the Ti/Tv ratio, the 12-class substitution spectrum, and the depth and quality
histograms in its own Python, parsing the VCF with cyvcf2. bcftools [2] computes the same
quantities independently in C. Bit-identical agreement therefore means two separate codebases, in two
languages, reach exactly the same answer on the same file; an exact integer match across all twelve
substitution classes is not attributable to a shared coincidental error. SnpEff's [3] consequence calls
are corroborated in the same spirit: bcftools csq annotates the identical VCF and the two
assignments are compared per variant by annotation_concordance.py.
One caveat, stated plainly: cyvcf2 uses HTSlib for low-level VCF parsing, as bcftools does,
so the independence is at the statistics and algorithm layer — which is the layer these figures
validate.
The substitution, depth, indel, and quality views read the pipeline's output VCF and are compared
against the same quantities computed independently by bcftools stats / plot-vcfstats
(Figures 1–4). These pairs validate the statistics and rendering rather than the calling itself;
the proof that the pipeline recovers real biology is the GIAB concordance in §3. The substitution
spectrum is bit-identical (A>G 109, C>T 104, G>A 100, T>C 94, …), with
the four transition bars dominating (Ti/Tv 3.06, elevated by coding SNVs and the ADAR A>G / T>C RNA-editing
signature). The impact and consequence distributions (Figures 5–6) render SnpEff's tiers and effect
table; SnpEff's native charts depend on the legacy Google Charts API, so there is no permissive plot to
overlay and the annotation engine is instead validated against bcftools csq (§4). The
oncoplot and rainfall views (Figure 7) are maftools-style visualizations shown for completeness:
maftools (GPL R) is the only canonical source and is excluded for licensing, and these are single-sample.
Haritica

bcftools plot-vcfstats — reference

plot-vcfstats (MIT): A>G 109,
C>T 104, G>A 100, T>C 94, …. The four transition bars dominate
(Ti/Tv 3.06).Haritica

bcftools stats — reference

Haritica

bcftools stats (IDD) — reference

Haritica

bcftools stats (QUAL) — reference

The impact donut (Figure 5) renders SnpEff's impact tiers on the 610 quality-passed variants, and the
consequence bar (Figure 6) renders SnpEff's effect table. SnpEff's native charts depend on the legacy
Google Charts API, so there is no permissive plot to overlay; the annotation engine is validated instead
against bcftools csq at 66.8% coarse-class agreement (§4). Tables 2 and 3 give the
underlying counts.
| Impact | Count | % |
|---|---|---|
| MODIFIER | 398 | 65.2 |
| LOW | 130 | 21.3 |
| MODERATE | 76 | 12.5 |
| HIGH | 6 | 1.0 |
| Total | 610 | 100 |
| Consequence (SnpEff) | Count |
|---|---|
| 3′ UTR variant | 296 |
| synonymous variant | 124 |
| missense variant | 74 |
| upstream gene variant | 43 |
| 5′ UTR variant | 23 |
| downstream gene variant | 15 |
| intron variant | 14 |
| splice region variant | 11 |
| frameshift variant | 2 |
Haritica

Haritica

Haritica's variant statistics reproduce bcftools stats bit-for-bit: the Ti/Tv ratio, the
transition and transversion totals, and the SNV count are identical, as is the full 12-class spectrum
(Figure 1). The calls themselves are scored against the NIST GIAB DNA truth set on chr20/21 within the
benchmark's high-confidence regions. Of 500 evaluable calls, 265 are true positives and 235 are false
positives, giving precision 0.53; excluding the ADAR A>G / T>C RNA-editing signature — real RNA
variants absent from a DNA truth set — raises precision to 0.62. A>G / T>C substitutions account
for 31.1% of the false positives, consistent with that signature. Region-level recall is low by construction
(RNA-seq samples only expressed exons of a whole-genome DNA truth), so precision is the informative metric;
moderate precision is expected for bcftools RNA-seq calling without RNA-aware filtering.
| Quantity | bcftools | Haritica |
|---|---|---|
| Ti/Tv ratio | 3.06 | 3.06 |
| Transitions | 407 | 407 |
| Transversions | 133 | 133 |
| Total SNVs | 540 | 540 |
| Indels | 70 | 70 |
| Metric | Value |
|---|---|
| True positives (in GIAB truth) | 265 |
| False positives (not in DNA truth) | 235 |
| Precision | 0.530 |
| Precision (excl. RNA-editing FPs) | 0.621 |
| A>G / T>C among FPs (ADAR signature) | 31.1% |
SnpEff's consequence calls are compared per variant against bcftools csq on the identical
VCF, both vocabularies mapped to coarse Sequence Ontology classes over the 2,393 variants both tools
annotate (Table 6). Coarse-class agreement is 1,599 / 2,393 = 66.8%, rising to 97.1%
once the two gene-proximal categories bcftools csq structurally cannot emit
(upstream_gene_variant, downstream_gene_variant) are set aside. On the consequences
that matter for interpreting RNA-seq variants — coding and UTR — the two tools agree almost
exactly: UTR 801/811, synonymous 196/197, missense 160/168, splice 40/45, frameshift 5/5, stop/start 4/4.
The residual disagreement is concentrated in one row. bcftools csq annotates only variants
inside a transcript (or at a splice site) and has no near-a-gene vocabulary, so a variant SnpEff labels as
gene-proximal that happens to fall inside some transcript is labelled intron by csq; this
accounts for most of csq's higher intron count (1,031 versus 362). The difference is one-directional and is
a labelling difference, not a calling error: of SnpEff's 362 introns, csq agrees on 355. SnpEff and
bcftools csq also use Ensembl annotations eleven releases apart (GRCh38.99 versus release 110),
so a position just outside a gene in the older model can sit inside a lengthened intron in the newer one.
SnpEff assigns impact tiers that bcftools csq does not (MODIFIER 2,437, LOW 241,
MODERATE 168, HIGH 28), and reported no chromosome-naming warnings.
| Coarse consequence class | SnpEff | bcftools csq | Agree |
|---|---|---|---|
| UTR (3′ / 5′) | 811 | 829 | 801 |
| intron | 362 | 1,031 | 355 |
| upstream gene | 495 | 0 | 0 |
| downstream gene | 252 | 0 | 0 |
| synonymous | 197 | 202 | 196 |
| missense / inframe | 168 | 169 | 160 |
| splice | 56 | 45 | 40 |
| non-coding transcript | 43 | 108 | 38 |
| frameshift | 5 | 5 | 5 |
| stop / start | 4 | 4 | 4 |
| Total (shared) | 2,393 | 2,393 | 1,599 |
All inputs and external references are public. Raw reads: ENA run SRR14253270 (GM12878 / HG001). Truth set: NIST Genome-in-a-Bottle HG001 GRCh38 v4.2.1 small-variant benchmark (NISTv4.2.1 GRCh38 release; program page). Reference genome: Ensembl GRCh38 (release 110), chromosomes 20 and 21. Reference tools, called but not implemented: bcftools and SnpEff. The independent concordance reference is generated by reference_concordance.py and the annotation comparison by annotation_concordance.py.