This directory compares gene structure annotations before and after manual
curation for the configured species in species.json.
The GitHub repository is intended to contain the analysis code, tests,
documentation, lightweight summary tables, and publication figures. Large raw
FASTA/GFF/GTF inputs and large per-gene logs are intentionally excluded by
.gitignore.
The single-species ABCD figure is the main visual summary for checking one species before and after manual curation:
- A. Quantity changes: global annotation counts, including genes, transcripts, exons, and CDS features.
- B. Locus fate: before/after gene accounting after reciprocal-overlap locus matching, including confirmed 1:1 pairs, split/merged loci, unresolved weak overlaps, and strict unmatched before/after genes.
- C. 1:1 structural attributes: structural changes among genes with confirmed before/after one-to-one correspondence.
- D. Paired change magnitude: per-pair delta distributions for gene span, transcript count, exon count, and CDS length.
Additional species figures: Artemisia annua, Cucumber, Fragaria ananassa, and Fragaria vesca.
Each species is expected to have one before and one after annotation file in
the analysis directory:
<species_id>.before.gff
<species_id>.before.gff3
<species_id>.after.gff
<species_id>.after.gff3
Compressed .gff.gz and .gff3.gz files are also supported by the Python
parsers. Derived files such as .tmap and .refmap are ignored when resolving
primary annotations.
Raw input files are not committed to GitHub. To reproduce the analysis, either
place the input annotations in the project directory using the names above, or
set ANALYSIS_DIR=/path/to/data before running the commands below.
Install the locked environment with:
pixi installThe main dependencies are Python, pandas, numpy, matplotlib, seaborn, gffcompare, and bedtools. The current workflow does not require AGAT.
Run the optional external gffcompare step:
pixi run analyzeBuild summary tables:
pixi run summarizeRun coordinate-based locus comparisons:
pixi run locusThe default locus scope is mrna, which excludes gene features without an
mRNA or transcript child. Transcripts must also contain an explicit exon
or CDS feature in the input GFF; mRNA-only or UTR-only records are filtered
before overlap matching because their raw spans cannot define reliable loci.
The default overlap mode is hybrid: candidate gene pairs are scored by the
best before/after transcript pair, using exon-footprint overlap defined as
overlap / min(tx_length_before, tx_length_after), where tx_length is the
summed length of merged exons in one transcript. Complete or high-confidence
containment is therefore treated as strong locus evidence without letting all
isoforms of a gene inflate the denominator. The overlap graph also prunes weak
bridge edges when two independent strong one-to-one anchors already explain the
locus. Use --overlap-mode reciprocal or --overlap-mode containment only for
sensitivity checks or legacy comparisons.
Generate final locus summary tables and figures:
pixi run tables
pixi run figures
pixi run target-figuresGenerate A/B/C/D single-species summary tables and a four-panel figure:
python plot_single_species_abcd.py --species PineappleExport IGV-friendly event tracks from the locus change logs:
pixi run tracksThe track exporter reads results/locus/<species_id>_change_log.csv and writes
event-level BED, GFF3, and TSV review files under results/tracks/. Load the
BED track together with the before/after GFF3 annotations in IGV to inspect
whether each predicted locus or structural-change event is reasonable.
Single-species IGV track example:
python export_change_tracks.py --species PineappleConcrete single-species example using Pineapple:
# Rebuild the Pineapple ABCD tables and figure from existing summary/locus outputs.
python plot_single_species_abcd.py --species Pineapple
# Main figure:
ls figures/Pineapple_ABCD_single_species.png
# Supporting tables:
ls results/single_species/Pineapple_ABCD_tables.md
ls results/single_species/Pineapple_figureA_quantity_table.csv
ls results/single_species/Pineapple_figureB_locus_fate_table.csv
ls results/single_species/Pineapple_figureC_syntenic_structure_table.csv
ls results/single_species/Pineapple_figureD_pair_delta_summary.csvRegenerate all tracked single-species ABCD figures:
for sp in Artemisia_annua Cucumber Fragaria_ananassa Fragaria_vesca Peach Pineapple Rice; do
python plot_single_species_abcd.py --species "$sp"
doneFor table and figure regeneration from existing gffcompare/locus outputs:
pixi run reportRun the complete workflow, including external tools and locus comparisons:
pixi run fullValidate cross-table consistency:
pixi run validatetcompare/: gffcompare outputs, generated locally and not tracked.results/: aggregate CSV/TSV tables and locus comparison logs.figures/: generated publication figures.logs/: full command logs fromrun_analyses.sh.
Important result tables:
summary_stats.csv: GFF-derived coding gene-model statistics used by plots; exon/CDS/UTR and transcript-length metrics are computed from one representative transcript per gene.summary_stats_by_section.csv: long-form GFF-derived statistics using the same representative-transcript structural summary.comparison_matrix.csv: locus-derived compatibility summary with added/removed, split/merge, complex, and 1:1 counts.comparison_by_feature_path.csv: locus-derived compatibility table in the legacy feature-path shape.comparison_matrix_all_gene_types.csv: locus-derived compatibility summary matchingcomparison_matrix.csv.locus_comparison_summary.csv: mutually exclusive locus subtype summary; one syntenic gene contributes to one broad category.locus_comparison_multilabel.csv: non-exclusive locus subtype attributes; one syntenic gene can count in multiple columns.curation_core_metrics.csv: compact per-species table for the publication figure. The main metrics are no-overlap new/deleted loci, split/merge events, and the fraction of before/after genes whose strict 1:1 representative transcript has an exon change; representative-transcript union and CDS subcounts are retained as audit columns.locus_diagnostics.csv: overlap-mode diagnostics, including candidate pairs and weak bridge edges pruned between strong one-to-one anchors.validation_report.csv/.md: consistency checks across summary, compare, and locus outputs.
Large per-gene files such as results/locus/*_change_log.csv and
results/single_species/*_syntenic_pair_deltas.csv are generated locally but
excluded from GitHub. The tracked *_change_summary.csv, A/B/C/D summary
tables, and figures are sufficient for quick review.
Primary figures:
figure1_quantity_changes.png: before/after quantity changes, including gene counts.figure2_syntenic_structure_changes.png: non-exclusive structural attributes for confirmed one-to-one gene pairs.curation_core_metrics_publication.png: three-panel cross-species summary of locus gain/loss, split/merge events, and representative-transcript exon changes as fractions of before/after annotations.{Species}_ABCD_single_species.png: per-species four-panel summary covering global quantities, locus fate, 1:1 structural attributes, and paired change magnitude.
Documentation:
docs/locus_compare_calculation_and_results.html: compact calculation summary, latest results, and IGV validation notes.docs/locus_compare_algorithm.html: full locus-comparison algorithm notes.docs/curation_core_metrics_figure_algorithm.html: focused explanation of the current A/B/C core figure.
A JDK 11 / NetBeans Ant reimplementation lives in java-locus-compare/.
Generated Java outputs are excluded from GitHub; regenerate and validate them
locally with:
cd java-locus-compare
ant test
ant run-all
ant validate-python-parity- Edit
species.jsonto add, remove, or reorder species. - Set
ANALYSIS_DIR=/path/to/analysisor pass--analysis-dir DIRto run against another directory. - Pass one or more species IDs to process a subset:
bash run_analyses.sh Rice Peach
python run_locus_comparisons.py Rice PeachRun the lightweight parser tests:
pixi run testThe key consistency expectation is that all configured species appear in
summary_stats.csv, comparison_matrix.csv, accuracy_metrics.csv, locus
summary tables, and the generated figures.


