From bb93f1f9a824eef4ea5563f95028d5eb335b74a2 Mon Sep 17 00:00:00 2001 From: apollo994 Date: Fri, 15 May 2026 17:41:41 +0200 Subject: [PATCH] clear up readme --- CONTRIBUTING.md | 44 +++++++++++++++++++++----------------------- README.md | 34 ++++------------------------------ 2 files changed, 25 insertions(+), 53 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 68a9c5f..31cb880 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -19,32 +19,10 @@ Rules that matter for everyone: - **One row per URL** — the same `access_url` must not appear twice in the same `annotations.tsv`. - **One row per file content** — the same annotation file (same MD5 of the downloaded bytes) must not appear twice in the same TSV, and must not duplicate a file already listed in [`checksums/annotation_checksums.tsv`](checksums/annotation_checksums.tsv) under another project or assembly. - Each URL must be a real **`https://`** link to a **GFF3** file that our checks can open. - -### Annotrieve import (read before you submit) - -> **Disclaimer** -> After your entry is merged here and flows through the Genome Annotation Tracker, Annotrieve imports community annotations into the live database. **If your GFF3 file content is identical to an annotation already present from NCBI or Ensembl (same MD5 checksum after Annotrieve’s processing), that community row is skipped during import** and will not show up as a separate annotation in the app. -> The registry cannot accept duplicate files; Annotrieve will not publish them twice. Only submit assemblies and files that add **new** annotation content. +- **Must add new content** — if your GFF3 matches an existing NCBI/Ensembl annotation (same MD5 after Annotrieve’s processing), it will be skipped on import. Only submit files that add **new** annotation content. See [section below](#md5-checksum-index) for more details on md5_checksum. --- -## MD5 checksum index - -The repository keeps a **repo-wide** TSV of file fingerprints: - -| Column | Meaning | -|--------|---------| -| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) | -| `assembly_accession` | NCBI assembly accession for that row | -| `repo_path` | Project folder (e.g. `my_lab_build`) | -| `access_url` | HTTPS link stored in `annotations.tsv` | - -- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message). -- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only. - -You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation. - ---- ## Contribute with a fork (works in the browser) @@ -163,3 +141,23 @@ These environment variables only affect the validator when set (defaults are fin | `NCBI_API_KEY` | — | Optional; higher NCBI rate limit when set | Assembly checks use the **datasets** subprocess, not ad-hoc NCBI HTTP from Python. URL checks use a **single streaming GET** per row (no separate HEAD request). + +--- + +## MD5 checksum index + +The repository keeps a **repo-wide** TSV of file fingerprints: + +| Column | Meaning | +|--------|---------| +| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) | +| `assembly_accession` | NCBI assembly accession for that row | +| `repo_path` | Project folder (e.g. `my_lab_build`) | +| `access_url` | HTTPS link stored in `annotations.tsv` | + +- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message). +- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only. + +You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation. + +--- diff --git a/README.md b/README.md index f3deb84..e8f1705 100644 --- a/README.md +++ b/README.md @@ -25,39 +25,13 @@ Together, these files describe “this assembly, this annotation file,” in a f See **[`CONTRIBUTING.md`](CONTRIBUTING.md)** for a step-by-step flow (fork → edit → pull request). -### Repo-wide checksum index - -After entries are merged to the default branch, automation maintains a shared index: - -```text -checksums/annotation_checksums.tsv -``` - -Each row records the **MD5 of the downloaded annotation file** (raw bytes as fetched from `access_url`), plus the assembly accession, project path, and URL. Pull-request validation uses this index to reject new rows whose file content is already registered under another project or assembly. - -- Index header: [`schema/annotation_checksums.header`](schema/annotation_checksums.header) -- Kept in sync on push to `master` / `main` by [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) (adds new rows, removes entries for deleted TSV lines or projects) - ## How it fits in the larger system After your changes are **merged here**, the **[Genome Annotation Tracker](https://github.com/guigolab/genome-annotation-tracker)** reads this registry, turns each project’s manifest + TSV into formatted rows, and adds them to the shared **community annotation table**. Those rows are published on **[Annotrieve](https://genome.crg.eu/annotrieve)** in periodic imports. ```text -You (this repo) Downstream App -───────────────── ─────────────────────────────── ─────────── -manifest.yaml ──┐ -annotations.tsv ──┼──► genome-annotation-tracker ──► community TSV -(project folders) │ (merges + normalizes rows) ──► Annotrieve -checksums/ ──┘ github.com/guigolab/ -annotation_ genome-annotation-tracker -checksums.tsv +You (this repo) Downstream App +───────────────── ────────────────────────────── ─────────── +manifest.yaml ──► genome-annotation-tracker ──► Annotrieve +annotations.tsv (community TSV) ``` - -## Import into Annotrieve - -> **Disclaimer — duplicate file content** -> Annotrieve identifies each annotation by an **MD5 checksum of the sorted, uncompressed GFF3** (the same content identity used for NCBI and Ensembl entries in the database). -> **Community submissions whose file content matches an annotation already imported from NCBI or Ensembl (same MD5) are skipped during import** and will not appear as a separate community record, even if your registry PR passed validation. -> Submit **distinct** annotation files (different assemblies and genuinely different GFF3 content). Re-hosting the same file under another URL or project folder does not create a second Annotrieve entry. - -Registry CI checks **downloaded file bytes**; Annotrieve’s import deduplication uses the **processed** checksum after sort/bgzip. In practice, identical biological content that is already in NCBI/Ensembl will be treated as a duplicate at import time.