From bb93f1f9a824eef4ea5563f95028d5eb335b74a2 Mon Sep 17 00:00:00 2001
From: apollo994 <fabio.zanarello.94@gmail.com>
Date: Fri, 15 May 2026 17:41:41 +0200
Subject: [PATCH] clear up readme

---
 CONTRIBUTING.md | 44 +++++++++++++++++++++-----------------------
 README.md       | 34 ++++------------------------------
 2 files changed, 25 insertions(+), 53 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 68a9c5f..31cb880 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -19,32 +19,10 @@ Rules that matter for everyone:
 - **One row per URL** — the same `access_url` must not appear twice in the same `annotations.tsv`.
 - **One row per file content** — the same annotation file (same MD5 of the downloaded bytes) must not appear twice in the same TSV, and must not duplicate a file already listed in [`checksums/annotation_checksums.tsv`](checksums/annotation_checksums.tsv) under another project or assembly.
 - Each URL must be a real **`https://`** link to a **GFF3** file that our checks can open.
-
-### Annotrieve import (read before you submit)
-
-> **Disclaimer**  
-> After your entry is merged here and flows through the Genome Annotation Tracker, Annotrieve imports community annotations into the live database. **If your GFF3 file content is identical to an annotation already present from NCBI or Ensembl (same MD5 checksum after Annotrieve’s processing), that community row is skipped during import** and will not show up as a separate annotation in the app.  
-> The registry cannot accept duplicate files; Annotrieve will not publish them twice. Only submit assemblies and files that add **new** annotation content.
+- **Must add new content** — if your GFF3 matches an existing NCBI/Ensembl annotation (same MD5 after Annotrieve’s processing), it will be skipped on import. Only submit files that add **new** annotation content. See [section below](#md5-checksum-index) for more details on md5_checksum.
 
 ---
 
-## MD5 checksum index
-
-The repository keeps a **repo-wide** TSV of file fingerprints:
-
-| Column | Meaning |
-|--------|---------|
-| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) |
-| `assembly_accession` | NCBI assembly accession for that row |
-| `repo_path` | Project folder (e.g. `my_lab_build`) |
-| `access_url` | HTTPS link stored in `annotations.tsv` |
-
-- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message).
-- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only.
-
-You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation.
-
----
 
 ## Contribute with a fork (works in the browser)
 
@@ -163,3 +141,23 @@ These environment variables only affect the validator when set (defaults are fin
 | `NCBI_API_KEY` | — | Optional; higher NCBI rate limit when set |
 
 Assembly checks use the **datasets** subprocess, not ad-hoc NCBI HTTP from Python. URL checks use a **single streaming GET** per row (no separate HEAD request).
+
+---
+
+## MD5 checksum index
+
+The repository keeps a **repo-wide** TSV of file fingerprints:
+
+| Column | Meaning |
+|--------|---------|
+| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) |
+| `assembly_accession` | NCBI assembly accession for that row |
+| `repo_path` | Project folder (e.g. `my_lab_build`) |
+| `access_url` | HTTPS link stored in `annotations.tsv` |
+
+- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message).
+- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only.
+
+You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation.
+
+---
diff --git a/README.md b/README.md
index f3deb84..e8f1705 100644
--- a/README.md
+++ b/README.md
@@ -25,39 +25,13 @@ Together, these files describe “this assembly, this annotation file,” in a f
 
 See **[`CONTRIBUTING.md`](CONTRIBUTING.md)** for a step-by-step flow (fork → edit → pull request).
 
-### Repo-wide checksum index
-
-After entries are merged to the default branch, automation maintains a shared index:
-
-```text
-checksums/annotation_checksums.tsv
-```
-
-Each row records the **MD5 of the downloaded annotation file** (raw bytes as fetched from `access_url`), plus the assembly accession, project path, and URL. Pull-request validation uses this index to reject new rows whose file content is already registered under another project or assembly.
-
-- Index header: [`schema/annotation_checksums.header`](schema/annotation_checksums.header)
-- Kept in sync on push to `master` / `main` by [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) (adds new rows, removes entries for deleted TSV lines or projects)
-
 ## How it fits in the larger system
 
 After your changes are **merged here**, the **[Genome Annotation Tracker](https://github.com/guigolab/genome-annotation-tracker)** reads this registry, turns each project’s manifest + TSV into formatted rows, and adds them to the shared **community annotation table**. Those rows are published on **[Annotrieve](https://genome.crg.eu/annotrieve)** in periodic imports.
 
 ```text
-You (this repo)          Downstream                         App
-─────────────────        ───────────────────────────────   ───────────
-manifest.yaml    ──┐
-annotations.tsv  ──┼──►  genome-annotation-tracker   ──►   community TSV
-(project folders)  │     (merges + normalizes rows)        ──►   Annotrieve
-checksums/       ──┘     github.com/guigolab/
-annotation_              genome-annotation-tracker
-checksums.tsv
+You (this repo)             Downstream                          App
+─────────────────           ──────────────────────────────      ───────────
+manifest.yaml        ──►    genome-annotation-tracker    ──►    Annotrieve
+annotations.tsv              (community TSV)
 ```
-
-## Import into Annotrieve
-
-> **Disclaimer — duplicate file content**  
-> Annotrieve identifies each annotation by an **MD5 checksum of the sorted, uncompressed GFF3** (the same content identity used for NCBI and Ensembl entries in the database).  
-> **Community submissions whose file content matches an annotation already imported from NCBI or Ensembl (same MD5) are skipped during import** and will not appear as a separate community record, even if your registry PR passed validation.  
-> Submit **distinct** annotation files (different assemblies and genuinely different GFF3 content). Re-hosting the same file under another URL or project folder does not create a second Annotrieve entry.
-
-Registry CI checks **downloaded file bytes**; Annotrieve’s import deduplication uses the **processed** checksum after sort/bgzip. In practice, identical biological content that is already in NCBI/Ensembl will be treated as a duplicate at import time.