Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 21 additions & 23 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,10 @@ Rules that matter for everyone:
- **One row per URL** — the same `access_url` must not appear twice in the same `annotations.tsv`.
- **One row per file content** — the same annotation file (same MD5 of the downloaded bytes) must not appear twice in the same TSV, and must not duplicate a file already listed in [`checksums/annotation_checksums.tsv`](checksums/annotation_checksums.tsv) under another project or assembly.
- Each URL must be a real **`https://`** link to a **GFF3** file that our checks can open.

### Annotrieve import (read before you submit)

> **Disclaimer**
> After your entry is merged here and flows through the Genome Annotation Tracker, Annotrieve imports community annotations into the live database. **If your GFF3 file content is identical to an annotation already present from NCBI or Ensembl (same MD5 checksum after Annotrieve’s processing), that community row is skipped during import** and will not show up as a separate annotation in the app.
> The registry cannot accept duplicate files; Annotrieve will not publish them twice. Only submit assemblies and files that add **new** annotation content.
- **Must add new content** — if your GFF3 matches an existing NCBI/Ensembl annotation (same MD5 after Annotrieve’s processing), it will be skipped on import. Only submit files that add **new** annotation content. See [section below](#md5-checksum-index) for more details on md5_checksum.

---

## MD5 checksum index

The repository keeps a **repo-wide** TSV of file fingerprints:

| Column | Meaning |
|--------|---------|
| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) |
| `assembly_accession` | NCBI assembly accession for that row |
| `repo_path` | Project folder (e.g. `my_lab_build`) |
| `access_url` | HTTPS link stored in `annotations.tsv` |

- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message).
- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only.

You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation.

---

## Contribute with a fork (works in the browser)

Expand Down Expand Up @@ -163,3 +141,23 @@ These environment variables only affect the validator when set (defaults are fin
| `NCBI_API_KEY` | — | Optional; higher NCBI rate limit when set |

Assembly checks use the **datasets** subprocess, not ad-hoc NCBI HTTP from Python. URL checks use a **single streaming GET** per row (no separate HEAD request).

---

## MD5 checksum index

The repository keeps a **repo-wide** TSV of file fingerprints:

| Column | Meaning |
|--------|---------|
| `md5_checksum` | MD5 of the **raw downloaded** GFF3 (plain or `.gz` bytes as fetched) |
| `assembly_accession` | NCBI assembly accession for that row |
| `repo_path` | Project folder (e.g. `my_lab_build`) |
| `access_url` | HTTPS link stored in `annotations.tsv` |

- **On pull requests:** new rows are downloaded and hashed during validation. Their MD5 is compared to other new rows in the PR and to the index on the **target branch**, so you get a clear error if the file was already merged elsewhere (including project path and URL in the message).
- **On merge to `master` / `main`:** [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) syncs the index for changed projects: removes entries for deleted rows (or deleted `annotations.tsv` files) and appends checksums for newly merged rows only.

You do not edit `checksums/annotation_checksums.tsv` by hand; it is maintained by automation.

---
34 changes: 4 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,39 +25,13 @@ Together, these files describe “this assembly, this annotation file,” in a f

See **[`CONTRIBUTING.md`](CONTRIBUTING.md)** for a step-by-step flow (fork → edit → pull request).

### Repo-wide checksum index

After entries are merged to the default branch, automation maintains a shared index:

```text
checksums/annotation_checksums.tsv
```

Each row records the **MD5 of the downloaded annotation file** (raw bytes as fetched from `access_url`), plus the assembly accession, project path, and URL. Pull-request validation uses this index to reject new rows whose file content is already registered under another project or assembly.

- Index header: [`schema/annotation_checksums.header`](schema/annotation_checksums.header)
- Kept in sync on push to `master` / `main` by [`.github/workflows/update-checksums.yml`](.github/workflows/update-checksums.yml) (adds new rows, removes entries for deleted TSV lines or projects)

## How it fits in the larger system

After your changes are **merged here**, the **[Genome Annotation Tracker](https://github.com/guigolab/genome-annotation-tracker)** reads this registry, turns each project’s manifest + TSV into formatted rows, and adds them to the shared **community annotation table**. Those rows are published on **[Annotrieve](https://genome.crg.eu/annotrieve)** in periodic imports.

```text
You (this repo) Downstream App
───────────────── ─────────────────────────────── ───────────
manifest.yaml ──┐
annotations.tsv ──┼──► genome-annotation-tracker ──► community TSV
(project folders) │ (merges + normalizes rows) ──► Annotrieve
checksums/ ──┘ github.com/guigolab/
annotation_ genome-annotation-tracker
checksums.tsv
You (this repo) Downstream App
───────────────── ────────────────────────────── ───────────
manifest.yaml ──► genome-annotation-tracker ──► Annotrieve
annotations.tsv (community TSV)
```

## Import into Annotrieve

> **Disclaimer — duplicate file content**
> Annotrieve identifies each annotation by an **MD5 checksum of the sorted, uncompressed GFF3** (the same content identity used for NCBI and Ensembl entries in the database).
> **Community submissions whose file content matches an annotation already imported from NCBI or Ensembl (same MD5) are skipped during import** and will not appear as a separate community record, even if your registry PR passed validation.
> Submit **distinct** annotation files (different assemblies and genuinely different GFF3 content). Re-hosting the same file under another URL or project folder does not create a second Annotrieve entry.

Registry CI checks **downloaded file bytes**; Annotrieve’s import deduplication uses the **processed** checksum after sort/bgzip. In practice, identical biological content that is already in NCBI/Ensembl will be treated as a duplicate at import time.
Loading