Skip to content

Gaia publication figures by gene (GAIA endpoint for OCR bounding boxes and figures)#323

Merged
asherpasha merged 2 commits into
BioAnalyticResource:devfrom
VinLau:gaia-publication-figures-by-gene
Jun 23, 2026
Merged

Gaia publication figures by gene (GAIA endpoint for OCR bounding boxes and figures)#323
asherpasha merged 2 commits into
BioAnalyticResource:devfrom
VinLau:gaia-publication-figures-by-gene

Conversation

@VinLau

@VinLau VinLau commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds GET /gaia/publication_figures_by_gene/<identifier> — reconstructs the original OCR-by-gene
figure selection (figures where the gene was OCR-detected on the image) that was gone by mongoDB, replacing the esearch-relevance feed (pubmed). Added many tests too. Hardest part about this endpoint was the SQL query since it is not a direct relational database conversion from mongoDB, i.e. not always had minimal joins. Now we will use this!

Sample:

image

Additive: new route + AuthorList/FigureModels models + gaia test wiring. 6 files, no overlap with #322.

Validation done

  • Rebased clean onto post-sql fixes #322 dev (no conflicts; api/init.py changed only by dev).
  • gaia tests run on REAL MySQL under pytest (not skipped) and pass — the skip guard activates now that
    sql fixes #322 removed the SQLite mirror. The two test_fastpheno failures are also gone (same MySQL-only cause).
  • Checked locally on Docker Desktop via multiple inputs

To validate after deploy to api_dev (needs the live gaia DB)

  • ABI3 -> real ~24+ figures across multiple PMCs, including 01-0441f4.jpg; the one we use in our examples.
  • allImageWords populated with related genes (fus3, lec2, abi5…), not just "abi3".
  • AT1G01115 -> 200 empty payload (locus-resolved gene, no aliases).
  • ABI3 call latency acceptable for interactive load (unindexed JSON scan; functional index on
    data->>'$.word' is the follow-up fix if slow).
  • Speed of endpoint

VinLau added 2 commits June 23, 2026 08:47
GET /gaia/publication_figures_by_gene/<identifier> resolves a gene to its
alias set, matches OCR-detected figure words (word-boundary regex for
aliases >=4 chars, exact match for <=3), and returns figures grouped by PMC
with img_url/caption/bbox plus allImageWords. Includes the bare-name
collision guard, null-url skip, and numeric-pubmed ordering. Implemented in
pure SQLAlchemy Core (gaia bind inferred from models; no raw text()) similar fashion to our typical endpoints.

- models: add AuthorList + FigureModels (reuse existing gaia models)
- wire gaia for local/CI: SQLALCHEMY_BINDS entry, init.sh load line, and a
  curated config/databases/gaia.sql test fixture
- tests: test_gaia.py with a skip-unless-MySQL guard so the MySQL-only
  endpoint skips (not errors) under the SQLite test harness
- gaia.sql: add gene 34G (3-char alias) test group with the exact OCR word
  34g, a boundary decoy 34g/x and a substring decoy x34gy on their own
  figures, plus a malformed OCR entry (imageName with no bbox). Pins that
  short aliases (<=3 chars) match exact-only, never via the word-boundary
  regex or a LIKE. This is because prod db only has gene aliases of at least 3 len.
- resources/gaia.py: unpack records the image name even when an OCR entry has
  no bbox (so the figure still displays) but skips the None, keeping each
  figure's bbox list clean for the frontend's box-drawing.
- test_gaia.py: add the 34G short-alias test (gated MySQL-only via the skip
  guard) asserting exact match, decoy exclusion, and a None-free bbox list.
@asherpasha asherpasha merged commit fae6fc0 into BioAnalyticResource:dev Jun 23, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants