Gaia publication figures by gene (GAIA endpoint for OCR bounding boxes and figures)#323
Merged
asherpasha merged 2 commits intoJun 23, 2026
Conversation
GET /gaia/publication_figures_by_gene/<identifier> resolves a gene to its alias set, matches OCR-detected figure words (word-boundary regex for aliases >=4 chars, exact match for <=3), and returns figures grouped by PMC with img_url/caption/bbox plus allImageWords. Includes the bare-name collision guard, null-url skip, and numeric-pubmed ordering. Implemented in pure SQLAlchemy Core (gaia bind inferred from models; no raw text()) similar fashion to our typical endpoints. - models: add AuthorList + FigureModels (reuse existing gaia models) - wire gaia for local/CI: SQLALCHEMY_BINDS entry, init.sh load line, and a curated config/databases/gaia.sql test fixture - tests: test_gaia.py with a skip-unless-MySQL guard so the MySQL-only endpoint skips (not errors) under the SQLite test harness
- gaia.sql: add gene 34G (3-char alias) test group with the exact OCR word 34g, a boundary decoy 34g/x and a substring decoy x34gy on their own figures, plus a malformed OCR entry (imageName with no bbox). Pins that short aliases (<=3 chars) match exact-only, never via the word-boundary regex or a LIKE. This is because prod db only has gene aliases of at least 3 len. - resources/gaia.py: unpack records the image name even when an OCR entry has no bbox (so the figure still displays) but skips the None, keeping each figure's bbox list clean for the frontend's box-drawing. - test_gaia.py: add the 34G short-alias test (gated MySQL-only via the skip guard) asserting exact match, decoy exclusion, and a None-free bbox list.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
GET /gaia/publication_figures_by_gene/<identifier>— reconstructs the original OCR-by-genefigure selection (figures where the gene was OCR-detected on the image) that was gone by mongoDB, replacing the esearch-relevance feed (pubmed). Added many tests too. Hardest part about this endpoint was the SQL query since it is not a direct relational database conversion from mongoDB, i.e. not always had minimal joins. Now we will use this!
Sample:
Additive: new route + AuthorList/FigureModels models + gaia test wiring. 6 files, no overlap with #322.
Validation done
sql fixes #322 removed the SQLite mirror. The two test_fastpheno failures are also gone (same MySQL-only cause).
To validate after deploy to api_dev (needs the live gaia DB)
data->>'$.word' is the follow-up fix if slow).