Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e83416c
Phase 2 end-to-end test on first 10 muscular-system terms
dosumis Apr 27, 2026
3ad9844
Leaf flow: look up genus + part_of via obo-grep instead of single-col…
dosumis Apr 28, 2026
7385dbf
Re-test leaf flow with leaf_template_rows: both is_a AND part_of popu…
dosumis Apr 28, 2026
8229a6e
Enrichment experiment: 6 muscle terms across difficulty gradient
dosumis Apr 28, 2026
4ce9a09
Ovary enrichment experiment — hypothesis disproven
dosumis Apr 28, 2026
c544244
Phase 6 + Phase 7 (skeletal-muscle): system overlays + develops_from
dosumis Apr 28, 2026
42738e6
Validate Phase 6 + Phase 7 muscle overlay end-to-end on muscular-system
dosumis Apr 28, 2026
db5d64b
Full muscular-system run: 75 input terms processed end-to-end
dosumis May 11, 2026
0f2984b
Delete hra-muscular.template.tsv
dosumis May 11, 2026
0fdd84d
Add consolidated unresolvable.tsv report from the full muscular-syste…
dosumis May 11, 2026
77fd128
Add consolidated review.tsv: input rows joined with all findings per row
dosumis May 11, 2026
bb73ff6
review.tsv: add mapped_label, parent_correction_label, mapping_evidence
dosumis May 11, 2026
3f73ed6
Register hra_muscular component and surface template diffs in PRs
dosumis May 15, 2026
c21556f
Move 3 back-muscle groupings from EC template to manual curation in e…
dosumis May 15, 2026
e640036
Review fixes: part_of for 9900025, term_tracker_item column, move rep…
dosumis May 15, 2026
18346d5
Merge branch 'master' into add-hra-muscular-ntr
dosumis May 18, 2026
6436580
Wire subclasses to back-muscle grouping terms (9900020/9900055/9900063)
dosumis May 18, 2026
8117acb
Add posterior abdominal wall (UBERON:9900100); wire 4 muscles + group…
dosumis May 18, 2026
31eb4c9
Merge branch 'add-hra-muscular-ntr' of https://github.com/obophenotyp…
dosumis May 18, 2026
692925c
Reassign template-row UBERON IDs from 99xxxxx (temp) to 11xxxxx (OS r…
dosumis May 18, 2026
58c9677
Fix illegal-annotation-property QC: use 'depiction' obo shortcut, not…
dosumis May 18, 2026
de0719b
ASCTB-TEMP URLs -> ccf: CURIEs; fix articularis genu is_a to skeletal…
dosumis May 18, 2026
d5f52d5
Fix 8 unsat muscles: split spine location into attaches_to_part_of
dosumis May 18, 2026
2007e30
Declare RO:0002177 as ObjectProperty in muscular prefixes stub
dosumis May 18, 2026
f1995e6
Merge branch 'master' into add-hra-muscular-ntr
dosumis May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/diff.yml
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ jobs:
ref: ${{ steps.comment-branch.outputs.head_ref }}
- name: Classify ontology PR branch
if: steps.check.outputs.triggered == 'true'
run: export ROBOT_JAVA_ARGS='-Xmx9G'; cd src/ontology; make BRI=false MIR=false PAT=false IMP=false COMP=false uberon-base.owl > TESTLOG.log
run: export ROBOT_JAVA_ARGS='-Xmx9G'; cd src/ontology; make BRI=false MIR=false PAT=false IMP=false COMP=true uberon-base.owl > TESTLOG.log
- name: Upload classified ontology in PR branch
if: steps.check.outputs.triggered == 'true'
uses: actions/upload-artifact@v4
Expand All @@ -136,7 +136,7 @@ jobs:
ref: master
- name: Classify ontology main branch
if: steps.check.outputs.triggered == 'true'
run: export ROBOT_JAVA_ARGS='-Xmx9G'; cd src/ontology; make BRI=false MIR=false PAT=false IMP=false COMP=false uberon-base.owl > TESTLOG.log
run: export ROBOT_JAVA_ARGS='-Xmx9G'; cd src/ontology; make BRI=false MIR=false PAT=false IMP=false COMP=true uberon-base.owl > TESTLOG.log
- name: Upload classified ontology main branch
if: steps.check.outputs.triggered == 'true'
uses: actions/upload-artifact@v4
Expand Down
12 changes: 6 additions & 6 deletions bulk_ntr_workflow/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,12 +194,12 @@ Docker-based) Makefile regeneration step.
| `bulk_ntr_workflow/outputs/template_groups_initial.tsv` | Groups working copy (EC directives) |
| `src/templates/<name>.template.tsv` | Final leaf template; updated in-place by Stage 4 |
| `src/templates/<name>-groups.template.tsv` | Final groups template (equivalent class definitions) |
| `src/templates/<name>-reports/input.tsv` | Filtered input rows + `term_type` classification |
| `src/templates/<name>-reports/errors.tsv` | Input errors (bad/FMA/ASCTB-TEMP parents) |
| `src/templates/<name>-reports/candidates.tsv` | Pre-mapped + OLS4-confirmed existing terms |
| `src/templates/<name>-reports/out_of_scope.tsv` | Pathological/dysfunctional terms |
| `src/templates/<name>-reports/name_corrections.tsv` | Source-label → corrected-label rewrites |
| `src/templates/<name>-reports/manual_curation.tsv` | Group terms not fitting simple `part_of` pattern |
| `bulk_ntr_workflow/outputs/<name>-reports/input.tsv` | Filtered input rows + `term_type` classification |
| `bulk_ntr_workflow/outputs/<name>-reports/errors.tsv` | Input errors (bad/FMA/ASCTB-TEMP parents) |
| `bulk_ntr_workflow/outputs/<name>-reports/candidates.tsv` | Pre-mapped + OLS4-confirmed existing terms |
| `bulk_ntr_workflow/outputs/<name>-reports/out_of_scope.tsv` | Pathological/dysfunctional terms |
| `bulk_ntr_workflow/outputs/<name>-reports/name_corrections.tsv` | Source-label → corrected-label rewrites |
| `bulk_ntr_workflow/outputs/<name>-reports/manual_curation.tsv` | Group terms not fitting simple `part_of` pattern |
| `bulk_ntr_workflow/outputs/definitions/input/*.json` | Per-group input for subagents |
| `bulk_ntr_workflow/outputs/definitions/*.json` | Per-group subagent output |

Expand Down
82 changes: 82 additions & 0 deletions bulk_ntr_workflow/experiments/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Enrichment experiment: muscle origin/insertion/innervation/action

**Run date:** 2026-04-28
**Branch:** `add-hra-muscular-ntr`
**Scope:** standalone — no workflow scripts modified.

## Goals

1. Test whether an agent can extract origin/insertion/innervation/action from Wikipedia + similar UBERON terms, with UBERON ID resolution and a verbatim evidence quote per field.
2. Find out how coverage varies across the muscle-term-type gradient (well-known whole muscle → obscure muscle sub-part) — testing the hypothesis that muscle parts are poorly axiomatised.
3. Verify that the "supporting quote" design is feasible without overhauling the workflow.

## Method

6 muscle terms picked across difficulty gradient. Each routed to a separate general-purpose agent with instructions to:
- Populate 6 fields (`is_a`, `part_of`, `has_muscle_origin`, `has_muscle_insertion`, `innervated_by`, `has_muscle_action`).
- Each field has `value` (UBERON ID), `label`, `evidence` (verbatim quote), `source` (URL/PMID).
- All fields optional — emit empty `value` with evidence quote if no UBERON term exists for the entity.
- Output JSON to `bulk_ntr_workflow/experiments/enriched_<slug>.json`.

## Results

| Term | Type | Existing UBERON? | Fields populated | UBERON IDs | Free-text |
|---|---|---|---:|---:|---:|
| internal abdominal oblique muscle | well-known whole | yes (UBERON:0005454) | 6/6 | 5 | 1 |
| tensor fascia latae muscle | well-known whole | yes (UBERON:0001376) | 6/6 | 4 | 2 |
| iliocostalis cervicalis muscle | segmental whole | yes (UBERON:0008546) | 5/6 | 5 | 0 |
| articularis genu muscle | less-famous whole | NEW | 6/6 | 5 | 1 |
| clavicular head of pectoralis major muscle | muscle head | NEW | 6/6 | 5 | 1 |
| dorsal part of intertransversarii laterales lumborum | obscure sub-part | NEW | 6/6 | 3 | 3 |

(*"Free-text"* = field had a value/quote but couldn't resolve to a UBERON ID.)

## Coverage findings

**Hypothesis confirmed (partly).** Muscle parts ARE more poorly served — but not in the way expected:
- The **target term** lacked a UBERON ID in 3 of 6 cases (all the new ones), as expected.
- The **anatomical entities they relate to** (origin bone, insertion attachment, innervating nerve) failed to resolve to UBERON IDs in unexpected places, even for famous muscles:
- **superior gluteal nerve** (tensor fascia latae innervation) — not in UBERON
- **lateral pectoral nerve** (clavicular head innervation) — not in UBERON
- **iliotibial tract** (tensor fascia latae insertion) — not in UBERON
- **ilioinguinal nerve, iliohypogastric nerve** (internal abdominal oblique innervation) — not in UBERON
- **linea alba** (internal abdominal oblique insertion) — not in UBERON
- **suprapatellar bursa** (articularis genu insertion) — not in UBERON
- **accessory process of lumbar vertebra** (dorsal lumborum origin) — not in UBERON

The agent fell back to **a more general parent** in each case (e.g. `humerus` instead of `lateral lip of intertubercular groove of humerus`; `thoracic nerve` instead of `lateral pectoral nerve`). These generalisations are correct but lose specificity.

For the obscure sub-part (`dorsal part of intertransversarii laterales lumborum`):
- Direct genus class missing (parent muscle `intertransversarii laterales lumborum muscle` not in UBERON)
- Origin attachment missing (`accessory process` not in UBERON)
- 3 of 6 fields had evidence but no UBERON ID — agent emitted `value: ""` with a clear `notes` field

**Coverage is not strongly correlated with term obscurity.** A famous muscle like tensor fascia latae has 2 unresolvable entities; the obscure dorsal sub-part has 3. The bottleneck is **UBERON's coverage of fine-grained anatomical attachments and named nerves**, not Wikipedia's coverage of the muscle itself.

## Quote-as-evidence findings

The verbatim quote design works well in practice:
- Quotes range 1–3 sentences, easy to scan
- Where a quote spans multiple fields (e.g. "originates from X and inserts onto Y"), the same passage is reused — no problem
- For obscure terms, the agent often had to rely on Kenhub or anatomy textbooks rather than Wikipedia — the `source` URL captures this naturally
- When evidence is absent (no source describes the field for this specific term), the agent leaves the field out cleanly

A curator reviewing the JSON could process each enrichment in seconds: read the quote, check the source matches, accept the UBERON ID resolution. **This makes the enrichment auditable in a way the current free-text definitions are not.**

## Surprises

1. **3 of 6 picks were already in UBERON.** Even moderately obscure terms (iliocostalis cervicalis) turned out to exist. Step 2 (existing-term check) is doing real work — we saw this with the group flow too. For HRA-ASCTB inputs, the agent should always run Step 2 first; enrichment is most valuable when the term is genuinely new.

2. **Existing UBERON terms have surprisingly LIGHT axiomatisation.** Tensor fascia latae's existing UBERON stanza had just `is_a` + 1 origin axiom. The enrichment added insertion + innervation + action that were missing. So the enrichment workflow could **also** improve existing terms, not just new ones.

3. **The hard problem is the relata, not the relations.** Identifying that a muscle is `innervated_by some nerve` is easy. Resolving "lateral pectoral nerve" to a UBERON ID is hard because UBERON doesn't have that nerve. A future enrichment workflow might want to flag missing UBERON terms it encounters as candidates for new term requests of their own (a kind of cascade — adding a muscle reveals the nerve it's innervated by also needs to be added).

## Implications for future work (NOT acted on)

If a richer NTR workflow is built:
- Make all enrichment fields **optional** (validated here — gracefully degrades).
- Capture **evidence quote + source URL** as a standard pattern for every populated field. Curator review would benefit substantially.
- Pre-extract **system-specific patterns** (skeletal muscle, bone, vasculature, etc.) so the agent knows which fields to look for rather than guessing per term.
- Detect and report **missing related entities** (e.g. lateral pectoral nerve) as a side-output, feeding into the next NTR batch.

For now the existing workflow is unchanged; this experiment documents the shape of the result.
96 changes: 96 additions & 0 deletions bulk_ntr_workflow/experiments/SUMMARY_OVARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Enrichment experiment: ovary terms

**Run date:** 2026-04-28
**Branch:** `add-hra-muscular-ntr` (ovary terms read from `add-hra-ovary-ntr` via git show)
**Scope:** standalone — no workflow scripts modified.

## Goals

Test the hypothesis: **for ovary terms, simple `is_a + part_of` should be sufficient** (i.e. the rich relations needed for muscles — origin/insertion/innervation — won't apply, and ovary structures should be cleanly modelled with just genus + container).

## Method

6 ovary terms picked across types:
- 3 layers/parts (corona radiata, corpus luteum granulosa lutein layer, corpus luteum granulosa theca layer)
- 1 compositional complex (cumulus oophorus oocyte complex)
- 2 follicle stages (early antral follicle, transitional primary ovarian follicle)

Each routed to a separate agent with the same evidence-quote enrichment design used for muscles. Fields tested: `is_a`, `part_of`, `composed_primarily_of`, `has_part`, `bounding_layer_of`, `develops_from`, `has_component` (with cardinality), `has_potential_to_develop_into`, `has_function`, `has_quality`.

## Results — hypothesis NOT confirmed

| Term | Type | Fields populated with UBERON IDs | Simple is_a+part_of sufficient? |
|---|---|---|---|
| corona radiata | layer | is_a, part_of, composed_primarily_of, **bounding_layer_of**, develops_from | NO — bounding_layer_of distinguishes it from generic granulosa cell layer |
| CL granulosa lutein layer | layer | is_a, part_of, **composed_primarily_of**, has_function | NO — composed_primarily_of CL:0000592 is the load-bearing differentiator vs sibling theca layer |
| CL granulosa theca layer | layer | is_a, part_of, **has_part** (CL:0000592 + CL:0000590), composed_primarily_of | NO — without has_part axioms, can't distinguish from generic CL layer |
| cumulus oophorus oocyte complex | complex | is_a, part_of, **has_part** (oocyte + cumulus + zona pellucida) | NO — without has_part, logically indistinguishable from cumulus oophorus alone |
| early antral follicle | stage | is_a, **has_part** (antrum), **has_component** w/ cardinality, **develops_from**, has_potential_to_develop_into | NO — UBERON's existing follicle-stage pattern requires all four mechanisms |
| transitional primary ovarian follicle | stage | is_a, has_part, **develops_from**, has_potential_to_develop_into | PARTIALLY — develops_from is essential; cardinality inherits from primary parent |

**Result: 5 of 6 ovary terms genuinely require relations beyond `is_a + part_of`.** Only the transitional primary follicle is borderline (develops_from is needed but cardinality can be inherited from the parent class).

## Why this is different from muscles

| Aspect | Muscle leaf terms | Ovary leaf terms |
|---|---|---|
| Defining relation | Spatial (where the muscle is, what it attaches to) | **Compositional** (what cells/parts it contains) and **temporal** (developmental sequence) |
| Common relations needed | has_muscle_origin, has_muscle_insertion, innervated_by | composed_primarily_of, has_part (CL:cell types), develops_from, has_component cardinality |
| Sibling distinguishability with is_a + part_of | Workable (different muscles → different containers/origins) | **Often fails** (sibling layers share container; sibling stages share genus) |
| External entities relied on | Bones, nerves (mostly UBERON) | Cell types (CL ontology — generally well-covered) |

The ovary case is in some ways **harder** for is_a + part_of than the muscle case:
- Multiple sibling structures within the same parent (e.g. lutein vs theca layer of the same corpus luteum) — they share `part_of UBERON:0002512`, so `part_of` alone doesn't differentiate them.
- Follicle stages share `is_a ovarian follicle` AND `part_of ovary` — neither relation distinguishes stages.
- The defining property is in the cellular composition or the developmental position, neither of which is captured by spatial part_of.

## UBERON precedent confirms the pattern

Existing UBERON follicle stage terms use sophisticated logical definitions:

```
UBERON:0000036 secondary ovarian follicle
intersection_of: UBERON:0001305 ! ovarian follicle
intersection_of: has_component UBERON:0005170 {minCardinality="2"} ! granulosa cell layer
intersection_of: has_potential_to_develop_into UBERON:0000037 ! tertiary ovarian follicle
relationship: develops_from UBERON:0000035 ! primary ovarian follicle
```

Existing CL layer terms use:
```
UBERON:0000155 theca cell layer
intersection_of: UBERON:0000119 ! cell layer
intersection_of: composed_primarily_of CL:... ! theca cell
intersection_of: part_of UBERON:0001305 ! ovarian follicle
```

UBERON convention itself **rejects the simple is_a + part_of pattern** for these structure types — the workflow's leaf template is missing exactly what UBERON considers necessary.

## Cross-experiment comparison

| Domain | Sufficient: is_a + part_of? | Most-needed extra relations | Pattern complexity |
|---|---|---|---|
| Muscle individual | partially | has_muscle_origin, has_muscle_insertion, innervated_by | Asserted relationships |
| Muscle group | yes (≥74% per Phase 2) | (none — simple is_a + part_of EC works) | EquivalentClass with single differentia |
| Muscle head/sub-part | yes (sparse precedent — only 2 terms in UBERON) | (parent muscle as part_of) | Asserted relationships |
| Ovary layer | NO | composed_primarily_of (CL:cell type), has_part (CL:cell types), bounding_layer_of | EquivalentClass with multi-differentia |
| Ovary compositional complex | NO | has_part (multiple CL+UBERON entities) | EquivalentClass with multiple has_part |
| Ovary stage | NO | develops_from, has_component with cardinality, has_potential_to_develop_into | Multi-axiom intersection_of with cardinality constraints |

**System-specific templates would help substantially.** The muscle and ovary domains need different fields, and within ovary the layers vs stages need different patterns. A single one-size-fits-all template either over-fits one domain or under-serves both.

## Implications

1. The user's intuition that ovary would need less than muscles was **wrong**, but the underlying point — that anatomical-system templates should be tailored — is more strongly supported, not less.

2. **Per-system templates** become important:
- Muscle template: + `has_muscle_origin`, `has_muscle_insertion`, `innervated_by`
- Ovary layer template: + `composed_primarily_of`, optionally `bounding_layer_of`
- Ovary compositional complex template: + `has_part` (multi-valued)
- Follicle stage template: + `develops_from`, `has_component` with cardinality, `has_potential_to_develop_into`

3. **The cardinality-constrained `has_component` is interesting.** ROBOT templates support this via the `>EC` directive (sub-axiom annotation), or via more elaborate column structures. Worth investigating if a stage-specific template is built.

4. **The agent's tool use was efficient.** All 6 ovary agents finished in 50–80s each, mostly using awk over `uberon-edit.obo` to find precedent stanzas and OLS4 only when needed for cell-type lookups in CL. obo-grep would not have been more efficient here.

5. **The evidence-quote design transferred cleanly.** Same JSON shape, same per-field quote+source — no schema changes needed across domains. Confirms it as a generalisable pattern.
40 changes: 40 additions & 0 deletions bulk_ntr_workflow/experiments/enriched_articularis_genu.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"label": "articularis genu muscle",
"is_a": {
"value": "UBERON:0001630",
"label": "muscle organ",
"evidence": "The articularis genus (also known as the subcrureus muscle) is a small skeletal muscle located anteriorly on the thigh just above the knee.",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle"
},
"part_of": {
"value": "UBERON:0000376",
"label": "hindlimb stylopod",
"evidence": "The articularis genus (also known as the subcrureus muscle) is a small skeletal muscle located anteriorly on the thigh just above the knee.",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle"
},
"has_muscle_origin": {
"value": "UBERON:0000981",
"label": "femur",
"evidence": "It arises from the anterior surface of the lower part of the body of the femur, deep to the vastus intermedius, close to the knee and from the deep fibers of the vastus intermedius.",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle"
},
"has_muscle_insertion": {
"value": null,
"label": "suprapatellar bursa / synovial membrane of knee joint",
"evidence": "Its insertion is on the synovial membrane of the knee-joint. [Infobox: Insertion: Suprapatellar bursa]",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle",
"notes": "No clear UBERON term resolved for suprapatellar bursa; left unresolved."
},
"innervated_by": {
"value": "UBERON:0001267",
"label": "femoral nerve",
"evidence": "It is innervated by branches of the femoral nerve (L2-L4).",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle"
},
"has_muscle_action": {
"value": "pulls the suprapatellar bursa superiorly during extension of the knee, preventing impingement of the synovial membrane between the patella and the femur",
"evidence": "Articularis genus pulls the suprapatellar bursa superiorly during extension of the knee, and prevents impingement of the synovial membrane between the patella and the femur.",
"source": "https://en.wikipedia.org/wiki/Articularis_genus_muscle"
},
"notes": "Articularis genus is a distinct small skeletal muscle of the anterior thigh, often described as separated from but closely associated with vastus intermedius (UBERON:0014847). Insertion field has no resolved UBERON ID because suprapatellar bursa / knee synovial membrane was not found as a discrete UBERON class in OLS lookup; left as a free-text label with evidence quote. Wikipedia evidence is sourced from Grant's Atlas of Anatomy (Agur & Dalley 2009) and Gray's Anatomy (1918) per the article's references."
}
Loading
Loading