Skip to content

Latest commit

 

History

History
230 lines (159 loc) · 22.1 KB

File metadata and controls

230 lines (159 loc) · 22.1 KB

Terraphim Clinical Pipeline: Graph-Based Safety Gates for MedGemma - From Class Suggestions to Specific Drug-Dose Evidence

Team

Dr Alexander Mikhalev: AI/ML expert, systems thinker. Zestic AI partner and builder of Terraphim. Background in distributed systems (Rust, async), complex systems, and knowledge graph engineering. Built the full pipeline: entity extraction, knowledge graph, multi-agent orchestration, MedGemma integration, evaluation harness, and edge deployment config. Solo submission.

Why Terraphim Embeddings and Knowledge Graph In my previous medical pipeline - The Pattern, platinum winner of $100K "Build on Redis" Hackathon at RedisConf 2021, I built an engineering marvel: with inference on CPU under 2 ms for QA-large-bert uncased model before NVIDIA Ampere architecture was released, yet while I was testing it I always thought: "If my doctor is going to be using my system, I don't want to be the patient". For the past 5 years, I have been building graph embeddings, which have enabled me to achieve 100% accuracy and grounding in medical industry knowledge. And with the industry awakening to the fact that Transformer architecture has its own limitations, Terraphim or terraphim-like graph embeddings can be a new type of tokeniser that will drive new large and small language models.

Problem statement

The Problem: LLMs Hallucinate Dangerous Drug Recommendations

Large language models generate plausible-sounding but often vague or incorrect medical recommendations. In precision oncology and pharmacogenomics, vague advice can be dangerous.

Anchor cases -- measured A/B comparison (ab_comparison example, reproduced 2026-02-24):

Running the same clinical cases through MedGemma with and without KG context:

Case Raw MedGemma (no KG) With Terraphim KG Grounding
BRAF Melanoma "BRAF inhibitor (e.g., Dabrafenib + Trametinib)" -- vague class suggestion Vemurafenib 450mg orally once daily -- specific drug + dose
CYP2D6 Codeine Oxycodone 5 mg/mL -- wrong drug entirely Codeine 60mg every 6h -- correct drug from KG context
EGFR NSCLC Osimertinib 80mg (correct on this run; prior run hallucinated 800mg -- 10x overdose) Osimertinib 80mg -- consistently correct per FLAURA trial

The BRAF case is the most reliably reproducible: raw MedGemma consistently hedges with drug class names ("consider BRAF inhibitor") instead of actionable prescriptions. The knowledge graph narrows this to a specific drug from the evidence-validated treatment subgraph. The CYP2D6 case shows the raw model substituting a different drug entirely, while KG grounding keeps the recommendation within the context-appropriate drug set. The EGFR case has shown stochastic dosing errors (800mg in one run, correct 80mg in another) -- exactly the kind of non-determinism that makes raw LLM output unsuitable for clinical decision support.

Why This Matters

  • 250,000+ adverse drug events per year in the US alone (Journal of Patient Safety)
  • 7% of ADEs are preventable through pharmacogenomic-guided prescribing (CPIC consortium)
  • Pharmacogenomic testing adoption is below 5% in most health systems -- not because tests are unavailable, but because interpreting gene-drug interactions at point-of-care is cognitively overwhelming
  • Pattern matching across 1.4 million SNOMED CT terms, CPIC guidelines, and clinical trial data is beyond human capacity at point-of-care speed

Impact Estimate

A health system processing 10,000 prescriptions/month with a conservatively estimated 2% pharmacogenomic interaction rate (vs. 30-60% drug-gene interaction prevalence in genotyped populations [Pasternak et al. 2023, Clin Transl Sci]) = 200 drug-gene interactions requiring clinical review per month. A 50% interception rate through high-specificity KG-grounded alerts (compared to <10% acceptance for standard CDS alerts [Felisberto et al. 2024, Health Informatics Journal]) = 100 prevented adverse events per month per health system. At an average preventable ADE cost of $5,000-$10,000 per incident (Bates et al. 1997 JAMA, inflation-adjusted from $4,685) this represents $500K-$1M/month in avoided costs -- before counting prevented hospitalisations and improved patient outcomes.

Overall solution

How MedGemma is Used

Terraphim uses MedGemma as the clinical reasoning engine within a knowledge-grounded multi-agent pipeline. MedGemma's medical training on clinical literature and healthcare data makes it uniquely suited for this role -- its parametric medical knowledge complements the knowledge graph's structured evidence.

The pipeline works in 6 steps:

Patient Input --> Entity Extraction (Aho-Corasick LeftmostLongest, <1ms)
             --> Knowledge Graph Query (SNOMED CT + PrimeKG, <1ms)
             --> PGx Validation (CPIC guidelines, <1ms)
             --> MedGemma Inference (KG-augmented prompt, 2-5s cloud / 23.5s GPU / 165s CPU)
             --> Safety Validation (KG grounding gate)
             --> Grounded Clinical Recommendation

Why MedGemma specifically:

  1. Medical training complements structured knowledge. MedGemma's parametric knowledge handles clinical reasoning and natural language generation, while the KG provides hard constraints (drug-gene interactions, contraindications, trial evidence). Neither alone is sufficient.

  2. KG catches MedGemma hallucinations. In the T790M case, raw MedGemma suggested a generic EGFR inhibitor. The knowledge graph narrowed this to Osimertinib based on the AURA3 trial—the correct second-line therapy per NCCN guidelines.

  3. KG context injection improves MedGemma output. Extracted entities (with SNOMED IDs), KG treatment options, and PGx validation results are injected directly into the MedGemma prompt. This structured context steers the model toward evidence-based specificity rather than generic advice.

  4. Edge deployment. MedGemma 4B quantized to Q4_K_M (2.3GB) fits within a 4GB memory budget alongside the full KG and entity extraction pipeline -- enabling offline clinical decision support in resource-constrained settings.

System 1 + System 2: Why This Architecture Works

This pipeline embodies the dual-process model from cognitive science. MedGemma operates as System 1— a fast, intuitive, pattern-matching system. It has absorbed millions of clinical documents during training and produces fluent, plausible-sounding recommendations in seconds. Like a clinician's trained intuition, it is usually directionally correct, but prone to confident errors: hallucinating a 10x dose (800mg vs 80mg), suggesting a drug outside the patient's mutation profile, or hedging with vague class-level recommendations ("consider EGFR inhibitor") instead of actionable specifics.

The Terraphim knowledge graph operates as System 2— slow, deliberate, evidence-based reasoning. Entity extraction with LeftmostLongest matching grounds the clinical text to precise SNOMED CT concepts. The typed graph (27 node types, 65 edge types) traces explicit evidence paths: Drug->Treats->Disease->HasVariant->Gene->CitedIn->Trial. PGx validation checks drug-gene interactions against CPIC guidelines. The safety gate verifies every recommendation against the validated treatment subgraph before sending it to the clinician.

Neither system alone is sufficient. System 1 (MedGemma) without System 2 produces hallucinations that look authoritative. System 2 (KG) without System 1 can only retrieve explicitly encoded information -- it cannot reason about novel combinations or generate natural-language explanations. The combination is greater than either part: MedGemma generates candidate recommendations from its vast parametric knowledge, and the knowledge graph validates, constrains, and grounds them in traceable clinical evidence. The result is a system that thinks fast and checks carefully -- the same cognitive architecture that makes expert clinicians effective, implemented as a reproducible, auditable pipeline.

LeftmostLongest Matching: Grounding at the Right Specificity Level

Entity extraction is the first step in the pipeline, and its precision determines whether the entire downstream chain -- KG lookup, PGx validation, MedGemma context injection -- operates on the correct clinical concept. Terraphim's EntityExtractor uses Aho-Corasick automata configured with LeftmostLongest match semantics. This is not an implementation detail -- it is a clinical safety requirement.

Why LeftmostLongest matters for grounding:

When clinical text contains "non-small cell lung carcinoma", the SNOMED CT ontology has both a specific concept (SNOMED 254637007: Non-small cell lung carcinoma) and a generic parent (SNOMED 363358000: Lung carcinoma). Without LeftmostLongest, the automaton could return "lung carcinoma" -- a shorter match that starts at the same position. This causes a cascade of errors:

LeftmostFirst (wrong):   "lung carcinoma" -> SNOMED 363358000 -> generic chemotherapy options
LeftmostLongest (correct): "non-small cell lung carcinoma" -> SNOMED 254637007 -> targeted therapies
                           (Osimertinib, Gefitinib, Erlotinib -- EGFR-specific)

The treatment graph for NSCLC includes EGFR-targeted therapies that do not exist under the generic "lung carcinoma" node. Grounding at the wrong specificity level means the KG context injected into MedGemma's prompt omits the most relevant treatment options, and the safety gate cannot validate against the correct drug-gene interaction set.

Concrete example from the evaluation suite: In CASE-001-EGFR-NSCLC, the extractor processes "58-year-old male with EGFR L858R-positive non-small cell lung cancer progressed on first-line gefitinib." LeftmostLongest ensures "non-small cell lung cancer" matches as a single entity (concept 254637007), not as "cell" + "lung cancer" fragments. The KG then correctly returns Osimertinib as the evidence-based second-line therapy per the FLAURA trial.

This is validated by dedicated tests (test_leftmost_longest_prefers_full_concept_over_fragment, test_leftmost_longest_from_terms) that assert the extractor always returns the most specific SNOMED concept when overlapping terms exist.

Graph-Based Embeddings: Why They Matter for Clinical Safety

Terraphim uses symbolic graph embeddings rather than vector embeddings for clinical knowledge representation. The MedicalRoleGraph (27 node types, 65 edge types) computes similarity using a weighted combination of Jaccard coefficient (0.7) and graph path distance (0.3). This produces deterministic, auditable similarity scores -- critical properties that dense vector embeddings lack.

Why graph-based over vector-based for clinical safety:

Property Vector Embeddings Graph-Based (Terraphim)
Determinism Non-deterministic (model-dependent) Deterministic (graph structure)
Auditability Opaque 768-dim vectors Traceable path: Drug->Treats->Disease->HasVariant->Gene
Drift Embedding models change with updates Stable until graph is explicitly updated
Explainability "These vectors are close" "Osimertinib treats NSCLC via EGFR L858R (FLAURA trial, 80% ORR)"
Regulatory fit Difficult to validate Auditable for clinical certification

Demonstrated safety gate behaviour: During real inference, MedGemma recommended "Pembrolizumab 200mg IV every 3 weeks" for an EGFR L858R+ NSCLC patient. The knowledge graph safety gate correctly blocked this recommendation -- Pembrolizumab (a PD-1 checkpoint inhibitor) is not in the validated EGFR-NSCLC treatment subgraph. The KG instead grounds the recommendation to Osimertinib per the FLAURA trial. This is not a failure mode -- it is the system working as designed. Across 36 real inference calls (18 CPU + 18 GPU), the safety gate maintained 100% detection of ungrounded recommendations while passing all clinically appropriate ones.

The find_similar() function enables clinical decision support beyond direct lookups: querying "what is similar to NSCLC?" returns SCLC (high similarity -- sibling lung cancers sharing treatment pathways) over breast cancer (lower similarity -- different organ, different treatment graph). This graph-structural similarity cannot be reliably replicated with dense embeddings trained on general medical text.

Built on the terraphim-ai crate ecosystem. The graph embeddings are not a competition-specific prototype -- they are part of terraphim-ai, a production Rust library for knowledge graph-powered semantic search. Two crates are load-bearing in every end-to-end demo:

  • terraphim_rolegraph (v1.4.10) -- the MedicalRoleGraph behind #[cfg(feature = "medical")] provides typed nodes (27 MedicalNodeType variants: Disease, Drug, Gene, Variant, Procedure, etc.), typed edges (65 MedicalEdgeType variants: Treats, Contraindicates, IsA, HasVariant, etc.), IS-A hierarchy traversal, SNOMED ID resolution, adjacency index for O(degree) edge lookups, and the SymbolicEmbeddingIndex that computes build_embeddings(), symbolic_similarity(), and find_similar(). In e2e_pipeline.rs, the full API is exercised: ancestor/descendant traversal, treatment lookups, contraindication checks, symbolic similarity between disease pairs, and k-nearest-neighbour search. In e2e_real_model.rs, a 36-node KG with Treats/Contraindicates/IsA edges validates treatment retrieval for all 18 clinical scenarios. In consultation.rs, MedicalRoleGraph is a struct field -- get_treatments() and get_node_term() are called per patient entity during Step 3 (Knowledge Graph Query) of the workflow.

  • terraphim_automata (v1.4.10) -- EntityExtractor builds an Aho-Corasick automaton configured with LeftmostLongest match semantics over SNOMED CT terms and synonyms for O(n+m+z) clinical entity extraction. The LeftmostLongest policy is critical for grounding precision: when the ontology contains both "lung carcinoma" (SNOMED 363358000) and "non-small cell lung carcinoma" (SNOMED 254637007), the matcher returns the longest, most specific concept. This matters because the downstream KG query for treatment options differs significantly -- NSCLC has targeted therapies (Osimertinib, Gefitinib) while the generic "lung carcinoma" node may only have broad chemotherapy. Without LeftmostLongest, the extractor could ground to the wrong specificity level, causing the safety gate to miss drug-gene interactions or suggest inappropriate treatments. In e2e_pipeline.rs, both EntityExtractor::from_terms() (curated SNOMED terms) and ShardedUmlsExtractor::load_from_artifact() (209MB pre-built UMLS automaton with 1.4M patterns) are demonstrated. In e2e_real_model.rs, a 33-term extractor processes all 18 evaluation scenarios. In consultation.rs, EntityExtractor is a struct field that drives Step 2 (Entity Extraction) of every consultation -- the SnomedMatch return type carries concept IDs, spans, and term metadata through the rest of the pipeline.

Evaluation Results

Metric CPU run (b6321317) GPU run (79d26e2e) GPU run (f4af1ed9)
Evaluation cases 18 18 18
Pass rate 18/18 (100%) 18/18 (100%) 18/18 (100%)
Safety failures 0 0 0
Average grounding score 0.92 0.89 0.92
Fully grounded cases 15/18 14/18 15/18
Gate: Safety 100% 100% 100%
Gate: KG Grounding 83.3% 77.8% 83.3%
Gate: Hygiene 94.4% 88.9% 94.4%
Avg inference latency 165.3s/case 23.5s/case 24.8s/case

Three runs using real MedGemma GGUF inference with no mock fallback. GPU delivers 7x speedup (RTX 2070, 99 layers offloaded, sm_75, CUDA 13.0). Gate score variation between runs is expected LLM output variance, not instability. Report f4af1ed9 (2026-02-24) confirms stable results after API server wiring changes.

Demo Video

demo-video.mp4 (85 seconds, 1920x1080, 6.1MB) -- recorded with Playwright against the live API server with real MedGemma GPU inference. The video walks through:

  1. UI overview -- header, patient card (EGFR NSCLC), 7-step pipeline, 6 agent cards
  2. Demo mode -- simulated pipeline run for Patient 0 (EGFR NSCLC) and Patient 2 (Epilepsy, HLA-A*31:01)
  3. Patient switching -- dropdown selection across 6 clinical scenarios
  4. Live mode toggle -- connects to Rust API backend via WebSocket
  5. Real GPU inference -- MedGemma 4B on RTX 2070, ~25s per case, with step-by-step pipeline animation
  6. Results display -- treatment recommendation, confidence score, safety validation, reasoning text
  7. Evaluation report -- 18/18 cases passing the 3-gate harness

The recording script (tests/e2e/record-demo.spec.js) is fully reproducible: start the API server with GPU, then npx playwright test tests/e2e/record-demo.spec.js.

Interactive Demo (demo.html)

The system ships with a self-contained web UI (static/demo.html) served by the API server:

  • Demo mode -- pre-computed pipeline walkthrough showing all 6 agent steps with simulated data
  • Live mode -- real-time WebSocket streaming to the Rust API backend. Connects to /ws/recommend, streams 12 events (connection -> 4 phase pairs -> dual workflow_completed), displays MedGemma recommendation with treatment, confidence, reasoning, alternatives, and safety status

Live mode verified with 8 Playwright e2e tests covering:

  • API endpoints (health, entity extraction, LLM-enhanced recommendation)
  • WebSocket phase streaming and dual workflow_completed protocol
  • UI rendering (header, patient card, pipeline step nodes, metrics)
  • Full live pipeline with GPU inference (~25s end-to-end)

Technical details

Architecture

+----------------------------------------------------------------------+
|                    Terraphim Multi-Agent Pipeline                     |
|                                                                      |
|  [Entity Extractor]  -->  [Knowledge Graph]  -->  [PGx Validator]    |
|   Aho-Corasick            SNOMED CT + PrimeKG      CPIC Guidelines   |
|   LeftmostLongest         27 node / 65 edge types   Drug-gene rules  |
|   1.4M SNOMED patterns    Symbolic embeddings       <1ms             |
|   <1ms                    <1ms                                       |
|                                                                      |
|  [MedGemma Client]  <--  [Context Builder]  <-- [Orchestrator]       |
|   Vertex AI / GGUF        KG-augmented prompt     OTP supervision    |
|   2-5s cloud / 23s GPU    SNOMED+PGx injection    Fault-tolerant     |
+----------------------------------------------------------------------+

11 Rust crates, 543+ tests, 0 failures, 9 Playwright e2e tests (8 functional + 1 video recording). The system is implemented in Rust for performance-critical components (entity extraction, KG queries, PGx validation, all <1ms) with MedGemma inference via Vertex AI (cloud) or GGUF quantised model (edge). The API server (terraphim-api) uses Axum with shared state to serve REST endpoints, WebSocket streaming, and the interactive demo UI.

Performance

Component Latency Details
Entity extraction <1ms Aho-Corasick LeftmostLongest, 1.4M SNOMED/UMLS patterns
Knowledge graph query <1ms In-memory, symbolic embeddings (Jaccard + path distance)
PGx validation <1ms CPIC rule lookup, drug-gene interaction check
MedGemma inference 2-5s (Vertex AI) MedGemma-4b-it via generateContent API
MedGemma inference 112-627s (edge, CPU) Q4_K_M GGUF, 2.3GB, avg 165s/case
MedGemma inference 21-25s (edge, GPU) Q4_K_M GGUF, RTX 2070 8GB, avg 23.5s/case
End-to-end pipeline <10s (cloud) Including all validation steps

Deployment

  • Cloud: Vertex AI endpoint with OAuth2/ADC authentication via genai crate (Gemini adapter + Bearer auth). Note: Vertex AI GPU instances were unavailable during the evaluation period due to capacity shortages in us-central1; all real inference results reported here use local GGUF (GPU and CPU). The Vertex AI backend has been implemented, tested against the API, and is ready when capacity becomes available.
  • Edge: MedGemma 4B Q4_K_M (2.3GB) + SNOMED automata (50MB) + KG (100MB) = <4GB total. Runs offline on commodity hardware. No mock fallback -- panics if model unavailable.
  • API server with LLM: cargo run -p terraphim-api --features "local-gguf,cuda" starts an Axum REST/WebSocket server with GPU-accelerated MedGemma inference. The GGUF model is loaded once at startup via MEDGEMMA_GGUF_PATH env var and shared across all request handlers via Arc<ClinicalService>. See .env.template for configuration.
  • Interactive demo: http://localhost:3001/ serves demo.html with Demo mode (simulated) and Live mode (real GPU inference via WebSocket streaming).
  • Reproducibility:
    cp .env.template .env  # Edit with your HF_TOKEN and MEDGEMMA_GGUF_PATH
    cargo test --workspace  # 543 tests
    cargo run -p terraphim-api --features "local-gguf,cuda"  # API server + demo UI
    cargo run -p terraphim-evaluation --features "local-gguf,cuda" --bin evaluation-runner  # 18/18 eval
    cargo run -p terraphim-demo --features "local-gguf,cuda" --example e2e_pipeline  # Full pipeline
    npx playwright test tests/e2e/demo-live.spec.js  # 8 e2e tests (requires API server running)
    npx playwright test tests/e2e/record-demo.spec.js  # Record demo video (85s, GPU inference)

Architecture:

Limitations

  • No fine-tuning -- MedGemma used as-is with prompt engineering and KG grounding
  • Multimodal capabilities (MedGemma vision) not yet integrated -- architecture supports it via ImagingAnalysisAgent role
  • 18-case evaluation suite covers pharmacogenomics and oncology; broader clinical domains need additional test cases
  • Edge inference on CPU (165s/case) is viable for batch evaluation but not interactive use -- GPU (23.5s/case) or cloud inference (2-5s) recommended for production