CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What This Is

health-memex is a *-memex ecosystem archive for personal health data. It consolidates medical records from multiple EHR (Electronic Health Record) systems into a single queryable SQLite database, then exposes that data via MCP server, CLI, arkiv export, and self-contained HTML SPA.

Part of the *-memex family (see ~/github/memex/CLAUDE.md for the ecosystem contract). Each archive covers one domain and satisfies a common contract: SQLite + FTS5 backend, MCP server, thin admin CLI, import pipelines, arkiv export, durable record IDs, and marginalia.

Cross-archive URI prefix: health-memex://

Project Structure

Package code lives in src/health_memex/. Tests in tests/.

# Development setup
pip install -e ".[dev,mcp]"

Commands

# Run all tests (pytest)
python -m pytest tests/

# Run a single test file
python -m pytest tests/test_adapters.py

# Run a single test class or method
python -m pytest tests/test_adapters.py::TestEpicAdapter::test_lab_panel_explosion

# Run tests with coverage
python -m pytest tests/ --cov=health_memex --cov-report=term-missing

# Lint (ruff configured in pyproject.toml)
ruff check src/ tests/
ruff format --check src/ tests/

# Type check (mypy, also runs in CI)
mypy src/health_memex/

# Load data from EHR exports
health-memex load epic <dir>
health-memex load meditech <dir>
health-memex load athena <dir>
health-memex load auto <dir-or-file>          # Auto-detect source type
health-memex load mychart-visit <file.mhtml>   # MyChart visit page MHTML
health-memex load mychart-test-result <file.mhtml>  # MyChart test result MHTML
health-memex load all --epic-dir <> --meditech-dir <> --athena-dir <>
health-memex load analyses <dir>               # Load analysis markdown files

# Query and inspect
health-memex query "SELECT test_name, value, result_date FROM lab_results ORDER BY result_date DESC"
health-memex summary

# What's new since a given date (visit diff)
health-memex diff 2025-01-01

# Export formats: arkiv, html
health-memex export arkiv --output ./arkiv/
health-memex export arkiv --output ./arkiv/ --embed          # inline base64 assets
health-memex export html --output summary.html
health-memex export html --output summary.html --embed-images --config health_memex.toml
health-memex export html --output summary.html --ai-chat --proxy-url https://proxy.example.com/v1/messages

# Import from arkiv archive (round-trip capable)
health-memex import ./arkiv/ --db new.db
health-memex import ./arkiv/ --validate-only

# Generate personalized config from your data
health-memex init-config

# Personal notes
health-memex notes list --limit 20
health-memex notes search --tag oncology --query "CEA"

# Start MCP server (matches ecosystem pattern: memex mcp, btk mcp)
health-memex mcp --db health_memex.db

Architecture

Three-Stage Data Pipeline

Every EHR source goes through the same pipeline, and each stage is independently testable:

Raw EHR files (XML/FHIR/MHTML)
    |
[Source Parser]  -> source-specific dict  (sources/*.py)
    |
[Adapter]        -> UnifiedRecords        (adapters/*_adapter.py)
    |
[DB Loader]      -> SQLite tables         (db.py using schema.sql)

Shared parsing infrastructure in core/cda.py (CDA R2 XML: namespace handling, section extraction, date formatting) and core/fhir.py (FHIR R4 Bundle: resource extraction by type, base64 decode of presented forms). Source parsers build on these.
Source parsers handle format-specific XML/FHIR/HTML parsing and return dicts with keys like lab_results, medications, problems, clinical_notes, etc.
Adapters normalize dates to ISO 8601, parse numeric values, deduplicate records, and map everything into dataclass instances (models.py).
DB loader uses UPSERT (INSERT...ON CONFLICT...DO UPDATE) for stable autoincrement IDs across re-imports. replace=True mode also cleans up stale records.

The CLI prints a stage comparison table after loading to verify no silent data loss (parser count -> adapter count -> DB count).

Source Types

Source	Format	Parser	Adapter
Epic MyChart	CDA R2 XML (IHE XDM)	`sources/epic.py`	`adapters/epic_adapter.py`
MEDITECH Expanse	CCDA XML + FHIR JSON (dual-format merge)	`sources/meditech.py`	`adapters/meditech_adapter.py`
athenahealth	FHIR R4 Bundle XML	`sources/athena.py`	`adapters/athena_adapter.py`
MyChart Visit MHTML	MIME HTML (visit notes, images)	`sources/mhtml_visit.py`	`adapters/mhtml_visit_adapter.py`
MyChart Test Result MHTML	MIME HTML (genomic panels)	`sources/mhtml_test_result.py`	`adapters/mhtml_test_result_adapter.py`

MEDITECH Dual-Format Merge

Unlike Epic (CDA-only) and athena (FHIR-only), the MEDITECH adapter merges two parallel data streams:

FHIR JSON (US Core FHIR Resources.json): structured coded data (LOINC, ICD-10, RxNorm) for encounters, conditions, medications, observations, immunizations
CCDA XML (CCDA/*.xml, UUID-named files): HTML-table-based extraction for labs, meds, notes, vitals, allergies, social/family/mental history

The adapter deduplicates across formats using composite keys (e.g., (test.lower(), date_iso, value) for labs, name.lower() for conditions). FHIR conditions override CCDA problems when names match. This dual-format merge is the most complex adapter path and is tested with dedicated fixtures.

Unified Data Model (models.py)

18 dataclass types mapping 1:1 to SQLite tables. 17 are in _TABLE_MAP (bulk-loaded via the adapter pipeline); PatientRecord is loaded separately. All dates are ISO YYYY-MM-DD strings. Every record carries a source field for provenance tracking. The UnifiedRecords container holds all records from a single source load.

Lab results have both value (text, handles <0.5, positive) and value_numeric (float, NULL when not parseable).

Note: LabResult is the only dataclass that doesn't follow the *Record/*Report/*Variant naming convention.

Database (db.py, schema.sql)

SQLite with WAL mode and foreign keys enabled. 17 clinical tables + load_log audit trail + notes/note_tags + analyses/analysis_tags + source_assets. Key indexes on lab dates/test names/LOINC codes, vital types/dates, procedure/imaging dates. Pathology reports FK to procedures with ON DELETE SET NULL.

UPSERT loading (_UNIQUE_KEYS in db.py): Each table has a natural key used for conflict detection. load_source(records, replace=True) does UPSERT + stale cleanup (bulk import). load_source(records, replace=False) does UPSERT only (additive import, e.g., MHTML).

db.query() returns list[dict] (via sqlite3.Row factory).

The main class is HealthMemexDB (in db.py).

MCP Server (mcp/server.py)

FastMCP server with HEALTH_MEMEX_DB env var for database path. 31 tools across four categories:

Read-only SQL: run_sql, get_schema, get_database_summary
Clinical queries: query_labs, get_lab_series_tool, get_available_tests_tool, get_abnormal_labs_tool, get_medications, reconcile_medications_tool, search_notes, get_pathology_report, get_source_files, get_asset_summary
Compound analysis: get_visit_diff, get_visit_prep, get_visit_prep_bundle, get_surgical_timeline, match_cross_source_encounters, get_data_quality_report, get_clinical_summary_tool, get_timeline
Write operations: save_note, get_note, search_notes_personal, delete_note, save_analysis, get_analysis, search_analyses, list_analyses, delete_analysis, write_record

Design principle: the LLM writes its own SQL for all reads via run_sql + get_schema. Write operations (notes, analyses) go through dedicated tools with controlled parameters. write_record allows inserting clinical records to any clinical table but requires source and validates column names against the schema.

Read-only SQL uses SQLite URI mode=ro for engine-level enforcement and blocks ATTACH/DETACH to prevent writable database bypass. Results capped at 5000 rows.

Data Access Modules (analysis/)

Parameterized query helpers that surface structured views of the data for LLMs (via MCP) and CLI:

lab_trends.py: lab values by test/date/LOINC, flagged abnormals, cross-source series
medications.py: active meds, history, cross-source grouping that surfaces status conflicts
surgical_timeline.py: procedures with linked pathology/imaging/meds by date proximity
visit_prep.py: generate_visit_prep (quick summary) and generate_visit_prep_bundle (comprehensive 10-section context for LLM-driven visit prep with key test trend resolution from config or frequency fallback)
visit_diff.py: everything new since date X across all clinical tables
clinical_summary.py: full clinical picture in one call (conditions, meds, labs, vitals, imaging, encounters, procedures, pathology)
timeline.py: unified chronological timeline merging all event types, with labs grouped by (date, source) to avoid flooding
data_quality.py: cross-source duplicate detection, source coverage matrix
cross_source.py: cross-source encounter matching by date

Extractors (extractors/)

Domain-specific extraction helpers used by adapters and analysis modules:

labs.py: CEA extraction from FHIR observations and parsed lab results
pathology.py: structured section parsing from pathology report text (diagnosis, gross, microscopic, staging, margins) and procedure linkage by date proximity + specimen similarity

Export Modules

export_arkiv.py: Arkiv universal record format (JSONL + README.md + schema.yaml). Primary backup/restore format with full round-trip support. Source assets exported to media/ or inline base64 via --embed. Record URIs use health-memex:{table}/{id} prefix.
import_arkiv.py: Arkiv import with validation, FK remapping, tag unfolding, and source asset restoration. Accepts both health-memex: and legacy chartfold: URI prefixes.

SPA (spa/)

Self-contained HTML SPA with embedded SQLite database via sql.js (WebAssembly). All data stays client-side. Supports --embed-images, --config, --ai-chat, and --proxy-url.

spa/export.py: HTML generation. Assembles JS/CSS/SQL data into a single-file SPA.
spa/chat_prompt.py: System prompt generation for AI chat (schema + stats + current analyses). _CLINICAL_TABLES derived from db.py's _UNIQUE_KEYS (not hardcoded).
spa/js/: 8 modules: app.js (init/lifecycle), db.js (sql.js wrapper), router.js (hash-based navigation), sections.js (all content sections including visit_prep and print_summary), chart.js (line charts via ChartRenderer), chat.js (AI agent loop + chat UI, conditionally included), ui.js (shared UI helpers), markdown.js (markdown rendering).
spa/css/: styles.css (main) + chat.css (chat panel, conditionally included).

AI chat (--ai-chat): Client-side agent loop with two tools: run_sql (SQL queries) and render_chart (inline line charts). Conversation persists across SPA navigation via DOM detach/reattach. Sliding window (MAX_MESSAGES=40) with pair-aware trimming.

Analysis Parser (analysis_parser.py)

Parses analysis markdown files with optional YAML frontmatter (title, category, tags, summary). Files without frontmatter use the filename as title. Used by health-memex load analyses <dir> and the save_analysis MCP tool.

Configuration (config.py)

TOML config (health_memex.toml) for personalized settings. Key tests to chart, dashboard settings. Auto-generated from DB contents via health-memex init-config.

Ecosystem Alignment

This archive satisfies the *-memex contract (see ~/github/memex/CLAUDE.md):

Contract Item	Implementation
SQLite + FTS5 backend	`db.py` + `schema.sql` (WAL mode, foreign keys)
MCP server	`mcp/server.py` (FastMCP, 31 tools, `health-memex mcp`)
Thin admin CLI	`cli.py` (load, export, import, query, summary, notes)
Import pipelines	5 source types (Epic, MEDITECH, athena, MyChart MHTML x2)
Export: arkiv	`export_arkiv.py` (JSONL + schema.yaml + README.md)
Export: HTML SPA	`spa/export.py` (sql.js, optional AI chat)
Durable record IDs	UPSERT with natural keys (`_UNIQUE_KEYS` in `db.py`)
Marginalia	`notes`/`note_tags` tables, linked to any clinical record

Cross-archive URI scheme: Records are addressable as health-memex://{kind}/{id}. Arkiv exports use health-memex:{table}/{id} URIs.

Key Conventions

All dates stored as ISO YYYY-MM-DD strings. Date normalization in core/utils.py (normalize_date_to_iso).
Source parsers use lxml with optional recover=True for XML with encoding issues (MEDITECH). MHTML parsers use Python stdlib email module + lxml.html XPath (NOT cssselect, which requires an extra package).
Deduplication happens at the adapter stage using deduplicate_by_key from core/utils.py.
Tests use pytest fixtures from tests/conftest.py with tmp_db, sample_unified_records, sample_epic_data, sample_meditech_data, sample_athena_data, and surgical_db.
Roundtrip tests (test_roundtrip.py) verify that record counts are preserved through all pipeline stages.
Requires Python 3.11+ (tomllib from stdlib). Dependencies: lxml, pyyaml. Optional: mcp (FastMCP) for MCP server.
CI tests on Python 3.11 and 3.12. Lint and typecheck run on 3.11 only.
Ruff for linting (configured in pyproject.toml), line length 100, target Python 3.11.
Coverage minimum: 68% (configured in pyproject.toml).

Adding a New EHR Source

Create sources/newsource.py with a process_*_export(input_dir) function returning a dict
Create adapters/newsource_adapter.py with a *_to_unified(data) -> UnifiedRecords function and _parser_counts(data) helper
Add a SourceConfig in sources/base.py (if applicable)
Wire into cli.py (add subcommand, _load_newsource function)
Add fixtures in tests/conftest.py and tests in test_newsource.py, test_adapters.py, test_roundtrip.py

Gotchas

mhtml_test_result.py: The function test_result_to_unified starts with test_, so pytest tries to collect it as a test. Import it with from ... import test_result_to_unified as adapt_test_result in tests.
source_assets are inserted via raw SQL in tests (not through the adapter pipeline).
_UNIQUE_KEYS in db.py must match the UNIQUE constraints declared in schema.sql.
When adding a new table: update _TABLE_MAP, _UNIQUE_KEYS, schema.sql, models.py, export_arkiv.py (_TIMESTAMP_FIELDS, _COLLECTION_DESCRIPTIONS, _FK_FIELDS if applicable), analysis/visit_diff.py, and if it's a non-clinical table add it to _NON_CLINICAL_TABLES in spa/chat_prompt.py.
Legacy arkiv archives exported under the old chartfold name use chartfold: URI prefixes. The importer (import_arkiv.py) accepts both health-memex: and chartfold: URIs for backwards compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

What This Is

Project Structure

Commands

Architecture

Three-Stage Data Pipeline

Source Types

MEDITECH Dual-Format Merge

Unified Data Model (models.py)

Database (db.py, schema.sql)

MCP Server (mcp/server.py)

Data Access Modules (analysis/)

Extractors (extractors/)

Export Modules

SPA (spa/)

Analysis Parser (analysis_parser.py)

Configuration (config.py)

Ecosystem Alignment

Key Conventions

Adding a New EHR Source

Gotchas

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

What This Is

Project Structure

Commands

Architecture

Three-Stage Data Pipeline

Source Types

MEDITECH Dual-Format Merge

Unified Data Model (models.py)

Database (db.py, schema.sql)

MCP Server (mcp/server.py)

Data Access Modules (analysis/)

Extractors (extractors/)

Export Modules

SPA (spa/)

Analysis Parser (analysis_parser.py)

Configuration (config.py)

Ecosystem Alignment

Key Conventions

Adding a New EHR Source

Gotchas