TranslatorSRI · gaurav · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 3, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,36 @@
+name: CI
+
+on:
+  pull_request:
+  push:
+    branches: [main]
+  schedule:
+    - cron: "0 17 * * 2"  # Tuesdays at 12pm EST (17:00 UTC); 1pm during EDT
+  workflow_dispatch:
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --group dev
+      - run: uv run ruff check src/ tests/
+      - run: uv run ruff format --check src/ tests/
+
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --group dev
+      - run: uv run pytest -v -m "not integration"
+
+  integration-test:
+    runs-on: ubuntu-latest
+    if: github.event_name != 'pull_request'
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --group dev
+      - run: uv run pytest -v -m "integration and not slow"
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+# Ignore data files.
+/data
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[codz]
@@ -14,8 +17,9 @@ dist/
 downloads/
 eggs/
 .eggs/
-lib/
-lib64/
+# Python distribution lib directories (not web/src/lib/)
+/lib/
+/lib64/
 parts/
 sdist/
 var/

diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.11
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,151 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+babel-explorer is a tool for querying and exploring Babel intermediate files. It allows users to discover why two biological/chemical identifiers are considered identical by the Babel system, which handles cross-references between different ontology and database identifiers (e.g., MONDO, HP, UMLS, HGNC).
+
+## Development Setup
+
+This project uses **uv** for package management:
+
+```bash
+# Install dependencies
+uv sync
+
+# Install with dev dependencies
+uv sync --group dev
+
+# Run the CLI
+uv run babel-explorer --help
+```
+
+## Commands
+
+### Running the Application
+
+```bash
+# Get cross-references for one or more CURIEs
+uv run babel-explorer xrefs MONDO:0004979
+
+# Get cross-references with expansion (recursive lookup)
+uv run babel-explorer xrefs MONDO:0004979 --recurse
+
+# Get cross-references with labels from NodeNorm
+uv run babel-explorer xrefs MONDO:0004979 --labels
+
+# Get ID records for CURIEs
+uv run babel-explorer ids MONDO:0004979
+
+# Test concordance changes with NodeNorm
+uv run babel-explorer test-concord MONDO:0004979 HP:0000001
+
+# Use custom Babel server or local directory
+uv run babel-explorer xrefs MONDO:0004979 --local-dir data/2025nov19 --babel-url https://stars.renci.org:443/var/babel_outputs/2025nov19/
+```
+
+### Development Commands
+
+```bash
+# Run all tests (includes large file downloads)
+uv run pytest -v
+
+# Run unit tests only (fast, no network)
+uv run pytest -v -m "not integration"
+
+# Run integration tests without 2GB+ downloads
+uv run pytest -v -m "integration and not slow"
+
+# Run a single test file
+uv run pytest -v tests/test_nodenorm.py
+
+# Run linter
+uv run ruff check
+
+# Format code
+uv run ruff format
+```
+
+## Architecture
+
+### Core Components
+
+1. **BabelDownloader** (`src/babel_explorer/core/downloader.py`):
+   - Downloads Babel intermediate files from a remote HTTP(S) server using Python's `requests` library (streaming downloads)
+   - Caches files locally in configurable directory (default: `data/2025nov19/`)
+   - Uses `@functools.lru_cache` to avoid re-downloading
+   - **Important**: Requires network access but no external tools like `wget`
+
+2. **BabelXRefs** (`src/babel_explorer/core/babel_xrefs.py`):
+   - Main query engine for cross-references
+   - Uses DuckDB to query Parquet files (`Concord.parquet`, `Identifiers.parquet`)
+   - Supports recursive expansion of cross-references via a single `WITH RECURSIVE` query
+   - Uses ephemeral in-memory DuckDB connections (nothing written to disk)
+
+3. **NodeNorm** (`src/babel_explorer/core/nodenorm.py`):
+   - Integration with NodeNormalization API (https://nodenormalization-sri.renci.org/)
+   - Fetches labels, biolink types, and equivalent identifiers for CURIEs
+   - Uses `@functools.lru_cache` for performance
+   - Optional component for label enrichment
+
+4. **CLI** (`src/babel_explorer/cli.py`):
+   - Click-based command-line interface
+   - Three main commands: `xrefs`, `ids`, `test-concord`
+
+### Data Flow
+
+1. User provides CURIEs via CLI
+2. BabelDownloader ensures required Parquet files are downloaded
+3. BabelXRefs queries files using DuckDB
+4. If `--labels` or `--recurse` flags are set, NodeNorm is queried for additional metadata
+5. Results are printed to stdout
+
+### Key Design Patterns
+
+- **Lazy downloading**: Files are only downloaded when first accessed
+- **LRU caching**: Heavy use of `@functools.lru_cache` to avoid redundant downloads and API calls
+- **Recursive expansion**: The `--recurse` flag recursively follows all cross-references to build complete graphs
+- **DuckDB for querying**: In-memory SQL queries against Parquet files for fast lookups
+
+## Testing
+
+### Test Structure
+
+Tests live in `tests/` and are split into fast **unit tests** (mocked, no network) and slower **integration tests** (real downloads and API calls). Pytest markers control which tests run:
+
+- **`@pytest.mark.integration`** — requires network access (downloads Parquet files or calls NodeNorm API)
+- **`@pytest.mark.slow`** — downloads very large files (2 GB+)
+
+| File | Unit | Integration | Slow | Total |
+|------|------|-------------|------|-------|
+| `tests/test_downloader.py` | 41 | 4 | 1 | 46 |
+| `tests/test_babel_xrefs.py` | 23 | 20 | 3 | 46 |
+| `tests/test_nodenorm.py` | 20 | 13 | 0 | 33 |
+| `tests/test_cli.py` | 24 | 0 | 0 | 24 |
+
+### Test Infrastructure
+
+- **`tests/conftest.py`** — Session-scoped fixtures that download Parquet files once and share them across all integration tests. Teardown removes the `data/test/` directory so the next run starts fresh.
+- **`tests/constants.py`** — Shared constants (URLs, file paths) and `load_curies()` helper.
+- **`tests/data/valid_curies.txt`** — One CURIE per line (`#` comments allowed). Integration tests are parametrized over this list — adding a new line automatically expands test coverage.
+
+### Key Dataclasses
+
+- **`Identifier`** — Frozen dataclass for a normalized NodeNorm entry (curie, label, biolink_type, taxa, description). Returned by `NodeNorm.get_identifier()` and `get_clique_identifiers()`.
+- **`CrossReference`** — Frozen dataclass for Concord.parquet rows (filename, subj, pred, obj)
+- **`LabeledCrossReference`** — Extends CrossReference with labels and biolink types from NodeNorm
+- **`IdentifierRecord`** — Frozen dataclass for Identifiers.parquet rows (curie + dynamic extra fields). Returned by `BabelXRefs.get_curie_ids()`.
+
+## Important Notes
+
+- **Data directory**: The `data/` directory is gitignored and contains downloaded Parquet files and generated DuckDB databases
+- **Babel versions**: The default Babel version is `2025nov19`, but this can be customized via `--local-dir` and `--babel-url`
+
+## File Locations
+
+- Source code: `src/babel_explorer/`
+- Tests: `tests/`
+- Test CURIEs: `tests/data/valid_curies.txt`
+- Downloaded Babel files: `data/<version>/duckdb/*.parquet`
+- Entry point: `src/babel_explorer/cli.py`
diff --git a/FUTURE.md b/FUTURE.md
@@ -0,0 +1,7 @@
+# Future Work
+
+## Deduplicate CLI option blocks
+
+`--local-dir`, `--babel-url`, and `--check-download` are copy-pasted between the
+`xrefs` and `ids` commands in `cli.py`. Extract a `@common_babel_options` Click
+decorator so defaults are defined in one place and can't drift.
diff --git a/README.md b/README.md
@@ -1,2 +1,56 @@
 # Babel Explorer
-Software for querying and exporting Babel intermediate files
+Software for querying and exploring Babel intermediate files.
+
+babel-explorer allows you to discover why two biological/chemical identifiers are considered identical by the [Babel](https://github.com/TranslatorSRI/Babel) system, which handles cross-references between different ontology and database identifiers (e.g., MONDO, HP, UMLS, HGNC).
+
+## Setup
+
+This project uses [uv](https://docs.astral.sh/uv/) for package management:
+
+```bash
+uv sync --group dev
+```
+
+## Usage
+
+```bash
+# Get cross-references for one or more CURIEs
+uv run babel-explorer xrefs MONDO:0004979
+
+# Get cross-references with expansion (recursive lookup)
+uv run babel-explorer xrefs MONDO:0004979 --recurse
+
+# Get cross-references with labels from NodeNorm
+uv run babel-explorer xrefs MONDO:0004979 --labels
+
+# Get ID records for CURIEs
+uv run babel-explorer ids MONDO:0004979
+
+# Test concordance changes with NodeNorm
+uv run babel-explorer test-concord MONDO:0004979 HP:0000001
+```
+
+## Testing
+
+Tests are split into fast **unit tests** (mocked, no network) and slower **integration tests** (real file downloads and API calls), controlled by pytest markers.
+
+```bash
+# Unit tests only — fast, no network required
+uv run pytest -v -m "not integration"
+
+# Integration tests without 2GB+ downloads
+uv run pytest -v -m "integration and not slow"
+
+# Full suite including large file downloads
+uv run pytest -v
+```
+
+### Adding Test CURIEs
+
+Integration tests are parametrized over the CURIEs listed in `tests/data/valid_curies.txt`. Add a new CURIE on its own line to automatically expand test coverage:
+
+```
+# tests/data/valid_curies.txt
+MONDO:0004979
+HP:0000001
+```
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,35 @@
+[project]
+name = "babel-explorer"
+version = "0.1.0"
+description = "Tool for querying and exploring Babel APIs and intermediate files"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "click>=8.3.1",
+    "duckdb>=1.4.2",
+    "requests>=2.32.5",
+    "rich>=13",
+    "tqdm>=4.67.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[dependency-groups]
+dev = [
+    "filelock>=3.16",
+    "pytest>=8.3.5",
+    "pytest-xdist[psutil]>=3.6",
+    "ruff>=0.11.0",
+]
+
+[project.scripts]
+babel-explorer = "babel_explorer.cli:cli"
+
+[tool.pytest.ini_options]
+addopts = "-n auto"
+markers = [
+    "integration: tests requiring network access (deselect with '-m \"not integration\"')",
+    "slow: tests downloading very large files 2GB+ (deselect with '-m \"not slow\"')",
+]
diff --git a/src/__init__.py b/src/__init__.py
diff --git a/src/babel_explorer/__init__.py b/src/babel_explorer/__init__.py