Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
fcc27c2
This initializes a uv package in this repository.
gaurav Dec 2, 2025
876353d
Added basic CLI.
gaurav Dec 2, 2025
ec1d1f0
Add /data to the .gitignore.
gaurav Dec 2, 2025
eff8f26
Initial implementation of a basic xref query-er.
gaurav Dec 3, 2025
4d04e2a
Added a method to look up a particular identifier.
gaurav Dec 4, 2025
8531cb7
Added CURIE expansion/recursive lookup.
gaurav Dec 4, 2025
a1aeec6
Added a basic ConcordTester.
gaurav Dec 4, 2025
bb1eb99
Added labels via NodeNorm.
gaurav Dec 4, 2025
40c3338
Midnight commit: attempting to improve expansion.
gaurav Jan 8, 2026
8c41112
Added some improvements.
gaurav Jan 8, 2026
239c89f
Added a CLAUDE.md by Claude.ai.
gaurav Feb 14, 2026
8132fe1
Reorganized file slightly.
gaurav Feb 14, 2026
bd00972
Claude wrote some tests.
gaurav Feb 14, 2026
9cc06bc
Improved downloader using Claude.
gaurav Feb 14, 2026
da8bb0c
Added MD5 download functionality.
gaurav Feb 14, 2026
8f36b74
Removed empty model file.
gaurav Feb 15, 2026
0534fd8
Attempted to rename this package to babel-explorer.
gaurav Feb 15, 2026
0b3a9f5
Add comprehensive pytest suite for all core modules
gaurav Feb 15, 2026
8535202
Merge branch 'main' into basic-implementation-in-uv
gaurav Feb 17, 2026
ff0dacc
Added uv.lock (not sure why it wasn't added previously).
gaurav Mar 2, 2026
bacc72d
Update CLAUDE.md
gaurav Mar 2, 2026
96d9609
Update pyproject.toml
gaurav Mar 2, 2026
0c33e7e
Update src/babel_explorer/core/babel_xrefs.py
gaurav Mar 2, 2026
1aff013
Update src/babel_explorer/core/nodenorm.py
gaurav Mar 2, 2026
af76c15
Replace MD5 checksumming with HTTP header caching and freshness window
gaurav Mar 2, 2026
fb41da0
Added some CURIEs to test.
gaurav Mar 3, 2026
5c544a2
Partially changed --expand to --recurse.
gaurav Mar 3, 2026
280212a
More fully changed --expand to --recurse.
gaurav Mar 3, 2026
b522e6e
Add pytest-xdist for parallel test execution
gaurav Mar 3, 2026
e137c31
Replace Python recursion in get_curie_xrefs with DuckDB WITH RECURSIVE
gaurav Mar 3, 2026
b115d02
Fix xdist race condition: skip test-data cleanup in parallel runs
gaurav Mar 3, 2026
5a0f758
Made output a bit prettier.
gaurav Mar 3, 2026
be2fa36
Update src/babel_explorer/core/nodenorm.py
gaurav Mar 16, 2026
2b2aa7f
Simplify babel_xrefs: extract helper, remove dead fetches, fix defaul…
gaurav Mar 16, 2026
3cdd19c
Fix LabeledCrossReference: make it a frozen dataclass subclass
gaurav Mar 16, 2026
c7a3f16
Fix BabelDownloader: use tempfile.gettempdir() when local_path is None
gaurav Mar 16, 2026
c6635bc
Fix test-concord: guard against None from get_clique_identifiers
gaurav Mar 16, 2026
d74110e
Potential fix for pull request finding
gaurav Mar 16, 2026
4338bdc
Potential fix for pull request finding
gaurav Mar 16, 2026
be3e427
Potential fix for pull request finding
gaurav Mar 16, 2026
f8b718b
Potential fix for pull request finding
gaurav Mar 16, 2026
48c8e96
Potential fix for pull request finding
gaurav Mar 16, 2026
8fb37d6
Potential fix for pull request finding
gaurav Mar 16, 2026
49f5c3b
Potential fix for pull request finding
gaurav Mar 16, 2026
c952c12
Fix DuckDB connection leaks by using context managers
gaurav Mar 16, 2026
b0539bb
Fix and simplify test mocks for context manager protocol
gaurav Mar 16, 2026
6319212
Add configurable HTTP timeout to NodeNorm and BabelDownloader
gaurav Mar 16, 2026
b634d11
Fix _etag_matches docstring to match actual behavior
gaurav Mar 16, 2026
a7eb8c1
Got rid of ignore_curies_in_expansion, which is no longer used.
gaurav Mar 16, 2026
7163a64
Add ruff CI and fix all lint errors
gaurav Mar 16, 2026
f9e549e
Rename lint workflow to CI and add unit test job
gaurav Mar 16, 2026
7d1d5ca
Improved documentation.
gaurav Mar 16, 2026
0d9e32c
Add tests for parse_duration() in cli.py
gaurav Mar 16, 2026
0ca35eb
Add CliRunner tests for xrefs, ids, and test-concord commands
gaurav Mar 16, 2026
17782a2
Reformatted code with ruff.
gaurav Mar 26, 2026
d34c5c3
Update src/babel_explorer/cli.py
gaurav Mar 30, 2026
ec0a71c
Cache get_identifier() locals in _to_labeled_xref; root-anchor lib/ i…
gaurav Mar 30, 2026
67be81e
Fix bugs and gaps identified in PR #1 code review
gaurav Mar 30, 2026
4199ae2
Run integration tests on push to master and weekly on Tuesdays
gaurav Mar 30, 2026
17f6b09
Add module, class, and method docstrings to new files in PR #1
gaurav Mar 30, 2026
8216b0b
Fix LabeledCrossReference biolink_type fields to list[str]; simplify …
gaurav Apr 1, 2026
ac418ff
Address PR #1 review: frozen Identifier, atomic rename, fail-open HEA…
gaurav Apr 1, 2026
06cd300
Sync CLAUDE.md with current code
gaurav Apr 1, 2026
46f0863
Merge branch 'basic-implementation-in-uv' of github.com:TranslatorSRI…
gaurav Apr 1, 2026
bf1c48c
Address PR #1 review: fix six correctness and quality issues
gaurav Apr 1, 2026
49e27c0
Add --format [text|json|tsv|csv] option to all CLI commands
gaurav Apr 9, 2026
aecae50
Add console format with rich color highlighting; replace text default
gaurav Apr 9, 2026
f7cde3a
Fix Identifier.from_dict splitting string fields into characters
gaurav Apr 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: CI

on:
pull_request:
push:
branches: [main]
schedule:
- cron: "0 17 * * 2" # Tuesdays at 12pm EST (17:00 UTC); 1pm during EDT
workflow_dispatch:

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv sync --group dev
- run: uv run ruff check src/ tests/
- run: uv run ruff format --check src/ tests/

test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv sync --group dev
- run: uv run pytest -v -m "not integration"

integration-test:
runs-on: ubuntu-latest
if: github.event_name != 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv sync --group dev
- run: uv run pytest -v -m "integration and not slow"
8 changes: 6 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Ignore data files.
/data

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[codz]
Expand All @@ -14,8 +17,9 @@ dist/
downloads/
eggs/
.eggs/
lib/
lib64/
# Python distribution lib directories (not web/src/lib/)
/lib/
/lib64/
parts/
sdist/
var/
Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11
151 changes: 151 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

babel-explorer is a tool for querying and exploring Babel intermediate files. It allows users to discover why two biological/chemical identifiers are considered identical by the Babel system, which handles cross-references between different ontology and database identifiers (e.g., MONDO, HP, UMLS, HGNC).

## Development Setup

This project uses **uv** for package management:

```bash
# Install dependencies
uv sync

# Install with dev dependencies
uv sync --group dev

# Run the CLI
uv run babel-explorer --help
```

## Commands

### Running the Application

```bash
# Get cross-references for one or more CURIEs
uv run babel-explorer xrefs MONDO:0004979

# Get cross-references with expansion (recursive lookup)
uv run babel-explorer xrefs MONDO:0004979 --recurse

# Get cross-references with labels from NodeNorm
uv run babel-explorer xrefs MONDO:0004979 --labels

Comment thread
gaurav marked this conversation as resolved.
# Get ID records for CURIEs
uv run babel-explorer ids MONDO:0004979

# Test concordance changes with NodeNorm
uv run babel-explorer test-concord MONDO:0004979 HP:0000001

# Use custom Babel server or local directory
uv run babel-explorer xrefs MONDO:0004979 --local-dir data/2025nov19 --babel-url https://stars.renci.org:443/var/babel_outputs/2025nov19/
```

### Development Commands

```bash
# Run all tests (includes large file downloads)
uv run pytest -v

# Run unit tests only (fast, no network)
uv run pytest -v -m "not integration"

# Run integration tests without 2GB+ downloads
uv run pytest -v -m "integration and not slow"

# Run a single test file
uv run pytest -v tests/test_nodenorm.py

# Run linter
uv run ruff check

# Format code
uv run ruff format
```

## Architecture

### Core Components

1. **BabelDownloader** (`src/babel_explorer/core/downloader.py`):
- Downloads Babel intermediate files from a remote HTTP(S) server using Python's `requests` library (streaming downloads)
- Caches files locally in configurable directory (default: `data/2025nov19/`)
- Uses `@functools.lru_cache` to avoid re-downloading
- **Important**: Requires network access but no external tools like `wget`

2. **BabelXRefs** (`src/babel_explorer/core/babel_xrefs.py`):
- Main query engine for cross-references
- Uses DuckDB to query Parquet files (`Concord.parquet`, `Identifiers.parquet`)
- Supports recursive expansion of cross-references via a single `WITH RECURSIVE` query
- Uses ephemeral in-memory DuckDB connections (nothing written to disk)

3. **NodeNorm** (`src/babel_explorer/core/nodenorm.py`):
- Integration with NodeNormalization API (https://nodenormalization-sri.renci.org/)
- Fetches labels, biolink types, and equivalent identifiers for CURIEs
- Uses `@functools.lru_cache` for performance
- Optional component for label enrichment

4. **CLI** (`src/babel_explorer/cli.py`):
- Click-based command-line interface
- Three main commands: `xrefs`, `ids`, `test-concord`

### Data Flow

1. User provides CURIEs via CLI
2. BabelDownloader ensures required Parquet files are downloaded
3. BabelXRefs queries files using DuckDB
4. If `--labels` or `--recurse` flags are set, NodeNorm is queried for additional metadata
5. Results are printed to stdout

### Key Design Patterns

- **Lazy downloading**: Files are only downloaded when first accessed
- **LRU caching**: Heavy use of `@functools.lru_cache` to avoid redundant downloads and API calls
- **Recursive expansion**: The `--recurse` flag recursively follows all cross-references to build complete graphs
- **DuckDB for querying**: In-memory SQL queries against Parquet files for fast lookups

## Testing

### Test Structure

Tests live in `tests/` and are split into fast **unit tests** (mocked, no network) and slower **integration tests** (real downloads and API calls). Pytest markers control which tests run:

- **`@pytest.mark.integration`** — requires network access (downloads Parquet files or calls NodeNorm API)
- **`@pytest.mark.slow`** — downloads very large files (2 GB+)

| File | Unit | Integration | Slow | Total |
|------|------|-------------|------|-------|
| `tests/test_downloader.py` | 41 | 4 | 1 | 46 |
| `tests/test_babel_xrefs.py` | 23 | 20 | 3 | 46 |
| `tests/test_nodenorm.py` | 20 | 13 | 0 | 33 |
| `tests/test_cli.py` | 24 | 0 | 0 | 24 |

### Test Infrastructure

- **`tests/conftest.py`** — Session-scoped fixtures that download Parquet files once and share them across all integration tests. Teardown removes the `data/test/` directory so the next run starts fresh.
- **`tests/constants.py`** — Shared constants (URLs, file paths) and `load_curies()` helper.
- **`tests/data/valid_curies.txt`** — One CURIE per line (`#` comments allowed). Integration tests are parametrized over this list — adding a new line automatically expands test coverage.

### Key Dataclasses

- **`Identifier`** — Frozen dataclass for a normalized NodeNorm entry (curie, label, biolink_type, taxa, description). Returned by `NodeNorm.get_identifier()` and `get_clique_identifiers()`.
- **`CrossReference`** — Frozen dataclass for Concord.parquet rows (filename, subj, pred, obj)
- **`LabeledCrossReference`** — Extends CrossReference with labels and biolink types from NodeNorm
- **`IdentifierRecord`** — Frozen dataclass for Identifiers.parquet rows (curie + dynamic extra fields). Returned by `BabelXRefs.get_curie_ids()`.

## Important Notes

- **Data directory**: The `data/` directory is gitignored and contains downloaded Parquet files and generated DuckDB databases
- **Babel versions**: The default Babel version is `2025nov19`, but this can be customized via `--local-dir` and `--babel-url`

## File Locations

- Source code: `src/babel_explorer/`
- Tests: `tests/`
- Test CURIEs: `tests/data/valid_curies.txt`
- Downloaded Babel files: `data/<version>/duckdb/*.parquet`
- Entry point: `src/babel_explorer/cli.py`
7 changes: 7 additions & 0 deletions FUTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Future Work

## Deduplicate CLI option blocks

`--local-dir`, `--babel-url`, and `--check-download` are copy-pasted between the
`xrefs` and `ids` commands in `cli.py`. Extract a `@common_babel_options` Click
decorator so defaults are defined in one place and can't drift.
56 changes: 55 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,56 @@
# Babel Explorer
Software for querying and exporting Babel intermediate files
Software for querying and exploring Babel intermediate files.

babel-explorer allows you to discover why two biological/chemical identifiers are considered identical by the [Babel](https://github.com/TranslatorSRI/Babel) system, which handles cross-references between different ontology and database identifiers (e.g., MONDO, HP, UMLS, HGNC).

## Setup

This project uses [uv](https://docs.astral.sh/uv/) for package management:

```bash
uv sync --group dev
```

## Usage

```bash
# Get cross-references for one or more CURIEs
uv run babel-explorer xrefs MONDO:0004979

# Get cross-references with expansion (recursive lookup)
uv run babel-explorer xrefs MONDO:0004979 --recurse

# Get cross-references with labels from NodeNorm
uv run babel-explorer xrefs MONDO:0004979 --labels

# Get ID records for CURIEs
uv run babel-explorer ids MONDO:0004979

# Test concordance changes with NodeNorm
uv run babel-explorer test-concord MONDO:0004979 HP:0000001
```

## Testing

Tests are split into fast **unit tests** (mocked, no network) and slower **integration tests** (real file downloads and API calls), controlled by pytest markers.

```bash
# Unit tests only — fast, no network required
uv run pytest -v -m "not integration"

# Integration tests without 2GB+ downloads
uv run pytest -v -m "integration and not slow"

# Full suite including large file downloads
uv run pytest -v
```

### Adding Test CURIEs

Integration tests are parametrized over the CURIEs listed in `tests/data/valid_curies.txt`. Add a new CURIE on its own line to automatically expand test coverage:

```
# tests/data/valid_curies.txt
MONDO:0004979
HP:0000001
```
35 changes: 35 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[project]
name = "babel-explorer"
version = "0.1.0"
description = "Tool for querying and exploring Babel APIs and intermediate files"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"click>=8.3.1",
"duckdb>=1.4.2",
"requests>=2.32.5",
"rich>=13",
"tqdm>=4.67.0",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[dependency-groups]
dev = [
"filelock>=3.16",
"pytest>=8.3.5",
"pytest-xdist[psutil]>=3.6",
"ruff>=0.11.0",
]

[project.scripts]
babel-explorer = "babel_explorer.cli:cli"

[tool.pytest.ini_options]
addopts = "-n auto"
markers = [
"integration: tests requiring network access (deselect with '-m \"not integration\"')",
"slow: tests downloading very large files 2GB+ (deselect with '-m \"not slow\"')",
]
Empty file added src/__init__.py
Empty file.
Empty file added src/babel_explorer/__init__.py
Empty file.
Loading
Loading