Skip to content

Commit 04aa4d3

Browse files
Copilotsfluegel05
andcommitted
Add chebi_utils library with downloader, extractors, splitter, tests, and CI workflow
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
1 parent 84ebd3f commit 04aa4d3

File tree

15 files changed

+905
-1
lines changed

15 files changed

+905
-1
lines changed

.github/workflows/ci.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: ["**"]
6+
pull_request:
7+
branches: ["**"]
8+
9+
jobs:
10+
lint:
11+
name: Lint (ruff)
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.12"
20+
21+
- name: Install ruff
22+
run: pip install ruff
23+
24+
- name: Check formatting
25+
run: ruff format --check .
26+
27+
- name: Check linting
28+
run: ruff check .
29+
30+
test:
31+
name: Unit Tests
32+
runs-on: ubuntu-latest
33+
strategy:
34+
matrix:
35+
python-version: ["3.10", "3.11", "3.12"]
36+
steps:
37+
- uses: actions/checkout@v4
38+
39+
- name: Set up Python ${{ matrix.python-version }}
40+
uses: actions/setup-python@v5
41+
with:
42+
python-version: ${{ matrix.python-version }}
43+
44+
- name: Install package and test dependencies
45+
run: pip install -e ".[dev]"
46+
47+
- name: Run tests
48+
run: pytest tests/ -v

README.md

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,85 @@
11
# python-chebi-utils
2-
Common processing functionality for the ChEBI ontology (e.g. extraction of molecules, classes and relations).
2+
3+
Common processing functionality for the ChEBI ontology — download data files, extract classes and relations, extract molecules, and generate stratified train/val/test splits.
4+
5+
## Installation
6+
7+
```bash
8+
pip install chebi-utils
9+
```
10+
11+
For development (includes `pytest` and `ruff`):
12+
13+
```bash
14+
pip install -e ".[dev]"
15+
```
16+
17+
## Features
18+
19+
### Download ChEBI data files
20+
21+
```python
22+
from chebi_utils import download_chebi_obo, download_chebi_sdf
23+
24+
obo_path = download_chebi_obo(dest_dir="data/") # downloads chebi.obo
25+
sdf_path = download_chebi_sdf(dest_dir="data/") # downloads chebi.sdf.gz
26+
```
27+
28+
Files are fetched from the [EBI FTP server](https://ftp.ebi.ac.uk/pub/databases/chebi/).
29+
30+
### Extract ontology classes and relations
31+
32+
```python
33+
from chebi_utils import extract_classes, extract_relations
34+
35+
classes = extract_classes("chebi.obo")
36+
# DataFrame: id, name, definition, is_obsolete
37+
38+
relations = extract_relations("chebi.obo")
39+
# DataFrame: source_id, target_id, relation_type (is_a, has_role, …)
40+
```
41+
42+
### Extract molecules
43+
44+
```python
45+
from chebi_utils import extract_molecules
46+
47+
molecules = extract_molecules("chebi.sdf.gz")
48+
# DataFrame: chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, …
49+
```
50+
51+
Both plain `.sdf` and gzip-compressed `.sdf.gz` files are supported.
52+
53+
### Generate train/val/test splits
54+
55+
```python
56+
from chebi_utils import create_splits
57+
58+
splits = create_splits(molecules, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
59+
train_df = splits["train"]
60+
val_df = splits["val"]
61+
test_df = splits["test"]
62+
```
63+
64+
Pass `stratify_col` to preserve class proportions across splits:
65+
66+
```python
67+
splits = create_splits(classes, stratify_col="is_obsolete", seed=42)
68+
```
69+
70+
## Running Tests
71+
72+
```bash
73+
pytest tests/ -v
74+
```
75+
76+
## Linting
77+
78+
```bash
79+
ruff check .
80+
ruff format --check .
81+
```
82+
83+
## CI/CD
84+
85+
A GitHub Actions workflow (`.github/workflows/ci.yml`) automatically runs ruff linting and the full test suite on every push and pull request across Python 3.10, 3.11, and 3.12.

chebi_utils/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from chebi_utils.downloader import download_chebi_obo, download_chebi_sdf
2+
from chebi_utils.obo_extractor import extract_classes, extract_relations
3+
from chebi_utils.sdf_extractor import extract_molecules
4+
from chebi_utils.splitter import create_splits
5+
6+
__all__ = [
7+
"download_chebi_obo",
8+
"download_chebi_sdf",
9+
"extract_classes",
10+
"extract_relations",
11+
"extract_molecules",
12+
"create_splits",
13+
]

chebi_utils/downloader.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""Download ChEBI data files from the EBI FTP server."""
2+
3+
from __future__ import annotations
4+
5+
import urllib.request
6+
from pathlib import Path
7+
8+
CHEBI_OBO_URL = "https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.obo"
9+
CHEBI_SDF_URL = "https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_complete.sdf.gz"
10+
11+
12+
def download_chebi_obo(dest_dir: str | Path = ".", filename: str = "chebi.obo") -> Path:
13+
"""Download the ChEBI OBO ontology file from the EBI FTP server.
14+
15+
Parameters
16+
----------
17+
dest_dir : str or Path
18+
Directory where the file will be saved (created if it doesn't exist).
19+
filename : str
20+
Name for the downloaded file.
21+
22+
Returns
23+
-------
24+
Path
25+
Path to the downloaded file.
26+
"""
27+
dest_dir = Path(dest_dir)
28+
dest_dir.mkdir(parents=True, exist_ok=True)
29+
dest_path = dest_dir / filename
30+
urllib.request.urlretrieve(CHEBI_OBO_URL, dest_path)
31+
return dest_path
32+
33+
34+
def download_chebi_sdf(dest_dir: str | Path = ".", filename: str = "chebi.sdf.gz") -> Path:
35+
"""Download the ChEBI SDF file from the EBI FTP server.
36+
37+
Parameters
38+
----------
39+
dest_dir : str or Path
40+
Directory where the file will be saved (created if it doesn't exist).
41+
filename : str
42+
Name for the downloaded file.
43+
44+
Returns
45+
-------
46+
Path
47+
Path to the downloaded file.
48+
"""
49+
dest_dir = Path(dest_dir)
50+
dest_dir.mkdir(parents=True, exist_ok=True)
51+
dest_path = dest_dir / filename
52+
urllib.request.urlretrieve(CHEBI_SDF_URL, dest_path)
53+
return dest_path

chebi_utils/obo_extractor.py

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
"""Extract classes and relations from ChEBI OBO ontology files."""
2+
3+
from __future__ import annotations
4+
5+
from pathlib import Path
6+
7+
import pandas as pd
8+
9+
10+
def _parse_obo_stanzas(filepath: str | Path) -> list[dict[str, list[str]]]:
11+
"""Parse an OBO file and return a list of stanza dicts."""
12+
stanzas: list[dict[str, list[str]]] = []
13+
current_stanza: dict[str, list[str]] | None = None
14+
15+
with open(filepath, encoding="utf-8") as f:
16+
for line in f:
17+
line = line.strip()
18+
if not line or line.startswith("!"):
19+
continue
20+
if line.startswith("["):
21+
if current_stanza is not None:
22+
stanzas.append(current_stanza)
23+
stanza_type = line.strip("[]")
24+
current_stanza = {"_type": [stanza_type]}
25+
elif current_stanza is not None and ":" in line:
26+
key, _, value = line.partition(":")
27+
current_stanza.setdefault(key.strip(), []).append(value.strip())
28+
29+
if current_stanza is not None:
30+
stanzas.append(current_stanza)
31+
32+
return stanzas
33+
34+
35+
def extract_classes(filepath: str | Path) -> pd.DataFrame:
36+
"""Extract ontology classes (terms) from a ChEBI OBO file.
37+
38+
Parameters
39+
----------
40+
filepath : str or Path
41+
Path to the ChEBI OBO file.
42+
43+
Returns
44+
-------
45+
pd.DataFrame
46+
DataFrame with columns: id, name, definition, is_obsolete.
47+
"""
48+
stanzas = _parse_obo_stanzas(filepath)
49+
rows = []
50+
for stanza in stanzas:
51+
if stanza.get("_type", [None])[0] != "Term":
52+
continue
53+
row = {
54+
"id": stanza.get("id", [None])[0],
55+
"name": stanza.get("name", [None])[0],
56+
"definition": stanza.get("def", [None])[0],
57+
"is_obsolete": stanza.get("is_obsolete", ["false"])[0] == "true",
58+
}
59+
rows.append(row)
60+
return pd.DataFrame(rows, columns=["id", "name", "definition", "is_obsolete"])
61+
62+
63+
def extract_relations(filepath: str | Path) -> pd.DataFrame:
64+
"""Extract class relations from a ChEBI OBO file.
65+
66+
Parameters
67+
----------
68+
filepath : str or Path
69+
Path to the ChEBI OBO file.
70+
71+
Returns
72+
-------
73+
pd.DataFrame
74+
DataFrame with columns: source_id, target_id, relation_type.
75+
"""
76+
stanzas = _parse_obo_stanzas(filepath)
77+
rows = []
78+
79+
for stanza in stanzas:
80+
if stanza.get("_type", [None])[0] != "Term":
81+
continue
82+
source_id = stanza.get("id", [None])[0]
83+
if source_id is None:
84+
continue
85+
86+
for is_a_val in stanza.get("is_a", []):
87+
target_id = is_a_val.split("!")[0].strip()
88+
rows.append({"source_id": source_id, "target_id": target_id, "relation_type": "is_a"})
89+
90+
for rel_val in stanza.get("relationship", []):
91+
parts = rel_val.split()
92+
if len(parts) >= 2:
93+
rel_type = parts[0]
94+
target_id = parts[1].split("!")[0].strip()
95+
rows.append(
96+
{
97+
"source_id": source_id,
98+
"target_id": target_id,
99+
"relation_type": rel_type,
100+
}
101+
)
102+
103+
return pd.DataFrame(rows, columns=["source_id", "target_id", "relation_type"])

chebi_utils/sdf_extractor.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
"""Extract molecule data from ChEBI SDF files."""
2+
3+
from __future__ import annotations
4+
5+
import gzip
6+
from pathlib import Path
7+
8+
import pandas as pd
9+
10+
11+
def _iter_sdf_records(filepath: str | Path):
12+
"""Yield individual SDF records as strings."""
13+
opener = gzip.open if str(filepath).endswith(".gz") else open
14+
current_record: list[str] = []
15+
16+
with opener(filepath, "rt", encoding="utf-8") as f:
17+
for line in f:
18+
current_record.append(line)
19+
if line.strip() == "$$$$":
20+
yield "".join(current_record)
21+
current_record = []
22+
23+
24+
def _parse_sdf_record(record: str) -> dict[str, str]:
25+
"""Parse a single SDF record into a dict of data-item properties."""
26+
props: dict[str, str] = {}
27+
lines = record.splitlines()
28+
29+
if lines:
30+
props["mol_name"] = lines[0].strip()
31+
32+
i = 0
33+
while i < len(lines):
34+
line = lines[i]
35+
if line.startswith("> <") and line.rstrip().endswith(">"):
36+
key = line.strip()[3:-1]
37+
value_lines: list[str] = []
38+
i += 1
39+
while i < len(lines) and lines[i].strip() not in ("", "$$$$"):
40+
value_lines.append(lines[i].strip())
41+
i += 1
42+
props[key] = "\n".join(value_lines)
43+
else:
44+
i += 1
45+
46+
return props
47+
48+
49+
def extract_molecules(filepath: str | Path) -> pd.DataFrame:
50+
"""Extract molecule data from a ChEBI SDF file.
51+
52+
Supports both plain (``.sdf``) and gzip-compressed (``.sdf.gz``) files.
53+
54+
Parameters
55+
----------
56+
filepath : str or Path
57+
Path to the ChEBI SDF (or SDF.gz) file.
58+
59+
Returns
60+
-------
61+
pd.DataFrame
62+
DataFrame with one row per molecule. Columns depend on the properties
63+
present in the file. Common columns (renamed for convenience):
64+
chebi_id, name, inchi, inchikey, smiles, formula, charge, mass.
65+
"""
66+
records = [_parse_sdf_record(r) for r in _iter_sdf_records(filepath)]
67+
68+
if not records:
69+
return pd.DataFrame()
70+
71+
df = pd.DataFrame(records)
72+
73+
rename_map = {
74+
"ChEBI ID": "chebi_id",
75+
"ChEBI Name": "name",
76+
"InChI": "inchi",
77+
"InChIKey": "inchikey",
78+
"SMILES": "smiles",
79+
"Formulae": "formula",
80+
"Charge": "charge",
81+
"Mass": "mass",
82+
}
83+
df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})
84+
85+
return df

0 commit comments

Comments
 (0)