Skip to content

Commit 806460b

Browse files
authored
Merge pull request #2 from ChEB-AI/main
Main
2 parents ece504f + 887e5d5 commit 806460b

File tree

15 files changed

+1131
-1
lines changed

15 files changed

+1131
-1
lines changed

.github/workflows/ci.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: ["**"]
6+
pull_request:
7+
branches: ["**"]
8+
9+
jobs:
10+
lint:
11+
name: Lint (ruff)
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.12"
20+
21+
- name: Install ruff
22+
run: pip install ruff
23+
24+
- name: Check formatting
25+
run: ruff format --check .
26+
27+
- name: Check linting
28+
run: ruff check .
29+
30+
test:
31+
name: Unit Tests
32+
runs-on: ubuntu-latest
33+
strategy:
34+
matrix:
35+
python-version: ["3.10", "3.11", "3.12"]
36+
steps:
37+
- uses: actions/checkout@v4
38+
39+
- name: Set up Python ${{ matrix.python-version }}
40+
uses: actions/setup-python@v5
41+
with:
42+
python-version: ${{ matrix.python-version }}
43+
44+
- name: Install package and test dependencies
45+
run: pip install -e ".[dev]"
46+
47+
- name: Run tests
48+
run: pytest tests/ -v

README.md

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,85 @@
11
# python-chebi-utils
2-
Common processing functionality for the ChEBI ontology (e.g. extraction of molecules, classes and relations).
2+
3+
Common processing functionality for the ChEBI ontology — download data files, extract classes and relations, extract molecules, and generate stratified train/val/test splits.
4+
5+
## Installation
6+
7+
```bash
8+
pip install chebi-utils
9+
```
10+
11+
For development (includes `pytest` and `ruff`):
12+
13+
```bash
14+
pip install -e ".[dev]"
15+
```
16+
17+
## Features
18+
19+
### Download ChEBI data files
20+
21+
```python
22+
from chebi_utils import download_chebi_obo, download_chebi_sdf
23+
24+
obo_path = download_chebi_obo(dest_dir="data/") # downloads chebi.obo
25+
sdf_path = download_chebi_sdf(dest_dir="data/") # downloads chebi.sdf.gz
26+
```
27+
28+
Files are fetched from the [EBI FTP server](https://ftp.ebi.ac.uk/pub/databases/chebi/).
29+
30+
### Extract ontology classes and relations
31+
32+
```python
33+
from chebi_utils import extract_classes, extract_relations
34+
35+
classes = extract_classes("chebi.obo")
36+
# DataFrame: id, name, definition, is_obsolete
37+
38+
relations = extract_relations("chebi.obo")
39+
# DataFrame: source_id, target_id, relation_type (is_a, has_role, …)
40+
```
41+
42+
### Extract molecules
43+
44+
```python
45+
from chebi_utils import extract_molecules
46+
47+
molecules = extract_molecules("chebi.sdf.gz")
48+
# DataFrame: chebi_id, name, smiles, inchi, inchikey, formula, charge, mass, …
49+
```
50+
51+
Both plain `.sdf` and gzip-compressed `.sdf.gz` files are supported.
52+
53+
### Generate train/val/test splits
54+
55+
```python
56+
from chebi_utils import create_splits
57+
58+
splits = create_splits(molecules, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)
59+
train_df = splits["train"]
60+
val_df = splits["val"]
61+
test_df = splits["test"]
62+
```
63+
64+
Pass `stratify_col` to preserve class proportions across splits:
65+
66+
```python
67+
splits = create_splits(classes, stratify_col="is_obsolete", seed=42)
68+
```
69+
70+
## Running Tests
71+
72+
```bash
73+
pytest tests/ -v
74+
```
75+
76+
## Linting
77+
78+
```bash
79+
ruff check .
80+
ruff format --check .
81+
```
82+
83+
## CI/CD
84+
85+
A GitHub Actions workflow (`.github/workflows/ci.yml`) automatically runs ruff linting and the full test suite on every push and pull request across Python 3.10, 3.11, and 3.12.

chebi_utils/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from chebi_utils.downloader import download_chebi_obo, download_chebi_sdf
2+
from chebi_utils.obo_extractor import build_chebi_graph
3+
from chebi_utils.sdf_extractor import extract_molecules
4+
from chebi_utils.splitter import create_splits
5+
6+
__all__ = [
7+
"download_chebi_obo",
8+
"download_chebi_sdf",
9+
"build_chebi_graph",
10+
"extract_molecules",
11+
"create_splits",
12+
]

chebi_utils/downloader.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
"""Download ChEBI data files from the EBI FTP server."""
2+
3+
from __future__ import annotations
4+
5+
import urllib.request
6+
from pathlib import Path
7+
8+
_CHEBI_LEGACY_VERSION_THRESHOLD = 245
9+
10+
11+
def _chebi_obo_url(version: int) -> str:
12+
if version < _CHEBI_LEGACY_VERSION_THRESHOLD:
13+
return f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
14+
return f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/ontology/chebi.obo"
15+
16+
17+
def _chebi_sdf_url(version: int) -> str:
18+
if version < _CHEBI_LEGACY_VERSION_THRESHOLD:
19+
return f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/chebi_legacy/archive/rel{version}/ontology/chebi.obo"
20+
return f"https://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel{version}/SDF/chebi.sdf.gz"
21+
22+
23+
def download_chebi_obo(
24+
version: int,
25+
dest_dir: str | Path = ".",
26+
filename: str = "chebi.obo",
27+
) -> Path:
28+
"""Download a versioned ChEBI OBO ontology file from the EBI FTP server.
29+
30+
Parameters
31+
----------
32+
version : int
33+
ChEBI release version number (e.g. 230, 245, 250).
34+
Versions below 245 are fetched from the legacy archive path.
35+
dest_dir : str or Path
36+
Directory where the file will be saved (created if it doesn't exist).
37+
filename : str
38+
Name for the downloaded file.
39+
40+
Returns
41+
-------
42+
Path
43+
Path to the downloaded file.
44+
"""
45+
dest_dir = Path(dest_dir)
46+
dest_dir.mkdir(parents=True, exist_ok=True)
47+
dest_path = dest_dir / filename
48+
urllib.request.urlretrieve(_chebi_obo_url(version), dest_path)
49+
return dest_path
50+
51+
52+
def download_chebi_sdf(
53+
version: int,
54+
dest_dir: str | Path = ".",
55+
filename: str = "chebi.sdf.gz",
56+
) -> Path:
57+
"""Download a versioned ChEBI SDF file from the EBI FTP server.
58+
59+
Parameters
60+
----------
61+
version : int
62+
ChEBI release version number (e.g. 230, 245, 250).
63+
Versions below 245 are fetched from the legacy archive path.
64+
dest_dir : str or Path
65+
Directory where the file will be saved (created if it doesn't exist).
66+
filename : str
67+
Name for the downloaded file.
68+
69+
Returns
70+
-------
71+
Path
72+
Path to the downloaded file.
73+
"""
74+
dest_dir = Path(dest_dir)
75+
dest_dir.mkdir(parents=True, exist_ok=True)
76+
dest_path = dest_dir / filename
77+
urllib.request.urlretrieve(_chebi_sdf_url(version), dest_path)
78+
return dest_path

chebi_utils/obo_extractor.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
"""Extract ChEBI ontology data using fastobo and build a networkx graph."""
2+
3+
from __future__ import annotations
4+
5+
from pathlib import Path
6+
7+
import fastobo
8+
import networkx as nx
9+
10+
11+
def _chebi_id_to_str(chebi_id: str) -> str:
12+
"""Convert 'CHEBI:123' to '123' (string)."""
13+
return chebi_id.split(":")[1]
14+
15+
16+
def _term_data(doc: "fastobo.term.TermFrame") -> dict | None:
17+
"""Extract data from a single fastobo TermFrame.
18+
19+
Returns
20+
-------
21+
dict or None
22+
Parsed term data, or ``None`` if the term is marked as obsolete.
23+
"""
24+
parents: list[str] = []
25+
has_part: set[str] = set()
26+
name: str | None = None
27+
smiles: str | None = None
28+
subset: str | None = None
29+
30+
for clause in doc:
31+
if isinstance(clause, fastobo.term.IsObsoleteClause):
32+
if clause.obsolete:
33+
return None
34+
elif isinstance(clause, fastobo.term.PropertyValueClause):
35+
pv = clause.property_value
36+
if str(pv.relation) in (
37+
"chemrof:smiles_string",
38+
"http://purl.obolibrary.org/obo/chebi/smiles",
39+
):
40+
smiles = pv.value
41+
elif isinstance(clause, fastobo.term.SynonymClause):
42+
if "SMILES" in clause.raw_value() and smiles is None:
43+
smiles = clause.raw_value().split('"')[1]
44+
elif isinstance(clause, fastobo.term.RelationshipClause):
45+
if str(clause.typedef) == "has_part":
46+
has_part.add(_chebi_id_to_str(str(clause.term)))
47+
elif isinstance(clause, fastobo.term.IsAClause):
48+
parents.append(_chebi_id_to_str(str(clause.term)))
49+
elif isinstance(clause, fastobo.term.NameClause):
50+
name = str(clause.name)
51+
elif isinstance(clause, fastobo.term.SubsetClause):
52+
subset = str(clause.subset)
53+
54+
return {
55+
"id": _chebi_id_to_str(str(doc.id)),
56+
"parents": parents,
57+
"has_part": has_part,
58+
"name": name,
59+
"smiles": smiles,
60+
"subset": subset,
61+
}
62+
63+
64+
def build_chebi_graph(filepath: str | Path) -> nx.DiGraph:
65+
"""Parse a ChEBI OBO file and build a directed graph of ontology terms.
66+
67+
``xref:`` lines are stripped before parsing as they can cause fastobo
68+
errors on some ChEBI releases. Only non-obsolete CHEBI-prefixed terms
69+
are included.
70+
71+
**Nodes** are string CHEBI IDs (e.g. ``"1"`` for ``CHEBI:1``) with
72+
attributes ``name``, ``smiles``, and ``subset``.
73+
74+
**Edges** carry a ``relation`` attribute and represent:
75+
76+
- ``is_a`` — directed from child to parent
77+
- ``has_part`` — directed from whole to part
78+
79+
Parameters
80+
----------
81+
filepath : str or Path
82+
Path to the ChEBI OBO file.
83+
84+
Returns
85+
-------
86+
nx.DiGraph
87+
Directed graph of ChEBI ontology terms and their relationships.
88+
"""
89+
with open(filepath, encoding="utf-8") as f:
90+
content = "\n".join(line for line in f if not line.startswith("xref:"))
91+
92+
graph: nx.DiGraph = nx.DiGraph()
93+
94+
for frame in fastobo.loads(content):
95+
if not (
96+
frame and isinstance(frame.id, fastobo.id.PrefixedIdent) and frame.id.prefix == "CHEBI"
97+
):
98+
continue
99+
100+
term = _term_data(frame)
101+
if term is None:
102+
continue
103+
104+
node_id = term["id"]
105+
graph.add_node(node_id, name=term["name"], smiles=term["smiles"], subset=term["subset"])
106+
107+
for parent_id in term["parents"]:
108+
graph.add_edge(node_id, parent_id, relation="is_a")
109+
110+
for part_id in term["has_part"]:
111+
graph.add_edge(node_id, part_id, relation="has_part")
112+
113+
return graph

0 commit comments

Comments
 (0)