Skip to content

Commit 10f8e5f

Browse files
committed
Initial commit
0 parents  commit 10f8e5f

25 files changed

Lines changed: 3781 additions & 0 deletions

.github/workflows/ci.yml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
jobs:
8+
quality:
9+
runs-on: ubuntu-latest
10+
strategy:
11+
fail-fast: false
12+
matrix:
13+
python-version: ["3.11", "3.12", "3.13", "3.14"]
14+
15+
steps:
16+
- name: Checkout
17+
uses: actions/checkout@v4
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -e ".[dev]"
28+
29+
- name: Ruff
30+
run: ruff check src tests
31+
32+
- name: Mypy
33+
run: mypy src
34+
35+
- name: Pytest
36+
run: pytest -q
37+
38+
- name: Build and check distributions
39+
if: matrix.python-version == '3.11'
40+
run: |
41+
rm -rf build dist
42+
python -m build
43+
python -m twine check dist/*

.gitignore

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
__pycache__/
2+
*.py[cod]
3+
*$py.class
4+
5+
build/
6+
dist/
7+
*.egg-info/
8+
pip-wheel-metadata/
9+
10+
.venv/
11+
.venv*/
12+
venv/
13+
env/
14+
15+
.pytest_cache/
16+
.mypy_cache/
17+
.ruff_cache/
18+
19+
.coverage
20+
.coverage.*
21+
htmlcov/
22+
23+
.audit*/
24+
artifacts/
25+
.artifacts/
26+
27+
.DS_Store
28+
poetry.toml

CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on Keep a Changelog and this project follows Semantic Versioning.
6+
7+
## [Unreleased]
8+
9+
### Added
10+
- CI workflow for linting, type checking, tests, and distribution validation.
11+
- Security and contributing documentation.
12+
- Pluggable cache backend contract (`CacheBackend`) with `DataCache` as default implementation.
13+
- Dataset usage-note metadata for every dataset via explicit or category-derived defaults.
14+
- Five ZINC tranche-based drug-like dataset targets across purchasability tiers (in-stock, agent, wait-ok, boutique, annotated), each configured as multi-tranche fetch targets.
15+
16+
### Changed
17+
- Source validation now treats mirrored URLs as healthy when at least one source is reachable.
18+
- Added dataset URL mode support for concatenating multiple source URLs into one cached raw file (`url_mode="concat"`).
19+
- Refresh fetch now fails instead of silently falling back to stale cached data.
20+
- Development metadata upgraded from Alpha to Beta.
21+
- Fetch and materialize metadata now include normalized dataset snapshots (description, usage notes, source, licensing).
22+
- CLI JSON output for `list` and `fetch` now includes dataset metadata for easier automation.
23+
24+
### Fixed
25+
- mypy issues in catalog and IO typing.
26+
- Known failing fallback dataset URLs in the default catalog.
27+
- Local artifact hygiene via `.gitignore`.

CONTRIBUTING.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Contributing
2+
3+
## Development setup
4+
5+
```bash
6+
python -m venv .venv
7+
. .venv/bin/activate
8+
pip install -e ".[dev]"
9+
```
10+
11+
## Local quality checks
12+
13+
```bash
14+
ruff check src tests
15+
mypy src
16+
pytest -q
17+
python -m build
18+
python -m twine check dist/*
19+
```
20+
21+
## Pull requests
22+
23+
- Keep changes focused and include tests for behavior changes.
24+
- Update `README.md` and `CHANGELOG.md` when user-visible behavior changes.
25+
- Ensure CI passes before requesting review.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 JJ Ben-Joseph
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# refua-data
2+
3+
`refua-data` is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.
4+
5+
## What it provides
6+
7+
- A built-in catalog of useful drug-discovery datasets.
8+
- Dataset-aware download pipeline with cache reuse and metadata tracking.
9+
- Pluggable cache backend architecture (filesystem cache by default).
10+
- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
11+
- HTTP conditional refresh support (`ETag` / `Last-Modified`) when enabled.
12+
- Incremental parquet materialization (chunked processing + partitioned parquet parts).
13+
- CLI for listing, fetching, and materializing datasets.
14+
- Source health checks via `validate-sources` for CI and environment diagnostics.
15+
- Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.
16+
17+
## Included datasets
18+
19+
The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including **ZINC**, **ChEMBL**, and **UniProt**.
20+
21+
1. `zinc15_250k` (ZINC)
22+
2. `zinc15_tranche_druglike_instock` (ZINC tranche)
23+
3. `zinc15_tranche_druglike_agent` (ZINC tranche)
24+
4. `zinc15_tranche_druglike_wait_ok` (ZINC tranche)
25+
5. `zinc15_tranche_druglike_boutique` (ZINC tranche)
26+
6. `zinc15_tranche_druglike_annotated` (ZINC tranche)
27+
7. `tox21`
28+
8. `bbbp`
29+
9. `bace`
30+
10. `clintox`
31+
11. `sider`
32+
12. `hiv`
33+
13. `muv`
34+
14. `esol`
35+
15. `freesolv`
36+
16. `lipophilicity`
37+
17. `pcba`
38+
18. `chembl_activity_ki_human`
39+
19. `chembl_activity_ic50_human`
40+
20. `chembl_assays_binding_human`
41+
21. `chembl_targets_human_single_protein`
42+
22. `chembl_molecules_phase3plus`
43+
23. `uniprot_human_reviewed`
44+
24. `uniprot_human_kinases`
45+
25. `uniprot_human_gpcr`
46+
26. `uniprot_human_ion_channels`
47+
27. `uniprot_human_transporters`
48+
49+
Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms.
50+
ChEMBL and UniProt presets are fetched through their public REST APIs and cached locally as JSONL.
51+
ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins,
52+
reactivity A/B/C/E) into one cached tabular source during fetch.
53+
54+
## Install
55+
56+
```bash
57+
cd refua-data
58+
pip install -e .
59+
```
60+
61+
## CLI quickstart
62+
63+
List datasets:
64+
65+
```bash
66+
refua-data list
67+
```
68+
69+
Validate all dataset sources:
70+
71+
```bash
72+
refua-data validate-sources
73+
```
74+
75+
Validate a subset and fail CI on probe failures:
76+
77+
```bash
78+
refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error
79+
```
80+
81+
JSON output for automation:
82+
83+
```bash
84+
refua-data validate-sources --json --fail-on-error
85+
```
86+
87+
For datasets with multiple mirrors, source validation succeeds when at least one configured source
88+
is reachable. Failed fallback attempts are included in the result details.
89+
90+
Fetch raw data with cache:
91+
92+
```bash
93+
refua-data fetch zinc15_250k
94+
```
95+
96+
Fetch API-based presets:
97+
98+
```bash
99+
refua-data fetch chembl_activity_ki_human
100+
refua-data fetch uniprot_human_kinases
101+
```
102+
103+
Materialize parquet:
104+
105+
```bash
106+
refua-data materialize zinc15_250k
107+
```
108+
109+
Refresh against remote metadata:
110+
111+
```bash
112+
refua-data fetch zinc15_250k --refresh
113+
```
114+
115+
For API datasets, `--refresh` re-runs the API query (with conditional headers on first page when available).
116+
117+
## Cache layout
118+
119+
By default, cache root is:
120+
121+
- `~/.cache/refua-data`
122+
123+
Override with:
124+
125+
- `REFUA_DATA_HOME=/custom/path`
126+
127+
Layout:
128+
129+
- `raw/<dataset>/<version>/...` downloaded source files
130+
- `_meta/raw/<dataset>/<version>/...json` raw metadata (`etag`, `sha256`, API request signature, rows/pages, dataset description/usage metadata)
131+
- `parquet/<dataset>/<version>/part-*.parquet` materialized parquet parts
132+
- `_meta/parquet/<dataset>/<version>/manifest.json` parquet manifest metadata with dataset snapshot
133+
134+
## Python API
135+
136+
```python
137+
from refua_data import DatasetManager
138+
139+
manager = DatasetManager()
140+
manager.fetch("zinc15_250k")
141+
manager.fetch("chembl_activity_ki_human")
142+
result = manager.materialize("zinc15_250k")
143+
print(result.parquet_dir)
144+
```
145+
146+
`DataCache` is the default cache backend. You can pass a custom backend object that implements
147+
the same interface (`ensure`, `raw_file`, `raw_meta`, `parquet_dir`, `parquet_manifest`,
148+
`read_json`, `write_json`) to make storage pluggable.
149+
150+
## Licensing notes
151+
152+
- `refua-data` package code is MIT licensed.
153+
- Dataset content licenses are dataset-specific and controlled by upstream providers.
154+
- Always verify dataset licensing and allowed use before redistribution or commercial deployment.

SECURITY.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Security Policy
2+
3+
## Reporting a vulnerability
4+
5+
Please report suspected vulnerabilities privately by emailing `jj@tensorspace.ai`.
6+
7+
Include:
8+
- A clear description of the issue
9+
- Affected versions and environment details
10+
- Steps to reproduce or a proof of concept
11+
12+
We will acknowledge receipt and work on triage and remediation.

0 commit comments

Comments
 (0)