refua-data

refua-data is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.

What it provides

A built-in catalog of useful drug-discovery datasets.
Dataset-aware download pipeline with cache reuse and metadata tracking.
Pluggable cache backend architecture (filesystem cache by default).
API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
HTTP conditional refresh support (ETag / Last-Modified) when enabled.
Support for partitioned parquet bundle downloads (for example Open Targets releases).
Native Excel (.xlsx) ingestion for datasets such as GDSC fitted dose-response releases.
Incremental parquet materialization (chunked processing + partitioned parquet parts).
CLI for listing, fetching, and materializing datasets.
Query interface for filtered row access from materialized parquet datasets.
Source health checks via validate-sources for CI and environment diagnostics.
Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.

Included datasets

The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ZINC, BindingDB, Open Targets, CancerRxGene/GDSC, ChEMBL, UniProt, openFDA, and the Human Protein Atlas.

zinc15_250k (ZINC)
zinc15_tranche_druglike_instock (ZINC tranche)
zinc15_tranche_druglike_agent (ZINC tranche)
zinc15_tranche_druglike_wait_ok (ZINC tranche)
zinc15_tranche_druglike_boutique (ZINC tranche)
zinc15_tranche_druglike_annotated (ZINC tranche)
tox21
bbbp
bace
clintox
sider
hiv
muv
esol
freesolv
lipophilicity
pcba
bindingdb_articles_affinity
openfda_drug_event_serious
proteinatlas_human_proteome
opentargets_target_prioritisation
gdsc2_fitted_dose_response
chembl_activity_ki_human
chembl_activity_ic50_human
chembl_activity_kd_human
chembl_activity_ec50_human
chembl_activity_ac50_human
chembl_assays_binding_human
chembl_assays_functional_human
chembl_assays_adme_human
chembl_targets_human_single_protein
chembl_targets_human_protein_complex
chembl_molecules_phase3plus
chembl_molecules_phase4
chembl_molecules_black_box_warning
chembl_mechanism_phase2plus
chembl_drug_indications_phase2plus
chembl_drug_indications_phase3plus
uniprot_human_reviewed
uniprot_human_receptors
uniprot_human_membrane
uniprot_human_nucleus
uniprot_human_kinases
uniprot_human_gpcr
uniprot_human_ion_channels
uniprot_human_transporters
uniprot_human_secreted
uniprot_human_transcription_factors
uniprot_human_enzymes

Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms. BindingDB is included as a versioned ZIP-backed TSV snapshot for literature-derived affinity modeling. Open Targets is included as a versioned parquet-part bundle for target prioritisation workflows. CancerRxGene GDSC is included as a versioned Excel-backed dose-response release for cell-line pharmacology modeling. ChEMBL, UniProt, and openFDA presets are fetched through their public REST APIs and cached locally as JSONL. ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins, reactivity A/B/C/E) into one cached tabular source during fetch.

Install

cd refua-data
pip install -e .

CLI quickstart

List datasets:

refua-data list

Validate all dataset sources:

refua-data validate-sources

Validate a subset and fail CI on probe failures:

refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error

JSON output for automation:

refua-data validate-sources --json --fail-on-error

For datasets with multiple mirrors, source validation succeeds when at least one configured source is reachable. Failed fallback attempts are included in the result details.

Fetch raw data with cache:

refua-data fetch zinc15_250k

Fetch API-based presets:

refua-data fetch chembl_activity_ki_human
refua-data fetch uniprot_human_kinases

Materialize parquet:

refua-data materialize zinc15_250k

Query materialized parquet rows:

refua-data query zinc15_250k --columns smiles,logP --filters '{"logP":{"lt":2.5}}' --limit 50

Refresh against remote metadata:

refua-data fetch zinc15_250k --refresh

For API datasets, --refresh re-runs the API query (with conditional headers on first page when available).

Cache layout

By default, cache root is:

~/.cache/refua-data

Override with:

REFUA_DATA_HOME=/custom/path

Layout:

raw/<dataset>/<version>/... downloaded source files
_meta/raw/<dataset>/<version>/...json raw metadata (etag, sha256, API request signature, rows/pages, dataset description/usage metadata)
parquet/<dataset>/<version>/part-*.parquet materialized parquet parts
_meta/parquet/<dataset>/<version>/manifest.json parquet manifest metadata with dataset snapshot

Python API

from refua_data import DatasetManager

manager = DatasetManager()
manager.fetch("zinc15_250k")
manager.fetch("chembl_activity_ki_human")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)

DataCache is the default cache backend. You can pass a custom backend object that implements the same interface (ensure, raw_file, raw_meta, parquet_dir, parquet_manifest, read_json, write_json) to make storage pluggable.

Licensing notes

refua-data package code is MIT licensed.
Dataset content licenses are dataset-specific and controlled by upstream providers.
Always verify dataset licensing and allowed use before redistribution or commercial deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
src/refua_data		src/refua_data
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

refua-data

What it provides

Included datasets

Install

CLI quickstart

Cache layout

Python API

Licensing notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

refua-data

What it provides

Included datasets

Install

CLI quickstart

Cache layout

Python API

Licensing notes

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages