refua-data is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.
- A built-in catalog of useful drug-discovery datasets.
- Dataset-aware download pipeline with cache reuse and metadata tracking.
- Pluggable cache backend architecture (filesystem cache by default).
- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
- HTTP conditional refresh support (
ETag/Last-Modified) when enabled. - Support for partitioned parquet bundle downloads (for example Open Targets releases).
- Native Excel (
.xlsx) ingestion for datasets such as GDSC fitted dose-response releases. - Incremental parquet materialization (chunked processing + partitioned parquet parts).
- CLI for listing, fetching, and materializing datasets.
- Query interface for filtered row access from materialized parquet datasets.
- Source health checks via
validate-sourcesfor CI and environment diagnostics. - Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.
The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ZINC, BindingDB, Open Targets, CancerRxGene/GDSC, ChEMBL, UniProt, openFDA, and the Human Protein Atlas.
zinc15_250k(ZINC)zinc15_tranche_druglike_instock(ZINC tranche)zinc15_tranche_druglike_agent(ZINC tranche)zinc15_tranche_druglike_wait_ok(ZINC tranche)zinc15_tranche_druglike_boutique(ZINC tranche)zinc15_tranche_druglike_annotated(ZINC tranche)tox21bbbpbaceclintoxsiderhivmuvesolfreesolvlipophilicitypcbabindingdb_articles_affinityopenfda_drug_event_seriousproteinatlas_human_proteomeopentargets_target_prioritisationgdsc2_fitted_dose_responsechembl_activity_ki_humanchembl_activity_ic50_humanchembl_activity_kd_humanchembl_activity_ec50_humanchembl_activity_ac50_humanchembl_assays_binding_humanchembl_assays_functional_humanchembl_assays_adme_humanchembl_targets_human_single_proteinchembl_targets_human_protein_complexchembl_molecules_phase3pluschembl_molecules_phase4chembl_molecules_black_box_warningchembl_mechanism_phase2pluschembl_drug_indications_phase2pluschembl_drug_indications_phase3plusuniprot_human_revieweduniprot_human_receptorsuniprot_human_membraneuniprot_human_nucleusuniprot_human_kinasesuniprot_human_gpcruniprot_human_ion_channelsuniprot_human_transportersuniprot_human_secreteduniprot_human_transcription_factorsuniprot_human_enzymes
Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms. BindingDB is included as a versioned ZIP-backed TSV snapshot for literature-derived affinity modeling. Open Targets is included as a versioned parquet-part bundle for target prioritisation workflows. CancerRxGene GDSC is included as a versioned Excel-backed dose-response release for cell-line pharmacology modeling. ChEMBL, UniProt, and openFDA presets are fetched through their public REST APIs and cached locally as JSONL. ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins, reactivity A/B/C/E) into one cached tabular source during fetch.
cd refua-data
pip install -e .List datasets:
refua-data listValidate all dataset sources:
refua-data validate-sourcesValidate a subset and fail CI on probe failures:
refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-errorJSON output for automation:
refua-data validate-sources --json --fail-on-errorFor datasets with multiple mirrors, source validation succeeds when at least one configured source is reachable. Failed fallback attempts are included in the result details.
Fetch raw data with cache:
refua-data fetch zinc15_250kFetch API-based presets:
refua-data fetch chembl_activity_ki_human
refua-data fetch uniprot_human_kinasesMaterialize parquet:
refua-data materialize zinc15_250kQuery materialized parquet rows:
refua-data query zinc15_250k --columns smiles,logP --filters '{"logP":{"lt":2.5}}' --limit 50Refresh against remote metadata:
refua-data fetch zinc15_250k --refreshFor API datasets, --refresh re-runs the API query (with conditional headers on first page when available).
By default, cache root is:
~/.cache/refua-data
Override with:
REFUA_DATA_HOME=/custom/path
Layout:
raw/<dataset>/<version>/...downloaded source files_meta/raw/<dataset>/<version>/...jsonraw metadata (etag,sha256, API request signature, rows/pages, dataset description/usage metadata)parquet/<dataset>/<version>/part-*.parquetmaterialized parquet parts_meta/parquet/<dataset>/<version>/manifest.jsonparquet manifest metadata with dataset snapshot
from refua_data import DatasetManager
manager = DatasetManager()
manager.fetch("zinc15_250k")
manager.fetch("chembl_activity_ki_human")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)DataCache is the default cache backend. You can pass a custom backend object that implements
the same interface (ensure, raw_file, raw_meta, parquet_dir, parquet_manifest,
read_json, write_json) to make storage pluggable.
refua-datapackage code is MIT licensed.- Dataset content licenses are dataset-specific and controlled by upstream providers.
- Always verify dataset licensing and allowed use before redistribution or commercial deployment.