99- Pluggable cache backend architecture (filesystem cache by default).
1010- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
1111- HTTP conditional refresh support (` ETag ` / ` Last-Modified ` ) when enabled.
12+ - Support for partitioned parquet bundle downloads (for example Open Targets releases).
1213- Incremental parquet materialization (chunked processing + partitioned parquet parts).
1314- CLI for listing, fetching, and materializing datasets.
1415- Query interface for filtered row access from materialized parquet datasets.
1718
1819## Included datasets
1920
20- The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ** ZINC** , ** BindingDB** , ** ChEMBL** , ** UniProt** , ** openFDA** , and the ** Human Protein Atlas** .
21+ The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including ** ZINC** , ** BindingDB** , ** Open Targets ** , ** ChEMBL** , ** UniProt** , ** openFDA** , and the ** Human Protein Atlas** .
2122
22231 . ` zinc15_250k ` (ZINC)
23242 . ` zinc15_tranche_druglike_instock ` (ZINC tranche)
@@ -39,36 +40,38 @@ The default catalog includes local-file/HTTP datasets plus API presets useful in
394018 . ` bindingdb_articles_affinity `
404119 . ` openfda_drug_event_serious `
414220 . ` proteinatlas_human_proteome `
42- 21 . ` chembl_activity_ki_human `
43- 22 . ` chembl_activity_ic50_human `
44- 23 . ` chembl_activity_kd_human `
45- 24 . ` chembl_activity_ec50_human `
46- 25 . ` chembl_activity_ac50_human `
47- 26 . ` chembl_assays_binding_human `
48- 27 . ` chembl_assays_functional_human `
49- 28 . ` chembl_assays_adme_human `
50- 29 . ` chembl_targets_human_single_protein `
51- 30 . ` chembl_targets_human_protein_complex `
52- 31 . ` chembl_molecules_phase3plus `
53- 32 . ` chembl_molecules_phase4 `
54- 33 . ` chembl_molecules_black_box_warning `
55- 34 . ` chembl_mechanism_phase2plus `
56- 35 . ` chembl_drug_indications_phase2plus `
57- 36 . ` chembl_drug_indications_phase3plus `
58- 37 . ` uniprot_human_reviewed `
59- 38 . ` uniprot_human_receptors `
60- 39 . ` uniprot_human_membrane `
61- 40 . ` uniprot_human_nucleus `
62- 41 . ` uniprot_human_kinases `
63- 42 . ` uniprot_human_gpcr `
64- 43 . ` uniprot_human_ion_channels `
65- 44 . ` uniprot_human_transporters `
66- 45 . ` uniprot_human_secreted `
67- 46 . ` uniprot_human_transcription_factors `
68- 47 . ` uniprot_human_enzymes `
43+ 21 . ` opentargets_target_prioritisation `
44+ 22 . ` chembl_activity_ki_human `
45+ 23 . ` chembl_activity_ic50_human `
46+ 24 . ` chembl_activity_kd_human `
47+ 25 . ` chembl_activity_ec50_human `
48+ 26 . ` chembl_activity_ac50_human `
49+ 27 . ` chembl_assays_binding_human `
50+ 28 . ` chembl_assays_functional_human `
51+ 29 . ` chembl_assays_adme_human `
52+ 30 . ` chembl_targets_human_single_protein `
53+ 31 . ` chembl_targets_human_protein_complex `
54+ 32 . ` chembl_molecules_phase3plus `
55+ 33 . ` chembl_molecules_phase4 `
56+ 34 . ` chembl_molecules_black_box_warning `
57+ 35 . ` chembl_mechanism_phase2plus `
58+ 36 . ` chembl_drug_indications_phase2plus `
59+ 37 . ` chembl_drug_indications_phase3plus `
60+ 38 . ` uniprot_human_reviewed `
61+ 39 . ` uniprot_human_receptors `
62+ 40 . ` uniprot_human_membrane `
63+ 41 . ` uniprot_human_nucleus `
64+ 42 . ` uniprot_human_kinases `
65+ 43 . ` uniprot_human_gpcr `
66+ 44 . ` uniprot_human_ion_channels `
67+ 45 . ` uniprot_human_transporters `
68+ 46 . ` uniprot_human_secreted `
69+ 47 . ` uniprot_human_transcription_factors `
70+ 48 . ` uniprot_human_enzymes `
6971
7072Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms.
7173BindingDB is included as a versioned ZIP-backed TSV snapshot for literature-derived affinity modeling.
74+ Open Targets is included as a versioned parquet-part bundle for target prioritisation workflows.
7275ChEMBL, UniProt, and openFDA presets are fetched through their public REST APIs and cached locally as JSONL.
7376ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins,
7477reactivity A/B/C/E) into one cached tabular source during fetch.
0 commit comments