agentcures
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 43 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 28 additions & 0 deletions b/‎.gitignore‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 27 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 25 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 154 additions & 0 deletions b/‎README.md‎
Lines changed: 154 additions & 0 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 12 additions & 0 deletions b/‎SECURITY.md‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,43 @@
+name: CI
+
+on:
+  push:
+  pull_request:
+
+jobs:
+  quality:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12", "3.13", "3.14"]
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Ruff
+        run: ruff check src tests
+
+      - name: Mypy
+        run: mypy src
+
+      - name: Pytest
+        run: pytest -q
+
+      - name: Build and check distributions
+        if: matrix.python-version == '3.11'
+        run: |
+          rm -rf build dist
+          python -m build
+          python -m twine check dist/*
@@ -0,0 +1,28 @@
+__pycache__/
+*.py[cod]
+*$py.class
+
+build/
+dist/
+*.egg-info/
+pip-wheel-metadata/
+
+.venv/
+.venv*/
+venv/
+env/
+
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+
+.coverage
+.coverage.*
+htmlcov/
+
+.audit*/
+artifacts/
+.artifacts/
+
+.DS_Store
+poetry.toml
@@ -0,0 +1,27 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on Keep a Changelog and this project follows Semantic Versioning.
+
+## [Unreleased]
+
+### Added
+- CI workflow for linting, type checking, tests, and distribution validation.
+- Security and contributing documentation.
+- Pluggable cache backend contract (`CacheBackend`) with `DataCache` as default implementation.
+- Dataset usage-note metadata for every dataset via explicit or category-derived defaults.
+- Five ZINC tranche-based drug-like dataset targets across purchasability tiers (in-stock, agent, wait-ok, boutique, annotated), each configured as multi-tranche fetch targets.
+
+### Changed
+- Source validation now treats mirrored URLs as healthy when at least one source is reachable.
+- Added dataset URL mode support for concatenating multiple source URLs into one cached raw file (`url_mode="concat"`).
+- Refresh fetch now fails instead of silently falling back to stale cached data.
+- Development metadata upgraded from Alpha to Beta.
+- Fetch and materialize metadata now include normalized dataset snapshots (description, usage notes, source, licensing).
+- CLI JSON output for `list` and `fetch` now includes dataset metadata for easier automation.
+
+### Fixed
+- mypy issues in catalog and IO typing.
+- Known failing fallback dataset URLs in the default catalog.
+- Local artifact hygiene via `.gitignore`.
@@ -0,0 +1,25 @@
+# Contributing
+
+## Development setup
+
+```bash
+python -m venv .venv
+. .venv/bin/activate
+pip install -e ".[dev]"
+```
+
+## Local quality checks
+
+```bash
+ruff check src tests
+mypy src
+pytest -q
+python -m build
+python -m twine check dist/*
+```
+
+## Pull requests
+
+- Keep changes focused and include tests for behavior changes.
+- Update `README.md` and `CHANGELOG.md` when user-visible behavior changes.
+- Ensure CI passes before requesting review.
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 JJ Ben-Joseph
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,154 @@
+# refua-data
+
+`refua-data` is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.
+
+## What it provides
+
+- A built-in catalog of useful drug-discovery datasets.
+- Dataset-aware download pipeline with cache reuse and metadata tracking.
+- Pluggable cache backend architecture (filesystem cache by default).
+- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
+- HTTP conditional refresh support (`ETag` / `Last-Modified`) when enabled.
+- Incremental parquet materialization (chunked processing + partitioned parquet parts).
+- CLI for listing, fetching, and materializing datasets.
+- Source health checks via `validate-sources` for CI and environment diagnostics.
+- Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.
+
+## Included datasets
+
+The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including **ZINC**, **ChEMBL**, and **UniProt**.
+
+1. `zinc15_250k` (ZINC)
+2. `zinc15_tranche_druglike_instock` (ZINC tranche)
+3. `zinc15_tranche_druglike_agent` (ZINC tranche)
+4. `zinc15_tranche_druglike_wait_ok` (ZINC tranche)
+5. `zinc15_tranche_druglike_boutique` (ZINC tranche)
+6. `zinc15_tranche_druglike_annotated` (ZINC tranche)
+7. `tox21`
+8. `bbbp`
+9. `bace`
+10. `clintox`
+11. `sider`
+12. `hiv`
+13. `muv`
+14. `esol`
+15. `freesolv`
+16. `lipophilicity`
+17. `pcba`
+18. `chembl_activity_ki_human`
+19. `chembl_activity_ic50_human`
+20. `chembl_assays_binding_human`
+21. `chembl_targets_human_single_protein`
+22. `chembl_molecules_phase3plus`
+23. `uniprot_human_reviewed`
+24. `uniprot_human_kinases`
+25. `uniprot_human_gpcr`
+26. `uniprot_human_ion_channels`
+27. `uniprot_human_transporters`
+
+Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms.
+ChEMBL and UniProt presets are fetched through their public REST APIs and cached locally as JSONL.
+ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins,
+reactivity A/B/C/E) into one cached tabular source during fetch.
+
+## Install
+
+```bash
+cd refua-data
+pip install -e .
+```
+
+## CLI quickstart
+
+List datasets:
+
+```bash
+refua-data list
+```
+
+Validate all dataset sources:
+
+```bash
+refua-data validate-sources
+```
+
+Validate a subset and fail CI on probe failures:
+
+```bash
+refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error
+```
+
+JSON output for automation:
+
+```bash
+refua-data validate-sources --json --fail-on-error
+```
+
+For datasets with multiple mirrors, source validation succeeds when at least one configured source
+is reachable. Failed fallback attempts are included in the result details.
+
+Fetch raw data with cache:
+
+```bash
+refua-data fetch zinc15_250k
+```
+
+Fetch API-based presets:
+
+```bash
+refua-data fetch chembl_activity_ki_human
+refua-data fetch uniprot_human_kinases
+```
+
+Materialize parquet:
+
+```bash
+refua-data materialize zinc15_250k
+```
+
+Refresh against remote metadata:
+
+```bash
+refua-data fetch zinc15_250k --refresh
+```
+
+For API datasets, `--refresh` re-runs the API query (with conditional headers on first page when available).
+
+## Cache layout
+
+By default, cache root is:
+
+- `~/.cache/refua-data`
+
+Override with:
+
+- `REFUA_DATA_HOME=/custom/path`
+
+Layout:
+
+- `raw/<dataset>/<version>/...` downloaded source files
+- `_meta/raw/<dataset>/<version>/...json` raw metadata (`etag`, `sha256`, API request signature, rows/pages, dataset description/usage metadata)
+- `parquet/<dataset>/<version>/part-*.parquet` materialized parquet parts
+- `_meta/parquet/<dataset>/<version>/manifest.json` parquet manifest metadata with dataset snapshot
+
+## Python API
+
+```python
+from refua_data import DatasetManager
+
+manager = DatasetManager()
+manager.fetch("zinc15_250k")
+manager.fetch("chembl_activity_ki_human")
+result = manager.materialize("zinc15_250k")
+print(result.parquet_dir)
+```
+
+`DataCache` is the default cache backend. You can pass a custom backend object that implements
+the same interface (`ensure`, `raw_file`, `raw_meta`, `parquet_dir`, `parquet_manifest`,
+`read_json`, `write_json`) to make storage pluggable.
+
+## Licensing notes
+
+- `refua-data` package code is MIT licensed.
+- Dataset content licenses are dataset-specific and controlled by upstream providers.
+- Always verify dataset licensing and allowed use before redistribution or commercial deployment.
@@ -0,0 +1,12 @@
+# Security Policy
+
+## Reporting a vulnerability
+
+Please report suspected vulnerabilities privately by emailing `jj@tensorspace.ai`.
+
+Include:
+- A clear description of the issue
+- Affected versions and environment details
+- Steps to reproduce or a proof of concept
+
+We will acknowledge receipt and work on triage and remediation.