A lightweight Python package and CLI tool for listing and loading AIRCHECK datasets, with built-in support for column selection, progress tracking, and automatic local caching. This is the Pythonic way to programmatically access datasets that are also available for download via the AIRCHECK website. Before using any dataset, please ensure you have read and agreed to the dataset agreement HitGen End User License Agreement (EULA)
-
Use virtual environments to avoid dependency conflicts:
python -m venv .venv source .venv/bin/activate # On Windows use .venv\Scripts\activate
-
Always validate that your code respects data privacy and licensing terms.
-
Avoid storing large datasets in version control. Let
aircheckdatahandle caching.
You can install the package from PyPI:
pip install aircheckdataFor development and testing (optional):
pip install -e ".[dev]"Installation verification (optional)
Verify that the installation was successful by running unit tests
pytest tests/aircheckdata can be used directly from your Python environment to:
- List pre-configured datasets
- View available columns and metadata
- Load datasets with optional filtering and progress indicators
from aircheckdata import list_datasets
datasets = list_datasets()
for name, desc in datasets.items():
print(f"{name}: {desc}")from aircheckdata import get_columns
columns = get_columns('HitGen','WDR91')
names = [item["name"] for item in columns]
print("Column Names: \n", names)from aircheckdata import load_dataset
df = load_dataset('HitGen','WDR91', columns=['ECFP6','ECFP4','LABEL']) # Download specified data columns with progressbar or
df = load_dataset('HitGen','WDR91', columns=['ECFP6','ECFP4','LABEL'],show_progress=False) # Download specified data columns with without progressbar, this is more memory efficient and faster
df = load_dataset() # Download once, then cache locally (by default it loads HitGen WDR91 Target)
print(df.head())# Load only selected columns
df = load_dataset('WDR91', columns=['ECFP6', 'ECFP4', 'LABEL'])
# Show progress while loading
df = load_dataset('WDR91', show_progress=True)
The aircheckdata CLI enables quick access to datasets via command-line:
aircheckdata --help| Option | Description |
|---|---|
list |
List all available datasets |
columns Provider Name "Target Name" |
Select columns to load or list columns of a dataset |
# List datasets
aircheckdata list
# View available columns for Distinct Target (defaults to HitGen WDR91 if no provider and Target name is given)
# aircheckdata columns
airctest columns <Provider Name> <Target Name>
airctest columns HitGen "WDR12"This package is distributed under the MIT License. However, the datasets it provides access to are subject to the HitGen End User License Agreement (EULA).
⚠️ By using any dataset accessed viaaircheckdata, you agree to abide by the HitGen EULA.Please refer to the full license terms and conditions here: 👉 https://www.aircheck.ai/docs/HitGen.pdf
Currently available datasets include:
WDR91: A curated Parquet dataset provided by HitGenWDR12: A curated Parquet dataset provided by HitGenSETDB1: A curated Parquet dataset provided by HitGenLRRK2: A curated Parquet dataset provided by HitGenDCAF7: A curated Parquet dataset provided by HitGenChicken PLCZ1: A curated Parquet dataset provided by HitGenChicken PLCZ1 known inhibitor: A curated Parquet dataset provided by HitGenHuman PLCZ1 (D202R OR H170A&H215A): A curated Parquet dataset provided by HitGenHuman PLCZ1 (D202R OR H170A&H215A) known inhibitor: A curated Parquet dataset provided by HitGenPLCZ1 (Chicken or Human mutants): A curated Parquet dataset provided by HitGenPLCZ1 off target His-PLCD1;2:756: A curated Parquet dataset provided by HitGen
- Python 3.7+