_________________________
< Antibody Data Repo >
-------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
The packages are managed using conda.
However, the ANARCI package needs to be installed manually.
Hence to create the venv you need to:
conda env create -f environment.yml
conda activate ab-data
pip install -e . # install ab_data as editbable local package
cd ANARCI
python setup.py install # install ANARCI in venvThe PDB dataset should only be downloaded once, hence you will probably not need to ever
download it from scratch (unless you want a version with a later date cutoff).
Still, if needed, you can run ab_data/scripts/pdb/download_pdb.sh.
At the moment, we have what is the produced raw.gz in our Azure File Share, and it contains all the raw PDB files in the database
https://daki.file.core.windows.net/pdb-dataset/2024-06-06
When dealing with PDB (and other big datasets) see Managing big datasets.
You can produce the loop-mediated interaction dataset starting from PDB by running
ab_data/scripts/pdb/create_loop_mediated_dataset.py.
Afterwards, you can load an instance of ab_data.datasets.loop_mediated.LoopMediatedPDB,
and load it from root. Automatically, the dataset will load the PDB files and cache them
to .pt files in the same root. The inlcusion_mask produced must be saved manually to
the dataset root directory as well.
The compressed result loop_mediated.gz is also stored in the Azure File Share above.
The SabDab dataset can be produced by running
ab_data/scripts/process_sabdab.py.
Similarly to the loop-mediated case, loading a SabDabDF instance will automatically
compute an inclusion mask, that needs to be saved manually to the root directory of the
dataset.
The compressed processed dataset is save in our Azure File Share at
https://daki.file.core.windows.net/datasets/SabDab
TODO
The clustered versions of the datasets require performing clustering in different ways, we refer to the ideas in
Atomically accurate de novo design of single-domain antibodies, supplementary section 1: "Training Datasets".
The currently utilised versions are obtained by running the following commands.
For SabDab clustering based on CDR loops:
python scripts/cluster_datasets --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24
Similarly for the loop-mediated interaction dataset, clustering based on the binder sequences:
python scripts/cluster_datasets --loop-mediated-dir /datadisk/pdb_datasets_06_06_2024/loop_mediated/
Negative datasets are based on the ideas in the reference above, supplementary sections 5.9-11.
For the negative version of SabDab where the binders are swapped at random (supplementary 5.10), we cluster the targets:
python scripts/cluster_datasets.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/ --output-cluster-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/target_clustering/ --cluster-mode targets --min-seq-id 0.8
For the version where H3 CDR loops are randomly swapped (supplementary 5.11), we cluster the H3 CDR loops:
python scripts/cluster_datasets.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/ --output-cluster-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/H3_CDR_clustering --cluster-mode H3_CDR --min-seq-id 0.6
Then, we obtain the mappings of compatible anti/nanobodies, i.e. H3 CDR sequences with low similarity, but same length:
python scripts/create_sabdab_with_cdr3_clustering.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24 --cluster-file /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/clusters.tsv --cdr3-cluster-file /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/negative/H3_CDR_clustering/clusters.tsv --out-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/negative/H3_CDR_clustering/
A couple of observations about managing (big) datasets.
While we store most of our data in Azure File Shares, it is not advised to do any
processing on them: they are slow when dealing with IO operations, and they have shared
bandwidth in the whole company.
To use a dataset, first mount the file shares to your local machine
bash scripts/mount_file_share.sh datasets
bash scripts/mount_file_share.sh rf2
bash scripts/mount_file_share.sh pdb-dataset
bash scripts/mount_file_share.sh weights
... # other file shares that might appear in the future
Then, to utilise them, move them to a local disk using rsync
rsync -a --info=progress2 SRC DST
and decompress whatever you need.
Notice that some datasets might be too big for your local machine. In such a case, you need to Mount a new data disk to your VM.
In Azure, you can mount a new data disk to your VM by going to your VM's page in the
Azure portal: Settings > Disks > Create and attach a new disk.
However, that is not all. After you have created the disk, log into your VM and look for
your disk, it is most probably under /dev/sdc. You can list all of the available disks
with the lsbk command.
Assuming our new disk is under /dev/sdc, then do the following:
# Format the disk. Most probably, just create a single partition in there.
# Assuming here that we will create the partition `/dev/sdc1`
sudo fdisk /dev/sdc
# Create a mount point directory where you want to mount the data disk
sudo mkdir /datadisk
# Mount the partition to the path
sudo mount /dev/sdc1 /datadisk
# Check if the shared disk is successfully mounted
ls /datadisk
# Change permissions
chmod -R 775 /datadisk
# Now you need to mount your new disk. If your new disk partition is sdc1, read its UUID
# from
sudo blkid
# then persist the mount using the discovered UUID (e.g. below `b732...`)
echo 'UUID=b732ec2e-bbc8-47f6-af29-61ccbd384c82 /datadisk ext4 defaults,nofail 0 2' | sudo tee -a /etc/fstab
Nice! Now you can use /datadisk in your machine to deal with big datasets
locally ✨🍰