Skip to content

totucuong/ab-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AbData

 _________________________ 
< Antibody Data Repo >
 ------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Installation and venv

The packages are managed using conda.
However, the ANARCI package needs to be installed manually.
Hence to create the venv you need to:

conda env create -f environment.yml
conda activate ab-data
pip install -e .    # install ab_data as editbable local package
cd ANARCI
python setup.py install    # install ANARCI in venv

Dataset processing ⚙️

PDB dataset

The PDB dataset should only be downloaded once, hence you will probably not need to ever download it from scratch (unless you want a version with a later date cutoff).
Still, if needed, you can run ab_data/scripts/pdb/download_pdb.sh.

At the moment, we have what is the produced raw.gz in our Azure File Share, and it contains all the raw PDB files in the database
https://daki.file.core.windows.net/pdb-dataset/2024-06-06

When dealing with PDB (and other big datasets) see Managing big datasets.

Loop-mediated interactions

You can produce the loop-mediated interaction dataset starting from PDB by running
ab_data/scripts/pdb/create_loop_mediated_dataset.py.
Afterwards, you can load an instance of ab_data.datasets.loop_mediated.LoopMediatedPDB, and load it from root. Automatically, the dataset will load the PDB files and cache them to .pt files in the same root. The inlcusion_mask produced must be saved manually to the dataset root directory as well. The compressed result loop_mediated.gz is also stored in the Azure File Share above.

SabDab dataset

The SabDab dataset can be produced by running ab_data/scripts/process_sabdab.py. Similarly to the loop-mediated case, loading a SabDabDF instance will automatically compute an inclusion mask, that needs to be saved manually to the root directory of the dataset. The compressed processed dataset is save in our Azure File Share at
https://daki.file.core.windows.net/datasets/SabDab

TCR dataset

TODO

Clustered Dataset

The clustered versions of the datasets require performing clustering in different ways, we refer to the ideas in
Atomically accurate de novo design of single-domain antibodies, supplementary section 1: "Training Datasets".

The currently utilised versions are obtained by running the following commands.

For SabDab clustering based on CDR loops:
python scripts/cluster_datasets --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24

Similarly for the loop-mediated interaction dataset, clustering based on the binder sequences:
python scripts/cluster_datasets --loop-mediated-dir /datadisk/pdb_datasets_06_06_2024/loop_mediated/

Negative datasets

Negative datasets are based on the ideas in the reference above, supplementary sections 5.9-11.

For the negative version of SabDab where the binders are swapped at random (supplementary 5.10), we cluster the targets:
python scripts/cluster_datasets.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/ --output-cluster-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/target_clustering/ --cluster-mode targets --min-seq-id 0.8

For the version where H3 CDR loops are randomly swapped (supplementary 5.11), we cluster the H3 CDR loops:
python scripts/cluster_datasets.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/ --output-cluster-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/H3_CDR_clustering --cluster-mode H3_CDR --min-seq-id 0.6
Then, we obtain the mappings of compatible anti/nanobodies, i.e. H3 CDR sequences with low similarity, but same length:
python scripts/create_sabdab_with_cdr3_clustering.py --sabdab-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24 --cluster-file /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/clusters.tsv --cdr3-cluster-file /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/negative/H3_CDR_clustering/clusters.tsv --out-dir /datadisk/SabDab/SabDab_10_07_24_DJ_11_09_24/negative/H3_CDR_clustering/


Managing big datasets 🗿

A couple of observations about managing (big) datasets.
While we store most of our data in Azure File Shares, it is not advised to do any processing on them: they are slow when dealing with IO operations, and they have shared bandwidth in the whole company.

Use a dataset

To use a dataset, first mount the file shares to your local machine

bash scripts/mount_file_share.sh datasets
bash scripts/mount_file_share.sh rf2
bash scripts/mount_file_share.sh pdb-dataset
bash scripts/mount_file_share.sh weights
...  # other file shares that might appear in the future

Then, to utilise them, move them to a local disk using rsync

rsync -a --info=progress2 SRC DST

and decompress whatever you need.

Notice that some datasets might be too big for your local machine. In such a case, you need to Mount a new data disk to your VM.

Mount a new data disk to your VM

In Azure, you can mount a new data disk to your VM by going to your VM's page in the Azure portal: Settings > Disks > Create and attach a new disk.

However, that is not all. After you have created the disk, log into your VM and look for your disk, it is most probably under /dev/sdc. You can list all of the available disks with the lsbk command.

Assuming our new disk is under /dev/sdc, then do the following:

# Format the disk. Most probably, just create a single partition in there. 
# Assuming here that we will create the partition `/dev/sdc1`
sudo fdisk /dev/sdc

# Create a mount point directory where you want to mount the data disk
sudo mkdir /datadisk

# Mount the partition to the path
sudo mount /dev/sdc1 /datadisk

# Check if the shared disk is successfully mounted
ls /datadisk

# Change permissions
chmod -R 775 /datadisk

# Now you need to mount your new disk. If your new disk partition is sdc1, read its UUID
# from
sudo blkid
# then persist the mount using the discovered UUID (e.g. below `b732...`)
echo 'UUID=b732ec2e-bbc8-47f6-af29-61ccbd384c82 /datadisk ext4 defaults,nofail 0 2' | sudo tee -a /etc/fstab

Nice! Now you can use /datadisk in your machine to deal with big datasets locally ✨🍰

About

Prepare training data for single domain antibody diffusion model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors