Conversation
One of the inclusion criteria to find a drug in Open Targets is that the same drug exists in DrugBank. This check was done using a ChEMBL ID/DrugBank ID table from 5 years ago whose origin I didn't know. Because of this issue opentargets/issues#4167 , I was having a look at longer term solutions that would allow us to have up to date mappings. ChEMBL has crossreferences with DrugBank on their pages because they use the Unichem API to get them. I've just ran into this old issue of ours explaining that the obsolete look up tables we have are coming from Unichem FTP. So now the source for the PIS rule is not the latest file on the bucket, but the FTP directory itself (not versioned as far as i can see, always pointing to latest). The format of the latest file is **the same** and we have over a **33% increase** in the number of cross references (8986 vs 6523).
There was a problem hiding this comment.
Pull request overview
Adds a new “baseline_expression” ingestion step to the pipeline configuration so that baseline expression inputs (GTEx, DICE, Tabula Sapiens, PRIDE) are downloaded alongside the other PTS prerequisites.
Changes:
- Introduces a new
baseline_expressionstep that downloads baseline expression datasets intoinput/baseline_expression/.... - Adds
gtex_versionto the scratchpad and uses it in GTEx download URLs. - Adds
foreach/doexpansions for downloading multiple DICE and PRIDE datasets.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz | ||
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz | ||
|
|
||
| - name: copy GTEx sample attributes | ||
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt | ||
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt | ||
|
|
||
| - name: copy GTEx subject phenotypes | ||
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt | ||
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt |
There was a problem hiding this comment.
gtex_version is parameterized in the GTEx URLs, but the filenames and destinations are hard-coded to v10 (e.g., GTEx_Analysis_v10_...). If gtex_version is changed, these URLs will likely 404 or create misleading v10-named outputs. Consider either removing gtex_version and hard-coding v10 in the URL path, or fully parameterizing the v10 parts of the filenames/destinations as well.
| source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz | |
| - name: copy GTEx sample attributes | |
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt | |
| - name: copy GTEx subject phenotypes | |
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt | |
| source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v${gtex_version}_RNASeQCv2.4.2_gene_tpm.gct.gz | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_RNASeQCv2.4.2_gene_tpm.gct.gz | |
| - name: copy GTEx sample attributes | |
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v${gtex_version}_Annotations_SampleAttributesDS.txt | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_Annotations_SampleAttributesDS.txt | |
| - name: copy GTEx subject phenotypes | |
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v${gtex_version}_Annotations_SubjectPhenotypesDS.txt | |
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_Annotations_SubjectPhenotypesDS.txt |
| - name: copy GTEx subject phenotypes | ||
| source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt | ||
| destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt | ||
|
|
There was a problem hiding this comment.
There appears to be trailing whitespace on the blank line after the GTEx subject phenotypes destination. Please remove it to avoid lint/pre-commit failures and keep diffs clean.
Should pull all files required for running PTS