Skip to content

Add baseline expression ingestion step#183

Open
Tobi1kenobi wants to merge 6 commits intomainfrom
ta-baseline_expression
Open

Add baseline expression ingestion step#183
Tobi1kenobi wants to merge 6 commits intomainfrom
ta-baseline_expression

Conversation

@Tobi1kenobi
Copy link
Copy Markdown
Contributor

Should pull all files required for running PTS

Tobi1kenobi and others added 5 commits February 18, 2026 12:08
One of the inclusion criteria to find a drug in Open Targets is that the same drug exists in DrugBank.
This check was done using a ChEMBL ID/DrugBank ID table from 5 years ago whose origin I didn't know.

Because of this issue opentargets/issues#4167 , I was having a look at longer term solutions that would allow us to have up to date mappings. 

ChEMBL has crossreferences with DrugBank on their pages because they use the Unichem API to get them. I've just ran into this old issue of ours explaining that the obsolete look up tables we have are coming from Unichem FTP. So now the source for the PIS rule is not the latest file on the bucket, but the FTP directory itself (not versioned as far as i can see, always pointing to latest).

The format of the latest file is **the same** and we have over a **33% increase** in the number of cross references (8986 vs 6523).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “baseline_expression” ingestion step to the pipeline configuration so that baseline expression inputs (GTEx, DICE, Tabula Sapiens, PRIDE) are downloaded alongside the other PTS prerequisites.

Changes:

  • Introduces a new baseline_expression step that downloads baseline expression datasets into input/baseline_expression/....
  • Adds gtex_version to the scratchpad and uses it in GTEx download URLs.
  • Adds foreach/do expansions for downloading multiple DICE and PRIDE datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread config.yaml Outdated
Comment on lines +24 to +33
source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz

- name: copy GTEx sample attributes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt

- name: copy GTEx subject phenotypes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtex_version is parameterized in the GTEx URLs, but the filenames and destinations are hard-coded to v10 (e.g., GTEx_Analysis_v10_...). If gtex_version is changed, these URLs will likely 404 or create misleading v10-named outputs. Consider either removing gtex_version and hard-coding v10 in the URL path, or fully parameterizing the v10 parts of the filenames/destinations as well.

Suggested change
source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct.gz
- name: copy GTEx sample attributes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SampleAttributesDS.txt
- name: copy GTEx subject phenotypes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt
source: https://storage.googleapis.com/adult-gtex/bulk-gex/v${gtex_version}/rna-seq/GTEx_Analysis_v${gtex_version}_RNASeQCv2.4.2_gene_tpm.gct.gz
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_RNASeQCv2.4.2_gene_tpm.gct.gz
- name: copy GTEx sample attributes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v${gtex_version}_Annotations_SampleAttributesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_Annotations_SampleAttributesDS.txt
- name: copy GTEx subject phenotypes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v${gtex_version}_Annotations_SubjectPhenotypesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v${gtex_version}_Annotations_SubjectPhenotypesDS.txt

Copilot uses AI. Check for mistakes.
Comment thread config.yaml Outdated
- name: copy GTEx subject phenotypes
source: https://storage.googleapis.com/adult-gtex/annotations/v${gtex_version}/metadata-files/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt
destination: input/baseline_expression/bulkRNAseq/GTEx/GTEx_Analysis_v10_Annotations_SubjectPhenotypesDS.txt

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be trailing whitespace on the blank line after the GTEx subject phenotypes destination. Please remove it to avoid lint/pre-commit failures and keep diffs clean.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants