Skip to content

Add prepare_genomes pipeline to lyrebird_metapackage_creation#270

Merged
wwood merged 7 commits intowwood:mainfrom
rzhao-2:master
Mar 13, 2026
Merged

Add prepare_genomes pipeline to lyrebird_metapackage_creation#270
wwood merged 7 commits intowwood:mainfrom
rzhao-2:master

Conversation

@rzhao-2
Copy link
Copy Markdown
Contributor

@rzhao-2 rzhao-2 commented Dec 4, 2025

Automated pipeline for most of the front-end preparation for making a lyrebird metapackage (proteins, transcripts, taxonomy)

@rzhao-2 rzhao-2 marked this pull request as draft December 4, 2025 06:19
@wwood
Copy link
Copy Markdown
Owner

wwood commented Mar 11, 2026

Hi Rossen, I'm tempted to merge this without looking too hard - it is self-contained and won't harm other code - seem like an OK idea? Is there any things you know are amiss?

@wwood wwood marked this pull request as ready for review March 13, 2026 05:16
@wwood wwood merged commit 232a0d7 into wwood:main Mar 13, 2026
2 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9544940f9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

genome_to_length[uvig] = int(row['length'])

new_taxa = {}
df = pl.DataFrame.read_csv(vcontact_assignments, infer_schema_length=10000)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace unsupported Polars CSV constructor

process_vcontact3_taxonomy.py reads the vContact output with pl.DataFrame.read_csv(...), but the project pins Polars 1.x (prepare_genomes/envs/ictv-download.yml), where CSV loading is done via pl.read_csv(...); this causes an AttributeError before any taxonomy rows are processed, so the process_taxonomy rule cannot produce final_reconstructed_metadata.tsv.

Useful? React with 👍 / 👎.

else:
last_known = 'unclassified'
taxbuilder.append(tax_ranks[tax_levels.index(level)] + last_known)
taxbuilder.append('s__' + (draft[6] if len(draft) > 6 and draft[6] != '' else 'unclassified'))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid double-prefixing species taxonomy rank

This appends s__ to draft[6], but draft is populated from taxonomy strings that are already rank-prefixed (e.g. ICTV taxonomies are built as d__/.../s__... in ictv-download.py), so genomes with a vContact assignment will be written with malformed species labels like s__s__MySpecies, corrupting downstream taxonomy parsing.

Useful? React with 👍 / 👎.

done = touch("vcontact3_prepared.done"),
concat_genomes = "vcontact3_concat_genomes.fna",
shell: # there are many files, so use cat with find
"find {input.galah_dir} -name '*.fna' -exec cat {{}} + > {output.concat_genomes}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use consistent genome extension for vContact prep

The same galah_clusters directory is treated as *.fasta in prodigal_gv but as *.fna here; when representatives are emitted as FASTA files, this find matches nothing and produces an empty vcontact3_concat_genomes.fna, causing vContact3 to run on zero genomes (or fail) and invalidating taxonomy reconstruction.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants