Add prepare_genomes pipeline to lyrebird_metapackage_creation by rzhao-2 · Pull Request #270 · wwood/singlem

rzhao-2 · 2025-12-04T06:19:41Z

Automated pipeline for most of the front-end preparation for making a lyrebird metapackage (proteins, transcripts, taxonomy)

…ion.

…file.

wwood · 2026-03-11T05:32:46Z

Hi Rossen, I'm tempted to merge this without looking too hard - it is self-contained and won't harm other code - seem like an OK idea? Is there any things you know are amiss?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9544940f9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T05:20:57Z

+        genome_to_length[uvig] = int(row['length'])
+
+new_taxa = {}
+df = pl.DataFrame.read_csv(vcontact_assignments, infer_schema_length=10000)


Replace unsupported Polars CSV constructor

process_vcontact3_taxonomy.py reads the vContact output with pl.DataFrame.read_csv(...), but the project pins Polars 1.x (prepare_genomes/envs/ictv-download.yml), where CSV loading is done via pl.read_csv(...); this causes an AttributeError before any taxonomy rows are processed, so the process_taxonomy rule cannot produce final_reconstructed_metadata.tsv.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-13T05:20:57Z

+            else:
+                last_known = 'unclassified'
+            taxbuilder.append(tax_ranks[tax_levels.index(level)] + last_known)
+    taxbuilder.append('s__' + (draft[6] if len(draft) > 6 and draft[6] != '' else 'unclassified'))


Avoid double-prefixing species taxonomy rank

This appends s__ to draft[6], but draft is populated from taxonomy strings that are already rank-prefixed (e.g. ICTV taxonomies are built as d__/.../s__... in ictv-download.py), so genomes with a vContact assignment will be written with malformed species labels like s__s__MySpecies, corrupting downstream taxonomy parsing.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-13T05:20:57Z

+        done = touch("vcontact3_prepared.done"),
+        concat_genomes = "vcontact3_concat_genomes.fna",
+    shell: # there are many files, so use cat with find
+        "find {input.galah_dir} -name '*.fna' -exec cat {{}} + > {output.concat_genomes}"


Use consistent genome extension for vContact prep

The same galah_clusters directory is treated as *.fasta in prodigal_gv but as *.fna here; when representatives are emitted as FASTA files, this find matches nothing and produces an empty vcontact3_concat_genomes.fna, causing vContact3 to run on zero genomes (or fail) and invalidating taxonomy reconstruction.

Useful? React with 👍 / 👎.

rzhao-2 added 3 commits December 3, 2025 22:43

lyrebird_metapackage_creation: Add workflow for viral genome collection.

fbe318d

lyrebird_metapackage_creation: Add vcontact3 step.

676fa01

lyrebird_metapackage_creation: Add vcontact3.yml.

ef9c389

rzhao-2 marked this pull request as draft December 4, 2025 06:19

rzhao-2 added 4 commits December 5, 2025 14:55

lyrebird_metapackage_creation: Add dependency for cluster job submiss…

de1c6d3

…ion.

lyrebird_metapackage_creation: Add genome lengths to output metadata …

9bd8dc9

…file.

lyrebird_metapackage_creation: Add taxonomy processing script.

3b5c44a

lyrebird_metapackage_creation: refactor

9544940

wwood marked this pull request as ready for review March 13, 2026 05:16

wwood merged commit 232a0d7 into wwood:main Mar 13, 2026
2 checks passed

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prepare_genomes pipeline to lyrebird_metapackage_creation#270

Add prepare_genomes pipeline to lyrebird_metapackage_creation#270
wwood merged 7 commits intowwood:mainfrom
rzhao-2:master

rzhao-2 commented Dec 4, 2025

Uh oh!

wwood commented Mar 11, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rzhao-2 commented Dec 4, 2025

Uh oh!

wwood commented Mar 11, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants