Skip to content

Conversation

@Shu-Feather
Copy link
Contributor

Add cell growth rate method for multivariant analysis, and modify cell mass method.
Both are compatible for multigeneration analysis, and can set variant and generation params to specify the variant and generation ID you wanna to analysis.

Shu-Feather and others added 30 commits June 25, 2025 03:08
GCSFS schedules future, which does not work in atexit callbacks, so we revert back to PyArrow. Additionally, we now use os._exit to explicitly set an exit code of 1 when an exception is raised during finalization because Python ignores exceptions in atexit callbacks by default.
Exceptions in atexit hooks are ignored and not reflected in the final exit code. Additionally, new futures cannot be scheduled in atexit hooks (after interpreter shutdown) so asyncio does not work.
Copy link
Contributor

@thalassemia thalassemia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked very deeply at most of these scripts yet, but I figured I might as well post the feedback I currently have as it'll completely change how this PR looks. The biggest issue right now is the amount of duplicated code. You could tackle this by creating the helper function I outlined in one of my comments and import it into the relevant modules.

target_variants = params.get("variant", None)
target_generation = params.get("generation", None)

# Filter by specified variants and generations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add this filter to the SQL query to speed that up (potentially by a lot if only reading a small subset of a large workflow output) and reduce RAM usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still applies. I think keeping RAM usage down is particularly important for multivariant/multiseed analyses where the amount of data being read can potentially be insanely high.

@thalassemia
Copy link
Contributor

thalassemia commented Aug 6, 2025

Also, to solve the import issue with Escher in the Pytest, you'll need to add it to the package list with uv. I think uv add Escher. Then, commit the changes to pyproject.toml and uv.lock.

How big is the Escher package and its dependencies BTW? You can get this by comparing du -sh .venv with a fresh environment (uv sync --extra dev --frozen) and after uv add Escher. I want to avoid adding anything too big as a mandatory installation requirement, if possible.

Can you check the size of plotly and ipykernel as well?

@Shu-Feather
Copy link
Contributor Author

Here are some pic. plotted by these methods:
By cell_growth_rate.py in multivariant:
image
image

By cell_mass.py in multivariant:
image

By fba_flux.py in multivariant: (stacked or grid method)
image
image

By fba_flux_heatmap.py in multigeneration:
image

By fba_flux_pca.py in single:
image

@Shu-Feather
Copy link
Contributor Author

Add several visualization methods for protein count visualization:

By catalyst_count.py in multivariant (grid or stacked mode):
image

By protein_count.py in multigeneration:
image

Copy link
Contributor

@thalassemia thalassemia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed most of this, because there's just too much to review in depth. The biggest area for improvement right now is redundancy across a lot of scripts. You have, for the most part, already packaged a lot of duplicated code into helper functions, so refactoring should hopefully not be too hard. The type errors in Mypy might need explicit type hints to fix, and my previous comment about adding Escher with uv still applies (to fix the failing pytests).

# Filter by specified generations if provided
if target_generations is not None:
print(f"[INFO] Target generations: {target_generations}")
df = df.filter(pl.col("generation").is_in(target_generations))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could build this generation filter into your SQL query with a WHERE clause to be more efficient.

SELECT time, generation,
counts as protein_count
FROM unnested_counts
WHERE idx = {protein_idx + 1} -- SQL uses 1-based indexing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overkill. I'd recommend SELECT listeners__monomer_counts[{protein_idx + 1}] AS cnt or using ecoli.library.parquet_emitter.named_idx.

)
gen_summary_pd["lower"] = (
gen_summary_pd["mean_count"] - gen_summary_pd["std_count"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use the upper and lower columns?


# Filter by specified variants and generations if provided
target_variants = params.get("variant", None)
target_generations = params.get("generation", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly recommend building these filters into the SQL query. For a multivariant analysis, the time saved by not reading some of the data can become significant.

avg_catalyst_counts[biocyc_id] = variant_avgs

# Create visualization based on layout mode
if layout_mode == "grid":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the code in this if...else into the above one for continuity.

@@ -0,0 +1,834 @@
"""
This script preprocesses FBA flux data by mapping extended reactions to base reactions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots of redundant code across the multigeneration fba_flux_ analyses. You've already organized things into helper functions so it may not be too hard to fix this.

@@ -0,0 +1,1004 @@
"""
This script preprocesses FBA flux data by mapping extended reactions to base reactions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a lot of redundant code.

@@ -0,0 +1,380 @@
"""
Visualize FBA reaction flux dynamics using PCA trajectory analysis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. Do you have any examples of reaction IDs that yielded interesting plots?

target_variants = params.get("variant", None)
target_generation = params.get("generation", None)

# Filter by specified variants and generations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still applies. I think keeping RAM usage down is particularly important for multivariant/multiseed analyses where the amount of data being read can potentially be insanely high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants