Skip to content

Save Parquet File for Single-cells during Colony Simulation without MongoDB? #389

@katha815

Description

@katha815

Problem Description

I am using ecoli_engine_process.py for colony simulation under baseline conditions. While I can successfully save snapshots of the colony state, I continuously encounter errors when trying to emit Parquet files for single-cell data. I wonder if anyone has successfully done this without an online emitter?

The errors and attempted debugging steps are shown below. I noticed the implemented antibiotic simulation uses an online database as the emitter, so I included observations about that in the last section.

It could be a different underlying reason why my simulation failed, and I would appreciate any insights or suggestions!

Environment

  • vEcoli commit: (current main branch)
  • Python 3.12
  • Config: spatial.json inheritance with "emitter": "parquet"

Goal

Resume a colony simulation from a saved JSON state file (baseline_2gen_seed_0_colony_t6000.json) and output to Parquet files.

My Configuration File

configs/colony_baseline_test2.json:

{
    "inherit_from": ["spatial.json"],
    "description": "Test2: read from trial1, run for another generation, save parquet per cell, save final state",
    "initial_colony_file": "baseline_2gen_seed_0_colony_t6000",

    "seed": 0,
    "sim_data_path": "out/all_media_conditions1/parca/kb/simData.cPickle",
    
    "emitter": "parquet",
    "emitter_arg": {
        "out_dir": "out/colony_runs/baseline_3rd_gen_seed_0"
    },
    "emit_config": false,

    "max_duration": 9000,
    
    "save": true,
    "save_times": [9000],
    "colony_save_prefix": "baseline_3rd_gen",
    
    "parallel": false,
    
    "engine_process_reports": [
        ["boundary"],
        ["bulk"],
        ["listeners"],
        ["environment", "exchange"]
    ]
}

Issue 1: NumpyRandomStateSerializer Serialization/Deserialization Mismatch

Command:

python ecoli/experiments/ecoli_engine_process.py --config configs/colony_baseline_test2.json

Error:

Traceback (most recent call last):
  File "ecoli/experiments/ecoli_engine_process.py", line 526, in <module>
    run_simulation(config)
  File "ecoli/experiments/ecoli_engine_process.py", line 389, in run_simulation
    initial_state = get_state_from_file(...)
  File "ecoli/library/json_state.py", line 168, in get_state_from_file
    return json.loads(f.read(), object_hook=custom_decoder)
  File "ecoli/library/serialize.py", line 83, in deserialize
    data = orjson.loads(data)
orjson.JSONDecodeError: unexpected character: line 1 column 1 (char 0)

Possible Root Cause:

  • serialize() at line 70-72 appears to output Python tuple format: ('MT19937', [...])
  • deserialize() at line 82 uses orjson.loads() which expects JSON array format: ["MT19937", [...]]

File: ecoli/library/serialize.py

Original code (lines 78-85):

def deserialize(self, data):
    matched_regex = self.regex_for_serialized.fullmatch(data)
    if matched_regex:
        data = matched_regex.group(1)
    data = orjson.loads(data)
    rng = np.random.RandomState()
    rng.set_state(data)
    return rng

Attempted Fix: Replace orjson.loads() with ast.literal_eval() to handle Python tuple format:

def deserialize(self, data):
    import ast
    matched_regex = self.regex_for_serialized.fullmatch(data)
    if matched_regex:
        data = matched_regex.group(1)
    if data.startswith("("):
        data = ast.literal_eval(data)
    else:
        data = orjson.loads(data)
    rng = np.random.RandomState()
    rng.set_state(tuple(data))
    return rng

Issue 2: Parquet Emitter Cannot Handle pint.Quantity in Config Metadata

Error (after attempting to fix Issue 1):

Traceback (most recent call last):
  File "ecoli/library/parquet_emitter.py", line 963, in emit
    v = np.asarray(v, dtype=np_dtype(v, k))
  File "ecoli/library/parquet_emitter.py", line 706, in np_dtype
    raise ValueError(f"{field_name} has unsupported type {type(val)}.")
ValueError: spatial_environment_config__multibody__bounds has unsupported type <class 'pint.Quantity'>.

During handling of the above exception, another exception occurred:
  File "ecoli/library/parquet_emitter.py", line 967, in emit
    v = pl.Series([v])
TypeError: not yet implemented: Nested object types

Possible Root Cause:

  • spatial.json contains pint.Quantity values (e.g., "!units[50 micrometer]")
  • np_dtype() doesn't appear to handle pint.Quantity, and falls back to Polars
  • pl.Series([v]) also seems unable to handle pint.Quantity objects

File: ecoli/library/parquet_emitter.py

Original code (line 967):

v = pl.Series([v])

Attempted Fix: Convert unsupported types to string:

v = pl.Series([str(v)])

Issue 3: emit_config Setting Not Passed to Engine

Note: Even after the str(v) fix, I attempted to disable config emission via JSON config "emit_config": false, but it appeared to have no effect.

Possible Root Cause:
ecoli_engine_process.py does not seem to pass the emit_config parameter to the Engine constructor.

File: ecoli/experiments/ecoli_engine_process.py

Original code (around line 465):

engine = Engine(
    processes=composite.processes,
    topology=composite.topology,
    initial_state=initial_state,
    experiment_id=experiment_id,
    emitter=emitter_config,
    progress_bar=config["progress_bar"],
    metadata=metadata,
    profile=config["profile"],
    initial_global_time=config.get("start_time", 0.0),
)

Attempted Fix: Add emit_config parameter:

engine = Engine(
    ...
    initial_global_time=config.get("start_time", 0.0),
    emit_config=config.get("emit_config", False),
)

Issue 4: Parquet Emitter Assumes agents Key in Data Structure

Error (after attempting to fix Issues 1-3):

Traceback (most recent call last):
  File "ecoli/experiments/ecoli_engine_process.py", line 485, in run_simulation
    colony_save_states(engine, config)
  File "ecoli/experiments/ecoli_engine_process.py", line 255, in colony_save_states
    engine.update(time_to_next_save)
  ...
  File "ecoli/processes/engine_process.py", line 505, in next_update
    self.emitter.emit(emit_config)
  File "ecoli/library/parquet_emitter.py", line 1007, in emit
    if len(data["data"]["agents"]) > 1:
KeyError: 'agents'

Possible Root Cause:

  • ParquetEmitter.emit() appears to expect a data["data"]["agents"] structure (outer simulation)
  • EngineProcess inner emitter seems to send cell data directly without the agents wrapper
  • The inner emitter is configured via inner_emitter in the EngineProcess config

File: ecoli/library/parquet_emitter.py (line 1007)


Observations on Implemented Antibiotic Simulation

The tet_amp_sim.py uses a different configuration that appears to avoid these issues:

# From configs/cloud.json (inherited by antibiotics.json)
{
    "emitter": "database",  # MongoDB, not Parquet
    "emitter_arg": {"host": "10.138.0.75:27017", "emit_limit": 4100000}
}
  • MongoDB can handle arbitrary Python objects including pint.Quantity
  • No JSON serialization issues with NumpyRandomState
  • No agents key structure assumptions

Summary Table

Issue File Line Status
1. RandomState serialize/deserialize mismatch serialize.py 78-85 Attempted fix with ast.literal_eval()
2. pint.Quantity not supported parquet_emitter.py 967 Attempted fix with str(v)
3. emit_config not passed to Engine ecoli_engine_process.py ~470 Attempted fix by adding parameter
4. Missing agents key handling parquet_emitter.py 1007 UNRESOLVED

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions