feat(runner): batch simulation job execution to prevent OOM kills#26
feat(runner): batch simulation job execution to prevent OOM kills#26daniel-klein wants to merge 3 commits intomainfrom
Conversation
Replace all-at-once task submission with configurable batched execution. Previously, all sim futures were held for the entire job duration to support a second gather pass for model_outputs Parquet writing. This prevented Dask from freeing any SimReturn from distributed memory, causing workers to OOM-kill and lose completed results (progress going backwards in the dashboard). Now tasks are submitted in batches (default 200, configurable via MODELOPS_JOB_BATCH_SIZE). Each batch's results are gathered before the next batch starts, allowing Dask to free distributed memory between batches. This bounds peak distributed memory to batch_size SimReturns instead of all params, and limits worker death blast radius to one batch instead of the entire job. Also adds OOM kill detection: when task failures contain OOM signals, a warning is logged suggesting a smaller batch size. Closes #25
Allow submitters to configure runner behavior by setting environment variables locally. Any env var matching MODELOPS_JOB_* is forwarded to the job pod spec. Usage: export MODELOPS_JOB_BATCH_SIZE=100 mops jobs submit study.json
gather() returns list[SimReturn | Exception]. When finding output names from the first SimReturn, an Exception has no .outputs attribute, crashing _write_model_outputs silently. This caused model_outputs to be missing whenever any replicate failed. Fix: scan for the first successful SimReturn, and skip Exception entries during iteration.
…verload Set worker-saturation=1.0 so the scheduler only assigns tasks equal to each worker's thread count, instead of piling all tasks onto workers immediately. This prevents distributed memory accumulation that causes OOM kills on large jobs (e.g., 1000+ params). Without this, submitting 7,000+ tasks causes the scheduler to push them all onto workers, each holding data for hundreds of queued tasks. With saturation=1.0, workers only hold data for tasks they're actively running (~4 per worker with current config). This is the root-cause fix for the OOM issue described in #25 and the incident doc. PR #26 proposes batching as a workaround, but scheduler-level backpressure is the proper Dask-native solution. Refs #25, #26
…verload (#29) Set worker-saturation=1.0 so the scheduler only assigns tasks equal to each worker's thread count, instead of piling all tasks onto workers immediately. This prevents distributed memory accumulation that causes OOM kills on large jobs (e.g., 1000+ params). Without this, submitting 7,000+ tasks causes the scheduler to push them all onto workers, each holding data for hundreds of queued tasks. With saturation=1.0, workers only hold data for tasks they're actively running (~4 per worker with current config). This is the root-cause fix for the OOM issue described in #25 and the incident doc. PR #26 proposes batching as a workaround, but scheduler-level backpressure is the proper Dask-native solution. Refs #25, #26
Why this PR is still needed (despite #29/#30 worker-saturation fix)Root cause: two separate memory problemsProblem 1 — Task over-saturation (fixed by #29/#30): Without Problem 2 — Cumulative memory growth in long-lived worker processes (NOT fixed by #29/#30): Each simulation allocates millions of small Python objects (numpy arrays, starsim states, etc.). When a task completes, Python frees the objects but CPython's pymalloc allocator cannot return fragmented memory pages to the OS. Worker RSS grows monotonically across sequential tasks — even with only 1 task at a time. Why worker-saturation alone doesn't prevent OOMCurrent worker config: 6 nworkers per pod × Observed on Why the blast radius is so largeThe current With a pod OOMKill: 6 workers × N completed results each = massive result loss and recomputation. Why
|
Summary
run_simulation_job()MODELOPS_JOB_BATCH_SIZEenv var (default: 200)Problem
The runner held references to ALL simulation futures for the entire job duration (to support a second
gather()pass formodel_outputsParquet writing). This prevented Dask from freeing any SimReturn from distributed memory, causing:Solution
Submit tasks in batches. Per batch:
Test plan
make submit-tinywith small batch (MODELOPS_JOB_BATCH_SIZE=2) to verify batching worksmake submitwith 1,000 params and verify no OOM killsCloses #25