async solver execution with job-based run management by Adityakushwaha2006 · Pull Request #186 · EAPD-DRB/MUIOGO

Adityakushwaha2006 · 2026-03-03T15:06:30Z

Summary

What changed: Added JobManager to track solver runs as background jobs. POST /run now returns a job ID immediately instead of waiting for the solver to finish. Two new endpoints handle status polling (GET /runStatus/<job_id>) and cancellation (POST /cancelRun/<job_id>). The solver subprocess is now started with Popen and checked every 0.5s so it can be cancelled or timed out cleanly. Fixed batchRun which was ignoring the solver field from the request and always using CBC.
Why: The solver was called with subprocess.run() inside the request handler, which blocked the thread for the entire duration of the run. If the browser timed out and disconnected, the solver kept running but the result was thrown away. There was also no way to cancel a run once started. This change fixes both.

Related issues

Issue exists and is linked
Closes [Task] Solver execution blocks all API requests for the duration of a run #174

Validation

Tests added/updated (or not applicable) -- not applicable, no test suite exists in this repo yet
Validation steps documented
Evidence attached (logs/screenshots/output as relevant)

Manual verification steps:

Start the server and open the app
Trigger a run via POST /run with a valid casename, caserunname, and solver
Confirm the response comes back immediately with a job_id and HTTP 202 -- the browser should not hang
Call GET /runStatus/<job_id> -- confirm status moves from running to completed once the solve finishes and the full result is present in the response
Start another run, then call POST /cancelRun/<job_id> while it is running -- confirm the response is Cancellation signal sent and a follow-up status poll shows cancelled
Call POST /batchRun with solver: glpk -- confirm it uses GLPK instead of CBC

Documentation

Docs updated in this PR (or not applicable)
Any setup/workflow changes reflected in repo docs

docs/ARCHITECTURE.md updated: system overview line updated, new Solver execution section added describing the async flow, job states, and in-memory storage behaviour.

Scope check

No unrelated refactors
Implemented from a feature branch
Change is deliverable without upstream OSeMOSYS/MUIO dependency
Base repo/branch is EAPD-DRB/MUIOGO:main (not upstream)

Manual Testing Evidence

Server: waitress, http://127.0.0.1:5002, demo case CLEWs Demo / REF

Test 1 - POST /run returns 202 immediately without blocking

HTTP 202
{
  "job_id": "41f76b43-181c-4028-b902-f74c3572e556",
  "message": "Solver job started. Poll /runStatus/<job_id> for updates.",
  "status": "running",
  "status_code": "accepted"
}

Test 2 - GET /runStatus/<job_id> returns job state

HTTP 200
{
  "casename": "CLEWs Demo",
  "caserunname": "REF",
  "job_id": "41f76b43-181c-4028-b902-f74c3572e556",
  "solver": "glpk",
  "status": "error",
  "error": "Solver binary 'glpsol' could not be found.",
  "result": null,
  "created_at": 1772552722.995,
  "started_at": 1772552722.996,
  "finished_at": 1772552723.035
}

Test 3 - POST /cancelRun/<job_id> returns 409 for a non-running job

HTTP 409
{
  "message": "Job cannot be cancelled — current status: error",
  "status_code": "error"
}

NamanmeetSingh · 2026-03-03T18:29:51Z

@Adityakushwaha2006 This is a fantastic implementation. The addition of the /cancelRun endpoint and using Popen for graceful timeouts is a massive architectural upgrade.

Just flagging for the maintainers that this addresses the same core blocking issue as the TaskManager in PR #146, but takes a slightly different approach (Popen job tracking vs. ThreadPoolExecutor). Both are great, but the cancellation feature here is particularly interesting for the long-running Track 1 tasks.

@Adityakushwaha2006 From the Track 1 perspective, the only architectural requirement for the final async queue is that the job state dictionary supports a mutable metadata: {} field. The ConvergingOrchestrator (PR #24) runs for 15+ minutes and needs to inject live mathematical deltas ($\epsilon$) into the /runStatus payload on the fly so the frontend can graph the convergence. If you could add a quick .update_metadata(job_id, key, value) method to your JobManager, this would perfectly support the macroeconomic loops as well.

brightyorcerf · 2026-03-03T18:37:28Z

@NamanmeetSingh architecturally, there is a fundamental difference here that will block the Track 1 convergence loops if we go with a Popen approach.

PR #186 is built around Subprocesses, which is great for external solver binaries but creates a "Memory Wall." The ConvergingOrchestrator (PR #24) is native Python logic; if we run it via Popen, it cannot share memory or live objects with the main API. We would have to serialize the entire model state to disk just to update a single ϵ value.

In contrast, the TaskManager in #146 uses Threads, allowing the Orchestrator to update the metadata dictionary directly and instantly in shared memory. I could update #146 with a thread-safe cancellation flag (Soft Cancellation) and automated exception handling, giving us the same control as #186 but with the deep Python integration required for the macroeconomic loops if you require that.

Adityakushwaha2006 · 2026-03-03T19:45:39Z

Hey @brightyorcerf ,
The memory wall point is valid .... but it only applies if someone runs the ConvergingOrchestrator via popen .....which this PR never proposes and the issue never asked for either....

This PR is scoped entirely to external solver binaries: GLPK and CBC are compiled C executables. They run in a separate OS process by definition.Popen is not a choice here , it kinda is the only correct mechanism. And crucially, the cancellation feature you're describing as interesting is only possible because of popen.....threading.Thread cannot send SIGTERM to an external binary....

I went through PR #146 as stated in your message.
PR #146 and PR #186 solve different layers. Neither supersedes the other. The solver subprocess layer and the in process orchestration layer have different requirements and the right answer is both, not one replacing the other.

Adityakushwaha2006 · 2026-03-03T19:52:03Z

Hey @NamanmeetSingh , thanks for the appreciation :D
On .update_metadata() , the Orchestrator already owns its iteration state in process, so routing live ε values through an external registry would mean serializing data you already have direct access to,which would be unnecessary and in my viewpoint ,architecturally not the direction we should go.

What i think we should do , is have a dedicated /convergeStatus/<task_id> endpoint on the Track 1 side which would be the cleaner interface for frontend progress reporting, without introducing mutable side channel writes into a registry built for a different layer...

Lemme know if you think we should still go through with that change , we can discuss further , if we can actually see a need for the same or if im missing something here : )

SeaCelo · 2026-03-11T15:43:48Z

My current view is that this PR is implementing a future async/job-based execution architecture rather than solving a confirmed current local single-user stability problem.

Under the current product model, I am not convinced this should remain in the active queue. My default direction right now is to close it unless there is a strong present-day local-use reason to keep this work moving.

If you want to make that case, please do it in #301.

I’m using that discussion for the first-pass review and will come back here after the review window.

Adityakushwaha2006 · 2026-03-11T22:15:00Z

Closing in response to the scope review in #301. The implementation is future async architecture rather than a present day local use fix.
Work preserved in the branch for reference if the async execution layer becomes relevant later.

Issue 174-TASK fix

dc347db

Adityakushwaha2006 changed the title ~~Issue 174-TASK fix~~ async solver execution with job-based run management Mar 3, 2026

Adityakushwaha2006 mentioned this pull request Mar 3, 2026

[Task] Solver execution blocks all API requests for the duration of a run #174

Open

10 tasks

This was referenced Mar 3, 2026

[Task] Expose CLEWS solver output schema and results as structured JSON for OG-Core integration #201

Open

Feature: ogcore output schema endpoint #202

Open

This was referenced Mar 6, 2026

[Feature] Standardizing Atomic I/O and Global State Locking for Concurrent Stability #258

Open

[Bug] Prevent solver subprocess hangs by introducing configurable execution timeout #259

Open

SeaCelo added Track: Stability Run safety, async execution, shared state integrity, and runtime robustness blocked labels Mar 11, 2026

SeaCelo added this to Project Management Mar 11, 2026

SeaCelo moved this to On Hold in Project Management Mar 11, 2026

SeaCelo added needs-decision Waiting on maintainer clarification or decision and removed blocked labels Mar 11, 2026

Adityakushwaha2006 closed this Mar 11, 2026

github-project-automation Bot moved this from On Hold to Done in Project Management Mar 11, 2026

This was referenced Mar 13, 2026

[Feature] Detect and recover interrupted solver runs #325

Open

Detect and recover interrupted solver runs using run_state.json #326

Open

krishivsaini mentioned this pull request Mar 14, 2026

[Infrastructure] Establish pytest foundation and GitHub Actions test CI #327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async solver execution with job-based run management #186

async solver execution with job-based run management #186
Adityakushwaha2006 wants to merge 1 commit intoEAPD-DRB:mainfrom
Adityakushwaha2006:feature/174-async-solver-execution

Adityakushwaha2006 commented Mar 3, 2026 •

edited

Loading

Uh oh!

NamanmeetSingh commented Mar 3, 2026

Uh oh!

brightyorcerf commented Mar 3, 2026 •

edited

Loading

Uh oh!

Adityakushwaha2006 commented Mar 3, 2026

Uh oh!

Adityakushwaha2006 commented Mar 3, 2026

Uh oh!

SeaCelo commented Mar 11, 2026

Uh oh!

Adityakushwaha2006 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Adityakushwaha2006 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related issues

Validation

Documentation

Scope check

Manual Testing Evidence

Uh oh!

NamanmeetSingh commented Mar 3, 2026

Uh oh!

brightyorcerf commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adityakushwaha2006 commented Mar 3, 2026

Uh oh!

Adityakushwaha2006 commented Mar 3, 2026

Uh oh!

SeaCelo commented Mar 11, 2026

Uh oh!

Adityakushwaha2006 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adityakushwaha2006 commented Mar 3, 2026 •

edited

Loading

brightyorcerf commented Mar 3, 2026 •

edited

Loading