Skip to content

async solver execution with job-based run management #186

Closed
Adityakushwaha2006 wants to merge 1 commit intoEAPD-DRB:mainfrom
Adityakushwaha2006:feature/174-async-solver-execution
Closed

async solver execution with job-based run management #186
Adityakushwaha2006 wants to merge 1 commit intoEAPD-DRB:mainfrom
Adityakushwaha2006:feature/174-async-solver-execution

Conversation

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor

@Adityakushwaha2006 Adityakushwaha2006 commented Mar 3, 2026

Summary

  • What changed: Added JobManager to track solver runs as background jobs. POST /run now returns a job ID immediately instead of waiting for the solver to finish. Two new endpoints handle status polling (GET /runStatus/<job_id>) and cancellation (POST /cancelRun/<job_id>). The solver subprocess is now started with Popen and checked every 0.5s so it can be cancelled or timed out cleanly. Fixed batchRun which was ignoring the solver field from the request and always using CBC.
  • Why: The solver was called with subprocess.run() inside the request handler, which blocked the thread for the entire duration of the run. If the browser timed out and disconnected, the solver kept running but the result was thrown away. There was also no way to cancel a run once started. This change fixes both.

Related issues

Validation

  • Tests added/updated (or not applicable) -- not applicable, no test suite exists in this repo yet
  • Validation steps documented
  • Evidence attached (logs/screenshots/output as relevant)

Manual verification steps:

  1. Start the server and open the app
  2. Trigger a run via POST /run with a valid casename, caserunname, and solver
  3. Confirm the response comes back immediately with a job_id and HTTP 202 -- the browser should not hang
  4. Call GET /runStatus/<job_id> -- confirm status moves from running to completed once the solve finishes and the full result is present in the response
  5. Start another run, then call POST /cancelRun/<job_id> while it is running -- confirm the response is Cancellation signal sent and a follow-up status poll shows cancelled
  6. Call POST /batchRun with solver: glpk -- confirm it uses GLPK instead of CBC

Documentation

  • Docs updated in this PR (or not applicable)
  • Any setup/workflow changes reflected in repo docs

docs/ARCHITECTURE.md updated: system overview line updated, new Solver execution section added describing the async flow, job states, and in-memory storage behaviour.

Scope check

  • No unrelated refactors
  • Implemented from a feature branch
  • Change is deliverable without upstream OSeMOSYS/MUIO dependency
  • Base repo/branch is EAPD-DRB/MUIOGO:main (not upstream)

Manual Testing Evidence

Server: waitress, http://127.0.0.1:5002, demo case CLEWs Demo / REF

Test 1 - POST /run returns 202 immediately without blocking

HTTP 202
{
  "job_id": "41f76b43-181c-4028-b902-f74c3572e556",
  "message": "Solver job started. Poll /runStatus/<job_id> for updates.",
  "status": "running",
  "status_code": "accepted"
}

Test 2 - GET /runStatus/<job_id> returns job state

HTTP 200
{
  "casename": "CLEWs Demo",
  "caserunname": "REF",
  "job_id": "41f76b43-181c-4028-b902-f74c3572e556",
  "solver": "glpk",
  "status": "error",
  "error": "Solver binary 'glpsol' could not be found.",
  "result": null,
  "created_at": 1772552722.995,
  "started_at": 1772552722.996,
  "finished_at": 1772552723.035
}

Test 3 - POST /cancelRun/<job_id> returns 409 for a non-running job

HTTP 409
{
  "message": "Job cannot be cancelled — current status: error",
  "status_code": "error"
}

@Adityakushwaha2006 Adityakushwaha2006 changed the title Issue 174-TASK fix async solver execution with job-based run management Mar 3, 2026
@NamanmeetSingh
Copy link
Copy Markdown

@Adityakushwaha2006 This is a fantastic implementation. The addition of the /cancelRun endpoint and using Popen for graceful timeouts is a massive architectural upgrade.

Just flagging for the maintainers that this addresses the same core blocking issue as the TaskManager in PR #146, but takes a slightly different approach (Popen job tracking vs. ThreadPoolExecutor). Both are great, but the cancellation feature here is particularly interesting for the long-running Track 1 tasks.

@Adityakushwaha2006 From the Track 1 perspective, the only architectural requirement for the final async queue is that the job state dictionary supports a mutable metadata: {} field. The ConvergingOrchestrator (PR #24) runs for 15+ minutes and needs to inject live mathematical deltas ($\epsilon$) into the /runStatus payload on the fly so the frontend can graph the convergence. If you could add a quick .update_metadata(job_id, key, value) method to your JobManager, this would perfectly support the macroeconomic loops as well.

@brightyorcerf
Copy link
Copy Markdown
Contributor

brightyorcerf commented Mar 3, 2026

@NamanmeetSingh architecturally, there is a fundamental difference here that will block the Track 1 convergence loops if we go with a Popen approach.

PR #186 is built around Subprocesses, which is great for external solver binaries but creates a "Memory Wall." The ConvergingOrchestrator (PR #24) is native Python logic; if we run it via Popen, it cannot share memory or live objects with the main API. We would have to serialize the entire model state to disk just to update a single ϵ value.

In contrast, the TaskManager in #146 uses Threads, allowing the Orchestrator to update the metadata dictionary directly and instantly in shared memory. I could update #146 with a thread-safe cancellation flag (Soft Cancellation) and automated exception handling, giving us the same control as #186 but with the deep Python integration required for the macroeconomic loops if you require that.

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Hey @brightyorcerf ,
The memory wall point is valid .... but it only applies if someone runs the ConvergingOrchestrator via popen .....which this PR never proposes and the issue never asked for either....

This PR is scoped entirely to external solver binaries: GLPK and CBC are compiled C executables. They run in a separate OS process by definition.Popen is not a choice here , it kinda is the only correct mechanism. And crucially, the cancellation feature you're describing as interesting is only possible because of popen.....threading.Thread cannot send SIGTERM to an external binary....

I went through PR #146 as stated in your message.
PR #146 and PR #186 solve different layers. Neither supersedes the other. The solver subprocess layer and the in process orchestration layer have different requirements and the right answer is both, not one replacing the other.

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Hey @NamanmeetSingh , thanks for the appreciation :D
On .update_metadata() , the Orchestrator already owns its iteration state in process, so routing live ε values through an external registry would mean serializing data you already have direct access to,which would be unnecessary and in my viewpoint ,architecturally not the direction we should go.

What i think we should do , is have a dedicated /convergeStatus/<task_id> endpoint on the Track 1 side which would be the cleaner interface for frontend progress reporting, without introducing mutable side channel writes into a registry built for a different layer...

Lemme know if you think we should still go through with that change , we can discuss further , if we can actually see a need for the same or if im missing something here : )

@SeaCelo
Copy link
Copy Markdown
Collaborator

SeaCelo commented Mar 11, 2026

My current view is that this PR is implementing a future async/job-based execution architecture rather than solving a confirmed current local single-user stability problem.

Under the current product model, I am not convinced this should remain in the active queue. My default direction right now is to close it unless there is a strong present-day local-use reason to keep this work moving.

If you want to make that case, please do it in #301.

I’m using that discussion for the first-pass review and will come back here after the review window.

@SeaCelo SeaCelo added Track: Stability Run safety, async execution, shared state integrity, and runtime robustness blocked labels Mar 11, 2026
@SeaCelo SeaCelo moved this to On Hold in Project Management Mar 11, 2026
@SeaCelo SeaCelo added needs-decision Waiting on maintainer clarification or decision and removed blocked labels Mar 11, 2026
@github-project-automation github-project-automation Bot moved this from On Hold to Done in Project Management Mar 11, 2026
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Closing in response to the scope review in #301. The implementation is future async architecture rather than a present day local use fix.
Work preserved in the branch for reference if the async execution layer becomes relevant later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-decision Waiting on maintainer clarification or decision Track: Stability Run safety, async execution, shared state integrity, and runtime robustness

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Task] Solver execution blocks all API requests for the duration of a run

4 participants