async solver execution with job-based run management #186
async solver execution with job-based run management #186Adityakushwaha2006 wants to merge 1 commit intoEAPD-DRB:mainfrom
Conversation
|
@Adityakushwaha2006 This is a fantastic implementation. The addition of the Just flagging for the maintainers that this addresses the same core blocking issue as the @Adityakushwaha2006 From the Track 1 perspective, the only architectural requirement for the final async queue is that the job state dictionary supports a mutable |
|
@NamanmeetSingh architecturally, there is a fundamental difference here that will block the Track 1 convergence loops if we go with a Popen approach. PR #186 is built around Subprocesses, which is great for external solver binaries but creates a "Memory Wall." The ConvergingOrchestrator (PR #24) is native Python logic; if we run it via Popen, it cannot share memory or live objects with the main API. We would have to serialize the entire model state to disk just to update a single ϵ value. In contrast, the TaskManager in #146 uses Threads, allowing the Orchestrator to update the metadata dictionary directly and instantly in shared memory. I could update #146 with a thread-safe cancellation flag (Soft Cancellation) and automated exception handling, giving us the same control as #186 but with the deep Python integration required for the macroeconomic loops if you require that. |
|
Hey @brightyorcerf , This PR is scoped entirely to external solver binaries: GLPK and CBC are compiled C executables. They run in a separate OS process by definition.Popen is not a choice here , it kinda is the only correct mechanism. And crucially, the cancellation feature you're describing as interesting is only possible because of popen.....threading.Thread cannot send SIGTERM to an external binary.... I went through PR #146 as stated in your message. |
|
Hey @NamanmeetSingh , thanks for the appreciation :D What i think we should do , is have a dedicated Lemme know if you think we should still go through with that change , we can discuss further , if we can actually see a need for the same or if im missing something here : ) |
|
My current view is that this PR is implementing a future async/job-based execution architecture rather than solving a confirmed current local single-user stability problem. Under the current product model, I am not convinced this should remain in the active queue. My default direction right now is to close it unless there is a strong present-day local-use reason to keep this work moving. If you want to make that case, please do it in #301. I’m using that discussion for the first-pass review and will come back here after the review window. |
|
Closing in response to the scope review in #301. The implementation is future async architecture rather than a present day local use fix. |
Summary
JobManagerto track solver runs as background jobs.POST /runnow returns a job ID immediately instead of waiting for the solver to finish. Two new endpoints handle status polling (GET /runStatus/<job_id>) and cancellation (POST /cancelRun/<job_id>). The solver subprocess is now started withPopenand checked every 0.5s so it can be cancelled or timed out cleanly. FixedbatchRunwhich was ignoring the solver field from the request and always using CBC.subprocess.run()inside the request handler, which blocked the thread for the entire duration of the run. If the browser timed out and disconnected, the solver kept running but the result was thrown away. There was also no way to cancel a run once started. This change fixes both.Related issues
Validation
Manual verification steps:
POST /runwith a valid casename, caserunname, and solverjob_idand HTTP 202 -- the browser should not hangGET /runStatus/<job_id>-- confirm status moves fromrunningtocompletedonce the solve finishes and the full result is present in the responsePOST /cancelRun/<job_id>while it is running -- confirm the response isCancellation signal sentand a follow-up status poll showscancelledPOST /batchRunwithsolver: glpk-- confirm it uses GLPK instead of CBCDocumentation
docs/ARCHITECTURE.mdupdated: system overview line updated, newSolver executionsection added describing the async flow, job states, and in-memory storage behaviour.Scope check
OSeMOSYS/MUIOdependencyEAPD-DRB/MUIOGO:main(not upstream)Manual Testing Evidence
Server:
waitress,http://127.0.0.1:5002, demo caseCLEWs Demo / REFTest 1 -
POST /runreturns 202 immediately without blockingTest 2 -
GET /runStatus/<job_id>returns job stateTest 3 -
POST /cancelRun/<job_id>returns 409 for a non-running job