#1986 concurrent planning#2072
Conversation
Greptile SummaryThis PR refactors
Confidence Score: 4/5Safe to merge with awareness that the module-wide FAULT policy — where any single arm's planning failure blocks all arms from new plans — remains in place; this is documented as intentional but affects bimanual workflows in the failure path. The five previously-flagged bugs are substantively addressed: _wait_plan now resolves the robot name before calling get_planning_status so single-robot skills no longer always return failure; execute() wraps task_invoke in try/except so coordinator crashes properly land in FAULT; and pick() calls _clear_failed_plan_for_retry between grasp candidates. The refactor introduces a large amount of new concurrent state (PlanningJob maps, per-robot executing/last_op_success dicts, ThreadPoolExecutor) in the critical manipulation path, and concurrent Drake context access relies on the planner using scratch contexts internally — both reasonable designs, but difficult to verify exhaustively from code review alone. dimos/manipulation/manipulation_module.py deserves the closest look — it carries most of the new concurrency logic, and the interactions between _planning_jobs, _last_op_success, _executing, and _state span several methods that must stay mutually consistent. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant ManipModule
participant Pool as ThreadPoolExecutor
participant Worker as _planning_worker
participant WorldMonitor
participant Planner as Drake RRT/IK
participant Coordinator
Caller->>ManipModule: plan_to_pose(pose, "left_arm")
ManipModule->>ManipModule: "_begin_planning() → PlanningJob(future=None)"
ManipModule->>Pool: submit(_planning_worker, "left_arm", ...)
Pool-->>ManipModule: future
ManipModule->>ManipModule: "job.future = future (under lock)"
ManipModule-->>Caller: True (accepted immediately)
Caller->>ManipModule: plan_to_pose(pose, "right_arm")
ManipModule->>ManipModule: "_begin_planning() → PlanningJob(future=None)"
ManipModule->>Pool: submit(_planning_worker, "right_arm", ...)
Pool-->>ManipModule: future
ManipModule-->>Caller: True (both in flight concurrently)
par left_arm planning
Worker->>WorldMonitor: dismiss_preview(left_id) [locked]
Worker->>WorldMonitor: get_current_joint_state(left_id)
Worker->>Planner: plan_joint_path(world, left_id, ...)
Planner-->>Worker: path
Worker->>ManipModule: _complete_job("left_arm", req_id, True, path, traj)
and right_arm planning
Worker->>WorldMonitor: dismiss_preview(right_id) [locked]
Worker->>WorldMonitor: get_current_joint_state(right_id)
Worker->>Planner: plan_joint_path(world, right_id, ...)
Planner-->>Worker: path
Worker->>ManipModule: _complete_job("right_arm", req_id, True, path, traj)
end
Caller->>ManipModule: wait_for_planning_completion(None, timeout)
ManipModule-->>Caller: True (both done)
Caller->>ManipModule: execute("left_arm")
ManipModule->>Coordinator: task_invoke("left_arm_task", "execute", traj)
Coordinator-->>ManipModule: True
ManipModule-->>Caller: True
Caller->>ManipModule: execute("right_arm")
ManipModule->>Coordinator: task_invoke("right_arm_task", "execute", traj)
Coordinator-->>ManipModule: True
ManipModule-->>Caller: True
Reviews (5): Last reviewed commit: "Merge branch 'main' into feature/1986-co..." | Re-trigger Greptile |
Problem
Details here - #1986
With multiple arms (e.g. OpenArm bimanual), every plan call blocks the whole module for 5–15s, so a bimanual sequence pays planning latency twice before either arm moves.
ManipulationModulealready keys_planned_paths/_planned_trajectoriesby robot_name, butplan_to_*callsself._planner.plan_joint_path(...)inline, so a second plan can't start until the first returns.This PR allows
ManipulationModuleto plan concurrently for an arbitrary number of arms, followed by overlappedpreviewandexecuteCloses DIM-854
Solution
Async planning API.
plan_to_pose/plan_to_jointsnow submit work to a per-module ThreadPoolExecutor (one worker per configured robot) and return immediately on accept. The Drake IK + RRT call moved into a new_planning_workerthat runs lock-free against WorldMonitor (Drake scratch contexts handle isolation) and publishes its result via compare-and-store on request_id.Per-robot job tracking. New
PlanningJobdataclass. State is no longer a single module-wide enum.get_state()derives an aggregate (FAULT > PLANNING > EXECUTING > COMPLETED > IDLE) from_planning_jobs,_executing, and_last_op_successmaps keyed by robot.New RPCs on
ManipulationModule:wait_for_planning_completion(robot_name=None, timeout=None): blocks on one robot's future or all active ones.get_planning_status(robot_name=None): returns per-robot dict (active,done,success,error,duration_s,invalidated,request_id) or map of all.has_planned_path,clear_planned_path,cancel: now takerobot_nameand refuse to act on a robot with an in-flight job.reset(),cancel(),execute()rewrites.reset(): invalidates active jobs (Drake RRT isn't preemptable. Late results are dropped at compare-and-store), clears stored paths/trajectories and_last_op_success, refuses if any robot is executing.cancel(robot_name=None): clears the execute accept window for one or all robots; no longer touches planning.execute(): per-robot gating: refuses if planning is in flight or module is faulted, sets_executing[robot_name]only during the coordinatortask_invokeaccept window.- Updates to downstream consumersConcurrency-safe
WorldMonitor.dismiss_preview(robot_id): serialized under the monitor lock since the live world isn't thread-safe; the planner worker calls this before each new plan.Downstream consumers updated.
pick_and_place_module.pynow waits for the async planner between grasp candidates via the new private_wait_planhelper;_preview_execute_waitwaits internally so existing callers (move_to_pose,move_to_joints, home/init, lift) keep their pre-async semantics. Interactive client (manipulation_client.py) addswait_plan()andplan_status(). README + OpenArm integration doc updated.Tests.
TestAsyncPlanningUnit(8 deterministic unit tests, gated fake planner): accept latency, concurrent distinct-robot plans, duplicate rejection, fault gating, reset-invalidates-late-results, preview/execute per-robot gating, clear_planned_path gating.test_openarm_bimanual_planning.pye2e: sequential-vs-concurrent timing assertion, plus overlapped preview and execute on theopenarm-mock-planner-coordinatorblueprint.How to Test
Contributor License Agreement