Skip to content

feat: implement task pause/resume functionality#357

Open
soumojit-D48 wants to merge 1 commit intoGetBindu:mainfrom
soumojit-D48:feat/implement-task-pause-resume
Open

feat: implement task pause/resume functionality#357
soumojit-D48 wants to merge 1 commit intoGetBindu:mainfrom
soumojit-D48:feat/implement-task-pause-resume

Conversation

@soumojit-D48
Copy link

@soumojit-D48 soumojit-D48 commented Mar 13, 2026

Summary

  • Problem: Long-running AI agent tasks had no way to temporarily stop execution to free resources and resume later from where they left off
  • Why it matters: Users need ability to pause resource-intensive tasks without losing progress, then resume when ready
  • What changed: Implemented pause/resume handlers in worker base, added checkpoint save/restore, added suspended/resumed task states, implemented pause_task/resume_task in scheduler
  • What did NOT change: Task execution logic, storage interface, protocol types (except adding new states)

Change Type (select all that apply)

  • Feature
  • Bug fix
  • Refactor
  • Documentation
  • Security hardening
  • Tests
  • Chore/infra

Scope (select all touched areas)

  • Server / API endpoints
  • Extensions (DID, x402, etc.)
  • Storage backends
  • Scheduler backends
  • Observability / monitoring
  • Authentication / authorization
  • CLI / utilities
  • Tests
  • Documentation
  • CI/CD / infra

Linked Issue/PR

User-Visible / Behavior Changes

  • Tasks can now be paused (state: suspended) and resumed (state: resumed)
  • Checkpoint data is saved when pausing to preserve task context
  • Only working tasks can be paused, only suspended tasks can be resumed

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/credentials handling changed? (No)
  • New/changed network calls? (No)
  • Database schema/migration changes? (No)
  • Authentication/authorization changes? (No)
  • If any Yes, explain risk + mitigation: N/A

Verification

Environment

  • OS: Windows/macOS/Linux
  • Python version: 3.x
  • Storage backend: Any (base interface unchanged)
  • Scheduler backend: memory/redis

Steps to Test

  1. Start a long-running task
  2. Call pause_task with task_id
  3. Verify task state changes to "suspended"
  4. Call resume_task with task_id
  5. Verify task state changes to "resumed"

Expected Behavior

  • Paused task should save checkpoint and enter suspended state
  • Resumed task should restore checkpoint and enter resumed state

Actual Behavior

Evidence (attach at least one)

  • Failing test before + passing after
  • Test output / logs
  • Screenshot / recording
  • Performance metrics (if relevant)
image

Human Verification (required)

What you personally verified (not just CI):

  • Verified scenarios: Code review of implementation logic
  • Edge cases checked: Invalid state transitions handled (pause completed/canceled/failed tasks, resume non-suspended tasks)
  • What you did NOT verify: Runtime testing with actual task execution

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Database migration needed? (No)
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit
  • Files/config to restore: N/A
  • Known bad symptoms reviewers should watch for: Tasks stuck in suspended state

Risks and Mitigations

  • Risk: Task checkpoint data could grow large in storage
    • Mitigation: Only save essential metadata, not full task state
  • Risk: Resume could fail if checkpoint data is corrupted
    • Mitigation: Handle missing checkpoint gracefully, log warnings

Checklist

  • [ x] Tests pass (uv run pytest)
  • Pre-commit hooks pass (uv run pre-commit run --all-files)
  • Documentation updated (if needed)
  • Security impact assessed
  • Human verification completed
  • Backward compatibility considered

@soumojit-D48
Copy link
Author

hi @raahulrahl, Can You check this PR and Let me know if its helpful or not, Thanks!!

@Paraschamoli
Copy link
Contributor

Hey! @soumojit-D48 I tried testing the pause/resume feature locally. When I send a request with method: "tasks/pause", the server returns an error saying the method isn’t recognized.

It looks like the worker and scheduler base were updated, but I couldn’t find where tasks/pause and tasks/resume are exposed in the RPC/API layer. Because of that, I’m not able to trigger the pause operation through the API.

Am I missing something in the setup, or do those handlers still need to be added?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Task Pause/Resume

2 participants