Skip to content

Realtime StoryRun stop path waits on transport drainTimeoutSeconds instead of topology termination #87

@lanycrost

Description

@lanycrost

Summary

When a realtime StoryRun receives spec.cancelRequested=true, Bobrapet currently routes it into handleGracefulCancel and keeps it alive until the resolved timeout expires. In the current implementation that timeout is derived from transport lifecycle.drainTimeoutSeconds, so room-end stop is delayed by the transport cutover drain window.

In the live cluster this produced a fixed ~2 minute wait after the call was already over.

Expected behavior

Realtime room-end stop should not inherit transport cutover drain timeout.

Once the realtime topology is gone, Bobrapet should finish the StoryRun quickly instead of waiting for drainTimeoutSeconds and then force-deleting StepRuns.

Observed behavior

Live run:

  • livekit-voice/livekit-voice-assistant-rm-6jwy7uutjydd-241a4c029032ae1d
  • spec.cancelRequested: true
  • storyrun.bubustack.io/graceful-cancel-observed-at: 2026-04-22T18:13:32.023539756Z
  • status.finishedAt: 2026-04-22T18:15:32Z
  • status.duration: 2m13s

Controller log for the same run:

  • Graceful cancel timeout expired; deleting remaining StepRuns
  • timeout:"2m0s"
  • startedAt:"2026-04-22T18:13:32.023539756Z"

The live example is currently configured with:

  • examples/realtime/livekit-voice/story.yaml:56-61
  • drainTimeoutSeconds: 120

Why this is wrong

The current StoryRun stop path couples room-end cancellation to transport lifecycle drain:

  • internal/controller/runs/storyrun_controller.go:278-280
    • cancelRequested short-circuits normal DAG reconciliation and goes straight into handleGracefulCancel
  • internal/controller/runs/storyrun_controller.go:1672-1795
    • handleGracefulCancel resolves its timeout from story transport drainTimeoutSeconds
  • api/transport/v1alpha1/transport_settings_types.go:433-440
    • DrainTimeoutSeconds is documented as drain-before-cutover transport lifecycle behavior, not room-end StoryRun stop behavior

So the controller currently uses a transport upgrade/cutover knob as the end-call stop delay.

Impact

  • Realtime StoryRuns stay alive long after the call is already over
  • Users see “stop” take minutes even when there is nothing left to process
  • The controller only cleans up after the timeout loop expires
  • This blocks the fast realtime termination path from being useful in production

Acceptance criteria

  • cancelRequested=true for realtime runs must not automatically wait on transport drainTimeoutSeconds
  • Room-end stop should use a dedicated shutdown contract, or finish as soon as realtime topology termination is observed
  • Bobrapet should not keep a canceled realtime StoryRun alive for the full transport cutover window when the room is already gone
  • Add regression coverage for a realtime StoryRun where stop is requested after room termination and verify the StoryRun reaches terminal state without waiting for the transport cutover drain timeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/operatorBobrapet controller or CRD-level change.kind/bugUnexpected behaviour or regression that needs fixing.priority/criticalProduction-impacting issue that needs immediate attention.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions