Skip to content

docs(agent): agent-workflows design wiki, ground truth, and archived POCs#4779

Open
mmabrouk wants to merge 2 commits into
mainfrom
docs/agent-workflows
Open

docs(agent): agent-workflows design wiki, ground truth, and archived POCs#4779
mmabrouk wants to merge 2 commits into
mainfrom
docs/agent-workflows

Conversation

@mmabrouk

@mmabrouk mmabrouk commented Jun 19, 2026

Copy link
Copy Markdown
Member

Part of the agent-workflows PR set (sliced by functional area, final code only)

Most of these are independent off main. Two short stacks exist (an indented item stacks on the line above it). Each PR shows only its own area.

This PR's base is main. Review it on its own.

Context

This is the design wiki for the agent-workflows project. It is an independent docs PR off main and carries no runnable code. Read it before the code PRs in the stack. The pages map what the current code does, what is still missing, and how the remaining work should be sliced. Start here so the behavior PRs are easier to follow.

What this adds

Current-state design pages under docs/design/agent-workflows/. ground-truth.md maps every code surface to its current role and splits the work into implemented, not implemented, and planned. meeting-alignment.md checks the work against the June 18 design discussion. implementation-review.md lists cleanup risks. pr-stack.md proposes the reviewable PR breakpoints. agent-template.md, protocol.md, triggers.md, status.md, ports-and-adapters.md, and sessions.md each cover one axis of the design. adapters/ holds the Pi, Claude Code, and Agenta harness notes.

Two scoped design folders. sdk-local-tools/ covers standalone SDK tool resolution, which is partly built and blocked on LocalBackend. docs/design/vault-named-secrets/ covers user-named project secrets in the vault, storage and management UI only for this iteration.

Archived POC material under trash/. This holds the original work-package notes, research spikes, proof-of-concept code, and superseded RFCs. It is kept for provenance. It is not design truth.

How to review this PR

Read ground-truth.md first. It is the source of truth that the other pages defer to, so it frames everything else. Then read pr-stack.md to see how the remaining work splits into reviewable PRs. Treat trash/ as an archive, not files to read line by line. Open a trash/ note only when a current page points you there for history or rationale.

@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 19, 2026 8:32pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 86b63082-f0ba-4040-bf46-3b2c84576934

📥 Commits

Reviewing files that changed from the base of the PR and between 1087fa2 and 8131d20.

📒 Files selected for processing (6)
  • docs/design/agent-workflows/README.md
  • docs/design/agent-workflows/architecture.md
  • docs/design/agent-workflows/ground-truth.md
  • docs/design/agent-workflows/implementation-review.md
  • docs/design/agent-workflows/pr-stack.md
  • docs/design/agent-workflows/status.md
✅ Files skipped from review due to trivial changes (5)
  • docs/design/agent-workflows/README.md
  • docs/design/agent-workflows/status.md
  • docs/design/agent-workflows/architecture.md
  • docs/design/agent-workflows/pr-stack.md
  • docs/design/agent-workflows/ground-truth.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/design/agent-workflows/implementation-review.md

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Added comprehensive agent-workflows documentation covering overall architecture, the public protocol (invoke/messages/load-session), session behavior (cold replay), and upcoming trigger concepts.
    • Published detailed adapter design write-ups for Pi, Claude Code, and the Agenta harness, plus a “ground truth” hub and status/known-gaps guidance.
    • Documented SDK local tools work, including review findings and conventions, along with vault named secrets planning and related design decisions.

Walkthrough

This PR adds agent-workflow design documentation, archived historical research and POC notes, sdk-local-tools review artifacts, and a separate planning set for vault named secrets.

Changes

Agent workflows documentation

Layer / File(s) Summary
Current workflow architecture and contracts
docs/design/agent-workflows/{README.md,architecture.md,ground-truth.md,ports-and-adapters.md,protocol.md,sessions.md,agent-template.md,triggers.md,status.md,meeting-alignment.md,open-issues.md,pr-stack.md,implementation-review.md}
Documents the current agent workflow runtime, routing, protocol contracts, session model, template shape, trigger concept, open gaps, status, and proposed PR slicing.
Harness and adapter behavior
docs/design/agent-workflows/adapters/{agenta,claude-code,pi}.md
Describes Pi, Claude, and Agenta harness behavior, including execution paths, prompt layering, tool delivery, tracing, permission handling, and backend support limits.
SDK local tools design set
docs/design/agent-workflows/sdk-local-tools/{README,context,research,plan,codebase-conventions,conventions-review,organization-proposal,status}.md
Defines the standalone local-tools problem, proposed package organization, conventions, phased rollout, current limitations, and status for local tool resolution work.
SDK local tools review records
docs/design/agent-workflows/sdk-local-tools/review/*
Adds the review scope, evidence, findings, risk tracking, scorecard, progress, metadata, and summary for the sdk-local-tools review.
Historical architecture and RFC archive
docs/design/agent-workflows/trash/README.md, docs/design/agent-workflows/trash/harness-port-redesign/*, docs/design/agent-workflows/trash/old-rfcs/*, docs/design/agent-workflows/trash/sdk-local-backend/status.md
Archives superseded architecture proposals, build plans, RFCs, and older local-backend status material under the trash workspace.
Historical research archive
docs/design/agent-workflows/trash/research/*
Adds archived research notes covering auth and secrets, Daytona sandboxes, in-memory Pi execution, telemetry, Pi interaction details, and shared-sandbox behavior.
Historical work packages and POCs
docs/design/agent-workflows/trash/wp-*/*
Adds archived work-package docs and POC scripts for Pi tracing, agent service experiments, Daytona sandbox runs, tool bridging, and Rivet ACP runtime experiments.

Vault named secrets planning docs

Layer / File(s) Summary
Vault named secrets design and rollout
docs/design/vault-named-secrets/{README,context,plan,research,status}.md
Adds planning documentation for a new custom_secret vault kind, including backend, migration, generated client, frontend, and verification work.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.81% which is insufficient. The required threshold is 60.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding comprehensive agent-workflows design documentation including ground truth, design pages, and archived POCs.
Description check ✅ Passed The description is well-organized and directly related to the changeset, explaining what design pages are added, the purpose of ground-truth.md, the scoped design folders, archived material, and review guidance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/agent-workflows

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. documentation Improvements or additions to documentation labels Jun 19, 2026
@mmabrouk

Copy link
Copy Markdown
Member Author

Reviewer guide: where to look

  • docs/design/agent-workflows/ground-truth.md (read first). The current implementation map. It ties each code surface to its role and splits the work into implemented, not implemented, and planned. Every other page defers to this one.
  • docs/design/agent-workflows/pr-stack.md. The proposed breakpoints for turning this work into reviewable stacked PRs, with scope, out-of-scope, and validation per PR. Read it second to see the review order.
  • docs/design/agent-workflows/agent-template.md. The intended split between generic agent identity, harness-specific config, and runtime infrastructure. This is the contract the template PR should land.
  • docs/design/agent-workflows/README.md. The "read in this order" index if you want the full tour past the two headline pages.
  • docs/design/agent-workflows/trash/ (archive, do not read line by line). Historical work-package notes, research spikes, POC code, and superseded RFCs, kept for provenance only. Open a note here only when a current page links to it for history or rationale.

@mmabrouk

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mmabrouk mmabrouk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex subagent review for #4779

I found one correctness issue to resolve before treating this as mergeable design truth:

  • docs/design/agent-workflows/ground-truth.md:3 and docs/design/agent-workflows/ground-truth.md:10: this page says it is the current implementation map and lists concrete code surfaces, but this PR is independent off main and the referenced code is not present in this PR head. I checked 1087fa2: services/agent/src/server.ts, sdks/python/agenta/sdk/agents/interfaces.py, and services/oss/src/agent/app.py all 404 at this SHA. README.md:3 and status.md:5 reinforce the same current/checked-in framing. If #4779 lands before the code PRs, main gets a ground-truth page that points to absent files and claims implemented behavior that is only present in sibling PRs. Please either make the docs explicitly scoped to the open agent-workflows PR set (including prerequisite/merge-order expectations), or stack/retarget this docs PR onto the code slices it documents so the file references are true at merge time.

Secondary coordination risk:

  • docs/design/agent-workflows/pr-stack.md:3 and docs/design/agent-workflows/pr-stack.md:14: the reviewer guide points people here for the review order, but the page lists proposed future slices rather than the concrete active PR map (#4771, #4772, #4773, #4775, #4776, #4778, #4779). Since some sibling PR bodies still point at older #4777/#4774 review targets, please add a short active PR set map or rename the page/intro so reviewers can distinguish proposal from live review stack.

I did not see high-level implemented/not-implemented drift against the sibling PR descriptions, assuming these docs are meant to describe the stack state rather than main at this SHA. I did not run a docs build; this was a GitHub content/consistency review.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (10)
docs/design/agent-workflows/sdk-local-tools/review/progress.md (1)

26-37: ⚡ Quick win

Clarify timeline for remediation-phase test counts.

The "Remediation" section reports different test categories and counts than the initial review phase (e.g., 146 SDK vs. 118 earlier). While this aligns with the narrative in summary.md about subsequent fixes, progress.md doesn't explicitly state that these counts reflect validation after findings were addressed. Consider adding a subheading like "Remediation and post-fix validation" or a brief note explaining the temporal shift.

docs/design/agent-workflows/sdk-local-tools/review/scorecard.md (1)

1-31: ⚡ Quick win

Clarify the temporal relationship between scorecard and summary verdicts.

The scorecard states "PASS WITH CONDITIONS" and notes that medium findings "warrant fixes before rollout" (lines 28-30), suggesting conditions are pending. However, summary.md declares "PASS — CONDITIONS RESOLVED" and explicitly states findings "were addressed in the subsequent organization refactor" (line 25). The documents describe two different timepoints—original review vs. post-remediation—but don't make this relationship explicit.

To avoid reader confusion, consider adding a header note or footer that clearly states: "This scorecard reflects the original review verdict before remediation. See summary.md for the post-remediation verdict."

docs/design/agent-workflows/sdk-local-tools/review/summary.md (1)

64-78: ⚡ Quick win

Clarify the "positional-ordering coupling" detail.

The invariant verification section (line 64) lists all six invariants as verified and notes they "all hold." However, progress.md line 20 mentions "one positional-ordering coupling noted," which isn't explained here. This creates ambiguity about whether this coupling:

  • Is a risk (but not listed in risks.md)
  • Is an observation (but unexplained)
  • Was already addressed (but not detailed)

Either elaborate on what this coupling is in the invariants section, or remove the unexplained reference from progress.md.

docs/design/agent-workflows/adapters/claude-code.md (1)

55-55: 💤 Low value

Add language specification to code block.

The fenced code block at line 55 should include a language identifier for syntax highlighting. Based on the context (showing mcp_servers and callback patterns), specify the language:

-```
+```python
 # Example or context here

Source: Linters/SAST tools

docs/design/agent-workflows/adapters/pi.md (1)

110-110: 💤 Low value

Add language specification to code block.

The fenced code block at line 110 showing the span tree structure should include a language identifier for consistency:

-```
+```
 invoke_agent            (AGENT)
   turn N                (CHAIN)
     chat <model>        (LLM)    real token usage from the provider call
     execute_tool <name> (TOOL)   one per tool the turn ran
-```
+```

Alternatively, if this is meant to be read as plain text structure, use `text `` as the language identifier, or leave it as a blockquote/list instead of a code block.

Source: Linters/SAST tools

docs/design/agent-workflows/trash/harness-port-redesign/plan.md (1)

93-93: 💤 Low value

Fix hyphenation in "Cross cutting" section header.

Line 93 uses "Cross cutting" but should use "Cross-cutting" as a compound adjective modifying the following noun.

Fix
-## Cross cutting
+## Cross-cutting
docs/design/agent-workflows/trash/harness-port-redesign/proposal.md (2)

141-141: 💤 Low value

Fix hyphenation: "client side" → "client-side" streaming.

Compound adjective modifying "streaming" should be hyphenated.

Fix
-  client side streaming (WP-4) becomes a small add on.
+  client-side streaming (WP-4) becomes a small add on.

146-146: 💤 Low value

Fix hyphenation: "cold per invoke" sandboxes.

This should be hyphenated to clarify that "cold-per-invoke" modifies "sandboxes".

Fix
-  decide warm vs cold (the WP-8 status calls this out). First class sessions and ACP
-  `session/load` want a daemon that survives between turns, which reopens the per session
-  env and folder jail questions in
+  decide warm vs cold (the WP-8 status calls this out). First class sessions and ACP
+  `session/load` want a daemon that survives between turns, which reopens the per-session
+  env and folder jail questions in
docs/design/agent-workflows/trash/harness-port-redesign/research.md (1)

150-154: 💤 Low value

Add language identifier to code block.

The fenced code block at lines 150–154 is missing a language identifier. This block contains a Rivet capability model list and should be marked as plain text or bash.

Fix
-\`\`\`
+\`\`\`text
 commandExecution, errorEvents, fileAttachments, fileChanges, images, itemStarted,
 mcpTools, permissions, planMode, questions, reasoning, sessionLifecycle, sharedProcess,
 status, streamingDeltas, textMessages, toolCalls, toolResults
-\`\`\`
+\`\`\`
docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/research.md (1)

91-91: 💤 Low value

Minor style: consider alternative to overused "exactly".

LanguageTool flags "exactly" as an overused word. Consider rephrasing for variety (e.g., "in the same way" or "identically").

Suggested alternative
-sandbox `env_vars` on Daytona. This is exactly how `DaytonaRunner`
+sandbox `env_vars` on Daytona. This mirrors how `DaytonaRunner`

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 94c1db69-7596-4692-a169-915065ff7567

📥 Commits

Reviewing files that changed from the base of the PR and between a97e608 and 1087fa2.

⛔ Files ignored due to path filters (1)
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (95)
  • docs/design/agent-workflows/README.md
  • docs/design/agent-workflows/adapters/agenta.md
  • docs/design/agent-workflows/adapters/claude-code.md
  • docs/design/agent-workflows/adapters/pi.md
  • docs/design/agent-workflows/agent-template.md
  • docs/design/agent-workflows/architecture.md
  • docs/design/agent-workflows/ground-truth.md
  • docs/design/agent-workflows/implementation-review.md
  • docs/design/agent-workflows/meeting-alignment.md
  • docs/design/agent-workflows/open-issues.md
  • docs/design/agent-workflows/ports-and-adapters.md
  • docs/design/agent-workflows/pr-stack.md
  • docs/design/agent-workflows/protocol.md
  • docs/design/agent-workflows/sdk-local-tools/README.md
  • docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md
  • docs/design/agent-workflows/sdk-local-tools/context.md
  • docs/design/agent-workflows/sdk-local-tools/conventions-review.md
  • docs/design/agent-workflows/sdk-local-tools/organization-proposal.md
  • docs/design/agent-workflows/sdk-local-tools/plan.md
  • docs/design/agent-workflows/sdk-local-tools/research.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md
  • docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md
  • docs/design/agent-workflows/sdk-local-tools/review/findings.md
  • docs/design/agent-workflows/sdk-local-tools/review/metadata.json
  • docs/design/agent-workflows/sdk-local-tools/review/plan.md
  • docs/design/agent-workflows/sdk-local-tools/review/progress.md
  • docs/design/agent-workflows/sdk-local-tools/review/questions.md
  • docs/design/agent-workflows/sdk-local-tools/review/risks.md
  • docs/design/agent-workflows/sdk-local-tools/review/scope.md
  • docs/design/agent-workflows/sdk-local-tools/review/scorecard.md
  • docs/design/agent-workflows/sdk-local-tools/review/summary.md
  • docs/design/agent-workflows/sdk-local-tools/status.md
  • docs/design/agent-workflows/sessions.md
  • docs/design/agent-workflows/status.md
  • docs/design/agent-workflows/trash/README.md
  • docs/design/agent-workflows/trash/harness-port-redesign/README.md
  • docs/design/agent-workflows/trash/harness-port-redesign/implementation.md
  • docs/design/agent-workflows/trash/harness-port-redesign/plan.md
  • docs/design/agent-workflows/trash/harness-port-redesign/proposal.md
  • docs/design/agent-workflows/trash/harness-port-redesign/research.md
  • docs/design/agent-workflows/trash/harness-port-redesign/status.md
  • docs/design/agent-workflows/trash/old-rfcs/agent-protocol-rfc.md
  • docs/design/agent-workflows/trash/old-rfcs/streaming-and-sessions.md
  • docs/design/agent-workflows/trash/research/auth-secrets.md
  • docs/design/agent-workflows/trash/research/daytona-sandbox.md
  • docs/design/agent-workflows/trash/research/diskless-in-memory-config.md
  • docs/design/agent-workflows/trash/research/open-questions.md
  • docs/design/agent-workflows/trash/research/otel-instrumentation.md
  • docs/design/agent-workflows/trash/research/pi-interaction.md
  • docs/design/agent-workflows/trash/research/sandbox-sharing.md
  • docs/design/agent-workflows/trash/sdk-local-backend/status.md
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/README.md
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/integrating-the-tracing-extension.md
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/.env.example
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/README.md
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/agenta-otel.ts
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/package.json
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/poc/run.ts
  • docs/design/agent-workflows/trash/wp-1-pi-tracing/tracing-in-the-agent-service.md
  • docs/design/agent-workflows/trash/wp-2-agent-service/README.md
  • docs/design/agent-workflows/trash/wp-2-agent-service/implementation-plan.md
  • docs/design/agent-workflows/trash/wp-2-agent-service/qa.md
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/README.md
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/poc/README.md
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/poc/bench_coldstart.py
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/poc/build_snapshot.py
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/poc/cleanup.py
  • docs/design/agent-workflows/trash/wp-3-daytona-sandbox/poc/run_agent.py
  • docs/design/agent-workflows/trash/wp-4-multi-message-output/README.md
  • docs/design/agent-workflows/trash/wp-5-chat-vs-completion/README.md
  • docs/design/agent-workflows/trash/wp-6-workflow-type-and-template/README.md
  • docs/design/agent-workflows/trash/wp-7-tools/README.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/README.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/architecture.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/context.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/isolation-and-fork.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/plan.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/build_rivet_snapshot.py
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/commit_agent_config.py
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/debug-events.ts
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/dump-full.ts
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/package.json
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/poc/spike.ts
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/research.md
  • docs/design/agent-workflows/trash/wp-8-rivet-acp-runtime/status.md
  • docs/design/agent-workflows/triggers.md
  • docs/design/vault-named-secrets/README.md
  • docs/design/vault-named-secrets/context.md
  • docs/design/vault-named-secrets/plan.md
  • docs/design/vault-named-secrets/research.md
  • docs/design/vault-named-secrets/status.md


# Status

Source of truth for this design effort. Keep it current.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Archive intent conflicts with “source of truth” wording.

Line 5 asks to “keep it current,” but Line 1 marks this file as superseded and historical. Please reword Line 5 to avoid signaling this as active design authority.

Suggested wording
-Source of truth for this design effort. Keep it current.
+Historical status snapshot for this design effort at the time. Kept for provenance only.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Source of truth for this design effort. Keep it current.
Historical status snapshot for this design effort at the time. Kept for provenance only.

1. **Ambition: full A to E arc.** Plan all five phases, including first class sessions and
retiring the `Runtime.exec` port. See [`plan.md`](plan.md).
2. **Session model: stay cold and replay.** Keep WP-8's one daemon per invoke. Do not
stand up a warm daemon. This avoids the per session env channel and the folder jail.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hyphenate “per-session” for clarity.

Use a compound modifier here to match standard technical writing style.

Suggested edit
-This avoids the per session env channel and the folder jail.
+This avoids the per-session env channel and the folder jail.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
stand up a warm daemon. This avoids the per session env channel and the folder jail.
stand up a warm daemon. This avoids the per-session env channel and the folder jail.
🧰 Tools
🪛 LanguageTool

[grammar] ~49-~49: Use a hyphen to join words.
Context: ...nd up a warm daemon. This avoids the per session env channel and the folder jail....

(QB_NEW_EN_HYPHEN)

Source: Linters/SAST tools

Comment on lines +66 to +526
```
┌─────────────────────────── client (useChat) ───────────────────────────┐
│ │
POST /messages (Accept: text/event-stream) POST /load-session │
│ │ │
▼ ▼ │
┌──────────────────┐ AgentEvent stream ┌───────────────────────────────────┐ │
│ agent run │ ─────────────────────▶│ streaming edge → UI Message Stream │──┘
│ (harness loop) │ └───────────────────────────────────┘
└──────────────────┘ persists per turn
│ │
└──────────────── trace store (ag.session.id) ◀───────┘ load-session reads here
```

## 3. Relationship to `/invoke`

`/messages` is a new endpoint. It does not change `/invoke`. The generic, stateless workflow
invoke keeps its exact request and response, and a client that does not run a chat agent never
touches `/messages`.

`/messages` reuses two things from the workflow contract so the backend does not fork: the
response envelope (`WorkflowServiceResponse`, with the answer in `data.outputs`) and revision
resolution (`references`). It diverges from `/invoke` in three ways, which is why it is its own
endpoint:

1. The conversation is a first-class member, `data.messages`, in the `UIMessage` shape, rather
than nested in `data.inputs.messages` as `{role, content}`.
2. The response can stream as a UI Message Stream (Section 6.2).
3. A turn belongs to a session (`session_id`, Section 4).

A server **SHOULD** map a `/messages` request onto the same internal agent invocation that
`/invoke` uses, after lifting `data.messages` and `data.inputs` into the handler's `messages`
and `inputs` arguments.

## 4. Session model

### 4.1 Identity

A `session_id` is an opaque string scoped to a project. The pair `(project_id, session_id)`
**MUST** be unique. A bare `session_id` is not a global identifier.

A client-supplied `session_id`:

- **MUST** be treated as an opaque token. A server **MUST NOT** interpolate it into a storage
path, a query, or a trace attribute without escaping.
- **SHOULD** be constrained by the server to a bounded length and a restricted character set.
A server **MAY** reject an id outside those bounds with `400 Bad Request`.

### 4.2 Resolution

On receiving a turn, the server resolves the session as follows:

1. If the request omits `session_id`, the server **MUST** mint a new unique id, associate the
turn with it, and return that id (Section 6).
2. If the request supplies a `session_id` that does not exist for the caller's project, the
server **MUST** create a session with that id and associate the turn with it.
3. If the request supplies a `session_id` that exists for the caller's project, the server
**MUST** associate the turn with that existing session.
4. If the request supplies a `session_id` that exists under a **different** project, the
server **MUST NOT** resume it. The server **MUST** treat it as case 2 within the caller's
own project, or reject the turn. A server **MUST NOT** disclose the existence of a session
the caller does not own.

Rule 4 is the ownership boundary. "Resume if it exists" means "resume if it exists and
belongs to the caller."

### 4.3 Continuation semantics for this version

In this version, associating a turn with a session records the turn under that session for
tracing and later retrieval. The conversation context the model sees is supplied by the
`messages` in the request (Section 5.2), not reconstructed from the server's record.

A future version MAY make the server's record authoritative, at which point a turn carries
only the new message and the server supplies the prior history. The request field is
unchanged by that evolution. See [streaming-and-sessions.md](streaming-and-sessions.md).

### 4.4 Concurrency

Two turns that create the same new `(project_id, session_id)` concurrently **MUST** resolve
to a single session. A server **SHOULD** enforce this with a unique constraint and treat the
losing creation as a resume (case 3).

## 5. Request format (`POST /messages`)

### 5.1 Envelope

```jsonc
{
"session_id": "sess_123", // OPTIONAL (Section 4)
"references": { ... }, // OPTIONAL: selects the workflow revision (as on /invoke)
"data": {
"messages": [ /* UIMessage[] */ ], // REQUIRED: the conversation (Section 5.2)
"inputs": { "<name>": <value> }, // OPTIONAL: named input variables (Section 5.3)
"parameters": { /* agent config */ } // OPTIONAL (Section 5.4)
}
}
```

`session_id` sits at the envelope top level, alongside the existing `trace_id` and `span_id`.
It **MUST NOT** be required in a request header.

`data.messages`, `data.inputs`, and `data.parameters` are siblings. They map onto the agent
handler's `messages`, `inputs`, and `parameters` arguments. On `/invoke` the conversation is
nested at `data.inputs.messages`; on `/messages` it is lifted out to `data.messages`, because
the conversation is the primary input of this endpoint.

### 5.2 `data.messages`

`data.messages` is the conversation as an array of `UIMessage` objects (Appendix B). It is
REQUIRED. The last element is the new user turn.

In this version the client **MUST** send the full conversation in `data.messages`. Each
element uses the parts-based `UIMessage` shape (Appendix B), not the `{role, content}` shape
of `/invoke`.

### 5.3 `data.inputs`

`data.inputs` carries the agent's named input variables for the turn: the workflow's declared
inputs and any per-turn context the caller supplies (for example a retrieved document or a
record id). Keys are input names; values are arbitrary JSON. This is the same `inputs` as the
workflow contract, with the conversation no longer nested inside it.

`data.inputs` is OPTIONAL and MAY be sent on every turn, since its values can change between
turns.

### 5.4 `data.parameters` and `references`

The agent configuration (instructions, model, tools, harness, sandbox, permission policy)
travels as on `/invoke`: inline in `data.parameters.agent`, or resolved by the platform from
`references` when the request targets a stored revision. This protocol does not change that
resolution.

### 5.5 Content negotiation

The response mode is selected by the `Accept` request header:

| `Accept` | Response |
| --- | --- |
| `application/json` (or absent) | Single JSON response (Section 6.1) |
| `text/event-stream` | UI Message Stream over SSE (Section 6.2) |

A server that cannot satisfy the `Accept` header **MUST** respond `406 Not Acceptable`.

## 6. Response formats

### 6.1 Single JSON response

For `Accept: application/json`, the server returns `200 OK` with a body extending
`WorkflowServiceResponse`:

```jsonc
{
"trace_id": "...",
"span_id": "...",
"session_id": "sess_123", // the resolved id (minted or echoed)
"status": { "code": 200 },
"data": { "outputs": { "role": "assistant", "content": "Berlin." } }
}
```

The response **MUST** include `session_id`, set to the resolved session (Section 4). The
assistant answer rides in `data.outputs` as today. Token usage is not in the body; it is
recorded on the trace.

### 6.2 UI Message Stream (SSE)

For `Accept: text/event-stream`, the server returns `200 OK` and streams the run in the
Vercel UI Message Stream format (AI SDK v5/v6).

#### 6.2.1 Response headers

The response **MUST** set:

```
content-type: text/event-stream
x-vercel-ai-ui-message-stream: v1
```

and **SHOULD** set:

```
cache-control: no-cache
connection: keep-alive
x-accel-buffering: no
```

`x-accel-buffering: no` disables proxy buffering so parts flush immediately.

#### 6.2.2 Framing

Each part is one SSE event: the literal bytes `data: `, followed by the part as compact JSON
(no insignificant whitespace), followed by `\n\n`.

```
data: {"type":"text-delta","id":"t1","delta":"Hello"}\n\n
```

The stream **MUST** terminate with the literal line `data: [DONE]\n\n`.

#### 6.2.3 Part registry

The parts a server emits, with their REQUIRED fields. Fields not listed are OPTIONAL and MAY
be omitted.

| `type` | Required fields | Meaning |
| --- | --- | --- |
| `start` | none | Begin a message. Carries `messageId` and `messageMetadata` (Section 6.2.4). |
| `start-step` | none | Begin a step of the agent loop. |
| `finish-step` | none | End the current step. |
| `finish` | none | End the message. Carries `finishReason`, `messageMetadata`. |
| `text-start` | `id` | Begin a text block. |
| `text-delta` | `id`, `delta` | Append `delta` to the text block `id`. |
| `text-end` | `id` | End the text block. |
| `reasoning-start` | `id` | Begin a reasoning block. |
| `reasoning-delta` | `id`, `delta` | Append to the reasoning block. |
| `reasoning-end` | `id` | End the reasoning block. |
| `tool-input-start` | `toolCallId`, `toolName` | A tool call begins. |
| `tool-input-delta` | `toolCallId`, `inputTextDelta` | Append a fragment of the tool arguments (note: `inputTextDelta`, not `delta`). |
| `tool-input-available` | `toolCallId`, `toolName`, `input` | The full tool arguments are known. |
| `tool-output-available` | `toolCallId`, `output` | The tool result. |
| `tool-output-error` | `toolCallId`, `errorText` | The tool failed. |
| `file` | `url`, `mediaType` | A file or image. `url` MAY be an `https:` or `data:` URL. |
| `data-<name>` | `data` | An application-defined part (generative UI). MAY carry `id` and `transient`. |
| `error` | `errorText` | A stream-level error (Section 8.2). |

A server **MUST** order parts so that for any `id` or `toolCallId`, a `*-start` precedes its
deltas, which precede its `*-end` or `*-available`. Text and reasoning deltas are
concatenated by `id`. Tool parts are keyed by `toolCallId`.

#### 6.2.4 Session id in the stream

The server **MUST** convey the resolved `session_id` as `messageMetadata.sessionId` on the
`start` part, which is the first part of the stream:

```
data: {"type":"start","messageId":"msg_1","messageMetadata":{"sessionId":"sess_123"}}
```

A server **MAY** additionally mirror `session_id` to a response header. The body remains the
normative source.

#### 6.2.5 Mapping from agent events

The streaming edge consumes the agent's internal `AgentEvent` stream
(`services/agent/src/protocol.ts:74`) and emits parts as follows:

| `AgentEvent` | Parts |
| --- | --- |
| run start (synthesized) | `start` (with `messageId`, `messageMetadata.sessionId`), then `start-step` |
| `message` | `text-start`, one or more `text-delta`, `text-end` |
| `thought` | `reasoning-start`, `reasoning-delta`, `reasoning-end` |
| `tool_call` | `tool-input-start`, then `tool-input-available` |
| `tool_result` with `isError=false` | `tool-output-available` |
| `tool_result` with `isError=true` | `tool-output-error` |
| `usage` | `messageMetadata` on the `finish` part |
| `error` | `error` (Section 8.2) |
| `done` | `finish-step`, then `finish` (`finishReason` = `stopReason`), then `[DONE]` |

A harness that reports `capabilities.streamingDeltas` produces token-level `text-delta`
parts. A harness that does not produces one `text-delta` carrying the whole text. The wire
shape is identical, so the client does not distinguish them.

The protocol streams deltas only. There is no full-message snapshot part. The client
assembles the final `UIMessage` from the parts. The server **SHOULD** record the assembled
turn on the trace (`ag.session.id`), which is the source `load-session` reads.

## 7. The `load-session` endpoint (`POST /load-session`)

Returns the history of a session so a client can rebuild a conversation it does not hold
locally.

### 7.1 Request

```jsonc
{ "session_id": "sess_123" }
```

`session_id` is REQUIRED. The server **MUST** apply the ownership rule of Section 4.2: if the
session does not exist for the caller's project, the server **MUST** respond `404 Not Found`
and **MUST NOT** reveal a session owned by another project.

### 7.2 Response (default, `Accept: application/json`)

The server returns `200 OK` with the conversation as `UIMessage` objects, the shape `useChat`
accepts as its initial `messages`:

```jsonc
{
"session_id": "sess_123",
"messages": [
{ "id": "m1", "role": "user", "parts": [ { "type": "text", "text": "capital of France?" } ] },
{ "id": "m2", "role": "assistant", "parts": [ { "type": "text", "text": "Paris." } ] }
]
}
```

### 7.3 Response (negotiated replay, `Accept: text/event-stream`)

A server **MAY** support a delta replay of the stored history under
`Accept: text/event-stream`, re-emitting the session as a UI Message Stream (Section 6.2).
This is OPTIONAL. Whether the folded form or the replay is the primary form is left open by
this draft; a conformant client **SHOULD** request `application/json` for rebuilding a static
view.

## 8. Error handling

### 8.1 Request and endpoint errors (JSON)

Before a stream begins, the server reports errors with an HTTP status and the existing
`status` envelope (`WorkflowServiceStatus`: `code`, `message`, `type`, `stacktrace`):

| Status | Condition |
| --- | --- |
| `400 Bad Request` | Malformed body, or a `session_id` that violates Section 4.1. |
| `401 Unauthorized` / `403 Forbidden` | Missing or invalid credentials. |
| `404 Not Found` | `load-session` on a session the caller does not own. |
| `406 Not Acceptable` | The `Accept` header cannot be satisfied. |
| `5xx` | Server failure before streaming starts. |

### 8.2 In-stream errors

A failure after the stream has started **MUST** be reported as an `error` part:

```
data: {"type":"error","errorText":"the agent run failed: ..."}
```

After emitting an `error` part, the server **SHOULD** terminate the stream. It **MAY** omit
the `finish` part. It **SHOULD** still emit `[DONE]` to close the SSE channel cleanly. The
client surfaces the error to the user.

## 9. Security considerations

- **Session ownership.** Section 4.2 rule 4 is a security requirement, not a convenience.
Because a client may supply a `session_id` for an unknown id (case 2), a server that keys
sessions on `session_id` alone would let a caller read or extend another tenant's
conversation. Servers **MUST** key on `(project_id, session_id)` and scope every resume,
every `load-session`, and every existence check to the caller's project.
- **Opaque ids.** A client-supplied `session_id` is untrusted input. See Section 4.1.
- **Secrets.** Provider keys and tool credentials travel and resolve as in the current
contract. This protocol adds no new secret-bearing field. `inputs` is caller-supplied
input and **MUST NOT** be used to smuggle credentials in place of the existing `secrets`
and signed-credential mechanisms.
- **Content negotiation and buffering.** A streaming response disables proxy buffering
(Section 6.2.1). Operators **MUST** ensure intermediaries do not re-buffer `text/event-
stream` responses, or streaming degrades to a single delayed flush.

## 10. Interaction sequences

### 10.1 New session, streaming turn

```
client server
│ POST /messages │
│ Accept: text/event-stream │
│ { data:{ messages:[...] } } │ (no session_id)
│───────────────────────────────────────▶│
│ │ mint sess_123
│ 200 text/event-stream │
│ data: {"type":"start", │
│ "messageMetadata": │
│ {"sessionId":"sess_123"}} │
│◀───────────────────────────────────────│
│ data: {"type":"start-step"} ... │
│ ... tool / text parts ... │
│ data: {"type":"finish"} │
│ data: [DONE] │
│◀───────────────────────────────────────│
│ (client stores sess_123 for next turn) │
```

### 10.2 Returning to a known session

```
client server
│ POST /load-session │
│ { "session_id": "sess_123" } │
│───────────────────────────────────────▶│ check ownership
│ 200 { messages: [ UIMessage, ... ] } │
│◀───────────────────────────────────────│
│ (render history; hold it) │
│ │
│ POST /messages │
│ Accept: text/event-stream │
│ { session_id:"sess_123", │
│ data:{ messages:[...full] } } │
│───────────────────────────────────────▶│ resolve existing sess_123
│ 200 text/event-stream → parts → [DONE] │
│◀───────────────────────────────────────│
```

## Appendix A: Full stream transcript

One turn: the agent calls a weather tool, reads the result, and answers. Every `data:` line
in order, each followed by a blank line.

```
data: {"type":"start","messageId":"msg_1","messageMetadata":{"sessionId":"sess_123"}}

data: {"type":"start-step"}

data: {"type":"tool-input-start","toolCallId":"call_1","toolName":"getWeather"}

data: {"type":"tool-input-available","toolCallId":"call_1","toolName":"getWeather","input":{"city":"Paris"}}

data: {"type":"tool-output-available","toolCallId":"call_1","output":{"weather":"sunny","temp":24}}

data: {"type":"finish-step"}

data: {"type":"start-step"}

data: {"type":"text-start","id":"t1"}

data: {"type":"text-delta","id":"t1","delta":"It is sunny "}

data: {"type":"text-delta","id":"t1","delta":"and 24°C in Paris."}

data: {"type":"text-end","id":"t1"}

data: {"type":"finish-step"}

data: {"type":"finish","messageMetadata":{"usage":{"input":820,"output":36,"cost":0.004}}}

data: [DONE]
```

## Appendix B: `UIMessage` schema

A message accumulated by the client and accepted by `load-session`:

```jsonc
{
"id": "m2",
"role": "user | assistant | system",
"parts": [
{ "type": "text", "text": "..." },
{ "type": "reasoning", "text": "..." },
{ "type": "tool-<name>", "toolCallId": "...", "state": "output-available", "input": {}, "output": {} },
{ "type": "file", "url": "...", "mediaType": "image/png" },
{ "type": "data-<name>", "data": { } },
{ "type": "step-start" }
],
"metadata": { }
}
```

A `UIMessage` carries no top-level `content` string in v5/v6. All content lives in `parts`.

## Appendix C: References

- RFC 2119, RFC 8174: requirement keywords.
- RFC 8259: JSON.
- WHATWG HTML, Server-Sent Events: `text/event-stream`.
- Vercel AI SDK UI Message Stream (v5/v6): https://ai-sdk.dev, and the chunk schema at
https://github.com/vercel/ai/blob/main/packages/ai/src/ui-message-stream/ui-message-chunks.ts
- Current contract: `sdks/python/agenta/sdk/models/workflows.py`,
`sdks/python/agenta/sdk/decorators/routing.py` (Accept negotiation at `:236`).
- Agent events and session id: `services/agent/src/protocol.ts:74`,
`sdks/python/agenta/sdk/agents/dtos.py`, `services/oss/src/agent/app.py`.
- Design rationale and trade-offs: [streaming-and-sessions.md](streaming-and-sessions.md).
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix markdownlint violations for fenced blocks and code span formatting.

This file currently triggers MD040 across multiple fences and MD038 around the code span near Line 256. Please add explicit fence languages (text, json, jsonc) and remove extra spaces inside the affected code span.

Example fixes
-```
+```text
            ┌─────────────────────────── client (useChat) ───────────────────────────┐
...
-```
+```

-```
+```text
content-type: text/event-stream
x-vercel-ai-ui-message-stream: v1
-```
+```

-Each part is one SSE event: the literal bytes `data: `, followed by ...
+Each part is one SSE event: the literal bytes `data:`, followed by ...
🧰 Tools
🪛 LanguageTool

[grammar] ~283-~283: Ensure spelling is correct
Context: ..., toolName| A tool call begins. | |tool-input-delta|toolCallId, inputTextDelta`...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~283-~283: Ensure spelling is correct
Context: ... | | tool-input-delta | toolCallId, inputTextDelta | Append a fragment of the tool argumen...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~283-~283: Ensure spelling is correct
Context: ...a fragment of the tool arguments (note: inputTextDelta, not delta). | | `tool-input-availabl...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.22.1)

[warning] 66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 239-239: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 246-246: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 256-256: Spaces inside code span elements

(MD038, no-space-in-code)


[warning] 259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 300-300: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 389-389: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 417-417: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 439-439: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 462-462: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 526-526: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Source: Linters/SAST tools

Comment on lines +2 to +3
AGENTA_HOST=http://144.76.237.122:8280/
AGENTA_API_KEY=your-agenta-project-api-key

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid publishing an API-key example over plaintext HTTP to a fixed public IP.

Line 2 currently encourages sending AGENTA_API_KEY over http://..., which can leak credentials in transit and creates a brittle environment default.

🔒 Suggested fix
-AGENTA_HOST=http://144.76.237.122:8280/
+AGENTA_HOST=https://cloud.agenta.ai
 AGENTA_API_KEY=your-agenta-project-api-key
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
AGENTA_HOST=http://144.76.237.122:8280/
AGENTA_API_KEY=your-agenta-project-api-key
AGENTA_HOST=https://cloud.agenta.ai
AGENTA_API_KEY=your-agenta-project-api-key
🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 3-3: [UnorderedKey] The AGENTA_API_KEY key should go before the AGENTA_HOST key

(UnorderedKey)

Comment on lines +184 to +185
const toolSpans = new Map<string, Span>();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close tool spans even when toolCallId is absent.

At Line 362 a tool span is always started, but at Line 371/375 it is only tracked and closed when toolCallId exists. Missing IDs leave spans open and skew the exported trace.

🛠️ Suggested fix
 const toolSpans = new Map<string, Span>();
+const anonymousToolSpans: Span[] = [];
@@
   pi.on("tool_execution_start", async (event: any) => {
@@
-    if (event?.toolCallId) toolSpans.set(event.toolCallId, span);
+    if (event?.toolCallId) toolSpans.set(event.toolCallId, span);
+    else anonymousToolSpans.push(span);
   });
@@
   pi.on("tool_execution_end", async (event: any) => {
-    const span = event?.toolCallId ? toolSpans.get(event.toolCallId) : undefined;
+    const span = event?.toolCallId
+      ? toolSpans.get(event.toolCallId)
+      : anonymousToolSpans.shift();
     if (!span) return;
     setOutput(span, toolResultText(event?.result));
     if (event?.isError) span.setStatus({ code: SpanStatusCode.ERROR });
     span.end();
-    toolSpans.delete(event.toolCallId);
+    if (event?.toolCallId) toolSpans.delete(event.toolCallId);
   });

Also applies to: 362-380

Comment on lines +33 to +42
t = time.monotonic()
sb = daytona.create(
CreateSandboxFromSnapshotParams(snapshot=snap, auto_stop_interval=0),
timeout=120,
)
dt = time.monotonic() - t
times.append(dt)
print(f"{snap:20} run {i + 1}/{N}: {dt:.2f}s state={sb.state}", flush=True)
daytona.delete(sb)
results[snap] = times

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee sandbox teardown per run to prevent leaks.

daytona.delete(sb) only runs on the happy path. Any exception after create() can leave sandboxes running and stop the benchmark early.

Suggested fix
 for snap in SNAPSHOTS:
     times: list[float] = []
     for i in range(N):
-        t = time.monotonic()
-        sb = daytona.create(
-            CreateSandboxFromSnapshotParams(snapshot=snap, auto_stop_interval=0),
-            timeout=120,
-        )
-        dt = time.monotonic() - t
-        times.append(dt)
-        print(f"{snap:20} run {i + 1}/{N}: {dt:.2f}s  state={sb.state}", flush=True)
-        daytona.delete(sb)
+        sb = None
+        try:
+            t = time.monotonic()
+            sb = daytona.create(
+                CreateSandboxFromSnapshotParams(snapshot=snap, auto_stop_interval=0),
+                timeout=120,
+            )
+            dt = time.monotonic() - t
+            times.append(dt)
+            print(f"{snap:20} run {i + 1}/{N}: {dt:.2f}s  state={sb.state}", flush=True)
+        finally:
+            if sb is not None:
+                daytona.delete(sb)
     results[snap] = times
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
t = time.monotonic()
sb = daytona.create(
CreateSandboxFromSnapshotParams(snapshot=snap, auto_stop_interval=0),
timeout=120,
)
dt = time.monotonic() - t
times.append(dt)
print(f"{snap:20} run {i + 1}/{N}: {dt:.2f}s state={sb.state}", flush=True)
daytona.delete(sb)
results[snap] = times
for snap in SNAPSHOTS:
times: list[float] = []
for i in range(N):
sb = None
try:
t = time.monotonic()
sb = daytona.create(
CreateSandboxFromSnapshotParams(snapshot=snap, auto_stop_interval=0),
timeout=120,
)
dt = time.monotonic() - t
times.append(dt)
print(f"{snap:20} run {i + 1}/{N}: {dt:.2f}s state={sb.state}", flush=True)
finally:
if sb is not None:
daytona.delete(sb)
results[snap] = times

Comment on lines +183 to +184
def arg(name: str, default: str) -> str:
return sys.argv[sys.argv.index(name) + 1] if name in sys.argv else default

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

arg() can crash on missing flag values.

If a user passes --model (or --auth) without a value, sys.argv[index + 1] raises IndexError and exits with a traceback instead of a clear CLI error.

Suggested fix
 def arg(name: str, default: str) -> str:
-    return sys.argv[sys.argv.index(name) + 1] if name in sys.argv else default
+    if name not in sys.argv:
+        return default
+    idx = sys.argv.index(name) + 1
+    if idx >= len(sys.argv) or sys.argv[idx].startswith("--"):
+        raise ValueError(f"Missing value for {name}")
+    return sys.argv[idx]

Comment on lines +259 to +266
pi_cmd = (
f"cd {run_dir} && TMPDIR={run_dir}/tmp "
f"pi -p {json.dumps(PROMPT)} "
f"--mode json --approve --provider {provider} --model {model} "
f"-t read,bash,edit,write,ls "
f"--session-dir {run_dir}/.pi-sessions --name {session_id} "
f"< /dev/null"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Shell command assembly is vulnerable to argument injection via --model.

model is user-controlled and interpolated unquoted into pi_cmd. A crafted value can append shell operators and execute unintended commands in the sandbox.

Suggested fix
+import shlex
...
             pi_cmd = (
-                f"cd {run_dir} && TMPDIR={run_dir}/tmp "
+                f"cd {shlex.quote(run_dir)} && TMPDIR={shlex.quote(run_dir)}/tmp "
                 f"pi -p {json.dumps(PROMPT)} "
-                f"--mode json --approve --provider {provider} --model {model} "
+                f"--mode json --approve --provider {shlex.quote(provider)} --model {shlex.quote(model)} "
                 f"-t read,bash,edit,write,ls "
-                f"--session-dir {run_dir}/.pi-sessions --name {session_id} "
+                f"--session-dir {shlex.quote(run_dir)}/.pi-sessions --name {shlex.quote(session_id)} "
                 f"< /dev/null"
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pi_cmd = (
f"cd {run_dir} && TMPDIR={run_dir}/tmp "
f"pi -p {json.dumps(PROMPT)} "
f"--mode json --approve --provider {provider} --model {model} "
f"-t read,bash,edit,write,ls "
f"--session-dir {run_dir}/.pi-sessions --name {session_id} "
f"< /dev/null"
)
import shlex
pi_cmd = (
f"cd {shlex.quote(run_dir)} && TMPDIR={shlex.quote(run_dir)}/tmp "
f"pi -p {json.dumps(PROMPT)} "
f"--mode json --approve --provider {shlex.quote(provider)} --model {shlex.quote(model)} "
f"-t read,bash,edit,write,ls "
f"--session-dir {shlex.quote(run_dir)}/.pi-sessions --name {shlex.quote(session_id)} "
f"< /dev/null"
)
🧰 Tools
🪛 ast-grep (0.43.0)

[info] 260-260: use jsonify instead of json.dumps for JSON output
Context: json.dumps(PROMPT)
Note: Security best practice.

(use-jsonify)

import os
import httpx

BASE = os.getenv("AGENTA_HOST", "http://144.76.237.122:8280").rstrip("/")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unencrypted HTTP default with API key exposure.

Line 15 defaults BASE to http://144.76.237.122:8280 (unencrypted HTTP). When AGENTA_API_KEY is sent to this endpoint at line 28/62, credentials are transmitted in plaintext. Although the AGENTA_HOST environment variable allows override to HTTPS, the hardcoded default should be HTTPS or explicitly documented as development-only.

Recommendation: Either (a) change the default to https://..., (b) add a prominent warning in the script's docstring, or (c) validate that AGENTA_HOST is HTTPS before sending credentials.

Comment on lines +69 to +71
out = resp.json()
new = out.get("workflow_revision") or out
print("new revision id:", new.get("id"), "version:", new.get("version"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Robust error handling for HTTP response.

Line 69 calls resp.json() without checking the status code first. If the POST fails (4xx/5xx), the response may not contain the expected JSON structure, causing .get() on line 70 to mask the real error.

For a POC script, consider adding:

resp.raise_for_status()  # after line 67, before line 69

This makes the script fail loudly if the API call fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant