See also — Parent: docs/README.md, overview.md · Where rules land by layer: manifest.md, agent.md, tools.md, communication.md, memory.md, orchestration.md, observability.md, security.md · Autonomy levels that select required rules: compliance.md · Runtime gates that apply these rules live: runtime.md (completion gate chain, L0/R34, R35 repair fixpoint), critique.md, evaluation.md · Deterministic purity (R33): orchestration.md · Refinement guard (R36): refinement.md · Normative spec: spec/versions/1.0/validation-rules.md
AWP uses four complementary tiers of validation, each operating at a different point in the workflow lifecycle and answering a different question. They are not redundant — they catch different classes of problems and you almost always want all four enabled in production.
| Tier | When | Cost | Catches | Configured under |
|---|---|---|---|---|
| 1. Deterministic schema/rule validation (R1–R32) | Load time, before any agent runs | Free, instant | Structural bugs: missing fields, cycles, ID collisions, reserved-namespace abuse, missing output contracts, broken sandbox/codemode config, A4 max_depth termination | Built into runtime; this document |
| 2. LLM semantic validation | Per-agent, after output is produced | 1 LLM call per agent (skippable if confidence is high) | Output that parses correctly but is semantically wrong (hallucinated facts, ignored instructions) | Implicit in delegation loop; gated by confidence threshold |
| 3. Critique loop | Per-worker inside delegation loop, after a defect is suspected | LLM calls for diagnose + repair | Defects in worker output, with targeted repair rather than full retry; learns cross-worker patterns into a defect memory | delegation_loop.critique — see critique.md |
| 4. Evaluation layer | Workflow level, after the run (or step-scored during) | Multiple LLM calls + deterministic tests | Quality scoring against rubrics, deterministic assertions, budget utility, policy checks. Can trigger retry/repair across the whole workflow | observability.evaluation — see evaluation.md |
Rule of thumb: R1–R32 reject invalid workflows; LLM/critique/evaluation reject bad outputs. The four tiers compose — a workflow that passes all R-rules can still fail evaluation, and a worker that passes critique can still produce a low evaluation score.
A separate, security-flavored validation runs whenever a worker creates a new tool at runtime: the B1–B6 sandbox auto-repair pipeline (runtime-tool-generation.md). It is not a workflow validator but a tool validator, and it sits between Tier 1 and Tier 2 conceptually.
This document specifies the deterministic Tier 1 rules. AWP runtimes must enforce these when loading a workflow. Each rule has a unique identifier, category, and description. Rules marked RECOMMENDED apply primarily to the Python reference implementation; other runtimes may adapt them.
| Rule | Category | Level | Summary |
|---|---|---|---|
| R1 | Manifest | MUST | Valid SemVer in awp field |
| R2 | Manifest | MUST | Workflow name matches regex |
| R3 | Agent Identity | RECOMMENDED | Python class named Agent |
| R4 | Agent Identity | RECOMMENDED | self.name matches identity.id (Python) |
| R5 | Orchestration | MUST | Unique agent IDs |
| R6 | Orchestration | MUST | DAG has no cycles |
| R7 | Orchestration | MUST | All dependencies resolve |
| R8 | File Structure | MUST | Agent config files exist |
| R9 | Agent Identity | MUST | Output contract present |
| R10 | Capabilities | MUST | No reserved namespace collisions |
| R11 | Capabilities | MUST | Unique tool FQNs |
| R12 | Agent Identity | MUST | Agent ID matches regex |
| R13 | Memory & State | MUST | No writes to reserved keys |
| R14 | Security | MUST | Sensitive fields redacted |
| R15 | Communication | MUST | Channel schema validated |
| R16 | Memory & State | MUST | Sharing strategy enforced |
| R17 | Orchestration | MUST | Timeouts enforced |
| R18 | Observability | MUST | Audit hash chain integrity |
| R19 | Capabilities | MUST | Code Mode requires tools enabled |
| R20 | Capabilities | MUST | Code Mode requires sandbox |
| R21 | Capabilities | MUST | Code Mode language is valid |
| R22 | Capabilities | MUST | Explicit SDK surface has tools |
| R23 | Capabilities | MUST | SDK excludes reference valid tools |
| R24 | Capabilities | MUST | Isolate sandbox requires network config |
| R25 | Capabilities | MUST | Dynamic tool namespace compliance |
| R26 | Capabilities | MUST | Dynamic tool creation requires Code Mode and workflow-level flag |
| R27 | Evaluation | MUST | Evaluation metric IDs unique and well-formed |
| R28 | Evaluation | MUST | Evaluation weights normalized and metric kinds valid |
| R29 | Evaluation | MUST | Thresholds consistent (e.g. warn <= fail) |
| R30 | Evaluation | MUST | step_scores.hooks use valid hooks; retry_policy.actions valid |
| R31 | Orchestration (A4) | MUST | delegation_loop.budget.max_depth present and >= 0 for A4 workflows |
| R32 | Orchestration (A4) | MUST | max_depth must not exceed the hard ceiling that guarantees termination |
- Category: Manifest
- Level: MUST
- Description: The
awpfield inworkflow.awp.yamlmust be a valid Semantic Versioning 2.0.0 string.
Valid:
awp: "1.0.0"Invalid:
awp: "1.0" # Missing patch version
awp: "v1.0.0" # Prefix not allowed
awp: 1 # Not a string- Category: Manifest
- Level: MUST
- Description: The
workflow.namefield must match^[a-z][a-z0-9_-]{0,62}[a-z0-9]$(kebab-case, 2-64 characters).
Valid:
workflow:
name: research-and-write
name: my_workflow_v2Invalid:
workflow:
name: Research-And-Write # Uppercase
name: a # Too short
name: my-workflow- # Trailing hyphen- Category: Agent Identity
- Level: RECOMMENDED (Python convention)
- Description: Python implementations should use a class named
Agent(specified inruntime.class_name). Non-Python runtimes may use any class name.
Valid (Python):
runtime:
class_name: AgentValid (non-Python or custom):
runtime:
class_name: CustomResearchAgent- Category: Agent Identity
- Level: RECOMMENDED (Python convention)
- Description: In Python implementations, the
self.nameproperty should return the same string asidentity.id. Non-Python runtimes may use any mechanism to associate the agent instance with its declared identity.
Valid (Python):
class Agent(AWPAgent):
@property
def name(self):
return "research_analyst" # Matches identity.id- Category: Orchestration
- Level: MUST
- Description: Every
idinorchestration.graphmust be unique within the workflow.
Valid:
orchestration:
graph:
- id: researcher
- id: writerInvalid:
orchestration:
graph:
- id: researcher
- id: researcher # Duplicate- Category: Orchestration
- Level: MUST
- Description: The
orchestration.graphmust form a Directed Acyclic Graph. Cycles must cause a validation error.
Valid:
graph:
- id: a
depends_on: []
- id: b
depends_on: [a]
- id: c
depends_on: [b]Invalid:
graph:
- id: a
depends_on: [c]
- id: b
depends_on: [a]
- id: c
depends_on: [b] # Cycle: a -> c -> b -> a- Category: Orchestration
- Level: MUST
- Description: Every entry in
depends_onmust reference a valididin the graph.
Invalid:
graph:
- id: writer
depends_on: [nonexistent_agent]- Category: File Structure
- Level: MUST
- Description: Every agent referenced in the graph must have a corresponding
agent.awp.yamlfile atagents/{agent_id}/agent.awp.yaml.
- Category: Agent Identity
- Level: MUST
- Description: Every agent must have an
output.contractdefined. Whenoutput.formatis"json", the contract must be a valid JSON Schema.
Valid:
output:
format: json
contract:
type: object
required: [decision, summary, confidence]
properties:
decision:
type: string
summary:
type: string
confidence:
type: numberInvalid:
output:
format: json
# No contract defined- Category: Capabilities
- Level: MUST
- Description: Custom tools must not use reserved namespaces:
web,http,file,shell,agent,memory,arithmetic,numpy,matplot,pandas,doc,sklearn.
Valid:
@app.tool("myns.custom_action")Invalid:
@app.tool("web.custom_search") # "web" is reserved- Category: Capabilities
- Level: MUST
- Description: All tool FQNs within a workflow must be unique. Duplicate tool names must cause a validation error.
- Category: Agent Identity
- Level: MUST
- Description: The
identity.idfield must match^[a-z][a-z0-9_]{0,46}[a-z0-9]$(snake_case, 2-48 characters).
Valid:
identity:
id: research_analyst
id: a1Invalid:
identity:
id: Research_Analyst # Uppercase
id: r # Too short
id: research-analyst # Hyphens not allowed- Category: Memory & State
- Level: MUST
- Description: Agents must not write to reserved state keys:
_meta,_errors,_trace,_workflow.
Valid:
state["research_analyst"] = {"findings": [...]}Invalid:
state["_meta"] = {"custom": "data"}- Category: Security
- Level: MUST
- Description: Fields listed in
state.sharing.sensitive_fieldsand environment variables withsensitive: truemust not appear in any log output, metric label, span attribute, or audit entry.
See Security Reference for details.
- Category: Communication
- Level: MUST
- Description: When a channel defines a
schema, the runtime must validate messagecontentagainst the schema before delivery. Messages that fail validation must be rejected.
See Communication Reference for details.
- Category: Memory & State
- Level: MUST
- Description: The runtime must enforce the declared
state.sharing.strategy. Underselective, agents must not access fields not listed inshare_outputor sharing rules. Underisolated, agents must not access other agents' state.
See Memory & State Reference for details.
- Category: Orchestration
- Level: MUST
- Description: The runtime must enforce
timeout_per_agentandtimeout_totallimits. When a timeout expires, the runtime must terminate the agent's execution and apply the configuredon_failurestrategy.
See Orchestration Reference for details.
- Category: Observability
- Level: MUST
- Description: When
audit.integrityis"hash_chain", each audit entry must include aprev_hashfield containing the hash of the previous entry. The first entry must haveprev_hashset to a zero hash.
Valid:
[
{"id": 1, "event": "workflow.start", "prev_hash": "000...000", "hash": "a1b2c3..."},
{"id": 2, "event": "agent.start", "prev_hash": "a1b2c3...", "hash": "d4e5f6..."}
]Invalid:
[
{"id": 1, "event": "workflow.start", "hash": "a1b2c3..."},
{"id": 2, "event": "agent.start", "hash": "d4e5f6..."}
](Missing prev_hash field.)
See Observability Reference for details.
- Category: Capabilities
- Level: MUST
- Description: If
capabilities.codemode.enabledistrue, thencapabilities.tools.enabledMUST betrue. Code Mode generates an SDK from the allowed tools.
- Category: Capabilities
- Level: MUST
- Description: If
capabilities.codemode.enabledistrue, thencapabilities.sandbox.typeMUST be set and MUST NOT be"none".
- Category: Capabilities
- Level: MUST
- Description:
capabilities.codemode.languageMUST be one of:"typescript","python","javascript".
- Category: Capabilities
- Level: MUST
- Description: If
capabilities.codemode.sdk_surface.modeis"explicit", thencapabilities.codemode.sdk_surface.includeMUST contain at least one tool FQN.
- Category: Capabilities
- Level: MUST
- Description: Every entry in
capabilities.codemode.sdk_surface.excludeMUST match at least one tool incapabilities.tools.allowed.
- Category: Capabilities
- Level: MUST
- Description: If
capabilities.sandbox.typeis"isolate", thecapabilities.sandbox.networksection MUST be present with at leastnetwork.enableddefined.
- Category: Capabilities
- Level: MUST
- Description: When an agent has
capabilities.codemode.tool_creation: true, itstool_creation_namespaceMUST NOT match any reserved namespace, and it MUST be listed in the workflow-leveldynamic_tools.allowed_namespaces. Default namespace is"dynamic". Prevents runtime-generated tools from shadowing built-ins or using undeclared namespaces.
- Category: Capabilities
- Level: MUST
- Description: When
capabilities.codemode.tool_creation: true, bothcapabilities.codemode.enabledand workflow-leveldynamic_tools.enabledMUST betrue. Tool creation operates through the Code Mode SDK and requires an activeDynamicToolFactoryon the runtime.
- Category: Evaluation
- Level: MUST
- Description: When
observability.evaluation.enabled: true, every metric'skindMUST be one of the valid metric kinds (llm_rubric,deterministic,schema,budget,policy). Invalid kinds are rejected at load time so misconfigured evaluators cannot silently skip scoring.
- Category: Evaluation
- Level: MUST
- Description: Evaluation thresholds (
accept,retry,fail) MUST each lie in[0.0, 1.0]and MUST satisfyaccept >= retry >= fail. Inverted or out-of-range thresholds are rejected so retry/accept decisions remain well-defined.
- Category: Evaluation
- Level: MUST
- Description: Every evaluation metric's
weightMUST be>= 0, and at least one metric MUST have a strictly positive weight. Prevents degenerate aggregations where the weighted score is always zero.
- Category: Evaluation
- Level: MUST
- Description:
step_scores.hooksMUST only contain valid hook names andretry_policy.actions.below_retry/below_failMUST reference valid actions. Ensures the evaluation loop can always resolve a concrete action when a threshold is crossed.
- Category: Orchestration (A4)
- Level: MUST
- Description: When
orchestration.delegation_loop.budgetis present,max_depthMUST be set to an integer>= 0. Usemax_depth: 0to disable recursive submanager spawning, or>= 1to allow A4 delegation. Missing or negative values are rejected so recursion always has a finite ceiling.
Note on the R31 label. The validator rule
R31above is the A4 max_depth gate. A different label also called "R31" appears inside the manager prompt inpackages/awp-runtime/src/awp/data/prompts.pyas "R31 Plan-Tool-Closure" — that is an unrelated prompt-level plan validator applied to each PLAN subtask'stool_manifest. The two live in different layers (static YAML validation vs. runtime plan grading) and share the label only by historical accident; do not conflate them.
- Category: Orchestration (A4)
- Level: MUST
- Description:
delegation_loop.budget.max_depthMUST NOT exceed the hard ceiling of10. Values> 5emit a warning ("most A4 workflows complete with depth <= 3"). Deep recursion makes budget reasoning and debugging intractable; a flatter decomposition is always preferred.
In addition to the static R-rules above, the delegation-loop runner enforces a deterministic completion-gate chain on every manager COMPLETE decision. These gates are runtime checks (they require filesystem and result-state access) and therefore do not carry an R-label, but they are normative for conformant delegation-loop implementations. A rejection by any gate bumps the _rejected_completions counter (see Completion-Retry Circuit Breaker below) and forces another manager iteration.
| Gate | MUST/SHOULD | Description |
|---|---|---|
critique |
MUST | Mean critique score across the latest iteration's worker critiques MUST be >= critique.min_score_to_complete when the critique engine is enabled and at least one critique score exists. |
deliverable_presence |
MUST | Every manager-declared deliverable path (from subtask required_outputs, or regex-scraped from success_criteria / description anchored on _output_dir / _workspace_dir) MUST exist on disk AND be a non-empty file. On rejection, emit a deliverable_presence gate event with missing: [...], empty: [...], `source: "required_outputs" |
placeholder |
MUST | Declared output files and the final_result dict MUST NOT contain placeholder strings (TODO, XX%, ???, your_value, FIXME, etc.). Code-comment exemptions apply (e.g. # TODO: on a line that looks like a code comment). |
file |
MUST | Declared output files MUST NOT be broken placeholders (1×1 PNGs, zero-length PDFs, truncated CSVs without headers). |
deliverable |
SHOULD | Legacy keyword-based check: if the task text implies a file deliverable (via hint keywords like image, report, chart, dataset) and the run's _output_dir contains no file >= 512 bytes, reject. Complementary to deliverable_presence. |
structural_integrity |
SHOULD | Markdown deliverables MUST pass deterministic structural checks: anchor adjacency, reference-format consistency, paragraph-duplication ratio, figure inline-ref presence. |
eval |
MUST when enabled | When observability.evaluation.enabled: true, the aggregated evaluation score MUST be >= thresholds.retry. Below retry → retry_with_repair; below fail → hard failure. |
When the manager issues pre_progress_plans > MAX_PRE_PROGRESS_PLANS consecutive PLAN decisions without any worker progress (the plan_loop gate), the runtime MUST pick one of the following deterministic transitions:
forced_delegate— if the active task plan has at least one subtask withstatus == "pending", the runner setsstate["_plan_locked"]to a textual nudge and continues the loop. The manager MUST issue DELEGATE on the next turn.forced_terminate— if the plan has no pending subtasks (all completed, failed, or skipped), the runtime MUST terminate the run with statuspartialand reasonplan_loop_stall.
The gate event MUST record transition: "forced_delegate" | "forced_terminate", pre_progress_plans, and pending_subtasks so the decision can be audited after the fact.
The runtime MUST track a monotonically-increasing _rejected_completions counter. Every rejection by the completion-gate chain (critique, deliverable_presence, placeholder, file, deliverable, structural_integrity, eval) MUST bump the counter by 1. Any successful DELEGATE decision MUST reset the counter to 0.
When the counter reaches budget.max_rejected_completions (default 2):
- If the last gate-rejection payload describes a concrete defect, the runtime MUST synthesize a repair subtask (priority
critical,required_outputsderived from the defect) and force the next iteration into DELEGATE mode. The counter is reset after repair synthesis. - If no repair can be derived, the runtime MUST terminate the run with status
partialand reasonmax_rejected_completions.
Both paths MUST emit a completion_circuit_breaker gate event with the counter value and — when applicable — the synthesized repair_subtask_id.