diff --git a/README.md b/README.md index d62405bf..b91bcf15 100644 --- a/README.md +++ b/README.md @@ -9,143 +9,122 @@

- The framework where agents build their own OS. + A verified operating layer for autonomous agents

CI License Rust - Tests + Pre-release

--- -Agents are starting to build their own tools — generating MCP servers at runtime, synthesizing helpers mid-session, evolving workflow topologies. At the same time, the infrastructure for making this safe is developing: policy-based authorization on tool invocations, behavioral contracts, state-machine-constrained agents, formal verification becoming practical. Harness engineering, durable execution, declarative agent specs — all moving forward. - -Temper is our attempt to explore what happens when you connect these ideas into one framework: agent-created tools as formally verified state machines, authorization policies derived from behavioral specs, and an evolution loop where unmet intents feed back into spec proposals with human approval. - -## How It Works - -An agent describes what it needs as declarative specs — state machines, data models, integrations, authorization policies. Temper formally verifies the specs, deploys them as a live API, mediates every action through [Cedar](https://www.cedarpolicy.com/) policies, and records everything. The human approves or rejects. The agent operates through what it built. - -```python -# Agent gives itself long-term memory — Temper verifies and deploys it -await temper.submit_specs("my-app", { - "Knowledge.ioa.toml": knowledge_spec, # state machine: agent-generated - "model.csdl.xml": data_model # data model: agent-generated -}) -# → Verification cascade: Z3 SMT, model checking, simulation, property tests -# → If all levels pass, the knowledge system is live - -# Agent stores and retrieves its own knowledge through the verified API -await temper.create("my-app", "KnowledgeEntries", { - "content": "service X fails under concurrent writes — use advisory locks", - "source": "incident-247" -}) -await temper.action("my-app", "KnowledgeEntries", "k-42", "Link", { - "related": ["k-12", "k-31"] # connect insights across sessions -}) -# → Cedar checks every operation — the agent can read its own entries -# but can't access another agent's knowledge without approval -``` - -The kernel is a thin Rust runtime that interprets whatever the mediation pipeline feeds it. Everything agents touch — specs, policies, WASM modules, reaction rules — hot-reloads. The kernel itself rarely changes. +## What is Temper? -## Why Temper? +Agents build tools at runtime. They generate helpers and create workflows. Those tools have no verification, no governance, no memory of why they exist. -Agent scaffolding — prompt templates, tool wrappers, output parsers — shrinks as models get smarter. What compounds is the world-facing infrastructure: verified state machines, authorization policies, persistent trajectories. The kernel is a [universal interpreter](https://en.wikipedia.org/wiki/Von_Neumann_universal_constructor) — everything else is a spec. Tools, harnesses, applications are all declarative descriptions with a signature that agents write, verify, deploy, and rewrite. The kernel rarely changes. The descriptions evolve. +Temper is an operating layer where agents describe capabilities as specifications. The kernel verifies each spec before deployment. Every action flows through authorization policies. A human approves changes to scope. -| What's developing in the field | Temper's angle | -|---|---| -| Agents synthesize tools at runtime | Those tools are verified state machines that persist as specs | -| Policy-based authorization on tool invocations | Policies derived from a behavioral spec, not authored separately | -| Runtime guardrails check outputs | State machine checked exhaustively *before* deployment (model checking + SMT) | -| Observability shows what happened | Unmet intents feed back into spec proposals with human approval | -| Declarative agent specs for portability | Declarative specs for correctness — verified, then deployed | -| Durable execution engines | Spec defines what the system does; durability follows from event sourcing | -| Harnesses as static scaffolding | Harnesses as specs — agents program and rewrite them through the same verify-deploy loop | +As agents and users operate through a skill, the evolution engine identifies gaps. It adds missing capabilities, fixes broken ones, and removes redundant ones. The human approves each change. -It's an exploration of what happens when you put formal verification, Cedar authorization, and evolution feedback into the same loop. +| | Step | What happens | +| ------ | --------------- | ----------------------------------------------------------------------- | +| **01** | Describe | An agent describes what it needs: states, transitions, guards, data shape. | +| **02** | Verify | The kernel proves the spec is sound before anything runs. | +| **03** | Operate | The agent works through the verified API. Every action is governed and recorded. | +| **04** | Evolve | Usage patterns surface gaps. The spec adapts. The human approves. | -## Key Features +
-### Spec-First Development +## Constructor, description, evolution -- Agents write declarative specifications, not application code -- IOA TOML specs define states, transitions, guards, and invariants; CSDL models define the data shape; Cedar policies define authorization -- The kernel derives all runtime behavior from these artifacts — if you lose the generated code, you regenerate it from the spec -- Specs hot-reload: transition tables, policies, WASM modules, and reaction rules update live +In 1949, Von Neumann designed a self-replicating machine with three parts: a *description* (the blueprint), a *constructor* that reads any description and builds the machine it encodes, and a copy mechanism that duplicates and mutates descriptions over time. The machine grows in complexity by changing its descriptions. The constructor stays the same. -### Formal Verification +Temper follows this pattern. -- Every spec passes a four-level cascade before it can deploy -- **L0 — Z3 SMT**: guards satisfiable, invariants inductive, no unreachable states -- **L1 — [Stateright](https://github.com/stateright/stateright) model checking**: exhaustive state space exploration, safety + liveness properties -- **L2 — Deterministic simulation**: fault injection (message delays, drops, crashes), reproducible via seeded PRNG -- **L3 — Property-based testing**: random action sequences with shrinking to minimal counterexamples -- The model checker verifies the same Rust code that runs in production — not a separate formal model +The **kernel** is the constructor. It reads specifications, verifies them, and deploys actors. It does not know what you are building. It interprets whatever you feed it. -### Cedar Authorization +**Skills** are the descriptions. Each skill bundles a verified state machine, a data model, authorization policies, and integration declarations into a single deployable capability. Agents create new skills by describing what they need. Other agents and users operate through them. -- Every action flows through [Cedar](https://www.cedarpolicy.com/) authorization with a default-deny posture -- Denied actions surface to the human as pending decisions — approve narrowly (this agent, this action, this resource), broadly (this agent, any action on this resource type), or deny -- Temper generates the Cedar policy from the approval; the human never writes policies from scratch -- Over time, the policy set converges on what the agent actually needs +The **evolution engine** observes how agents use skills, clusters failure patterns, and proposes spec changes. Agents can also create new skills when they encounter problems the current set does not cover. -### Self-Describing API +
-- Generated OData v4 endpoints with `$metadata` discovery -- Agents discover entity types, available actions, and valid transitions without documentation -- Full query support: `$filter`, `$select`, `$expand`, bound actions +## Temper is right for you if -### WASM Integrations +- ✅ You give agents tools and worry about what those tools do unsupervised +- ✅ You want agents to create their own capabilities, with proof those capabilities are safe +- ✅ You need an audit trail connecting every agent action to an authorization decision +- ✅ You want agent-built tools to improve through use, without manual rewrites +- ✅ You're building multi-agent systems that need shared, governed state +- ✅ You want a default-deny security posture where permissions grow as trust builds -- External systems accessed through sandboxed WASM modules with per-call resource budgets -- Cedar mediates which integrations an agent can use — no raw API keys or direct network access -- Integrations declared in the spec, verified as part of the state machine +
-### Trajectories and Evolution +## Features -- Every action — success or failure — is recorded as a trajectory entry with agent identity, before/after state, and authorization decision -- The evolution engine analyzes trajectory patterns: repeated failures, friction points, unmet intents -- Patterns become spec proposals — an O-P-A-D-I record chain (Observation, Problem, Analysis, Decision, Impact) — surfaced for human approval -- The agent can propose changes to its own harness; the human holds the gate + + + + + + + + + + + +
+

Verified Skills

+Agents describe capabilities as specs. A four-level verification cascade proves them sound before deployment. +
+

Governed by Default

+Every action flows through authorization with a default-deny posture. Denied actions surface to the human for approval. The policy set grows as the agent works. +
+

Self-Evolving

+The evolution engine observes usage patterns and failures. It proposes spec changes. Agents create new skills. The human approves every change. +
+

Self-Describing API

+Every skill generates a queryable API with schema discovery. Agents find available actions and valid transitions without documentation. +
+

Full Audit Trail

+Every action records agent identity, before/after state, and the authorization decision. Agents can query their own history. +
+

Hot-Reload

+Skills deploy and update without downtime. Specs, policies, and integrations reload live. +
-## Quick Start +
-### For agents (via MCP) +## Without Temper vs. With Temper -Temper exposes a single MCP tool — `execute` — which runs Python in a sandboxed REPL against a running Temper server. The agent discovers specs, creates entities, invokes actions, and manages governance all through the `temper.*` API. +| Without Temper | With Temper | +|---|---| +| ❌ Agents build tools with no proof those tools are correct | ✅ Every tool is a verified state machine, proven sound before it runs | +| ❌ Agent permissions live in prompts and hope | ✅ Authorization policies enforce boundaries. Denied actions surface for human approval | +| ❌ Agent state lives in markdown files and JSON blobs | ✅ State lives in event-sourced entities with queryable APIs | +| ❌ No audit trail for agent actions | ✅ Every action records who did what, when, under which policy | +| ❌ Adding agent capabilities means writing code | ✅ Agents describe new capabilities. The kernel verifies and deploys them | +| ❌ Tools break silently | ✅ The verification cascade catches violations before deployment | -```python -# 1. Discover what's deployed -specs = await temper.specs("my-app") +
-# 2. Submit specs — the agent describes what it needs -await temper.submit_specs("my-app", { - "Task.ioa.toml": task_spec, # state machine - "model.csdl.xml": data_model # entity schema -}) -# → Verification cascade runs automatically -# → If it passes, the API is live +## What Temper is not -# 3. Create entities and take actions -task = await temper.create("my-app", "Tasks", { - "title": "Review PR #42", - "assignee": "agent-codereview" -}) -await temper.action("my-app", "Tasks", task["id"], "Start", {}) +| | | +| ---------------------------- | -------------------------------------------------------------------------------------------------------------------- | +| **Not an agent framework.** | Temper does not build agents. It provides the layer agents run on. Bring your own: Claude Code, OpenClaw, Pydantic AI, LangChain, or anything with MCP support. | +| **Not a workflow builder.** | No drag-and-drop pipelines. Temper models capabilities as verified state machines. | +| **Not a backend-as-a-service.** | Temper generates APIs from specifications. You do not write controllers or service layers. | +| **Not a prompt manager.** | Agent prompts, models, and runtimes are yours. Temper governs what agents *do*. | -# 4. Query through OData -open_tasks = await temper.list( - "my-app", "Tasks", "status eq 'InProgress'" -) -``` +
-### For humans +## Quick start -Start a Temper server, then give your agent the MCP client. Add to your project's `.mcp.json`: +Add Temper as an MCP server. Your agent gets a sandboxed Python REPL with the `temper.*` API. ```json { @@ -158,284 +137,132 @@ Start a Temper server, then give your agent the MCP client. Add to your project' } ``` -This gives the agent the `execute` tool — a sandboxed Python REPL with the `temper.*` API. The MCP server is a thin client that connects to a running Temper server. - ```bash -temper serve --port 3000 # start the server -temper mcp --port 3000 # connect to local server -temper mcp --url https://temper.railway.app # connect to remote server -temper mcp --port 3000 --agent-id bot # set agent identity +temper serve --port 3000 # start the kernel ``` -Once agents are running, you manage them through the **Observe dashboard** (Next.js UI) or the CLI: - -- **Decisions page**: When an agent hits a deny, you see the request and can approve at three scopes or deny. Temper generates the Cedar policy for you. -- **Agents page**: Action counts, denial rates, timelines. -- **Evolution page**: Spec proposals from the evolution engine. Approve to deploy, deny to discard. +Through the REPL, agents discover deployed skills, create entities, invoke actions, submit new specifications, and manage governance. You manage pending decisions through the Observe dashboard or CLI. ```bash -temper serve --port 3000 # start the server -temper decide --list # see pending decisions -temper decide --approve medium # approve with medium scope +temper decide --list # see pending authorization decisions +temper decide --approve # approve with a scope ``` +
+ ## Architecture ``` -┌────────────────────────────────────────────────────────────────────────────┐ -│ Agent (Claude Code, Pi, Pydantic AI, OpenClaw, Cursor, LangChain, etc.) │ -└───────────────────────────────────┬────────────────────────────────────────┘ - │ MCP (execute) - ▼ -┌────────────────────────────────────────────────────────┐ -│ Monty Sandbox (Python REPL) │ -│ temper.submit_specs() · create() · action() · list() │ -└───────────────────────┬────────────────────────────────┘ - │ - ▼ -┌──────────────────────────────────────────────────────────┐ -│ Temper Kernel │ -│ │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Spec │→ │ Verify │→ │ Deploy │ │ -│ │ IOA+CSDL │ │ L0-L3 │ │ Actor RT │ │ -│ └──────────┘ └──────────┘ └────┬─────┘ │ -│ │ │ -│ ┌──────────┐ ┌──────────┐ ┌────▼─────┐ │ -│ │ Cedar │ │ WASM │ │ OData │ │ -│ │ AuthZ │ │ Integr. │ │ API │ │ -│ └──────────┘ └──────────┘ └──────────┘ │ -│ │ -│ ┌──────────┐ ┌─────────────────────┐ ┌──────────┐ │ -│ │ Event │ │ OTEL → Logfire, │ │ Evolution│ │ -│ │ Sourcing │ │ Datadog, ClickHouse │ │ Engine │ │ -│ └──────────┘ └─────────────────────┘ └──────────┘ │ -└──────────────────────────────────────────────────────────┘ - │ - ▼ -┌────────────────────────────────────────────────────────┐ -│ Persistence: Postgres or Turso/libSQL │ -└────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────┐ +│ Agent (Claude Code, OpenClaw, Pydantic AI, etc.) │ +└────────────────────────┬────────────────────────────┘ + │ MCP (execute) + ▼ +┌─────────────────────────────────────────────────────┐ +│ Sandboxed REPL │ +│ temper.submit_specs() · create() · action() · ... │ +└────────────────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Temper Kernel │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Specs │→ │ Verify │→ │ Deploy │ │ +│ └──────────┘ └──────────┘ └────┬─────┘ │ +│ │ │ +│ ┌──────────┐ ┌──────────┐ ┌────▼─────┐ │ +│ │ AuthZ │ │ Integr. │ │ Query │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Events │ │ Observe │ │ Evolve │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +└─────────────────────────────────────────────────────┘ ``` -**Hot-reloadable** (what agents create and modify): -- IOA specs → transition tables rebuild live -- Cedar policies → authorization engine reloads live -- WASM modules → re-instantiate live -- Reaction rules → reload live - -**Static** (the kernel): -- Spec interpreter, Cedar evaluator, WASM host, HTTP server, persistence - -## The Agent OS - -The kernel is the foundation — spec interpreter, verification cascade, Cedar authorization, persistence. On top of it sit apps: sets of specs (state machines, data models, policies) verified and deployed on the kernel. - -### Apps on the kernel - -- **Bundled apps.** Some capabilities are general enough to ship with Temper: agent execution, task management, a notification pipeline. These arrive as pre-verified spec bundles — ready to use out of the box, or modify to fit. -- **Agent-built apps.** Others are specific to what the agent does. A deployment orchestrator for a DevOps agent. A patient intake workflow for a healthcare agent. The agent designs these as specs, submits them, and they become part of its operating environment. - -### Composability - -An agent's apps are entities on the same kernel. The task manager can reference knowledge entries. The code review workflow can spawn tasks. The notification pipeline can trigger on any state transition in any app. They compose because they share the same runtime, the same authorization model, and the same query surface (OData). - -### Sharing - -Everything is a spec, so agents can share them. An incident response workflow one agent built can be exported as a spec bundle and imported by another agent on another Temper instance. The verification cascade runs again on import, so the receiving agent knows the specs are sound in their context. - -## How Agents Grow New Capabilities - -Temper records every action — successes and failures — as trajectory entries. The evolution engine analyzes these trajectories for patterns and surfaces spec proposals for human approval. This creates a feedback loop where agents accumulate capabilities over time. - -**Example: an agent keeps re-investigating solved bugs.** Trajectories show repeated context loss across sessions. The evolution engine surfaces the pattern. The agent designs a Knowledge spec (`Draft → Indexed → Linked → Archived`) with semantic search and Cedar-scoped access. You review the reachable states, approve, and the knowledge system hot-reloads. The agent starts retaining what it learns. - -**Example: an agent hits a throughput bottleneck.** Trajectories show a growing queue of unprocessed work. The agent designs a TaskDelegation spec — entities that spawn scoped sub-agents with Cedar permissions narrowed to the delegated task. The spec's invariant guarantees a sub-agent can never escalate beyond its parent's authorization. You review, approve, and the agent can now distribute work. - -The pattern repeats. Each cycle — trajectory analysis, spec proposal, verification, human approval — adds a new capability to the agent's operating environment. +**The kernel** (static): spec interpreter, verification cascade, actor runtime, authorization engine, event sourcing, telemetry. -## Running Agents +**Skills** (what agents create and modify): state machines, data models, authorization policies, integrations. All hot-reloadable. -Temper already provides the shared state layer for multiple agents — verified entities queryable via OData, Cedar-mediated access between agents, and trajectories recording every action. The natural next step is building agent execution on top of these same primitives: modeling agents, tasks, and plans as Temper entities, with background execution, spawning, and coordination built in. +
-### Agents as entities +## Status -An Agent would be a Temper entity with its own state machine — just like any other entity. So would Plans, Tasks, and ToolCalls. Creating an agent, assigning it work, tracking its progress — all state transitions, all mediated by Cedar, all recorded as trajectories. +> **Temper is pre-release (0.1.0).** The architecture is stabilizing. The API surface is not frozen. Expect breaking changes. We are building and exploring. -### Background execution - -A headless executor daemon would watch for Agent entities via SSE, claim them, and run them concurrently: - -- **Claiming.** Executor sets `executor_id` on the Agent entity — first-come-first-serve across multiple executor instances. -- **Concurrency.** Bounded by semaphore. Multiple executors share the load. -- **Fault tolerance.** Conversation state checkpointed after each turn. If an executor crashes, another resumes from the checkpoint. - -### Spawning and coordination - -- **Parent → child.** An agent spawns children through a `SpawnChild` action — same as creating any entity. The child gets a scoped role, goal, and Cedar permissions narrowed to its delegated task. -- **Cross-entity gates.** A parent's completion gates on all children reaching a terminal state — a cross-entity invariant verified before the spec deploys. -- **Shared state, not messaging.** Temper is the shared state layer. Agents coordinate by reading each other's entities through the same OData API. One agent's completed task unblocks another's next step — because they query the same verified state. - -### Same primitives all the way down - -The Agent state machine would be a spec. The Task lifecycle would be a spec. Cross-entity guards would be verified. Cedar would mediate every tool call. Trajectories would record every action. An agent spawning a child would go through the same verification-mediation-recording pipeline as an agent creating a knowledge entry. - -### Where this is heading - -Orchestration patterns as specs. What polls what, what supervises what, how agents form teams, what triggers a new agent — all expressible as state machines that go through the verification cascade. An agent could design its own orchestration topology, submit it, and have it verified before it runs. - -## The Layers - -Temper is being built bottom-up. Each layer enables the next. - -| Layer | Description | Status | -|-------|-------------|--------| -| **6. Agent Execution** | Agents as entities. Background executor, spawning, scheduling, multi-agent coordination. | Planned | -| **5. Pure Temper Agent** | Agent's only tool is Temper. No raw shell, no bespoke tools. Everything mediated. | Planned | -| **4. Harness Composition** | Agents design harnesses as specs — what polls what, what reviews what, what gates what. | Planned | -| **3. Integration Framework** | Streaming-capable integrations (LLM calls, HTTP, databases) as WASM modules, mediated by Cedar. | In Progress | -| **2. Temper as Filesystem** | OData-queryable entity persistence replaces markdown files and JSON blobs. | Planned | -| **1. CRUD Apps** | Agents build applications as entity specs. Other agents consume them through the generated API. | Working | -| **Foundation: Kernel** | Spec parser, verification cascade, actor runtime, Cedar authZ, OData API, event sourcing, evolution. 950+ tests. | Done | +| Working | Next | +|---|---| +| Spec parser and verification cascade (SMT, model checking, simulation, property tests) | Agent execution (agents as entities, background executor) | +| Authorization engine (default-deny, approval flows, policy generation) | Streaming integrations | +| API generation with schema discovery | Harness composition (agents design harnesses as specs) | +| Event sourcing (Postgres, Turso/libSQL) | Distributed deployment | +| MCP integration (sandboxed REPL) | | +| Sandboxed integrations with resource budgets | | +| Evolution engine (trajectory capture, failure clustering, spec proposals) | | +| Observe dashboard (decisions, agents, entities) | | +| Pre-built skills: project management, filesystem, agent orchestration | | -**Layer 1 — CRUD apps.** Temper entities are queryable via OData. An agent can build something like an issue tracker or project board entirely as Temper specs. Other agents consume it through the generated API. *Working today.* +950+ tests across 25 crates. -**Layer 2 — Filesystem.** Agents tend to store state in markdown files, JSON blobs, or ad-hoc memory — fragile and unqueryable. If Temper's OData layer becomes the filesystem, every file is an entity, every write is a transition, every read is a query. Checkpointing becomes entity state. Version history becomes event sourcing. Search becomes `$filter`. +
+Technical details for the curious -**Layer 3 — Integrations.** Agents need to reach external systems. Instead of bespoke tool implementations per agent, Temper provides an integration layer where agents write integrations as WASM modules + specs. Cedar mediates which integrations an agent can use. +### Verification cascade -**Layer 4 — Harness composition.** The harness should always be rewritable. With apps for tracking work, a filesystem for state, and integrations for external systems — agents have what they need to design complete harnesses as specs: what polls what, what reviews what, what gates what. Tools and harnesses are both code with a signature — declarative specs that agents author, verify, and rewrite as they evolve. +Every spec passes four levels before deployment: -**Layer 5 — Pure Temper agent.** An agent whose only tool is Temper. No raw filesystem, no shell, no bespoke API clients. Everything mediated, queryable, auditable. +- **L0**: SMT solver checks guard satisfiability and invariant inductiveness +- **L1**: Exhaustive model checking explores the full state space +- **L2**: Deterministic simulation with fault injection (message drops, delays, crashes) +- **L3**: Property-based testing with random action sequences and shrinking -**Layer 6 — Agent execution.** The top of the stack: Temper runs the agents themselves. Agents as entities with verified state machines. Background executors claim and run them. Agents spawn children, schedule work, coordinate through shared state. The orchestration runs on the same primitives — specs, verification, Cedar, trajectories — as everything else. +The model checker verifies the same Rust code that runs in production. -## What's Implemented +### Specifications -| Feature | Status | -|---------|--------| -| I/O Automaton spec parser (states, actions, guards, invariants, integrations) | **Done** | -| CSDL data model parser (OData-compatible entity types) | **Done** | -| Verification cascade — L0 Z3 SMT, L1 Stateright, L2 DST with fault injection, L3 proptest | **Done** | -| Actor runtime with event sourcing, deterministic scheduling, bounded mailboxes | **Done** | -| OData v4 API generation (CRUD, $filter, $select, $expand, bound actions) | **Done** | -| Cedar authorization (default-deny, per-action policies, agent identity) | **Done** | -| OTEL observability (wide events, dual projection to metrics + spans) | **Done** | -| Postgres and Turso/libSQL persistence backends (multi-tenant) | **Done** | -| MCP integration — Monty sandbox with `execute` tool (thin client to running server) | **Done** | -| WASM sandboxed integrations (resource budgets, Cedar-gated) | **Done** | -| Evolution Engine — O-P-A-D-I record chain, unmet intent capture, approval gate | **Done** | -| JIT transition tables with hot-swap (live spec updates, zero downtime) | **Done** | -| Human approval flow (default-deny, pending decisions, Cedar policy generation) | **Done** | -| Observe dashboard — Next.js UI for decisions, agents, entities, specs, evolution | **Done** | -| Programmatic spec submission API (agents generate and deploy specs) | **Done** | -| Cross-entity choreography via reaction engine | **Done** | -| Agent runtime with LLM-driven execution loop and tool registries | In Progress | -| Headless executor — SSE-driven agent claiming, concurrent execution, checkpointing | Planned | -| Agent spawning — parent→child with cross-entity state gates and Cedar inheritance | Planned | -| Deterministic simulation store with configurable fault injection | **Done** | -| Temper as agent filesystem (OData-queryable entity persistence) | Planned | -| Streaming integration framework (LLM calls, HTTP, databases) | In Progress | -| Harness composition — agents design harnesses as specs | Planned | -| Formal verification of WASM integration modules | Planned | -| Cross-entity invariants (formal proofs spanning multiple entity types) | Planned | -| Orchestration patterns as specs — agent teams, supervision, triggers | Planned | -| Scheduled agent invocations — cron/timer-triggered execution | Planned | -| Distributed deployment — multi-node actor placement | Planned | +Skills are defined by three declarative artifacts: -950+ tests across 25 crates. +- **I/O Automaton specs** (.ioa.toml): states, transitions, guards, invariants, integration declarations +- **CSDL data models** (.csdl.xml): entity types, relationships, actions (OData v4 standard) +- **Cedar policies** (.cedar): authorization rules with default-deny posture -
-What agents generate (IOA spec example) - -Agents generate specs — nobody writes them by hand. But if you want to see what gets generated: - -```toml -[automaton] -name = "Knowledge" -states = ["Draft", "Indexed", "Linked", "Archived"] -initial = "Draft" - -[[state]] -name = "content" -type = "string" - -[[state]] -name = "source" -type = "string" - -[[state]] -name = "links" -type = "counter" -initial = "0" - -[[action]] -name = "Index" -from = ["Draft"] -to = "Indexed" -guard = "content != ''" - -[[action]] -name = "Link" -from = ["Indexed"] -to = "Linked" - -[[action]] -name = "Archive" -from = ["Indexed", "Linked"] -to = "Archived" - -[[invariant]] -name = "IndexRequiresContent" -when = ["Indexed", "Linked", "Archived"] -assert = "content != ''" - -[[integration]] -name = "semantic_search" -trigger = "Index" -type = "wasm" -module = "search_service" -on_success = "IndexSucceeded" -on_failure = "IndexFailed" -``` +Agents generate these. Nobody writes them by hand. -States, transitions, guards, invariants, and WASM integrations — all in one declarative file. The verification cascade operates on this directly. The kernel derives a transition table from it. - -
- -
-Crate overview (25 crates) +### Crate overview (25 crates) | Crate | Purpose | |-------|---------| | **temper-spec** | IOA TOML + CSDL parsers, compiles to StateMachine IR | | **temper-verify** | L0-L3 verification cascade (Z3, Stateright, DST, proptest) | -| **temper-jit** | TransitionTable builder, hot-swap controller, shadow testing | +| **temper-jit** | TransitionTable builder, hot-swap controller | | **temper-runtime** | Actor system, bounded mailboxes, event sourcing, SimScheduler | -| **temper-server** | HTTP/axum, OData routing, entity dispatch, webhooks, idempotency | +| **temper-server** | HTTP/axum, OData routing, entity dispatch, idempotency | | **temper-odata** | OData v4: path parsing, query options, $filter/$select/$expand | -| **temper-authz** | Cedar-based authorization on every action | +| **temper-authz** | Cedar-based authorization engine | | **temper-observe** | OTEL spans + metrics, trajectory tracking | -| **temper-evolution** | O-P-A-D-I record chain, Evolution Engine | +| **temper-evolution** | O-P-A-D-I record chain, evolution engine | | **temper-wasm** | WASM sandboxed integrations with per-call resource budgets | -| **temper-mcp** | MCP server, Monty sandbox (execute tool, thin client) | -| **temper-platform** | Hosting platform, verify-deploy pipeline, system OData API | -| **temper-optimize** | Query + cache optimizer, N+1 detection, safety checker | +| **temper-mcp** | MCP server, Monty sandbox (execute tool) | +| **temper-platform** | Hosting platform, verify-deploy pipeline, skill catalog | +| **temper-optimize** | Query + cache optimizer, N+1 detection | | **temper-store-postgres** | Postgres event journal + snapshots (multi-tenant) | | **temper-store-turso** | Turso/libSQL event journal + snapshots | -| **temper-store-redis** | Distributed mailbox, placement, cache traits (stubs) | +| **temper-store-redis** | Distributed mailbox, placement, cache traits | | **temper-cli** | CLI: parse, verify, serve, mcp, decide | -| **temper-agent-runtime** | Agent execution loop with pluggable LLM providers and tool registries | -| **temper-executor** | Headless agent runner — watches for Agent entities, claims and executes them | -| **temper-sandbox** | Shared Monty sandbox infrastructure: JSON/Monty conversion, HTTP dispatch | -| **temper-sdk** | HTTP client library for Temper server (OData entities, governance API, SSE) | -| **temper-codegen** | Generates Rust actor code from CSDL entity models and behavioral specs | -| **temper-store-sim** | In-memory deterministic event store for simulation testing with fault injection | -| **temper-wasm-sdk** | SDK crate for writing WASM integration modules | +| **temper-agent-runtime** | Agent execution loop with pluggable LLM providers | +| **temper-executor** | Headless agent runner (watches for Agent entities, claims and executes) | +| **temper-sandbox** | Shared Monty sandbox infrastructure | +| **temper-sdk** | HTTP client library for Temper server | +| **temper-codegen** | Generates Rust actor code from CSDL + behavioral specs | +| **temper-store-sim** | In-memory deterministic event store with fault injection | +| **temper-wasm-sdk** | SDK for writing WASM integration modules | | **temper-macros** | Proc macros: `#[derive(Message)]`, `#[derive(DomainEvent)]` |
+
+ ## Contributing Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. diff --git a/docs/PAPER.md b/docs/PAPER.md index f4a2efe2..de56b40b 100644 --- a/docs/PAPER.md +++ b/docs/PAPER.md @@ -1,108 +1,113 @@ -# Temper: What If Agents Could Converse Their Way to a Verified Backend? +# Temper: Descriptions All the Way Down -*A framework and platform research for agent-built, agent-operated API backends* +*An operating layer where agents write verified specifications, and specifications evolve through use* **Seshendra Nalla** -*Draft -- February 2026* +*Draft -- February 2026, revised March 2026* --- ## Abstract -Most enterprise SaaS applications follow a remarkably similar pattern: entities -move through state machines, guards prevent invalid transitions, integrations -notify external systems, and authorization policies control who can do what. -An e-commerce order, a support ticket, a subscription, an approval workflow -- -the business logic in each case is a set of states, transitions, and invariants. -The rest is infrastructure. - -This observation suggests that if the state machine is the essential artifact, -then much of the surrounding code -- controllers, service layers, ORM mappings, -webhook plumbing, instrumentation -- might be derivable rather than written. -And if specifications can be generated from conversation rather than hand-coded, -the feedback loop between what a user needs and what the system provides -tightens considerably. - -We explore this idea through Temper, an actor-based framework where I/O Automaton -specifications define behavioral state machines, a four-level verification cascade -(SMT symbolic checking, exhaustive model checking, deterministic simulation, and -property-based testing) establishes correctness before deployment, and an evolution -engine captures unmet user intents from production to propose specification changes -back to the developer. The state machine stays pure and deterministically verifiable; -external integrations follow an outbox pattern, dispatched asynchronously from the -event journal. A self-describing HTTP API is derived automatically from the data -model, giving agents a structured interface they can navigate without documentation. -A pre-built rule index yields sub-30ns action evaluation, while end-to-end benchmarks -through the full HTTP stack with PostgreSQL persistence show ~18ms per action and -~2,200 persisted actions per second under concurrent load. - -Beyond serving as a framework for building applications, the same architecture -positions Temper as an operating layer for autonomous agents. When every -state-changing action flows through a governed, verified state machine -- with -Cedar authorization enforcing default-deny policies, a pending decision flow -surfacing denied actions for human approval, and trajectory logging recording -every transition with agent identity -- the result is an auditable, secure -execution environment where agents operate within formally verified boundaries. - -The framework is implemented as a 16-crate Rust workspace with 440+ tests and a -reference e-commerce application (three entity types, seven verified specifications -across different SaaS domains). We do not claim this approach generalizes to all -backend systems -- but for the substantial class of applications whose core logic -is state machine shaped, the results suggest that specification-first, conversation- -driven development is a practical path worth investigating further. +In 1949, Von Neumann designed a self-replicating machine with three parts: a +description encoding a blueprint, a universal constructor that reads any description +and builds the machine it encodes, and a copy mechanism that duplicates and mutates +descriptions over time [21, 22]. The machine grows in complexity by changing its +descriptions. The constructor stays the same. Von Neumann arrived at this +architecture before Watson and Crick discovered DNA. Biology uses the same +separation. + +Temper follows this pattern for autonomous agents. The kernel is the constructor: +it reads specifications -- I/O Automaton state machines, CSDL data models, Cedar +authorization policies, WASM integration declarations -- and builds a running +system from them. Skills are the descriptions: each one bundles a verified state +machine, data model, policies, and integrations into a deployable capability. +Agents create new skills and operate through existing ones. The GEPA (Guided +Evolution of Pareto-optimal Artifacts) observes how skills are used, clusters +failure patterns, and proposes specification changes. The descriptions evolve. +The constructor stays stable. + +Every description passes a four-level verification cascade before deployment: +Z3 SMT symbolic checking, exhaustive model checking, deterministic simulation +with fault injection, and property-based testing. Every action flows through +Cedar authorization with a default-deny posture. Denied actions surface to the +human for approval. Policies accumulate as agents work. Every transition is +recorded with agent identity, before/after state, and the authorization decision. + +The system is implemented as a 25-crate Rust workspace with 950+ tests. Three +pre-built skills ship with the platform (project management, filesystem, agent +orchestration). A reference e-commerce application demonstrates the full +development flow across three verified entity types. A pre-built rule index +yields sub-30ns action evaluation; end-to-end benchmarks through the full HTTP +stack with PostgreSQL persistence show ~18ms per action and ~2,200 persisted +actions per second under concurrent load. + +We do not claim this approach generalizes to all backend systems. For the +substantial class of applications whose core logic is state machine shaped -- +and for agents that need governed, verified, evolving capabilities -- the +results suggest this is a practical path worth investigating further. --- ## 1. Introduction -The emergence of autonomous LLM agents as first-class API consumers -fundamentally changes the contract between a backend system and its callers. -Traditional web frameworks--Rails, Django, Spring Boot--are designed around a -request/response model where a human user clicks a button, the framework -routes the request to a controller, and a response is rendered. The developer -writes imperative handler code; correctness is established, if at all, by unit -tests and code review. - -Agentic backends face three compounding challenges that this model does not -address: +Agents build tools at runtime. They generate MCP servers, synthesize helpers, +create workflow automation. At the same time, the infrastructure for making +this safe is developing: policy-based authorization on tool invocations, +behavioral contracts, state-machine-constrained agents, formal verification +becoming practical. + +The tools agents build have a recurring shape. A project tracker, a knowledge +base, a deployment pipeline, a notification system -- the core logic in each case +is a state machine: states, transitions, guards, invariants. The surrounding +infrastructure -- persistence, API endpoints, authorization, webhooks, +observability -- follows mechanically from the state machine definition. + +This observation leads to a structural question: if the state machine is the +essential artifact, and agents are already building these artifacts at runtime, +what happens when you formalize the pattern? + +Temper's answer draws from Von Neumann's 1949 architecture for self-replicating +machines [21, 22]. Von Neumann separated the *description* (a blueprint +encoded on a tape) from the *constructor* (a universal machine that reads any +description and builds what it encodes). Evolution happens by mutating +descriptions, not the constructor. The constructor stays stable. + +In Temper, the kernel is the constructor. It reads specifications and builds +running systems from them. Skills are the descriptions. Each skill bundles +a verified state machine, a data model, authorization policies, and integration +declarations. Agents create skills by describing what they need. The GEPA +(Guided Evolution of Pareto-optimal Artifacts) observes how skills are used and +proposes specification changes. Agents also create new skills when they encounter +gaps. Both paths -- mutation and creation -- go through a four-level verification +cascade and require human approval. + +Four challenges motivate this architecture: 1. **Correctness under autonomy.** An agent may issue hundreds of API calls - per minute, exploring state spaces that no human tester would traverse. A - subtle invariant violation--shipping an order without captured payment, for - instance--can propagate silently because the agent has no intuition to catch - it. - -2. **Evolvability without code archaeology.** When an agent's trajectory - analysis reveals that users want to split an order into multiple shipments, - the system must evolve. In a code-first framework, this means modifying - controllers, models, migrations, and tests. If an agent is performing the - modification, it must understand the full codebase. - -3. **Optimizability under production load.** Agents generate access patterns - that differ from human browsing. N+1 query patterns, suboptimal cache TTLs, - and shard hotspots manifest at runtime. The system should be able to - detect and correct these autonomously. - -4. **Governance under autonomy.** An agent acting as an enterprise employee + per minute, exploring state spaces no human tester would traverse. A + subtle invariant violation -- shipping an order without captured payment -- + can propagate silently because the agent has no intuition to catch it. + +2. **Evolvability without code archaeology.** When trajectory analysis reveals + that users want to split an order into multiple shipments, the system must + evolve. In a code-first framework, this means modifying controllers, + models, migrations, and tests. If the capability is a description, you + modify the description and re-verify. + +3. **Governance under autonomy.** An agent acting as an enterprise employee or personal assistant must operate within boundaries: which APIs it can call, which data it can modify, which external systems it can reach. These boundaries must be enforceable (not advisory), auditable (every action recorded with agent identity), and evolvable (new permissions granted as needs arise, not anticipated upfront). -Temper's key insight is that **specifications, not code, should be the durable -artifact** — and that **specifications themselves should be generated from -conversation**. A developer describes their domain through a conversational -interview; the system generates I/O Automaton behavioral specifications, OData -CSDL data models, and Cedar authorization policies from that conversation. -Code is generated from these specifications and can be regenerated whenever the -specifications change. -When end users encounter capabilities the system lacks, their unmet intents -flow through an Evolution Engine that proposes specification changes for -developer approval. The system continuously evolves from both developer -intent and production feedback, with a four-level verification cascade -gating every change. +4. **Interpretability under evolution.** When skills evolve through use, + a human must be able to read the specification and understand what the + system does, trace why any change was made, and verify that the change + is sound. The specification is the system's behavior, complete and + inspectable. The remainder of this paper is organized as follows. Section 2 presents the overall architecture. Sections 3--5 describe the specification layer, the actor @@ -181,8 +186,8 @@ surface as structured proposals for specification changes. Approved changes run through the verification cascade and deploy via hot-swap, closing the loop between production behavior and system evolution. -The framework is implemented in Rust (edition 2024) as 16 crates plus a -reference application, totaling 440+ tests. +The framework is implemented in Rust (edition 2024) as 25 crates plus a +reference application, totaling 950+ tests. --- @@ -888,6 +893,26 @@ for agent consumption. standard. Google Zanzibar [16] provides relationship-based authorization at scale. Temper uses Amazon Cedar [2]; Section 3.3 discusses the rationale. +**Agent operating systems.** AgentOS [23] proposes a skill-based architecture +where agents discover, compose, and execute modular capabilities rather than +monolithic applications. Their skills are natural language contracts parsed +into intent specifications and capability constraints. Temper's skills share +the modularity and composability but differ in the contract representation: +Temper skills are formally verified state machines with mathematical proofs +of correctness, not natural language intent specifications. AgentOS's +semantic firewall for intent verification, trajectory mining for pattern +learning, and state rollback mechanisms have parallels in Temper's Cedar +authorization, GEPA trajectory analysis, and event-sourced state recovery. + +**Self-replicating machines.** Von Neumann's universal constructor [21, 22] +established the separation between description (blueprint) and constructor +(universal builder) as a prerequisite for open-ended evolution. Temper's +kernel/skill architecture follows this separation: the kernel is the +constructor that interprets any specification, and skills are the +descriptions that agents create, mutate, and evolve. The GEPA serves +as the copy-and-mutate mechanism, proposing description changes based on +observed usage patterns. + **Self-optimizing systems.** CockroachDB [17] performs automatic range splitting and rebalancing. Neon [18] adjusts compute and storage resources based on workload. Temper's optimizer actors operate at the application layer, @@ -900,7 +925,7 @@ observability data, with a safety checker ensuring correctness. ### 11.1 Test Coverage -The Temper workspace contains 450 tests across 16 crates and one reference +The Temper workspace contains 950+ tests across 25 crates and one reference application. Key categories: | Category | Count | Crates | @@ -1184,51 +1209,61 @@ As described in Section 9.3, the write path is backend-agnostic via Temper makes the following contributions: -1. A conversation-first architecture where specifications are generated from - developer interviews, code is derived from specifications, and the system - self-evolves from production trajectory intelligence. +1. A constructor/description architecture where the kernel (constructor) interprets + any specification fed to it, and skills (descriptions) are the evolving artifacts + that agents create, operate through, and improve. The kernel stays stable. The + descriptions evolve. 2. A four-level verification cascade combining Z3 SMT symbolic verification, exhaustive model checking with multi-variable state and liveness properties, deterministic simulation with fault injection, and property-based testing. + Every description passes this cascade before deployment. + +3. A skill system where each skill bundles a verified state machine, data model, + authorization policies, and integration declarations into a deployable + capability. Three pre-built skills ship with the platform; agents create + new ones at runtime. -3. An Evolution Engine with Lamport-style problem formalization, human approval - gates, and portable SQL evidence. +4. GEPA (Guided Evolution of Pareto-optimal Artifacts): a trajectory-driven + evolution engine that observes how skills are used, clusters failure patterns, + and proposes specification changes. Combined with agent-initiated skill + creation, this provides two paths for capability evolution -- mutation + of existing descriptions and creation of new ones -- both gated by + verification and human approval. -4. Trajectory intelligence that extracts product signal (unmet intents, friction, +5. An Evolution Engine with Lamport-style problem formalization, human approval + gates, the O-P-A-D-I record chain, and portable SQL evidence. + +6. Trajectory intelligence that extracts product signal (unmet intents, friction, workarounds) from agent execution traces. -5. A three-tier JIT execution model with atomic hot-swap and shadow testing for - zero-downtime state machine evolution. +7. A three-tier JIT execution model with atomic hot-swap and shadow testing for + zero-downtime skill evolution. -6. A DST-first development methodology where actor-level simulation tests +8. A DST-first development methodology where actor-level simulation tests validate state machine behavior before HTTP wiring, catching guard resolution bugs that would be invisible to integration tests. -7. End-to-end functional validation: the same `TransitionTable` verified by +9. End-to-end functional validation: the same `TransitionTable` verified by Stateright, deterministic simulation, and property tests runs inside HTTP-serving entity actors, establishing a provable chain from formal specification to production behavior. -8. Adoption of TigerStyle [19] as a cross-cutting engineering methodology: - assertion density at the state machine level, bounded execution throughout - the actor runtime, static resource budgets, and DST-first development - where simulation testing is the primary--not supplementary--testing - strategy. +10. Adoption of TigerStyle [19] as a cross-cutting engineering methodology: + assertion density at the state machine level, bounded execution throughout + the actor runtime, static resource budgets, and DST-first development + where simulation testing is the primary testing strategy. -9. A reference e-commerce application that demonstrates the full self-hosted - development flow: three verified entity state machines (Order, Payment, - Shipment), 22 DST tests with determinism proofs, three cascade tests, - infrastructure as code, and a complete O-P-A-D-I evolution chain showing - how production observations lead to spec improvements. +11. An interpretability guarantee at every layer: specifications are readable, + verification produces counterexamples (not just pass/fail), evolution is + auditable through the O-P-A-D-I chain, and the human retains approval + authority over every mutation. -10. An agent governance layer where autonomous agents operate within formally +12. An agent governance layer where autonomous agents operate within formally verified state machines, Cedar authorization enforces default-deny policies with reactive human approval, every action is recorded with agent identity in an auditable trajectory log, and agents can generate and submit their - own specifications through a programmatic API -- positioning Temper as the - operating layer for autonomous agents acting as enterprise employees or - personal assistants. + own specifications through a programmatic API. ### 12.2 Conversational Development Vision @@ -1513,3 +1548,13 @@ https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TIGER_STYLE.md [20] OpenTelemetry. *OpenTelemetry Specification.* https://opentelemetry.io/docs/specs/otel/ + +[21] J. von Neumann. *Theory of Self-Reproducing Automata.* Edited and +completed by A. W. Burks. University of Illinois Press, 1966. + +[22] J. von Neumann. *The General and Logical Theory of Automata.* +In L. A. Jeffress (ed.), Cerebral Mechanisms in Behavior: The Hixon +Symposium, pp. 1-31. Wiley, 1951. + +[23] T. Li et al. *AgentOS: A Natural Language-Driven Data Ecosystem for +Autonomous Agents.* arXiv:2603.08938, 2026. diff --git a/docs/POSITIONING.md b/docs/POSITIONING.md index 504362b4..30cf2a76 100644 --- a/docs/POSITIONING.md +++ b/docs/POSITIONING.md @@ -1,143 +1,116 @@ -# Temper: An Observation About Enterprise SaaS +# Temper: From Agent-Built Tools to Verified Skills -## 1. The Pattern +## 1. Agents Are Starting to Build Their Own Tools -Spend enough time looking at enterprise SaaS backends and a pattern starts to emerge. An e-commerce order moves through Draft, Submitted, Confirmed, Shipped, Delivered. A support ticket goes from Open to InProgress to Resolved to Closed. A subscription cycles between Active, PastDue, Suspended, Cancelled. +Agent scaffolding shrinks as models get smarter. Prompt templates, tool wrappers, output parsers: the model absorbs what used to require code around it. Two things remain. Agents need infrastructure to run on: filesystems, sandboxes, persistence, authorization. And agents need tools to do their work: trackers, pipelines, coordinators. -The business logic in each case is a state machine: states, transitions between them, guards that prevent invalid transitions ("you can't submit an empty order"), and invariants that must always hold ("cancelled is final"). The entities are different, but the shape of the problem is the same. +Agents are starting to build the second category for themselves. A coding agent generates an MCP server mid-session because the tool it needs does not exist. A planning agent synthesizes a workflow tracker to coordinate subtasks. An operations agent creates a notification pipeline to monitor deployments. These are not pre-built tools the agent was given. The agent decided it needed them and made them. -What surrounds this core? Persistence, API endpoints, authorization, webhooks, observability. These layers are important -- critical, even -- but they follow mechanically from the state machine definition. If I know the states, transitions, and invariants, I can derive the rest. +Most agents still operate with a fixed set of tools handed to them by a developer. As models get more capable, agents will build more of their own tools. The question is what happens to them. -This is not a new observation. State machines are a well-studied formalism. What's interesting is asking: *how far can you push this?* If the state machine is the essential artifact, what becomes possible? +## 2. Most Tools Are State Machines -## 2. What Falls Out +Consider what agents tend to build. A project tracker with statuses: backlog, in progress, in review, done. A deployment pipeline: pending, building, testing, deployed, rolled back. A notification system: draft, scheduled, sent, failed, retried. -If you accept the premise that the state machine is the core, each layer of a traditional SaaS backend maps to a declarative primitive: +The same shape runs through enterprise SaaS. An e-commerce order moves through draft, submitted, confirmed, shipped, delivered. A support ticket goes from open to in progress to resolved to closed. A subscription cycles between active, past due, suspended, cancelled. -| What you'd normally write | What it maps to | -|---|---| -| ORM models, migrations | A CSDL data model | -| Controllers, service layers | IOA TOML specifications | -| if/else workflow logic | TransitionTable guards and effects | -| Auth middleware | Cedar ABAC policies | -| Webhook integrations | Integration declarations (outbox pattern) | -| Manual instrumentation | Automatic telemetry from transitions | +The core logic in each case is a state machine. Statuses, transitions between them, rules about which transitions are allowed ("you can't ship without confirming payment"), constraints that must hold ("cancelled is final"). The entities are different. The shape of the problem is the same. -The question is whether this mapping is a useful simplification or an over-reduction that loses expressiveness. Temper is an attempt to find out. +This is the hypothesis Temper operates on. A large class of the tools agents build, and a large class of the applications developers build, share this structure. If the state machine is the essential artifact, two things follow. -## 3. Five Patterns +First, verification becomes tractable. You can prove, before anything runs, that every rule is satisfiable, every constraint holds across all reachable statuses, and no failure scenario violates the contract. You cannot do this for arbitrary code in general. You can do it for state machines. -To test how far the IOA approach stretches, we wrote specifications for five different SaaS patterns. All five parse, verify through a four-level cascade (SMT symbolic checking, exhaustive model checking, deterministic simulation, and property-based testing), and run in the same actor runtime. +Second, the tools agents build are not code. They are *descriptions* of behavior: statuses, transitions, rules. This distinction matters for everything that comes next. -### E-Commerce Order (`reference-apps/ecommerce/specs/order.ioa.toml`) +## 3. Descriptions Need a Constructor -The most complex spec: 10 states, 12 transition actions. Multi-state cancellation (from Draft, Submitted, or Confirmed). A counter guard (`items > 0`) prevents empty orders from reaching Submitted. Terminal states (Cancelled, Refunded) have no outbound transitions. Integration hooks fire on SubmitOrder, ConfirmOrder, and ShipOrder. +A description sitting in a file does nothing. Someone has to read it and build the running system it encodes: the persistence layer, the API endpoints, the authorization checks, the event journal. If you build all of that by hand for each description, you have not gained much. -``` -Draft --> Submitted --> Confirmed --> Processing --> Shipped --> Delivered - | | | | - +----+-----+ | ReturnRequested - | | | - Cancelled Cancelled Returned - | - Refunded -``` +In 1949, Von Neumann asked what threshold of complexity a machine must cross before it can evolve. His answer was a machine with three parts: a *description* that encodes a blueprint, a *universal constructor* that reads any description and builds whatever it encodes, and a copy mechanism that duplicates descriptions. The constructor is generic. It does not know what it is building. It interprets whatever you feed it. Evolution happens by changing descriptions, not the constructor. -### Support Ticket (`test-fixtures/specs/ticket.ioa.toml`) +Temper follows this separation. The kernel is the constructor. It reads specifications and builds a running system from them: verification cascade, actor runtime, event sourcing, authorization engine, API generation. The kernel does not know whether you are building a project tracker or a deployment pipeline. It interprets whatever you feed it. -A back-and-forth workflow: agents reply, customers respond, the ticket bounces between InProgress and WaitingOnCustomer. The `replies` counter prevents resolution without engagement. Closed is terminal; Resolved can be reopened. +An agent that needs a project tracker writes a description of a project tracker. The kernel verifies the description, deploys it, and the agent operates through it. An agent that needs sprint planning writes a description of sprint planning. Same kernel, different description. -### Approval Workflow (`test-fixtures/specs/approval.ioa.toml`) +We call a verified, deployed description a *skill*. A skill bundles: -Boolean guards (`is_true has_reviewer`) prevent submission without a reviewer. Revise resets the boolean, forcing reassignment. The `approvals` counter proves approval happened. +- A natural language description of the capability ("issue tracking with projects, cycles, labels, and comments") for discovery and indexing +- Guidance for agents on how to use the skill: patterns, examples, constraints +- One or more state machine specifications defining statuses, transitions, guards, and invariants +- A data model defining the entity schema +- Authorization policies defining who can do what +- Integration declarations for external systems (sandboxed) -### Subscription Management (`test-fixtures/specs/subscription.ioa.toml`) +The natural language description and guidance are what agents and humans read. The specifications are what the kernel verifies and executes. -Payment failure escalation: Active → PastDue → Suspended → Expired. Self-transitions (EnableAutoRenew, DisableAutoRenew) modify booleans without changing status, demonstrating that state variables and status are orthogonal. Integration hooks fire on PaymentFailed and SuspendSubscription. +## 4. Verified Skills Still Need Governance -### Issue Tracker (`test-fixtures/specs/issue.ioa.toml`) +A verified skill is correct: its state machine does what the description says. But the agent operating through it can still do things it should not. It can reassign every issue to itself. It can access another agent's project. It can call an external API through an integration without anyone approving that access. -Assignee tracking via boolean, review cycle counting. StartWork requires an assignee. Both RequestChanges and ApproveReview increment the review counter, giving a built-in velocity metric. +Verification handles correctness. It does not handle who can use the skill, or how. -Each of these took minutes to write and passed the full verification cascade on the first or second attempt. The harder question -- whether this pattern library covers enough of the real-world design space to be useful -- remains open. +Temper uses a default-deny authorization posture. When an agent attempts an action that no policy permits, the denial surfaces to the human: "Your agent tried to reassign issues in Project X. Allow?" The human approves with a scope: narrow (this agent, this action, this resource), medium (this agent, this action, any resource of this type), or broad (this agent, any action on this resource type). Temper generates the authorization policy and hot-loads it. -## 4. What Works +The human does not write policies from scratch or anticipate what the agent will need. The human responds as needs arise. Over time, the policy set converges on what the agent requires. -Everything below maps to working code. 441 tests pass across 16 crates. +## 5. Skills Must Evolve -- **IOA TOML parser** with six section types: automaton, state, action, invariant, liveness, integration -- **4-level verification cascade**: L0 SMT symbolic, L1 Stateright exhaustive, L2 deterministic simulation with fault injection, L3 proptest -- **Actor runtime** with Postgres event sourcing, hot-swap via SwapController, multi-tenant SpecRegistry -- **OData API** auto-generated from CSDL entity types -- **Conversational platform** that interviews developers, generates specs, and deploys through the cascade -- **Integration engine** (outbox pattern): webhooks dispatched asynchronously from the event journal -- **Automatic telemetry**: two-layer OTEL spans (HTTP + actor) with real durations verified in ClickHouse -- **Cedar ABAC** authorization evaluated per action -- **Evolution engine** that captures unmet user intents from production +An agent described a project tracker last week. This week the team needs sprint planning. The tracker's state machine does not cover it. Someone has to write a new description, re-verify, redeploy. Next month the team wants labels and priorities. Another description. The skills the agent built are frozen at the moment of creation. -Performance through the full OData HTTP stack with Postgres: ~28ns for rule evaluation, ~18ms per persisted action end-to-end, ~591ms for 100 concurrent checkouts (2,200 actions/sec). +Two paths address this. -## 5. What Doesn't Work (Yet) +**Agents create new skills.** When an agent encounters a problem the current skill set does not cover, it writes a new description. The kernel verifies and deploys it. An agent that needs sprint planning describes sprint planning. -| Gap | Why it matters | Current workaround | -|---|---|---| -| No floating-point state variables | Can't track prices as state | Use Postgres event payload fields | -| No cross-entity invariants | Can't express "Shipped implies Payment captured" | Integration engine orchestrates | -| No conditional effects | Can't do "if items > 5 then bulk discount" | Decompose into actions with guards | -| Single-node only | No horizontal scaling | Redis traits designed but not wired | -| No temporal guards | Can't do "if idle > 30 days" | Integration engine cron triggers | -| No UI layer | API only | OData is a standard; any frontend works | -| Spec gen needs an LLM | Interview agent requires Claude/GPT | Specs are hand-writable IOA TOML | -| No string state variables | Status + counters + booleans only | Finite automaton by design; strings in payload | +**Existing skills evolve through use.** The kernel records every agent action as an entity transition. Separately, the MCP bridge captures each agent's full execution trace: what the agent tried to do, what succeeded, what failed, what it gave up on. The GEPA (Guided Evolution of Pareto-optimal Artifacts) replays these execution traces against the current specs, clusters failure patterns, and proposes changes to existing descriptions. Agents keep trying to assign issues to teams, but the tracker only supports individual assignees? The GEPA produces a spec diff that adds team assignment. The change goes through the verification cascade before it takes effect. -Some of these are fundamental to the approach (finite automaton = no strings in state). Others are engineering work (Redis wiring, temporal guards). Being clear about which is which matters. +Both paths are evolution in Von Neumann's sense: changes to the descriptions, not the constructor. The kernel stays stable. The skills change. -## 6. The Agent Operating Layer +## 6. Evolution Needs a Trust Gradient -There's a stronger claim than "agents can generate specs" or "agents can consume the API." It's this: **Temper is the operating layer for autonomous agents.** +If a human must approve every description change, the system cannot scale. If the system evolves autonomously, you lose control. -Agents today run with whatever tools they're given. They call APIs directly, write to databases, execute code in sandboxes. There is no shared governance model. There is no formal verification of what an agent is about to do. There is no audit trail that connects an agent's intent to its effects. When something goes wrong, you grep through logs. +The answer is a spectrum. On one end, the human approves every significant decision. On the other end, the system operates within pre-approved boundaries. You move along this spectrum as trust builds, as the policy set matures and the execution traces give you confidence. The scope of autonomy expands within specific boundaries, backed by data. -The thesis: every state-changing action an agent takes should flow through a governed, verified, auditable layer. Not optionally. By design. +That is the chain: **unverified → verified → governed → evolving → trust-calibrated**. Each step solves a problem the previous step could not. Each step depends on the one before it. Governance assumes verification. Evolution assumes governance. Trust calibration assumes evolution data. -### The agent is both developer and operator +## 7. Interpretability -In the personal assistant and enterprise employee use cases, the agent builds its own specifications. When an agent needs to execute a multi-step plan -- process an expense report, coordinate a deployment, manage a customer interaction -- it generates an IOA specification describing the states, transitions, guards, and integrations of that plan. Temper verifies the spec through the four-level cascade before the agent can execute through it. The agent's plan itself is a verified state machine. +A system that evolves its own descriptions raises a concern: changes humans cannot understand or control. Temper tries to address this at several points, from the spec format down to individual agent actions. -The agent then operates through the verified spec: calling actions, transitioning state, triggering integrations. The spec is the contract. The verification cascade is the proof. The runtime enforces the contract on every action. +**The spec is the documentation.** In a traditional system, the code is the source of truth and documentation trails behind it. In Temper, the description is both. You can open a spec file and read the statuses, transitions, guards, and invariants. That file is what the kernel verifies and what the runtime executes. There is no separate implementation that could diverge from the spec. -### The human is the policy setter +**Verification produces counterexamples, not just pass/fail.** When the cascade finds a property violation, it returns a concrete trace: the exact sequence of actions that leads to the broken state. "If an agent calls Assign, then StartWork, then SubmitForReview, then ApproveReview without a reviewer assigned, the invariant 'reviewed implies reviewer exists' is violated." You debug at the domain level, not the code level. -Cedar policies define what agents can and cannot do. The default posture is deny-all. When an agent attempts something not yet permitted, the denial surfaces to the human: "Your agent tried to call the Stripe API and was blocked. Allow?" The human approves with a scope -- narrow, medium, or broad -- and Temper generates the Cedar policy and hot-loads it. Over time, the policy set converges on what the agent actually needs. The human doesn't anticipate permissions upfront; they respond as needs arise. +**Every action is an event.** The kernel persists every entity transition as an immutable event: the action name, the agent that performed it, the before and after state, and the authorization decision that allowed it. You can reconstruct the full history of any entity from its event journal. You can answer "who moved this issue to Done, when, and which policy permitted it?" -### Everything is recorded +**Every authorization decision is recorded.** Each allow or deny captures the policy that applied, the principal (which agent), the action, and the resource. Denied actions create pending decisions for the human. The authorization history shows how the policy set developed over time and why each permission exists. -Every action an agent takes through Temper is a state transition. Every transition is persisted with the agent's identity, the before/after state, whether authorization succeeded or was denied, and the Cedar policy that governed the decision. This gives you an audit trail, agent self-awareness (the agent can query its own state), and cross-agent visibility (multiple agents sharing a Temper instance see each other's state changes). +**Agent trajectories are captured.** The MCP bridge records the agent's session trajectory: every tool call, every decision, every success and failure. The Temper-native agent captures even richer trajectories, since every agent action is itself a governed entity transition. These trajectories are stored separately from entity events. The GEPA replays them against current specs to find gaps. A human can inspect any agent session to understand what the agent attempted and where it got stuck. -### External access is governed +**Evolution changes are traceable.** The O-P-A-D-I record chain (Observation, Problem, Analysis, Decision, Impact) connects every spec change back to the observation that motivated it. You can ask "why does this state exist?" and trace the answer: an observation from agent execution traces, the problem the GEPA identified, the analysis it performed, the decision a human approved, and the measured impact after deployment. -When agents need to call outside systems, they do so through integrations declared in the IOA spec. Cedar policies govern which external calls are permitted. WASM modules for integrations run in a sandbox. In the vision, these modules can be reviewed by a security agent or formally verified -- the same way state machine specs are verified today. +**The Temper-native agent takes this further.** When agents run as entities inside Temper, their state machines, budgets, and lifecycle are governed by the same kernel. Every agent action is an event. You can pause an agent, resume it from any point in its history, or replay its entire execution. The agent becomes as inspectable as the skills it operates through. -### The interface is a REPL +## 8. Where This Stands -The vision for how agents interact with Temper is a sandboxed code execution environment -- in the style of Symbolica's Agentica or Cloudflare's Code Mode. Agents write code against a typed API surface; the sandbox mediates all external access through Temper. The REPL is the only tool the agent is given for state-changing operations. +Temper is version 0.1.0. The constructor works. The description format is stabilizing. The evolution loop runs end-to-end in testing. 950+ tests pass across 25 crates. -### What this means +**The constructor can do this today.** Parse a description (I/O Automaton spec + data model + authorization policies). Run a four-level verification cascade. Deploy it as a live actor with event sourcing, a generated API, and authorization enforcement. Hot-reload when the description changes. Record every entity transition. Capture full agent execution traces through the MCP bridge. Run the GEPA against those traces to propose description changes. Enforce cross-entity invariants (both hard constraints and eventual consistency with bounded convergence). Surface denied actions to the human for approval. -Agents generating specifications is already possible -- the spec submission API and verification cascade exist today. Cedar default-deny governance, pending decision approval flows, per-agent audit trails, and the observe dashboard for agent activity -- these are built and working. The REPL interface and security review agents are the vision, not yet implemented. +**Three pre-built skills ship with the platform:** project management (5 entity types), filesystem (4 entity types), agent orchestration (3 entity types). Agents can install them, operate through them, and propose changes. Agents can also submit new descriptions. -The question is not whether agents need governance. The question is whether governance can be formal, verified, and transparent rather than ad hoc. That is what Temper is for. +**The constructor cannot do this yet.** No floating-point state variables (prices live in payload fields, not state). No conditional effects ("if items > 5 then discount" requires decomposing into separate guarded actions). Single-node only (Redis traits are designed but not wired). No temporal guards ("if idle > 30 days" requires scheduled actions or integration engine cron triggers). Some of these are fundamental to finite automata. Others are engineering work. -## 7. The Evolution Loop +**Temper is being built bottom-up.** Each layer enables the next. -The most interesting part might be what happens after deployment. +| Layer | What it does | Status | +|-------|-------------|--------| +| **6. Agent Execution** | Agents as entities with their own state machines, budgets, and lifecycle. | In Progress | +| **5. Pure Temper Agent** | Agent's only tool is Temper. Every action mediated. | In Progress | +| **4. Harness Composition** | Agents describe harnesses as specs. | In Progress | +| **3. Integration Framework** | External APIs as sandboxed WASM modules, governed by authorization. | In Progress | +| **2. Temper as Filesystem** | Entity persistence replaces markdown files and JSON blobs. | In Progress | +| **1. Skills** | Agents write descriptions. The kernel verifies and deploys them. Others consume them through the generated API. | Done | +| **Foundation: Kernel** | Spec interpreter, verification cascade, actor runtime, authorization, event sourcing, evolution engine. | Done | -Production usage generates trajectory data. When a user tries an action the current spec doesn't support, the system captures it as an observation record. These observations surface to the developer as structured proposals: "users are trying to split orders into multiple shipments; here's a spec diff that would enable it." The developer approves or rejects. Approved changes run through the verification cascade and deploy via hot-swap. - -This creates a feedback loop between production behavior and system evolution. The developer stays in the approval seat, but the system does the discovery work. Over time, the specs converge toward what users actually need rather than what someone imagined at design time. - -The O-P-A-D-I record chain (Observation, Proposal, Approval, Deployment, Impact) provides a complete audit trail for every behavioral change. It's early, but the loop is operational in the current implementation. - ---- - -*This document describes the current state of the project. The five verified specs, the benchmark numbers, and the test counts reflect what exists today. The agent operating layer -- Cedar governance, pending decisions, audit trails -- is built and working. The REPL interface and security review agents are the next things to build. Whether this pattern holds across a broader set of real-world agent deployments is the next thing to find out.* +Today, agents interact with Temper through an MCP bridge. The next layers close that gap: agents as first-class entities inside the constructor, then agents that compose harnesses as descriptions, then agents whose only tool is Temper itself. Whether this pattern holds across a broader set of real-world agent deployments is the next thing to find out.