A deterministic AI execution runtime for production-grade AI workloads.
This repository provides a reference implementation of a distributed, state-driven runtime for executing AI workflows with deterministic DAG orchestration, context resolution, Redis Lua coordination, retry/recovery, retention/compaction, distributed concurrency control, execution control state, replay validation, correlated metrics and tracing, execution-correlated decision ledger, and executable enterprise demo scenarios.
The current runtime foundations are intentionally designed as the base for a broader AI execution and MLOps-oriented platform.
The latest major updates focused on turning the runtime from a DAG executor into a controllable, observable, distributed execution platform.
| Area | Summary |
|---|---|
| Distributed multi-runtime-instance execution | Added foundations for multiple runtime instances and workers to coordinate through shared Redis-backed execution state. |
| Runtime control plane foundation | Added adapter-neutral control-plane foundations for replay, execution control, local runtime queue control, runtime instance registry/control, and run admission decisions. |
| Distributed concurrency and throttling demo | Added an executable throttling-100 enterprise demo scenario with provider-level concurrency control, realtime throttling visibility, Redis lease-based admission control, randomized provider distribution, and bounded provider capacity under worker pressure. |
| Execution control state | Added durable ExecutionId-level pause, resume, cancel, waiting-for-input, and human input submission. |
| Runtime queue control | Added RunId-level queue pause/resume, queued cancellation, running cancellation bridge, and hot enqueue support. |
| Execution replay validation | Added Replay API foundations that load persisted snapshots by ExecutionId, validate deterministic fingerprints, validate dependency graph and step state, optionally restore runtime state, and expose replay metadata, ledger events, and timeline events. |
| Execution-correlated decision ledger | Added durable execution-correlated ledger events for execution lifecycle, run lifecycle, queue control, claim acquisition, retry, recovery, concurrency, policy evaluation, snapshot persistence, replay lifecycle, and runtime control observability. |
| Correlated metrics and tracing storage modes | Added runtime correlation for metrics and traces with configurable Disabled, Memory, Mongo, and MemoryAndMongo storage modes. |
| Context resolution and helpers | Added a dedicated helper layer for input bindings, previous step outputs, payload rehydration, provider/model/operation context, policy context, RAG context, and replay-safe helpers. |
| Documentation restructure | Completed Phase 0 V1 with a shorter README, preserved runtime internals, documentation index, roadmap, enterprise readiness matrix, ecosystem comparison, and focused runtime documentation under docs/ai/. |
| Long-term platform direction | Added a dedicated road-to-MLOps direction document to describe how the runtime foundations can evolve toward AI execution infrastructure and MLOps-oriented runtime operations. |
For detailed changes, see CHANGELOG.md, docs/index.md, and docs/road-to-mlops.md.
Deterministic AI Runtime is a .NET runtime for executing complex AI workflows as controlled, observable, recoverable, replayable, and auditable distributed executions.
It is designed for workloads such as:
- LLM orchestration
- RAG pipelines
- tool execution
- decision workflows
- long-running AI processes
- multi-step distributed execution
The runtime treats AI orchestration as a systems problem, not only as a prompt engineering problem.
It provides a state-driven execution layer where:
- workflows are modeled as DAGs
- workers are stateless
- Redis stores hot execution state
- Redis Lua scripts enforce atomic coordination
- MongoDB stores durable payloads and snapshots
- context helpers resolve inputs, payloads, provider metadata, and policy context
- policies control retry, retention, and concurrency
- metrics, traces, and ledger events share runtime correlation
- executions can be replay-validated from persisted snapshots
- execution can be paused, resumed, cancelled, or blocked for human input
The project should be read as an AI execution infrastructure foundation. The runtime core is already substantial, while the longer-term direction is to evolve toward a broader AI operations and MLOps-oriented platform.
Most AI projects focus on prompts, agents, RAG, embeddings, or model calls.
That is enough for prototypes.
Once AI moves to production, the hard problem becomes execution:
- How do you coordinate multiple workers?
- How do you avoid duplicate execution?
- How do you recover after crashes?
- How do you replay an execution?
- How do you control retries?
- How do you throttle providers and models?
- How do you keep memory bounded?
- How do you resolve context safely across steps, payloads, providers, and policies?
- How do you pause, resume, or cancel safely?
- How do you support human-in-the-loop workflows?
- How do you prove deterministic convergence?
This runtime exists to address those production execution concerns.
Production AI workloads are no longer single prompts running in isolation.
They become distributed execution systems with:
- multiple pipeline steps
- parallel branches
- dependencies
- retries
- external providers
- large payloads
- compacted or externalized state
- context resolution across previous step outputs
- failure recovery
- operational controls
- audit and replay requirements
- multiple workers or runtime instances
Without a real execution runtime, these systems often become fragile:
- hidden retry loops
- duplicated work
- lost state
- corrupted execution progress
- unbounded memory growth
- unclear ownership
- inconsistent input/context reconstruction
- poor or fragmented observability
- impossible replay
This project explores what an AI execution runtime should look like when reliability, determinism, context resolution, and distributed coordination are treated as first-class design requirements.
| Capability | Status | Summary |
|---|---|---|
| Deterministic DAG execution | Implemented | Workflows execute through dependency-aware DAG state. |
| Redis hot state | Implemented | Active execution state is stored in Redis. |
| Redis Lua atomic coordination | Implemented | Critical transitions use Lua-backed atomic operations. |
| Distributed workers | Implemented | Workers can claim and execute steps safely. |
| Multi-runtime-instance execution foundations | Implemented | Runtime instances can coordinate through shared Redis-backed execution state. |
| Context resolution and helpers | Foundation available | Input bindings, previous step outputs, provider metadata, policy context, and payload rehydration are resolved through helper layers. |
| Deterministic convergence | Implemented | Final state is derived from state transitions, not worker ordering. |
| Retry and recovery | Implemented | Retry state, waiting windows, and stale running-step recovery are separated. |
| Retention and compaction | Implemented | Hot state can be compacted/evicted while payloads remain resolvable. |
| Distributed concurrency and throttling | Implemented | Redis ZSET leases enforce global, pipeline, step, execution, instance, provider, model, and operation limits. |
| Enterprise throttling scenario | Implemented | throttling-100 demonstrates provider-level distributed throttling with realtime visibility and deterministic convergence. |
| Policy-driven execution | Implemented | Retry, retention, and concurrency use configurable policy definitions. |
| Execution control state | Implemented | ExecutionId-level pause, resume, cancel, waiting-for-input, and human input submission. |
| Runtime queue control | Implemented | RunId-level queue pause/resume, queued cancellation, running cancellation bridge, and hot enqueue. |
| Runtime control plane foundation | Implemented | Adapter-neutral facades expose replay, execution control, local queue control, runtime instance control, and admission foundations for API/MCP/CLI/dashboard/Kubernetes adapters. |
| Runtime instance registry and control | Implemented | Runtime instances can register, heartbeat, expose queue capacity, be listed, marked draining, or unregistered. |
| Run admission / slot decisions | Implemented | Admission can assign runs to an available runtime instance, request scale-out, queue globally later, or reject according to policy. |
| RunId vs ExecutionId separation | Implemented | Controller lifecycle identity is separated from durable DAG execution identity. |
| Snapshot and Replay API foundations | Implemented | Terminal snapshots, replay metadata, deterministic fingerprint validation, audit-only replay, restore replay, ledger loading, and timeline loading are available. |
| Execution-correlated decision ledger | Implemented | Durable correlated ledger events exist for execution lifecycle, run lifecycle, queue control, claims, steps, retry, recovery, policy evaluation, concurrency, execution control, human input, snapshots, storage failures, replay lifecycle, retention, compaction, and finalization. |
| Observability, metrics, and tracing | Foundation available | Runtime metrics, trace recording, realtime events, correlated trace timelines, and configurable Memory/Mongo/MemoryAndMongo persistence exist; production-grade OpenTelemetry integration and dashboarding remain planned. |
| MLOps-oriented platform evolution | Direction defined | Long-term platform direction is documented in docs/road-to-mlops.md. |
| Durable decision ledger | Foundation available | Execution-correlated runtime ledger foundations are implemented and aligned with runtime correlation, including replay lifecycle visibility. |
| Public API / SDK polish | Planned | Future work for cleaner external developer experience. |
Client / API / Controller
|
v
Runtime Orchestration Layer
|
v
Pipeline Definition + DAG Resolution
|
v
Context Resolution and Helper Layer
|
+--> Input binding resolution
+--> Previous step output resolution
+--> Payload rehydration
+--> Provider / model / operation context
+--> Policy context
+--> Concurrency context
+--> RAG retrieval / merge / compose context
|
v
DAG Execution Engine
|
+--> Execution Control Gate
|
+--> Concurrency / Throttling Engine
|
v
Redis Hot State + Redis Lua Coordination
|
+--> Atomic step claims
+--> Retry scheduling
+--> Worker recovery
+--> Control state
+--> Distributed leases
|
v
Stateless Workers / Runtime Instances
|
v
Step Executors
|
+--> LLM
+--> RAG
+--> Tools
+--> Decisions
|
v
MongoDB Payloads / Snapshots / Replay Validation
|
v
Execution-Correlated Decision Ledger
|
v
Correlated Observability / Metrics / Tracing Foundations
The runtime is intentionally split into layers:
- orchestration starts and manages executions
- DAG state determines what can run
- context helpers resolve inputs, payloads, metadata, and policy context
- Redis coordinates distributed workers
- policies control runtime behavior
- workers execute claimed steps
- persistence stores large payloads and snapshots
- replay validates deterministic reconstruction from persisted snapshots
- correlated observability records runtime behavior across ledger, metrics, traces, workers, and executions
The project is designed around production questions that enterprise AI systems must answer.
| Enterprise Question | Runtime Direction |
|---|---|
| What happens if a worker crashes? | Running steps can be recovered through stale-claim detection and Redis-backed recovery. |
| How do you prevent duplicate executions? | Atomic Redis Lua claims and claim tokens enforce single step ownership. |
| How do you replay a workflow? | Terminal snapshots, replay metadata, deterministic fingerprint validation, audit-only replay, and restore replay provide replay foundations. |
| How do you audit an AI decision? | Execution-correlated decision ledger events, correlated metrics/traces, execution state, step results, retry metadata, recovery state, snapshots, replay reports, and observability provide audit foundations. |
| How do you limit concurrency? | Distributed Redis ZSET leases and policy-driven throttling enforce limits. |
| How do you resolve execution context safely? | Context helpers resolve inputs, step outputs, payload references, provider metadata, and policy context consistently. |
| How do you pause/resume/cancel safely? | Execution control state blocks new claims and coordinates deterministic finalization. |
| How do you control human-in-the-loop? | WaitingForInput and SubmitHumanInput are supported through durable control state. |
| How do you keep memory/state bounded? | Retention, compaction, eviction, and payload externalization control hot state size. |
| How do you coordinate multiple runtime instances? | Shared Redis state, Lua coordination, leases, and deterministic convergence enable coordination. |
| How do you prove deterministic convergence? | Integration tests and enterprise demo scenarios validate completion, replay fingerprints, distributed execution, throttling, recovery behavior, atomic retention, compaction consistency, ledger visibility, and trace timeline visibility. |
| How does this evolve toward AI operations and MLOps? | Runtime foundations are designed to support future AI execution control planes, governance, observability, replay, and operational workflows. |
For the detailed enterprise matrix, see docs/enterprise-readiness.md.
The runtime includes an adapter-neutral control-plane foundation for operational runtime control.
The control plane currently exposes foundations for:
- replay and audit control
- execution control
- local runtime queue control
- runtime instance registration and heartbeat
- runtime instance visibility and draining
- run admission / slot decisions
- future API, MCP, CLI, dashboard, and Kubernetes adapters
The control plane is also split into two identity levels.
RunId
= controller / queue / background job lifecycle id
ExecutionId
= authoritative durable DAG execution id
This separation allows the runtime to manage queue-level work without confusing it with durable execution state.
Implemented execution control capabilities include:
- pause execution
- resume execution
- cancel execution
- wait for human input
- submit human input
- block new claims based on control state
- cancellation finalization override
Implemented controller queue capabilities include:
- pause queue
- resume queue
- cancel queued run
- cancel running run by bridging to ExecutionId cancellation
- hot enqueue while controller is running
- hot enqueue while queue is paused
This makes the runtime controllable, not only executable.
Implemented runtime instance capabilities include:
- register runtime instance
- heartbeat runtime instance
- get runtime instance status
- list runtime instances
- expose local queue pressure
- expose available run slots
- mark runtime instance as draining
- unregister runtime instance
These foundations prepare the runtime for Kubernetes pod visibility, shared admission, dashboards, and future autoscaling.
Implemented admission capabilities include:
- assign a run to an available runtime instance
- prefer a requested runtime instance when available
- select the least-loaded available instance
- request scale-out when no instance has capacity
- queue globally later when shared queue fallback is enabled
- reject when no capacity or fallback exists
Admission does not enqueue runs yet.
It only decides what should happen next.
For details, see docs/ai/runtime-control-plane.md.
The runtime uses Redis as a hot state and coordination layer.
Redis is not only used as a cache. It is used as the active distributed execution state.
Critical operations include:
- creating execution state
- claiming ready DAG steps
- validating claim ownership
- completing steps
- failing steps
- scheduling retries
- recovering stale running steps
- enforcing distributed concurrency leases
- storing execution control state
Redis Lua scripts are used for atomic transitions where race conditions must be avoided.
This allows multiple workers or runtime instances to cooperate safely without direct worker-to-worker communication.
The runtime includes a helper layer that connects declarative configuration to concrete execution behavior.
This layer resolves:
- input bindings from execution state
- previous step outputs
- compacted or externalized payloads
- provider, providerKey, model, and operation metadata
- retry policy context
- retention policy context
- concurrency context
- RAG retrieval, merge, and compose context
- replay-safe comparison data
This keeps the DAG engine focused on orchestration and prevents plugins, policies, and providers from manually reconstructing raw execution state.
For details, see docs/ai/context-resolution-and-helpers.md.
The runtime includes foundations for production visibility and replayability:
- execution lifecycle metrics
- retry and recovery metrics
- retention metrics
- resolver metrics
- storage metrics
- context resolution diagnostics
- concurrency admission diagnostics
- realtime runtime events
- readable console runtime events
- execution-correlated decision ledger
- durable runtime lifecycle audit events
- claim, concurrency, retry, and recovery audit visibility
- queue and execution control audit visibility
- atomic retention and compaction auditability
- snapshot persistence audit events
- replay lifecycle ledger events
- replay metadata
- replay report generation
- replay snapshot loading
- replay deterministic fingerprint validation
- replay dependency graph validation
- replay step state validation
- replay payload reference validation
- replay ledger event loading
- replay trace timeline loading
- replay diagnostic output for ledger and timeline inspection
- correlated trace recording foundations
- correlated metric and trace storage modes
- in-memory, MongoDB, and MemoryAndMongo observability persistence
- terminal snapshots
- replay restoration
- deterministic replay fingerprint validation
Runtime metrics and traces can be configured as Disabled, Memory, Mongo, or MemoryAndMongo. This allows local diagnostics, durable MongoDB-backed inspection, or both at the same time while keeping the execution runtime independent from observability storage choices.
Replay can validate persisted executions without re-running LLMs, tools, external providers, or side effects. It can expose replay metadata, decision ledger events, and trace timeline events when requested.
OpenTelemetry-style distributed tracing, richer dashboards, HTTP replay APIs, replay audit tooling, and advanced decision lineage remain roadmap items.
This project is under active development.
It should be treated as an advanced reference implementation and evolving AI infrastructure project, not as a polished commercial product.
The strongest areas today are:
- deterministic DAG execution
- Redis-backed distributed state
- Redis Lua atomic coordination
- context resolution and helper foundations
- retry/recovery semantics
- retention/compaction
- distributed concurrency and throttling
- executable distributed throttling scenario
- execution control state
- runtime queue control
- runtime control-plane foundations
- runtime instance registry and control
- run admission / slot decisioning
- queue and execution control observability
- replay/snapshot validation foundations
- replay metadata, ledger, and trace timeline diagnostics
- correlated observability, tracing, metrics, and realtime logging foundations
- integration-test-driven validation
Areas still evolving include:
- public API/SDK polish
- HTTP replay APIs and controller abstractions
- OpenTelemetry/exporter polish for tracing and metrics
- operational dashboarding
- Kubernetes deployment assets
- real enterprise sample workflows
- MLOps-oriented platform capabilities
- production documentation split
A local enterprise-oriented demo is available in demo/enterprise-runtime/.
The demo is designed to prove that the runtime behaves like distributed AI execution infrastructure, not a toy agent framework.
It currently includes:
- Docker Compose infrastructure for Redis and MongoDB
- scenario documentation for enterprise runtime behaviors
- an external sample step plugin under
Samples/Multiplexed.Sample.External.Plugins.Steps - a JSON pipeline at
demo/enterprise-runtime/pipelines/enterprise-demo-pipeline.json - interactive console scenario selection
- distributed runtime worker participation
- realtime readable runtime logs
- live progress output
- execution pause, resume, and cancel controls
- retry recovery summaries
- retention and hot-state summaries
- replay validation for supported scenarios
- replay metadata, ledger, and timeline diagnostics in integration tests
- distributed provider throttling through the
throttling-100scenario RunIdandExecutionIdseparation- terminal completion through the controller path
- execution-correlated runtime ledger visibility
- correlated metrics and tracing diagnostics
- MemoryAndMongo observability validation
- queue and execution control audit events
- retry, recovery, and concurrency ledger validation
- atomic retention and compaction auditability
The current executable console scenarios are:
json
chaos-100
chaos-500
throttling-100
The throttling-100 scenario demonstrates:
- distributed provider throttling
- Redis lease-based concurrency admission
- realtime
[THROTTLED]visibility - randomized provider distribution with OpenAI as the throttled target
- bounded provider capacity under worker pressure
- deterministic convergence despite throttling delays
The demo validates the controller execution path, distributed worker participation, runtime controls, realtime logging, correlated observability, and terminal completion behavior. It is intended to show distributed AI execution infrastructure, not only a simple batch or in-memory execution path.
Future demo work will expand further into crash recovery, human-in-the-loop, advanced replay workflows, Kubernetes deployment assets, real enterprise sample workflows, and broader AI operations/MLOps-oriented runtime capabilities.
The roadmap is organized into phases.
| Phase | Focus | Status |
|---|---|---|
| Completed | Core runtime foundations already implemented | Implemented / validated by tests |
| Phase 0 | README review and documentation restructure | Completed (V1) |
| Phase 1 | Enterprise demo | Completed (V1) - controller demo, distributed workers, runtime controls, chaos scenarios, retention/replay, and throttling scenario validated |
| Phase 2 | Real enterprise sample | Planned |
| Phase 3 | Correlated observability, tracing, and metrics | Foundations available / active polish |
| Phase 4 | Kubernetes deployment demo | Planned |
| Phase 5 | Public API / SDK polish | Planned |
| Phase 6 | Deterministic Replay Engine and Audit Foundations | Completed (V1) |
| Phase 7 | Replay Controller, HTTP APIs, Dashboard, and Operational Tooling | Planned |
| Phase 8 | Cost and Provider Governance | Planned |
| Phase 9 | Articles and public positioning | Planned |
The roadmap above tracks the current runtime and enterprise demo evolution.
The longer-term platform direction is tracked separately in docs/road-to-mlops.md, which describes how these deterministic execution foundations can evolve toward broader AI execution infrastructure, governance, observability, replay, and MLOps-oriented runtime operations.
For the detailed roadmap, see docs/roadmap.md.
The full documentation map is available here:
docs/index.md— Documentation index and reading guide.docs/runtime-internals.md— Complete technical reference preserved from the original README.docs/enterprise-readiness.md— Enterprise readiness matrix.docs/comparison-existing-tools.md— Ecosystem positioning and comparison with existing tools.docs/roadmap.md— Project roadmap.docs/road-to-mlops.md— Long-term evolution from deterministic AI runtime foundations toward a broader AI execution and MLOps-oriented platform.docs/ai/runtime-control-plane.md— Runtime control-plane foundation for replay, execution control, runtime queue control, runtime instance visibility/control, and run admission.demo/enterprise-runtime/README.md— Local enterprise runtime demo using Docker Compose, Redis, MongoDB, external demo steps, controller execution, distributed workers, and scenario documentation.demo/enterprise-runtime/scenarios/06-distributed-throttling.md— Executable distributed throttling scenario documentation.demo/enterprise-runtime/scenarios/08-deterministic-convergence.md— Deterministic convergence scenario documentation.
Focused AI runtime documentation:
docs/ai/architecture-overview.mddocs/ai/distributed-execution.mddocs/ai/execution-control-state.mddocs/ai/runtime-queue-control.mddocs/ai/runtime-control-plane.mddocs/ai/retry-and-recovery.mddocs/ai/retention-and-compaction.mddocs/ai/distributed-concurrency-throttling.mddocs/ai/replay-and-audit.mddocs/ai/observability.mddocs/ai/observability-tracing.mddocs/ai/runtime-metrics.mddocs/ai/execution-correlated-ledger.mddocs/ai/testing-strategy.mddocs/ai/config-driven-runtime.mddocs/ai/policy-driven-execution.mddocs/ai/context-resolution-and-helpers.mddocs/ai/step-plugins.mddocs/ai/rag-pipelines.md
These files were extracted progressively from docs/runtime-internals.md.
This project is licensed under the Business Source License 1.1 (BSL).
- Free for development, testing, and internal use
- Commercial production use requires a license
- Automatically converts to Apache 2.0 on 2029-01-01
See the repository license file for full terms.