Deterministic AI Runtime

A deterministic AI execution runtime for production-grade AI workloads.

This repository provides a reference implementation of a distributed, state-driven runtime for executing AI workflows with deterministic DAG orchestration, context resolution, Redis Lua coordination, retry/recovery, retention/compaction, distributed concurrency control, execution control state, replay validation, correlated metrics and tracing, execution-correlated decision ledger, and executable enterprise demo scenarios.

The current runtime foundations are intentionally designed as the base for a broader AI execution and MLOps-oriented platform.

Latest Updates

The latest major updates focused on turning the runtime from a DAG executor into a controllable, observable, distributed execution platform.

Area	Summary
Distributed multi-runtime-instance execution	Added foundations for multiple runtime instances and workers to coordinate through shared Redis-backed execution state.
Runtime control plane foundation	Added adapter-neutral control-plane foundations for replay, execution control, local runtime queue control, runtime instance registry/control, and run admission decisions.
Distributed concurrency and throttling demo	Added an executable `throttling-100` enterprise demo scenario with provider-level concurrency control, realtime throttling visibility, Redis lease-based admission control, randomized provider distribution, and bounded provider capacity under worker pressure.
Execution control state	Added durable `ExecutionId`-level pause, resume, cancel, waiting-for-input, and human input submission.
Runtime queue control	Added `RunId`-level queue pause/resume, queued cancellation, running cancellation bridge, and hot enqueue support.
Execution replay validation	Added Replay API foundations that load persisted snapshots by `ExecutionId`, validate deterministic fingerprints, validate dependency graph and step state, optionally restore runtime state, and expose replay metadata, ledger events, and timeline events.
Execution-correlated decision ledger	Added durable execution-correlated ledger events for execution lifecycle, run lifecycle, queue control, claim acquisition, retry, recovery, concurrency, policy evaluation, snapshot persistence, replay lifecycle, and runtime control observability.
Correlated metrics and tracing storage modes	Added runtime correlation for metrics and traces with configurable `Disabled`, `Memory`, `Mongo`, and `MemoryAndMongo` storage modes.
Context resolution and helpers	Added a dedicated helper layer for input bindings, previous step outputs, payload rehydration, provider/model/operation context, policy context, RAG context, and replay-safe helpers.
Documentation restructure	Completed Phase 0 V1 with a shorter README, preserved runtime internals, documentation index, roadmap, enterprise readiness matrix, ecosystem comparison, and focused runtime documentation under `docs/ai/`.
Long-term platform direction	Added a dedicated road-to-MLOps direction document to describe how the runtime foundations can evolve toward AI execution infrastructure and MLOps-oriented runtime operations.

For detailed changes, see CHANGELOG.md, docs/index.md, and docs/road-to-mlops.md.

Overview

Deterministic AI Runtime is a .NET runtime for executing complex AI workflows as controlled, observable, recoverable, replayable, and auditable distributed executions.

It is designed for workloads such as:

LLM orchestration
RAG pipelines
tool execution
decision workflows
long-running AI processes
multi-step distributed execution

The runtime treats AI orchestration as a systems problem, not only as a prompt engineering problem.

It provides a state-driven execution layer where:

workflows are modeled as DAGs
workers are stateless
Redis stores hot execution state
Redis Lua scripts enforce atomic coordination
MongoDB stores durable payloads and snapshots
context helpers resolve inputs, payloads, provider metadata, and policy context
policies control retry, retention, and concurrency
metrics, traces, and ledger events share runtime correlation
executions can be replay-validated from persisted snapshots
execution can be paused, resumed, cancelled, or blocked for human input

The project should be read as an AI execution infrastructure foundation. The runtime core is already substantial, while the longer-term direction is to evolve toward a broader AI operations and MLOps-oriented platform.

Why This Exists

Most AI projects focus on prompts, agents, RAG, embeddings, or model calls.

That is enough for prototypes.

Once AI moves to production, the hard problem becomes execution:

How do you coordinate multiple workers?
How do you avoid duplicate execution?
How do you recover after crashes?
How do you replay an execution?
How do you control retries?
How do you throttle providers and models?
How do you keep memory bounded?
How do you resolve context safely across steps, payloads, providers, and policies?
How do you pause, resume, or cancel safely?
How do you support human-in-the-loop workflows?
How do you prove deterministic convergence?

This runtime exists to address those production execution concerns.

The Production AI Execution Problem

Production AI workloads are no longer single prompts running in isolation.

They become distributed execution systems with:

multiple pipeline steps
parallel branches
dependencies
retries
external providers
large payloads
compacted or externalized state
context resolution across previous step outputs
failure recovery
operational controls
audit and replay requirements
multiple workers or runtime instances

Without a real execution runtime, these systems often become fragile:

hidden retry loops
duplicated work
lost state
corrupted execution progress
unbounded memory growth
unclear ownership
inconsistent input/context reconstruction
poor or fragmented observability
impossible replay

This project explores what an AI execution runtime should look like when reliability, determinism, context resolution, and distributed coordination are treated as first-class design requirements.

Core Capabilities

Capability	Status	Summary
Deterministic DAG execution	Implemented	Workflows execute through dependency-aware DAG state.
Redis hot state	Implemented	Active execution state is stored in Redis.
Redis Lua atomic coordination	Implemented	Critical transitions use Lua-backed atomic operations.
Distributed workers	Implemented	Workers can claim and execute steps safely.
Multi-runtime-instance execution foundations	Implemented	Runtime instances can coordinate through shared Redis-backed execution state.
Context resolution and helpers	Foundation available	Input bindings, previous step outputs, provider metadata, policy context, and payload rehydration are resolved through helper layers.
Deterministic convergence	Implemented	Final state is derived from state transitions, not worker ordering.
Retry and recovery	Implemented	Retry state, waiting windows, and stale running-step recovery are separated.
Retention and compaction	Implemented	Hot state can be compacted/evicted while payloads remain resolvable.
Distributed concurrency and throttling	Implemented	Redis ZSET leases enforce global, pipeline, step, execution, instance, provider, model, and operation limits.
Enterprise throttling scenario	Implemented	`throttling-100` demonstrates provider-level distributed throttling with realtime visibility and deterministic convergence.
Policy-driven execution	Implemented	Retry, retention, and concurrency use configurable policy definitions.
Execution control state	Implemented	ExecutionId-level pause, resume, cancel, waiting-for-input, and human input submission.
Runtime queue control	Implemented	RunId-level queue pause/resume, queued cancellation, running cancellation bridge, and hot enqueue.
Runtime control plane foundation	Implemented	Adapter-neutral facades expose replay, execution control, local queue control, runtime instance control, and admission foundations for API/MCP/CLI/dashboard/Kubernetes adapters.
Runtime instance registry and control	Implemented	Runtime instances can register, heartbeat, expose queue capacity, be listed, marked draining, or unregistered.
Run admission / slot decisions	Implemented	Admission can assign runs to an available runtime instance, request scale-out, queue globally later, or reject according to policy.
RunId vs ExecutionId separation	Implemented	Controller lifecycle identity is separated from durable DAG execution identity.
Snapshot and Replay API foundations	Implemented	Terminal snapshots, replay metadata, deterministic fingerprint validation, audit-only replay, restore replay, ledger loading, and timeline loading are available.
Execution-correlated decision ledger	Implemented	Durable correlated ledger events exist for execution lifecycle, run lifecycle, queue control, claims, steps, retry, recovery, policy evaluation, concurrency, execution control, human input, snapshots, storage failures, replay lifecycle, retention, compaction, and finalization.
Observability, metrics, and tracing	Foundation available	Runtime metrics, trace recording, realtime events, correlated trace timelines, and configurable Memory/Mongo/MemoryAndMongo persistence exist; production-grade OpenTelemetry integration and dashboarding remain planned.
MLOps-oriented platform evolution	Direction defined	Long-term platform direction is documented in `docs/road-to-mlops.md`.
Durable decision ledger	Foundation available	Execution-correlated runtime ledger foundations are implemented and aligned with runtime correlation, including replay lifecycle visibility.
Public API / SDK polish	Planned	Future work for cleaner external developer experience.

Architecture at a Glance

Client / API / Controller
        |
        v
Runtime Orchestration Layer
        |
        v
Pipeline Definition + DAG Resolution
        |
        v
Context Resolution and Helper Layer
        |
        +--> Input binding resolution
        +--> Previous step output resolution
        +--> Payload rehydration
        +--> Provider / model / operation context
        +--> Policy context
        +--> Concurrency context
        +--> RAG retrieval / merge / compose context
        |
        v
DAG Execution Engine
        |
        +--> Execution Control Gate
        |
        +--> Concurrency / Throttling Engine
        |
        v
Redis Hot State + Redis Lua Coordination
        |
        +--> Atomic step claims
        +--> Retry scheduling
        +--> Worker recovery
        +--> Control state
        +--> Distributed leases
        |
        v
Stateless Workers / Runtime Instances
        |
        v
Step Executors
        |
        +--> LLM
        +--> RAG
        +--> Tools
        +--> Decisions
        |
        v
MongoDB Payloads / Snapshots / Replay Validation
        |
        v
Execution-Correlated Decision Ledger
        |
        v
Correlated Observability / Metrics / Tracing Foundations

The runtime is intentionally split into layers:

orchestration starts and manages executions
DAG state determines what can run
context helpers resolve inputs, payloads, metadata, and policy context
Redis coordinates distributed workers
policies control runtime behavior
workers execute claimed steps
persistence stores large payloads and snapshots
replay validates deterministic reconstruction from persisted snapshots
correlated observability records runtime behavior across ledger, metrics, traces, workers, and executions

Enterprise Readiness

The project is designed around production questions that enterprise AI systems must answer.

Enterprise Question	Runtime Direction
What happens if a worker crashes?	Running steps can be recovered through stale-claim detection and Redis-backed recovery.
How do you prevent duplicate executions?	Atomic Redis Lua claims and claim tokens enforce single step ownership.
How do you replay a workflow?	Terminal snapshots, replay metadata, deterministic fingerprint validation, audit-only replay, and restore replay provide replay foundations.
How do you audit an AI decision?	Execution-correlated decision ledger events, correlated metrics/traces, execution state, step results, retry metadata, recovery state, snapshots, replay reports, and observability provide audit foundations.
How do you limit concurrency?	Distributed Redis ZSET leases and policy-driven throttling enforce limits.
How do you resolve execution context safely?	Context helpers resolve inputs, step outputs, payload references, provider metadata, and policy context consistently.
How do you pause/resume/cancel safely?	Execution control state blocks new claims and coordinates deterministic finalization.
How do you control human-in-the-loop?	WaitingForInput and SubmitHumanInput are supported through durable control state.
How do you keep memory/state bounded?	Retention, compaction, eviction, and payload externalization control hot state size.
How do you coordinate multiple runtime instances?	Shared Redis state, Lua coordination, leases, and deterministic convergence enable coordination.
How do you prove deterministic convergence?	Integration tests and enterprise demo scenarios validate completion, replay fingerprints, distributed execution, throttling, recovery behavior, atomic retention, compaction consistency, ledger visibility, and trace timeline visibility.
How does this evolve toward AI operations and MLOps?	Runtime foundations are designed to support future AI execution control planes, governance, observability, replay, and operational workflows.

For the detailed enterprise matrix, see docs/enterprise-readiness.md.

Runtime Control Plane

The runtime includes an adapter-neutral control-plane foundation for operational runtime control.

The control plane currently exposes foundations for:

replay and audit control
execution control
local runtime queue control
runtime instance registration and heartbeat
runtime instance visibility and draining
run admission / slot decisions
future API, MCP, CLI, dashboard, and Kubernetes adapters

The control plane is also split into two identity levels.

RunId
= controller / queue / background job lifecycle id

ExecutionId
= authoritative durable DAG execution id

This separation allows the runtime to manage queue-level work without confusing it with durable execution state.

ExecutionId-Level Control

Implemented execution control capabilities include:

pause execution
resume execution
cancel execution
wait for human input
submit human input
block new claims based on control state
cancellation finalization override

RunId-Level Queue Control

Implemented controller queue capabilities include:

pause queue
resume queue
cancel queued run
cancel running run by bridging to ExecutionId cancellation
hot enqueue while controller is running
hot enqueue while queue is paused

This makes the runtime controllable, not only executable.

Runtime Instance Control

Implemented runtime instance capabilities include:

register runtime instance
heartbeat runtime instance
get runtime instance status
list runtime instances
expose local queue pressure
expose available run slots
mark runtime instance as draining
unregister runtime instance

These foundations prepare the runtime for Kubernetes pod visibility, shared admission, dashboards, and future autoscaling.

Run Admission / Slot Decisions

Implemented admission capabilities include:

assign a run to an available runtime instance
prefer a requested runtime instance when available
select the least-loaded available instance
request scale-out when no instance has capacity
queue globally later when shared queue fallback is enabled
reject when no capacity or fallback exists

Admission does not enqueue runs yet.

It only decides what should happen next.

For details, see docs/ai/runtime-control-plane.md.

Distributed Execution and Coordination

The runtime uses Redis as a hot state and coordination layer.

Redis is not only used as a cache. It is used as the active distributed execution state.

Critical operations include:

creating execution state
claiming ready DAG steps
validating claim ownership
completing steps
failing steps
scheduling retries
recovering stale running steps
enforcing distributed concurrency leases
storing execution control state

Redis Lua scripts are used for atomic transitions where race conditions must be avoided.

This allows multiple workers or runtime instances to cooperate safely without direct worker-to-worker communication.

Context Resolution and Runtime Helpers

The runtime includes a helper layer that connects declarative configuration to concrete execution behavior.

This layer resolves:

input bindings from execution state
previous step outputs
compacted or externalized payloads
provider, providerKey, model, and operation metadata
retry policy context
retention policy context
concurrency context
RAG retrieval, merge, and compose context
replay-safe comparison data

This keeps the DAG engine focused on orchestration and prevents plugins, policies, and providers from manually reconstructing raw execution state.

For details, see docs/ai/context-resolution-and-helpers.md.

Observability, Replay, and Audit Foundations

The runtime includes foundations for production visibility and replayability:

execution lifecycle metrics
retry and recovery metrics
retention metrics
resolver metrics
storage metrics
context resolution diagnostics
concurrency admission diagnostics
realtime runtime events
readable console runtime events
execution-correlated decision ledger
durable runtime lifecycle audit events
claim, concurrency, retry, and recovery audit visibility
queue and execution control audit visibility
atomic retention and compaction auditability
snapshot persistence audit events
replay lifecycle ledger events
replay metadata
replay report generation
replay snapshot loading
replay deterministic fingerprint validation
replay dependency graph validation
replay step state validation
replay payload reference validation
replay ledger event loading
replay trace timeline loading
replay diagnostic output for ledger and timeline inspection
correlated trace recording foundations
correlated metric and trace storage modes
in-memory, MongoDB, and MemoryAndMongo observability persistence
terminal snapshots
replay restoration
deterministic replay fingerprint validation

Runtime metrics and traces can be configured as Disabled, Memory, Mongo, or MemoryAndMongo. This allows local diagnostics, durable MongoDB-backed inspection, or both at the same time while keeping the execution runtime independent from observability storage choices.

Replay can validate persisted executions without re-running LLMs, tools, external providers, or side effects. It can expose replay metadata, decision ledger events, and trace timeline events when requested.

OpenTelemetry-style distributed tracing, richer dashboards, HTTP replay APIs, replay audit tooling, and advanced decision lineage remain roadmap items.

Current Status

This project is under active development.

It should be treated as an advanced reference implementation and evolving AI infrastructure project, not as a polished commercial product.

The strongest areas today are:

deterministic DAG execution
Redis-backed distributed state
Redis Lua atomic coordination
context resolution and helper foundations
retry/recovery semantics
retention/compaction
distributed concurrency and throttling
executable distributed throttling scenario
execution control state
runtime queue control
runtime control-plane foundations
runtime instance registry and control
run admission / slot decisioning
queue and execution control observability
replay/snapshot validation foundations
replay metadata, ledger, and trace timeline diagnostics
correlated observability, tracing, metrics, and realtime logging foundations
integration-test-driven validation

Areas still evolving include:

public API/SDK polish
HTTP replay APIs and controller abstractions
OpenTelemetry/exporter polish for tracing and metrics
operational dashboarding
Kubernetes deployment assets
real enterprise sample workflows
MLOps-oriented platform capabilities
production documentation split

Enterprise Runtime Demo

A local enterprise-oriented demo is available in demo/enterprise-runtime/.

The demo is designed to prove that the runtime behaves like distributed AI execution infrastructure, not a toy agent framework.

It currently includes:

Docker Compose infrastructure for Redis and MongoDB
scenario documentation for enterprise runtime behaviors
an external sample step plugin under Samples/Multiplexed.Sample.External.Plugins.Steps
a JSON pipeline at demo/enterprise-runtime/pipelines/enterprise-demo-pipeline.json
interactive console scenario selection
distributed runtime worker participation
realtime readable runtime logs
live progress output
execution pause, resume, and cancel controls
retry recovery summaries
retention and hot-state summaries
replay validation for supported scenarios
replay metadata, ledger, and timeline diagnostics in integration tests
distributed provider throttling through the throttling-100 scenario
RunId and ExecutionId separation
terminal completion through the controller path
execution-correlated runtime ledger visibility
correlated metrics and tracing diagnostics
MemoryAndMongo observability validation
queue and execution control audit events
retry, recovery, and concurrency ledger validation
atomic retention and compaction auditability

The current executable console scenarios are:

json
chaos-100
chaos-500
throttling-100

The throttling-100 scenario demonstrates:

distributed provider throttling
Redis lease-based concurrency admission
realtime [THROTTLED] visibility
randomized provider distribution with OpenAI as the throttled target
bounded provider capacity under worker pressure
deterministic convergence despite throttling delays

The demo validates the controller execution path, distributed worker participation, runtime controls, realtime logging, correlated observability, and terminal completion behavior. It is intended to show distributed AI execution infrastructure, not only a simple batch or in-memory execution path.

Future demo work will expand further into crash recovery, human-in-the-loop, advanced replay workflows, Kubernetes deployment assets, real enterprise sample workflows, and broader AI operations/MLOps-oriented runtime capabilities.

Roadmap

The roadmap is organized into phases.

Phase	Focus	Status
Completed	Core runtime foundations already implemented	Implemented / validated by tests
Phase 0	README review and documentation restructure	Completed (V1)
Phase 1	Enterprise demo	Completed (V1) - controller demo, distributed workers, runtime controls, chaos scenarios, retention/replay, and throttling scenario validated
Phase 2	Real enterprise sample	Planned
Phase 3	Correlated observability, tracing, and metrics	Foundations available / active polish
Phase 4	Kubernetes deployment demo	Planned
Phase 5	Public API / SDK polish	Planned
Phase 6	Deterministic Replay Engine and Audit Foundations	Completed (V1)
Phase 7	Replay Controller, HTTP APIs, Dashboard, and Operational Tooling	Planned
Phase 8	Cost and Provider Governance	Planned
Phase 9	Articles and public positioning	Planned

The roadmap above tracks the current runtime and enterprise demo evolution.

The longer-term platform direction is tracked separately in docs/road-to-mlops.md, which describes how these deterministic execution foundations can evolve toward broader AI execution infrastructure, governance, observability, replay, and MLOps-oriented runtime operations.

For the detailed roadmap, see docs/roadmap.md.

Documentation

The full documentation map is available here:

docs/index.md — Documentation index and reading guide.
docs/runtime-internals.md — Complete technical reference preserved from the original README.
docs/enterprise-readiness.md — Enterprise readiness matrix.
docs/comparison-existing-tools.md — Ecosystem positioning and comparison with existing tools.
docs/roadmap.md — Project roadmap.
docs/road-to-mlops.md — Long-term evolution from deterministic AI runtime foundations toward a broader AI execution and MLOps-oriented platform.
docs/ai/runtime-control-plane.md — Runtime control-plane foundation for replay, execution control, runtime queue control, runtime instance visibility/control, and run admission.
demo/enterprise-runtime/README.md — Local enterprise runtime demo using Docker Compose, Redis, MongoDB, external demo steps, controller execution, distributed workers, and scenario documentation.
demo/enterprise-runtime/scenarios/06-distributed-throttling.md — Executable distributed throttling scenario documentation.
demo/enterprise-runtime/scenarios/08-deterministic-convergence.md — Deterministic convergence scenario documentation.

Focused AI runtime documentation:

These files were extracted progressively from docs/runtime-internals.md.

License

This project is licensed under the Business Source License 1.1 (BSL).

Free for development, testing, and internal use
Commercial production use requires a license
Automatically converts to Apache 2.0 on 2029-01-01

See the repository license file for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
clients/nextjs		clients/nextjs
demo/enterprise-runtime		demo/enterprise-runtime
docs		docs
implementations/dotnet		implementations/dotnet
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deterministic AI Runtime

Latest Updates

Overview

Why This Exists

The Production AI Execution Problem

Core Capabilities

Architecture at a Glance

Enterprise Readiness

Runtime Control Plane

ExecutionId-Level Control

RunId-Level Queue Control

Runtime Instance Control

Run Admission / Slot Decisions

Distributed Execution and Coordination

Context Resolution and Runtime Helpers

Observability, Replay, and Audit Foundations

Current Status

Enterprise Runtime Demo

Roadmap

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deterministic AI Runtime

Latest Updates

Overview

Why This Exists

The Production AI Execution Problem

Core Capabilities

Architecture at a Glance

Enterprise Readiness

Runtime Control Plane

ExecutionId-Level Control

RunId-Level Queue Control

Runtime Instance Control

Run Admission / Slot Decisions

Distributed Execution and Coordination

Context Resolution and Runtime Helpers

Observability, Replay, and Audit Foundations

Current Status

Enterprise Runtime Demo

Roadmap

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages