Skip to content

ienugr/gallon-simulation

Repository files navigation

Gallon Delivery Simulation

A fully containerised, multi-service simulation that demonstrates how easy it is to build your own AI agent using the Strands Agents framework. The agent manages a water-gallon delivery operation: it receives requests from autonomous clients, decides how to fulfil them, tracks inventory, and records everything to a vector knowledge base — all in real time, all observable through a live dashboard.

This project was built as a hands-on showcase. The core message: you don't need a complex platform to run your own agent. A handful of Python files, a @tool decorator, and a model of your choice is all it takes. The rest of the stack (database, knowledge base, monitoring) is ordinary software that you already know how to build.


Background

Agents are often presented as magical black boxes sitting behind a managed platform. This project cuts through that and shows the mechanics directly:

  • What does a tool actually look like in code?
  • How does the agent decide what to call?
  • How do you wire an agent into an HTTP service?
  • How do you give an agent memory across calls?
  • How do you swap the underlying model without rewriting your agent?

The simulation is intentionally concrete. Gallons are delivered, inventory is depleted, clients die if they run dry for too long. You can watch it happen in a browser. Every agent decision is logged. The knowledge base accumulates real behavioral history. It is small enough to read in an afternoon and real enough to learn from.


What the simulation does

Five clients start with a small stock of water gallons. Each tick they consume some (with one random "burst" day per week). When stock drops below a threshold they call the agent to request a delivery. The agent checks inventory, fulfils or rejects, and logs the event to a vector knowledge base. A background monthly loop summarises operations — total deliveries, gallons, top client, anomalies — and records that summary to the knowledge base. A dashboard shows everything live.

              tick every N seconds
                      │
         ┌────────────▼────────────┐
         │      gallon-clients     │  5 concurrent async clients
         │  (consume, request, die)│  PostgreSQL → gallon_clients DB
         └────────────┬────────────┘
                      │  POST /invocations
         ┌────────────▼────────────┐
         │      gallon-agent       │  Strands agent + tools
         │  (deliver, monthly loop)│  connects to inference provider
         └────────────┬────────────┘
                      │  HTTP
         ┌────────────▼────────────┐
         │     gallon-service      │  FastAPI data layer
         │  inventory + knowledge  │  PostgreSQL → gallon_service DB
         │         base            │  LanceDB (vector store)
         └─────────────────────────┘

         ┌─────────────────────────┐
         │     gallon-monitor      │  dashboard at :8888
         │  reads all three above  │
         └─────────────────────────┘

Project structure

workshop/
├── docker-compose.yml          # one command to run everything
├── .env                        # all runtime config in one place
├── init-scripts/
│   └── init-db.sql             # creates gallon_service + gallon_clients DBs
│
├── gallon-agent/               # the agent service
│   └── gallon_agent/
│       ├── agent.py            # model construction + agent factory
│       ├── tools.py            # @tool definitions (the agent's hands)
│       ├── server.py           # FastAPI: /invocations + /ping + weekly loop
│       ├── service_client.py   # HTTP client for gallon-service
│       └── config.py           # pydantic-settings
│
├── gallon-service/             # data layer (no agent logic)
│   └── gallon_service/
│       ├── server.py           # inventory, deliveries, KB endpoints
│       ├── db.py               # asyncpg queries
│       ├── vector_store.py     # LanceDB + fastembed
│       └── config.py
│
├── gallon-clients/             # simulation clients
│   └── gallon_clients/
│       ├── runner.py           # spawns N concurrent clients
│       ├── client.py           # GallonClient: consume, request, die
│       ├── agent_gateway.py    # calls /invocations on the agent
│       └── db.py               # client state in PostgreSQL
│
└── gallon-monitor/             # live dashboard
    └── gallon_monitor/
        ├── server.py           # aggregates state from all services
        └── dashboard.html      # single-page dashboard (vanilla JS)

Quickstart

Prerequisites

  • Docker + Docker Compose (or Colima on Mac)
  • An inference provider — see Inference providers below

1. Clone and configure

git clone <repo>
cd workshop
cp .env.example .env   # then edit .env — see below

2. Start everything

docker compose up -d --build

3. Open the dashboard

http://localhost:8888

That is it. Clients start running immediately. The agent processes deliveries in the background and runs a monthly summary every 30 ticks. The dashboard refreshes every two seconds.

4. Restart clients (after they all die)

docker compose restart gallon-clients

Inference providers

The agent supports three inference providers, switched entirely through environment variables — no code changes.

Local (default) — Rapid-MLX or any OpenAI-compatible server

Best for development on Apple Silicon. Rapid-MLX runs models natively via MLX.

# install and start the model (run on the host, not in Docker)
pip install rapid-mlx
rapid-mlx serve qwen3-0.6b-8bit --port 8000

.env:

INFERENCE_PROVIDER=local
MODEL_ID=qwen3-0.6b-8bit
INFERENCE_PORT=8000

Works with any OpenAI-compatible local server (Ollama, LM Studio, vLLM, etc.). For Ollama, set MODEL_ID=smollm2:135m and ensure it is pulled first.

Note on small models: models under ~4B parameters often struggle with reliable tool calling. The simulation works around this by making deterministic decisions (restock, OOS probability) in Python and only asking the LLM to call a single tool per invocation.

OpenAI

INFERENCE_PROVIDER=openai
MODEL_ID=gpt-4o-mini
OPENAI_API_KEY=sk-...

Amazon Bedrock

INFERENCE_PROVIDER=bedrock
MODEL_ID=us.anthropic.claude-haiku-4-5-20251001-v1:0
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

Cross-region inference profile IDs (the us. / apac. / eu. prefix) are required for most Bedrock models. Find yours in the Bedrock console under Cross-region inference.


Configuration reference

All config lives in .env at the workspace root. Every value has a sensible default so you only need to set what you want to change.

Variable Default Description
INFERENCE_PROVIDER local local / openai / bedrock
MODEL_ID qwen3-0.6b-8bit Model identifier for the chosen provider
INFERENCE_PORT 8000 Port of the local inference server
OPENAI_API_KEY Required when INFERENCE_PROVIDER=openai
AWS_REGION us-east-1 Required when INFERENCE_PROVIDER=bedrock
AWS_ACCESS_KEY_ID Required when INFERENCE_PROVIDER=bedrock
AWS_SECRET_ACCESS_KEY Required when INFERENCE_PROVIDER=bedrock
NUM_CLIENTS 5 Number of simulation clients
SIM_DAYS 0 Ticks per client before stopping (0 = indefinite)
TICK_SECONDS 5.0 Real seconds per simulation day
SIM_TICK_SECONDS 1.0 Real seconds per simulation day (agent side)
INITIAL_STOCK 5.0 Starting gallons per client
DAILY_CONSUMPTION 1.0 Gallons consumed on a normal day
BURST_CONSUMPTION 5.0 Gallons consumed on burst day
REORDER_THRESHOLD 2.0 Stock level that triggers a delivery request
DELIVERY_AMOUNT_MIN 3.0 Minimum gallons requested per delivery (randomised per request)
DELIVERY_AMOUNT_MAX 10.0 Maximum gallons requested per delivery
DEATH_THRESHOLD_DAYS 3 Consecutive zero-stock days before a client dies
OOS_PROBABILITY 0.2 Probability agent is unavailable per check_inventory call
RESTOCK_THRESHOLD 20.0 Inventory level that triggers automatic restock
MAX_CAPACITY 100.0 Warehouse max capacity (gallons)
EMBED_MODEL BAAI/bge-small-en-v1.5 fastembed model for the knowledge base
POSTGRES_PASSWORD postgres Shared PostgreSQL password

How the agent works

The agent contract

The agent exposes two HTTP endpoints following the Bedrock AgentCore contract:

  • GET /ping — health check, returns {"status": "Healthy"}
  • POST /invocations — all client interactions, prompt in body

Clients send plain text prompts:

Client action Prompt Agent response
Check warehouse check_inventory CLIENT_ID:uuid {"available_gallons": 80.0, "is_out_of_stock": false, "unavailability_reason": null}
Request delivery deliver CLIENT_ID:uuid CLIENT_NAME:Alice GALLONS:5.0 {"delivered": true, "gallons_delivered": 5.0, "status": "delivered"}

Tools

Tools are plain Python functions decorated with @tool. The Strands framework automatically generates the JSON schema from the type annotations and docstring, passes them to the model, and calls the function when the model requests it.

from strands import tool

@tool
def deliver_gallons(client_id: str, client_name: str, gallons_requested: float, sim_day: int) -> dict:
    """
    Execute a delivery request: deduct gallons from inventory and log it.
    Returns gallons_delivered and status.
    """
    # ... implementation calls gallon-service HTTP endpoints
    # auto-restocks if inventory drops below threshold after delivery

The full tool set:

Tool Purpose
deliver_gallons Fulfil a delivery: check live inventory, deduct, log to DB and KB, auto-restock if needed
record_monthly_summary Persist the monthly operational summary to the knowledge base

Two focused agents

Rather than one general-purpose agent, the simulation uses two single-tool agents. This makes the model's job trivial: read the values in the prompt, call the one tool, done. It works reliably even with very small models.

def build_delivery_agent() -> Agent:
    return Agent(model=_make_model(), system_prompt=_delivery_prompt(), tools=[deliver_gallons])

def build_monthly_agent() -> Agent:
    return Agent(model=_make_model(), system_prompt=_monthly_prompt(), tools=[record_monthly_summary])

Deterministic vs LLM decisions

Not everything should be delegated to the model. The simulation handles several things in Python and passes pre-computed values to the agent as facts:

  • Restock: triggered automatically inside deliver_gallons when inventory drops below threshold. No LLM needed.
  • Agent unavailability: random.random() < oos_probability picks a random human-readable reason (holiday, force majeure, etc.) at check_inventory time. A dice roll, not a judgment call.
  • Monthly stats: all arithmetic (top client, failure rate, anomaly detection) is computed in Python. The LLM only calls record_monthly_summary with the pre-computed values.

The LLM's job is to call the right tool with the right arguments — that is what it is actually good at.

Concurrency

Strands agents are synchronous. The simulation runs them in asyncio.to_thread so the FastAPI event loop stays unblocked, and serialises concurrent calls with asyncio.Lock so only one agent runs at a time.

Model switching

Inference provider is selected at startup from the INFERENCE_PROVIDER env var. The _make_model() factory returns the appropriate Strands model object — OpenAIModel for local and OpenAI, BedrockModel for Bedrock. Everything above that layer (agents, tools, server) is provider-agnostic.


Knowledge base

Every delivery and every weekly agent decision is written to LanceDB, a local vector database. Text is embedded with fastembed using BAAI/bge-small-en-v1.5 (384 dimensions, runs on CPU, no GPU needed).

Two collections:

client_behaviors — one record per delivery attempt

"Alice on day 42 requested 5.0g, received 5.0g (delivered)"
metadata: client_id, client_name, sim_day, gallons_requested, gallons_delivered, status

agent_decisions — one record per monthly summary

"Month 3: 47 deliveries, 218.5g delivered, 12.0g restocked, 6% failure rate.
 Top client: Alice (64.0g). Anomalies: No anomalies detected"
metadata: month_number, total_gallons_delivered, total_gallons_restocked,
          total_deliveries, failed_deliveries, top_client_name, top_client_gallons,
          anomaly_summary

The knowledge base is browsable in the dashboard (bottom two panels) and queryable via:

curl -s http://localhost:8081/kb/client_behaviors?limit=10
curl -s http://localhost:8081/kb/agent_decisions?limit=10

# semantic search
curl -s -X POST http://localhost:8081/kb/search \
  -H 'Content-Type: application/json' \
  -d '{"collection": "client_behaviors", "query_text": "out of stock", "limit": 5}'

Service API reference

gallon-service :8081

Method Path Description
GET /inventories Current inventory
POST /inventories/restocks Refill to max capacity
POST /deliveries Log a delivery (atomically deducts inventory)
GET /deliveries Delivery history (optional ?client_id= filter)
GET /kb/{collection} Recent KB entries (?limit=50)
POST /kb/ingest Add a record to the KB
POST /kb/search Semantic search
POST /kb/embed Embed a string (returns vector)

gallon-agent :8080

Method Path Description
GET /ping Health check
POST /invocations Agent entry point
GET /conversations Last 200 interactions (for the dashboard)

Dashboard

Open http://localhost:8888. Refreshes every 2 seconds.

  • Header — agent health badge + last-updated time
  • Summary stats — alive clients, dead clients, total requests, gallons delivered
  • Inventory — live gauge with capacity bar
  • Clients table — per-client stock, status, request/fulfil counts
  • Delivery feed — last 20 deliveries, newest first
  • Agent conversation log — all agent interactions including monthly summaries, rendered as a chat
  • Knowledge base — last 30 entries from each KB collection, with metadata

All scrollable panels are resizable (drag the bottom edge).


Adapting this for your own agent

The gallon delivery domain is a placeholder. The pattern is the point.

  1. Define your tools in tools.py — plain Python functions with @tool and type annotations. The framework handles schema generation and invocation.

  2. Write a focused system prompt — tell the agent what it is and what single action it should take. One tool per agent is more reliable than a Swiss-army prompt.

  3. Wire it into FastAPI in server.py — call asyncio.to_thread(agent, prompt) from your handler. Clear agent.messages after each call if you want stateless invocations.

  4. Choose your model in agent.py via the provider factory. Switch with an env var, not a code change.

  5. Separate data access into its own service (gallon-service) so the agent never touches the database directly. This makes the agent testable, replaceable, and deployable independently.


Tech stack

Layer Technology
Agent framework Strands Agents
Local inference Rapid-MLX (Apple Silicon)
Cloud inference Amazon Bedrock, OpenAI
HTTP server FastAPI + Uvicorn
HTTP client httpx
Relational DB PostgreSQL 16 via asyncpg
Vector DB LanceDB
Embeddings fastembed (BAAI/bge-small-en-v1.5)
Config pydantic-settings
Packaging uv + hatchling
Containers Docker Compose

Running on Apple Silicon (Colima)

If you are using Colima instead of Docker Desktop:

# first-time setup — allocate enough disk and mount your repo volume
colima start --disk 100 --mount /Volumes/Repository:w --mount $HOME:w

# start Rapid-MLX on the host (accessible from containers via host.docker.internal)
rapid-mlx serve qwen3-0.6b-8bit --port 8000

# then start the stack
docker compose up -d --build

If Colima was previously stopped and the runtime directory is missing:

mkdir -p ~/tmp/colima
colima start --disk 100 --mount /Volumes/Repository:w --mount $HOME:w

About

Fully Local AI Gallon Agent Simulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors