vix-eval

Docker-based test harness for evaluating coding agents in isolated containers.

Each test case is a YAML config. A single run_test.sh orchestrates: parse config, build image, run container with mitmproxy + agent, user interacts, then extract artifacts (diff, plan, flow file).

Prerequisites

Docker
Python 3.10+ with pip
Bash 4+

Install host-side dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

.env Setup

cp .env.example .env
# Edit .env and set your Anthropic API key

Quickstart

pip install -r requirements.txt
cp .env.example .env && vim .env   # set ANTHROPIC_API_KEY
./run_test.sh tasks/example-test.yaml  --agent cc --settings settings/vix/

For single-agent configs (using the agent: key), the --agent flag is optional:

./run_test.sh tasks/single-agent-test.yaml

Config Reference

Test configs live in tasks/ as YAML files.

Single agent (backward compatible)

test_id: my-test              # Required. Unique identifier.
description: "What to do"     # Optional.

repository:
  enabled: true               # Default: true
  url: https://github.com/... # Repo to clone
  commit: ""                  # Optional. Specific commit to checkout.

proxy:
  enabled: true               # Default: true
  listen_host: "127.0.0.1"    # Default. Internal proxy bind address.
  listen_port: 58000          # Default. Proxy port inside container.
  web_port: 8099              # Default. mitmweb UI port (published to host).
  target: "https://api.anthropic.com"  # Default.
  flow_filename: "cc.flow"    # Default.

agent:
  type: claude-code            # Default.
  version: latest              # Default.

environment:
  extra_env:                   # Optional. Additional env vars.
    MY_VAR: "value"
    FROM_HOST: "${HOST_VAR}"   # Interpolated from host environment.

Multi-agent

test_id: my-test
description: "What to do"

repository:
  url: https://github.com/...

agents:
  claude-code:
    type: claude-code
    version: latest
  vix:
    type: vix
    version: latest

Use --agent <name> to select which agent to run:

./run_test.sh --agent vix tasks/my-test.yaml

The agent type maps to a Dockerfile: docker/Dockerfile.<type>. You cannot specify both agent: and agents: in the same config.

Output Artifacts

After a test run, artifacts are saved to tasks/<task_dir>/results/<agent_name>/<timestamp>/:

File	Description
`changes.diff`	Combined staged + unstaged + untracked changes
`plan.md`	Agent's plan file (from `.claude/plans/`)
`plans/`	Raw plan files as written by the agent
`cc.flow`	mitmproxy flow file (API traffic capture)

What Happens When the Container Exits

When you exit the interactive shell (Ctrl-D or exit), the following happens automatically before the container shuts down:

Diff generation (inside container): All changes the agent made to the cloned repo (staged, unstaged, and untracked files) are captured into /output/changes.diff. The repo itself is ephemeral and discarded with the container.
Container stops: The --rm flag removes the container. Only the /output volume (mapped to tasks/<task_dir>/results/<agent_name>/<timestamp>/) survives.
Artifact summary (on host): extract_artifacts.sh runs and prints a summary of what was captured — diff stats, plan files, and flow file size.

Plans and the flow file are written directly to /output during the session (via symlinks and mitmproxy's --save-stream-file), so they are available immediately.

Architecture

run_test.sh
├── scripts/parse_config.py       # YAML → .env key=value file
├── docker/Dockerfile.<type>      # Agent-specific Dockerfiles
│   ├── Dockerfile.claude-code    # node:20-bookworm + mitmproxy + claude-code
│   └── Dockerfile.vix            # golang:1.22-bookworm + mitmproxy + vix
├── docker/entrypoint.sh          # Clone repo, start proxy, launch shell, generate diff on exit
├── scripts/wait_for_proxy.sh     # Poll mitmweb until ready
└── scripts/extract_artifacts.sh  # Post-run: summarize and copy plan artifacts

Each agent type has its own Dockerfile at docker/Dockerfile.<type> with all runtime dependencies and the agent itself baked in.

Isolation: Each test runs in a fresh Docker container. The cloned repo lives only inside the container.
Proxy: mitmproxy runs as a reverse proxy to Anthropic's API, capturing all traffic.
Artifacts only: Only the diff, plans, and flow file are persisted to the host — no repo checkout on disk.
Security: ANTHROPIC_API_KEY is passed via -e, never written to disk in env files.

Proxy Web UI

While a test is running with proxy.enabled: true (the default), mitmweb's web interface is accessible from the host at:

http://localhost:8099

The UI lets you inspect all API traffic between the agent and Anthropic in real time — requests, responses, headers, and bodies. The port is configurable via proxy.web_port in the test config.

vixd

For type: vix agents, vixd starts automatically in the background when the container launches. Its working directory is /workspace (the cloned repo).

Logs are written to /output/vixd.log and persisted to the host alongside other artifacts:

# Inside container
tail -f /output/vixd.log

# From host
tail -f tasks/<task_dir>/results/<agent_name>/<timestamp>/vixd.log

The daemon PID and log path are shown in the container banner at startup.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker		docker
results		results
scripts		scripts
settings/vix		settings/vix
tasks		tasks
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
comparison_dashboard.html		comparison_dashboard.html
dashboard.html		dashboard.html
requirements.txt		requirements.txt
run_test.sh		run_test.sh
test_build.sh		test_build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vix-eval

Prerequisites

.env Setup

Quickstart

Config Reference

Single agent (backward compatible)

Multi-agent

Output Artifacts

What Happens When the Container Exits

Architecture

Proxy Web UI

vixd

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vix-eval

Prerequisites

.env Setup

Quickstart

Config Reference

Single agent (backward compatible)

Multi-agent

Output Artifacts

What Happens When the Container Exits

Architecture

Proxy Web UI

vixd

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages