This repository contains the full source code for the "PromptPilot" major project.
It is organized into two main parts:
Built with React, Vite, TailwindCSS, and Shadcn/UI.
src/: Contains the UI logic, components (PromptEditor,VersionHistory), and API client.public/: Static assets.- Configuration:
vite.config.ts,tailwind.config.js,tsconfig.json,package.json(Required for building and running the web app).
Built with Python, FastAPI, Gemini, and Groq.
- Located in the
chronicle/folder. - Handles database (PostgreSQL), AI simulations, and Version Control logic.
-
Start Backend:
cd chronicle docker-compose up -d -
Start Frontend:
# (From root) npm run dev
Open http://localhost:5173 to view the app.
A prompt version control system with execution tracking, built on FastAPI and PostgreSQL.
Chronicle lets you create prompts, manage immutable versions with full history, promote versions to production, execute them against an LLM (Groq), and track every run with cost and latency data. Record deletions are soft-deleted by default and can be restored from the Trash Bin. Items in the Trash Bin are automatically pruned after 20 days.
All API endpoints require an X-API-Key header. The GUI and docs are excluded from authentication.
chronicle/
├── main.py # FastAPI app, mounts GUI and API routers
├── config.py # Settings via pydantic-settings (.env)
├── db.py # Async SQLAlchemy engine + session factory
├── version_control/ # Prompt & version CRUD, promotion, alias history
│ ├── models.py # Prompt, PromptVersion (SQLAlchemy)
│ ├── alias_history.py # AliasHistory model (promotion audit trail)
│ ├── schemas.py # Pydantic request/response models
│ └── routes.py # API endpoints for prompts, versions, promotion
├── execution/ # Prompt execution against LLM
│ ├── models.py # Run model (stores every execution)
│ ├── llm_service.py # Groq API wrapper
│ ├── variable_engine.py # {{variable}} injection with strict mode
│ ├── pricing.py # Model pricing table + cost calculation
│ ├── schemas.py # ExecuteRequest, ExecuteResponse, RunRead
│ └── routes.py # POST /execute/{prompt_key}
├── files/ # Frontend (served at /gui)
│ ├── index.html
│ ├── style.css
│ └── app.js
├── alembic/ # Database migrations
├── docs/ # Documentation
│ ├── design-specification.md
│ ├── project-structure.md
│ ├── local-setup-guide.md
│ ├── phase1-implementation-status.md
│ └── testing-guide.md
├── evaluation/ # Eval pipelines, metrics, Pareto computation
│ ├── evaluators.py # Exact match, LLM Judge, and Confidence calcs
│ ├── metrics.py # Pareto frontier logic and geometric knee points
│ ├── models.py # Evaluation schemas for the database
│ ├── orchestrator.py # Background task runner and rollback logic
│ ├── routes.py # API endpoints for executing tests and fetching scores
│ └── schemas.py # Data validation for eval API requests
├── synthetic/ # Synthetic data generation and validation
│ ├── cli.py # Command-line hooks for synthetic job execution
│ ├── config.py # YAML configuration parsing logic
│ ├── diversity.py # Jaccard similarity calculation for deduplication
│ ├── generator.py # Primary generator model LLM logic
│ ├── prompts.py # System prompts utilized strictly for synthesizing
│ └── validator.py # Secondary model grading logic ensuring output schema
└── tests/ # 183 integration + unit tests
| Method | Endpoint | Description |
|---|---|---|
| POST | /prompts |
Create a prompt (title, created_by; key is auto-generated if omitted) |
| GET | /prompts |
List all prompts with latest version |
| GET | /prompts/{id} |
Get prompt by ID with latest version |
| DELETE | /prompts/{id} |
Delete prompt and all its versions |
| POST | /prompts/{id}/promote |
Set production version (body: {version_id}) |
| GET | /prompts/{id}/alias-history |
Get promotion audit trail |
| POST | /versions |
Create immutable version (prompt_id, prompt_text, model_settings, created_by) |
| GET | /versions/{prompt_id}/history |
List all versions for a prompt |
| GET | /versions/latest/{prompt_id} |
Get latest version |
| Method | Endpoint | Description |
|---|---|---|
| POST | /execute/{prompt_key} |
Execute production version with variables |
The execute endpoint injects {variables} into {{placeholders}}, calls the LLM, records a Run row with status, latency, cost, and returns the response. The X-PromptOps-Run-ID header contains the run ID.
All API endpoints require an X-API-Key header matching the configured key. Excluded paths: /gui/*, /health, /docs, /redoc, /openapi.json.
# Example authenticated request
Invoke-WebRequest -Uri http://localhost:8000/api/v1/version-control/prompts `
-Headers @{"X-API-Key"="chronicle-dev-key"}The GUI automatically includes the API key in all fetch calls.
The header contains a segmented control with three themes:
| Theme | Description |
|---|---|
| Dark | Cold dark (blue accent, #0B0E11 canvas) |
| Light | Clean white (blue accent, #FFFFFF canvas) |
| Matte | Warm off-black inspired by Claude.ai (amber-orange accent, #1a1612 canvas) |
Matte is the default. Selection persists to localStorage under key chronicle-theme.
- Live token bar — below the prompt editor textarea, shows estimated token count and per-call cost as you type
- Estimate Cost button — next to Execute in the execution panel, shows a detailed breakdown (tokens, input cost, model) for the production version before running it
Chronicle now includes a fully-featured Command Line Interface (CLI) to manage the prompt execution environment directly from your terminal.
init: Configure backend connections to your Chronicle instance.use: Switch between different environments easily.list: List all available prompts natively.versions: Show the version history for specific prompts.execute: Run a prompt with injected variables straight from the terminal.eval datasets: List all evaluation datasets with task type and example counts.eval run: Launch an eval job interactively or via flags (--prompt,--version,--dataset,--evaluators).eval status <job-id>: Check job status with optional--watchfor live polling.eval report <job-id>: View full job report — summary metrics, results table, and calibration data.eval compare <dataset-id>: Pareto frontier analysis with recommendation panel, ranked table, and ASCII scatter plot.
- Version Control System: Immutable prompt versions with full history, safe experimentation, and promotion workflows.
- Execution Tracking: Every LLM execution is recorded with cost and latency data natively linked to PromptOps traces.
- GUI Interface: A robust 3-column layout web interface featuring theme support and live cost/latency estimation.
- CLI Tooling: A dynamic terminal interface for developers to interact with Chronicle's API securely without context switching.
- Reliable Backend Platform: FastAPI and PostgreSQL powering a fast, strongly typed, and reliable PromptOps operational lifecycle.
Dataset Management
- Create datasets with name, description, task_type (classification/qa/generation)
- Add examples individually, in bulk, or via CSV upload
- Soft-delete protection — no data loss
Eval Job Orchestrator
- POST /api/v1/eval/jobs — creates job, runs in background via FastAPI BackgroundTasks
- Pre-inserts Run row with
pendingstatus before LLM call — crash-safe execution - Per-example rollback — one failed example does not abort the job
- Polls each example against the dataset, stores raw output and is_correct
Three Evaluators
exact_match— normalised string comparison (lowercase, stripped)llm_judge— secondary LLM scores output 0.0–1.0 against expectedconfidence_calibration— MCE computation across confidence buckets
Metrics & Summary
- compute_summary — accuracy, mean_evaluator_score, cost_per_correct, p50/p95 latency, MCE
- compute_pareto_frontier — dominance loop across cost_per_correct vs accuracy
- identify_knee_point — perpendicular geometric distance in normalised [0,1] space
- generate_recommendation — plain English deployment recommendation
Comparison & Dashboard API
- GET /api/v1/eval/compare/dataset/{id} — auto-compare all jobs on a dataset
- GET /api/v1/eval/dashboard — platform-wide metrics
- GET /api/v1/eval/jobs/{id}/report — full job report for CLI
- GET /api/v1/eval/datasets/{id}/leaderboard — ranked versions with Pareto flags
Synthetic Data Generator
- Two-model pipeline: generator + validator (different models for independence)
- Jaccard similarity deduplication — enforces diversity across examples
- YAML config driven — task_type, num_examples, output_schema, diversity_threshold
- Datasets: ecommerce-sentiment-eval (21 examples), science-qa-eval (24 examples)
Frontend
- Full React UI wired to all backend APIs — zero mock data
- Pareto scatter plot with proper min/max normalisation
- Real-time eval job polling (2s interval, auto-clears on completion)
- Run Eval Job modal with prompt/version/evaluator selection
- Health bar wired to GET /api/v1/eval/dashboard
6-point frontier computed across 3 prompt versions × 2 models:
| Version | Model | Accuracy | Cost/Correct | Pareto Optimal | Knee |
|---|---|---|---|---|---|
| v1141 | llama-3.3-70b-versatile | 100% | $0.000094 | ✓ | |
| v1131 | llama-3.3-70b-versatile | 96.7% | $0.000068 | ✓ | ✓ |
| v1136 | llama-3.3-70b-versatile | 92.9% | $0.000064 | ✓ | |
| v1135 | llama-3.3-70b-versatile | 95.0% | $0.000100 | ✗ (dominated) | |
| v1130 | llama-3.3-70b-versatile | 0.0% | — | ✗ | |
| v1134 | llama-3.1-8b-instant | 90.0% | — (pricing gap) | ✗ |
Recommendation generated: "Version 1131 on llama-3.3-70b-versatile offers the best cost-accuracy balance (knee point). Recommended for production unless accuracy above 97% is required, in which case version 1141 achieves 100% at 1.4x the cost."
- Versions are immutable. No update or delete on versions. Fix a bad version by creating a new one. All deletions are now soft-deletes with an auto-pruning job to prevent accidental hard deletion of alias history.
- Promotion model. Each prompt has a
production_version_id. Promoting a version updates this pointer and logs the change toAliasHistory. - Pre-insert pending run. The
Runrow is inserted withstatus=pendingbefore the LLM call, then updated tosuccessorerrorin atry/finally. No silent failures. - Cost tracking. Per-execution cost is calculated from token usage and stored on the
Runrow. Unknown models getcost_usd=None.
| Model | Provider | Tier |
|---|---|---|
| llama-3.3-70b-versatile | Groq | Primary |
| llama-3.1-8b-instant | Groq | Fast/Cheap |
| llama3-70b-8192 | Groq | Legacy |
| mixtral-8x7b-32768 | Groq | Legacy |
| gemma2-9b-it | Groq | Compact |
| openai/gpt-oss-120b | OpenRouter | Validation |
Note: Missing model pricing acts as a known gap that affects Pareto cost computation for jobs using that model.
- Python 3.11+
- PostgreSQL (local or Docker)
# Clone and enter project
cd chronicle
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Configure environment
# Edit .env — set DATABASE_URL and GROQ_API_KEY
# Run migrations
alembic upgrade head
# Start server
uvicorn main:app --reloadOpen http://localhost:8000/gui/ for the web UI, or http://localhost:8000/docs for Swagger.
docker-compose up --build# Run all 183 tests
.venv\Scripts\python.exe -m pytest -v
# Run a specific section
.venv\Scripts\python.exe -m pytest test_chronicle_full.py -v -k "TestExecutionBoundary"Test coverage spans prompt CRUD, version control, promotion + alias history, execution boundary, run integrity, cost calculation, authentication (missing key, wrong key, correct key, auto-generated key), and the alias history endpoint.
See docs/testing-guide.md for the full breakdown of what each test verifies and manual frontend testing instructions.
| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL |
Yes | localhost | Async PostgreSQL connection string |
GROQ_API_KEY |
Yes | — | Groq API key for LLM calls |
API_KEY |
No | chronicle-dev-key |
API key for authenticating requests |
DEFAULT_LLM_MODEL |
No | llama-3.3-70b-versatile |
Default model for execution |
ENVIRONMENT |
No | development |
Runtime environment |
LOG_LEVEL |
No | INFO |
Logging level |
| Document | Description |
|---|---|
| Design Specification | Full system design, schema diagrams, and phase planning |
| Project Structure | Module-by-module breakdown of the codebase |
| Local Setup Guide | Step-by-step Windows local setup (no Docker) |
| Phase 1 Status | Implementation checklist for all Phase-1 components |
| Testing Guide | What to test, how to test, and expected results |

