A stateless, deterministic evaluation engine that orchestrates domain-specific agents via multi-agent evaluation and applies fixed aggregation rules to produce structured decisions.
When evaluating complex, multi-dimensional decisions (startup ideas, technical proposals, design reviews), manual review is serial and subjective. Running multiple independent evaluators in parallel and combining their outputs deterministically allows:
- Parallel assessment across orthogonal dimensions (market, business, technical)
- Reproducible decision logic auditable by inspection
- Decoupling of decision rules from LLM outputs
This engine provides the scaffolding for that pattern.
Is:
- A stateless orchestration layer for multi-agent workflows
- A framework for plugging in domain-specific evaluators
- Deterministic (given the same inputs and LLM, produces identical results)
- Designed for evaluation tasks that decompose into independent parallel assessments
- Currently deployed to evaluate early-stage B2B startup ideas
Is not:
- A general-purpose LLM orchestration platform (no built-in retries, rate limiting, cost controls, or concurrency management)
- A persistent system (no database, no state durability across restarts)
- An optimization or tuning system (no statistical learning, no parameter adaptation across runs)
- An autonomous agent framework (agents follow fixed evaluation logic, not free exploration)
- A user-facing product (no UI, no auth, no multi-tenancy)
- A research framework (no hypothesis testing infrastructure, no logging designed for analysis)
The engine treats evaluation as a directed acyclic graph (DAG):
- Generator: Produces initial structured input (e.g., startup brief)
- Evaluators (parallel): Each independently assesses the input on a specific dimension, producing a partial state update
- Arbiter: Applies fixed aggregation rules to evaluator outputs, producing a final decision
Each node receives the full engine state and returns a dict of keys to merge back into state. The graph runtime manages orchestration.
EngineState is the canonical data structure passed through the graph. It contains:
run_id: Unique identifier for this evaluationbrief: Generated input (e.g., startup concept)market_eval: EvalResult from market dimensionbusiness_eval: EvalResult from business dimensiontechnical_eval: EvalResult from technical dimensionfinal_decision: Arbiter output ("BUILD" or "KILL")
EvalResult is the typed output of each evaluator:
{
"component": str, # Evaluator name (e.g., "market")
"status": "PASS" | "KILL",
"confidence": float, # [0, 1]
"reason": str # Explanation
}The arbiter applies a simple, fixed rule: all evaluators must return "PASS" for the final decision to be "BUILD"; any "KILL" result sets the final decision to "KILL".
This is rejection-first: conservative, auditable, and biases toward caution.
Each run is fully described by EngineState. Given identical inputs and the same LLM implementation (mock or real with fixed random seed), the engine produces identical outputs. No run depends on previous runs or external state.
This enables:
- Reproducible evaluation (same brief always produces same decision)
- Hermetic testing (no flaky integration)
- Debugging by rerunning with the same state
| Module | Responsibility |
|---|---|
main.py |
Entry point; composes initial state, instantiates context, invokes graph, persists shadow log |
graph.py |
Builds and compiles the StateGraph with nodes and edges |
core/state.py |
TypedDict definitions for EngineState and EvalResult |
core/context.py |
ExecutionContext container passed to agents (holds LLM, optional retriever/tools/config) |
core/logger.py |
File-based shadow logging (writes per-run JSON snapshot to logs/) |
agents/* |
Domain-specific evaluators and generator; implement (state, context) -> dict |
llm/base.py |
LLMClient interface (minimal: generate(system: str, user: str) -> str) |
llm/factory.py |
Selects LLM implementation via LLM_PROVIDER env var |
llm/mock_llm.py |
Deterministic mock for local dev and tests |
llm/openai_llm.py |
OpenAI API client |
- Core: State management, graph orchestration, logging. Domain-agnostic.
- Domain: Evaluator prompts, parsing logic, evaluation criteria. Replaceable.
- LLM: Client implementations. Pluggable via factory pattern.
-
Initialization (
main.py::run_once): Create initialEngineStatewith uniquerun_id, instantiateExecutionContextwith selected LLM, build graph. -
Generation (
agents/generator.py): LLM produces a structured brief (e.g., startup concept). Returns{"brief": ...}. -
Parallel Evaluation: Three evaluators run in parallel:
- Market evaluator: Assesses market size, timing, TAM. Returns
{"market_eval": EvalResult}. - Business evaluator: Assesses business model, monetization, defensibility. Returns
{"business_eval": EvalResult}. - Technical evaluator: Assesses technical feasibility, architecture, risk. Returns
{"technical_eval": EvalResult}.
Each evaluator independently reads the brief and applies domain-specific logic.
- Market evaluator: Assesses market size, timing, TAM. Returns
-
Arbitration (
agents/arbiter.py): Reads all three eval results, applies fixed rule (all PASS β BUILD, any KILL β KILL). Returns{"final_decision": "BUILD" | "KILL"}. -
Logging (
core/logger.py): Full final state written as JSON tologs/{run_id}_{timestamp}.json. -
Output (
main.py): Final state printed to stdout.
- Each agent function accepts
(state: EngineState, context: ExecutionContext)and returns a dict of keys to merge into state. - LLM outputs are not validated by the engine; agents must parse and validate before returning.
EngineStatekeys are explicitly defined; new keys should not be added without updating the TypedDict.- Status values must match the
Statusenum: "PASS" or "KILL".
- Statelessness first: No run-to-run state; each evaluation is self-contained.
- Explicit over implicit: Contracts are documented in docstrings; magic is avoided.
- Rejection-first arbitration: Conservative bias toward KILL reduces downstream costs.
- LLM as a component: Treated as an external dependency, not the source of truth. Outputs must be parsed and validated.
- Auditability: Logs and decision rules are simple enough to inspect and understand without tooling.
- Evaluators run in parallel; they cannot depend on each other's results.
- Graph topology is fixed (not dynamically generated).
- No inter-run state or learning; each run is isolated.
- No cost controls, retries, or rate limiting in the engine (delegate to LLM layer or orchestrator).
- Multi-agent evaluation of B2B startup ideas across market, business, and technical dimensions
- Deterministic arbitration rule and shadow logging
- Pluggable LLM backends (mock and OpenAI)
- Reproducible execution with mock LLM
- Per-run JSON logs for debugging
- Core engine (graph orchestration, state merging, arbitration rule)
- TypedDict contract for EngineState and EvalResult
- LLMClient interface
- Evaluator prompts and parsing logic (domain-specific; may be refactored as use cases expand)
- Shadow logging format (currently simple JSON; may add structured fields for analytics)
- Create
agents/new_eval.pywith functionnew_evaluator(state: EngineState, context: ExecutionContext) -> dict. - Load your prompt from
prompts/new.txt. - Call
context.llm.generate(system, user)and parse the result. - Return
{"new_eval": EvalResult}(update EngineState TypedDict to include this key). - Add node and edges in
graph.py.
- Create
llm/new_llm.pywith classNewLLM(LLMClient)implementinggenerate(system: str, user: str) -> str. - Update
llm/factory.pyto instantiate your backend whenLLM_PROVIDER=new_llm. - Optionally add environment variable configuration for API keys or endpoints.
- Replace
agents/generator.pyto accept and parse the domain input (e.g., feature spec instead of startup brief). - Rewrite or replace evaluators in
agents/*to assess domain-specific criteria. - Update
core/state.pyEngineState if new dimensions are needed (e.g., addsecurity_eval,performance_eval). - Optionally adjust the arbitration rule in
agents/arbiter.pyif the domain requires different logic. - Update prompts in
prompts/*.
Logging: Each run produces a JSON file in logs/ containing the full final state. Search logs by run_id or timestamp for debugging.
Console Output: Final decision and structured state are printed to stdout for immediate feedback.
Tracing: Currently minimal. To add detailed tracing (e.g., intermediate LLM calls, parsing errors), hook into agent functions or the graph runtime.
- Mock LLM: Returns fixed JSON. Runs are fully deterministic.
- OpenAI LLM: Same brief + same model/temperature produces identical results (given OpenAI doesn't change model weights).
- Re-running: Save the
brieffrom a previous run's log, inject it into a fresh run, and the arbiter will make the same decision.
| Failure | Cause | Impact | Mitigation |
|---|---|---|---|
| LLM API unavailable | Network/service outage | Run halts, exception raised | Retry logic at orchestrator level; use mock LLM for dev |
| Invalid JSON from LLM | Model returning non-JSON or unexpected schema | Agent parse fails, run halts | Validate schema at agent; add error handling; use mock LLM to test |
Missing brief when evaluator runs |
Generator didn't populate state | Evaluator gets None, likely crashes | Graph ordering ensures generator runs first; add asserts |
| All evaluators KILL, but rule expects all PASS | Domain evaluation agrees idea is bad | Final decision is KILL (correct) | This is by design; rule is rejection-first |
- Check the shadow log (run_id from console output).
- Re-run with
LLM_PROVIDER=mockto isolate LLM issues. - Add print statements or logging in agents to trace LLM calls.
- Manually inspect the
briefin the log to understand why evaluators voted KILL.
- Python 3.10+
virtualenvorconda
# Clone and navigate to the project
git clone <repo>
cd blackbox-engine
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run with mock LLM (deterministic, no API keys needed)
export LLM_PROVIDER=mock
python main.py
# Or use OpenAI (requires OPENAI_API_KEY in environment)
export LLM_PROVIDER=openai
export OPENAI_API_KEY=sk-...
python main.pypytest -v
# Run a specific test
pytest tests/test_arbiter.py -v
# Run with coverage
pytest --cov=core --cov=agents tests/- Console: Prints final state as formatted JSON.
- Logs: A
.jsonfile is written tologs/for each run.
Version: Pre-release
Stability: Core engine and TypedDict contracts are stable. Evaluator logic, prompts, and shadow log format may change.
Maturity: Functional prototype suitable for evaluation and iteration within a team. Not production-hardened (no async, no concurrency controls, no persistence, no comprehensive error handling).
- No schema validation of evaluator outputs (agents must validate manually)
- No cost tracking or rate limiting
- No retry logic or exponential backoff
- Shadow logs are not indexed or queryable (plain JSON files)
- No support for dynamic graph topology or conditional edges
- LLM calls are synchronous (blocks until response)
This project is maintained as an internal tool. External contributions are not currently accepted. Issues and suggestions from internal stakeholders are welcome.
- Each run writes a JSON snapshot to
logs/(run_id, timestamp, brief, market_eval, business_eval, technical_eval, final_decision). - Shadow logs are intended for audit and debugging; they are file-based and not atomic. Replace with an atomic writer or central store when needed.
- Console output includes a formatted, read-only view of the final state.
- Add new domains: implement an agent that returns a partial state patch and
wire it into
graph.py. - Add new LLM backends: implement
LLMClient.generate(system: str, user: str) -> strand register inllm/factory.py. - Policy changes: keep arbiter deterministic; for complex policies, add a separate policy node that produces a final patch.
- Validation: add pydantic models in agents to validate LLM outputs before merging into state.
- v0.1 β intentionally scoped:
- Core orchestration and graph execution in place.
- Deterministic mock LLM for local runs, basic OpenAI adapter available.
- File-based shadow logging and unit tests for core rules.
- Next steps for production readiness:
- Add schema validation and stronger LLM wrappers (retry, rate limit).
- Introduce atomic logging or an audit store.
- Add integration tests covering live LLM paths in a controlled environment.
- Preserve EngineState and EvalResult shapes; many components rely on them.
- Treat evaluators as untrusted inputs: validate and sanitize before merge.
- The system deliberately biases toward rejection to contain cost and risk.