Design Decisions Log

This document records the major product, methodology, and implementation decisions for Forecaster Arena. It is intended to answer three questions:

What did the project choose to do?
Why was that choice made?
What constraints does that choice impose on operations and future changes?

This file is a living design-history document. Historical entries remain in place even when later implementation details evolve, but entries should reflect the current production code unless explicitly marked as superseded.

Current System Snapshot

As of March 7, 2026, the production codebase currently implements:

Weekly cohort creation at Sunday 00:00 UTC
One frozen benchmark slot per active family per cohort
Top-500 market selection by Polymarket volume
Sequential decision execution per cohort to reduce provider contention
Timestamped portfolio snapshots (10-minute cadence)
Public, redacted health reporting
Admin-only bounded CSV+ZIP exports
Database-level uniqueness for:
- one frozen benchmark slot per cohort (agents)
- one weekly cohort start timestamp (cohorts.started_at)
- one decision per agent/cohort/week (decisions)

The active family lineup in code is:

GPT-5.2
Gemini 3 Pro
Grok 4.1
Claude Opus 4.5
DeepSeek V3.2
Kimi K2
Qwen 3

Decision Log

D001: Weekly Cohort System

Status: Active
Decision: The benchmark runs as independent weekly cohorts rather than as one continuous portfolio.

Rationale:

Multiple cohorts provide repeated samples across different market regimes.
Weekly cohort boundaries are easy to explain publicly and audit internally.
A cohort can be analyzed independently without mixing capital paths.
Cohorts that run until all positions resolve avoid arbitrary finalization.

Operational implications:

Cohort IDs are the primary boundary for most analytics and exports.
Leaderboards can be aggregated across cohorts or inspected per cohort.
Operations need clear logic for “current week” detection and deduplication.

D002: One Agent Per Benchmark Family Slot Per Cohort

Status: Active
Decision: Each active benchmark family slot has exactly one frozen agent assignment in each cohort.

Rationale:

Keeps competition conditions symmetric.
Prevents duplication or weighting of one provider by extra agent instances.
Simplifies attribution in leaderboards, charts, and trade history.

Implementation constraints:

Enforced physically by UNIQUE(cohort_id, benchmark_config_model_id) in agents.
Agent creation is safe to rerun because inserts use INSERT OR IGNORE.
The semantic identity of that cohort participant is frozen by family_id, release_id, and benchmark_config_model_id.

D003: $10,000 Starting Balance

Status: Active
Decision: Each agent starts with $10,000 in virtual cash.

Rationale:

Easy to reason about for percentages and dashboard display.
Large enough to allow portfolio construction across multiple weeks.
Small enough that position sizing remains meaningful.

Related constraints:

Minimum bet is $50
Maximum single bet is 25% of current cash balance

D004: Top-500 Market Selection

Status: Active
Decision: LLMs see the top 500 markets by trading volume.

Rationale:

Volume acts as a rough liquidity filter.
Keeping the selection bounded controls prompt size and cost.
A global ranking avoids manual category curation bias.

Trade-offs:

Models do not see the full Polymarket universe.
Lower-volume niche markets are intentionally excluded.

D005: Deterministic Prompting

Status: Active
Decision: All model calls use temperature 0.

Rationale:

Makes weekly decisions reproducible given identical inputs.
Reduces evaluation noise attributable to sampling randomness.
Makes debugging malformed responses and regression testing easier.

D006: Decision Action Space = BET / SELL / HOLD

Status: Active
Decision: Models are limited to three high-level actions: BET, SELL, and HOLD.

Rationale:

Keeps parsing strict and auditable.
Maps cleanly to paper-trading portfolio operations.
Avoids introducing hidden intermediate semantics that are hard to score.

Implementation note:

Malformed or invalid responses can fall back to a synthetic HOLD.
Trade execution may still fail even when the parsed action is valid.

D007: Historical V1 Confidence Derived From Bet Size

Status: Historical
Decision: v1 Brier diagnostics derived confidence from bet size rather than a separately requested probability field. v2 does not use this as active scoring methodology; v2 treats bet size as capital allocation and ranks by paper portfolio value / P&L.

Rationale:

Forces sizing and confidence to stay coupled.
Keeps the protocol smaller and easier to audit.
Makes the benchmark evaluate practical forecasting decisions, not only verbalized probabilities.

Formula:

max_possible_bet = cash_balance * 0.25
implied_confidence = bet_amount / max_possible_bet

D008: No Refills Within a Cohort

Status: Active
Decision: Agents are not recapitalized mid-cohort.

Rationale:

Preserves the consequences of poor position sizing and liquidation choices.
Makes risk management part of benchmark performance.
Prevents artificial “reset” strategies.

D009: SQLite as the Single-Node Source of Truth

Status: Active
Decision: SQLite is used as the operational database.

Rationale:

The system is single-node and write volume is moderate.
Backups are simple, fast, and operationally cheap.
Shipping a self-contained research benchmark matters more than horizontal scaling complexity.

Trade-offs:

Concurrency must be handled carefully in application code.
Multi-node execution would require a different storage strategy.

D010: OpenRouter as the Unified Model Gateway

Status: Active
Decision: All models are accessed through OpenRouter rather than through provider-specific integrations.

Rationale:

One transport surface for all participating models.
Simplifies usage accounting and operational support.
Reduces the number of secrets and SDK behaviors the system must manage.

D011: Full Prompt and Response Retention

Status: Active
Decision: The system stores the full prompts and model responses for each decision row.

Rationale:

Reproducibility is a core benchmark requirement.
Prompt-level auditing is necessary for fair comparisons.
Research users need more than aggregate performance figures.

Operational implications:

Decision exports can optionally include prompts.
Storage footprint is intentionally higher than a metrics-only system.

D012: Timestamped Portfolio Snapshots

Status: Active
Decision: Portfolio snapshots are stored by timestamp, not by calendar date.

Rationale:

The public charts operate over intraday ranges (10M, 1H, 1D) as well as longer windows.
A daily-only snapshot model cannot support accurate portfolio curves.
Timestamped snapshots preserve operational history for debugging and audits.

Implementation constraints:

Snapshot uniqueness is (agent_id, snapshot_timestamp).
Downstream docs and queries must use snapshot_timestamp, not snapshot_date.

D013: Sequential Weekly Decision Execution

Status: Active
Decision: Agents in a cohort are processed sequentially.

Rationale:

Minimizes provider and gateway rate-limit contention.
Keeps logs and operational tracing easier to follow.
Reduces the blast radius of partial failures.

Trade-offs:

End-to-end weekly runtime is longer than a parallel fan-out design.
Timeout budgeting must account for the full sequential run.

D014: Database-Level Decision Idempotency

Status: Active
Decision: One decision row per (agent_id, cohort_id, decision_week) is enforced in the database, and weekly decision execution claims that row before calling the model.

Rationale:

Read-before-write checks alone are not safe under overlapping cron runs.
Duplicate decisions can double-spend cash and invalidate benchmark results.
Claiming the row before network I/O prevents two workers from executing the same agent/week concurrently.

Implementation details:

decisions has a unique index on (agent_id, cohort_id, decision_week).
The engine writes an in-progress placeholder row, then finalizes it after the model response.
Retryable zero-trade outcomes reuse the same row instead of inserting another.
Stale in-progress rows can be reclaimed after a timeout window.

D015: Database-Level Weekly Cohort Uniqueness

Status: Active
Decision: Weekly cohort creation is keyed by normalized UTC week start and is uniqueness-constrained at the database layer.

Rationale:

Cron drift or operator-triggered retries should not create two cohorts for the same Sunday window.
A sequential cohort_number alone does not prevent duplicate weekly starts.

Implementation details:

Week start is normalized to Sunday 00:00:00.000Z.
cohorts.started_at is unique.
Cohort creation re-checks within an immediate transaction.

D016: Resolve Markets Only After Successful Settlement

Status: Active
Decision: A market is marked locally resolved only after all position settlements for that market complete successfully.

Rationale:

Marking a market resolved too early can strand open positions permanently, because later runs may skip it as already processed.
Settlement safety is more important than optimistic status transitions.

Implementation details:

Resolution checks still detect an upstream resolved market first.
The system settles positions, records scoring, and only then updates the local markets.status = 'resolved'.
Partial settlement failure leaves the market locally closed, making the next resolution pass retryable.

D017: Redacted Public Health Reporting

Status: Active
Decision: The public health endpoint reports subsystem status without disclosing exact missing secret names or raw internal exception text.

Rationale:

Health endpoints are useful operationally but are often internet-exposed.
Exact secret names and detailed exception text add low-value disclosure risk.
External monitors only need component-level health, not sensitive internals.

Implementation notes:

environment.message now says configuration is incomplete instead of listing variable names.
Database and integrity failures return generic labels such as Database unavailable or Integrity issues detected.

D018: Bounded, Admin-Only Exports

Status: Active
Decision: Admin exports are intentionally bounded by cohort, time window, and per-table row caps, and are packaged without shell interpolation.

Rationale:

Research exports should be available, but not as an unrestricted data dump.
Bounded exports keep server resource use predictable.
Shell-based ZIP command construction created avoidable injection risk.

Implementation details:

Max export window: 7 days
Max rows per exported table: 50,000
Files are written to a temp directory, zipped with argv-based invocation, and stored under backups/exports
Export archives are cleaned up after approximately 24 hours

D019: Truthful Empty-State UX

Status: Active
Decision: The public UI should distinguish between:
- a live benchmark with real cohort data,
- a synced-but-not-yet-trading preview,
- and a system still awaiting the first cohort or market sync.

Rationale:

Hardcoded “live” claims are misleading before data exists.
Research credibility depends on the UI accurately reflecting operational state.

Implementation notes:

The home page status badge is runtime-aware.
The home page no longer hardcodes synced-market counts in empty states.
The models page still renders all seven competitors even before the first live leaderboard exists.

Superseded or Clarified Assumptions

S001: Daily Snapshots

Status: Superseded
Original assumption: portfolio snapshots were effectively daily.
Current reality: snapshots are timestamped and intended for 10-minute cadence.

S002: Health Endpoint as a Detailed Diagnostic Surface

Status: Clarified
Original assumption: public health responses could safely expose exact missing secret names or raw database errors.
Current reality: health remains operationally useful, but intentionally redacts sensitive implementation details.

Future Decision Areas

The following areas remain open for future design decisions:

Whether to move from sequential to queued/background decision execution
Whether to split public health and private diagnostics into separate endpoints
Whether to add first-class migrations rather than schema-on-start
Whether to support historical benchmark versions with family-lineup versioning
Whether to provide signed export URLs instead of session-gated downloads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Decisions Log

Current System Snapshot

Decision Log

D001: Weekly Cohort System

D002: One Agent Per Benchmark Family Slot Per Cohort

D003: $10,000 Starting Balance

D004: Top-500 Market Selection

D005: Deterministic Prompting

D006: Decision Action Space = BET / SELL / HOLD

D007: Historical V1 Confidence Derived From Bet Size

D008: No Refills Within a Cohort

D009: SQLite as the Single-Node Source of Truth

D010: OpenRouter as the Unified Model Gateway

D011: Full Prompt and Response Retention

D012: Timestamped Portfolio Snapshots

D013: Sequential Weekly Decision Execution

D014: Database-Level Decision Idempotency

D015: Database-Level Weekly Cohort Uniqueness

D016: Resolve Markets Only After Successful Settlement

D017: Redacted Public Health Reporting

D018: Bounded, Admin-Only Exports

D019: Truthful Empty-State UX

Superseded or Clarified Assumptions

S001: Daily Snapshots

S002: Health Endpoint as a Detailed Diagnostic Surface

Future Decision Areas

FilesExpand file tree

DECISIONS.md

Latest commit

History

DECISIONS.md

File metadata and controls

Design Decisions Log

Current System Snapshot

Decision Log

D001: Weekly Cohort System

D002: One Agent Per Benchmark Family Slot Per Cohort

D003: $10,000 Starting Balance

D004: Top-500 Market Selection

D005: Deterministic Prompting

D006: Decision Action Space = BET / SELL / HOLD

D007: Historical V1 Confidence Derived From Bet Size

D008: No Refills Within a Cohort

D009: SQLite as the Single-Node Source of Truth

D010: OpenRouter as the Unified Model Gateway

D011: Full Prompt and Response Retention

D012: Timestamped Portfolio Snapshots

D013: Sequential Weekly Decision Execution

D014: Database-Level Decision Idempotency

D015: Database-Level Weekly Cohort Uniqueness

D016: Resolve Markets Only After Successful Settlement

D017: Redacted Public Health Reporting

D018: Bounded, Admin-Only Exports

D019: Truthful Empty-State UX

Superseded or Clarified Assumptions

S001: Daily Snapshots

S002: Health Endpoint as a Detailed Diagnostic Surface

Future Decision Areas