The ACE system is a self-improving agent framework that learns from experience by generating, evaluating, and evolving actionable "bullets" (heuristics/rules) that improve agent performance. The system uses Darwin-Gödel evolutionary principles combined with LLM-based evaluation to continuously improve agent decision-making.
- BulletPlaybook - Stores and manages bullets with database persistence
- HybridSelector - Intelligent bullet selection using 5-stage algorithm
- Reflector - Generates new bullets from agent failures
- Curator - Deduplicates bullets using semantic similarity
- TrainingPipeline - Orchestrates offline and online training
- DarwinBulletEvolver - Evolves bullets using genetic programming
- PatternManager - Classifies inputs and tracks bullet effectiveness per pattern
- LLMJudge - Evaluates agent outputs using LLM-based judgment
Input → Agent → Output
↓
LLM Judge → Evaluation
Agent Execution:
- Input is classified into a pattern category using
PatternManager - Intelligent bullet selection using
HybridSelector:- Gets bullets for each evaluator from database
- Selects top bullets per evaluator using semantic similarity
- Limits to max 10 bullets per evaluator
- Agent executes with selected bullets
- Output is evaluated by LLM Judge (or ground truth comparison)
- Transaction is saved to database with pattern association
Failure → Reflector → New Bullet → Curator → Playbook
↓
Darwin Evolution (if enabled)
When bullets are generated:
- Agent output is incorrect (wrong prediction)
- OR fewer than 5 bullets were used (coverage issue)
Generation process:
Reflectorgenerates a new bullet from the failureDarwin-Gödel Evolutiontriggers (if 2+ bullets exist):- Use last 6 bullets as parents (NOT tested)
- Generate 4 candidates via crossover
- Test ONLY the 5 newly generated bullets (new + 4 candidates) on 4 transactions
- Keep only the best bullet
Curatorchecks for duplicates (semantic similarity > 0.85)- If not duplicate, bullet is added to
Playbook - Bullet is associated with evaluator and saved to database
New Bullet → Crossover Parents (last 6 bullets) → Generate 4 Candidates
↓
Test 5 NEW Bullets on 4 Scenarios
↓
Keep Only Best Bullet
Old bullets are NOT re-tested - only used as parents for crossover
Complete Evolution Flow:
-
Crossover Phase (Genetic Generation):
- Take last 6 bullets from playbook as parents
- For each of 4 candidates:
- Randomly select 2 different parents
- Use LLM to intelligently merge them via crossover
- Generate novel child bullet combining features from both parents
Example:
Parent 1: "New user (< 90 days) + large amount (> $1000) = 80% fraud" Parent 2: "VPN usage + crypto merchant = 95% fraud" ↓ (LLM crossover) Child: "New user (< 90 days) + VPN + crypto merchant + amount > $1000 = 97% fraud" -
Testing Phase (Fitness Evaluation):
- Take all 5 newly generated bullets (1 new + 4 candidates)
- For each bullet:
- Test on 4 random transactions from database
- LLM Judge evaluates: "Would this bullet help with this transaction?"
- Count how many transactions it would help with
- Calculate fitness score (accuracy = helpful_count / 4)
Example:
Bullet: "New user + VPN + crypto = 97% fraud" Transaction 1: "New user from Nigeria using VPN, $500 crypto" → Judge: "YES, helpful" ✓ Transaction 2: "Existing user buying groceries, $50" → Judge: "NO, not applicable" ✗ Transaction 3: "New user + VPN + crypto, $2000" → Judge: "YES, helpful" ✓ Transaction 4: "VPN + crypto transaction" → Judge: "YES, helpful" ✓ Fitness Score: 3/4 = 75% -
Selection Phase (Survival of the Fittest):
- Sort all 5 bullets by fitness score
- Keep ONLY the best bullet
- Add it to playbook (if not duplicate)
Example:
Results: 1. Bullet A: 80% fitness → Keep ✓ 2. Bullet B: 60% fitness → Discard 3. Bullet C: 40% fitness → Discard 4. Bullet D: 20% fitness → Discard 5. Bullet E: 20% fitness → Discarded
Evolution triggers:
- EVERY time a new bullet is generated
- Requires at least 2 existing bullets for crossover
- Integrated into bullet generation process
Evaluator-Specific Evolution: Bullets are generated per evaluator (not globally). This means:
-
Get evaluator-specific parents: Uses only bullets from the SAME evaluator
recent_bullets = playbook.get_bullets_for_node(node, evaluator=evaluator)
-
Generate candidates: Creates 4 candidates using only those evaluator-specific parents
-
Test and select: Tests all 5 bullets (new + 4 candidates) for this evaluator
-
Add to evaluator: Saves bullets with the same evaluator tag
self.curator.merge_bullet( content=bullet_content, node=node, playbook=playbook, source="evolution", evaluator=evaluator # Same evaluator as parents )
Example:
Evaluator: "fraud_detection"
├─ Get last 6 bullets for "fraud_detection" evaluator
├─ Generate 4 candidates from those bullets
├─ Test all 5 bullets on transactions
└─ Keep top 2 bullets → Add to "fraud_detection" evaluator
Evaluator: "risk_assessment"
├─ Get last 6 bullets for "risk_assessment" evaluator
├─ Generate 4 candidates from those bullets
├─ Test all 5 bullets on transactions
└─ Keep top 2 bullets → Add to "risk_assessment" evaluator
Each evaluator evolves independently!
How Crossover Connects to Testing:
┌─────────────────────────────────────────────────────────────────┐
│ CROSSOVER PHASE │
│ (Genetic Generation - Creates Novel Bullets) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Parent Pool (last 6 bullets) │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Parent 1 │ │ Parent 2 │ │ ... (4 more) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │ │ │ │
│ ├───────────────────┼────────────────────┤ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Candidate 1 │ │ Candidate 2 │ │ Candidate 3 │ │
│ │ (LLM merged) │ │ (LLM merged) │ │ (LLM merged) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ Candidate 4 │ │
│ │ (LLM merged) │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TESTING PHASE │
│ (Fitness Evaluation - Tests Novel Bullets) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ │
│ │ New Bullet │──┐ │
│ └────────────────┘ │ │
│ │ │
│ ┌────────────────┐ │ ┌────────────────────────────────┐ │
│ │ Candidate 1 │──┼─────│ Transaction 1: Nigeria + VPN │ │
│ └────────────────┘ │ │ Judge: YES ✓ │ │
│ │ └────────────────────────────────┘ │
│ ┌────────────────┐ │ │
│ │ Candidate 2 │──┼─────│ Transaction 2: Groceries $50 │ │
│ └────────────────┘ │ │ Judge: NO ✗ │ │
│ │ └────────────────────────────────┘ │
│ ┌────────────────┐ │ │
│ │ Candidate 3 │──┼─────│ Transaction 3: Crypto $2000 │ │
│ └────────────────┘ │ │ Judge: YES ✓ │ │
│ │ └────────────────────────────────┘ │
│ ┌────────────────┐ │ │
│ │ Candidate 4 │──┼─────│ Transaction 4: ... │ │
│ └────────────────┘ │ │ Judge: ... │ │
│ │ └────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Fitness Scores: [80%, 60%, 40%, 20%, 20%] │
│ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SELECTION PHASE │
│ (Survival of the Fittest) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sort by fitness → Keep only best bullet → Add to Playbook │
│ │
│ ✓ Bullet A (80% fitness) → Added │
│ ✗ Bullet B (60% fitness) → Discarded │
│ ✗ Bullet C (40% fitness) → Discarded │
│ ✗ Bullet D (20% fitness) → Discarded │
│ ✗ Bullet E (20% fitness) → Discarded │
│ │
└─────────────────────────────────────────────────────────────────┘
Evolution process:
- Generate new bullet from reflection
- Get last 6 bullets from playbook (used ONLY as parents for crossover, NOT tested)
- Generate 4 candidate bullets via crossover from those parents
- Test ONLY the 5 newly generated bullets (new bullet + 4 candidates) on 4 random transactions from database
- LLM Judge evaluates if each bullet would help with each transaction
- Select top 1 bullet based on fitness scores
- Add only the best bullet to playbook
Important: Old bullets from the playbook are NOT re-tested. Evolution only tests newly generated bullets.
Evolution parameters:
n_samples: 4 scenarios (reduced for speed)min_bullets_for_crossover: 2candidates_generated: 4bullets_kept: 1 (only the best)evaluator: Uses LLM judge domain field for context
- Filter bullets by node (agent name)
- Filter by evaluator (perspective)
- Filter by source (offline/online)
- Minimum success rate threshold: 0.3
- Exclude bullets with poor performance
- Generate embedding for input query
- Compute cosine similarity with bullet embeddings
- Minimum similarity threshold: 0.5
- Uses pre-calculated embeddings (cached in database)
Combines multiple factors:
- Quality Score (30%): Bullet's empirical success rate
- Semantic Score (40%): Cosine similarity to query
- Thompson Sampling (30%): Exploration-exploitation balance
- Pattern Boost: Additional boost if bullet is effective for this pattern
- Promote diverse bullets
- Avoid redundant patterns
- Diversity weight: 0.15
bullets
id: Unique bullet identifiercontent: Bullet textnode: Agent node nameevaluator: Evaluator/perspective namecontent_embedding: Pre-calculated 1536-dim embeddinghelpful_count: Times bullet helpedharmful_count: Times bullet harmedtimes_selected: Times bullet was selectedsource: 'offline', 'online', or 'evolution'created_at,last_used: Timestamps
transactions
transaction_data: JSON with systemprompt, userprompt, output, reasoningmode: 'vanilla', 'offline_online', 'online_only'predicted_decision: Agent's decisioncorrect_decision: Ground truthis_correct: Booleaninput_pattern_id: Pattern classification
input_patterns
pattern_summary: LLM-generated summarynode: Agent nodecategory: Pattern categoryfeatures: JSON with extracted features
bullet_input_effectiveness
pattern_id: Input patternbullet_id: Bullethelpful_count: Times bullet helped for this patternharmful_count: Times bullet harmed for this pattern
llm_judges
node: Agent nodemodel: LLM model (default: gpt-4o-mini)temperature: 0.0system_prompt: Judge's system promptevaluation_criteria: JSON array of criteriadomain: Domain context (used for evaluator identification)
judge_evaluations
judge_id: LLM judgetransaction_id: Transaction evaluatedinput_text,output_text: Transaction dataground_truth: Ground truth (if available)is_correct: Judge's decisionconfidence: Confidence scorereasoning: Judge's reasoning
bullet_selections
transaction_id: Transactionbullet_id: Bullet selectedrank: Selection rankscore: Selection score
Evaluators represent different perspectives or criteria for evaluating agent outputs:
- Risk-focused: Evaluates risk assessment correctness
- Pattern-focused: Evaluates pattern detection accuracy
- Context-focused: Evaluates contextual understanding
- LLM Judge Domain Field: Identifies evaluator context
- Bullet Association: Bullets are associated with evaluators
- Selection: Bullets are selected per evaluator
- Evolution: Evolution is evaluator-aware
EVALUATOR1 Rules:
- bullet 1
- bullet 2
EVALUATOR2 Rules:
- bullet 1
- bullet 2
Each evaluator can have up to 10 bullets in the prompt.
- No bullets used
- Baseline performance
- Evaluated with ground truth
- Pre-train on training set (70%)
- Test on test set (30%)
- Evaluated with ground truth
- Bullets generated during training
- Bullets used during testing
- Real-time learning
- No pre-training
- Uses LLM Judge for evaluation
- Bullets generated on-the-fly
- Embeddings generated when creating bullets
- Stored in database to avoid repeated API calls
- Used for semantic similarity search
- Inputs classified into patterns
- Bullet effectiveness tracked per pattern
- Pattern-specific bullet selection
- Semantic similarity threshold: 0.85
- Prevents duplicate bullets
- Saves context space
- All bullets persisted to database
- Transactions saved for evolution
- Pattern classifications tracked
- Bullet effectiveness per pattern
- Contextual filtering
- Quality filtering
- Semantic filtering
- Hybrid scoring
- Diversity promotion
- Crossover: 4 candidates generated via LLM crossover (4 API calls)
- Testing: 5 bullets × 4 scenarios = 20 LLM judge calls
- Selection: Keep best 1 bullet only
- Total: ~24 API calls per evolution cycle
- Reduced from 60+ to ~24 API calls with same quality
- Pre-calculated embeddings in database
- No on-the-fly generation
- Faster semantic similarity
- Pattern classifications cached
- Bullet effectiveness per pattern cached
- Faster retrieval
- Max 10 bullets per evaluator
- Hyperparameter:
max_bullets(default: 5) - Prevents context rot
- Input: "New user from Nigeria using VPN, $500 crypto purchase"
- Pattern Classification:
pattern_id=5, confidence=0.92 - Bullet Selection:
- Evaluator: "fraud_detection"
- Selected 5 bullets matching input
- Agent Execution: Returns "DECLINE"
- Evaluation: Ground truth says "DECLINE" → Correct
- Pattern Tracking: Update bullet effectiveness for pattern_id=5
- Transaction Saved: Full transaction data persisted
- Agent Error: Predicted "APPROVE" but should be "DECLINE"
- Reflection: Reflector generates new bullet
- Evolution:
- Generate 4 candidates via crossover from last 6 bullets
- Test all 5 bullets on 5 random transactions
- LLM Judge evaluates fitness
- Keep top 2 bullets
- Curator Check: Not duplicate (similarity 0.72)
- Add to Playbook: Top 2 bullets saved with evaluator="fraud_detection"
Analyze transaction with selected bullets
Evaluate agent output with LLM Judge
Get bullets for a node (with optional query for intelligent selection)
Get playbook statistics
Get bullets for a specific node
Run comprehensive test suite
DATABASE_URL: PostgreSQL connection stringOPENAI_API_KEY: OpenAI API keyDEBUG: Debug mode (default: False)LOG_LEVEL: Logging level (default: INFO)
max_bullets: Maximum bullets per evaluator (default: 10)quality_threshold: Minimum success rate (default: 0.3)semantic_threshold: Minimum similarity (default: 0.5)diversity_weight: Diversity promotion weight (default: 0.15)similarity_threshold: Deduplication threshold (default: 0.85)
- Genetic programming allows novel combinations
- Testing on real transactions ensures fitness
- Top performers survive, weak ones die
- Different perspectives need different rules
- Prevents confusion between criteria
- More targeted improvements
- Performance: No repeated API calls
- Cost: Reduce API usage
- Consistency: Same embeddings over time
- Different inputs need different bullets
- Pattern-specific effectiveness tracking
- More intelligent selection
- Tests new bullets immediately against real transactions
- Ensures only the best bullets survive and get added
- Prevents accumulation of weak bullets
- Creates competitive pressure for improvement
- Fast enough (5 transactions × 5 bullets = 25 API calls)
- Multi-Objective Evolution: Evolve for multiple criteria simultaneously
- Adaptive Thresholds: Automatically adjust thresholds based on performance
- Bullet Hierarchies: Organize bullets into hierarchies
- Explainability: Track why bullets were selected
- A/B Testing: Compare different bullet strategies
- Darwin-Gödel Machines: https://arxiv.org/abs/1509.08784
- Thompson Sampling: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
- Semantic Similarity: https://en.wikipedia.org/wiki/Semantic_similarity