MemEvolve includes an independent, parity-based quality scoring system that provides unbiased evaluation of LLM responses across different model types and architectures.
The quality scoring system addresses a critical challenge in LLM evaluation: ensuring fair assessment regardless of whether a model uses direct responses or reasoning/thinking processes.
- Parity-Based Evaluation: Treats all models fairly regardless of reasoning capabilities
- Independent Assessment: Separate from memory system to avoid bias
- Multi-Factor Analysis: Considers content quality, structure, and insight
- Model-Agnostic: Works with any OpenAI-compatible model
- Adaptive Learning: Tracks performance patterns and adjusts automatically
ResponseQualityScorer
├── _calculate_response_quality() # Main scoring logic
├── _evaluate_reasoning_response() # For thinking models
├── _evaluate_direct_response() # For standard models
├── _calculate_content_factors() # Semantic density, structure, insight
├── _evaluate_reasoning_consistency() # Reasoning-answer alignment
└── _apply_bias_correction() # Model-type bias adjustment
- Content Analysis: Evaluate semantic density, logical structure, and insight
- Query Alignment: Assess how well response addresses the user's question
- Model Type Detection: Identify if response includes reasoning content
- Parity Application: Apply appropriate scoring methodology:
- Direct responses: 100% answer quality evaluation
- Reasoning responses: 70% answer + 30% reasoning quality
- Bias Correction: Adjust for systematic model-type biases
- Final Score: Normalized score between 0.0 and 1.0
- Definition: Meaningful concepts per word
- Purpose: Encourages concise, information-rich responses
- Calculation: Concept-to-word ratio with uniqueness weighting
- Definition: Presence of step-by-step reasoning and examples
- Purpose: Rewards well-organized, educational responses
- Indicators: Numbered steps, bullet points, illustrative examples
- Definition: Beyond generic responses, provides unique perspectives
- Purpose: Encourages creative, valuable contributions
- Detection: Comparison against common response patterns
- Definition: Direct addressing of question aspects
- Purpose: Ensures relevance and completeness
- Scoring: Coverage of question components, accuracy
- Step-by-step logic: Coherent thought process
- Error handling: Identification and correction of mistakes
- Self-correction: Acknowledgment and revision of initial thoughts
- Alignment: How well reasoning leads to final answer
- Contradiction detection: Identifying logical inconsistencies
- Support: Whether reasoning supports the conclusion
Without bias correction, models without reasoning capabilities often receive lower scores simply because they're evaluated on different criteria than thinking models.
# Track performance by model type
bias_tracker = {
"reasoning_models": {
"scores": [],
"avg_score": 0.0,
"count": 0
},
"direct_models": {
"scores": [],
"avg_score": 0.0,
"count": 0
}
}
# Apply bias correction
if model_type == "reasoning":
bias_adjustment = -bias_adjustment_factor
elif model_type == "direct":
bias_adjustment = +bias_adjustment_factor- Performance Tracking: Monitor average scores by model type
- Statistical Analysis: Identify significant performance gaps
- Automatic Adjustment: Apply corrections when bias is detected
- Continuous Learning: System adapts as more data is collected
| Score Range | Quality Level | Characteristics |
|---|---|---|
| 0.8 - 1.0 | Excellent | Insightful, well-structured, novel, perfectly aligned |
| 0.6 - 0.8 | Good | Clear, relevant, some insight, decent structure |
| 0.4 - 0.6 | Average | Basic relevance, minimal structure, generic content |
| 0.2 - 0.4 | Poor | Partial relevance, poor structure, generic |
| 0.0 - 0.2 | Very Poor | Irrelevant, incoherent, incorrect |
For models with reasoning content, scores are distributed:
- 70% based on final answer quality
- 30% based on reasoning process quality
This ensures:
- Fair Competition: Direct models aren't penalized for lacking reasoning
- Reasoning Value: High-quality reasoning is properly rewarded
- Answer Focus: Final response quality remains primary
| Variable | Description | Default | Notes |
|---|---|---|---|
MEMEVOLVE_LOG_MIDDLEWARE_ENABLE |
Enable detailed scoring logs | false |
Set to true for debugging |
MEMEVOLVE_QUALITY_BIAS_CORRECTION |
Enable bias correction | true |
Disable for raw scores |
MEMEVOLVE_QUALITY_MIN_THRESHOLD |
Minimum score for experience storage | 0.1 |
Filter low-quality responses |
# Reasoning model weights
reasoning_weights = {
"answer_quality": 0.7,
"reasoning_quality": 0.3
}
# Direct model weights
direct_weights = {
"answer_quality": 1.0
}from utils.quality_scorer import ResponseQualityScorer
scorer = ResponseQualityScorer(debug=True)
# Score a direct response
response = {
"role": "assistant",
"content": "Water appears wet due to surface tension..."
}
context = {
"original_query": "Why does water feel wet?",
"messages": [{"role": "user", "content": "Why does water feel wet?"}]
}
score = scorer.calculate_response_quality(response, context, "Why does water feel wet?")
print(f"Quality score: {score:.3f}")# Score a thinking model response
reasoning_response = {
"role": "assistant",
"content": "Water feels wet due to surface tension...",
"reasoning_content": "First, consider what 'wet' means..."
}
score = scorer.calculate_response_quality(
reasoning_response, context, "Why does water feel wet?"
)
print(f"Reasoning model score: {score:.3f}")# Track quality over time
import json
from datetime import datetime
quality_log = []
for response_batch in responses:
scores = []
for response in response_batch:
score = scorer.calculate_response_quality(response, context, query)
scores.append(score)
quality_log.append({
"timestamp": datetime.now().isoformat(),
"avg_score": sum(scores) / len(scores),
"count": len(scores),
"has_reasoning": any(r.get("reasoning_content") for r in response_batch)
})
# Save for analysis
with open("quality_trends.json", "w") as f:
json.dump(quality_log, f, indent=2)Cause: Likely missing query context or response content
Solution: Ensure both content and original query are provided
# ❌ Missing context
score = scorer.calculate_response_quality(response, {}, query)
# ✅ Proper context
context = {"original_query": query, "messages": [{"role": "user", "content": query}]}
score = scorer.calculate_response_quality(response, context, query)Cause: Bias correction may need adjustment or insufficient data Solution: Monitor bias tracking and allow system to learn
# Enable bias correction logging
export MEMEVOLVE_LOG_MIDDLEWARE_ENABLE=true
export MEMEVOLVE_QUALITY_BIAS_CORRECTION=true
# Check bias tracking in logs
grep "bias correction" logs/api-server.logCause: May need threshold adjustment for your specific use case Solution: Adjust minimum threshold or weighting factors
# Custom scorer with adjusted thresholds
custom_scorer = ResponseQualityScorer(
debug=True,
min_threshold=0.2, # Higher threshold
bias_correction=False # Disable if bias not needed
)Enable detailed scoring analysis:
export MEMEVOLVE_LOG_MIDDLEWARE_ENABLE=true
python scripts/start_api.pyDebug output includes:
- Content factor breakdown
- Reasoning evaluation details
- Bias correction calculations
- Final score composition
- Direct responses: ~5-10ms additional processing
- Reasoning responses: ~15-25ms due to consistency analysis
- Bias correction: Minimal overhead (<1ms)
- Bias tracking: ~100KB for performance history
- Scoring cache: Optional, ~1MB for recent scores
- Total impact: Negligible for typical deployments
Extend scoring with custom evaluation criteria:
class CustomQualityScorer(ResponseQualityScorer):
def _calculate_custom_factors(self, content, query):
# Add your custom scoring logic
technical_accuracy = self._evaluate_technical_accuracy(content)
code_quality = self._evaluate_code_quality(content)
return {
"technical_accuracy": technical_accuracy,
"code_quality": code_quality
}
def _evaluate_technical_accuracy(self, content):
# Implementation-specific logic
return scoreQuality scores feed directly into the evolution framework:
# Quality scores influence fitness evaluation
fitness_calculator = FitnessCalculator(
quality_weight=0.6,
performance_weight=0.4
)
# Evolution prioritizes high-quality responses
best_genotypes = evolution_manager.select_top_performers(
quality_scores=quality_history,
performance_metrics=timing_data
)- API Reference - Complete API endpoints
- Troubleshooting - Common issues and solutions
- Architecture Overview - System design
- Getting Started - Quick setup guide
Last updated: January 24, 2026