Quality Scoring System

MemEvolve includes an independent, parity-based quality scoring system that provides unbiased evaluation of LLM responses across different model types and architectures.

🎯 Overview

The quality scoring system addresses a critical challenge in LLM evaluation: ensuring fair assessment regardless of whether a model uses direct responses or reasoning/thinking processes.

Key Principles

Parity-Based Evaluation: Treats all models fairly regardless of reasoning capabilities
Independent Assessment: Separate from memory system to avoid bias
Multi-Factor Analysis: Considers content quality, structure, and insight
Model-Agnostic: Works with any OpenAI-compatible model
Adaptive Learning: Tracks performance patterns and adjusts automatically

🏗️ Architecture

Core Components

ResponseQualityScorer
├── _calculate_response_quality()     # Main scoring logic
├── _evaluate_reasoning_response()    # For thinking models
├── _evaluate_direct_response()       # For standard models
├── _calculate_content_factors()      # Semantic density, structure, insight
├── _evaluate_reasoning_consistency() # Reasoning-answer alignment
└── _apply_bias_correction()         # Model-type bias adjustment

Scoring Process

Content Analysis: Evaluate semantic density, logical structure, and insight
Query Alignment: Assess how well response addresses the user's question
Model Type Detection: Identify if response includes reasoning content
Parity Application: Apply appropriate scoring methodology:
- Direct responses: 100% answer quality evaluation
- Reasoning responses: 70% answer + 30% reasoning quality
Bias Correction: Adjust for systematic model-type biases
Final Score: Normalized score between 0.0 and 1.0

📊 Scoring Factors

Content Quality Factors

Semantic Density

Definition: Meaningful concepts per word
Purpose: Encourages concise, information-rich responses
Calculation: Concept-to-word ratio with uniqueness weighting

Logical Structure

Definition: Presence of step-by-step reasoning and examples
Purpose: Rewards well-organized, educational responses
Indicators: Numbered steps, bullet points, illustrative examples

Insight & Novelty

Definition: Beyond generic responses, provides unique perspectives
Purpose: Encourages creative, valuable contributions
Detection: Comparison against common response patterns

Query Alignment

Definition: Direct addressing of question aspects
Purpose: Ensures relevance and completeness
Scoring: Coverage of question components, accuracy

Reasoning-Specific Factors

Reasoning Quality

Step-by-step logic: Coherent thought process
Error handling: Identification and correction of mistakes
Self-correction: Acknowledgment and revision of initial thoughts

Reasoning-Answer Consistency

Alignment: How well reasoning leads to final answer
Contradiction detection: Identifying logical inconsistencies
Support: Whether reasoning supports the conclusion

🔄 Bias Correction System

Problem Statement

Without bias correction, models without reasoning capabilities often receive lower scores simply because they're evaluated on different criteria than thinking models.

Solution: Adaptive Bias Tracking

# Track performance by model type
bias_tracker = {
    "reasoning_models": {
        "scores": [],
        "avg_score": 0.0,
        "count": 0
    },
    "direct_models": {
        "scores": [], 
        "avg_score": 0.0,
        "count": 0
    }
}

# Apply bias correction
if model_type == "reasoning":
    bias_adjustment = -bias_adjustment_factor
elif model_type == "direct":
    bias_adjustment = +bias_adjustment_factor

Bias Detection Mechanisms

Performance Tracking: Monitor average scores by model type
Statistical Analysis: Identify significant performance gaps
Automatic Adjustment: Apply corrections when bias is detected
Continuous Learning: System adapts as more data is collected

📈 Score Interpretation

Score Ranges

Score Range	Quality Level	Characteristics
0.8 - 1.0	Excellent	Insightful, well-structured, novel, perfectly aligned
0.6 - 0.8	Good	Clear, relevant, some insight, decent structure
0.4 - 0.6	Average	Basic relevance, minimal structure, generic content
0.2 - 0.4	Poor	Partial relevance, poor structure, generic
0.0 - 0.2	Very Poor	Irrelevant, incoherent, incorrect

Thinking Model Scores

For models with reasoning content, scores are distributed:

70% based on final answer quality
30% based on reasoning process quality

This ensures:

Fair Competition: Direct models aren't penalized for lacking reasoning
Reasoning Value: High-quality reasoning is properly rewarded
Answer Focus: Final response quality remains primary

🔧 Configuration

Environment Variables

Variable	Description	Default	Notes
`MEMEVOLVE_LOG_MIDDLEWARE_ENABLE`	Enable detailed scoring logs	`false`	Set to `true` for debugging
`MEMEVOLVE_QUALITY_BIAS_CORRECTION`	Enable bias correction	`true`	Disable for raw scores
`MEMEVOLVE_QUALITY_MIN_THRESHOLD`	Minimum score for experience storage	`0.1`	Filter low-quality responses

Score Weighting

# Reasoning model weights
reasoning_weights = {
    "answer_quality": 0.7,
    "reasoning_quality": 0.3
}

# Direct model weights  
direct_weights = {
    "answer_quality": 1.0
}

📋 Usage Examples

Basic Quality Scoring

from utils.quality_scorer import ResponseQualityScorer

scorer = ResponseQualityScorer(debug=True)

# Score a direct response
response = {
    "role": "assistant", 
    "content": "Water appears wet due to surface tension..."
}
context = {
    "original_query": "Why does water feel wet?",
    "messages": [{"role": "user", "content": "Why does water feel wet?"}]
}

score = scorer.calculate_response_quality(response, context, "Why does water feel wet?")
print(f"Quality score: {score:.3f}")

Scoring Reasoning Content

# Score a thinking model response
reasoning_response = {
    "role": "assistant",
    "content": "Water feels wet due to surface tension...",
    "reasoning_content": "First, consider what 'wet' means..."
}

score = scorer.calculate_response_quality(
    reasoning_response, context, "Why does water feel wet?"
)
print(f"Reasoning model score: {score:.3f}")

Monitoring Quality Trends

# Track quality over time
import json
from datetime import datetime

quality_log = []
for response_batch in responses:
    scores = []
    for response in response_batch:
        score = scorer.calculate_response_quality(response, context, query)
        scores.append(score)
    
    quality_log.append({
        "timestamp": datetime.now().isoformat(),
        "avg_score": sum(scores) / len(scores),
        "count": len(scores),
        "has_reasoning": any(r.get("reasoning_content") for r in response_batch)
    })

# Save for analysis
with open("quality_trends.json", "w") as f:
    json.dump(quality_log, f, indent=2)

🐛 Troubleshooting

Common Issues

"All scores are the same"

Cause: Likely missing query context or response content Solution: Ensure both content and original query are provided

# ❌ Missing context
score = scorer.calculate_response_quality(response, {}, query)

# ✅ Proper context
context = {"original_query": query, "messages": [{"role": "user", "content": query}]}
score = scorer.calculate_response_quality(response, context, query)

"Reasoning models score lower than expected"

Cause: Bias correction may need adjustment or insufficient data Solution: Monitor bias tracking and allow system to learn

# Enable bias correction logging
export MEMEVOLVE_LOG_MIDDLEWARE_ENABLE=true
export MEMEVOLVE_QUALITY_BIAS_CORRECTION=true

# Check bias tracking in logs
grep "bias correction" logs/api-server.log

"Scores seem too high/low"

Cause: May need threshold adjustment for your specific use case Solution: Adjust minimum threshold or weighting factors

# Custom scorer with adjusted thresholds
custom_scorer = ResponseQualityScorer(
    debug=True,
    min_threshold=0.2,  # Higher threshold
    bias_correction=False  # Disable if bias not needed
)

Debug Mode

Enable detailed scoring analysis:

export MEMEVOLVE_LOG_MIDDLEWARE_ENABLE=true
python scripts/start_api.py

Debug output includes:

Content factor breakdown
Reasoning evaluation details
Bias correction calculations
Final score composition

📊 Performance Impact

Computational Overhead

Direct responses: ~5-10ms additional processing
Reasoning responses: ~15-25ms due to consistency analysis
Bias correction: Minimal overhead (<1ms)

Memory Usage

Bias tracking: ~100KB for performance history
Scoring cache: Optional, ~1MB for recent scores
Total impact: Negligible for typical deployments

🔮 Advanced Features

Custom Quality Factors

Extend scoring with custom evaluation criteria:

class CustomQualityScorer(ResponseQualityScorer):
    def _calculate_custom_factors(self, content, query):
        # Add your custom scoring logic
        technical_accuracy = self._evaluate_technical_accuracy(content)
        code_quality = self._evaluate_code_quality(content)
        
        return {
            "technical_accuracy": technical_accuracy,
            "code_quality": code_quality
        }

    def _evaluate_technical_accuracy(self, content):
        # Implementation-specific logic
        return score

Integration with Evolution System

Quality scores feed directly into the evolution framework:

# Quality scores influence fitness evaluation
fitness_calculator = FitnessCalculator(
    quality_weight=0.6,
    performance_weight=0.4
)

# Evolution prioritizes high-quality responses
best_genotypes = evolution_manager.select_top_performers(
    quality_scores=quality_history,
    performance_metrics=timing_data
)

📚 Related Documentation

API Reference - Complete API endpoints
Troubleshooting - Common issues and solutions
Architecture Overview - System design
Getting Started - Quick setup guide

Last updated: January 24, 2026

FilesExpand file tree

quality-scoring.md

Latest commit

History