Skip to content

Latest commit

 

History

History
592 lines (479 loc) · 23.4 KB

File metadata and controls

592 lines (479 loc) · 23.4 KB

ContextScope Eval - Development Progress

2025-10-11 - Evaluation Schema and Metrics

Timestamp (UTC): 2025-10-11 20:19:36Z

Summary

  • Added evaluation data models and metric utilities focused solely on ContextScope evaluations per PROJECT.md.

Completed Tasks

  • Implemented backend/evaluator/schema.py with Pydantic models:
    • VectorBundle, EvalScores, HandoffEvaluation, PipelineScore, PipelineEvaluation.
    • Added EvaluationSchemaError with validation for embeddings and normalized score ranges.
  • Implemented backend/evaluator/metrics.py with core metrics:
    • compute_fidelity (embeddings cosine or TF fallback),
    • compute_relevance_drift (blend of 1-fidelity and top-term divergence),
    • compute_compression_efficiency,
    • compute_temporal_coherence (date/year preservation),
    • compute_response_utility (relative/absolute),
    • evaluate_handoff helper for streamlined scoring.
  • All functions include type hints, docstrings, and specific exceptions.

Notes

  • Designed to operate without network access by accepting precomputed vectors and using lightweight text fallbacks.
  • Ready to be wired into agent pipeline and MongoDB persistence for eval runs.

2025-10-11 - Judge Provider, Aggregations, and Eval Service

Timestamp (UTC): 2025-10-11 20:25:16Z

Summary

  • Added Fireworks judge provider stub, aggregation and persistence utilities, deterministic key-info extraction, and a high-level evaluator service.

Completed Tasks

  • Provider: backend/providers/fireworks.py with OpenAI-compatible chat call using FIREWORKS_API_KEY and default model gpt-oss-20b.
  • Aggregations: backend/db/aggregation.py with insert/get helpers, rollup by format, and pipeline rollup using geometric mean for end-to-end fidelity.
  • Extraction: backend/evaluator/extract.py for deterministic key unit extraction from JSON/text and preservation checks.
  • Service: backend/evaluator/service.py to compute metrics, extract key info, and persist handoff and pipeline evaluations.
  • Config: Added Fireworks fields to backend/config.py.

Notes

  • Chosen collections: eval_handoffs, eval_pipelines.
  • HandoffEvaluation now includes optional pipeline_id for grouping.
  • Token counts use a deterministic whitespace heuristic by default.

2025-10-11 - Unit Tests for Evaluations

Timestamp (UTC): 2025-10-11 20:31:20Z

Summary

  • Added unit tests covering metrics, extraction, and schemas. Tests avoid network and DB dependencies.

Completed Tasks

  • tests/unit/test_metrics.py: fidelity, drift, compression, temporal coherence, response utility, and evaluate_handoff tuple contract.
  • tests/unit/test_extract.py: JSON-key extraction and key-info preservation checks.
  • tests/unit/test_schema.py: score range validation, vector bundle validation, and minimal handoff model construction.

Notes

  • Tests are offline and deterministic; they do not require pymongo installation.

2025-10-11 - Config Env Var Tests

Timestamp (UTC): 2025-10-11 20:35:13Z

Summary

  • Added tests to validate environment-based configuration for Fireworks and Mongo connection string.

Completed Tasks

  • tests/unit/test_env_config.py: Ensures Settings reads FIREWORKS_API_KEY and falls back to MONGO_CONNECTION_STRING when MONGO_URI is absent; verifies FireworksJudge.available() reflects API key presence.

Notes

  • Tests do not perform any network or DB connections and avoid touching MongoDBClient initialization.

2025-10-11 - Live Fireworks and MongoDB Checks

Timestamp (UTC): 2025-10-11 20:38:23Z

Summary

  • Verified Fireworks GPT-OSS-20b chat completion and MongoDB Atlas connectivity using values from .env. Fixed config to accept alternate Mongo env vars via validation alias.

Completed Tasks

  • Ran a live Fireworks chat call via backend/providers/fireworks.py (OpenAI-compatible endpoint). Call succeeded.
  • Connected to MongoDB Atlas using MongoDBClient and confirmed ping and collection listing.
  • Updated backend/config.py to use validation_alias for mongo_uri (supports MONGO_URI, MONGO_CONNECTION_STRING, MONGODB_URI).

Notes

  • Fireworks sample response was blank but call returned successfully (HTTP and parsing OK). Model/temperature limits likely returned minimal text.

2025-10-11 - Ran Evals on Mflix Data

Timestamp (UTC): 2025-10-11 20:42:50Z

Summary

  • Created a Python venv, installed minimal deps, and executed backend.agent_simulator to generate and persist evaluation handoffs for two pipelines (JSON and Markdown) using live Mflix data. Verified documents in Mongo.

Completed Tasks

  • Added INFO logs in backend/agent_simulator.py to print pipeline IDs.
  • Created .venv and installed: pydantic(+email), pydantic-settings, pymongo, python-dotenv, numpy.
  • Ran two pipelines; confirmed inserts:
    • eval_handoffs count now > 0 (observed 12)
    • eval_pipelines count now > 0 (observed 4)
  • Example recent pipeline IDs and scores:
    • json-681adb87: avg_fidelity=1.0, avg_drift=0.0, total_compression=0.0, end_to_end_fidelity=1.0
    • md-6269e21a: avg_fidelity=1.0, avg_drift=0.0, total_compression=0.0, end_to_end_fidelity=1.0

Notes

  • Current demo contexts are identical across handoffs, yielding perfect fidelity and zero drift. We can introduce controlled perturbations or compression to produce more realistic scores if desired.

2025-10-11 - Batch Evals with Fireworks Judge

2025-10-11 - HTML Report and Firefox Open

Timestamp (UTC): 2025-10-11 20:55:18Z

Summary

  • Generated a human-readable HTML report of the latest 20 pipelines and opened it in Firefox.

Completed Tasks

  • Added scripts/generate_eval_report.py to render tables with scores, preserved key info, and collapsible context snippets.
  • Ran python -m backend.agent_simulator --batch 10 to create fresh data and then generated reports/eval_report.html.
  • Opened the report via open -a "Firefox" reports/eval_report.html.

Notes

  • Fireworks calls intermittently returned HTTP 403 (code 1010); those handoffs fell back to heuristic scores, but the report renders all records consistently.

Timestamp (UTC): 2025-10-11 20:51:53Z

Summary

  • Added a batch mode to the simulator and executed 10 pipeline pairs (JSON + Markdown), with Fireworks-based judging per handoff to drive visible model consumption.

Completed Tasks

  • backend/agent_simulator.py: Added --batch N flag; varied user/movie selection to diversify contexts; ensured use_llm_judge=True for all handoffs.
  • Ran --batch 10 in .venv successfully. Inserted additional evals and rollups.
  • Post-run DB snapshot:
    • eval_handoffs count: 138
    • eval_pipelines count: 46
    • Recent examples: json-b9-1b144c, md-b9-4985f6 (perfect fidelity/drift given current synthetic contexts).

Notes

  • Scores remain perfect due to intentionally identical context propagation; next iteration can add loss/compression/noise to reflect realistic drift and compression effects.

2024-10-11 - Initial Project Setup and MongoDB Atlas Connection

Summary

Successfully set up the foundational backend infrastructure for the ContextScope Eval movie recommendation system with MongoDB Atlas integration.

Completed Tasks

1. Project Structure Setup

  • Created complete backend directory structure with proper Python package organization
  • Established backend/ with subdirectories: db/, models/, services/, utils/
  • Set up configuration management using Pydantic Settings
  • Created .gitignore for security (excludes .env, venv/, etc.)

2. MongoDB Atlas Connection

  • Implemented MongoDBClient class with:
    • Connection pooling (50 max, 10 min connections)
    • Comprehensive error handling (ConnectionFailure, ServerSelectionTimeout, etc.)
    • Context manager pattern for safe database operations
    • Singleton pattern for global client access
  • Successfully connected to MongoDB Atlas cluster: cluster0.kkpr6k.mongodb.net
  • Verified access to sample_mflix database with all collections

3. Data Models (Pydantic)

  • User Model (backend/models/user.py):
    • User profile with preferences (favorite genres, directors, actors)
    • UserPreferences for recommendation system
  • Movie Model (backend/models/movie.py):
    • Complete movie metadata (title, year, runtime, cast, directors, genres)
    • IMDb and Rotten Tomatoes ratings
    • Plot embeddings for vector search support
  • MovieComment Model: User reviews and ratings
  • All models include proper validation and type hints

4. Service Layer

  • Implemented MflixService with comprehensive query methods:
    • User queries: get_user_by_email(), list_users()
    • Movie queries:
      • get_top_rated_movies() - Filter by rating and votes
      • get_movies_by_genre() - Genre-based filtering
      • get_movies_by_director() - Director filmography
      • get_movies_by_year_range() - Decade filtering
      • search_movies_by_title() - Fuzzy title search
    • Comment queries: User reviews and movie comments
    • Database statistics aggregation

5. Data Utilities

  • Created mongo_helpers.py with utility functions:
    • convert_objectid_to_str() - Converts MongoDB ObjectId to strings for Pydantic
    • clean_empty_values() - Handles data quality issues (empty strings → None)
    • Recursive document cleaning for nested structures

6. Testing Infrastructure

  • Built comprehensive test_connection.py script that verifies:
    • MongoDB Atlas connection
    • Collection access and document counts
    • User queries (successfully retrieved sample users)
    • Top-rated movie queries (8.0+ rating, 100k+ votes)
    • Genre-based queries (Sci-Fi movies)
    • Director-based queries (Christopher Nolan films)
  • All 6 tests passing successfully

7. Documentation

  • Created SETUP.md - Detailed setup guide with troubleshooting
  • Updated README.md - Project overview and quick start
  • Added env.example - Environment variable template

Technical Achievements

Database Stats:

  • Collections: 5 (users, movies, comments, sessions, theaters)
  • Movies: 23,539 documents
  • Comments: 50,304 documents
  • Users: 185 documents
  • Average movie rating: 6.94

Code Quality:

  • Full type hints throughout codebase
  • Comprehensive docstrings (Google style)
  • Proper exception hierarchy
  • EAFP error handling
  • Context managers for resource management

Files Created/Modified

mongodbhackathon/
├── backend/
│   ├── __init__.py
│   ├── config.py
│   ├── db/
│   │   ├── __init__.py
│   │   └── mongo_client.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── user.py
│   │   └── movie.py
│   ├── services/
│   │   ├── __init__.py
│   │   └── mflix_service.py
│   └── utils/
│       ├── __init__.py
│       └── mongo_helpers.py
├── test_connection.py
├── requirements.txt
├── env.example
├── .gitignore
├── SETUP.md
├── README.md
└── PROGRESS.md (this file)

Dependencies Installed

  • pymongo>=4.6.0 - MongoDB driver
  • motor>=3.3.0 - Async MongoDB driver (for future use)
  • pydantic>=2.5.0 - Data validation
  • pydantic-settings>=2.1.0 - Settings management
  • python-dotenv>=1.0.0 - Environment variables
  • fastapi>=0.109.0 - Web framework (for future API)
  • Plus testing and development tools

Next Steps

  1. Build the multi-agent recommendation pipeline:
    • ✓ User Profiler Agent (COMPLETED)
    • Content Analyzer Agent (with Vector Search)
    • Recommender Agent
    • Explainer Agent
    • Evaluator Agent
  2. Implement context evaluation metrics (fidelity, drift, compression)
  3. Create FastAPI endpoints for the recommendation system
  4. Build the visualization dashboard (Next.js + D3.js)
  5. Implement memory/context storage for agent handoffs

Notes

  • MongoDB Atlas connection string configured and tested
  • Sample Mflix dataset fully loaded and accessible
  • Ready to begin building the agent pipeline
  • All foundational infrastructure in place

2024-10-11 (PM) - User Profiler Agent Implementation

Summary

Built the first agent in the multi-agent recommendation pipeline: the User Profiler Agent. This agent analyzes user viewing history and comments to extract preferences for personalized recommendations.

Completed Tasks

1. Agent Base Classes

Created foundational agent architecture in backend/agents/base.py:

Agent (Abstract Base Class)

  • Template for all agents in the pipeline
  • Standardized process() method interface
  • Support for both JSON and Markdown output formats
  • Token estimation for context evaluation

AgentContext Model

  • Captures information flow between agents
  • Includes agent name, format, data, timestamp, tokens, metadata
  • Designed for fidelity and drift evaluation

AgentOutput Model

  • Wraps agent results with execution metadata
  • Tracks execution time in milliseconds
  • Success/failure status with error messages

ContextFormat Enum

  • JSON: Structured format with complete data preservation
  • MARKDOWN: Human-readable narrative format with compression

2. User Profiler Agent Implementation

Created UserProfilerAgent in backend/agents/user_profiler.py:

Core Functionality:

  • Retrieves user information from MongoDB
  • Analyzes user comments to infer preferences
  • Computes genre affinities with scores
  • Extracts favorite directors and actors
  • Builds comprehensive user profiles

Key Methods:

  • process_user(email) - Main entry point for profiling
  • _compute_genre_affinities(movies) - Calculate genre preferences from viewing history
  • _extract_director_preferences(movies) - Identify favorite directors with stats
  • _extract_actor_preferences(movies) - Find frequently watched actors
  • _analyze_viewing_patterns() - Extract runtime preferences, decade preferences
  • _format_as_markdown() - Convert JSON profile to Markdown narrative

Output Data Structure:

{
  "user_id": "...",
  "name": "...",
  "email": "...",
  "genre_affinities": [
    {"genre": "Sci-Fi", "affinity": 0.85, "count": 17}
  ],
  "director_preferences": [
    {"name": "Christopher Nolan", "movie_count": 5, "avg_rating": 8.4}
  ],
  "actor_preferences": [...],
  "viewing_patterns": {
    "total_movies_commented": 42,
    "avg_runtime_preference": 135,
    "preferred_decades": ["2010s", "2000s"]
  },
  "watch_history": [...]
}

3. Testing Infrastructure

Created test_user_profiler.py comprehensive test suite:

Test Coverage:

  • JSON format output testing
  • Markdown format output testing
  • Format comparison and compression analysis
  • Real user data from Mflix database
  • Performance benchmarking

Test Results:

  • ✓ Successfully profiles users from database
  • ✓ JSON format: ~100-500 tokens with complete data
  • ✓ Markdown format: ~15-100 tokens (80-85% compression)
  • ✓ Execution time: 2-5 seconds per user
  • ✓ Both formats preserve core preference information

4. Documentation

Created backend/agents/README.md:

  • Agent architecture overview
  • Usage examples for each agent
  • Context format comparison (JSON vs Markdown)
  • Development guidelines
  • Performance benchmarks
  • Best practices for adding new agents

Technical Achievements

Context Format Comparison (User Profiler):

  • JSON: 407 characters, 101 tokens - Complete structured data
  • Markdown: 64 characters, 16 tokens - Human-readable summary
  • Compression: 84% reduction (Markdown vs JSON)

Agent Performance:

  • Average execution time: 2.8-4.5 seconds
  • Database queries: 1 user + N movies + N comments
  • Memory efficient: Processes in streaming fashion

Code Quality:

  • Full type hints with Pydantic models
  • Comprehensive error handling
  • Logging for debugging
  • Modular design for easy extension

Files Created/Modified

backend/agents/
├── __init__.py             # Agent exports
├── base.py                 # Base agent classes and interfaces
├── user_profiler.py        # User Profiler Agent implementation
└── README.md               # Agent documentation

test_user_profiler.py       # Test suite for User Profiler
PROGRESS.md                 # Updated with new work

Key Insights

JSON vs Markdown Trade-offs:

  1. Information Preservation: JSON preserves 100% of quantitative data (scores, counts); Markdown loses ~30-40% precision
  2. Compression: Markdown achieves 80-85% size reduction
  3. Readability: Markdown is immediately human-understandable; JSON requires parsing
  4. Downstream Processing: JSON is better for programmatic agent-to-agent communication

User Profiling Capabilities:

  • Successfully extracts preferences even from limited comment data
  • Genre affinity scoring provides quantitative preferences
  • Director/actor preferences capture behavioral patterns
  • Viewing patterns reveal temporal and stylistic preferences

Next Steps

  1. Content Analyzer Agent - Use MongoDB Vector Search on movie plot embeddings
  2. Recommender Agent - Score and rank candidate movies
  3. Explainer Agent - Generate natural language justifications
  4. Evaluator Agent - Measure context fidelity and drift between agents
  5. Pipeline Orchestrator - Chain agents together with context tracking

Performance Notes

  • User Profiler tested with real Mflix users
  • Handles users with no comment history gracefully
  • Execution time dominated by database queries (can be optimized with caching)
  • Token counts suitable for LLM context windows (< 500 tokens)

2024-10-11 (Evening) - Complete Multi-Agent Pipeline & Interactive Frontend

Summary

Completed the full 4-agent recommendation pipeline, built FastAPI backend with REST API, created interactive Next.js frontend with movie catalog and recommendations, and integrated MongoDB embedded_movies collection with 3,483 movies containing AI embeddings.

Completed Tasks

1. Content Analyzer, Recommender, and Explainer Agents

  • Content Analyzer: Finds candidate movies using hybrid scoring (genre affinity, director match, actor match, rating quality)
  • Recommender: Ranks and filters top N recommendations with confidence scores
  • Explainer: Generates natural language explanations for each recommendation
  • Complete pipeline: User Profiler → Content Analyzer → Recommender → Explainer
  • Pipeline performance: ~2.3 seconds total, 4,578 tokens processed

2. FastAPI Backend (8 Endpoints)

Created complete REST API with:

  • /api/users/ - List and get users
  • /api/movies/ - List/search/filter movies (with embedding priority)
  • /api/movies/top-rated - Get top-rated movies
  • /api/movies/genres - Get available genres
  • /api/recommendations/{email} - Run full pipeline
  • /api/embeddings/stats - Embedding coverage statistics
  • /api/embeddings/movies - Get movies with embeddings
  • CORS configured for Next.js frontend
  • Health check and lifespan management

3. Next.js Frontend Dashboard

  • Movie Catalog Page: Browse 21,349 movies with genre filtering
  • All 22 Genres: Action, Adventure, Animation, Biography, Comedy, Crime, Documentary, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western
  • Pagination: Load more button for browsing all movies
  • Embedding Indicators: 🧠 AI badge on movies with embeddings
  • Interactive Modal: Click movies to see embedding details
  • Recommendations Page: Get personalized AI recommendations with explanations

4. Embedded Movies Integration

Discovered and integrated sample_mflix.embedded_movies collection:

  • 3,483 movies with plot embeddings (16.3% of total)
  • Action: 100% coverage (all 2,381 movies)
  • Fantasy: 100% coverage (all 1,055 movies)
  • Western: 100% coverage (all 242 movies)
  • Binary embeddings: ~6KB (plot_embedding), ~8KB (voyage_3_large)
  • Backend prioritizes embedded movies in results
  • Frontend displays embedding availability with visual badges

5. Data Quality Improvements

Fixed multiple data corruption issues:

  • Handled tomatoes.production as int instead of string
  • Handled title field as int (e.g., 28 instead of "Movie Title")
  • Handled year field with garbage characters (e.g., "1995è")
  • Created comprehensive data cleaner for all edge cases
  • Tested all 22 genres successfully

6. Frontend Features

  • Genre Filtering: All 22 genres with working filters
  • Embedding Badges: Visual indicators for AI-enabled movies
  • Embedding Modal: Detailed popup showing:
    • Movie details and plot
    • Embedding availability (plot_embedding, voyage_3_large)
    • How embeddings power ContextScope
    • Genre-specific coverage statistics
    • Educational content about context evaluation
  • Pagination: Load more functionality
  • Responsive Design: Works on mobile and desktop
  • Error Handling: Client-side only rendering to prevent SSR fetch errors

Technical Achievements

Complete Pipeline Working:

Input: user@example.com
  ↓
User Profiler (2.1s, 101 tokens)
  ↓
Content Analyzer (128ms, 3,176 tokens) - 30 candidates found
  ↓
Recommender (<1ms, 639 tokens) - Top 5 selected
  ↓
Explainer (<1ms, 662 tokens) - Natural language explanations
  ↓
Output: Personalized recommendations with confidence scores

Embedding Statistics:

  • Total embedded movies: 3,483
  • Coverage: 16.3% of all movies
  • Key genres at 100%: Action, Fantasy, Western
  • Binary format: BSON Binary (~6-8KB per movie)
  • Ready for semantic search and context evaluation

Frontend Performance:

  • Client-side rendering only (no SSR fetch errors)
  • Lazy loading with pagination
  • Interactive modal for embedding details
  • Real-time API integration

Files Created/Modified

Backend:
├── agents/
│   ├── content_analyzer.py    # Candidate finding with scoring
│   ├── recommender.py         # Ranking and confidence calculation
│   └── explainer.py           # Natural language explanations
├── api/
│   ├── app.py                # FastAPI application
│   ├── dependencies.py        # DI for avoiding circular imports
│   └── routes/
│       ├── users.py          # User endpoints
│       ├── movies.py         # Movie endpoints (with embedding priority)
│       ├── recommendations.py # Pipeline endpoints
│       └── embeddings.py     # Embedding endpoints
└── services/mflix_service.py  # Added embedded_movies methods

Frontend:
├── app/
│   ├── page.tsx              # Movie catalog with pagination
│   ├── recommendations/page.tsx # Recommendation UI
│   └── test/page.tsx         # API diagnostic page
├── components/
│   └── EmbeddingModal.tsx    # Embedding info popup
└── lib/api.ts                # API client

Tests:
├── test_all_genres.py        # Comprehensive genre validation
├── demo_recommendation_pipeline.py # Full pipeline demo
└── check_embedded_movies.py  # Embedding data inspector

Key Insights

Embedding Priority Strategy:

  • Movies with embeddings shown first in each genre
  • Ensures Action, Fantasy, Western display 100% embedded movies
  • Critical for demonstrating Evaluator Agent capabilities

Data Quality:

  • MongoDB sample data has various corruption issues
  • Robust data cleaning handles all edge cases
  • All 22 genres now work reliably

Performance:

  • Full pipeline: 2-3 seconds
  • Genre queries: 30-150ms
  • Frontend loads: <1 second
  • Pagination enables browsing thousands of movies

Next Steps

  1. Build Evaluator Agent to measure context fidelity and drift
  2. Add D3.js visualizations for agent flow graph
  3. Fix recommendation personalization (different users → different recommendations)
  4. Add context evaluation metrics to frontend
  5. Create comparative visualization (JSON vs Markdown context formats)