Automated AI Code Comparison & Evaluation System
Anubis is an intelligent code evaluation platform that automates the process of comparing code generation across multiple AI models. Instead of manually testing different models and hoping for the best, Anubis takes a single prompt and automatically runs it through multiple AI models in parallel, evaluates each candidate using a comprehensive scoring pipeline, and surfaces the best result based on your priorities.
Developers often face a frustrating workflow when using AI code generation:
- Manual Testing: Developers try multiple AI models manually, one at a time
- Subjective Comparison: Code quality is judged by eye, leading to inconsistent decisions
- Uncertainty: No objective metrics to determine which model produces the best code for a specific task
- No Priority System: Can't easily prioritize what matters most (e.g., performance vs. readability)
This process is slow, inconsistent, and doesn't scale.
Anubis automates the entire workflow:
- Parallel Multi-Model Generation: Takes a single prompt and runs it through multiple AI models simultaneously in different branches
- Automated Evaluation: Each code candidate is automatically evaluated using a comprehensive scoring pipeline
- Priority-Based Weighting: Metrics are weighted according to user-selected priorities (e.g., if performance matters most, time complexity gets higher weight)
- Objective Ranking: Produces a final weighted score for each generated code and ranks them objectively
- Best Code Surface: Automatically surfaces the best code based on your priorities
anubis-demo.mov
Result: Fast, consistent, and objective code comparison that scales.
Anubis follows a modular, agentic architecture with clear separation of concerns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Next.js) β
β User Interface & Real-time Streaming β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β HTTP/SSE
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backend API (Flask) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Code Generator (Coding Agent) β β
β β β’ Multi-model parallel code generation β β
β β β’ Google AI SDK integration β β
β β β’ Streaming support β β
β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Code Evaluator (Evaluator Agent) β β
β β β’ Orchestrates metric evaluation β β
β β β’ Dynamic weight calculation β β
β β β’ Overall score computation β β
β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Metric Analyzers β β
β β β’ ReadabilityAnalyzer β β
β β β’ ConsistencyAnalyzer β β
β β β’ ComplexityAnalyzer β β
β β β’ DocumentationAnalyzer β β
β β β’ DependencyAnalyzer β β
β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Output Formatter β β
β β β’ JSON structure generation β β
β β β’ Ranking computation β β
β β β’ Summary creation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
βΌ
βββββββββββββββββββββββββ
β Google AI Studio β
β Gemini Models β
βββββββββββββββββββββββββ
- Input Processing: User provides prompt, model list, and optional metric priorities
- Dynamic Weight Calculation: If priorities provided, calculate exponential decay weights
- Parallel Code Generation:
- Build prompts with metric priority instructions
- Generate code via Google AI SDK for each model simultaneously
- Stream results in real-time (SSE endpoint)
- Code Evaluation:
- Run all metric analyzers on each generated code
- Calculate individual metric scores
- Compute overall weighted score
- Ranking & Output:
- Rank models by overall score
- Format results with best code highlighted
- Return JSON or stream via SSE
Backend:
- Python 3.12+
- Flask - Web framework
- Google AI SDK (genai) - Gemini API integration
- PyYAML - Configuration management
Frontend:
- Next.js 16+ - React framework
- TypeScript - Type safety
- Server-Sent Events (SSE) - Real-time streaming
Infrastructure:
- Turborepo - Monorepo management
- pnpm - Package manager
Anubis leverages Google AI Studio and Gemini models as core components of its agentic workflow:
- Prompt Design & Testing: We designed and tested our prompt templates inside Google AI Studio to optimize output quality across different Gemini variants
- Rapid Iteration: Google AI Studio enabled quick iteration on:
- System instructions for code generation
- Code-generation prompts
- Scoring heuristics and evaluation criteria
- Model Selection: Tested various Gemini models (gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash) to understand their strengths
Gemini models serve as one of the core "branches" in our multi-model AB testing pipeline:
- Comparator Agent: Uses Gemini models via Google AI SDK to generate code candidates
- Streaming Support: Real-time code generation with Server-Sent Events
- Multi-Model Comparison: Gemini variants are compared against each other and other models
- Priority-Aware Generation: Prompts include metric priorities to guide Gemini's code generation focus
- SDK: Google AI SDK (
google.genai) for programmatic access - Streaming: Real-time streaming of code chunks as they're generated
- Error Handling: Robust retry logic and graceful degradation
- Performance: Parallel execution of multiple Gemini model variants
- β Multi-Model Code Generation: Generate code from multiple AI models simultaneously
- β Real-Time Streaming: Watch code generation happen in real-time via Server-Sent Events
- β
Comprehensive Evaluation: 5 key metrics with detailed analysis:
- Readability: Variable naming, structure, comments
- Consistency: Naming conventions, code style uniformity
- Time Complexity: Algorithm efficiency, Big O analysis
- Code Documentation: Docstrings, inline comments
- External Dependencies: Standard library preference
- β Priority-Based Weighting: Customize metric importance with exponential decay weighting
- β Automated Ranking: Objective ranking by weighted overall score
- β Best Code Surface: Automatically highlights the best generated code
- β RESTful API: Easy-to-use JSON API endpoints
- β Streaming API: Real-time updates via Server-Sent Events
- Parallel Execution: Multiple models generate code simultaneously, significantly reducing total time
- Dynamic Weighting: Exponential decay algorithm for priority-based metric weighting
- Graceful Error Handling: Continues evaluation even if one model fails
- Retry Logic: Automatic retries for failed API calls
- Configurable: YAML-based configuration for weights and models
anubis/
βββ apps/
β βββ backend/ # Python Flask backend
β β βββ anubis/ # Core Anubis package
β β β βββ code_generator.py # Code generation via Google AI SDK
β β β βββ code_evaluator.py # Orchestrates metric evaluation
β β β βββ output_formatter.py # Formats results to JSON
β β β βββ evaluators/ # Metric analyzers
β β β βββ base_analyzer.py
β β β βββ readability_analyzer.py
β β β βββ consistency_analyzer.py
β β β βββ complexity_analyzer.py
β β β βββ documentation_analyzer.py
β β β βββ dependency_analyzer.py
β β βββ app.py # Flask application
β β βββ config.yaml # Configuration
β β βββ requirements.txt # Python dependencies
β β
β βββ web/ # Next.js frontend
β βββ app/ # Next.js app directory
β βββ package.json
β
βββ packages/ # Shared packages
β βββ ui/ # Shared UI components
β βββ eslint-config/ # ESLint configuration
β βββ typescript-config/ # TypeScript configuration
β
βββ package.json # Root package.json (Turborepo)
βββ turbo.json # Turborepo configuration
- Node.js 18+ and pnpm 9.0.0+
- Python 3.12+
- Google API Key (Get one here)
-
Clone the repository:
git clone https://github.com/w3joe/anubis.git cd anubis -
Install dependencies:
pnpm install
-
Set up backend:
cd apps/backend python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Configure environment variables:
# In apps/backend/ export GOOGLE_API_KEY=your_api_key_here # Or create a .env file
-
Run the development servers:
# From project root pnpm run devThis will start:
- Backend API at
http://localhost:5001 - Frontend at
http://localhost:3000
- Backend API at
GET /healthPOST /api/v1/evaluate
Content-Type: application/jsonRequest:
{
"prompt": "Write a function to find the longest palindromic substring",
"models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
"metrics": ["time_complexity", "readability", "consistency"]
}Response: See Backend README for full response schema.
POST /api/v1/evaluate/stream
Content-Type: application/jsonRequest: Same as above
Response: Server-Sent Events stream with real-time updates:
generation_start: Model begins generatingcode_chunk: Incremental code chunksgeneration_complete: Model finishedevaluation_result: Metrics and scoressummary: Final summary with rankingscomplete: Stream finished
Each metric is scored on a 0-10 scale:
| Metric | Description | What It Measures |
|---|---|---|
| Readability | Code clarity and maintainability | Variable naming, structure, comments |
| Consistency | Code style uniformity | Naming conventions, pattern adherence |
| Time Complexity | Algorithm efficiency | Big O notation, performance optimization |
| Code Documentation | Documentation quality | Docstrings, inline comments |
| External Dependencies | Dependency management | Standard library usage, dependency count |
When you provide a metrics array, Anubis uses exponential decay weighting:
- Higher-ranked metrics get exponentially higher weights
- Weights are normalized to sum to 1.0
- Example:
["time_complexity", "readability"]gives time_complexity ~70% weight, readability ~30%
curl -X POST http://localhost:5000/api/v1/evaluate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a binary search function",
"models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
"metrics": ["time_complexity", "readability"]
}'- Navigate to
http://localhost:3000 - Enter your coding prompt
- Select models to compare
- Optionally set metric priorities
- Click "Evaluate" and watch real-time results
cd apps/backend
pytest tests/- Backend: Follows PEP 8 style guidelines
- Frontend: ESLint + Prettier configured
Contributions are welcome! Please open an issue or submit a pull request.
[Add your license here]
- GitHub Repository: https://github.com/w3joe/anubis
- Google AI Studio: https://aistudio.google.com
- Gemini API Docs: https://ai.google.dev/docs
Anubis - Weighing the code of the AI gods βοΈ
Automated. Objective. Fast.