Anubis ⚖️𓁣

Automated AI Code Comparison & Evaluation System

📖 Project Description

Anubis is an intelligent code evaluation platform that automates the process of comparing code generation across multiple AI models. Instead of manually testing different models and hoping for the best, Anubis takes a single prompt and automatically runs it through multiple AI models in parallel, evaluates each candidate using a comprehensive scoring pipeline, and surfaces the best result based on your priorities.

🎯 The Problem

Developers often face a frustrating workflow when using AI code generation:

Manual Testing: Developers try multiple AI models manually, one at a time
Subjective Comparison: Code quality is judged by eye, leading to inconsistent decisions
Uncertainty: No objective metrics to determine which model produces the best code for a specific task
No Priority System: Can't easily prioritize what matters most (e.g., performance vs. readability)

This process is slow, inconsistent, and doesn't scale.

✨ The Solution

Anubis automates the entire workflow:

Parallel Multi-Model Generation: Takes a single prompt and runs it through multiple AI models simultaneously in different branches
Automated Evaluation: Each code candidate is automatically evaluated using a comprehensive scoring pipeline
Priority-Based Weighting: Metrics are weighted according to user-selected priorities (e.g., if performance matters most, time complexity gets higher weight)
Objective Ranking: Produces a final weighted score for each generated code and ranks them objectively
Best Code Surface: Automatically surfaces the best code based on your priorities

anubis-demo.mov

Result: Fast, consistent, and objective code comparison that scales.

🏗️ Architecture

Anubis follows a modular, agentic architecture with clear separation of concerns:

Core Components

┌─────────────────────────────────────────────────────────────┐
│                      Frontend (Next.js)                     │
│              User Interface & Real-time Streaming           │
└───────────────────────┬─────────────────────────────────────┘
                        │
                        │ HTTP/SSE
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                   Backend API (Flask)                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │         Code Generator (Coding Agent)                │   │
│  │  • Multi-model parallel code generation              │   │
│  │  • Google AI SDK integration                         │   │
│  │  • Streaming support                                 │   │
│  └───────────────────┬──────────────────────────────────┘   │
│                      │                                      │
│                      ▼                                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │      Code Evaluator (Evaluator Agent)                │   │
│  │  • Orchestrates metric evaluation                    │   │
│  │  • Dynamic weight calculation                        │   │
│  │  • Overall score computation                         │   │
│  └───────────────────┬──────────────────────────────────┘   │
│                      │                                      │
│                      ▼                                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Metric Analyzers                        │   │
│  │  • ReadabilityAnalyzer                               │   │
│  │  • ConsistencyAnalyzer                               │   │
│  │  • ComplexityAnalyzer                                │   │
│  │  • DocumentationAnalyzer                             │   │
│  │  • DependencyAnalyzer                                │   │
│  └───────────────────┬──────────────────────────────────┘   │
│                      │                                      │
│                      ▼                                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │           Output Formatter                           │   │
│  │  • JSON structure generation                         │   │
│  │  • Ranking computation                               │   │
│  │  • Summary creation                                  │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                        │
                        │
                        ▼
            ┌───────────────────────┐
            │   Google AI Studio    │
            │   Gemini Models       │
            └───────────────────────┘

Data Flow

Input Processing: User provides prompt, model list, and optional metric priorities
Dynamic Weight Calculation: If priorities provided, calculate exponential decay weights
Parallel Code Generation:
- Build prompts with metric priority instructions
- Generate code via Google AI SDK for each model simultaneously
- Stream results in real-time (SSE endpoint)
Code Evaluation:
- Run all metric analyzers on each generated code
- Calculate individual metric scores
- Compute overall weighted score
Ranking & Output:
- Rank models by overall score
- Format results with best code highlighted
- Return JSON or stream via SSE

Technology Stack

Backend:

Python 3.12+
Flask - Web framework
Google AI SDK (genai) - Gemini API integration
PyYAML - Configuration management

Frontend:

Next.js 16+ - React framework
TypeScript - Type safety
Server-Sent Events (SSE) - Real-time streaming

Infrastructure:

Turborepo - Monorepo management
pnpm - Package manager

🤖 Google AI Studio & Gemini Integration

Anubis leverages Google AI Studio and Gemini models as core components of its agentic workflow:

How We Use Google AI Studio

Prompt Design & Testing: We designed and tested our prompt templates inside Google AI Studio to optimize output quality across different Gemini variants
Rapid Iteration: Google AI Studio enabled quick iteration on:
- System instructions for code generation
- Code-generation prompts
- Scoring heuristics and evaluation criteria
Model Selection: Tested various Gemini models (gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash) to understand their strengths

Gemini Models in the Pipeline

Gemini models serve as one of the core "branches" in our multi-model AB testing pipeline:

Comparator Agent: Uses Gemini models via Google AI SDK to generate code candidates
Streaming Support: Real-time code generation with Server-Sent Events
Multi-Model Comparison: Gemini variants are compared against each other and other models
Priority-Aware Generation: Prompts include metric priorities to guide Gemini's code generation focus

Integration Details

SDK: Google AI SDK (google.genai) for programmatic access
Streaming: Real-time streaming of code chunks as they're generated
Error Handling: Robust retry logic and graceful degradation
Performance: Parallel execution of multiple Gemini model variants

🚀 Features

Core Features

✅ Multi-Model Code Generation: Generate code from multiple AI models simultaneously
✅ Real-Time Streaming: Watch code generation happen in real-time via Server-Sent Events
✅ Comprehensive Evaluation: 5 key metrics with detailed analysis:
- Readability: Variable naming, structure, comments
- Consistency: Naming conventions, code style uniformity
- Time Complexity: Algorithm efficiency, Big O analysis
- Code Documentation: Docstrings, inline comments
- External Dependencies: Standard library preference
✅ Priority-Based Weighting: Customize metric importance with exponential decay weighting
✅ Automated Ranking: Objective ranking by weighted overall score
✅ Best Code Surface: Automatically highlights the best generated code
✅ RESTful API: Easy-to-use JSON API endpoints
✅ Streaming API: Real-time updates via Server-Sent Events

Advanced Features

Parallel Execution: Multiple models generate code simultaneously, significantly reducing total time
Dynamic Weighting: Exponential decay algorithm for priority-based metric weighting
Graceful Error Handling: Continues evaluation even if one model fails
Retry Logic: Automatic retries for failed API calls
Configurable: YAML-based configuration for weights and models

📦 Project Structure

anubis/
├── apps/
│   ├── backend/              # Python Flask backend
│   │   ├── anubis/          # Core Anubis package
│   │   │   ├── code_generator.py      # Code generation via Google AI SDK
│   │   │   ├── code_evaluator.py      # Orchestrates metric evaluation
│   │   │   ├── output_formatter.py    # Formats results to JSON
│   │   │   └── evaluators/            # Metric analyzers
│   │   │       ├── base_analyzer.py
│   │   │       ├── readability_analyzer.py
│   │   │       ├── consistency_analyzer.py
│   │   │       ├── complexity_analyzer.py
│   │   │       ├── documentation_analyzer.py
│   │   │       └── dependency_analyzer.py
│   │   ├── app.py           # Flask application
│   │   ├── config.yaml      # Configuration
│   │   └── requirements.txt  # Python dependencies
│   │
│   └── web/                 # Next.js frontend
│       ├── app/             # Next.js app directory
│       └── package.json
│
├── packages/                # Shared packages
│   ├── ui/                 # Shared UI components
│   ├── eslint-config/      # ESLint configuration
│   └── typescript-config/  # TypeScript configuration
│
├── package.json            # Root package.json (Turborepo)
└── turbo.json              # Turborepo configuration

🛠️ Setup

Prerequisites

Node.js 18+ and pnpm 9.0.0+
Python 3.12+
Google API Key (Get one here)

Installation

Clone the repository:

git clone https://github.com/w3joe/anubis.git
cd anubis

Install dependencies:
```
pnpm install
```

Set up backend:

cd apps/backend
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configure environment variables:

# In apps/backend/
export GOOGLE_API_KEY=your_api_key_here
# Or create a .env file

Run the development servers:
```
# From project root
pnpm run dev
```
This will start:
- Backend API at http://localhost:5001
- Frontend at http://localhost:3000

📡 API Endpoints

1. Health Check

GET /health

2. Code Evaluation (Main Endpoint)

POST /api/v1/evaluate
Content-Type: application/json

Request:

{
  "prompt": "Write a function to find the longest palindromic substring",
  "models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
  "metrics": ["time_complexity", "readability", "consistency"]
}

Response: See Backend README for full response schema.

3. Streaming Evaluation

POST /api/v1/evaluate/stream
Content-Type: application/json

Request: Same as above

Response: Server-Sent Events stream with real-time updates:

generation_start: Model begins generating
code_chunk: Incremental code chunks
generation_complete: Model finished
evaluation_result: Metrics and scores
summary: Final summary with rankings
complete: Stream finished

📊 Evaluation Metrics

Each metric is scored on a 0-10 scale:

Metric	Description	What It Measures
Readability	Code clarity and maintainability	Variable naming, structure, comments
Consistency	Code style uniformity	Naming conventions, pattern adherence
Time Complexity	Algorithm efficiency	Big O notation, performance optimization
Code Documentation	Documentation quality	Docstrings, inline comments
External Dependencies	Dependency management	Standard library usage, dependency count

Priority-Based Weighting

When you provide a metrics array, Anubis uses exponential decay weighting:

Higher-ranked metrics get exponentially higher weights
Weights are normalized to sum to 1.0
Example: ["time_complexity", "readability"] gives time_complexity ~70% weight, readability ~30%

🎬 Example Usage

Using cURL

curl -X POST http://localhost:5000/api/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a binary search function",
    "models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
    "metrics": ["time_complexity", "readability"]
  }'

Using the Frontend

Navigate to http://localhost:3000
Enter your coding prompt
Select models to compare
Optionally set metric priorities
Click "Evaluate" and watch real-time results

🧪 Development

Running Tests

cd apps/backend
pytest tests/

Code Style

Backend: Follows PEP 8 style guidelines
Frontend: ESLint + Prettier configured

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

📝 License

[Add your license here]

🔗 Links

GitHub Repository: https://github.com/w3joe/anubis
Google AI Studio: https://aistudio.google.com
Gemini API Docs: https://ai.google.dev/docs

Anubis - Weighing the code of the AI gods ⚖️

Automated. Objective. Fast.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
apps		apps
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
turbo.json		turbo.json

Folders and files

Latest commit

History

Repository files navigation

Anubis ⚖️𓁣

📖 Project Description

🎯 The Problem

✨ The Solution

🏗️ Architecture

Core Components

Data Flow

Technology Stack

🤖 Google AI Studio & Gemini Integration

How We Use Google AI Studio

Gemini Models in the Pipeline

Integration Details

🚀 Features

Core Features

Advanced Features

📦 Project Structure

🛠️ Setup

Prerequisites

Installation

📡 API Endpoints

1. Health Check

2. Code Evaluation (Main Endpoint)

3. Streaming Evaluation

📊 Evaluation Metrics

Priority-Based Weighting

🎬 Example Usage

Using cURL

Using the Frontend

🧪 Development

Running Tests

Code Style

Contributing

📝 License

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages