Skip to content

w3joe/anubis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Anubis βš–οΈπ“£

Automated AI Code Comparison & Evaluation System

GitHub

πŸ“– Project Description

Anubis is an intelligent code evaluation platform that automates the process of comparing code generation across multiple AI models. Instead of manually testing different models and hoping for the best, Anubis takes a single prompt and automatically runs it through multiple AI models in parallel, evaluates each candidate using a comprehensive scoring pipeline, and surfaces the best result based on your priorities.

🎯 The Problem

Developers often face a frustrating workflow when using AI code generation:

  1. Manual Testing: Developers try multiple AI models manually, one at a time
  2. Subjective Comparison: Code quality is judged by eye, leading to inconsistent decisions
  3. Uncertainty: No objective metrics to determine which model produces the best code for a specific task
  4. No Priority System: Can't easily prioritize what matters most (e.g., performance vs. readability)

This process is slow, inconsistent, and doesn't scale.

✨ The Solution

Anubis automates the entire workflow:

  1. Parallel Multi-Model Generation: Takes a single prompt and runs it through multiple AI models simultaneously in different branches
  2. Automated Evaluation: Each code candidate is automatically evaluated using a comprehensive scoring pipeline
  3. Priority-Based Weighting: Metrics are weighted according to user-selected priorities (e.g., if performance matters most, time complexity gets higher weight)
  4. Objective Ranking: Produces a final weighted score for each generated code and ranks them objectively
  5. Best Code Surface: Automatically surfaces the best code based on your priorities
anubis-demo.mov

Result: Fast, consistent, and objective code comparison that scales.

πŸ—οΈ Architecture

Anubis follows a modular, agentic architecture with clear separation of concerns:

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Frontend (Next.js)                     β”‚
β”‚              User Interface & Real-time Streaming           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚ HTTP/SSE
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Backend API (Flask)                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚         Code Generator (Coding Agent)                β”‚   β”‚
β”‚  β”‚  β€’ Multi-model parallel code generation              β”‚   β”‚
β”‚  β”‚  β€’ Google AI SDK integration                         β”‚   β”‚
β”‚  β”‚  β€’ Streaming support                                 β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                      β”‚                                      β”‚
β”‚                      β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚      Code Evaluator (Evaluator Agent)                β”‚   β”‚
β”‚  β”‚  β€’ Orchestrates metric evaluation                    β”‚   β”‚
β”‚  β”‚  β€’ Dynamic weight calculation                        β”‚   β”‚
β”‚  β”‚  β€’ Overall score computation                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                      β”‚                                      β”‚
β”‚                      β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Metric Analyzers                        β”‚   β”‚
β”‚  β”‚  β€’ ReadabilityAnalyzer                               β”‚   β”‚
β”‚  β”‚  β€’ ConsistencyAnalyzer                               β”‚   β”‚
β”‚  β”‚  β€’ ComplexityAnalyzer                                β”‚   β”‚
β”‚  β”‚  β€’ DocumentationAnalyzer                             β”‚   β”‚
β”‚  β”‚  β€’ DependencyAnalyzer                                β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                      β”‚                                      β”‚
β”‚                      β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           Output Formatter                           β”‚   β”‚
β”‚  β”‚  β€’ JSON structure generation                         β”‚   β”‚
β”‚  β”‚  β€’ Ranking computation                               β”‚   β”‚
β”‚  β”‚  β€’ Summary creation                                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚
                        β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚   Google AI Studio    β”‚
            β”‚   Gemini Models       β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. Input Processing: User provides prompt, model list, and optional metric priorities
  2. Dynamic Weight Calculation: If priorities provided, calculate exponential decay weights
  3. Parallel Code Generation:
    • Build prompts with metric priority instructions
    • Generate code via Google AI SDK for each model simultaneously
    • Stream results in real-time (SSE endpoint)
  4. Code Evaluation:
    • Run all metric analyzers on each generated code
    • Calculate individual metric scores
    • Compute overall weighted score
  5. Ranking & Output:
    • Rank models by overall score
    • Format results with best code highlighted
    • Return JSON or stream via SSE

Technology Stack

Backend:

  • Python 3.12+
  • Flask - Web framework
  • Google AI SDK (genai) - Gemini API integration
  • PyYAML - Configuration management

Frontend:

  • Next.js 16+ - React framework
  • TypeScript - Type safety
  • Server-Sent Events (SSE) - Real-time streaming

Infrastructure:

  • Turborepo - Monorepo management
  • pnpm - Package manager

πŸ€– Google AI Studio & Gemini Integration

Anubis leverages Google AI Studio and Gemini models as core components of its agentic workflow:

How We Use Google AI Studio

  1. Prompt Design & Testing: We designed and tested our prompt templates inside Google AI Studio to optimize output quality across different Gemini variants
  2. Rapid Iteration: Google AI Studio enabled quick iteration on:
    • System instructions for code generation
    • Code-generation prompts
    • Scoring heuristics and evaluation criteria
  3. Model Selection: Tested various Gemini models (gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash) to understand their strengths

Gemini Models in the Pipeline

Gemini models serve as one of the core "branches" in our multi-model AB testing pipeline:

  • Comparator Agent: Uses Gemini models via Google AI SDK to generate code candidates
  • Streaming Support: Real-time code generation with Server-Sent Events
  • Multi-Model Comparison: Gemini variants are compared against each other and other models
  • Priority-Aware Generation: Prompts include metric priorities to guide Gemini's code generation focus

Integration Details

  • SDK: Google AI SDK (google.genai) for programmatic access
  • Streaming: Real-time streaming of code chunks as they're generated
  • Error Handling: Robust retry logic and graceful degradation
  • Performance: Parallel execution of multiple Gemini model variants

πŸš€ Features

Core Features

  • βœ… Multi-Model Code Generation: Generate code from multiple AI models simultaneously
  • βœ… Real-Time Streaming: Watch code generation happen in real-time via Server-Sent Events
  • βœ… Comprehensive Evaluation: 5 key metrics with detailed analysis:
    • Readability: Variable naming, structure, comments
    • Consistency: Naming conventions, code style uniformity
    • Time Complexity: Algorithm efficiency, Big O analysis
    • Code Documentation: Docstrings, inline comments
    • External Dependencies: Standard library preference
  • βœ… Priority-Based Weighting: Customize metric importance with exponential decay weighting
  • βœ… Automated Ranking: Objective ranking by weighted overall score
  • βœ… Best Code Surface: Automatically highlights the best generated code
  • βœ… RESTful API: Easy-to-use JSON API endpoints
  • βœ… Streaming API: Real-time updates via Server-Sent Events

Advanced Features

  • Parallel Execution: Multiple models generate code simultaneously, significantly reducing total time
  • Dynamic Weighting: Exponential decay algorithm for priority-based metric weighting
  • Graceful Error Handling: Continues evaluation even if one model fails
  • Retry Logic: Automatic retries for failed API calls
  • Configurable: YAML-based configuration for weights and models

πŸ“¦ Project Structure

anubis/
β”œβ”€β”€ apps/
β”‚   β”œβ”€β”€ backend/              # Python Flask backend
β”‚   β”‚   β”œβ”€β”€ anubis/          # Core Anubis package
β”‚   β”‚   β”‚   β”œβ”€β”€ code_generator.py      # Code generation via Google AI SDK
β”‚   β”‚   β”‚   β”œβ”€β”€ code_evaluator.py      # Orchestrates metric evaluation
β”‚   β”‚   β”‚   β”œβ”€β”€ output_formatter.py    # Formats results to JSON
β”‚   β”‚   β”‚   └── evaluators/            # Metric analyzers
β”‚   β”‚   β”‚       β”œβ”€β”€ base_analyzer.py
β”‚   β”‚   β”‚       β”œβ”€β”€ readability_analyzer.py
β”‚   β”‚   β”‚       β”œβ”€β”€ consistency_analyzer.py
β”‚   β”‚   β”‚       β”œβ”€β”€ complexity_analyzer.py
β”‚   β”‚   β”‚       β”œβ”€β”€ documentation_analyzer.py
β”‚   β”‚   β”‚       └── dependency_analyzer.py
β”‚   β”‚   β”œβ”€β”€ app.py           # Flask application
β”‚   β”‚   β”œβ”€β”€ config.yaml      # Configuration
β”‚   β”‚   └── requirements.txt  # Python dependencies
β”‚   β”‚
β”‚   └── web/                 # Next.js frontend
β”‚       β”œβ”€β”€ app/             # Next.js app directory
β”‚       └── package.json
β”‚
β”œβ”€β”€ packages/                # Shared packages
β”‚   β”œβ”€β”€ ui/                 # Shared UI components
β”‚   β”œβ”€β”€ eslint-config/      # ESLint configuration
β”‚   └── typescript-config/  # TypeScript configuration
β”‚
β”œβ”€β”€ package.json            # Root package.json (Turborepo)
└── turbo.json              # Turborepo configuration

πŸ› οΈ Setup

Prerequisites

  • Node.js 18+ and pnpm 9.0.0+
  • Python 3.12+
  • Google API Key (Get one here)

Installation

  1. Clone the repository:

    git clone https://github.com/w3joe/anubis.git
    cd anubis
  2. Install dependencies:

    pnpm install
  3. Set up backend:

    cd apps/backend
    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install -r requirements.txt
  4. Configure environment variables:

    # In apps/backend/
    export GOOGLE_API_KEY=your_api_key_here
    # Or create a .env file
  5. Run the development servers:

    # From project root
    pnpm run dev

    This will start:

    • Backend API at http://localhost:5001
    • Frontend at http://localhost:3000

πŸ“‘ API Endpoints

1. Health Check

GET /health

2. Code Evaluation (Main Endpoint)

POST /api/v1/evaluate
Content-Type: application/json

Request:

{
  "prompt": "Write a function to find the longest palindromic substring",
  "models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
  "metrics": ["time_complexity", "readability", "consistency"]
}

Response: See Backend README for full response schema.

3. Streaming Evaluation

POST /api/v1/evaluate/stream
Content-Type: application/json

Request: Same as above

Response: Server-Sent Events stream with real-time updates:

  • generation_start: Model begins generating
  • code_chunk: Incremental code chunks
  • generation_complete: Model finished
  • evaluation_result: Metrics and scores
  • summary: Final summary with rankings
  • complete: Stream finished

πŸ“Š Evaluation Metrics

Each metric is scored on a 0-10 scale:

Metric Description What It Measures
Readability Code clarity and maintainability Variable naming, structure, comments
Consistency Code style uniformity Naming conventions, pattern adherence
Time Complexity Algorithm efficiency Big O notation, performance optimization
Code Documentation Documentation quality Docstrings, inline comments
External Dependencies Dependency management Standard library usage, dependency count

Priority-Based Weighting

When you provide a metrics array, Anubis uses exponential decay weighting:

  • Higher-ranked metrics get exponentially higher weights
  • Weights are normalized to sum to 1.0
  • Example: ["time_complexity", "readability"] gives time_complexity ~70% weight, readability ~30%

🎬 Example Usage

Using cURL

curl -X POST http://localhost:5000/api/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a binary search function",
    "models": ["gemini-2.0-flash-exp", "gemini-1.5-pro"],
    "metrics": ["time_complexity", "readability"]
  }'

Using the Frontend

  1. Navigate to http://localhost:3000
  2. Enter your coding prompt
  3. Select models to compare
  4. Optionally set metric priorities
  5. Click "Evaluate" and watch real-time results

πŸ§ͺ Development

Running Tests

cd apps/backend
pytest tests/

Code Style

  • Backend: Follows PEP 8 style guidelines
  • Frontend: ESLint + Prettier configured

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

πŸ“ License

[Add your license here]

πŸ”— Links


Anubis - Weighing the code of the AI gods βš–οΈ

Automated. Objective. Fast.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors