Skip to content

pranavagrawaI/RaA

Repository files navigation

Reproducibility as Accuracy (RaA) Benchmark

License: Apache 2.0

RaA is a framework to evaluate information fidelity in iterative multimodal transformations. It assesses how well AI models preserve semantic content as data "loops" between different modalities, quantifying drift and degeneration.

➡️ View the Full Documentation in /docs

Table of Contents

Overview

The Reproducibility as Accuracy (RaA) Benchmark is a framework for measuring how well AI models preserve information when translating it between different formats, like text and images. At its core, RaA quantifies how much "meaning" is lost in translation.

The Analogy: A High-Tech Game of "Telephone" 📞

Imagine playing the game "Telephone," but with AI. You start with an image—say, a photo of a red car on a book.

  1. Whisper to the AI (Image → Text): The first AI model looks at the image and "whispers" a description of it: "A photorealistic image of a small, red toy car on top of a large, open book."
  2. Whisper Back (Text → Image): This text description is then given to a second AI model, which tries to draw the image based only on that description.
  3. Repeat: The newly generated image is then shown to the first model, which generates a new description, and the cycle repeats.

After several rounds, how much does the final image resemble the original? This change, or "semantic drift," is precisely what RaA is designed to measure.

Technical Workflow

The benchmark operationalizes this "game" through a configurable, automated pipeline:

  1. The Loop (I-T-I or T-I-T): The core of the benchmark is the LoopController, which manages the iterative process. It starts with a "seed" (an image or text) and passes it through a loop for a specified number of iterations.

  2. Generative Models (prompt_engine.py): At each step, the prompt_engine calls a generative model (like Google's Gemini) to perform the transformation.

  3. Automated Evaluation (evaluation_engine.py): Once the loop is complete, the EvaluationEngine assesses the drift by performing pairwise comparisons using a multimodal LLM guided by a detailed set of criteria.

By the end of the process, RaA provides a clear report on the model's ability to maintain information fidelity through repeated transformations.


Evaluation Criteria

For a detailed breakdown, see Evaluation Criteria in the docs.

The EvaluationEngine uses detailed prompts to score similarity across five key criteria:

  • Content Correspondence (The "What"): Checks if the core subjects and objects are the same in both outputs.
  • Compositional Alignment (The "How"): Assesses if the arrangement and relationships of elements are consistent.
  • Fidelity & Completeness (The "Detail"): Measures if both outputs have a similar level of detail.
  • Stylistic Congruence (The "Feel"): Evaluates if the artistic style, tone, and feel are the same.
  • Overall Semantic Intent (The "Message"): Determines if the overall meaning and purpose are preserved.

Reporting

For a detailed breakdown, see Reporting in the docs.

After the evaluation is complete, the RaA benchmark generates two types of reports to help you understand the results:

  • Quantitative Charts: Visualizations of how the evaluation scores change over each iteration of the loop.
  • Qualitative Summaries: A narrative summary of the semantic drift, generated by an AI model.

Getting Started

For a more detailed guide, see Getting Started in the docs.

Prerequisites

  • Python 3.8+
  • An API key for the generative models you intend to use (e.g., Google AI Studio)

Installation

  1. Clone the repository:

    git clone [https://github.com/pranavagrawai/raa.git](https://github.com/pranavagrawai/raa.git)
    cd raa
  2. Install dependencies:

    pip install -r requirements.txt

Configuration

  1. API Key: Create a .env file by copying the example:

    cp .env.example .env

    Open the .env file and add your Google API Key:

    GOOGLE_API_KEY="YOUR_API_KEY_HERE"
    
  2. Benchmark Parameters: All experiment parameters are defined in a YAML file. See configs/benchmark_config.yaml for a complete example. You can create your own configuration file based on this template.


Usage

The entire pipeline is orchestrated by src/main.py.

Full Run

To run the complete benchmark pipeline (generation, evaluation, and reporting), use the following command:

python src/main.py --config configs/benchmark_config.yaml

Evaluation Only

If you have already generated the artifacts and only want to run the evaluation, use the --eval flag:

python src/main.py --config configs/benchmark_config.yaml --eval

Reporting Only

To generate reports from existing evaluation results, use the --report flag:

python src/main.py --config configs/benchmark_config.yaml --report

Project Structure

For a deeper dive, see Project Architecture in the docs.

File Description
main.py The main entry point that orchestrates the entire pipeline.
loop_controller.py Manages the core recursive loop (I-T-I or T-I-T).
evaluation_engine.py Performs automated, pairwise comparisons between artifacts.
graph_creator.py Generates plots and charts from the evaluation ratings.
reporting_summary.py Generates qualitative summaries of the evaluation results.
benchmark_config.py Defines and loads the YAML configuration.

License

This project is licensed under the Apache 2.0 License — see the LICENSE file for details.

About

RaA is a benchmark that evaluates how well multimodal AI systems preserve information through image-to-text iterations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages