Reproducibility as Accuracy (RaA) Benchmark

RaA is a framework to evaluate information fidelity in iterative multimodal transformations. It assesses how well AI models preserve semantic content as data "loops" between different modalities, quantifying drift and degeneration.

➡️ View the Full Documentation in /docs

Overview

The Reproducibility as Accuracy (RaA) Benchmark is a framework for measuring how well AI models preserve information when translating it between different formats, like text and images. At its core, RaA quantifies how much "meaning" is lost in translation.

The Analogy: A High-Tech Game of "Telephone" 📞

Imagine playing the game "Telephone," but with AI. You start with an image—say, a photo of a red car on a book.

Whisper to the AI (Image → Text): The first AI model looks at the image and "whispers" a description of it: "A photorealistic image of a small, red toy car on top of a large, open book."
Whisper Back (Text → Image): This text description is then given to a second AI model, which tries to draw the image based only on that description.
Repeat: The newly generated image is then shown to the first model, which generates a new description, and the cycle repeats.

After several rounds, how much does the final image resemble the original? This change, or "semantic drift," is precisely what RaA is designed to measure.

Technical Workflow

The benchmark operationalizes this "game" through a configurable, automated pipeline:

The Loop (I-T-I or T-I-T): The core of the benchmark is the LoopController, which manages the iterative process. It starts with a "seed" (an image or text) and passes it through a loop for a specified number of iterations.
Generative Models (prompt_engine.py): At each step, the prompt_engine calls a generative model (like Google's Gemini) to perform the transformation.
Automated Evaluation (evaluation_engine.py): Once the loop is complete, the EvaluationEngine assesses the drift by performing pairwise comparisons using a multimodal LLM guided by a detailed set of criteria.

By the end of the process, RaA provides a clear report on the model's ability to maintain information fidelity through repeated transformations.

Evaluation Criteria

For a detailed breakdown, see Evaluation Criteria in the docs.

The EvaluationEngine uses detailed prompts to score similarity across five key criteria:

Content Correspondence (The "What"): Checks if the core subjects and objects are the same in both outputs.
Compositional Alignment (The "How"): Assesses if the arrangement and relationships of elements are consistent.
Fidelity & Completeness (The "Detail"): Measures if both outputs have a similar level of detail.
Stylistic Congruence (The "Feel"): Evaluates if the artistic style, tone, and feel are the same.
Overall Semantic Intent (The "Message"): Determines if the overall meaning and purpose are preserved.

Reporting

For a detailed breakdown, see Reporting in the docs.

After the evaluation is complete, the RaA benchmark generates two types of reports to help you understand the results:

Quantitative Charts: Visualizations of how the evaluation scores change over each iteration of the loop.
Qualitative Summaries: A narrative summary of the semantic drift, generated by an AI model.

Getting Started

For a more detailed guide, see Getting Started in the docs.

Prerequisites

Python 3.8+
An API key for the generative models you intend to use (e.g., Google AI Studio)

Installation

Clone the repository:

git clone [https://github.com/pranavagrawai/raa.git](https://github.com/pranavagrawai/raa.git)
cd raa

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

API Key: Create a .env file by copying the example:
```
cp .env.example .env
```
Open the .env file and add your Google API Key:
```
GOOGLE_API_KEY="YOUR_API_KEY_HERE"
```
Benchmark Parameters: All experiment parameters are defined in a YAML file. See configs/benchmark_config.yaml for a complete example. You can create your own configuration file based on this template.

Usage

The entire pipeline is orchestrated by src/main.py.

Full Run

To run the complete benchmark pipeline (generation, evaluation, and reporting), use the following command:

python src/main.py --config configs/benchmark_config.yaml

Evaluation Only

If you have already generated the artifacts and only want to run the evaluation, use the --eval flag:

python src/main.py --config configs/benchmark_config.yaml --eval

Reporting Only

To generate reports from existing evaluation results, use the --report flag:

python src/main.py --config configs/benchmark_config.yaml --report

Project Structure

For a deeper dive, see Project Architecture in the docs.

File	Description
`main.py`	The main entry point that orchestrates the entire pipeline.
`loop_controller.py`	Manages the core recursive loop (I-T-I or T-I-T).
`evaluation_engine.py`	Performs automated, pairwise comparisons between artifacts.
`graph_creator.py`	Generates plots and charts from the evaluation ratings.
`reporting_summary.py`	Generates qualitative summaries of the evaluation results.
`benchmark_config.py`	Defines and loads the YAML configuration.

License

This project is licensed under the Apache 2.0 License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
configs		configs
data		data
docs		docs
prompts		prompts
results/central_experiment1		results/central_experiment1
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
future_plans.md		future_plans.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducibility as Accuracy (RaA) Benchmark

Table of Contents

Overview

The Analogy: A High-Tech Game of "Telephone" 📞

Technical Workflow

Evaluation Criteria

Reporting

Getting Started

Prerequisites

Installation

Configuration

Usage

Full Run

Evaluation Only

Reporting Only

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reproducibility as Accuracy (RaA) Benchmark

Table of Contents

Overview

The Analogy: A High-Tech Game of "Telephone" 📞

Technical Workflow

Evaluation Criteria

Reporting

Getting Started

Prerequisites

Installation

Configuration

Usage

Full Run

Evaluation Only

Reporting Only

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages