🧠 SuperMemory-VQA: An Egocentric Visual Question Answering Benchmark for Long-Horizon Memory

Official repository for SuperMemory-VQA. Paper currently under review at NeurIPS 2026 (Datasets and Benchmarks Track)

📹 52.9 Hours of Video	🧩 4,853 Grounded Q&As	🛡️ Hallucination Robust	🕶️ Rich Sensor Modalities
Everyday activities recorded via Gen 1 Meta Aria Glasses	Human-in-the-loop verified episodic, conversational, and procedural QA	Multiple-choice including ordered vague & "unanswerable" options	Synchronized RGB, audio transcript, eye gaze, IMU, and SLAM

🌟 Overview

As AI agents integrate into Augmented Reality (AR) glasses, they have the potential to act as personalized memory assistants—helping users locate misplaced objects, recall spoken details, and reconstruct daily timelines. However, existing datasets predominantly focus on short-term perception or action recognition.

SuperMemory-VQA is a multi-modal egocentric Visual Question Answering (VQA) benchmark designed around questions people actually ask memory assistants. Built with continuous recordings spanning hours (and up to two weeks), it challenges model capability across five key dimensions:

Natural Conversational Phrasing: Context-dependent queries instead of predictable templates.
Long-Horizon Context: Multi-hour recordings that test the boundaries of context scaling.
Dense Multi-Evidence Retrieval: Questions requiring linking disjoint moments across vast temporal gaps (e.g., matching a spoken plan to later visual results).
Grounded Multi-Modal Reasoning: Seamlessly aligning video, audio transcript, gaze tracking, motion, and spatial context.
Epistemic Calibration & Abstention: Each multiple-choice question contains ordered answer options (Correct > Vague > Wrong > Unanswerable) to test whether models know when they have sufficient evidence or if they hallucinate.

📂 Dataset Tasks

SuperMemory-VQA evaluates agents across six user-validated memory categories reflecting actual human memory needs:

📍 Object & Location Memory: Recalling the last known position of an object, its state modifications, and its spatial trajectory over time.
💬 Conversational Memory: Retrieving spoken commitments, instruction corrections, deferred answers, and dialogue states from audio transcripts.
👁️ Visual Scene Recall: Retrieving specific fine-grained visual details (e.g., text on screens, manual ingredients, visible landmarks).
🔗 In-Context Retrieval: Synthesizing current visual cues with prior facts and associations to navigate complex relational memory tasks.
⏱️ Timeline Reconstruction: Chronologically sequencing disjoint events to evaluate temporal and procedural episodic memory.
🎯 Intent Recall: Recovering stated or implied future goals, reminders, and prospective action intentions.

📊 Comparison with Existing Benchmarks

Unlike typical egocentric benchmarks that focus on short clips or simple action labels, SuperMemory-VQA provides a comprehensive environment for long-context multi-evidence retrieval.

Dataset	Focus	Hrs	Context	QAs	Multi-Evid.	Natural Queries	Evaluation Type
EPIC-KITCHENS-100	Action Rec.	100	≈ 8.5m	--	--	No	Verb-noun labels & narrations
Ego4D	Ego Activities	3,670	≈ 23m	--	--	No	Temporal/spatial localization
EgoSchema	Long Video QA	> 250	3m	5,063	Single	No	5-way MCQ over localized clips
EgoLife	Life Assistant	300	> 1h	6,000	Limited	No	Generic MCQ with evidence timestamps
SuperMemory-VQA (Ours)	SuperMemory	52.9	> 1h	4,853	34%	Yes	Ordered MCQs with time spans

⚙️ Setup and Installation

1. Prerequisites

Ensure you have Python 3.10+ and Node.js (for building the Svelte frontend UI) installed.

2. Clone and Install Dependencies

git clone https://github.com/AIoT-MLSys-Lab/supermemory-vqa.git
cd supermemory-vqa

Create a virtual environment and install Python requirements:

python -m venv venv
# On Windows (PowerShell):
.\venv\Scripts\Activate.ps1
# On macOS/Linux:
source venv/bin/activate

pip install -r requirements.txt

3. Environment Configuration (`.env`)

Copy the template .env.example to create your own configuration file:

# On Windows:
copy .env.example .env
# On macOS/Linux:
cp .env.example .env

Open .env and set your Gemini API Key:

GEMINI_API_KEY=your_actual_gemini_api_key_here

🚀 Running the Annotation Pipeline

SuperMemory-VQA features a highly scalable, human-in-the-loop annotation pipeline designed in two sequential stages.

🎬 Stage 1: Dense Sequential Captioning (`stage1_v2`)

This stage processes video chunks sequentially in chronological order. To maintain narrative consistency across chunks and video boundaries, it feeds previous caption summaries back to the LLM. It optimizes cost and context constraints using Gemini's explicit context caching to cache video inputs and system instructions, keeping only the sliding window text history uncached.

To run Stage 1 V2:

python -m src.pipeline.stage1_v2 "<video_folder>" `
  --output "<narration_output_folder>" `
  --config "src\pipeline\conf\pipeline_v2.yaml" `
  -O stage1_model=gemini-3-flash-preview `
  -O stage1_fallback_model=gemini-3-flash-preview `
  --run-id "<run_id>"

Key Options:

--output or -o: Folder to save the generated caption narrations (default: saves alongside source videos).
--model or -m: Specify a custom Gemini model to use.
--max-context or -c: Set the maximum number of previous chunks in the sliding context window (default: 30).
--config: Load the Hydra/OmegaConf pipeline config, for example src\pipeline\conf\pipeline_v2.yaml.
--config-override or -O: Repeatable Hydra-style override, for example -O chunk_duration=60.
--run-id: Optional stable identifier used in manifests and run-state logs.

📝 Stage 2: Question Generation & Verification (`stage2_loop_concurrent`)

This stage reads the narrations generated in Stage 1, creates a global Super Ledger of events, drafts challenging memory Q/A pairs, and submits them to an automated verifier loop. The verifier checks each pair for factual correctness, causality, and naturalness.

To run Stage 2 Loop Concurrent:

python -m src.pipeline.stage2_loop_concurrent `
  "<narration_folder>" `
  "<video_folder>" `
  --output "<qa_output_folder>" `
  --config "src\pipeline\conf\pipeline_v2.yaml" `
  -O stage2_planner_model=gemini-3-flash-preview `
  -O stage2_retriever_model=gemini-3-flash-preview `
  -O stage2_verifier_model=gemini-3-flash-preview `
  -O stage2_enhancer_model=gemini-3-flash-preview `
  -O stage2_retriever_fallback_model=gemini-3-flash-preview `
  -O stage2_verifier_fallback_model=gemini-3-flash-preview `
  -O stage2_enhancer_fallback_model=gemini-3-flash-preview `
  --run-id "<run_id>"

Key Options:

--output or -o: Folder to save final Q/A pairs.
--planner-model: Custom Gemini model for question generation.
--verifier-model: Custom Gemini model for verification checks.
--target or -t: Target number of QA annotations.
--qa-per-minute or -qpm: Desired density of QA generation per minute of video.
--global-qa-ratio or -g: Proportion of global multi-evidence questions vs. localized ones (default: 0.5).
--generate-only: Run the generator only, bypassing the automated verifier loop.
--ledger-only or -l: Compile the Super Ledger event database only.
--max-loops: Maximum loops of verification/re-generation to execute.
--force or -f: Force reprocessing, ignoring previous caches.
--config: Load the Hydra/OmegaConf pipeline config, for example src\pipeline\conf\pipeline_v2.yaml.
--config-override or -O: Repeatable Hydra-style override, for example -O qa_batch_size=20.
--run-id: Optional stable identifier used in manifests and run-state logs.

🎮 Starting the Annotation Review UI

Once the annotation pipeline generates Q/A pairs, you can inspect, verify, and refine them using our interactive review dashboard.

The application utilizes a Flask backend and a Svelte frontend. Upon starting, the backend automatically detects Node.js and builds the Svelte app on the fly.

Option A: Platform-Specific Startup Scripts (Recommended)

Windows (PowerShell):
```
.\start_server.ps1
```

macOS / Linux (Bash):

chmod +x start_server.sh
./start_server.sh

These scripts automate virtual environment checks, load environment variables from .env, install missing Python packages, build/update Svelte assets, and start the local Flask server.

Option B: Manual Startup

If you prefer running directly:

python app.py

Accessing the Interface

Once the server starts up, open your web browser and navigate to:

💻 Main UI: http://localhost:5000
📖 Interactive App & Docs: http://localhost:5000/app

📄 Citation

If you use our dataset or code in your research, please cite our NeurIPS 2026 paper:

@inproceedings{supermemory_vqa2026,
  title={SuperMemory-VQA: An Egocentric Visual Question Answering Benchmark for Long-Horizon Memory},
  author={Anonymous Authors},
  booktitle={NeurIPS 2026 (Evaluations & Datasets Track)},
  year={2026},
  note={Under review}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 SuperMemory-VQA: An Egocentric Visual Question Answering Benchmark for Long-Horizon Memory

🌟 Overview

📂 Dataset Tasks

📊 Comparison with Existing Benchmarks

⚙️ Setup and Installation

1. Prerequisites

2. Clone and Install Dependencies

3. Environment Configuration (`.env`)

🚀 Running the Annotation Pipeline

🎬 Stage 1: Dense Sequential Captioning (`stage1_v2`)

Key Options:

📝 Stage 2: Question Generation & Verification (`stage2_loop_concurrent`)

Key Options:

🎮 Starting the Annotation Review UI

Option A: Platform-Specific Startup Scripts (Recommended)

Option B: Manual Startup

Accessing the Interface

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
frontend		frontend
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
connect_tunnel.py		connect_tunnel.py
main.py		main.py
requirements.txt		requirements.txt
start_server.ps1		start_server.ps1
start_server.sh		start_server.sh

Folders and files

Latest commit

History

Repository files navigation

🧠 SuperMemory-VQA: An Egocentric Visual Question Answering Benchmark for Long-Horizon Memory

🌟 Overview

📂 Dataset Tasks

📊 Comparison with Existing Benchmarks

⚙️ Setup and Installation

1. Prerequisites

2. Clone and Install Dependencies

3. Environment Configuration (.env)

🚀 Running the Annotation Pipeline

🎬 Stage 1: Dense Sequential Captioning (stage1_v2)

Key Options:

📝 Stage 2: Question Generation & Verification (stage2_loop_concurrent)

Key Options:

🎮 Starting the Annotation Review UI

Option A: Platform-Specific Startup Scripts (Recommended)

Option B: Manual Startup

Accessing the Interface

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Environment Configuration (`.env`)

🎬 Stage 1: Dense Sequential Captioning (`stage1_v2`)

📝 Stage 2: Question Generation & Verification (`stage2_loop_concurrent`)

Packages