ArcadeRAG — Industrial-Grade RAG on a Multi-Model Database

Repository status: this repo is currently organized around an industrial paper-style artifact and experiment plan. The README will be rewritten into a user-facing RAG system README after the first paper acceptance.

Plan Updates

15-Dec-2025:
- Initial commit of ArcadeRAG README and experiment plan.
- The python package is still under development; no released version yet.

Abstract

ArcadeRAG investigates whether a single multi-model database can support end-to-end Retrieval-Augmented Generation (RAG) workloads—documents + graph relationships + vector search + analytics—with production-relevant constraints (high ingest, fast retrieval, updates, and reproducibility). We present (i) a unified data model for RAG artifacts, (ii) an experimental methodology, and (iii) a benchmarking plan focused on comparing ArcadeDB’s multi-model approach against PostgreSQL + pgvector as the primary baseline. PostgreSQL with the pgvector extension represents a widely adopted approach for implementing vector search within existing relational infrastructures and is commonly used as a baseline in production RAG systems. ArcadeDB additionally supports fully embedded execution.

Unless stated otherwise, ArcadeDB is evaluated in embedded mode to isolate engine-level behavior and eliminate network-induced variability.

Note: Experiments are planned; results are not yet included. We do include preliminary implementation experience from large-scale pipelines (e.g., Stack Overflow-style workloads) to inform the design and measurement plan.

1. Introduction

RAG systems are usually assembled from multiple components (document store, vector DB, optional graph DB, analytics engine). This improves specialization but increases operational complexity: duplicated data, cross-system consistency, multi-step pipelines, and fractured observability.

ArcadeRAG’s core question:

Can a native multi-model database serve as a single, coherent substrate for industrial RAG—without sacrificing retrieval quality, latency, throughput, or operational simplicity?

2. Problem Statement and Scope

We focus on industrial RAG constraints:

Continuous ingestion and updates (new documents, revised embeddings, metadata changes)
Fast top-k vector retrieval and filtering
Optional graph-structured knowledge (entities, links, provenance, citations, threads, conversations)
Reproducible, scriptable experiments with clear cost/performance metrics

Non-goals:

Proposing new ANN algorithms or embedding models
Optimizing LLM prompting or generation strategies
Distributed or multi-node execution

3. Why ArcadeDB for RAG

ArcadeDB is a multi-model database whose smallest persisted unit is a record, and records come in three types: Document, Vertex, and Edge. At a physical level, ArcadeDB adopts a document-centric storage model: graph vertices and edges are specialized document records sharing the same storage, indexing, and transactional infrastructure.

Key properties we leverage for RAG prototyping:

Document model for text + metadata (schema-full or schema-less).
Graph model for relations (entities, citations, threads, “mentions”, provenance), where vertices are documents with additional graph features, and edges connect vertices.
Types and buckets: “Type” is close to the concept of a table; buckets are the physical storage units and can be scaled for parallelism.
Built-in support for vector indexes and vector neighbor queries (SQL-level vector search functions and index configuration parameters).

4. System Overview

ArcadeRAG is designed around a single principle:

Represent every artifact needed by RAG (data, structure, embeddings, logs, and evaluation traces) in a single database with one consistent ID space.

4.1 Logical Pipeline

Ingest raw corpora (documents, posts, comments, tickets, PDFs, webpages)
Normalize into canonical Document records (text + metadata + provenance)
Structure into an optional Graph layer (entities, links, citations, thread/author relations)
Embed text units and store vectors alongside records
Index vectors and run retrieval (top-k + filters)
Assemble context windows (unit selection, deduplication, re-ranking hooks)
Evaluate retrieval + end-to-end QA (and store traces for reproducibility)

4.2 Execution Modes

ArcadeDB supports both:

Embedded mode, where the database runs in-process with the application, and
Server–client mode, accessible via REST (HTTP) or gRPC (HTTP/2).

In this work, we focus on embedded execution to study unified graph–vector query behavior without network overhead. Server-based deployment is discussed but not benchmarked.

5. Data Model

5.1 Record Types

ArcadeDB stores records as:

Document: flexible JSON-like structure (can be schema-less or schema-full).
Vertex: a document “with some additional features” for graph modeling; can hold arbitrary properties and embedded records, like documents.
Edge: a connection between two vertices.

5.2 “Tables” in ArcadeDB (Terminology Alignment)

ArcadeDB’s Type concept is closest to a relational table (and can be schema-less/full, with inheritance).

So, yes: you can model “tables” as Document Types (document records, no graph connectivity by default). When you need relationships and traversals, you use Vertex/Edge types.

5.3 Buckets and Parallelism

Each type can have one or more buckets (physical storage files). Multiple buckets per type can be used to increase parallelism for heavy ingestion.

6. RAG Workloads We Target

We focus on the retrieval layer of RAG pipelines; downstream generation quality is evaluated only optionally and is not the primary optimization target.

We will benchmark multiple retrieval modes common in production:

6.1 Pure Vector RAG

Atomic text units with embeddings (e.g., one embedding per Stack Overflow question or answer)
Top-k retrieval + metadata filters
Deduplication and “citation” assembly

6.2 Hybrid RAG (Vector + Structured Filters)

Vector retrieval constrained by fields (time ranges, source, tenant, language, tags)

6.3 Graph-Enhanced RAG (Optional)

Retrieve seed nodes by vector similarity
Expand neighborhood by graph edges (citations, reply chains, entity mentions)
Aggregate evidence and assemble context windows

7. Baselines and Comparison Plan

We use PostgreSQL + pgvector as the primary comparator:

widely deployed
clear operational story
strong SQL baseline for filters and joins
well-known vector index options

While PostgreSQL can be extended with graph query layers (e.g., Apache AGE), such extensions emulate graph functionality on top of relational storage and do not fundamentally alter the polyglot execution model studied here; we therefore leave them out of scope.

8. Experimental Methodology

8.1 Metrics (Retrieval Layer)

Recall@k (approximate NN quality)
Hit@k (at least one relevant item in top-k, defined per dataset labeling)
Latency (p50/p95) for:
- vector search only
- vector + filters
- (optional) vector + graph expansion
Index build time
Index size on disk and RAM footprint
Update cost:
- insert throughput (docs + vectors)
- delete/update throughput (re-embedding, document revisions)

8.2 Metrics (End-to-End RAG)

QA accuracy metrics appropriate to dataset (EM/F1 or task-specific)
Context quality:
- redundancy / duplicate rate
- citation correctness (when provenance exists)
Cost proxies:
- retrieved token count vs. accuracy
- end-to-end wall time per query

8.3 Reproducibility Requirements

Each experiment run records:

dataset version + preprocessing checksum
embedding model name + settings
index parameters
hardware + OS + JVM settings
full query traces and outputs

8.4 Execution Environment

All experiments are conducted on single-node machines with fixed hardware configurations. Embedded systems are evaluated in-process, while PostgreSQL is evaluated in its standard client–server configuration using local connections.

9. Vector Search in ArcadeDB

ArcadeDB supports vector indexing and “neighbors” queries via SQL, with configurable parameters such as distanceFunction, m, and ef.

Concrete SQL examples and index configurations will be added once the experimental setup is finalized.

10. Datasets

We use the Stack Overflow data dump (approx. 5–6 GB compressed) as our primary industrial dataset. This corpus provides a rich multi-model structure:

Text: Questions, Answers, Comments
Graph: User interactions, badges, related posts, thread structure
Metadata: Tags, votes, dates, history

We utilize the June 2024 snapshot (stackoverflow.com-*.7z from archive.org), including Posts, Users, Comments, Tags, Votes, Badges, PostLinks, and PostHistory.

10.1 Text Granularity and Chunking Policy

For datasets with naturally atomic text units (e.g., Stack Overflow questions and answers), we do not apply text chunking. Each unit is modeled as a single record (Document or Vertex) and assigned exactly one embedding.

This choice reflects the inherent semantic coherence of the data and avoids fragmenting graph relationships (e.g., authorship, replies, citations). Chunk-based modeling for long-form documents is a common alternative but is out of scope for the current experiments.

Implementation note: we have prior experience building multi-model pipelines over Stack Overflow-style dumps (high-ingest XML → documents → graph → embeddings), which informs batching, schema design, and validation checks. (Formal measurements will be added after experiments run.)

11. Threats to Validity

Embedding dependence: results can vary substantially by embedding model; we will fix models per experiment and report sensitivity where possible.
Dataset labeling: defining “relevance” for Stack Overflow (e.g., accepted answers vs. high-voted answers) requires clear heuristics; we will document our specific relevance criteria.
Parameter tuning fairness: we will enforce a tuning budget per system to avoid overfitting a single baseline.
Execution model asymmetry: embedded and client–server systems differ in lifecycle and communication overhead; we explicitly evaluate embedded execution to isolate engine behavior.

12. Terminology Notes

ArcadeDB “Type” ≈ relational “Table”.
ArcadeDB “Document Type” can represent table-like collections without graph connectivity.
ArcadeDB “Vertex” is a document with graph features; “Edge” connects vertices.
“Text unit” refers to the smallest semantically meaningful unit assigned an embedding (e.g., a Stack Overflow answer), not a fixed-size text chunk.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArcadeRAG — Industrial-Grade RAG on a Multi-Model Database

Plan Updates

Abstract

1. Introduction

2. Problem Statement and Scope

3. Why ArcadeDB for RAG

4. System Overview

4.1 Logical Pipeline

4.2 Execution Modes

5. Data Model

5.1 Record Types

5.2 “Tables” in ArcadeDB (Terminology Alignment)

5.3 Buckets and Parallelism

6. RAG Workloads We Target

6.1 Pure Vector RAG

6.2 Hybrid RAG (Vector + Structured Filters)

6.3 Graph-Enhanced RAG (Optional)

7. Baselines and Comparison Plan

8. Experimental Methodology

8.1 Metrics (Retrieval Layer)

8.2 Metrics (End-to-End RAG)

8.3 Reproducibility Requirements

8.4 Execution Environment

9. Vector Search in ArcadeDB

10. Datasets

10.1 Text Granularity and Chunking Policy

11. Threats to Validity

12. Terminology Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

License

ArcadeData/ArcadeRAG

Folders and files

Latest commit

History

Repository files navigation

ArcadeRAG — Industrial-Grade RAG on a Multi-Model Database

Plan Updates

Abstract

1. Introduction

2. Problem Statement and Scope

3. Why ArcadeDB for RAG

4. System Overview

4.1 Logical Pipeline

4.2 Execution Modes

5. Data Model

5.1 Record Types

5.2 “Tables” in ArcadeDB (Terminology Alignment)

5.3 Buckets and Parallelism

6. RAG Workloads We Target

6.1 Pure Vector RAG

6.2 Hybrid RAG (Vector + Structured Filters)

6.3 Graph-Enhanced RAG (Optional)

7. Baselines and Comparison Plan

8. Experimental Methodology

8.1 Metrics (Retrieval Layer)

8.2 Metrics (End-to-End RAG)

8.3 Reproducibility Requirements

8.4 Execution Environment

9. Vector Search in ArcadeDB

10. Datasets

10.1 Text Granularity and Chunking Policy

11. Threats to Validity

12. Terminology Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages