Repository status: this repo is currently organized around an industrial paper-style artifact and experiment plan. The README will be rewritten into a user-facing RAG system README after the first paper acceptance.
- 15-Dec-2025:
- Initial commit of ArcadeRAG README and experiment plan.
- The python package is still under development; no released version yet.
ArcadeRAG investigates whether a single multi-model database can support end-to-end Retrieval-Augmented Generation (RAG) workloads—documents + graph relationships + vector search + analytics—with production-relevant constraints (high ingest, fast retrieval, updates, and reproducibility). We present (i) a unified data model for RAG artifacts, (ii) an experimental methodology, and (iii) a benchmarking plan focused on comparing ArcadeDB’s multi-model approach against PostgreSQL + pgvector as the primary baseline. PostgreSQL with the pgvector extension represents a widely adopted approach for implementing vector search within existing relational infrastructures and is commonly used as a baseline in production RAG systems. ArcadeDB additionally supports fully embedded execution.
Unless stated otherwise, ArcadeDB is evaluated in embedded mode to isolate engine-level behavior and eliminate network-induced variability.
Note: Experiments are planned; results are not yet included. We do include preliminary implementation experience from large-scale pipelines (e.g., Stack Overflow-style workloads) to inform the design and measurement plan.
RAG systems are usually assembled from multiple components (document store, vector DB, optional graph DB, analytics engine). This improves specialization but increases operational complexity: duplicated data, cross-system consistency, multi-step pipelines, and fractured observability.
ArcadeRAG’s core question:
Can a native multi-model database serve as a single, coherent substrate for industrial RAG—without sacrificing retrieval quality, latency, throughput, or operational simplicity?
We focus on industrial RAG constraints:
- Continuous ingestion and updates (new documents, revised embeddings, metadata changes)
- Fast top-k vector retrieval and filtering
- Optional graph-structured knowledge (entities, links, provenance, citations, threads, conversations)
- Reproducible, scriptable experiments with clear cost/performance metrics
Non-goals:
- Proposing new ANN algorithms or embedding models
- Optimizing LLM prompting or generation strategies
- Distributed or multi-node execution
ArcadeDB is a multi-model database whose smallest persisted unit is a record, and records come in three types: Document, Vertex, and Edge. At a physical level, ArcadeDB adopts a document-centric storage model: graph vertices and edges are specialized document records sharing the same storage, indexing, and transactional infrastructure.
Key properties we leverage for RAG prototyping:
- Document model for text + metadata (schema-full or schema-less).
- Graph model for relations (entities, citations, threads, “mentions”, provenance), where vertices are documents with additional graph features, and edges connect vertices.
- Types and buckets: “Type” is close to the concept of a table; buckets are the physical storage units and can be scaled for parallelism.
- Built-in support for vector indexes and vector neighbor queries (SQL-level vector search functions and index configuration parameters).
ArcadeRAG is designed around a single principle:
Represent every artifact needed by RAG (data, structure, embeddings, logs, and evaluation traces) in a single database with one consistent ID space.
- Ingest raw corpora (documents, posts, comments, tickets, PDFs, webpages)
- Normalize into canonical
Documentrecords (text + metadata + provenance) - Structure into an optional
Graphlayer (entities, links, citations, thread/author relations) - Embed text units and store vectors alongside records
- Index vectors and run retrieval (top-k + filters)
- Assemble context windows (unit selection, deduplication, re-ranking hooks)
- Evaluate retrieval + end-to-end QA (and store traces for reproducibility)
ArcadeDB supports both:
- Embedded mode, where the database runs in-process with the application, and
- Server–client mode, accessible via REST (HTTP) or gRPC (HTTP/2).
In this work, we focus on embedded execution to study unified graph–vector query behavior without network overhead. Server-based deployment is discussed but not benchmarked.
ArcadeDB stores records as:
- Document: flexible JSON-like structure (can be schema-less or schema-full).
- Vertex: a document “with some additional features” for graph modeling; can hold arbitrary properties and embedded records, like documents.
- Edge: a connection between two vertices.
ArcadeDB’s Type concept is closest to a relational table (and can be schema-less/full, with inheritance).
So, yes: you can model “tables” as Document Types (document records, no graph connectivity by default). When you need relationships and traversals, you use Vertex/Edge types.
Each type can have one or more buckets (physical storage files). Multiple buckets per type can be used to increase parallelism for heavy ingestion.
We focus on the retrieval layer of RAG pipelines; downstream generation quality is evaluated only optionally and is not the primary optimization target.
We will benchmark multiple retrieval modes common in production:
- Atomic text units with embeddings (e.g., one embedding per Stack Overflow question or answer)
- Top-k retrieval + metadata filters
- Deduplication and “citation” assembly
- Vector retrieval constrained by fields (time ranges, source, tenant, language, tags)
- Retrieve seed nodes by vector similarity
- Expand neighborhood by graph edges (citations, reply chains, entity mentions)
- Aggregate evidence and assemble context windows
We use PostgreSQL + pgvector as the primary comparator:
- widely deployed
- clear operational story
- strong SQL baseline for filters and joins
- well-known vector index options
While PostgreSQL can be extended with graph query layers (e.g., Apache AGE), such extensions emulate graph functionality on top of relational storage and do not fundamentally alter the polyglot execution model studied here; we therefore leave them out of scope.
-
Recall@k (approximate NN quality)
-
Hit@k (at least one relevant item in top-k, defined per dataset labeling)
-
Latency (p50/p95) for:
- vector search only
- vector + filters
- (optional) vector + graph expansion
-
Index build time
-
Index size on disk and RAM footprint
-
Update cost:
- insert throughput (docs + vectors)
- delete/update throughput (re-embedding, document revisions)
-
QA accuracy metrics appropriate to dataset (EM/F1 or task-specific)
-
Context quality:
- redundancy / duplicate rate
- citation correctness (when provenance exists)
-
Cost proxies:
- retrieved token count vs. accuracy
- end-to-end wall time per query
Each experiment run records:
- dataset version + preprocessing checksum
- embedding model name + settings
- index parameters
- hardware + OS + JVM settings
- full query traces and outputs
All experiments are conducted on single-node machines with fixed hardware configurations. Embedded systems are evaluated in-process, while PostgreSQL is evaluated in its standard client–server configuration using local connections.
ArcadeDB supports vector indexing and “neighbors” queries via SQL, with configurable
parameters such as distanceFunction, m, and ef.
Concrete SQL examples and index configurations will be added once the experimental setup is finalized.
We use the Stack Overflow data dump (approx. 5–6 GB compressed) as our primary industrial dataset. This corpus provides a rich multi-model structure:
- Text: Questions, Answers, Comments
- Graph: User interactions, badges, related posts, thread structure
- Metadata: Tags, votes, dates, history
We utilize the June 2024 snapshot (stackoverflow.com-*.7z from archive.org),
including Posts, Users, Comments, Tags, Votes, Badges, PostLinks, and PostHistory.
For datasets with naturally atomic text units (e.g., Stack Overflow questions and answers), we do not apply text chunking. Each unit is modeled as a single record (Document or Vertex) and assigned exactly one embedding.
This choice reflects the inherent semantic coherence of the data and avoids fragmenting graph relationships (e.g., authorship, replies, citations). Chunk-based modeling for long-form documents is a common alternative but is out of scope for the current experiments.
Implementation note: we have prior experience building multi-model pipelines over Stack Overflow-style dumps (high-ingest XML → documents → graph → embeddings), which informs batching, schema design, and validation checks. (Formal measurements will be added after experiments run.)
- Embedding dependence: results can vary substantially by embedding model; we will fix models per experiment and report sensitivity where possible.
- Dataset labeling: defining “relevance” for Stack Overflow (e.g., accepted answers vs. high-voted answers) requires clear heuristics; we will document our specific relevance criteria.
- Parameter tuning fairness: we will enforce a tuning budget per system to avoid overfitting a single baseline.
- Execution model asymmetry: embedded and client–server systems differ in lifecycle and communication overhead; we explicitly evaluate embedded execution to isolate engine behavior.
- ArcadeDB “Type” ≈ relational “Table”.
- ArcadeDB “Document Type” can represent table-like collections without graph connectivity.
- ArcadeDB “Vertex” is a document with graph features; “Edge” connects vertices.
- “Text unit” refers to the smallest semantically meaningful unit assigned an embedding (e.g., a Stack Overflow answer), not a fixed-size text chunk.