Winner: ArangoDB for Phase 1-2, with optional Virtuoso addition in Phase 3+
| Criteria | ArangoDB | SurrealDB | Virtuoso |
|---|---|---|---|
| Multi-model | ✅ Native | ✅ Native | ❌ RDF-only |
| Graph queries | ✅ AQL | ✅ SurrealQL | ✅ SPARQL |
| Elixir support | ✅ arangox | ||
| Production ready | ✅ Yes | ✅ Yes | |
| Managed hosting | ✅ €45/month | ❌ Self-host | ✅ €€€ |
| JSON storage | ✅ Native | ✅ Native | ❌ RDF triples |
| Semantic web | ✅ Native | ||
| Learning curve | Medium | Medium | Steep (RDF) |
- Multi-model done right: Document + Graph + Key-Value in one database
- AQL is intuitive: SQL-like syntax for graph traversals
- Strong Elixir integration:
arangoxlibrary mature and maintained - Production-proven: Used by companies like Cisco, Barclays
- Managed hosting: ArangoDB Oasis (€45/month includes backups, monitoring)
- JSON-first: Schema-less documents, perfect for Evidence metadata
- Cost: Not free at scale (but €45/month is reasonable)
- No native RDF: Must export to JSON-LD for semantic web
- Smaller community: Than PostgreSQL/Neo4j
// Setup
db._createDocumentCollection("investigations");
db._createDocumentCollection("claims");
db._createDocumentCollection("evidence");
db._createDocumentCollection("navigation_paths");
db._createEdgeCollection("relationships");// Insert investigation
db.investigations.insert({
_key: "uk_inflation_2023",
title: "UK Inflation Crisis 2023",
status: "published",
lead_journalist: "Sarah Johnson",
created_at: "2023-01-15"
});
// Insert claims
db.claims.insert({
_key: "claim_1",
investigation_id: "uk_inflation_2023",
text: "UK inflation reached 40-year high of 11.1% in October 2022",
claim_type: "primary",
confidence_level: 0.95,
prompt_scores: {
provenance: 95,
replicability: 100,
objective: 90,
methodology: 85,
publication: 100,
transparency: 90
}
});
db.claims.insert({
_key: "claim_2",
investigation_id: "uk_inflation_2023",
text: "Energy price cap increase was primary driver of inflation spike",
claim_type: "supporting",
confidence_level: 0.85,
prompt_scores: {
provenance: 80,
replicability: 75,
objective: 70,
methodology: 80,
publication: 85,
transparency: 75
}
});
// Insert evidence
db.evidence.insert({
_key: "ons_cpi_data",
title: "Consumer Price Index Data - October 2022",
evidence_type: "dataset",
source_url: "https://www.ons.gov.uk/economy/inflationandpriceindices",
zotero_key: "ABC123",
dublin_core: {
creator: "Office for National Statistics",
date: "2022-11-16",
type: "Dataset",
format: "CSV"
},
prompt_scores: {
provenance: 100,
replicability: 100,
objective: 95,
methodology: 95,
publication: 100,
transparency: 95
},
tags: ["inflation", "cpi", "uk", "2022"]
});
db.evidence.insert({
_key: "ofgem_price_cap",
title: "Ofgem Energy Price Cap Announcement Q4 2022",
evidence_type: "document",
source_url: "https://www.ofgem.gov.uk/price-cap",
zotero_key: "DEF456",
dublin_core: {
creator: "Office of Gas and Electricity Markets",
date: "2022-08-26",
type: "Press Release"
},
prompt_scores: {
provenance: 95,
replicability: 90,
objective: 85,
methodology: 80,
publication: 90,
transparency: 85
},
tags: ["energy", "price-cap", "ofgem", "uk"]
});
// Create relationships
db.relationships.insert({
_from: "claims/claim_1",
_to: "evidence/ons_cpi_data",
relationship_type: "supports",
weight: 1.0,
confidence: 0.95,
reasoning: "ONS CPI data directly confirms the 11.1% figure",
created_by: "sarah.johnson@example.com"
});
db.relationships.insert({
_from: "claims/claim_2",
_to: "evidence/ofgem_price_cap",
relationship_type: "supports",
weight: 0.8,
confidence: 0.85,
reasoning: "Price cap increase correlates with inflation spike timing",
created_by: "sarah.johnson@example.com"
});
db.relationships.insert({
_from: "claims/claim_1",
_to: "claims/claim_2",
relationship_type: "contextualizes",
weight: 0.7,
confidence: 0.8,
reasoning: "Energy prices provide context for overall inflation rate",
created_by: "sarah.johnson@example.com"
});FOR claim IN claims
FILTER claim._key == "claim_1"
FOR v, e, p IN 1..1 OUTBOUND claim relationships
FILTER e.relationship_type == "supports"
FILTER IS_SAME_COLLECTION("evidence", v)
RETURN {
claim: claim.text,
evidence: v.title,
weight: e.weight,
confidence: e.confidence,
reasoning: e.reasoning,
prompt_overall: (
v.prompt_scores.provenance * 0.2 +
v.prompt_scores.replicability * 0.15 +
v.prompt_scores.objective * 0.15 +
v.prompt_scores.methodology * 0.2 +
v.prompt_scores.publication * 0.15 +
v.prompt_scores.transparency * 0.15
)
}Output:
{
"claim": "UK inflation reached 40-year high of 11.1% in October 2022",
"evidence": "Consumer Price Index Data - October 2022",
"weight": 1.0,
"confidence": 0.95,
"reasoning": "ONS CPI data directly confirms the 11.1% figure",
"prompt_overall": 97.5
}// Find all nodes within 3 hops of a claim
FOR claim IN claims
FILTER claim._key == "claim_1"
FOR v, e, p IN 1..3 ANY claim relationships
RETURN DISTINCT {
node_type: IS_SAME_COLLECTION("claims", v) ? "claim" : "evidence",
node_id: v._key,
node_text: v.text || v.title,
depth: LENGTH(p.edges),
path_weight: PRODUCT(p.edges[*].weight)
}Output:
[
{
"node_type": "evidence",
"node_id": "ons_cpi_data",
"node_text": "Consumer Price Index Data - October 2022",
"depth": 1,
"path_weight": 1.0
},
{
"node_type": "claim",
"node_id": "claim_2",
"node_text": "Energy price cap increase was primary driver",
"depth": 1,
"path_weight": 0.7
},
{
"node_type": "evidence",
"node_id": "ofgem_price_cap",
"node_text": "Ofgem Energy Price Cap Announcement",
"depth": 2,
"path_weight": 0.56
}
]FOR claim IN claims
FILTER claim.investigation_id == "uk_inflation_2023"
LET supporting = (
FOR v, e IN 1..1 OUTBOUND claim relationships
FILTER e.relationship_type == "supports"
RETURN {evidence: v, weight: e.weight}
)
LET contradicting = (
FOR v, e IN 1..1 OUTBOUND claim relationships
FILTER e.relationship_type == "contradicts"
RETURN {evidence: v, weight: e.weight}
)
FILTER LENGTH(contradicting) > 0
RETURN {
claim: claim.text,
support_count: LENGTH(supporting),
contradiction_count: LENGTH(contradicting),
net_support: SUM(supporting[*].weight) - SUM(contradicting[*].weight),
contradictions: contradicting[*].evidence.title
}// Researcher perspective (prioritizes methodology, replicability)
LET researcher_weights = {
provenance: 0.15,
replicability: 0.30,
objective: 0.15,
methodology: 0.35,
publication: 0.10,
transparency: 0.20
}
FOR evidence IN evidence
FILTER evidence.investigation_id == "uk_inflation_2023"
LET researcher_score = (
evidence.prompt_scores.provenance * researcher_weights.provenance +
evidence.prompt_scores.replicability * researcher_weights.replicability +
evidence.prompt_scores.objective * researcher_weights.objective +
evidence.prompt_scores.methodology * researcher_weights.methodology +
evidence.prompt_scores.publication * researcher_weights.publication +
evidence.prompt_scores.transparency * researcher_weights.transparency
)
SORT researcher_score DESC
RETURN {
title: evidence.title,
researcher_score: researcher_score,
scores: evidence.prompt_scores
}// Requires fulltext index:
db.evidence.ensureIndex({ type: "fulltext", fields: ["title", "tags"] });
FOR doc IN FULLTEXT(evidence, "title,tags", "inflation,energy")
RETURN {
title: doc.title,
tags: doc.tags,
score: BM25(doc) // Built-in relevance scoring
}# mix.exs
{:arangox, "~> 0.5.2"}
# config/dev.exs
config :evidence_graph, EvidenceGraph.ArangoDB,
endpoints: "http://localhost:8529",
database: "evidence_graph",
username: "root",
password: "dev"
# lib/evidence_graph/arango.ex
defmodule EvidenceGraph.ArangoDB do
use Arangox
def start_link(opts) do
Arangox.start_link(
endpoints: opts[:endpoints],
database: opts[:database],
username: opts[:username],
password: opts[:password]
)
end
def find_claim(claim_id) do
query = """
FOR claim IN claims
FILTER claim._key == @claim_id
RETURN claim
"""
Arangox.transaction(
__MODULE__,
fn cursor ->
cursor
|> Arangox.cursor(query, %{claim_id: claim_id})
|> Enum.to_list()
end
)
end
def evidence_chain(claim_id, max_depth \\ 3) do
query = """
FOR claim IN claims
FILTER claim._key == @claim_id
FOR v, e, p IN 1..@max_depth ANY claim relationships
RETURN {
node: v,
edge: e,
depth: LENGTH(p.edges)
}
"""
Arangox.transaction(
__MODULE__,
fn cursor ->
cursor
|> Arangox.cursor(query, %{claim_id: claim_id, max_depth: max_depth})
|> Enum.to_list()
end
)
end
end- Modern architecture: Built with Rust, async-first
- Multi-model: Documents + Graph + Realtime
- GraphQL-like syntax: Familiar to web developers
- Embedded mode: Can run in-process (great for testing)
- Permissions system: Fine-grained access control built-in
- New/immature: Only released in 2022, smaller ecosystem
- No managed hosting: Must self-host (more ops burden)
- Elixir support lacking: No native driver, must use HTTP API
- Limited production deployments: Unproven at scale
- Smaller community: Fewer Stack Overflow answers
-- Create claim with relationship
CREATE claims:claim_1 SET
text = "UK inflation reached 11.1%",
investigation = investigations:uk_inflation_2023,
prompt_scores = {
provenance: 95,
methodology: 85
};
-- Create relationship
RELATE claims:claim_1->supports->evidence:ons_cpi_data SET
weight = 1.0,
confidence = 0.95;
-- Traverse graph
SELECT ->supports->evidence.* FROM claims:claim_1;Wait and see. Exciting tech, but too risky for Phase 1. Revisit in 2026 if community grows.
- RDF native: Best-in-class SPARQL performance
- Semantic web integration: Direct compatibility with academic repositories
- Mature: 20+ years of development
- Federated queries: Can query across multiple SPARQL endpoints
- Linked Open Data: Natural fit for cross-investigation linking
- RDF-only: Documents must be converted to triples (awkward for JSON)
- Steep learning curve: SPARQL + ontologies require semantic web knowledge
- Overkill for Phase 1: Don't need semantic web until Phase 3
- Schema rigidity: RDF schemas less flexible than JSON documents
- Elixir support: HTTP SPARQL only, no native driver
PREFIX eg: <http://evidencegraph.org/ontology#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?claim ?evidence ?weight
WHERE {
?claim a eg:Claim ;
eg:text "UK inflation reached 11.1%" ;
eg:supports ?rel .
?rel eg:to ?evidence ;
eg:weight ?weight .
?evidence dc:title ?evidenceTitle .
}
ORDER BY DESC(?weight)Phase 3 addition. Store JSON-LD from day 1 in ArangoDB. Export to Virtuoso only if semantic web features become critical.
Phase 1-2: ArangoDB only
├── Store documents as JSON (flexible schema)
├── Store JSON-LD context for future RDF export
├── Use AQL for graph queries
└── €45/month ArangoDB Oasis
Phase 3+: Add Virtuoso (optional)
├── Export investigations to RDF triples
├── Add SPARQL endpoint for academic integration
├── Keep ArangoDB as primary database
└── Virtuoso as read-only semantic web layer
{
"@context": "https://schema.org",
"@type": "Claim",
"@id": "http://evidencegraph.org/claims/claim_1",
"text": "UK inflation reached 11.1%",
"dateCreated": "2023-01-15",
"author": {
"@type": "Person",
"name": "Sarah Johnson"
},
"evidence": {
"@type": "Dataset",
"@id": "http://evidencegraph.org/evidence/ons_cpi_data",
"name": "Consumer Price Index Data"
},
"prompt_scores": {
"provenance": 95,
"replicability": 100
}
}Benefits:
- Start simple (ArangoDB)
- Preserve migration path (JSON-LD)
- Add semantic web when needed
- No vendor lock-in
# Pull image
podman pull arangodb/arangodb:3.11
# Run container
podman run -d \
--name arangodb \
-p 8529:8529 \
-e ARANGO_ROOT_PASSWORD=dev \
arangodb/arangodb:3.11
# Access web UI
open http://localhost:8529
# Username: root
# Password: dev# Create database
curl -X POST http://localhost:8529/_db/_system/_api/database \
-u root:dev \
-H "Content-Type: application/json" \
-d '{"name": "evidence_graph"}'
# Create collections (see queries above)
# Use ArangoDB web UI > Collections > New Collection// In ArangoDB web UI > Queries tab
// Test 1: Simple claim lookup (should be <10ms)
FOR claim IN claims
FILTER claim._key == "claim_1"
RETURN claim
// Test 2: 3-hop traversal (should be <100ms for 100 nodes)
FOR claim IN claims
FILTER claim._key == "claim_1"
FOR v, e, p IN 1..3 ANY claim relationships
RETURN v
// Test 3: Full-text search (should be <200ms)
FOR doc IN FULLTEXT(evidence, "title,tags", "inflation")
RETURN doc
// Test 4: Aggregation (PROMPT score average)
FOR evidence IN evidence
COLLECT investigation = evidence.investigation_id
AGGREGATE avg_provenance = AVG(evidence.prompt_scores.provenance)
RETURN {investigation, avg_provenance}- Simple lookup: ~5ms
- 3-hop traversal (100 nodes): ~50ms
- Full-text search: ~100ms
- Aggregations: ~30ms
| Use Case | ArangoDB | SurrealDB | Virtuoso |
|---|---|---|---|
| Flexible JSON storage | ✅ Perfect | ✅ Good | ❌ Poor (triples) |
| Graph traversals | ✅ AQL excellent | ✅ Good | ✅ SPARQL excellent |
| Elixir integration | ✅ arangox mature | ||
| Production hosting | ✅ Managed €45 | ❌ Self-host | ✅ Managed €€€ |
| Learning curve | ✅ Medium | ✅ Medium | ❌ Steep |
| Academic integration | ✅ Native RDF | ||
| Cost | ✅ €540/year | ✅ Free (self-host) | ❌ €€€ |
| Risk | ✅ Low | ✅ Low (mature) |
Final Decision: ArangoDB for Phase 1-2, evaluate Virtuoso in Phase 3.
- ArangoDB Docs: https://www.arangodb.com/docs/stable/
- SurrealDB Docs: https://surrealdb.com/docs
- Virtuoso Docs: http://virtuoso.openlinksw.com/
- JSON-LD: https://json-ld.org/
- arangox GitHub: https://github.com/ArangoDB-Community/arangox
Last Updated: 2025-11-22 Next Review: Month 3 (benchmark decision point)