GraphBuilder-RAG: A Hybrid Knowledge Graph Construction and Retrieval-Augmented Generation Framework for Trustworthy AI
Workshop Submission for GLOW @ WWW 2026
Graph-enhanced LLMs for Trustworthy Web Data Management
GraphBuilder-RAG is a novel framework that addresses critical trustworthiness challenges in Large Language Models (LLMs) through integrated knowledge graph construction, hybrid retrieval, and graph-based verification. The system implements automated KG extraction and fusion from heterogeneous web/domain data combined with update-resilient hybrid KG+RAG pipelines maintained by autonomous agents.
- Automated Multi-Stage KG Construction Pipeline: Heterogeneous data ingestion → normalization → extraction → validation → fusion with provenance tracking
- Hybrid Retrieval Architecture: Combines FAISS vector search + Neo4j graph queries with confidence-based fusion
- GraphVerify Module: Graph-grounded hallucination detection that verifies LLM claims against knowledge graph edges
- NL2Cypher Interface: LLM-powered natural language to Cypher query translation with safety validation
- Agentic Maintenance System: Self-maintaining KG through ReverifyAgent, ConflictResolverAgent, and SchemaSuggestorAgent
- End-to-End Provenance: Evidence spans, source tracking, edge versioning, and confidence scoring throughout
Large Language Models exhibit critical failure modes that undermine their trustworthiness in real-world applications:
- Hallucination: Generation of factually incorrect or unsupported claims
- Missing Relationships: Inability to reason over structured multi-hop relationships
- Unverifiable Outputs: Lack of citations and evidence tracing
- Knowledge Staleness: Inability to maintain accuracy as information evolves
- Opacity: No explanation of reasoning paths or confidence levels
These limitations prevent deployment in high-stakes domains (medical, legal, financial) where verifiable accuracy is mandatory.
Knowledge graphs provide:
- Structural Grounding: Explicit entity-relationship representations
- Multi-hop Reasoning: Traversable relationship paths
- Verification Substrate: Ground truth for claim validation
- Explainability: Traceable reasoning chains
- Updateability: Localized corrections without retraining
- Automated KG Construction: Build a pipeline that extracts, validates, and fuses knowledge from heterogeneous sources without manual curation
- Hybrid Retrieval: Combine semantic (vector) and structural (graph) retrieval for comprehensive context
- Verified Generation: Ensure LLM outputs are grounded in verified graph knowledge
- Provenance Tracking: Maintain full evidence chains from source documents to final answers
- Self-Maintenance: Deploy agents that keep the KG accurate and consistent over time
GraphBuilder-RAG directly addresses 5 primary GLOW topics while implicitly covering all 10:
| GLOW Topic | Our Contribution | System Component |
|---|---|---|
| Topic #6: Automated KG Construction & Validation | Multi-stage pipeline: ingestion → extraction → validation → fusion | IngestionService, ExtractionService, ValidationEngine, FusionService |
| Topic #7: Graph-Based Prompting + RAG | Hybrid retrieval with graph-augmented prompt building | HybridRetrievalService, QueryService |
| Topic #2: Hallucination Detection & Mitigation | GraphVerify module for claim verification | GraphVerifyEngine |
| Topic #5: NL2Cypher Interface | LLM-powered natural language → Cypher translation | NL2QueryEngine in QueryService |
| Topic #3: Explainability & Provenance | Evidence spans, source tracking, confidence scores | EvidenceSpan, edge versioning, audit logs |
- Topic #1: Graph-based retrieval via Neo4j subgraph traversal
- Topic #4: Property graphs with rich node/edge properties and versioning
- Topic #8: Trustworthiness metrics (confidence, fairness, provenance completeness)
- Topic #9: Domain-agnostic architecture applicable to medical, finance, legal domains
- Topic #10: Agentic AI through ReverifyAgent, ConflictResolverAgent, SchemaSuggestorAgent
✅ Requirement 1: Automated KG extraction & fusion from heterogeneous Web/domain data
Implemented via:
- Multi-source ingestion (URLs, files, APIs)
- Normalization layer for format unification
- Dual extraction (table-based + LLM-based)
- Multi-validator pipeline (ontology + domain + external)
- Entity resolution and idempotent fusion
✅ Requirement 2: Hybrid KG + RAG + Update-resilient pipelines
Implemented via:
- FAISS vector index for semantic search
- Neo4j graph database for structural retrieval
- Confidence-based fusion of retrieval signals
- Autonomous agents (ReverifyAgent, ConflictResolverAgent) for continuous validation
- Versioned edges with conflict detection
graph TB
subgraph "Data Ingestion Layer"
A[Document Sources] --> B[Ingestion Service]
B --> C[MongoDB GridFS]
end
subgraph "KG Construction Pipeline"
C --> D[Normalization Service]
D --> E[Extraction Service]
E --> F[Validation Engine]
F --> G[Entity Resolution]
G --> H[Fusion Service]
H --> I[Neo4j Graph DB]
end
subgraph "Retrieval & Query Layer"
J[User Query] --> K[Query Service]
K --> L[Hybrid Retrieval]
L --> M[FAISS Vector Search]
L --> N[Neo4j Graph Search]
M --> O[Context Fusion]
N --> O
end
subgraph "Generation & Verification"
O --> P[Graph-Augmented Prompt]
P --> Q[LLM Generation]
Q --> R[GraphVerify]
R --> S[Verified Response]
end
subgraph "Agentic Maintenance"
T[ReverifyAgent] --> I
U[ConflictResolver] --> I
V[SchemaSuggestor] --> I
end
subgraph "Storage"
W[MongoDB] --> X[Documents & Metadata]
I --> Y[Knowledge Graph]
Z[Redis] --> AA[Task Queue]
AB[FAISS] --> AC[Vector Index]
end
sequenceDiagram
participant User
participant API
participant Ingestion
participant Pipeline
participant Neo4j
participant Query
participant LLM
participant Verify
User->>API: Upload Document
API->>Ingestion: Store in GridFS
Ingestion->>Pipeline: Trigger Processing
Pipeline->>Pipeline: Normalize
Pipeline->>Pipeline: Extract Triples
Pipeline->>Pipeline: Validate
Pipeline->>Neo4j: Fuse into Graph
User->>API: Submit Query
API->>Query: Process Query
Query->>Query: FAISS Search
Query->>Neo4j: Graph Retrieval
Query->>Query: Fuse Context
Query->>LLM: Generate Answer
LLM->>Verify: Verify Claims
Verify->>Neo4j: Check Graph
Verify->>User: Return Verified Answer
| Layer | Technology | Purpose |
|---|---|---|
| API | FastAPI | RESTful endpoints for ingestion and query |
| Graph DB | Neo4j | Property graph storage with Cypher queries |
| Document DB | MongoDB + GridFS | Raw documents, metadata, audit logs |
| Vector Store | FAISS | Dense embeddings for semantic search |
| Cache | Redis | Task queue, session management |
| Workers | Celery | Asynchronous processing, agent orchestration |
| LLM | Groq (Llama 3.3 70B) | Extraction, NL2Cypher, QA, verification |
| Embeddings | SentenceTransformers | all-MiniLM-L6-v2 for text encoding |
Purpose: Unified entry point for heterogeneous data sources
Functionality:
- Multi-source Support: URLs, file uploads, API endpoints
- Format Detection: Automatic detection of PDF, HTML, JSON, CSV, etc.
- Content Hashing: SHA-256 deduplication
- GridFS Storage: Large binary storage in MongoDB
- Metadata Creation: Document ID generation, timestamps, source tracking
Key Implementation:
async def ingest_from_url(url: str, source_type: DocumentType) -> RawDocument:
# Fetch document
content = await http_client.get(url)
# Generate hash for deduplication
content_hash = sha256(content).hexdigest()
# Check for duplicates
if exists := await collection.find_one({"content_hash": content_hash}):
return existing_document
# Store in GridFS
gridfs_id = await gridfs.upload(content)
# Create metadata record
raw_doc = RawDocument(
document_id=generate_id(),
source=url,
content_hash=content_hash,
gridfs_id=gridfs_id,
timestamp=utcnow()
)
# Emit normalization task
await emit_task("normalize_document", doc_id=raw_doc.document_id)
return raw_docGLOW Alignment: Foundation for Topic #6 (Automated KG Construction)
Purpose: Convert heterogeneous formats into unified structured representation
Functionality:
- PDF Extraction: Text + layout analysis using PyMuPDF
- HTML Parsing: Clean extraction with BeautifulSoup
- Table Detection: Identify and structure tabular data
- Chunking: Split into semantically coherent segments
- Metadata Preservation: Maintain section headers, page numbers, table IDs
Normalization Output:
{
"document_id": "doc_abc123",
"text_chunks": [
{
"chunk_id": "chunk_001",
"text": "Albert Einstein was born in Ulm, Germany...",
"start_char": 0,
"end_char": 156,
"metadata": {"section": "Biography", "page": 1}
}
],
"tables": [
{
"table_id": "table_001",
"headers": ["Name", "Born", "Died"],
"rows": [["Albert Einstein", "1879", "1955"]]
}
]
}GLOW Alignment: Enables extraction from diverse sources (Topic #6)
Purpose: Extract knowledge triples (subject, predicate, object) from normalized documents
Dual Extraction Strategy:
- Input: Structured tables
- Method: Rule-based column mapping
- Confidence: 0.95 (high reliability)
- Example:
Table: | Person | Born | Died | | Einstein | 1879 | 1955 | Extracts: - (Einstein, born_year, 1879) - (Einstein, died_year, 1955)
- Input: Unstructured text chunks
- Method: Groq LLM with structured prompt
- Output: JSON array of triples with evidence spans
- Confidence: Model-generated + heuristics
LLM Extraction Prompt:
Extract all factual relationships from this text as triples.
TEXT:
Albert Einstein was born in Ulm, Germany on March 14, 1879.
FORMAT (JSON):
[
{
"subject": "Albert Einstein",
"predicate": "born_in",
"object": "Ulm, Germany",
"subject_type": "Person",
"object_type": "Location",
"evidence_span": {"start": 0, "end": 56}
}
]
Output: CandidateTriple objects with evidence spans
GLOW Alignment: Core of Topic #6 (Automated KG Construction)
Purpose: Validate candidate triples before graph fusion
Multi-Validator Pipeline:
- Type Constraints: Check subject/object types match predicate expectations
- Example:
founded_byrequires(Organization, founded_by, Person) - Domain Rules: Numeric ranges, date formats, cardinality constraints
- Sanity Checks: Non-empty fields, no self-loops, reasonable lengths
- Evidence Quality: Verify evidence spans exist and are non-trivial
- Logical Consistency: No obvious contradictions
- Wikipedia API: Cross-check entities and relationships
- Wikidata SPARQL: Query for confirming triples
- Caching: Avoid redundant API calls
- Confidence Boosting: +0.2 confidence if externally verified
Confidence Fusion:
final_confidence = (
extraction_confidence * 0.4 +
ontology_score * 0.2 +
domain_score * 0.2 +
external_score * 0.2
)Validation Result:
ValidatedTriple(
triple_id="triple_xyz",
triple=Triple(...),
validation=ValidationResult(
passed=True,
confidence_score=0.87,
checks_passed=["ontology", "domain", "external"],
external_sources=["wikipedia", "wikidata"]
),
evidence=[EvidenceSpan(...)]
)GLOW Alignment: Ensures trustworthiness (Topic #6, Topic #8)
Purpose: Map entity mentions to canonical entity IDs to avoid duplicates
Resolution Strategy:
- Exact Match: Check for exact canonical_name match
- Alias Match: Check against known aliases
- Fuzzy Match: Levenshtein distance with threshold
- Create New: If no match, create new canonical entity
Entity Record:
Entity(
entity_id="ent_einstein_001",
canonical_name="Albert Einstein",
aliases=["Einstein", "A. Einstein"],
entity_type=EntityType.PERSON,
confidence=0.95
)GLOW Alignment: Enables entity linking and deduplication (Topic #4, Topic #6)
Purpose: Upsert validated triples into Neo4j with versioning and conflict detection
Fusion Workflow:
graph TD
A[ValidatedTriple] --> B[Resolve Source Entity]
A --> C[Resolve Target Entity]
B --> D{Check Existing Edges}
C --> D
D -->|No Conflict| E[Insert New Edge]
D -->|Conflict Detected| F[Log Conflict]
F --> G[Insert with Conflict Flag]
E --> H[Audit Log]
G --> H
H --> I[Update Triple Status]
Edge Versioning:
- Each edge has a
versionproperty - Updates increment version
- Old versions retained in audit logs
- Conflicts detected when same (source, relationship) maps to different targets
Neo4j Schema:
CREATE (s:Entity {
entity_id: "ent_001",
canonical_name: "Albert Einstein",
entity_type: "Person"
})
CREATE (t:Entity {
entity_id: "ent_002",
canonical_name: "Ulm, Germany",
entity_type: "Location"
})
CREATE (s)-[r:BORN_IN {
edge_id: "edge_abc",
confidence: 0.87,
evidence_ids: ["doc_001"],
version: 1,
timestamp: "2025-12-08T10:00:00Z"
}]->(t)GLOW Alignment: Update-resilient KG (Mentor Requirement #2, Topic #4)
GraphBuilder-RAG implements a confidence-based hybrid retrieval system that combines:
- Semantic Retrieval: FAISS vector similarity search
- Structural Retrieval: Neo4j graph traversal with NL2Cypher
graph LR
A[User Query] --> B[Query Preprocessing]
B --> C[Parallel Retrieval]
C --> D[FAISS Vector Search]
C --> E[NL2Cypher Generation]
E --> F[Neo4j Execution]
D --> G[Confidence Fusion]
F --> G
G --> H[Ranked Context]
H --> I[Graph-Augmented Prompt]
Implementation:
- Query Encoding: Embed query using SentenceTransformer
- Index Search: L2 distance search in FAISS index
- Chunk Retrieval: Fetch top-k chunks from MongoDB
- Score Normalization: Convert distances to similarity scores
Chunk Match:
ChunkMatch(
chunk_id="chunk_001",
document_id="doc_abc",
text="Albert Einstein was born...",
score=0.89, # Cosine similarity
metadata={"section": "Biography"}
)[Continue to Part 2 for remaining sections...]