GraphBuilder-RAG: A Hybrid Knowledge Graph Construction and Retrieval-Augmented Generation Framework for Trustworthy AI

Workshop Submission for GLOW @ WWW 2026
Graph-enhanced LLMs for Trustworthy Web Data Management

Executive Summary

GraphBuilder-RAG is a novel framework that addresses critical trustworthiness challenges in Large Language Models (LLMs) through integrated knowledge graph construction, hybrid retrieval, and graph-based verification. The system implements automated KG extraction and fusion from heterogeneous web/domain data combined with update-resilient hybrid KG+RAG pipelines maintained by autonomous agents.

Key Innovations

Automated Multi-Stage KG Construction Pipeline: Heterogeneous data ingestion → normalization → extraction → validation → fusion with provenance tracking
Hybrid Retrieval Architecture: Combines FAISS vector search + Neo4j graph queries with confidence-based fusion
GraphVerify Module: Graph-grounded hallucination detection that verifies LLM claims against knowledge graph edges
NL2Cypher Interface: LLM-powered natural language to Cypher query translation with safety validation
Agentic Maintenance System: Self-maintaining KG through ReverifyAgent, ConflictResolverAgent, and SchemaSuggestorAgent
End-to-End Provenance: Evidence spans, source tracking, edge versioning, and confidence scoring throughout

1. Introduction

1.1 Problem Statement

Large Language Models exhibit critical failure modes that undermine their trustworthiness in real-world applications:

Hallucination: Generation of factually incorrect or unsupported claims
Missing Relationships: Inability to reason over structured multi-hop relationships
Unverifiable Outputs: Lack of citations and evidence tracing
Knowledge Staleness: Inability to maintain accuracy as information evolves
Opacity: No explanation of reasoning paths or confidence levels

These limitations prevent deployment in high-stakes domains (medical, legal, financial) where verifiable accuracy is mandatory.

1.2 Why Knowledge Graphs?

Knowledge graphs provide:

Structural Grounding: Explicit entity-relationship representations
Multi-hop Reasoning: Traversable relationship paths
Verification Substrate: Ground truth for claim validation
Explainability: Traceable reasoning chains
Updateability: Localized corrections without retraining

1.3 Research Objectives

Automated KG Construction: Build a pipeline that extracts, validates, and fuses knowledge from heterogeneous sources without manual curation
Hybrid Retrieval: Combine semantic (vector) and structural (graph) retrieval for comprehensive context
Verified Generation: Ensure LLM outputs are grounded in verified graph knowledge
Provenance Tracking: Maintain full evidence chains from source documents to final answers
Self-Maintenance: Deploy agents that keep the KG accurate and consistent over time

2. GLOW Workshop Alignment

GraphBuilder-RAG directly addresses 5 primary GLOW topics while implicitly covering all 10:

Primary Topics (Explicit Contributions)

GLOW Topic	Our Contribution	System Component
Topic #6: Automated KG Construction & Validation	Multi-stage pipeline: ingestion → extraction → validation → fusion	`IngestionService`, `ExtractionService`, `ValidationEngine`, `FusionService`
Topic #7: Graph-Based Prompting + RAG	Hybrid retrieval with graph-augmented prompt building	`HybridRetrievalService`, `QueryService`
Topic #2: Hallucination Detection & Mitigation	GraphVerify module for claim verification	`GraphVerifyEngine`
Topic #5: NL2Cypher Interface	LLM-powered natural language → Cypher translation	`NL2QueryEngine` in `QueryService`
Topic #3: Explainability & Provenance	Evidence spans, source tracking, confidence scores	`EvidenceSpan`, edge versioning, audit logs

Implicit Topics (Covered by Design)

Topic #1: Graph-based retrieval via Neo4j subgraph traversal
Topic #4: Property graphs with rich node/edge properties and versioning
Topic #8: Trustworthiness metrics (confidence, fairness, provenance completeness)
Topic #9: Domain-agnostic architecture applicable to medical, finance, legal domains
Topic #10: Agentic AI through ReverifyAgent, ConflictResolverAgent, SchemaSuggestorAgent

Mentor's Two Core Requirements

✅ Requirement 1: Automated KG extraction & fusion from heterogeneous Web/domain data

Implemented via:

Multi-source ingestion (URLs, files, APIs)
Normalization layer for format unification
Dual extraction (table-based + LLM-based)
Multi-validator pipeline (ontology + domain + external)
Entity resolution and idempotent fusion

✅ Requirement 2: Hybrid KG + RAG + Update-resilient pipelines

Implemented via:

FAISS vector index for semantic search
Neo4j graph database for structural retrieval
Confidence-based fusion of retrieval signals
Autonomous agents (ReverifyAgent, ConflictResolverAgent) for continuous validation
Versioned edges with conflict detection

3. System Architecture

3.1 High-Level Architecture

graph TB
    subgraph "Data Ingestion Layer"
        A[Document Sources] --> B[Ingestion Service]
        B --> C[MongoDB GridFS]
    end
    
    subgraph "KG Construction Pipeline"
        C --> D[Normalization Service]
        D --> E[Extraction Service]
        E --> F[Validation Engine]
        F --> G[Entity Resolution]
        G --> H[Fusion Service]
        H --> I[Neo4j Graph DB]
    end
    
    subgraph "Retrieval & Query Layer"
        J[User Query] --> K[Query Service]
        K --> L[Hybrid Retrieval]
        L --> M[FAISS Vector Search]
        L --> N[Neo4j Graph Search]
        M --> O[Context Fusion]
        N --> O
    end
    
    subgraph "Generation & Verification"
        O --> P[Graph-Augmented Prompt]
        P --> Q[LLM Generation]
        Q --> R[GraphVerify]
        R --> S[Verified Response]
    end
    
    subgraph "Agentic Maintenance"
        T[ReverifyAgent] --> I
        U[ConflictResolver] --> I
        V[SchemaSuggestor] --> I
    end
    
    subgraph "Storage"
        W[MongoDB] --> X[Documents & Metadata]
        I --> Y[Knowledge Graph]
        Z[Redis] --> AA[Task Queue]
        AB[FAISS] --> AC[Vector Index]
    end

3.2 Data Flow: Document Ingestion → Query Response

sequenceDiagram
    participant User
    participant API
    participant Ingestion
    participant Pipeline
    participant Neo4j
    participant Query
    participant LLM
    participant Verify
    
    User->>API: Upload Document
    API->>Ingestion: Store in GridFS
    Ingestion->>Pipeline: Trigger Processing
    
    Pipeline->>Pipeline: Normalize
    Pipeline->>Pipeline: Extract Triples
    Pipeline->>Pipeline: Validate
    Pipeline->>Neo4j: Fuse into Graph
    
    User->>API: Submit Query
    API->>Query: Process Query
    Query->>Query: FAISS Search
    Query->>Neo4j: Graph Retrieval
    Query->>Query: Fuse Context
    Query->>LLM: Generate Answer
    LLM->>Verify: Verify Claims
    Verify->>Neo4j: Check Graph
    Verify->>User: Return Verified Answer

3.3 Technology Stack

Layer	Technology	Purpose
API	FastAPI	RESTful endpoints for ingestion and query
Graph DB	Neo4j	Property graph storage with Cypher queries
Document DB	MongoDB + GridFS	Raw documents, metadata, audit logs
Vector Store	FAISS	Dense embeddings for semantic search
Cache	Redis	Task queue, session management
Workers	Celery	Asynchronous processing, agent orchestration
LLM	Groq (Llama 3.3 70B)	Extraction, NL2Cypher, QA, verification
Embeddings	SentenceTransformers	all-MiniLM-L6-v2 for text encoding

4. Core System Components

4.1 Document Ingestion Service

Purpose: Unified entry point for heterogeneous data sources

Functionality:

Multi-source Support: URLs, file uploads, API endpoints
Format Detection: Automatic detection of PDF, HTML, JSON, CSV, etc.
Content Hashing: SHA-256 deduplication
GridFS Storage: Large binary storage in MongoDB
Metadata Creation: Document ID generation, timestamps, source tracking

Key Implementation:

async def ingest_from_url(url: str, source_type: DocumentType) -> RawDocument:
    # Fetch document
    content = await http_client.get(url)
    
    # Generate hash for deduplication
    content_hash = sha256(content).hexdigest()
    
    # Check for duplicates
    if exists := await collection.find_one({"content_hash": content_hash}):
        return existing_document
    
    # Store in GridFS
    gridfs_id = await gridfs.upload(content)
    
    # Create metadata record
    raw_doc = RawDocument(
        document_id=generate_id(),
        source=url,
        content_hash=content_hash,
        gridfs_id=gridfs_id,
        timestamp=utcnow()
    )
    
    # Emit normalization task
    await emit_task("normalize_document", doc_id=raw_doc.document_id)
    
    return raw_doc

GLOW Alignment: Foundation for Topic #6 (Automated KG Construction)

4.2 Normalization Service

Purpose: Convert heterogeneous formats into unified structured representation

Functionality:

PDF Extraction: Text + layout analysis using PyMuPDF
HTML Parsing: Clean extraction with BeautifulSoup
Table Detection: Identify and structure tabular data
Chunking: Split into semantically coherent segments
Metadata Preservation: Maintain section headers, page numbers, table IDs

Normalization Output:

{
  "document_id": "doc_abc123",
  "text_chunks": [
    {
      "chunk_id": "chunk_001",
      "text": "Albert Einstein was born in Ulm, Germany...",
      "start_char": 0,
      "end_char": 156,
      "metadata": {"section": "Biography", "page": 1}
    }
  ],
  "tables": [
    {
      "table_id": "table_001",
      "headers": ["Name", "Born", "Died"],
      "rows": [["Albert Einstein", "1879", "1955"]]
    }
  ]
}

GLOW Alignment: Enables extraction from diverse sources (Topic #6)

4.3 Extraction Service

Purpose: Extract knowledge triples (subject, predicate, object) from normalized documents

Dual Extraction Strategy:

4.3.1 Table Extractor (Deterministic)

Input: Structured tables
Method: Rule-based column mapping
Confidence: 0.95 (high reliability)

Example:

Table: | Person | Born | Died |
       | Einstein | 1879 | 1955 |

Extracts:
- (Einstein, born_year, 1879)
- (Einstein, died_year, 1955)

4.3.2 LLM Extractor (Semantic)

Input: Unstructured text chunks
Method: Groq LLM with structured prompt
Output: JSON array of triples with evidence spans
Confidence: Model-generated + heuristics

LLM Extraction Prompt:

Extract all factual relationships from this text as triples.

TEXT:
Albert Einstein was born in Ulm, Germany on March 14, 1879.

FORMAT (JSON):
[
  {
    "subject": "Albert Einstein",
    "predicate": "born_in",
    "object": "Ulm, Germany",
    "subject_type": "Person",
    "object_type": "Location",
    "evidence_span": {"start": 0, "end": 56}
  }
]

Output: CandidateTriple objects with evidence spans

GLOW Alignment: Core of Topic #6 (Automated KG Construction)

4.4 Validation Engine

Purpose: Validate candidate triples before graph fusion

Multi-Validator Pipeline:

4.4.1 Ontology Validator

Type Constraints: Check subject/object types match predicate expectations
Example: founded_by requires (Organization, founded_by, Person)
Domain Rules: Numeric ranges, date formats, cardinality constraints

4.4.2 Domain Constraint Validator

Sanity Checks: Non-empty fields, no self-loops, reasonable lengths
Evidence Quality: Verify evidence spans exist and are non-trivial
Logical Consistency: No obvious contradictions

4.4.3 External Verifier

Wikipedia API: Cross-check entities and relationships
Wikidata SPARQL: Query for confirming triples
Caching: Avoid redundant API calls
Confidence Boosting: +0.2 confidence if externally verified

Confidence Fusion:

final_confidence = (
    extraction_confidence * 0.4 +
    ontology_score * 0.2 +
    domain_score * 0.2 +
    external_score * 0.2
)

Validation Result:

ValidatedTriple(
    triple_id="triple_xyz",
    triple=Triple(...),
    validation=ValidationResult(
        passed=True,
        confidence_score=0.87,
        checks_passed=["ontology", "domain", "external"],
        external_sources=["wikipedia", "wikidata"]
    ),
    evidence=[EvidenceSpan(...)]
)

GLOW Alignment: Ensures trustworthiness (Topic #6, Topic #8)

4.5 Entity Resolution Service

Purpose: Map entity mentions to canonical entity IDs to avoid duplicates

Resolution Strategy:

Exact Match: Check for exact canonical_name match
Alias Match: Check against known aliases
Fuzzy Match: Levenshtein distance with threshold
Create New: If no match, create new canonical entity

Entity Record:

Entity(
    entity_id="ent_einstein_001",
    canonical_name="Albert Einstein",
    aliases=["Einstein", "A. Einstein"],
    entity_type=EntityType.PERSON,
    confidence=0.95
)

GLOW Alignment: Enables entity linking and deduplication (Topic #4, Topic #6)

4.6 Fusion Service

Purpose: Upsert validated triples into Neo4j with versioning and conflict detection

Fusion Workflow:

graph TD
    A[ValidatedTriple] --> B[Resolve Source Entity]
    A --> C[Resolve Target Entity]
    B --> D{Check Existing Edges}
    C --> D
    D -->|No Conflict| E[Insert New Edge]
    D -->|Conflict Detected| F[Log Conflict]
    F --> G[Insert with Conflict Flag]
    E --> H[Audit Log]
    G --> H
    H --> I[Update Triple Status]

Edge Versioning:

Each edge has a version property
Updates increment version
Old versions retained in audit logs
Conflicts detected when same (source, relationship) maps to different targets

Neo4j Schema:

CREATE (s:Entity {
    entity_id: "ent_001",
    canonical_name: "Albert Einstein",
    entity_type: "Person"
})

CREATE (t:Entity {
    entity_id: "ent_002", 
    canonical_name: "Ulm, Germany",
    entity_type: "Location"
})

CREATE (s)-[r:BORN_IN {
    edge_id: "edge_abc",
    confidence: 0.87,
    evidence_ids: ["doc_001"],
    version: 1,
    timestamp: "2025-12-08T10:00:00Z"
}]->(t)

GLOW Alignment: Update-resilient KG (Mentor Requirement #2, Topic #4)

5. Hybrid Retrieval Architecture

5.1 Overview

GraphBuilder-RAG implements a confidence-based hybrid retrieval system that combines:

Semantic Retrieval: FAISS vector similarity search
Structural Retrieval: Neo4j graph traversal with NL2Cypher

5.2 Retrieval Flow

graph LR
    A[User Query] --> B[Query Preprocessing]
    B --> C[Parallel Retrieval]
    C --> D[FAISS Vector Search]
    C --> E[NL2Cypher Generation]
    E --> F[Neo4j Execution]
    D --> G[Confidence Fusion]
    F --> G
    G --> H[Ranked Context]
    H --> I[Graph-Augmented Prompt]

5.3 FAISS Vector Search

Implementation:

Query Encoding: Embed query using SentenceTransformer
Index Search: L2 distance search in FAISS index
Chunk Retrieval: Fetch top-k chunks from MongoDB
Score Normalization: Convert distances to similarity scores

Chunk Match:

ChunkMatch(
    chunk_id="chunk_001",
    document_id="doc_abc",
    text="Albert Einstein was born...",
    score=0.89,  # Cosine similarity
    metadata={"section": "Biography"}
)

[Continue to Part 2 for remaining sections...]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphBuilder-RAG: A Hybrid Knowledge Graph Construction and Retrieval-Augmented Generation Framework for Trustworthy AI

Executive Summary

Key Innovations

1. Introduction

1.1 Problem Statement

1.2 Why Knowledge Graphs?

1.3 Research Objectives

2. GLOW Workshop Alignment

Primary Topics (Explicit Contributions)

Implicit Topics (Covered by Design)

Mentor's Two Core Requirements

3. System Architecture

3.1 High-Level Architecture

3.2 Data Flow: Document Ingestion → Query Response

3.3 Technology Stack

4. Core System Components

4.1 Document Ingestion Service

4.2 Normalization Service

4.3 Extraction Service

4.3.1 Table Extractor (Deterministic)

4.3.2 LLM Extractor (Semantic)

4.4 Validation Engine

4.4.1 Ontology Validator

4.4.2 Domain Constraint Validator

4.4.3 External Verifier

4.5 Entity Resolution Service

4.6 Fusion Service

5. Hybrid Retrieval Architecture

5.1 Overview

5.2 Retrieval Flow

5.3 FAISS Vector Search

FilesExpand file tree

GLOW_DOCUMENTATION_PART1.md

Latest commit

History

GLOW_DOCUMENTATION_PART1.md

File metadata and controls

GraphBuilder-RAG: A Hybrid Knowledge Graph Construction and Retrieval-Augmented Generation Framework for Trustworthy AI

Executive Summary

Key Innovations

1. Introduction

1.1 Problem Statement

1.2 Why Knowledge Graphs?

1.3 Research Objectives

2. GLOW Workshop Alignment

Primary Topics (Explicit Contributions)

Implicit Topics (Covered by Design)

Mentor's Two Core Requirements

3. System Architecture

3.1 High-Level Architecture

3.2 Data Flow: Document Ingestion → Query Response

3.3 Technology Stack

4. Core System Components

4.1 Document Ingestion Service

4.2 Normalization Service

4.3 Extraction Service

4.3.1 Table Extractor (Deterministic)

4.3.2 LLM Extractor (Semantic)

4.4 Validation Engine

4.4.1 Ontology Validator

4.4.2 Domain Constraint Validator

4.4.3 External Verifier

4.5 Entity Resolution Service

4.6 Fusion Service

5. Hybrid Retrieval Architecture

5.1 Overview

5.2 Retrieval Flow

5.3 FAISS Vector Search