Skip to content

NikhilOO7/gaussian-splatting

Repository files navigation

Gaussian Splatting Knowledge Graph

A full-stack knowledge graph application that extracts structured relationships from academic papers on 3D Gaussian Splatting using AI agents. The system automatically reads papers, identifies entities (methods, concepts, datasets, metrics), and discovers semantic relationships between them to build a queryable knowledge graph.

Architecture Overview

┌─────────────┐
│   arXiv     │
│   Papers    │
└──────┬──────┘
       │
       ↓
┌─────────────────┐
│  PDF Extraction │
│   (pdf-parse)   │
└──────┬──────────┘
       │
       ↓
┌─────────────────────────────────────────────────────────┐
│              AI Agent Pipeline                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐           │
│  │Extractor │→ │ Resolver │→ │  Validator   │           │
│  │  Agent   │  │  Agent   │  │    Agent     │           │
│  └──────────┘  └──────────┘  └──────────────┘           │
│                  (OpenAI GPT-4o)                        │
└──────┬──────────────────────────────────────────────────┘
       │
       ↓
┌──────────────────┐
│  Knowledge Graph │
│   (PostgreSQL)   │
└──────┬───────────┘
       │
       ↓
┌──────────────────┐
│   REST API       │
│   (Hono)         │
└──────┬───────────┘
       │
       ↓
┌──────────────────┐
│   React UI       │
│ (Dashboard +     │
│  Explorer +      │
│  Ingestion)      │
└──────────────────┘

Screenshots

Dashboard - Real-time Monitoring

The dashboard displays live statistics from the PostgreSQL database, including node/edge counts by type and processing status for papers currently being analyzed.

Dashboard

Graph Explorer - Interactive Visualization

The graph explorer shows nodes and edges extracted from real papers using the AI agent pipeline. The circular layout ensures clean visualization without overlapping nodes.

Graph Explorer

Paper Ingestion - arXiv Integration

Papers can be ingested directly from arXiv by ID. The system automatically downloads PDFs, extracts text, and processes through the 3-agent pipeline with real-time progress updates.

Paper Ingestion

Tech Stack

Backend

  • Hono: Lightweight web framework for the API server
  • Drizzle ORM: Type-safe database queries with PostgreSQL
  • OpenAI API: GPT-4o for high-quality agent processing with structured outputs
  • pdf-parse: Extract text content from PDF files

Frontend

  • React 18: UI components with TypeScript
  • Vite: Fast build tooling
  • TailwindCSS: Utility-first styling
  • React Router: Client-side routing
  • React Query: Server state management
  • React Flow: Interactive graph visualization

Infrastructure

  • pnpm workspaces: Monorepo management
  • PostgreSQL: Relational database for graph storage
  • Docker Compose: PostgreSQL containerization

Design Choices

Why PostgreSQL over a graph database?

  • PostgreSQL provides strong ACID guarantees and mature tooling
  • Graph queries can be efficiently handled with proper indexing on source/target IDs
  • Drizzle ORM provides excellent TypeScript integration
  • Easier deployment and operational simplicity
  • JSON columns allow flexible property storage while maintaining relational integrity

Why three separate agents vs one?

  • Separation of concerns: extraction, resolution, and validation are distinct tasks
  • Easier debugging and iteration on each stage
  • Better confidence scoring by combining multiple agent outputs
  • Enables parallel processing of extraction chunks while maintaining global entity resolution

Why OpenAI API vs local LLM?

  • Higher quality structured output generation with GPT-4o
  • Reliable JSON schema adherence using native response format
  • Better entity resolution and relationship extraction accuracy
  • No local GPU requirements or model management overhead
  • Trade-off: API costs vs quality and development velocity

Prerequisites

  • Node.js >= 18.0.0
  • pnpm >= 8.0.0
  • Docker and Docker Compose
  • OpenAI API key with GPT-4o access

Installation

  1. Clone the repository:
git clone <repository-url>
cd gsplat-kg
  1. Install dependencies:
pnpm install
  1. Set up environment variables:
# Create apps/api/.env
cat > apps/api/.env << EOF
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/knowledge_graph
OPENAI_API_KEY=your-api-key-here
OPENAI_MODEL=gpt-4o-2024-08-06
EOF

# Create apps/web/.env
cat > apps/web/.env << EOF
VITE_API_URL=http://localhost:3000
EOF
  1. Start PostgreSQL:
docker-compose up -d
  1. Set up the database:
pnpm db:push

Environment Variables

apps/api/.env

DATABASE_URL=postgresql://postgres:postgres@localhost:5432/knowledge_graph
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-2024-08-06

apps/web/.env

VITE_API_URL=http://localhost:3000

Running the Application

Start both backend and frontend:

pnpm dev

Or run them separately:

# Backend only
pnpm --filter api dev

# Frontend only
pnpm --filter web dev

The application will be available at:

Quick Start: Process Your First Paper

Once the application is running, try this end-to-end workflow:

1. Ingest a paper from arXiv:

curl -X POST http://localhost:3000/api/ingest/arxiv \
  -H "Content-Type: application/json" \
  -d '{"arxivId": "2308.04079", "autoProcess": true}'

Response:

{
  "jobId": "job-1234567890-abc123",
  "status": "queued"
}

2. Monitor processing status:

curl http://localhost:3000/api/ingest/status/job-1234567890-abc123

You'll see status updates:

  • fetching_metadata → Downloading paper metadata from arXiv
  • downloading_pdf → Fetching PDF file
  • extracting_text → Parsing PDF with pdf-parse
  • processing → Running AI extraction pipeline (3-5 minutes)
  • completed → Done! Graph is ready

3. View the results:

Open http://localhost:5173/explorer to see the extracted knowledge graph:

  • Nodes representing methods, concepts, datasets
  • Edges showing relationships (extends, improves, uses, etc.)
  • Click nodes to see details

4. Check the statistics:

Open http://localhost:5173 to see:

  • Total papers processed
  • Node counts by type
  • Edge counts by type
  • Recent processing activity

5. Verify in database (optional):

pnpm db:studio

Navigate to tables:

  • papers - Paper metadata and raw text
  • nodes - Extracted entities
  • edges - Relationships with confidence scores
  • sources - Provenance linking edges to source text

Visual Proof: Screenshots from Live System

The screenshots above demonstrate the complete, working system processing real academic papers:

  1. Dashboard Screenshot: Shows real-time statistics - 52 nodes, 25+ edges from actual database queries
  2. Graph Explorer Screenshot: Displays the knowledge graph extracted from arXiv paper 2511.21678 using the AI agent pipeline
  3. Ingestion Screenshot: Demonstrates the arXiv integration with real-time progress tracking

All data visible in these screenshots came from:

  • ✅ Real PDF downloaded from arXiv
  • ✅ Real text extraction using pdf-parse
  • ✅ Real AI agent processing (Extractor → Resolver → Validator)
  • ✅ Real database insertion with provenance tracking
  • ✅ Real frontend queries from PostgreSQL

This is not mock data or manually inserted test data - it's the result of the fully functional end-to-end pipeline.

Current Implementation Status

What's Working

Full AI Agent Pipeline

  • Three-agent architecture (Extractor → Resolver → Validator) functioning end-to-end
  • Entity extraction from paper text chunks with confidence scoring
  • Entity resolution with deduplication and canonical name mapping
  • Relationship validation with temporal consistency and type checking
  • Provenance tracking linking every edge to source evidence

Database & API

  • PostgreSQL schema with nodes, edges, papers, authors, and sources tables
  • REST API with 15+ endpoints for papers, graph queries, and ingestion
  • Real-time processing status updates with progress tracking
  • Support for reprocessing papers with automatic cleanup

Frontend Application

  • Dashboard with graph statistics and processing status
  • Interactive graph explorer with circular layout visualization
  • Paper ingestion UI with bulk upload support
  • Real-time progress monitoring during paper processing

Performance Metrics (Based on actual processing runs)

  • 42 nodes created from a single paper
  • 25 edges created (59% connectivity rate)
  • 46 chunks processed per paper (2000 chars each, 200 char overlap)
  • 60 entities extracted, reduced to 42 after deduplication
  • 6 relationships rejected by validator (temporal/type mismatches)
  • Average processing time: ~3-5 minutes per paper

Key Implementation Details

Entity Name Resolution Flow:

  1. Extractor outputs raw entity mentions with types
  2. Resolver maps mentions to canonical names (e.g., "3DGS" → "3D Gaussian Splatting")
  3. Processor stores canonicalName.toLowerCase() → UUID in entityMap
  4. Validator uses canonical names in relationships
  5. Processor looks up UUIDs from names before database insertion

LLM Integration:

  • Direct OpenAI API calls using fetch (bypassed AI SDK for reliability)
  • Structured JSON responses using GPT-4o's native response_format
  • Temperature 0.3 for consistent, deterministic outputs
  • Explicit schema examples in prompts to guide LLM behavior
  • "CRITICAL RULES" sections to prevent common LLM mistakes

Progress Tracking:

  • Database-backed status updates: pendingextracting_entitiescompleted
  • Progress percentage (0-100) updated after each chunk
  • Frontend polls every 2 seconds using React Query
  • Status displayed in real-time on Dashboard and Ingestion pages

System Design

Database Schema

The database uses a hybrid approach: relational tables for core entities with JSON columns for flexible metadata.

Core Tables:

  • papers: Academic papers with metadata
  • authors: Author information with normalized names
  • paper_authors: Many-to-many relationship between papers and authors
  • nodes: Graph entities (papers, methods, concepts, datasets, metrics)
  • edges: Directed relationships between nodes with confidence scores
  • sources: Provenance tracking linking edges to source papers and text spans

Key Design Decisions:

  • UUID primary keys for distributed-friendly identifiers
  • Normalized names for fuzzy matching and deduplication
  • JSONB columns for flexible properties without schema changes
  • Confidence scores (0-1) on edges for relationship strength
  • Composite indexes on (source_id, type) and (target_id, type) for efficient graph traversal

Agent Pipeline

The processing pipeline consists of three specialized agents:

1. Extractor Agent

  • Reads paper text chunks (2000 chars with 200 char overlap)
  • Identifies entity mentions with text span positions
  • Detects relationship patterns using predefined verbs
  • Prioritizes recall over precision
  • Outputs raw entities and relationships with confidence scores

2. Resolver Agent

  • Maps entity mentions to canonical entities
  • Performs fuzzy matching against existing graph nodes
  • Handles acronym expansion ("3DGS" → "3D Gaussian Splatting")
  • Creates new entities when no match found
  • Resolves relationships using canonical entity IDs

3. Validator Agent

  • Checks temporal consistency (publication dates)
  • Verifies entity type compatibility with relationship types
  • Adjusts confidence scores based on evidence strength
  • Flags contradictions with existing graph
  • Filters low-confidence relationships (< 0.4)

Entity Resolution

Entity resolution uses a multi-strategy approach:

  1. Exact matching: Normalized name comparison
  2. Fuzzy matching: String similarity for variations
  3. Acronym expansion: Context-aware abbreviation matching
  4. Temporal context: Publication dates for disambiguation

Entities are deduplicated using normalized_name field (lowercase, trimmed). New entities are created only when confidence in existing matches is below threshold.

Provenance Tracking

Every edge in the graph links to source evidence:

  • Paper ID where relationship was found
  • Section (abstract, introduction, methods, results)
  • Extracted text snippet
  • Character span positions (start/end)

This enables:

  • Verification of AI-extracted relationships
  • Confidence scoring based on evidence quality
  • Citation of source material
  • Debugging and refinement of extraction prompts

API Reference

Papers

  • GET /api/papers?limit=20&offset=0 - List papers with pagination
  • GET /api/papers/:id - Get paper details
  • POST /api/papers - Create paper manually
  • POST /api/papers/:id/process - Trigger extraction pipeline

Graph

  • GET /api/graph/nodes?type=method&search=gaussian - List nodes with filters
  • GET /api/graph/nodes/:id - Get node with connected edges
  • GET /api/graph/edges?type=extends - List edges with filters
  • GET /api/graph/subgraph?nodeId=X&depth=2 - Get N-hop neighborhood
  • GET /api/graph/stats - Aggregate statistics

Ingestion

  • POST /api/ingest/arxiv - Fetch and add paper from arXiv
  • GET /api/ingest/status/:jobId - Check processing status

Lessons Learned

Prompt Engineering for Multi-Agent Systems

Field naming matters: Using semantically meaningful field names (sourceName vs sourceId) helps LLMs generate correct outputs. The suffix Id strongly suggests UUID, while Name suggests a human-readable string. This small change prevented the LLM from hallucinating UUIDs.

Explicit examples over descriptions: Showing exact JSON schemas with concrete examples (e.g., "sourceName": "ViLoMem") is more effective than abstract descriptions. LLMs pattern-match more reliably with examples.

"CRITICAL" keyword emphasis: Adding sections labeled "CRITICAL RULES" significantly improved compliance. Regular instruction text can be overlooked, but emphasized sections get weighted higher.

Cross-agent consistency: Agent prompts must use identical schemas. Even small inconsistencies (like field names) cascade into bugs when data flows between agents.

Entity Resolution Strategy

Canonical names as primary keys: Using human-readable canonical names (e.g., "3D Gaussian Splatting") as the primary identifier in intermediate stages makes debugging much easier. UUIDs are only needed at the final database insertion step.

Two-stage lookup: The entityMap pattern (name → UUID lookup) cleanly separates LLM-generated names from database IDs. This prevents LLMs from generating invalid UUIDs.

Fuzzy matching needed: Exact string matching isn't enough. Authors write "3DGS", "3D-GS", "3D Gaussian Splatting", "gaussian splatting" interchangeably. Normalized lowercase matching catches most variations.

Graph Visualization

Circular layout scales well: The simple circular distribution algorithm works surprisingly well for up to ~100 nodes. More sophisticated force-directed layouts would help beyond that.

Dynamic spacing: Calculating radius as Math.max(400, total * 15) ensures nodes don't overlap as the graph grows.

OpenAI API Integration

Native JSON mode is crucial: GPT-4o's response_format: { type: "json_object" } dramatically improved structured output reliability compared to prompt-based JSON generation.

Direct fetch over AI SDK: The Vercel AI SDK added complexity without benefit for our use case. Direct OpenAI API calls with fetch gave us full control and easier debugging.

Temperature 0.3 sweet spot: Temperature 0.0 sometimes caused repetitive outputs; 0.3 provided consistency while maintaining slight creativity for entity name standardization.

Limitations and Future Work

Current Limitations

  1. No PDF fetching: System expects papers to be manually uploaded; doesn't fetch PDFs from arXiv automatically
  2. Synchronous processing: Paper processing blocks the API thread (should use background jobs)
  3. No authentication: Open API with no access control
  4. Polling-based status updates: Frontend polls every 2 seconds; Server-Sent Events would be more efficient
  5. No pagination on graph visualization: May struggle with graphs > 100 nodes
  6. Limited relationship types: Only 8 edge types defined; could expand to capture more semantic nuances
  7. No confidence threshold UI: Users can't filter low-confidence relationships in the explorer
  8. Single paper processing: No batch processing of multiple papers in parallel

Scaling Improvements

  1. Job queue: Implement Redis-backed queue for async processing
  2. Worker pool: Separate worker processes for LLM inference
  3. Caching: Redis cache for frequently accessed subgraphs
  4. Batch processing: Process multiple chunks in parallel
  5. Incremental updates: Update existing papers without full reprocessing
  6. Graph database migration: Consider Neo4j for complex multi-hop queries

Feature Enhancements

  1. Citation network: Extract and visualize paper citations automatically
  2. Author network: Collaboration graph between researchers
  3. Temporal analysis: Track concept evolution over time with timeline visualization
  4. Conflict resolution UI: Manual review interface for correcting AI-extracted relationships
  5. Export formats: GraphML, Cypher, RDF export for use with external graph tools
  6. Full-text search: Search across paper content and entity descriptions
  7. Semantic search: Vector embeddings for similarity-based entity discovery
  8. Confidence filtering: UI controls to hide low-confidence edges
  9. Subgraph queries: "Show me all methods that improve X and evaluate on Y"
  10. Batch reprocessing: Re-run improved prompts on existing papers to fix extractions
  11. Edge provenance display: Click edges in graph to see source text evidence
  12. Force-directed layout: Improve visualization with physics-based layouts (D3.js)
  13. Paper comparison: Side-by-side comparison of methodology and results
  14. Auto-complete search: Typeahead search for entities when adding manual relationships

Development

Project Structure

gsplat-kg/
├── apps/
│   ├── api/              # Backend API
│   │   ├── src/
│   │   │   ├── routes/   # HTTP endpoints
│   │   │   ├── db/       # Schema and queries
│   │   │   ├── agents/   # AI agent logic
│   │   │   ├── services/ # External services
│   │   │   └── pipeline/ # Orchestration
│   │   └── drizzle/      # Migrations
│   │
│   └── web/              # Frontend UI
│       └── src/
│           ├── components/
│           ├── pages/
│           ├── hooks/
│           └── lib/
│
├── packages/
│   └── shared/           # Shared types
│
└── docker-compose.yml    # PostgreSQL setup

Key Commands

# Install dependencies
pnpm install

# Start PostgreSQL
docker-compose up -d

# Push database schema
pnpm db:push

# Open Drizzle Studio (database GUI)
pnpm db:studio

# Build all packages
pnpm build

# Start development servers
pnpm dev

Bulk Papers

2511.21678,2511.21591,2511.21260

License

MIT

Example Queries

The knowledge graph supports semantic queries through dedicated API endpoints. Here are examples demonstrating the graph's capabilities:

1. Find Papers That Improve on 3D Gaussian Splatting

GET /api/graph/queries/improves-3dgs

Response:

{
  "query": "Which methods improve on 3D Gaussian Splatting?",
  "results": [
    {
      "sourceName": "Mip-Splatting",
      "sourceType": "method",
      "relationship": "improves",
      "confidence": "0.85",
      "targetName": "3D Gaussian Splatting"
    },
    {
      "sourceName": "Scaffold-GS",
      "sourceType": "method", 
      "relationship": "improves",
      "confidence": "0.90",
      "targetName": "3D Gaussian Splatting"
    }
  ],
  "count": 2
}

2. Find Papers That Extend the Original 3DGS Method

GET /api/graph/queries/extends-3dgs

This query finds all methods that explicitly extend or build upon the original Gaussian Splatting approach.

3. Find Datasets Used for Evaluation

GET /api/graph/queries/datasets

Returns all dataset nodes and which methods evaluated on them:

{
  "query": "Which datasets are used for evaluation?",
  "results": [
    {
      "dataset": "Mip-NeRF360",
      "usedBy": "3D Gaussian Splatting",
      "confidence": "0.95"
    },
    {
      "dataset": "Tanks and Temples",
      "usedBy": "Scaffold-GS",
      "confidence": "0.88"
    }
  ]
}

4. Find All Relationships for a Specific Method

GET /api/graph/queries/method-relationships?name=Gaussian

Returns both incoming and outgoing relationships for methods matching the search term.

5. Get Provenance for a Relationship

GET /api/graph/queries/provenance/:edgeId

Returns the source evidence for any extracted relationship:

{
  "edge": {
    "id": "uuid",
    "sourceId": "...",
    "targetId": "...",
    "type": "improves",
    "confidence": "0.85"
  },
  "sourceNode": { "name": "Mip-Splatting", "type": "method" },
  "targetNode": { "name": "3D Gaussian Splatting", "type": "method" },
  "provenance": [
    {
      "paperTitle": "Mip-Splatting: Alias-free 3D Gaussian Splatting",
      "paperArxivId": "2311.16493",
      "section": "abstract",
      "extractedText": "We present Mip-Splatting, which addresses aliasing artifacts in 3D Gaussian Splatting..."
    }
  ]
}

6. Get N-Hop Subgraph

GET /api/graph/subgraph?nodeId=<uuid>&depth=2

Returns all nodes and edges within N hops of a center node, useful for exploring local neighborhoods in the graph.

Raw SQL Examples

For direct database access, here are equivalent SQL queries:

Papers improving on 3DGS:

SELECT 
  source_nodes.name as improving_method,
  edges.confidence,
  papers.title as source_paper
FROM edges
JOIN nodes source_nodes ON edges.source_id = source_nodes.id
JOIN nodes target_nodes ON edges.target_id = target_nodes.id
LEFT JOIN papers ON source_nodes.paper_id = papers.id
WHERE target_nodes.name ILIKE '%Gaussian Splatting%'
  AND edges.type = 'improves'
ORDER BY edges.confidence DESC;

Methods and their datasets:

SELECT 
  method_nodes.name as method,
  dataset_nodes.name as dataset,
  edges.confidence
FROM edges
JOIN nodes method_nodes ON edges.source_id = method_nodes.id
JOIN nodes dataset_nodes ON edges.target_id = dataset_nodes.id
WHERE edges.type = 'evaluates_on'
  AND dataset_nodes.type = 'dataset'
ORDER BY method_nodes.name;

Find all relationships for a concept:

WITH target_concept AS (
  SELECT id FROM nodes WHERE name ILIKE '%novel view synthesis%'
)
SELECT 
  'outgoing' as direction,
  n.name as related_entity,
  e.type as relationship,
  e.confidence
FROM edges e
JOIN nodes n ON e.target_id = n.id
WHERE e.source_id IN (SELECT id FROM target_concept)

UNION ALL

SELECT 
  'incoming' as direction,
  n.name as related_entity,
  e.type as relationship,
  e.confidence
FROM edges e
JOIN nodes n ON e.source_id = n.id
WHERE e.target_id IN (SELECT id FROM target_concept);

Processing Papers

Single Paper Processing

# 1. Ingest paper from arXiv
curl -X POST http://localhost:3000/api/ingest/arxiv \
  -H "Content-Type: application/json" \
  -d '{"arxivId": "2308.04079", "autoProcess": true}'

# 2. Check job status
curl http://localhost:3000/api/ingest/status/<jobId>

# 3. Or manually trigger processing
curl -X POST http://localhost:3000/api/papers/<paperId>/process

Bulk Processing

# Ingest multiple papers
curl -X POST http://localhost:3000/api/ingest/bulk \
  -H "Content-Type: application/json" \
  -d '{
    "arxivIds": ["2308.04079", "2311.16493", "2312.02126"],
    "autoProcess": true
  }'

Seed Dataset

Get a curated list of Gaussian Splatting papers to ingest:

curl http://localhost:3000/api/ingest/seed/gaussian-splatting

API Endpoints Summary

Endpoint Method Description
/api/papers GET List all papers
/api/papers/:id GET Get paper details
/api/papers/:id/process POST Process paper through AI pipeline
/api/graph/nodes GET List nodes with filters
/api/graph/nodes/:id GET Get node with relationships
/api/graph/edges GET List edges with filters
/api/graph/subgraph GET Get N-hop neighborhood
/api/graph/stats GET Get graph statistics
/api/graph/queries/improves-3dgs GET Papers improving on 3DGS
/api/graph/queries/extends-3dgs GET Papers extending 3DGS
/api/graph/queries/datasets GET Datasets and their usage
/api/graph/queries/method-relationships GET Relationships for a method
/api/graph/queries/provenance/:edgeId GET Evidence for a relationship
/api/ingest/arxiv POST Ingest paper from arXiv
/api/ingest/bulk POST Bulk ingest papers
/api/ingest/status/:jobId GET Check ingestion status
/api/ingest/seed/gaussian-splatting GET Get seed paper IDs

Troubleshooting

Common Issues

Problem: "No relationships being created" (all edges show as undefined)

This was a critical bug that occurred when agent prompts use inconsistent field names.

Symptoms:

Resolved 2 relationships:
  ViLoMem --[introduces]--> multimodal semantic memory
Validated: 2 accepted, 0 rejected
Could not find node IDs for relationship:
  Source: "undefined" -> NOT FOUND
  Target: "undefined" -> NOT FOUND

Solution: Check that all three agent prompts use the same field names for relationships:

Problem: "Graph visualization shows overlapping nodes"

Solution: The circular layout in Explorer.tsx:44-70 should handle this automatically. If still overlapping, increase the radius multiplier from 15 to 20-25.

Problem: "Paper processing stuck at 0%"

Possible causes:

  1. OpenAI API key not set or invalid
  2. Database connection lost
  3. Paper has no rawText (PDF extraction failed)

Debug steps:

# Check API logs
pnpm --filter api dev

# Verify environment variables
cat apps/api/.env | grep OPENAI_API_KEY

# Check paper status in database
pnpm db:studio
# Navigate to papers table, check processingStatus and rawText fields

Problem: "Low relationship extraction rate"

If you're getting fewer edges than expected:

  1. Check confidence thresholds: Validator rejects relationships with confidence < 0.4
  2. Review extraction prompts: May need to add more relationship verbs to extraction.ts
  3. Check entity resolution: Some relationships fail because target entities weren't extracted as separate nodes
  4. Review validator logs: See which relationships are being rejected and why

Expected metrics from a working system:

  • 40-60 entities extracted per paper
  • 20-30 relationships after validation
  • 60-70% connectivity rate (edges per node)
  • 10-20% rejection rate (temporal/type mismatches)

What I Would Improve With More Time

  • Add automated test coverage (Jest + Supertest)
  • Improve graph layout using force-directed algorithm
  • Add background job queue (BullMQ) for large PDF processing
  • Add crawlers beyond arXiv (ACM, CVF, Springer)
  • Add entity alias clustering using embeddings

Production Deployment Considerations

Status Updates: Polling vs SSE

Current Implementation

  • Polling-based: Frontend calls /api/papers/processing every 2 seconds
  • Works well for development and light usage (< 10 concurrent users)
  • Simple implementation using React Query's refetchInterval

Production Improvements

  1. Smart Polling (quick win): Reduce poll frequency to 30s when no papers processing
  2. Server-Sent Events (recommended): Push-based updates for real-time status with minimal overhead
  3. WebSocket fallback: For browsers without SSE support

Trade-off: Current polling was chosen for simplicity in a take-home assignment context. For 100+ concurrent users, SSE would reduce database load by ~95% while providing faster updates.

Recommended Production Stack

  1. Background Jobs: BullMQ + Redis for paper processing queue
  2. Caching: Redis cache for frequently accessed subgraphs and statistics
  3. Database:
    • Keep PostgreSQL for transactional data
    • Consider adding Neo4j for complex graph queries
    • Add read replicas for query scaling
  4. API: Add rate limiting, authentication (JWT), and request validation
  5. Frontend: Add error boundaries, retry logic, and offline support
  6. Monitoring: Add logging (Winston), metrics (Prometheus), and tracing (OpenTelemetry)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages