RAG Question Answering System
A production-grade Retrieval-Augmented Generation pipeline that answers questions grounded strictly in your documents with source citations.
- Hybrid Retrieval Strategy: Combines Dense Vector Similarity (semantic meaning) with BM25 Keyword Search (exact terms) using Reciprocal Rank Fusion (RRF) for highly accurate context retrieval.
- Production-ready Vector Storage: Uses Neon (Serverless PostgreSQL + pgvector) for robust, persistent cloud storage.
- Open-Source LLM Integration: Powered by the fully open-source Llama 3.3-70B. Inference is delegated to the Groq API to provide accessible, instant performance without requiring heavy local GPU resources.
- Smart Semantic Chunking: Splits documents based on shifts in topic and semantics, not arbitrary character counts, preserving contextual meaning.
This system lets you ingest a set of PDF or text documents and ask natural language questions about them. Every answer is generated strictly from the retrieved document context, with citations showing exactly which file and page the answer came from.
Under the hood: documents are semantically chunked, embedded using a local model, and stored in a PostgreSQL vector database. At query time, a hybrid retriever combining dense vector search and BM25 keyword search finds the most relevant passages, which are injected into a grounded prompt and sent to Llama 3.3-70B via Groq for generation.
Here is a visual walk-through of the end-to-end functionality using the Next.js frontend and FastAPI backend.
Documents intended for the knowledge base are placed into the backend/docs/ directory. For this demo, we use the original "Attention Is All You Need" paper and a sample text file.
The documents are processed via the /ingest endpoint. They are semantically chunked, embedded using BAAI/bge-small-en-v1.5, and safely stored in the Neon PostgreSQL database.
Questions asked through the UI are answered strictly based on the retrieved vector context. The exact source chunks, including file names and page numbers (from PDFs), are provided below the answer.
The prompt rigidly enforces grounding. If a question is asked that cannot be answered using the retrieved chunks, the system safely refuses rather than hallucinating an answer.
┌─────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ PDF / TXT Files │
│ │ │
│ ▼ │
│ SimpleDirectoryReader (ingestion/loader.py) │
│ │ Loads files, populates file_name + page_label │
│ ▼ │
│ SemanticSplitterNodeParser (ingestion/chunker.py) │
│ │ Splits on topic shifts, not character count │
│ ▼ │
│ HuggingFaceEmbedding (ingestion/embedder.py) │
│ │ BAAI/bge-small-en-v1.5 → 384-dim vectors │
│ ▼ │
│ PGVectorStore (retrieval/vectorstore.py) │
│ │ Neon (pgvector) │
│ ▼ │
│ Stored ✓ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ User Question │
│ │ │
│ ▼ │
│ VectorIndexRetriever ──┐ │
│ (dense similarity search) │ (retrieval/retriever.py) │
│ ├─► QueryFusionRetriever │
│ BM25Retriever ──┘ (RRF merge) │
│ (keyword search, in-memory) │ │
│ ▼ │
│ Top-K fused chunks │
│ │ │
│ ▼ │
│ Custom QA PromptTemplate (generation/prompt.py) │
│ │ Injects context + enforces grounding │
│ ▼ │
│ RetrieverQueryEngine (generation/pipeline.py) │
│ │ compact response mode │
│ ▼ │
│ Groq API — Llama 3.3-70B (generation/llm.py) │
│ │ │
│ ▼ │
│ Answer + Source Citations (api/routes.py → FastAPI) │
│ │ │
│ ▼ │
│ Next.js Frontend (frontend/) │
└─────────────────────────────────────────────────────────────────┘
| Layer | Choice | Reason |
|---|---|---|
| Language | Python 3.11+ | Assignment requirement |
| RAG Orchestration | LlamaIndex | First-class pgvector support, modular retriever/engine design |
| Embeddings | BAAI/bge-small-en-v1.5 |
Outperforms MiniLM on retrieval benchmarks, 384-dim, lightweight |
| Chunking | SemanticSplitterNodeParser |
Preserves topical coherence over naive character splitting |
| Vector DB | Neon (PostgreSQL + pgvector) | Production-grade, persistent, cloud-native |
| Retrieval | Hybrid: Dense + BM25 via RRF | Dense handles semantics, BM25 handles exact terms, RRF fuses both |
| LLM | Llama 3.3-70B via Groq | Open-source model, near-zero latency, free tier |
| API | FastAPI | Async, typed, assignment suggestion |
| Frontend | Next.js 16 + shadcn/ui | Clean chat UI, collapsible source citations |
helpora-rag/
├── backend/
│ ├── config.py # All env vars — single source of truth
│ ├── ingestion/
│ │ ├── embedder.py # BGE embedding model (singleton)
│ │ ├── loader.py # PDF/TXT loader via SimpleDirectoryReader
│ │ └── chunker.py # Semantic chunking with metadata enrichment
│ ├── retrieval/
│ │ ├── vectorstore.py # Neon pgvector integration
│ │ └── retriever.py # Hybrid retriever (dense + BM25 + RRF)
│ ├── generation/
│ │ ├── llm.py # Groq LLM wrapper
│ │ ├── prompt.py # Grounded QA prompt template
│ │ └── pipeline.py # RetrieverQueryEngine + source extraction
│ ├── api/
│ │ ├── main.py # FastAPI app, lifespan, CORS
│ │ ├── routes.py # /health, /ingest, /ask endpoints
│ │ └── schemas.py # Pydantic request/response models
│ ├── ingest_docs.py # CLI ingestion script
│ ├── docs/ # Place your PDFs here
│ └── requirements.txt
└── frontend/
└── src/
├── app/ # Next.js App Router
├── components/ # ChatWindow, MessageBubble, InputBar
├── lib/api.ts # Typed fetch wrapper for FastAPI
└── types/index.ts # Shared TypeScript interfaces
- Python 3.11+
- Node.js 18+ and pnpm
- A Neon account (free tier is sufficient)
- A Groq API key (free tier)
git clone <your-repo-url>
cd helpora-rag# Backend
cd backend
python -m venv venv
source venv/bin/activate # macOS/Linux
# .\venv\Scripts\Activate.ps1 # Windows PowerShell
pip install -r requirements.txt# Frontend (from project root)
cd frontend
pnpm installIn your Neon dashboard, open the SQL Editor and run once:
CREATE EXTENSION IF NOT EXISTS vector;LlamaIndex handles table creation automatically after this. You never write CREATE TABLE yourself.
cd backend
cp .env.example .envEdit backend/.env:
GROQ_API_KEY=gsk_your_key_here
NEON_DATABASE_URL=postgresql://user:password@ep-xxxx.us-east-2.aws.neon.tech/dbname?sslmode=require
VECTOR_STORE=neon
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
TOP_K=5Edit frontend/.env.local:
NEXT_PUBLIC_API_URL=http://localhost:8000Place your PDF or TXT files in backend/docs/. The knowledge base used in this project:
attention_is_all_you_need.pdf— The original "Attention Is All You Need" paper by Vaswani et al., 2017sample_paper.txt— A sample test document discussing RAG and LLM patterns
These files are used to demonstrate the system's ability to extract, strictly index, and accurately retrieve from both complex PDFs and plain text files.
cd backend
python ingest_docs.py --path ./docsExpected output:
2025-01-01 12:00:00 | INFO | 📂 Loading documents from: ./docs
2025-01-01 12:00:01 | INFO | → Loaded 3 document(s)
2025-01-01 12:00:01 | INFO | ✂️ Chunking documents semantically...
2025-01-01 12:00:45 | INFO | → Created 142 chunk(s)
2025-01-01 12:00:45 | INFO | 📦 Embedding and storing chunks...
2025-01-01 12:01:10 | INFO | → Stored 142 chunk(s) in vector store
2025-01-01 12:01:10 | INFO | ✅ Ingestion complete!
cd backend
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000Verify it's running:
curl http://localhost:8000/health
# {"status":"healthy","vector_store":"neon","llm":"llama-3.3-70b-versatile"}cd frontend
pnpm devOpen http://localhost:3000.
Returns API status and backend configuration.
{
"status": "healthy",
"vector_store": "neon",
"llm": "llama-3.3-70b-versatile"
}Ingest documents from a directory.
// Request
{ "path": "./docs" }
// Response
{
"message": "Successfully ingested 3 document(s)",
"chunks_ingested": 142
}Ask a question. Returns answer + source citations.
// Request
{ "question": "What is retrieval-augmented generation?" }
// Response
{
"answer": "Retrieval-Augmented Generation (RAG) is... [Source: rag_paper.pdf, Page 1]",
"sources": [
{
"file": "rag_paper.pdf",
"page": "1",
"excerpt": "Retrieval-Augmented Generation combines..."
}
]
}I opted for Neon with pgvector for vector storage rather than a local file-based database. It's cloud-native, always available, and removes local setup friction, ensuring the database state persists predictably across demo runs for anyone reviewing the application.
Instead of the typical MiniLM model, I chose the BGE-small model. It consistently outperforms MiniLM on the BEIR retrieval benchmark suite while remaining equally lightweight (384 dimensions). This results in tangibly higher-quality context retrieval without any added latency or memory cost.
Instead of using naive recursive character splitting—which can arbitrarily slice sentences in half or separate definitions from their examples—I implemented a Semantic Splitter. It calculates embedding similarity between adjacent sentences and makes cuts only when the topic actually shifts. This preserves the coherence of the text chunks and drastically improves the retriever's accuracy.
Standard dense vector search struggles with exact-term matching (like specific acronyms, proper nouns, or version numbers). To solve this, I built a Hybrid Retriever that couples semantic dense search with an exact-keyword BM25 index. The results from both lists are combined using Reciprocal Rank Fusion (RRF), ensuring robust retrieval for both broad conceptual questions and highly specific keyword queries.
The assignment encouraged using an open-source model. I utilized the strictly open-source Llama 3.3-70B, but actively opted to delegate the heavy inference workload to the Groq API. Running a 70-billion parameter model locally demands immense GPU VRAM, which often causes hard crashes on consumer hardware. Proxying open-source models through Groq honors the assignment's constraints while ensuring the app performs instantly and safely on your machine.
I preferred using LlamaIndex over LangChain for the pipeline construction. LlamaIndex is purpose-built explicitly for RAG, and its native abstractions like QueryFusionRetriever and SemanticSplitterNodeParser are far cleaner and more powerful for this architecture out of the box.
- No streaming — The
/askendpoint returns a complete response. A streaming implementation viaStreamingResponseand Groq's streaming API would improve perceived latency for long answers. - Index rebuilt per request —
load_existing_index()is called on every/askrequest. In production this would be cached at startup and refreshed only after/ingest. - BM25 runs in-memory — BM25 retrieval loads all nodes into memory. For very large corpora this would need to move to a dedicated sparse index (e.g. Elasticsearch BM25).
- No re-ranking — A cross-encoder re-ranker (e.g.
BAAI/bge-reranker-base) applied after hybrid retrieval would further improve precision on the final top-k. - Single-turn only — The system has no conversation memory. Follow-up questions referencing previous answers are treated as independent queries.






