A Retrieval-Augmented Generation (RAG) system designed to help Toastmasters members access accurate, document-grounded information about pathways, roles, and club processes. The system implements a three-stage pipeline: Ingestion, RAG Pipeline, and Evaluation, with advanced features like query classification, reranking, and metadata-driven filtering for optimal performance.
- Overview
- Repository Structure
- Installation
- Usage
- Pipeline Stages
- Configuration
- Key Features
- Performance Notes
- Dependencies
This RAG system processes Toastmasters documents (PDFs and DOCX files) and provides intelligent question-answering capabilities through:
- Intelligent Query Classification: Distinguishes between broad questions (instant lookup) and specific questions (full RAG pipeline)
- Advanced Retrieval: Vector similarity search with optional metadata-driven filtering
- Reranking: Cross-encoder reranking for improved relevance
- Local LLM Generation: Document-grounded answer generation
- Comprehensive Evaluation: Metrics for retrieval quality and system performance
RAG_toastmasters/
│
├── ingestion/ # Stage 1: Data Ingestion Pipeline
│ ├── main.py # Main orchestration script for ingestion
│ ├── extract_data.py # PDF and DOCX extraction with table handling
│ ├── chunk.py # Text chunking using RecursiveCharacterTextSplitter
│ └── vectorise.py # Embedding generation and FAISS index creation
│
├── rag_pipeline/ # Stage 2: RAG Query Processing Pipeline
│ ├── main.py # Main RAG class orchestrating the pipeline
│ ├── query_classifier.py # Broad question classification (roles/pathways)
│ ├── core_lookup.py # Fast lookup for broad questions
│ ├── retriever.py # Vector retrieval with metadata filtering
│ ├── reranker.py # Cross-encoder reranking for relevance
│ └── generate.py # Local LLM response generation
│
├── evaluation/ # Stage 3: System Evaluation Framework
│ ├── evaluation_script.py # Comprehensive evaluation metrics
│ ├── test_cases.json # Ground truth test cases
│ └── evaluation_report.txt # Generated evaluation reports
│
├── streamlit/ # Web Interface
│ ├── app.py # Streamlit web application
│ ├── feedback_db.py # Feedback database management
│ └── suggested_questions.json # Pre-defined question suggestions
│
├── data/ # Data Directory
│ ├── raw/ # Original source documents (PDFs, DOCX)
│ ├── extracted/ # Extracted plain text files
│ ├── chunks/ # JSON files with chunked text and metadata
│ ├── vectordb/ # Vector database files
│ │ ├── vector_index.faiss # FAISS vector index
│ │ └── metadata.json # Chunk metadata with categories
│ └── static/ # Static knowledge files
│ └── core_knowledge.json # Pre-written answers for broad questions
│
├── config.yaml # Configuration file for all pipeline settings
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone the repository (if applicable)
-
Install dependencies:
pip install -r requirements.txt
-
Set up OpenAI API Key (if using OpenAI models):
- Windows (PowerShell):
$env:OPENAI_API_KEY="your-api-key-here"
- Windows (Command Prompt):
set OPENAI_API_KEY=your-api-key-here
- Linux/Mac:
export OPENAI_API_KEY=your-api-key-here - Permanent setup (recommended): Add to your shell profile (
.bashrc,.zshrc, etc.) or use a.envfile
Important: Never commit your API key to the repository. The system only reads the API key from environment variables for security.
- Windows (PowerShell):
-
Set up the data pipeline:
- Place your source documents (PDFs, DOCX) in
data/raw/ - Run the ingestion pipeline (see Stage 1: Ingestion)
- Place your source documents (PDFs, DOCX) in
Process raw documents and create the vector database:
cd ingestion/
python main.pyThis will:
- Extract text from PDFs and DOCX files
- Chunk the text into manageable pieces
- Generate embeddings and create a FAISS index
- Save metadata for filtering and retrieval
from rag_pipeline.main import RAG
# Initialize the RAG pipeline
rag = RAG(config_path="config.yaml", verbose=True)
# Query the system
result = rag.query("What is the timer's role?")
print(result["response"])Launch the Streamlit web application:
cd streamlit/
streamlit run app.pyThe web interface provides:
- Interactive question-answering
- Suggested questions
- Feedback collection system
Evaluate system performance:
cd evaluation/
python evaluation_script.pyReview the generated report in evaluation_report.txt for:
- Retrieval quality metrics (Precision, Recall, F1, MRR)
The ingestion pipeline processes raw documents and prepares them for retrieval through four main steps:
- PDF Processing: Uses PyMuPDF (fitz) for text extraction and Camelot for table extraction
- DOCX Processing: Uses python-docx to extract text and tables
- Text Cleaning: Normalizes Unicode characters, removes non-printable characters, and collapses excessive whitespace
- Output: Clean plain text files saved to
data/extracted/
- Uses LangChain's
RecursiveCharacterTextSplitterfor intelligent text segmentation - Configurable chunk size (default: 1000 characters) and overlap (default: 100 characters)
- Respects paragraph boundaries, sentences, and word boundaries
- Output: JSON files with structured chunks containing:
id: Chunk identifiertext: Chunk contentmetadata: Source file information
- Saved to
data/chunks/
Automatically infers document categories based on filename patterns:
- "Role" - Role-related documents (grammarian, timer, etc.)
- "Eval" - Evaluation-related content
- "Leadership" - Leadership and officer information
- "Contest" - Speech contest rules and procedures
- "Generic" - General Toastmasters content
Categories are stored in metadata for later filtering.
- Generates embeddings using sentence-transformers model (
all-MiniLM-L6-v2by default) - Creates FAISS indices for fast similarity search:
- Main index:
data/vectordb/vector_index.faiss(all chunks, for backward compatibility) - Category indices:
data/vectordb/vector_index_{Category}.faiss(one per category, for latency reduction)
- Main index:
- Assigns
global_idto each chunk for tracking - Stores metadata linking chunks to source files and categories
- Output:
data/vectordb/vector_index.faiss- Main FAISS vector index (all chunks)data/vectordb/metadata.json- Complete metadata mappingdata/vectordb/vector_index_{Category}.faiss- Category-specific indices (Role, Eval, Leadership, Contest, Generic)data/vectordb/metadata_{Category}.json- Category-specific metadata files
The RAG pipeline processes user queries through a sophisticated multi-step process designed to provide accurate, contextually relevant answers.
Query → Classification → [Broad Questions → Fast Lookup]
↓
[Specific Questions → Full RAG Pipeline]
↓
Retrieval → Reranking → Generation
The system implements a broad question classifier that distinguishes between general questions and specific questions.
Broad Questions:
- Pattern-based classification using keyword matching
- Categories detected: "roles", "pathways"
- Handled via
CoreLookupfor instant responses - Uses pre-written answers from
data/static/core_knowledge.json - Benefits: Zero latency, always accurate for predefined categories
Example Broad Patterns:
- "different roles", "list of roles", "all roles"
- "different pathways", "list of pathways", "all pathways"
Specific Questions:
- All other queries routed to full RAG pipeline
- Undergo vector retrieval, reranking, and generation
The retriever performs semantic similarity search using the FAISS vector index.
Key Features:
- Encodes query using same embedding model as ingestion
- Performs cosine similarity search (L2-normalized vectors)
- Returns top-k most similar chunks with relevance scores
- Supports metadata-driven filtering (see below)
To improve both latency and relevance, the system implements intelligent metadata filtering using category-specific FAISS indices. This approach actually reduces search latency by searching only the relevant category's index instead of the full index.
How It Works:
-
Query Analysis: Analyzes query keywords to infer category:
- Contest keywords → "Contest" category
- Role keywords (grammarian, timer, etc.) → "Role" category
- Leadership keywords (president, VPE, etc.) → "Leadership" category
- Evaluation keywords → "Eval" category
-
Category-Specific Indices (Created during ingestion):
- During ingestion, separate FAISS indices are created for each category
- Each category index contains only vectors from that category
- Files saved:
vector_index_{Category}.faissandmetadata_{Category}.json - Main index (
vector_index.faiss) is still created for backward compatibility
-
Filtered Retrieval with Latency Reduction:
- When a category is detected and category-specific index exists:
- Searches ONLY the relevant category's index (much smaller, faster)
- No post-filtering needed - all results are already relevant
- Significant latency reduction: e.g., if Role category has 20% of chunks, search is ~5x faster
- When no category detected or category index unavailable:
- Falls back to main index with post-retrieval filtering (slower but still works)
- When a category is detected and category-specific index exists:
Performance Benefits:
- True Latency Reduction: Searches only relevant category's index (30-80% faster depending on category size)
- Better Relevance: Eliminates irrelevant category results
- Automatic: No manual filtering required
- Backward Compatible: Falls back to main index if category indices unavailable
Configuration:
filters:
enable: true # Enable automatic category-based filteringAfter initial retrieval, the system optionally applies cross-encoder reranking to improve result relevance beyond what pure embedding similarity provides.
How It Works:
- Uses cross-encoder model (
ms-marco-MiniLM-L-6-v2by default) - Scores each query-document pair independently
- Reranks retrieved chunks by relevance scores
- Selects top-k reranked chunks for generation
Performance Benefits:
- Improved Relevance: Cross-encoders consider query-document interaction
- Better Precision: Filters out false positives from initial retrieval
- Contextual Understanding: Better semantic matching than embeddings alone
Configuration:
reranker:
enable: true
top_k: 5 # Number of top chunks after rerankingThe final step generates natural language responses using a local LLM.
- Takes selected chunks and user query
- Constructs RAG prompt with context and question
- Generates coherent, document-grounded answer
- Model: Configurable via
config.yaml(default:phi3)
The evaluation framework provides comprehensive metrics to assess system performance across multiple dimensions.
-
Retrieval Quality Metrics:
- Precision@k: Proportion of retrieved chunks that are relevant
- Recall@k: Proportion of relevant chunks that were retrieved
- F1@k: Harmonic mean of precision and recall
- Mean Reciprocal Rank (MRR): Quality of first relevant result ranking
- Average Retrieval Time: Latency measurement
-
Coverage Analysis:
- Total chunks and sources
- Chunks per source distribution
- Source balance score (lower is more balanced)
-
Latency Benchmarking:
- Mean, median, P95, P99 latency metrics
- Statistical analysis across multiple runs
-
Load test cases with ground truth (
test_cases.json)- Format:
{"query": "...", "relevant_global_ids": [...]}
- Format:
-
Run evaluation:
cd evaluation/ python evaluation_script.py -
Review generated report (
evaluation_report.txt):- System statistics
- Knowledge base coverage
- Retrieval quality metrics
- Performance benchmarks
Utility script to generate test cases for evaluation:
- Create queries covering different categories
- Manually or automatically annotate relevant chunks
- Export to JSON format for evaluation
The system is fully configurable via config.yaml:
rag_pipeline:
# Paths
index_path: "../data/vectordb/vector_index.faiss"
metadata_path: "../data/vectordb/metadata.json"
chunks_path: "../data/chunks"
core_knowledge_path: "../data/static/core_knowledge.json"
# Models
embedding_model: "all-MiniLM-L6-v2"
reranker_model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
llm_model: "gpt-4o-mini" # Options: "gpt-3.5-turbo", "gpt-4", "gpt-4-turbo", "gpt-4o", "gpt-4o-mini", or "phi3" for Ollama
# OpenAI Settings (required if using OpenAI models)
# IMPORTANT: API key should be set via OPENAI_API_KEY environment variable for security
openai:
base_url: null # Optional: Custom base URL for OpenAI-compatible APIs (e.g., Azure OpenAI)
temperature: 0.7 # Sampling temperature (0.0 to 2.0)
max_tokens: null # Maximum tokens to generate (null = model default)
# Retrieval settings
top_k_retrieve: 8 # Initial retrieval count
# Metadata filtering
filters:
enable: false # Enable automatic category-based filtering
# Reranking settings
reranker:
enable: false
top_k: 5 # Top chunks after reranking
# Chunking settings (for ingestion)
chunk_size: 1000
chunk_overlap: 100
# Verbose output
verbose: falseSecurity Note: The OpenAI API key is never stored in config.yaml. It must be set as an environment variable OPENAI_API_KEY for security. See Installation for setup instructions.
- Fast pattern-based classification for general questions
- Instant responses via
core_knowledge.jsonlookup - Zero retrieval overhead for predefined categories
- Cross-encoder reranking for superior relevance
- Filters false positives from initial retrieval
- Better semantic understanding than embeddings alone
- Configurable enable/disable
- Automatic category detection from query keywords
- Category-specific FAISS indices created during ingestion for true latency reduction
- Searches only relevant category's index (30-80% faster than full index search)
- Better relevance by eliminating irrelevant categories
- Automatic fallback to main index if category not detected
- Configurable enable/disable
- Precision, Recall, F1, MRR metrics
- Latency benchmarking
- Coverage analysis
- Automated report generation
- Modular design allows easy component swapping
- Configurable via YAML without code changes
- Supports both broad and specific queries efficiently
- Metadata filtering with category indices: Reduces retrieval latency by 30-80% when category is detected (searches smaller category index instead of full index)
- Metadata filtering without category indices: Improves relevance but doesn't reduce latency (filters after retrieval from full index)
- Reranking adds ~100-200ms per query but improves precision significantly
- Broad question classification provides instant responses (<10ms)
- Vector retrieval (full index): typically 50-150ms depending on index size
- Vector retrieval (category index): typically 10-50ms (much faster!)
- Total pipeline latency (with category filtering, without reranking): ~100-200ms
- Total pipeline latency (with category filtering, with reranking): ~200-400ms
- Total pipeline latency (without filtering, without reranking): ~200-300ms
- Total pipeline latency (without filtering, with reranking): ~300-500ms
- sentence-transformers: Embedding generation
- faiss-cpu: Vector similarity search
- langchain: Text splitting and document handling
- pymupdf: PDF text extraction
- camelot-py: PDF table extraction
- python-docx: DOCX processing
- pyyaml: Configuration management
- streamlit: Web interface (optional)