A Retrieval-Augmented Generation (RAG) application using LLM models and vector stores.
3.11
- Embeddings: Uses
LlamaCppEmbeddings- requires GGUF format models (quantized llama.cpp models) - LLM: Uses
LlamaCpp- requires GGUF format models - Vector Store: FAISS (model-agnostic). Using faiss-cpu package
- RAG Pipeline: LangChain's RetrievalQA (works with any LangChain-compatible LLM)
The application has been tested with the following models:
- Embeddings:
bge-small-en-v1.5-q4_k_m.gguf - LLM:
llama-2-7b.Q4_K_M.gguf
Not supported.
The application supports the following file types:
- PDF (*.pdf)
- CSV (*.csv)
- JSON (*.json)
- HTML (*.html)
- Text (*.txt)
The application can be configured using environment variables:
SOURCE_DIR: Directory containing the source files to processSOURCE_TYPE: Type of source files to process (pdf, csv, json, html, txt)EMBEDDINGS_MODEL_PATH: Path to the embeddings modelLLM_MODEL_PATH: Path to the LLM modelFLASK_PORT: Port for the Flask API server
LANGCHAIN_TRACING_V2: Set to "true" to enable LangSmith tracingLANGCHAIN_ENDPOINT: LangSmith API endpoint (default: https://api.smith.langchain.com)LANGCHAIN_API_KEY: Your LangSmith API keyLANGCHAIN_PROJECT: LangSmith project name for organizing traces
CPU_THREADS: Number of CPU threads to use (default: 6, recommended: match your CPU cores)GPU_LAYERS: Number of layers to offload to GPU (-1 means all)GPU_BATCH_SIZE: Batch size for processing multiple inputs at once (default: 256)N_CTX: Context window size/token limit (default: 4096, should match your LLM model's context size)TEMPERATURE: Controls randomness in generation (0.0 = deterministic, higher = more random)MAX_TOKENS: Maximum number of tokens to generate in responses (default: 2000)TOP_P: Nucleus sampling probability threshold (default: 0.9)TOP_K: Limits vocabulary to top K tokens (default: 40)REPEAT_PENALTY: Penalizes token repetition (default: 1.3, higher = less repetition)GRAMMAR_PATH: Optional path to JSON grammar for structured output
The easiest way to run the application is using Docker Compose:
- Make sure you have a
.envfile in the project root directory with all required environment variables (see Configuration section above).
cp .env.example .env- Run the application:
docker-compose upThe Docker container will use the environment variables from your .env file. Make sure this file exists before running the container.
You can also run the application directly:
- Install dependencies:
pip install -r requirements.txt-
Create a
.envfile based on.env.exampleand configure your environment variables. -
Run the application:
python -m rag_app.runTo force recreation of the vector store:
python -m rag_app.run --force-createThe application provides a simple API to query the vector store. You can access the interactive Swagger documentation at:
http://localhost:8080/api/docs
The API endpoints are:
POST /api/query
Request body:
{
"question": "your question here"
}Example:
curl -X POST "http://localhost:8080/api/query" \
-H "Content-Type: application/json" \
-d '{"question": "What is the main topic of the document?"}'Response:
{
"data": {
"answer": "string",
"question": "string"
}
}Run the tests with:
python -m unittest discover tests