LLKMS is a powerful tool for processing and querying documents in various formats, designed to support language learning and knowledge management. It integrates with Amazon S3 for cloud storage, uses LangChain and FAISS for advanced Retrieval Augmented Generation (RAG), and supports configurable language models like OpenAI and DeepSeek. Whether you’re a learner, researcher, or knowledge enthusiast, LLKMS makes it easy to manage and extract insights from your documents.
- Multi-format Support: Process
.pdf,.txt,.png/.jpg/.jpeg(with OCR),.docx, and.html/.htmfiles. - Cloud Integration: Seamlessly connect to Amazon S3 for document storage and retrieval.
- Smart Retrieval: Leverage RAG with FAISS for fast, context-aware answers (limited to three sentences).
- Flexible Models: Use language models from OpenAI, DeepSeek, or others via a configurable
ModelFactory. - Usage Tracking: Monitor token usage and API costs with a summary on exit.
- Detailed Logging: Comprehensive logs for debugging and transparency (
logs/llkms.log).
- Python: 3.9 or higher
- API Keys:
- OpenAI (
OPENAI_API_KEY) or DeepSeek (DEEPSEEK_API_KEY) - AWS (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) for S3
- OpenAI (
- Tesseract OCR: For image processing (install separately)
-
Clone the Repository
git clone https://github.com/Butterski/llkms.git cd llkms -
Install Dependencies
pip install -r requirements.txt
-
Set Up Environment Variables
- Copy the example
.envfile:cp .env.example .env
- Edit
.envwith your credentials:AWS_ACCESS_KEY_ID=your_aws_access_key AWS_SECRET_ACCESS_KEY=your_aws_secret_key OPENAI_API_KEY=your_openai_api_key # Optional DEEPSEEK_API_KEY=your_deepseek_api_key # Optional
- Copy the example
-
Install Tesseract OCR
- See the Tesseract installation guide for your OS.
-
Start the Application
python src/llkms/main.py
- Loads
config.yaml, connects to S3, processes documents, and opens an interactive menu.
- Loads
-
Query Your Documents
- Select "RAG Pipeline with S3" from the menu.
- Ask questions (e.g., "What’s in my documents?") and get concise answers.
- Optionally view retrieved documents.
- Type
quitto exit and see usage stats.
-
Force Reindexing
- Rebuild the vector store (skips cache):
python src/llkms/main.py --reindex
- Rebuild the vector store (skips cache):
Customize settings in config.yaml:
- AWS: Bucket (
eng-llkms), prefix (knowledge) - Model: Provider (
deepseek/openai), model name, temperature, max tokens - App: Temp directory (
temp), vector store cache (vector_store_cache)
Example snippet:
aws:
bucket: eng-llkms
prefix: knowledge
model:
provider: deepseek
model: deepseek-chat
temperature: 0.7
max_tokens: 1024LLKMS tracks:
- Total Tokens: All tokens used
- Prompt/Completion Tokens: Detailed breakdown
- Requests: Number of successful API calls
- Cost: Estimated USD cost
- View the summary when exiting the app.
- Downloads documents from S3 to a temp directory.
- Processes files into chunks using
RecursiveCharacterTextSplitter. - Indexes chunks with FAISS for efficient retrieval.
- Answers queries via a RAG pipeline with your chosen language model.
- Fork the repo:
https://github.com/Butterski/llkms - Create a branch:
git checkout -b feature/your-feature - Commit changes:
git commit -m "Add your feature" - Push:
git push origin feature/your-feature - Submit a Pull Request.
- LangChain: RAG and document processing framework
- OpenAI: Optional LLM provider
- DeepSeek: Default LLM provider
- FAISS: Vector storage
- Tesseract OCR: Image text extraction
- Questionary: Interactive CLI