Skip to content

raphael-ph/industrial-rag-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

industrial-rag-pipeline

industrial-rag-pipeline is a Retrieval-Augmented Generation (RAG) pipeline designed for Electrical Motor QA. It integrates document extraction, indexing, retrieval, and generation capabilities using Elasticsearch and Google Generative AI.


Table of Contents


Overview

This project provides a pipeline for processing PDF documents, indexing their content into Elasticsearch, and enabling semantic search and question-answering capabilities. It uses Google Generative AI for embedding generation and response generation.


Features

  • PDF Extraction: Extracts text from PDF files and splits it into chunks for indexing.
  • Elasticsearch Integration: Indexes document chunks and performs semantic search using vector embeddings.
  • RAG Agent: Combines retrieved documents with generative AI to answer user queries.
  • FastAPI: Provides RESTful endpoints for document indexing and question answering.
  • Streamlit Playground: Interactive interface for testing the RAG pipeline.
  • Logging: Color-coded logging for better debugging and monitoring.

Project Structure

.
├── api/                # FastAPI endpoints for the RAG pipeline
├── app/                # Core application logic (pipeline, prompts, schemas, utils)
├── notebooks/          # Jupyter notebooks for evaluation and experimentation
├── tests/              # Unit and integration tests
├── playground.py       # Streamlit app for interactive RAG testing
├── Makefile            # Automation scripts for development and testing
├── .env                # Environment variables (not included in version control)
├── pyproject.toml      # Project dependencies and metadata
└── README.md           # Project documentation

Installation

Prerequisites

  • Python 3.12 or higher
  • uv package manager
  • Elasticsearch instance
  • Google Generative AI API access

Steps

  1. Install uv (if not already installed):

    # On macOS and Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # On Windows
    powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
    
    # Or via pip
    pip install uv
  2. Clone the repository:

    git clone <repository-url>
    cd industrial-rag-pipeline
  3. Create and activate virtual environment with uv:

    uv venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  4. Install dependencies:

    uv sync
  5. Set up environment variables in .env (see Environment Variables).


Usage

Running the API

  1. Start the FastAPI server:

    make dev

    Or directly with uv:

    uv run uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
  2. Access the API documentation:

Running the Playground

  1. Start the Streamlit playground:

    make playground

    Or directly with uv:

    uv run streamlit run playground.py
  2. Open the playground in your browser:

Testing

Run all tests using the Makefile:

make test

Run specific tests:

  • Local integration tests:
    make test-local
    # Or: uv run pytest tests/local/
  • API tests:
    make test-api
    # Or: uv run pytest tests/api/

Development

Adding Dependencies

Add new dependencies using uv:

# Add a regular dependency
uv add package-name

# Add a development dependency
uv add --dev package-name

# Add from a specific source
uv add "package-name>=1.0.0"

Updating Dependencies

# Update all dependencies
uv sync --upgrade

# Update a specific package
uv add --upgrade package-name

Running Scripts

Use uv run to execute scripts within the project environment:

# Run Python scripts
uv run python scripts/your_script.py

# Run CLI tools
uv run black .
uv run isort .
uv run mypy .

Environment Variables

The project uses a .env file to manage sensitive configurations. Below are the required variables:

# Google API Key
GEMINI_API_KEY="your-google-api-key"

# Elasticsearch configs
ELASTIC_SEARCH_API_KEY="your-elasticsearch-api-key"
ELASTIC_SEARCH_URL="your-elasticsearch-url"

# Optional
INDEX_NAME="your-default-index-name"

Endpoints

Health Check

  • GET /health
    • Returns the health status of the API.

Document Indexing

  • POST /documents/
    • Uploads and indexes PDF documents into Elasticsearch.

Question Answering

  • POST /question/
    • Generates answers to user queries using the RAG pipeline.

Why UV?

This project uses uv as the Python package manager for several advantages:

  • Speed: 10-100x faster than pip for dependency resolution and installation
  • Reliability: Consistent dependency resolution with lockfile support
  • Simplicity: Single tool for virtual environments, dependency management, and script running
  • Modern: Built with Rust for performance and reliability
  • Compatibility: Works with existing pyproject.toml and requirements.txt files

If you prefer using pip, you can still generate a requirements file:

uv export --format requirements-txt --output-file requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors