RAG API for document Q&A that ingests PDFs, finds the most relevant chunks, and returns grounded answers with low latency.
On upload, PDFs are extracted, chunked with overlap, embedded (768-d), and stored in PostgreSQL/pgvector for cosine search; on ask, the query is embedded, top-k chunks are retrieved, and Gemini 1.5 Flash generates a context-bound answer with Redis caching and SQLAlchemy-managed sessions; endpoints: POST /upload, POST /ask.
Cuts manual reading time, keeps answers faithful to your documents, and runs locally with simple ops—fast to prototype, easy to productionize.
FastAPI, LangGraph, Google GenAI Embedding-001 (768), Gemini 1.5 Flash, PostgreSQL + pgvector (IVFFLAT), Redis cache, SQLAlchemy, Docker Compose, .env config (DATABASE_URL, REDIS_URL, GEMINI_API_KEY), batching + exponential backoff, cosine similarity search.
- Python 3.11+
- Docker Desktop
- Google AI Studio API key (
GEMINI_API_KEY)
Create and activate virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows PowerShell
source .venv/bin/activate # macOS/LinuxInstall dependencies
pip install -r requirements.txtSet up environment variables
Create .env in the repo root:
DATABASE_URL=postgresql+psycopg://postgres:postgres@localhost:5432/postgres
REDIS_URL=redis://localhost:6379
GEMINI_API_KEY=YOUR_GOOGLE_AI_STUDIO_KEYStart Postgres + Redis
docker compose -f docker-compose.dev.yml up -dInitialize the database schema:
docker cp infra/sql_init.sql docs-ai-minimal-pg-1:/sql_init.sql
docker exec -it docs-ai-minimal-pg-1 psql -U postgres -f /sql_init.sqlRun the API server
python -m uvicorn api.main:app --reloadUpload a PDF
curl -X POST -F "file=@data/examplepdf.pdf" http://localhost:8000/uploadAsk a question
curl -X POST http://localhost:8000/ask -H "Content-Type: application/json" -d "{\"question\":\"Summarize the document in 2 sentences.\"}"- /upload → Extracts PDF text, chunks it (with overlap), generates embeddings via
google-genai(output_dimensionality=768), stores in Postgres pgvector. - /ask → Embeds the query, searches similar chunks in pgvector, sends top results to Gemini 1.5 Flash for the answer.
Chunking config (in core/ingest.py):
def chunk(text: str, max_chars=2400, overlap=150):
...max_chars→ chunk sizeoverlap→ shared characters between chunks
This removes all DB + Redis data.
docker compose -f docker-compose.dev.yml down -vHigh-level flow
POST /upload→ extract PDF text → chunk → embed chunks (batching + backoff) → store text + vectors in Postgres (pgvector).POST /ask→ embed query → cosine search in pgvector → pass top docs as context to Gemini → return grounded answer.
Directories
api/main.py– FastAPI app + router registration.routes.py– HTTP endpoints:/healthz,/upload,/ask.
core/ingest.py– PDF → text, chunking, callsembed_docs()and persists rows.llm.py– Text generation (Gemini 1.5 Flash) and embeddings (google‑genai) with batching + retry/backoff.workflows.py– Minimal RAG pipeline (retrieve → generate) built with LangGraph.
db/session.py– SQLAlchemy engine/session fromDATABASE_URL.pg.py–insert_embeddings()andsimilarity_search()with pgvector.redis.py– tiny JSON cache for retrieved docs.
infra/sql_init.sql– pgvector schema (768‑d vectors + IVFFLAT index).
Endpoints
GET /healthz– liveness probe.POST /upload(multipart form) – fieldfile(PDF). Returns{status, chunks}.POST /ask(JSON) – body{"question": "..."}. Returns{answer}.
Chunking (core/ingest.py)
chunk(text, max_chars=2400, overlap=150)→ overlap preserves context across boundaries.
Embeddings (core/llm.py)
- Model:
models/embedding-001withoutput_dimensionality=768(matchesvector(768)). - Task pairing: docs →
RETRIEVAL_DOCUMENT, queries →RETRIEVAL_QUERY. - Batching: default
EMBED_BATCH_SIZE=24(env override) with exponential backoff on 429/5xx.
Retrieval (db/pg.py)
- Cosine distance:
ORDER BY (e.embedding <=> :query_vec). - Returns top‑k chunk texts to the generator.
Generation (core/llm.py → generate())
- Model:
gemini-1.5-flash. - Prompt instructs “answer strictly from provided context; say you can’t find it if missing”.
Config
.env–DATABASE_URL,REDIS_URL,GEMINI_API_KEY.- (Optional)
EMBED_BATCH_SIZE– tune request burst to avoid throttling. - Adjust
max_chars/overlapincore/ingest.pyfor ingestion speed vs. recall.
