Skip to content

feat(ai): on-device hybrid RAG for Ask your PDF#85

Merged
sumitsahoo merged 1 commit into
devfrom
feature/ai-local-hybrid-rag
May 17, 2026
Merged

feat(ai): on-device hybrid RAG for Ask your PDF#85
sumitsahoo merged 1 commit into
devfrom
feature/ai-local-hybrid-rag

Conversation

@sumitsahoo
Copy link
Copy Markdown
Collaborator

Summary

Ships Ask your PDF — a fully on-device RAG pipeline that lets users chat with any PDF without uploading bytes anywhere. Everything (chunking, embedding, retrieval, reranking, chat) runs in the browser via Transformers.js + WebGPU.

  • Two chat tiers (user-selectable, persisted to localStorage): Compact (LFM2.5-1.2B, ~810 MB) and Quality (LFM2-2.6B, ~1.55 GB). Shared embedder (EmbeddingGemma-300M) and cross-encoder reranker (MS MARCO MiniLM-L-6-v2).
  • Hybrid retrieval: BM25 + dense fused via Reciprocal Rank Fusion, anchor-chunk merge for identity questions, cosine relevance gate to refuse off-topic queries before they reach the LLM.
  • Deterministic fast-paths for verbatim contact extraction, doc-type identification, and topic-absence refusal — defends against small-model failure modes (mis-extracted digits, hallucinated topics).
  • Cache + consent UX: model weights persisted in CacheStorage keyed by SHA-256 of the PDF for instant re-opens, two-step consent dialog with per-model progress, free-RAM vs delete-cache affordances.
  • Latest commit: corrected the registry's approxSizeBytes against actual HuggingFace file listings — the Quality bundle reads "≈ 1.9 GB total" (was "1.8 GB" / actual 1.88 GB) and Compact reads "≈ 1.1 GB" (was implicitly 1.5 GB / actual 1.13 GB). Embed peak RAM bumped 400 → 500 MB and now includes the 26 MB Gemma SentencePiece tokenizer that was previously missed.

Deep-dive in docs/local-ai.md.

Test plan

  • vp check passes (format + lint + type-check)
  • vp test passes (unit suite — 126 tests)
  • pnpm test:e2e runs Ask PDF end-to-end with real model weights (résumé fixture)
  • pnpm exec tsx tests/e2e/retrieval-probe.ts dumps per-retriever hits + relevance scores per question for tuning
  • Manual: load Ask PDF on a fresh browser profile — gate shows correct aggregate size; pick each tier and verify per-tile download/RAM labels render from registry
  • Manual: refresh after first download — warm-load reads from CacheStorage in seconds, no network
  • Manual: open an encrypted PDF — friendly notice renders instead of crashing

Registry `approxSizeBytes` values had drifted from the real on-disk
weights — the Quality tier was showing "≈ 1.8 GB total" in the gate
when the actual download is ~1.88 GB, and the Compact tier overstated
by ~390 MB on the chat model alone. Verified against HuggingFace file
listings for each pinned dtype:

  - LFM2.5-1.2B (q4):  1.2 GB → 810 MB (model_q4.onnx_data is 850 MB)
  - LFM2-2.6B (q4f16): 1.5 GB → 1.55 GB
  - EmbeddingGemma q8: 309 MB → 320 MB (was missing the 26 MB Gemma
                                       SentencePiece tokenizer)
  - Embed peak RAM:    400 MB → 500 MB (int8 dequant overhead)

Aggregate now reads "≈ 1.1 GB total" on Compact and "≈ 1.9 GB total"
on Quality. Prose updated in README, docs/local-ai.md, tool-registry,
ChatModelPicker, AiModelDetailsModal, ai-runtime, useRagModels, and
the AskPdf timing-weight comment.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
cloakpdf cb7da66 Commit Preview URL

Branch Preview URL
May 17 2026, 12:44 PM

@sumitsahoo sumitsahoo merged commit db278fa into dev May 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant