Skip to content

Latest commit

 

History

History
278 lines (210 loc) · 10 KB

File metadata and controls

278 lines (210 loc) · 10 KB

Demo Upgrade Plan: Real Semantic Search with Transformers.js

Current State

  • 20 synthetic docs about altor-vec itself, 64-dim term-frequency vectors
  • 8.6 KB demo-index.bin in public/
  • 5 pre-computed query vectors — users can only click chips or type text that keyword-matches to a pre-computed vector
  • No real semantic understanding — findClosestQuery() does keyword overlap to pick the closest pre-computed vector

Target State

  • 10,000 real MS MARCO passages with 384-dim embeddings
  • Live query embedding via Transformers.js (Xenova/all-MiniLM-L6-v2) in a Web Worker
  • Users type ANY query and get relevant semantic search results in <1ms (search) + ~150ms (embedding)

Key Research Findings

1. GitHub Pages can serve the large files

  • GitHub Pages repo limit: 1GB, individual file soft limit: 100MB
  • Our files: index.bin (17MB) + metadata.json (3.5MB) = 20.5MB total → well within limits
  • GitHub Pages serves with proper Content-Type and supports gzip/brotli compression
  • The deploy workflow (actions/upload-pages-artifact + actions/deploy-pages) handles arbitrary files in dist/

2. Transformers.js approach: CDN dynamic import (recommended)

Option A: npm install — adds ~5MB to node_modules, but the ONNX model (~23MB) still downloads at runtime from HuggingFace CDN. Increases bundle size slightly.

Option B: CDN dynamic import (recommended)import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers'. Zero bundle impact. Model downloads from HuggingFace CDN on first use and is cached by the browser.

Decision: Use npm install (@huggingface/transformers) for TypeScript types and build-time safety, but the ONNX model still loads from HuggingFace CDN at runtime (~23MB model_quantized.onnx). This is how Transformers.js is designed to work.

3. ONNX model download UX

The quantized model (model_quantized.onnx) is 23MB. On first visit:

  • Fast connection (50 Mbps): ~4 seconds
  • Moderate (10 Mbps): ~18 seconds
  • Slow (2 Mbps): ~90 seconds

The model is cached by the browser after first download (HuggingFace sets proper cache headers). Subsequent visits: instant.

UX strategy:

  • Show a multi-phase progress indicator: "Loading WASM engine..." → "Downloading embedding model (23MB)..." → "Ready"
  • Keep the suggested query chips functional immediately (with pre-computed vectors) so users can try the demo while the model loads
  • Once model is ready, switch to live embedding for all queries
  • Show "Embedding your query..." briefly when user types custom query

4. Keep suggested query chips — YES

  • They provide instant gratification while the model downloads
  • They guide users who don't know what to search for
  • With real MS MARCO data, we need good example queries. Suggested chips:
    1. "what is a bank transit number" (finance)
    2. "how does photosynthesis work" (science)
    3. "what causes earthquakes" (geology)
    4. "define machine learning" (tech)
    5. "who invented the telephone" (history)

These should have pre-computed 384-dim vectors so they work instantly even before the model loads.

5. Web Worker for embedding — YES

Transformers.js model inference takes ~100-200ms. Running on main thread would cause jank. Use a dedicated Web Worker that:

  1. Loads @huggingface/transformers pipeline
  2. Accepts text queries via postMessage
  3. Returns 384-dim Float32Array embeddings
  4. Reports download progress back to main thread

6. Metadata format

Current metadata.json: [{"id": 0, "text": "..."}, ...] — 10K passages, avg 337 chars each, 3.5MB total.

For display, we want a title + snippet. MS MARCO passages don't have titles, so we'll:

  • Use the first ~60 chars as a "title" (or first sentence)
  • Use the full text as the snippet
  • Generate this at build time in a new script, outputting a compact format

Implementation Plan

Files to Modify

1. src/demo-data.ts — Rewrite

Remove: All 20 synthetic docs, all 64-dim pre-computed vectors Replace with:

export interface DemoDoc {
  title: string;   // First sentence or first 80 chars
  snippet: string; // Full passage text
}

export interface DemoQuery {
  query: string;
  vector: number[]; // 384-dim pre-computed vector
}

// Will be generated by new build script
export const DEMO_QUERIES: DemoQuery[] = [...]; // 5 queries with 384-dim vectors
// DEMO_DOCS removed — loaded dynamically from metadata.json

2. src/App.tsx — Modify LiveSearchDemo component (~lines 685-937)

Changes:

  • Remove DEMO_DOCS import (docs come from metadata.json loaded at runtime)
  • Add state for: modelState ("idle" | "downloading" | "ready" | "error"), modelProgress, passages (loaded from metadata.json)
  • Add Web Worker ref for Transformers.js embedding
  • Change runQuery to:
    1. If pre-computed vector available (chip click) → use directly
    2. Else if model ready → send to worker, await embedding, then search
    3. Else → show "Model still loading..." message
  • Update status bar to show model download progress
  • Update search footer to show "embedding: Xms + search: Xms"
  • Add debounced auto-search on typing (300ms debounce) when model is ready

Keep unchanged:

  • Overall section structure, styling, animations
  • Reveal components
  • Engine loading logic (init + fetch index)
  • Result rendering cards
  • Search input UI

3. src/embed-worker.ts — NEW file (Web Worker)

// Web Worker for Transformers.js embedding
import { pipeline, env } from '@huggingface/transformers';

let embedder: any = null;

self.onmessage = async (e) => {
  if (e.data.type === 'init') {
    env.allowLocalModels = false;
    embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
      quantized: true,
      progress_callback: (progress: any) => {
        self.postMessage({ type: 'progress', ...progress });
      }
    });
    self.postMessage({ type: 'ready' });
  }

  if (e.data.type === 'embed') {
    const output = await embedder(e.data.text, { pooling: 'mean', normalize: true });
    self.postMessage({ type: 'embedding', vector: Array.from(output.data), id: e.data.id });
  }
};

4. scripts/generate-demo-data.mjs — Rewrite

New script that:

  1. Reads data/metadata.json (10K passages)
  2. Generates title/snippet pairs for each passage
  3. Uses Transformers.js (Node.js) to compute embeddings for 5 demo queries
  4. Outputs:
    • public/metadata.json — compact [{title, snippet}, ...] (id = array index)
    • src/demo-data.ts — only DEMO_QUERIES with pre-computed 384-dim vectors
  5. Copies data/index.binpublic/demo-index.bin

5. vite.config.ts — Minor update

Add worker config for proper Web Worker bundling:

worker: {
  format: 'es'
}

6. package.json — Add dependency

"@huggingface/transformers": "^3.4.0"

Files to Add to public/

File Size Source
demo-index.bin 17MB Copy from data/index.bin (replaces existing 8.6KB)
metadata.json ~2-3MB Generated from data/metadata.json (title+snippet only)

Files to Remove

  • Nothing removed, but public/demo-index.bin is replaced (8.6KB → 17MB)
  • scripts/generate-demo-data.mjs is rewritten (old toy version replaced)

UX Flow with Loading States

Phase 1: Page Load (0-2s)

Status: ● Loading WASM engine...
[Search input disabled]
[Query chips disabled]

Phase 2: WASM Ready, Model Downloading (2-20s)

Status: ● WASM engine running — 10,000 vectors
        ⟳ Downloading embedding model... 45%
[Query chips ENABLED — use pre-computed vectors]
[Search input ENABLED for chips, typing shows "Model loading..." hint]

Phase 3: Everything Ready

Status: ● WASM engine running — 10,000 vectors
        ● Embedding model ready
[Query chips ENABLED]
[Search input ENABLED — live embedding on type/enter]
[Auto-search as you type with 300ms debounce]

Query Execution Flow

  1. User types query → 300ms debounce
  2. Show "Embedding..." spinner in input
  3. Worker returns vector (~150ms)
  4. WASM search (~0.5ms)
  5. Display results with "embed: 142ms + search: 0.48ms" in footer

Fallback Strategy

If Transformers.js fails to load:

  • Query chips still work (pre-computed vectors)
  • Show message: "Live embedding unavailable. Try a suggested query above."
  • Log error for debugging

If model download is too slow:

  • Show progress bar with percentage
  • After 30s without progress, show: "Download taking longer than expected. The model is cached after first load."

If WASM fails:

  • Existing error state: "Failed to load engine. Try refreshing."
  • This is already handled

If metadata.json fails to load:

  • Fall back to showing just doc IDs and distances (no title/snippet)
  • Show message: "Could not load passage text"

Build & Deploy Changes

Local dev workflow:

  1. npm install @huggingface/transformers
  2. Run new scripts/generate-demo-data.mjs to generate public/metadata.json + src/demo-data.ts + copy public/demo-index.bin
  3. npm run dev — works as before

CI/CD (.github/workflows/deploy.yml):

  • No changes needed! The large files go in public/, Vite copies them to dist/, and upload-pages-artifact handles them
  • The ONNX model is NOT in our repo — it's fetched from HuggingFace CDN at runtime

Estimated bundle impact:

  • @huggingface/transformers JS glue: ~200KB (tree-shaken)
  • Worker file: ~5KB
  • Public assets: +17MB (index) + ~2.5MB (metadata) = ~19.5MB
  • ONNX model: 23MB downloaded from CDN (not in our bundle)

Summary of Changes

Component Current Upgraded
Documents 20 synthetic, 64-dim 10,000 MS MARCO, 384-dim
Index size 8.6 KB 17 MB
Query embedding Keyword match to 5 pre-computed vectors Live Transformers.js in Web Worker
Model None all-MiniLM-L6-v2 quantized (23MB, cached)
Search latency <0.1ms (toy data) <1ms (real HNSW)
User can search for 5 pre-defined topics Anything
Privacy 100% client-side 100% client-side (model from CDN, cached)