- 20 synthetic docs about altor-vec itself, 64-dim term-frequency vectors
- 8.6 KB
demo-index.bininpublic/ - 5 pre-computed query vectors — users can only click chips or type text that keyword-matches to a pre-computed vector
- No real semantic understanding —
findClosestQuery()does keyword overlap to pick the closest pre-computed vector
- 10,000 real MS MARCO passages with 384-dim embeddings
- Live query embedding via Transformers.js (
Xenova/all-MiniLM-L6-v2) in a Web Worker - Users type ANY query and get relevant semantic search results in <1ms (search) + ~150ms (embedding)
- GitHub Pages repo limit: 1GB, individual file soft limit: 100MB
- Our files:
index.bin(17MB) +metadata.json(3.5MB) = 20.5MB total → well within limits - GitHub Pages serves with proper
Content-Typeand supports gzip/brotli compression - The deploy workflow (
actions/upload-pages-artifact+actions/deploy-pages) handles arbitrary files indist/
Option A: npm install — adds ~5MB to node_modules, but the ONNX model (~23MB) still downloads at runtime from HuggingFace CDN. Increases bundle size slightly.
Option B: CDN dynamic import (recommended) — import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers'. Zero bundle impact. Model downloads from HuggingFace CDN on first use and is cached by the browser.
Decision: Use npm install (@huggingface/transformers) for TypeScript types and build-time safety, but the ONNX model still loads from HuggingFace CDN at runtime (~23MB model_quantized.onnx). This is how Transformers.js is designed to work.
The quantized model (model_quantized.onnx) is 23MB. On first visit:
- Fast connection (50 Mbps): ~4 seconds
- Moderate (10 Mbps): ~18 seconds
- Slow (2 Mbps): ~90 seconds
The model is cached by the browser after first download (HuggingFace sets proper cache headers). Subsequent visits: instant.
UX strategy:
- Show a multi-phase progress indicator: "Loading WASM engine..." → "Downloading embedding model (23MB)..." → "Ready"
- Keep the suggested query chips functional immediately (with pre-computed vectors) so users can try the demo while the model loads
- Once model is ready, switch to live embedding for all queries
- Show "Embedding your query..." briefly when user types custom query
- They provide instant gratification while the model downloads
- They guide users who don't know what to search for
- With real MS MARCO data, we need good example queries. Suggested chips:
- "what is a bank transit number" (finance)
- "how does photosynthesis work" (science)
- "what causes earthquakes" (geology)
- "define machine learning" (tech)
- "who invented the telephone" (history)
These should have pre-computed 384-dim vectors so they work instantly even before the model loads.
Transformers.js model inference takes ~100-200ms. Running on main thread would cause jank. Use a dedicated Web Worker that:
- Loads
@huggingface/transformerspipeline - Accepts text queries via
postMessage - Returns 384-dim Float32Array embeddings
- Reports download progress back to main thread
Current metadata.json: [{"id": 0, "text": "..."}, ...] — 10K passages, avg 337 chars each, 3.5MB total.
For display, we want a title + snippet. MS MARCO passages don't have titles, so we'll:
- Use the first ~60 chars as a "title" (or first sentence)
- Use the full text as the snippet
- Generate this at build time in a new script, outputting a compact format
Remove: All 20 synthetic docs, all 64-dim pre-computed vectors Replace with:
export interface DemoDoc {
title: string; // First sentence or first 80 chars
snippet: string; // Full passage text
}
export interface DemoQuery {
query: string;
vector: number[]; // 384-dim pre-computed vector
}
// Will be generated by new build script
export const DEMO_QUERIES: DemoQuery[] = [...]; // 5 queries with 384-dim vectors
// DEMO_DOCS removed — loaded dynamically from metadata.jsonChanges:
- Remove
DEMO_DOCSimport (docs come from metadata.json loaded at runtime) - Add state for:
modelState("idle" | "downloading" | "ready" | "error"),modelProgress,passages(loaded from metadata.json) - Add Web Worker ref for Transformers.js embedding
- Change
runQueryto:- If pre-computed vector available (chip click) → use directly
- Else if model ready → send to worker, await embedding, then search
- Else → show "Model still loading..." message
- Update status bar to show model download progress
- Update search footer to show "embedding: Xms + search: Xms"
- Add debounced auto-search on typing (300ms debounce) when model is ready
Keep unchanged:
- Overall section structure, styling, animations
- Reveal components
- Engine loading logic (init + fetch index)
- Result rendering cards
- Search input UI
// Web Worker for Transformers.js embedding
import { pipeline, env } from '@huggingface/transformers';
let embedder: any = null;
self.onmessage = async (e) => {
if (e.data.type === 'init') {
env.allowLocalModels = false;
embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
quantized: true,
progress_callback: (progress: any) => {
self.postMessage({ type: 'progress', ...progress });
}
});
self.postMessage({ type: 'ready' });
}
if (e.data.type === 'embed') {
const output = await embedder(e.data.text, { pooling: 'mean', normalize: true });
self.postMessage({ type: 'embedding', vector: Array.from(output.data), id: e.data.id });
}
};New script that:
- Reads
data/metadata.json(10K passages) - Generates title/snippet pairs for each passage
- Uses Transformers.js (Node.js) to compute embeddings for 5 demo queries
- Outputs:
public/metadata.json— compact[{title, snippet}, ...](id = array index)src/demo-data.ts— onlyDEMO_QUERIESwith pre-computed 384-dim vectors
- Copies
data/index.bin→public/demo-index.bin
Add worker config for proper Web Worker bundling:
worker: {
format: 'es'
}"@huggingface/transformers": "^3.4.0"
| File | Size | Source |
|---|---|---|
demo-index.bin |
17MB | Copy from data/index.bin (replaces existing 8.6KB) |
metadata.json |
~2-3MB | Generated from data/metadata.json (title+snippet only) |
- Nothing removed, but
public/demo-index.binis replaced (8.6KB → 17MB) scripts/generate-demo-data.mjsis rewritten (old toy version replaced)
Status: ● Loading WASM engine...
[Search input disabled]
[Query chips disabled]
Status: ● WASM engine running — 10,000 vectors
⟳ Downloading embedding model... 45%
[Query chips ENABLED — use pre-computed vectors]
[Search input ENABLED for chips, typing shows "Model loading..." hint]
Status: ● WASM engine running — 10,000 vectors
● Embedding model ready
[Query chips ENABLED]
[Search input ENABLED — live embedding on type/enter]
[Auto-search as you type with 300ms debounce]
- User types query → 300ms debounce
- Show "Embedding..." spinner in input
- Worker returns vector (~150ms)
- WASM search (~0.5ms)
- Display results with "embed: 142ms + search: 0.48ms" in footer
- Query chips still work (pre-computed vectors)
- Show message: "Live embedding unavailable. Try a suggested query above."
- Log error for debugging
- Show progress bar with percentage
- After 30s without progress, show: "Download taking longer than expected. The model is cached after first load."
- Existing error state: "Failed to load engine. Try refreshing."
- This is already handled
- Fall back to showing just doc IDs and distances (no title/snippet)
- Show message: "Could not load passage text"
npm install @huggingface/transformers- Run new
scripts/generate-demo-data.mjsto generatepublic/metadata.json+src/demo-data.ts+ copypublic/demo-index.bin npm run dev— works as before
- No changes needed! The large files go in
public/, Vite copies them todist/, andupload-pages-artifacthandles them - The ONNX model is NOT in our repo — it's fetched from HuggingFace CDN at runtime
@huggingface/transformersJS glue: ~200KB (tree-shaken)- Worker file: ~5KB
- Public assets: +17MB (index) + ~2.5MB (metadata) = ~19.5MB
- ONNX model: 23MB downloaded from CDN (not in our bundle)
| Component | Current | Upgraded |
|---|---|---|
| Documents | 20 synthetic, 64-dim | 10,000 MS MARCO, 384-dim |
| Index size | 8.6 KB | 17 MB |
| Query embedding | Keyword match to 5 pre-computed vectors | Live Transformers.js in Web Worker |
| Model | None | all-MiniLM-L6-v2 quantized (23MB, cached) |
| Search latency | <0.1ms (toy data) | <1ms (real HNSW) |
| User can search for | 5 pre-defined topics | Anything |
| Privacy | 100% client-side | 100% client-side (model from CDN, cached) |