Skip to content

Latest commit

 

History

History
69 lines (44 loc) · 1.83 KB

File metadata and controls

69 lines (44 loc) · 1.83 KB

python-searchengine

Simple search engine implementation in Python for illustrative purposes to go with this blog post.

It also has a simple vector search implementation to go with the follow-up post.

Requirements

Python 3.10 or greater, and uv.

Usage

Install dependencies:

uv sync

Run the full-text search from the command line. On first run, the Wikipedia dataset (~20GB) will be downloaded from Hugging Face and cached automatically:

uv run python run.py

Run the semantic (vector) search:

uv run python run_semantic.py

On first run this builds a vector index by embedding all 6.4M documents. Embeddings are checkpointed to data/checkpoints/ so you can resume if interrupted. The finished index is saved to data/vector_index.* and memory-mapped on subsequent runs.

To skip the multi-hour encoding step, download the pre-computed embeddings from Hugging Face, place the JSON and .npy files in data/checkpoints/, and run uv run python run_semantic.py.

If you'd like to download the dataset separately (e.g. before a demo):

uv run python download.py

To get higher download rate limits, set a Hugging Face token:

export HF_TOKEN=hf_...

Run from interactive console:

uv run ipython

In [1]: run run.py
In [2]: index.search('python programming language', rank=True)[:5]

Development

Lint and type check:

uv run ruff check .
uv run mypy search/

Run tests:

uv run pytest -v