A simple, readable, and extensible search engine implementation in Python. Designed to be a clean reference for understanding how search engines work.
- Crawler: Fetches pages and extracts text.
- Indexer: Normalizes text and builds an inverted index.
- Search: Ranks results using Term Frequency (TF).
- Storage: Uses local JSON files for simplicity.
crawler.py: Fetches URLs and parses HTML.indexer.py: Tokenizes text, removes stopwords, and manages the index.search.py: Implements the search ranking logic.main.py: The CLI runner.
No external dependencies are required. Just ensure you have Python 3 installed.
Create a text file (e.g., seed_urls.txt) with URLs to index:
https://www.python.org/
https://en.wikipedia.org/wiki/Search_engine
Run the crawler:
python main.py crawl seed_urls.txtThis will create index_data.json and metadata.json.
Run a search query:
python main.py search "python"- Crawling: The crawler downloads HTML, strips tags, and extracts raw text.
- Indexing: Text is split into tokens. Punctuation is removed, and common "stopwords" (like 'the', 'is') are filtered out.
- Inverted Index: We build a map where keys are words and values are lists of documents containing those words (along with frequency).
- Searching: When you search, the engine looks up your keywords in the inverted index, finds matching documents, and adds up the occurrence counts (Term Frequency) to score them.
- Better Ranking: Implement TF-IDF (Term Frequency-Inverse Document Frequency) to downweight common terms.
- Advanced Crawler: Add link extraction to visit new pages recursively (BFS).
- Stemming: Use a library like
nltkto reduce words to their root (e.g., "running" -> "run"). - Web Interface: Build a simple Flask/Django app to serve results.