A high-performance, from-scratch search engine capable of indexing and querying millions of Wikipedia articles. This project implements advanced data structures and ranking algorithms in C++ to provide a complete search experience with real-time suggestions and authority-based ranking.
- Custom Search Engine: Built from the ground up in C++ without external search libraries (like Lucene or Whoosh).
- Hybrid Ranking: Combines BM25 (Best Matching 25) for content relevance and PageRank for link-based authority.
- Real-Time Autocomplete: An optimized Trie (Prefix Tree) structure that provides sub-millisecond query suggestions.
- Distributed Indexing: A multi-step "MapReduce-style" pipeline that shards raw data, builds partial indices, and merges them into a single binary index.
- Modern GUI: A dedicated desktop application built with Qt 6 for an intuitive search experience.
- Cross-Language Stemming: Integrated Python-based Porter Stemmer for accurate word normalization.
The engine is divided into several specialized components:
splitter.cpp: Shards raw Wikipedia JSONL data into manageable chunks based on term hashes.indexer_shard.cpp: Processes individual shards to create partial inverted indices.mergebins.cpp: Merges all partial indices into a final, high-performanceindex.bin.
pageRank.cpp: Processes millions of Wikipedia page links using the iterative PageRank algorithm to assign "authority" scores to every page.search.cpp: The core query processor. It calculates BM25 scores and combines them with PageRank values for the final result list.
autoCompleteTrie.cpp: Builds a binary-serialized Trie structure from the entire Wikipedia title set, enabling instant "search-as-you-type" functionality.
- Primary Language: C++17/20
- Interface: Qt 6.9 Framework
- Data Processing: Python (NLTK for stemming, BeautifulSoup for parsing)
- Data Formats: Binary (Custom serialization), JSONL, CSV
- Algorithms: BM25, PageRank, Trie DFS, Hash-based Sharding
- C++ Compiler (GCC/Clang/MinGW)
- Python 3.x (with
nltkinstalled) - Qt 6 (for the GUI)
- Build the Index:
# Run the indexing pipeline ./splitter ./indexer_shard <shard_id> ./mergebins
- Build the Autocomplete Trie:
./autoCompleteTrie build
- Run the Search Console:
./search
- Launch the GUI: Open
WikiEngineGUI/CMakeLists.txtin Qt Creator and build.
├── WikiEngineGUI/ # Qt-based Desktop Application
├── BinsAndTxtx/ # Final binary index files
├── data_files/ # Raw and processed Wikipedia dumps
├── Parsers/ # Python scripts for data preparation
├── search.cpp # Main query engine logic
├── pageRank.cpp # Link authority calculator
└── autoCompleteTrie.cpp # Real-time prefix suggestions