🔍 Mini Google — Document Search Engine

A high-performance, console-based document search engine built in pure Java — implementing the same core indexing logic used by large-scale search engines like Google.

📌 What This Project Does

Given a directory of text files, this engine:

Crawls all .txt files recursively across nested folders
Tokenizes raw text — normalizing case, stripping punctuation, and handling whitespace
Indexes every word using an Inverted Index data structure
Ranks search results by term frequency — most relevant files appear first
Serves queries in O(1) average time via HashMap-based lookups

Think of it as a local, offline version of a search engine that can index and search your own documents instantly.

⚙️ Core Data Structure — Inverted Index

The heart of this project is the Inverted Index, implemented as a nested HashMap:

HashMap<String, HashMap<String, Integer>>
       │                │         │
    keyword          filename   frequency

Example:

"gravity" → {
    "formation.txt" → 11,
    "intro.txt"     → 11,
    "culture.txt"   → 7,
    "physics.txt"   → 4,
    "eht.txt"       → 1
}

This mirrors how real-world search engines build their posting lists — mapping terms to the documents that contain them, along with relevance signals (here, term frequency).

🏗️ Architecture

The project follows a clean, single-responsibility design. Each class has one job:

┌────────────────────────────────────────────────────────┐
│                        Main.java                       │
│          Entry point · User I/O · Orchestration        │
└────────────────────────┬───────────────────────────────┘
                         │
           ┌─────────────▼──────────────┐
           │       SearchEngine.java     │
           │   Ties all modules together │
           └──┬───────────┬─────────────┘
              │           │
   ┌──────────▼──┐   ┌────▼──────────────┐
   │ FileLoader  │   │  InvertedIndex    │
   │  .java      │   │  .java            │
   │ Recursive   │   │ HashMap<String,   │
   │ dir scan +  │   │  HashMap<String,  │
   │ file reader │   │   Integer>>       │
   └──────────┬──┘   └────▲──────────────┘
              │           │
         ┌────▼───────────┴──┐
         │   Tokenizer.java  │
         │ Lowercase · strip │
         │ punctuation ·     │
         │ split tokens      │
         └───────────────────┘

Class	Responsibility
`Main.java`	Entry point, manages user interaction loop
`FileLoader.java`	Recursively walks directories, reads `.txt` file content
`Tokenizer.java`	Normalizes text: lowercase, removes special characters, splits tokens
`InvertedIndex.java`	Builds and stores the `word → {file → frequency}` mapping
`SearchEngine.java`	Accepts a query, fetches from the index, sorts and formats results

🚀 Performance

Metric	Value
Indexing Speed	~5 files indexed in 20 ms
Vocabulary (sample data)	1,330 unique words
Lookup Time Complexity	O(1) average (HashMap)
Indexing Time Complexity	O(N × M) where N = files, M = avg tokens per file
Space Complexity	O(V × D) where V = vocabulary size, D = number of documents

📂 Project Structure

Document-Search-Engine-using-Inverted-Index/
│
├── src/
│   └── com/
│       └── minigoogle/
│           ├── Main.java            # Application entry point
│           ├── FileLoader.java      # Recursive file crawling & reading
│           ├── Tokenizer.java       # Text normalization & tokenization
│           ├── InvertedIndex.java   # Core inverted index data structure
│           └── SearchEngine.java    # Query handling & ranked result output
│
├── sample_data/
│   └── black_hole_data/             # Sample .txt dataset for testing
│       ├── intro.txt
│       ├── formation.txt
│       ├── physics.txt
│       ├── culture.txt
│       └── eht.txt
│
├── out/                             # Compiled .class files
└── README.md

🖥️ Getting Started

Prerequisites

Java Development Kit (JDK) 17 or higher
Any terminal / command-line interface

1. Clone the Repository

git clone https://github.com/ShreeKumbhar/Document-Search-Engine-using-Inverted-Index.git
cd Document-Search-Engine-using-Inverted-Index

2. Compile

javac -d out src/com/minigoogle/*.java

3. Run

java -cp out com.minigoogle.Main

4. Use It

=========================================
   Mini Google - File Based Search Engine
=========================================

Enter directory path to index: /your/path/to/sample_data/black_hole_data

[INFO] Starting Indexing Process...
[INFO] Indexing Complete. Indexed 5 files in 20 ms.
[INFO] Vocabulary Size: 1330 words.

Search (type ':q' to exit): gravity

Found 5 result(s):
 - File: formation.txt | Frequency: 11
 - File: intro.txt     | Frequency: 11
 - File: culture.txt   | Frequency: 7
 - File: physics.txt   | Frequency: 4
 - File: eht.txt       | Frequency: 1

💡 Quick Test: Use the included sample_data/black_hole_data directory — just provide its absolute path when prompted.

🧠 Key Concepts Demonstrated

Concept	Application
Inverted Index	Core data structure — maps terms to documents with frequencies
HashMap & Nested Maps	O(1) average lookup time for keyword queries
Recursive File I/O	Walking nested directory trees to find all `.txt` files
String Processing	Regex-based tokenization, normalization pipeline
Frequency-Based Ranking	Results sorted by term frequency as a relevance signal
Modular OOP Design	Clean separation of concerns across 5 single-responsibility classes
Time & Space Complexity Analysis	Documented and justified for indexing and query phases

🔮 Planned Improvements

Phrase Search — Query for exact phrases like "black hole"
Boolean Operators — Support AND, OR, NOT in queries
Index Persistence — Save index to disk (JSON/binary) and reload without re-indexing
Multi-format Support — Extend to PDF, DOCX, and CSV files
TF-IDF Ranking — Replace raw frequency with TF-IDF for smarter relevance
Web Interface — Expose search via a lightweight REST API or browser UI

🛠️ Tech Stack

Language: Java 17
Data Structures: HashMap, ArrayList, List
I/O: java.nio.file (NIO) for recursive file traversal
Build: Manual javac compilation (no external build tools required)

👤 Author

Shree Kumbhar

GitHub: @ShreeKumbhar

📄 License

This project is open source and available for educational and personal use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Mini Google — Document Search Engine

📌 What This Project Does

⚙️ Core Data Structure — Inverted Index

🏗️ Architecture

🚀 Performance

📂 Project Structure

🖥️ Getting Started

Prerequisites

1. Clone the Repository

2. Compile

3. Run

4. Use It

🧠 Key Concepts Demonstrated

🔮 Planned Improvements

🛠️ Tech Stack

👤 Author

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
out		out
sample_data		sample_data
src/com/minigoogle		src/com/minigoogle
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🔍 Mini Google — Document Search Engine

📌 What This Project Does

⚙️ Core Data Structure — Inverted Index

🏗️ Architecture

🚀 Performance

📂 Project Structure

🖥️ Getting Started

Prerequisites

1. Clone the Repository

2. Compile

3. Run

4. Use It

🧠 Key Concepts Demonstrated

🔮 Planned Improvements

🛠️ Tech Stack

👤 Author

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages