Hifa is a high-performance, parallel text indexing and searching tool designed for large text corpora. It uses an FM-index for efficient searching and lz4 compression for a small disk footprint. The project is written in Rust and leverages several high-performance libraries to provide a fast and efficient search experience.
-
Fast Indexing: Hifa can quickly index large directories of text files.
-
Parallel Processing: Leveraging the Rayon library, Hifa uses all available CPU cores for both indexing and searching.
-
Memory Efficient: Files are processed in chunks, allowing Hifa to handle massive datasets with a configurable memory footprint.
-
Compressed Indexes: Using
lz4_flex, index files are highly compressed to save disk space. -
Substring and Regex Search: Hifa supports both simple substring and complex regular expression searches.
-
Cross-Platform: Built with Rust, Hifa can be compiled and run on Windows, macOS, and Linux.
- Rust programming language and Cargo package manager. You can install them from rustup.rs.
-
Clone the repository:
Bash
git clone https://github.com/your-username/hifa.git cd hifa -
Build the project in release mode for optimal performance:
Bash
cargo build --releaseThe binaries,
hifa-indexandhifa-search, will be available in the./target/release/directory.
Hifa has two main commands: hifa-index for creating an index and hifa-search for querying it.
To index a directory of text files, use the hifa-index command:
Bash
./target/release/hifa-index --input <path/to/your/text_files> --output <path/to/your/index_directory>
Options:
-
-i,--input <DIR>: The directory containing the text files to index. -
-o,--output <DIR>: The directory where the index files will be stored. -
-c,--chunk-size <SIZE>: The size of each text chunk in megabytes (default: 256). -
-t,--threads <NUM>: The number of threads to use for parallel processing (default: number of CPU cores).
Once the index is created, you can search it using the hifa-search command:
Bash
./target/release/hifa-search --index <path/to/your/index_directory> --pattern "your search query"
Options:
-
-i,--index <DIR>: The directory containing the Hifa index. -
-p,--pattern <PATTERN>: The search pattern. -
-r,--regex: Treat the pattern as a regular expression. -
-m,--max-results <NUM>: The maximum number of results to display (default: 1000). -
-c,--context <NUM>: The number of context lines to show around each match (default: 0). -
--no-color: Disable colored output. -
--case-insensitive: Perform a case-insensitive search. -
--stats: Show detailed search statistics.
Hifa's workflow is divided into two main phases:
-
Indexing: The
hifa-indextool scans the input directory and divides the text files into chunks of a specified size. It then processes these chunks in parallel, creating a compressed index file (.fmi) for each one. A manifest file (manifest.bin) is also created to keep track of the chunks and other metadata. -
Searching: The
hifa-searchtool loads the manifest and performs a parallel search across all indexed chunks for the given query. Results are aggregated, sorted, and displayed to the user. This approach allows Hifa to scale with the number of available cores and efficiently search through massive amounts of text.