Skip to content

lsiddd/hifa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hifa

Hifa is a high-performance, parallel text indexing and searching tool designed for large text corpora. It uses an FM-index for efficient searching and lz4 compression for a small disk footprint. The project is written in Rust and leverages several high-performance libraries to provide a fast and efficient search experience.


Features

  • Fast Indexing: Hifa can quickly index large directories of text files.

  • Parallel Processing: Leveraging the Rayon library, Hifa uses all available CPU cores for both indexing and searching.

  • Memory Efficient: Files are processed in chunks, allowing Hifa to handle massive datasets with a configurable memory footprint.

  • Compressed Indexes: Using lz4_flex, index files are highly compressed to save disk space.

  • Substring and Regex Search: Hifa supports both simple substring and complex regular expression searches.

  • Cross-Platform: Built with Rust, Hifa can be compiled and run on Windows, macOS, and Linux.


Getting Started

Prerequisites

  • Rust programming language and Cargo package manager. You can install them from rustup.rs.

Building the Project

  1. Clone the repository:

    Bash

    git clone https://github.com/your-username/hifa.git
    cd hifa
    
  2. Build the project in release mode for optimal performance:

    Bash

    cargo build --release
    

    The binaries, hifa-index and hifa-search, will be available in the ./target/release/ directory.


Usage

Hifa has two main commands: hifa-index for creating an index and hifa-search for querying it.

Indexing

To index a directory of text files, use the hifa-index command:

Bash

./target/release/hifa-index --input <path/to/your/text_files> --output <path/to/your/index_directory>

Options:

  • -i, --input <DIR>: The directory containing the text files to index.

  • -o, --output <DIR>: The directory where the index files will be stored.

  • -c, --chunk-size <SIZE>: The size of each text chunk in megabytes (default: 256).

  • -t, --threads <NUM>: The number of threads to use for parallel processing (default: number of CPU cores).

Searching

Once the index is created, you can search it using the hifa-search command:

Bash

./target/release/hifa-search --index <path/to/your/index_directory> --pattern "your search query"

Options:

  • -i, --index <DIR>: The directory containing the Hifa index.

  • -p, --pattern <PATTERN>: The search pattern.

  • -r, --regex: Treat the pattern as a regular expression.

  • -m, --max-results <NUM>: The maximum number of results to display (default: 1000).

  • -c, --context <NUM>: The number of context lines to show around each match (default: 0).

  • --no-color: Disable colored output.

  • --case-insensitive: Perform a case-insensitive search.

  • --stats: Show detailed search statistics.


How it Works

Hifa's workflow is divided into two main phases:

  1. Indexing: The hifa-index tool scans the input directory and divides the text files into chunks of a specified size. It then processes these chunks in parallel, creating a compressed index file (.fmi) for each one. A manifest file (manifest.bin) is also created to keep track of the chunks and other metadata.

  2. Searching: The hifa-search tool loads the manifest and performs a parallel search across all indexed chunks for the given query. Results are aggregated, sorted, and displayed to the user. This approach allows Hifa to scale with the number of available cores and efficiently search through massive amounts of text.

About

A high-performance, async-first, and modular full-text search engine library for Rust, built on top of Tantivy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages