Hifa

Hifa is a high-performance, parallel text indexing and searching tool designed for large text corpora. It uses an FM-index for efficient searching and lz4 compression for a small disk footprint. The project is written in Rust and leverages several high-performance libraries to provide a fast and efficient search experience.

Features

Fast Indexing: Hifa can quickly index large directories of text files.
Parallel Processing: Leveraging the Rayon library, Hifa uses all available CPU cores for both indexing and searching.
Memory Efficient: Files are processed in chunks, allowing Hifa to handle massive datasets with a configurable memory footprint.
Compressed Indexes: Using lz4_flex, index files are highly compressed to save disk space.
Substring and Regex Search: Hifa supports both simple substring and complex regular expression searches.
Cross-Platform: Built with Rust, Hifa can be compiled and run on Windows, macOS, and Linux.

Getting Started

Prerequisites

Rust programming language and Cargo package manager. You can install them from rustup.rs.

Building the Project

Clone the repository:

Bash

git clone https://github.com/your-username/hifa.git
cd hifa

Build the project in release mode for optimal performance:

Bash
```
cargo build --release
```
The binaries, hifa-index and hifa-search, will be available in the ./target/release/ directory.

Usage

Hifa has two main commands: hifa-index for creating an index and hifa-search for querying it.

Indexing

To index a directory of text files, use the hifa-index command:

Bash

./target/release/hifa-index --input <path/to/your/text_files> --output <path/to/your/index_directory>

Options:

-i, --input <DIR>: The directory containing the text files to index.
-o, --output <DIR>: The directory where the index files will be stored.
-c, --chunk-size <SIZE>: The size of each text chunk in megabytes (default: 256).
-t, --threads <NUM>: The number of threads to use for parallel processing (default: number of CPU cores).

Searching

Once the index is created, you can search it using the hifa-search command:

Bash

./target/release/hifa-search --index <path/to/your/index_directory> --pattern "your search query"

Options:

-i, --index <DIR>: The directory containing the Hifa index.
-p, --pattern <PATTERN>: The search pattern.
-r, --regex: Treat the pattern as a regular expression.
-m, --max-results <NUM>: The maximum number of results to display (default: 1000).
-c, --context <NUM>: The number of context lines to show around each match (default: 0).
--no-color: Disable colored output.
--case-insensitive: Perform a case-insensitive search.
--stats: Show detailed search statistics.

How it Works

Hifa's workflow is divided into two main phases:

Indexing: The hifa-index tool scans the input directory and divides the text files into chunks of a specified size. It then processes these chunks in parallel, creating a compressed index file (.fmi) for each one. A manifest file (manifest.bin) is also created to keep track of the chunks and other metadata.
Searching: The hifa-search tool loads the manifest and performs a parallel search across all indexed chunks for the given query. Results are aggregated, sorted, and displayed to the user. This approach allows Hifa to scale with the number of available cores and efficiently search through massive amounts of text.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hifa

Features

Getting Started

Prerequisites

Building the Project

Usage

Indexing

Searching

How it Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hifa

Features

Getting Started

Prerequisites

Building the Project

Usage

Indexing

Searching

How it Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages