Skip to content

imagibee/Gigantor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

201 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gigantor

Boosting performance of System.Text.RegularExpressions.Regex

It helps with the following problems:

  • searching gigantic files that exceed RAM
  • replacing matches with new text
  • poor performance due to CPU under utilization
  • main thread is unresponsive
  • searching streams
  • searching compressed data

The approach is to partition the data into chunks which are simultaneously processed on multiple threads. Since processing takes place on worker threads, the main thread remains responsive. Since the chunks are reasonably sized and the whole file does not need to fit into memory, files that exceed RAM can be processed.

API

Here is a brief overview of the API. For details refer to the source code or the unit tests.

This class uses multi-threading to boost the performance of searches with System.Text.RegularExpressions.Regex compared to to the single-threaded approach. It also supports searching gigantic files that exceed RAM, and directly searching streams. Roughly a 4x improvement was measured on the test system.

It uses an overlap to handle matches that fall on partition boundaries. De-duping of the overlap regions is performed automatically at the end of the search so that the final results are free of duplicates.

// Create a regular expression to match urls
System.Text.RegularExpressions.Regex regex = new(
    @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?",
    RegexOptions.Compiled);

// Create the searcher
Imagibee.Gigantor.RegexSearcher searcher = new("myfile", regex, progress);

// Do the search
Imagibee.Gigantor.Background.StartAndWait(searcher, progress, (_) => {});

foreach (var match in searcher.GetMatchData()) {
    // Do something with the matches
}

// Replace all the urls with stackoverflow.com in a new file
using System.IO.FileStream output = File.Create("myfile2");
searcher.Replace(output, (match) => { return "https://www.stackoverflow.com"; }); 

This class creates a mapping between line numbers and file positions for gigantic files. Once the mapping has been created it can be used to quickly find positions at the start of a line or the line number that contains a position.

// Create the indexer
LineIndexer indexer = new("myfile", progress);

// Do the indexing
Imagibee.Gigantor.Background.StartAndWait(indexer, progress, (_) => {});

// Use indexer to print the middle line
using System.IO.FileStream fs = new("myfile", FileMode.Open);
Imagibee.Gigantor.StreamReader reader = new(fs);
fs.Seek(indexer.PositionFromLine(indexer.LineCount / 2), SeekOrigin.Begin);
Console.WriteLine(reader.ReadLine());

Input Data

The input data can either be uncompressed files, or streams. Files offer better performance than streams, but streams allow searching compressed data without decompressing it to disk first.

Examples

Here are some more detailed examples.

Benchmarks

Here is a document that discusses the benchmarks in greater detail.

Testing

Prior to running the tests run Scripts/setup to prepare the test files. This script creates some large files in the temporary folder which are deleted on reboot. Once setup has been completed run Scripts/test.

License

MIT

Versioning

This package uses semantic versioning. Tags on the main branch indicate versions. It is recomended to use a tagged version. The latest version on the main branch should be considered under development when it is not tagged.

Issues

Report and track issues here.

Contributing

Minor changes such as bug fixes are welcome. Simply make a pull request. Please discuss more significant changes prior to making the pull request by opening a new issue that describes the change.

About

Boosting performance of System.Text.RegularExpressions.Regex

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Contributors

Languages