Skip to content

rvarki/RLZ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relative Lempel-Ziv (RLZ)

ooooooooo.   ooooo         oooooooooooo 
`888   `Y88. `888'        d'""""""d888' 
 888   .d88'  888               .888P   
 888ooo88P'   888              d888'    
 888`88b.     888            .888P      
 888  `88b.   888       o   d888'    .P 
o888o  o888o o888ooooood8 .8888888888P  
                                        v1.1.0

Description

This software computes the Relative Lempel Ziv (RLZ) parse of the target sequence file using a reference file. By default, the software does character-level encoding. However, character-level encoding can fail if the reference file does not contain all the unique characters in the sequence file. We have also provided an option to do the encoding at the bit-level. Doing the encoding this way prevents the issue that occurs when the sequence contains a character that is not present in the reference. However, the parse might be larger than the character-level parse due to encoding the bit representation of the sequence file. The decompression expects that the files were originally ASCII (8 bit) encoded.

The software performs pattern matching by streaming the sequence file char by char (or bit by bit) against the FM-index of the reversed reference. This approach enables the software to simulate forward matching of the sequence file against the reference. The correct reference positions of the matches are obtained by applying an involution to the suffix array positions retrieved from the FM-index (which is built on the reversed reference text), a constant-time operation.

Algorithm Workflow

To compress the target sequence file relative to a reference file, the software follows these steps:

Bit-level encoding:

  1. Convert both the reference and sequence files to their "binary" representation.

Common steps (for all encoding types):

  1. Reverse the reference sequence.

  2. Build an FM-index from the reversed reference (in bit or character form).

  3. Perform "backwards" (forward) match: Match each character or bit of the sequence against the reversed reference using the FM-index's backward matching capabilities (to simulate forward matching).

    3a. If a match is found, check if the next character or bit also matches.

    3b. If a match is found and it's the end of the sequence, apply involution to the suffix array (SA) position, subtract the length of the match from the position, and record the (pos, len) pair.

    3c. If a mismatch occurs, apply involution to the SA position, subtract the length of the match from the position, record the (prev pos, len - 1) pair, and restart the search from the mismatched character or bit.

  4. Write all (pos, len) pairs sequentially to a file. This constitutes the RLZ parse.

Note

The RLZ parse is in reference to the reference file.

Prerequisites

  • CMake 3.15 or higher.
  • GCC 9+
  • C++17-compatible compiler.
  • OpenMP

Getting Started

Building the Project

git clone https://github.com/rvarki/RLZ.git
cd RLZ
git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j

Running the project

After building the project, an executable named rlz will be created in the build directory. Run it with:

./rlz compress -r [reference file] -s [file to compress] [options] 

Default Compression Example

In this section, we will go through a small example using the default character-level compression. In the data/dna directory, we have provided an example reference and target sequence file that were derived from DNA FASTA files.

  1. To compress the sequence file with character compression, run the following command from the build directory
./rlz compress -r ../data/dna/dna_ref.txt -s ../data/dna/dna_seq.txt

This command will produce the following file in the data/dna directory: dna_seq.txt.rlz. The .rlz file contains the RLZ parse.

Note

Multithreading is supported in the compression step with the -t [num. of threads] option which can significantly make the compression step faster. However, the RLZ parse is slightly different from what you would get if you run with a single thread. The reason is we cannot identify phrases that span where the file was split. Potentially might add an additional thread number of parse entries that would not exist if you ran with a single thread.

  1. To decompress the file, run the following command
./rlz decompress -r ../data/dna/dna_ref.txt -p ../data/dna/dna_seq.txt.rlz 

This command should produce a file called dna_seq.txt.out in the data/dna directory. This is the decompressed sequence file.

  1. Check to see if the file decompressed correctly
diff ../data/dna/dna_seq.txt ../data/dna/dna_seq.txt.out

There should be no output from this command if compressed and decompressed correctly.

Bit Compression Example

In this section, we will go through a small example using the bit-level compression option. In the data/english directory, we have provided an example reference and target sequence file that were derived from the English text in the Pizza&Chili Corpus.

  1. To compress the sequence file, run the following command from the build directory
./rlz compress -r ../data/english/english_ref.txt -s ../data/english/english_seq.txt --bit

This command will produce the following file in the data/english directory: english_seq.txt.rlz. The .rlz file contains the RLZ parse.

  1. To decompress the file, run the following command
./rlz decompress -r ../data/english/english_ref.txt -p ../data/english/english_seq.txt.rlz --bit

This command should produce a file called english_seq.txt.out in the data/english directory. This is the decompressed sequence file.

  1. Check to see if the file decompressed correctly
diff ../data/english/english_seq.txt ../data/english/english_seq.txt.out

There should be no output from this command if compressed and decompressed correctly.

Note

To get more information from the tool. Run the command with --verbose flag.

License

This project is licensed under the GNU License - see the LICENSE file for details

Dependencies

Acknowledgements

About

RLZ - compress sequence files using a reference file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors