ooooooooo. ooooo oooooooooooo
`888 `Y88. `888' d'""""""d888'
888 .d88' 888 .888P
888ooo88P' 888 d888'
888`88b. 888 .888P
888 `88b. 888 o d888' .P
o888o o888o o888ooooood8 .8888888888P
v1.1.0
This software computes the Relative Lempel Ziv (RLZ) parse of the target sequence file using a reference file. By default, the software does character-level encoding. However, character-level encoding can fail if the reference file does not contain all the unique characters in the sequence file. We have also provided an option to do the encoding at the bit-level. Doing the encoding this way prevents the issue that occurs when the sequence contains a character that is not present in the reference. However, the parse might be larger than the character-level parse due to encoding the bit representation of the sequence file. The decompression expects that the files were originally ASCII (8 bit) encoded.
The software performs pattern matching by streaming the sequence file char by char (or bit by bit) against the FM-index of the reversed reference. This approach enables the software to simulate forward matching of the sequence file against the reference. The correct reference positions of the matches are obtained by applying an involution to the suffix array positions retrieved from the FM-index (which is built on the reversed reference text), a constant-time operation.
To compress the target sequence file relative to a reference file, the software follows these steps:
Bit-level encoding:
- Convert both the reference and sequence files to their "binary" representation.
Common steps (for all encoding types):
-
Reverse the reference sequence.
-
Build an FM-index from the reversed reference (in bit or character form).
-
Perform "backwards" (forward) match: Match each character or bit of the sequence against the reversed reference using the FM-index's backward matching capabilities (to simulate forward matching).
3a. If a match is found, check if the next character or bit also matches.
3b. If a match is found and it's the end of the sequence, apply involution to the suffix array (SA) position, subtract the length of the match from the position, and record the (pos, len) pair.
3c. If a mismatch occurs, apply involution to the SA position, subtract the length of the match from the position, record the (prev pos, len - 1) pair, and restart the search from the mismatched character or bit.
-
Write all (pos, len) pairs sequentially to a file. This constitutes the RLZ parse.
Note
The RLZ parse is in reference to the reference file.
- CMake 3.15 or higher.
- GCC 9+
- C++17-compatible compiler.
- OpenMP
git clone https://github.com/rvarki/RLZ.git
cd RLZ
git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j
After building the project, an executable named rlz will be created in the build directory. Run it with:
./rlz compress -r [reference file] -s [file to compress] [options]
In this section, we will go through a small example using the default character-level compression. In the data/dna directory, we have provided an example reference and target sequence file that were derived from DNA FASTA files.
- To compress the sequence file with character compression, run the following command from the build directory
./rlz compress -r ../data/dna/dna_ref.txt -s ../data/dna/dna_seq.txt
This command will produce the following file in the data/dna directory: dna_seq.txt.rlz. The .rlz file contains the RLZ parse.
Note
Multithreading is supported in the compression step with the -t [num. of threads] option which can significantly make the compression step faster. However, the RLZ parse is slightly different from what you would get if you run with a single thread. The reason is we cannot identify phrases that span where the file was split. Potentially might add an additional thread number of parse entries that would not exist if you ran with a single thread.
- To decompress the file, run the following command
./rlz decompress -r ../data/dna/dna_ref.txt -p ../data/dna/dna_seq.txt.rlz
This command should produce a file called dna_seq.txt.out in the data/dna directory. This is the decompressed sequence file.
- Check to see if the file decompressed correctly
diff ../data/dna/dna_seq.txt ../data/dna/dna_seq.txt.out
There should be no output from this command if compressed and decompressed correctly.
In this section, we will go through a small example using the bit-level compression option. In the data/english directory, we have provided an example reference and target sequence file that were derived from the English text in the Pizza&Chili Corpus.
- To compress the sequence file, run the following command from the build directory
./rlz compress -r ../data/english/english_ref.txt -s ../data/english/english_seq.txt --bit
This command will produce the following file in the data/english directory: english_seq.txt.rlz. The .rlz file contains the RLZ parse.
- To decompress the file, run the following command
./rlz decompress -r ../data/english/english_ref.txt -p ../data/english/english_seq.txt.rlz --bit
This command should produce a file called english_seq.txt.out in the data/english directory. This is the decompressed sequence file.
- Check to see if the file decompressed correctly
diff ../data/english/english_seq.txt ../data/english/english_seq.txt.out
There should be no output from this command if compressed and decompressed correctly.
Note
To get more information from the tool. Run the command with --verbose flag.
This project is licensed under the GNU License - see the LICENSE file for details
- S. Kuruppu, S. J. Puglisi and J. Zobel, Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval. Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE 2010) Lecture Notes in Computer Science, Volume 6393, (2010) pp. 201-206.