Efficiently parse the transcripts metadata from the detected_transcripts.csv file generated by Vizgen MERFISH pipeline.
To see my original implementation of the fast file parsing script, check out commit 76e6014.
Version 2 of the script is an adaptation of Naveen Naidu's take on The One Billion Row Challenge (1brc). It comes with some extra run-time reliability, more comparable performance across different hardware setups and OS, plus avoid the use of unsafe mmap.
- AMD Ryzen 9 9950X, 16 cores, 32 threads
- 64 GB RAM
- SSD: Samsung 990 Pro 2TB
- HDD: Seagate IronWolf Pro 8TB
~35 GBVizgen'sdetected_transcripts.csvresult file from a MERFISH run of 800 gene probes on a mouse brain (horizontal slice).- The file has
~400 million rows(detected transcripts) and10 columns.
Basically maxing out the IO throughput on HDD drives (>250 MB/s) and around 2-2.5 GB/s on SSD drives on cold start. The data is streamed and processed in chunks (can be set with the -c flag), so the memory usage is kept low (< 1 GB typically, but depends on your -c and -p settings). The time taken for analysis (calculating and formatting the metadata) is negligible compared to the IO time.
If the detected_transcripts.csv file is cached by the system (say, some operations had just been performed using the file, or you have just copied over the file, AND your system has sufficient memory / fast paging), the speed can be much faster, around 3-5 GB/s or more.
Processing a detected_transcripts.csv file of size ~35 GB on a Ryzen 9 9950X, 16 cores, 32 threads:
./detected_transcripts_metadata \
-i /path/to/detected_transcripts.csv \
-c 1024- Memory usage:
~ 1 GB - HDD cold start:
142 seconds - SSD cold start:
18.2 seconds - Cached file (either HDD or SSD):
~12 seconds
The script requires transcripts to be rows, and at least these data columns are expected: global_x, global_y, global_z, cell_id, gene. The column names are case-sensitive.
Call the executable from the terminal, or with the -h flag to see the help message.
./detected_transcripts_metadata -hExtract metadata from the detected_transcripts.csv file generated on the Vizgen MERFISH platform.
Usage: detected_transcripts_metadata.exe [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Path to "detected_transcripts.csv" file to process
-l, --lines Count the number of lines in the input file and exit
-o, --output <OUTPUT> Path to save the metadata JSON file
If not provided, the output will be saved in the same directory as the input file
Use special value "-" to print the metadata JSON to console without saving to a file
-c, --chunk-size <CHUNK_SIZE> Chunk size (in KiB) for each processing thread [default: 512]
-p, --process <PROCESS> Number of processing threads [default: available_parallelism()]
-q, --quiet Run quitely without printing to stdout (except for errors)
-h, --help Print help
-V, --version Print version
The default for -c (512 KiB) and -p (available number of threads) should be sufficient for most use cases. On computers with more than 16 threads and with data stored on fast SSD, increasing the chunk size to 1024 KiB or even higher will likely improve performance.
./detected_transcripts_metadata \
-i /path/to/detected_transcripts.csv \
-o /path/to/output.json \
-c 2048 \
-p 64The output of this script is a JSON file with the following structure:
{
// Number of cells with transcripts
"cell_count": number,
// Number of transcripts within each cell
"transcripts_per_cell": [number], // length of unique cell_ids
// Total number of transcripts
"total_transcripts": number,
// Number of transcripts within cells (cell_id != "NA" or "N/A" or "-1")
"transcripts_within_cells": number,
// Number of transcripts for each z-layer
"layers_transcripts_count": {
"0": number,
"1": number,
// ...
// "<global_z index>" : number
},
// Number of transcripts within cells (intracellular) for each z-layer
"layers_transcripts_within_cells_count": {
"0": number,
"1": number,
// ...
// "<global_z index>" : number
},
// Frequency of each gene
"gene_frequency": {
"<gene name>": number,
// ...
// "<gene name>" : number
},
// Frequency of each gene per z-layer
"genes_frequency_per_layer": {
"0": {
"<gene name>": number,
// ...
// "<gene name>" : number
},
"1": {
"<gene name>": number,
// ...
// "<gene name>" : number
},
// ...
// "<global_z index>" : {
// "<gene name>": number,
// // ...
// // "<gene name>" : number
// }
}
}In addition to saving a JSON file, one can use the special value - with the -o flag to print the output to the terminal. In conjunction with the -q flag to suppress the info and progress messages, this will create in a clean JSON output to the terminal that can be piped to another command or saved to a file.
./detected_transcripts_metadata \
-i /path/to/detected_transcripts.csv \
-o - \
-c 500000 \
-p 16 \
-q > /path/to/output.jsonFor example, you can use this script within a Python program to parse the metadata and use it for further analysis/visualization.
import json, subprocess
json_string = subprocess.run(
[
"../target/release/detected_transcripts_metadata.exe", # or path to the executable
"-i",
"/path/to/detected_transcripts.csv",
"-o",
"-",
"-c",
"512",
"-q",
],
shell=False, capture_output=True
)
data = json.loads(json_string.stdout)
print(data)Simply clone the repository and run the Rust standard build command:
git clone https://github.com/codynhanpham/detected_transcripts_metadata.git
cd detected_transcripts_metadata
cargo build --releaseThe executable will be located in the target/release directory.
This project is licensed under the MIT License - see the LICENSE file for details.