Parse Vizgen MERFISH Detected Transcripts Data

Efficiently parse the transcripts metadata from the detected_transcripts.csv file generated by Vizgen MERFISH pipeline.

About v2

To see my original implementation of the fast file parsing script, check out commit 76e6014.

Version 2 of the script is an adaptation of Naveen Naidu's take on The One Billion Row Challenge (1brc). It comes with some extra run-time reliability, more comparable performance across different hardware setups and OS, plus avoid the use of unsafe mmap.

Benchmark

Hardware

AMD Ryzen 9 9950X, 16 cores, 32 threads
64 GB RAM
SSD: Samsung 990 Pro 2TB
HDD: Seagate IronWolf Pro 8TB

Data

~35 GB Vizgen's detected_transcripts.csv result file from a MERFISH run of 800 gene probes on a mouse brain (horizontal slice).
The file has ~400 million rows (detected transcripts) and 10 columns.

Performance

Basically maxing out the IO throughput on HDD drives (>250 MB/s) and around 2-2.5 GB/s on SSD drives on cold start. The data is streamed and processed in chunks (can be set with the -c flag), so the memory usage is kept low (< 1 GB typically, but depends on your -c and -p settings). The time taken for analysis (calculating and formatting the metadata) is negligible compared to the IO time.

If the detected_transcripts.csv file is cached by the system (say, some operations had just been performed using the file, or you have just copied over the file, AND your system has sufficient memory / fast paging), the speed can be much faster, around 3-5 GB/s or more.

Results

Processing a detected_transcripts.csv file of size ~35 GB on a Ryzen 9 9950X, 16 cores, 32 threads:

    ./detected_transcripts_metadata \
        -i /path/to/detected_transcripts.csv \
        -c 1024

Memory usage: ~ 1 GB
HDD cold start: 142 seconds
SSD cold start: 18.2 seconds
Cached file (either HDD or SSD): ~12 seconds

Usage

The script requires transcripts to be rows, and at least these data columns are expected: global_x, global_y, global_z, cell_id, gene. The column names are case-sensitive.

Call the executable from the terminal, or with the -h flag to see the help message.

./detected_transcripts_metadata -h

Extract metadata from the detected_transcripts.csv file generated on the Vizgen MERFISH platform.   
Usage: detected_transcripts_metadata.exe [OPTIONS] --input <INPUT>

Options:
  -i, --input <INPUT>            Path to "detected_transcripts.csv" file to process
  -l, --lines                    Count the number of lines in the input file and exit
  -o, --output <OUTPUT>          Path to save the metadata JSON file
                                 If not provided, the output will be saved in the same directory as the input file
                                 Use special value "-" to print the metadata JSON to console without saving to a file
  -c, --chunk-size <CHUNK_SIZE>  Chunk size (in KiB) for each processing thread [default: 512]
  -p, --process <PROCESS>        Number of processing threads [default: available_parallelism()]
  -q, --quiet                    Run quitely without printing to stdout (except for errors)
  -h, --help                     Print help
  -V, --version                  Print version

The default for -c (512 KiB) and -p (available number of threads) should be sufficient for most use cases. On computers with more than 16 threads and with data stored on fast SSD, increasing the chunk size to 1024 KiB or even higher will likely improve performance.

./detected_transcripts_metadata \
    -i /path/to/detected_transcripts.csv \
    -o /path/to/output.json \
    -c 2048 \
    -p 64

Output

The output of this script is a JSON file with the following structure:

{
    // Number of cells with transcripts
    "cell_count": number,

    // Number of transcripts within each cell
    "transcripts_per_cell": [number], // length of unique cell_ids

    // Total number of transcripts
    "total_transcripts": number,

    // Number of transcripts within cells (cell_id != "NA" or "N/A" or "-1")
    "transcripts_within_cells": number,


    // Number of transcripts for each z-layer
    "layers_transcripts_count": {
        "0": number,
        "1": number,
        // ...
        // "<global_z index>" : number
    },

    // Number of transcripts within cells (intracellular) for each z-layer
    "layers_transcripts_within_cells_count": {
        "0": number,
        "1": number,
        // ...
        // "<global_z index>" : number
    },

    // Frequency of each gene
    "gene_frequency": {
        "<gene name>": number,
        // ...
        // "<gene name>" : number
    },

    // Frequency of each gene per z-layer
    "genes_frequency_per_layer": {
        "0": {
            "<gene name>": number,
            // ...
            // "<gene name>" : number
        },
        "1": {
            "<gene name>": number,
            // ...
            // "<gene name>" : number
        },
        // ...
        // "<global_z index>" : {
        //     "<gene name>": number,
        //     // ...
        //     // "<gene name>" : number
        // }
    }
}

In addition to saving a JSON file, one can use the special value - with the -o flag to print the output to the terminal. In conjunction with the -q flag to suppress the info and progress messages, this will create in a clean JSON output to the terminal that can be piped to another command or saved to a file.

./detected_transcripts_metadata \
    -i /path/to/detected_transcripts.csv \
    -o - \
    -c 500000 \
    -p 16 \
    -q > /path/to/output.json

For example, you can use this script within a Python program to parse the metadata and use it for further analysis/visualization.

import json, subprocess

json_string = subprocess.run(
    [
        "../target/release/detected_transcripts_metadata.exe", # or path to the executable
        "-i",
        "/path/to/detected_transcripts.csv",
        "-o",
        "-",
        "-c",
        "512",
        "-q",
    ],
    shell=False, capture_output=True
)

data = json.loads(json_string.stdout)
print(data)

Build From Source

Simply clone the repository and run the Rust standard build command:

git clone https://github.com/codynhanpham/detected_transcripts_metadata.git
cd detected_transcripts_metadata
cargo build --release

The executable will be located in the target/release directory.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
example		example
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parse Vizgen MERFISH Detected Transcripts Data

About v2

Benchmark

Hardware

Data

Performance

Results

Usage

Output

Build From Source

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

codynhanpham/detected_transcripts_metadata

Folders and files

Latest commit

History

Repository files navigation

Parse Vizgen MERFISH Detected Transcripts Data

About v2

Benchmark

Hardware

Data

Performance

Results

Usage

Output

Build From Source

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages