Skip to content

Efficiently parse the transcripts metadata from the detected_transcripts.csv file generated by Vizgen MERFISH pipeline.

License

Notifications You must be signed in to change notification settings

codynhanpham/detected_transcripts_metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parse Vizgen MERFISH Detected Transcripts Data

Efficiently parse the transcripts metadata from the detected_transcripts.csv file generated by Vizgen MERFISH pipeline.

About v2

To see my original implementation of the fast file parsing script, check out commit 76e6014.

Version 2 of the script is an adaptation of Naveen Naidu's take on The One Billion Row Challenge (1brc). It comes with some extra run-time reliability, more comparable performance across different hardware setups and OS, plus avoid the use of unsafe mmap.

Benchmark

Hardware

  • AMD Ryzen 9 9950X, 16 cores, 32 threads
  • 64 GB RAM
  • SSD: Samsung 990 Pro 2TB
  • HDD: Seagate IronWolf Pro 8TB

Data

  • ~35 GB Vizgen's detected_transcripts.csv result file from a MERFISH run of 800 gene probes on a mouse brain (horizontal slice).
  • The file has ~400 million rows (detected transcripts) and 10 columns.

Performance

Basically maxing out the IO throughput on HDD drives (>250 MB/s) and around 2-2.5 GB/s on SSD drives on cold start. The data is streamed and processed in chunks (can be set with the -c flag), so the memory usage is kept low (< 1 GB typically, but depends on your -c and -p settings). The time taken for analysis (calculating and formatting the metadata) is negligible compared to the IO time.

If the detected_transcripts.csv file is cached by the system (say, some operations had just been performed using the file, or you have just copied over the file, AND your system has sufficient memory / fast paging), the speed can be much faster, around 3-5 GB/s or more.

Results

Processing a detected_transcripts.csv file of size ~35 GB on a Ryzen 9 9950X, 16 cores, 32 threads:

    ./detected_transcripts_metadata \
        -i /path/to/detected_transcripts.csv \
        -c 1024
  • Memory usage: ~ 1 GB
  • HDD cold start: 142 seconds
  • SSD cold start: 18.2 seconds
  • Cached file (either HDD or SSD): ~12 seconds

Usage

The script requires transcripts to be rows, and at least these data columns are expected: global_x, global_y, global_z, cell_id, gene. The column names are case-sensitive.

Call the executable from the terminal, or with the -h flag to see the help message.

./detected_transcripts_metadata -h
Extract metadata from the detected_transcripts.csv file generated on the Vizgen MERFISH platform.   
Usage: detected_transcripts_metadata.exe [OPTIONS] --input <INPUT>

Options:
  -i, --input <INPUT>            Path to "detected_transcripts.csv" file to process
  -l, --lines                    Count the number of lines in the input file and exit
  -o, --output <OUTPUT>          Path to save the metadata JSON file
                                 If not provided, the output will be saved in the same directory as the input file
                                 Use special value "-" to print the metadata JSON to console without saving to a file
  -c, --chunk-size <CHUNK_SIZE>  Chunk size (in KiB) for each processing thread [default: 512]
  -p, --process <PROCESS>        Number of processing threads [default: available_parallelism()]
  -q, --quiet                    Run quitely without printing to stdout (except for errors)
  -h, --help                     Print help
  -V, --version                  Print version

The default for -c (512 KiB) and -p (available number of threads) should be sufficient for most use cases. On computers with more than 16 threads and with data stored on fast SSD, increasing the chunk size to 1024 KiB or even higher will likely improve performance.

./detected_transcripts_metadata \
    -i /path/to/detected_transcripts.csv \
    -o /path/to/output.json \
    -c 2048 \
    -p 64

Output

The output of this script is a JSON file with the following structure:

{
    // Number of cells with transcripts
    "cell_count": number,

    // Number of transcripts within each cell
    "transcripts_per_cell": [number], // length of unique cell_ids

    // Total number of transcripts
    "total_transcripts": number,

    // Number of transcripts within cells (cell_id != "NA" or "N/A" or "-1")
    "transcripts_within_cells": number,


    // Number of transcripts for each z-layer
    "layers_transcripts_count": {
        "0": number,
        "1": number,
        // ...
        // "<global_z index>" : number
    },

    // Number of transcripts within cells (intracellular) for each z-layer
    "layers_transcripts_within_cells_count": {
        "0": number,
        "1": number,
        // ...
        // "<global_z index>" : number
    },

    // Frequency of each gene
    "gene_frequency": {
        "<gene name>": number,
        // ...
        // "<gene name>" : number
    },

    // Frequency of each gene per z-layer
    "genes_frequency_per_layer": {
        "0": {
            "<gene name>": number,
            // ...
            // "<gene name>" : number
        },
        "1": {
            "<gene name>": number,
            // ...
            // "<gene name>" : number
        },
        // ...
        // "<global_z index>" : {
        //     "<gene name>": number,
        //     // ...
        //     // "<gene name>" : number
        // }
    }
}

In addition to saving a JSON file, one can use the special value - with the -o flag to print the output to the terminal. In conjunction with the -q flag to suppress the info and progress messages, this will create in a clean JSON output to the terminal that can be piped to another command or saved to a file.

./detected_transcripts_metadata \
    -i /path/to/detected_transcripts.csv \
    -o - \
    -c 500000 \
    -p 16 \
    -q > /path/to/output.json

For example, you can use this script within a Python program to parse the metadata and use it for further analysis/visualization.

import json, subprocess

json_string = subprocess.run(
    [
        "../target/release/detected_transcripts_metadata.exe", # or path to the executable
        "-i",
        "/path/to/detected_transcripts.csv",
        "-o",
        "-",
        "-c",
        "512",
        "-q",
    ],
    shell=False, capture_output=True
)

data = json.loads(json_string.stdout)
print(data)

Build From Source

Simply clone the repository and run the Rust standard build command:

git clone https://github.com/codynhanpham/detected_transcripts_metadata.git
cd detected_transcripts_metadata
cargo build --release

The executable will be located in the target/release directory.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Efficiently parse the transcripts metadata from the detected_transcripts.csv file generated by Vizgen MERFISH pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •