Skip to content

Transipedia/RiboKastIndex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RiboKastIndex

Makes k-mer indexes from ribo-seq data

Introduction

RiboKastIndex is a bioinformatics workflow designed for processing Ribo-seq data.
The RiboKastIndex pipeline automates key steps ensuring a streamlined and reproducible approach for ribosome profiling. This pipeline processes data from raw sequencing reads to ribosome profiling outputs and k-mer analyses. Once phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to input datasets. Using KaMRaT, the pipeline constructs the comprehensive k-mer index (RSindex) and generates a contig count matrix (contigs are merged k-mers). These outputs enable further analyses, such as determining which sequences from a list are actively translated. You can query RNA sequences in the RS index to assess their translation status.

Additional Information:

Framed reads are those where, for example, more than 70% of the reads with a specific length map to the same frame of the CDS (coding sequence), indicating a P-site that aligns codons to P0 of the CDS. To determine the translated frame among the three possible frames, if more than 70% of the reads translate in frame 1, the reads with that specific length are considered phased.

From the phased reads, the pipeline extracts phased k-mers, which are aligned to the same translation frame. Once these phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to the input datasets. Using KaMRaT, the pipeline builds a comprehensive k-mer index and generates a contig count matrix (with contigs being merged k-mers), facilitating downstream analysis.

Requirements and Setup

Before running the RiboKastIndex pipeline, ensure that the following prerequisites are met, including setting up the required Ribodoc Conda environment, KaMRaT for k-mer analysis, and joinCounts for merging k-mer counts.

1. Set Up and Activate Conda Environment

The pipeline relies on a Conda environment defined in the RiboKastIndex.yaml file. Follow the steps below to set up and activate the environment.

Step 1: Create the Conda Environment

If the environment is not already created, follow these steps to create it:

  1. Install Miniconda or Conda if it's not already installed:

  2. Create the environment from the RiboKastIndex.yaml file:

    conda env create -f /path/to/RiboKastIndex/RiboKastIndex.yaml
  3. List your environments: After creating the environment, run the following to get the path of the created environment:

    conda info --envs
  4. Activate the environment: Using the path or the environment name from the previous command, activate your environment:

    source /home/yourusername/miniconda3/bin/activate your_environment_name

2. Install and Configure KaMRaT

The pipeline uses KaMRaT for k-mer analysis. Follow these steps to download and configure the KaMRaT Singularity image:

  1. Download the KaMRaT Singularity image (.sif) from the official GitHub repository:

  2. Use the following command to download the image:

    singularity pull KaMRaT.sif docker://transipedia/kamrat:latest
  3. Configure the path to the downloaded image in the config.yaml file under the kamratImg key:

    kamratImg: "/path/to/KaMRaT.sif"

    Replace /path/to/KaMRaT.sif with the actual path where the Singularity image is located.


3. Install and Configure joinCounts

joinCounts is used for merging k-mer counts. You can find it on GitHub at the following link:

  1. Clone and install joinCounts:

    git clone https://github.com/Transipedia/dekupl-joinCounts.git
    cd dekupl-joinCounts
    make
  2. Set the path to joinCounts in the config.yaml file:

    pathJoinCounts: "$PATH:/path/to/dekupl-joinCounts"

    Replace /path/to/dekupl-joinCounts with the actual path to the joinCounts executable.

Results Directory Structure

The results generated by the RS_Framed_kmers pipeline are organized into several key directories:

  • BAM_transcriptome.25-35/: Contains BAM and BAM index files of aligned reads to the transcriptome.
  • adapter_lists/: Stores adapter sequences detected or used for trimming.
  • annex_database/: Contains reference indices (Bowtie2 and Hisat2), GFF files, and other annotations used for the analysis.
  • cutadapt/: Trimmed FastQ files for each sample after adapter removal.
  • fastqc/: FastQC quality reports before and after trimming.
  • kmerCount/: Results of k-mer counting, including individual k-mer counts for each sample, final merged k-mer result (merged-res.tsv) and the kmers index.
  • no-outRNA/: FastQ files with rRNA reads removed.
  • riboWaltz.25-35/: Results from riboWaltz analysis, including P-site offset data, periodicity plots, and frame-shift analysis.

Below is a concise tree structure of the key output directories:

RESULTS/
├── config.yaml
├── adapter_lists/
│   └── <sample>.txt
├── annex_database/
│   ├── NamedCDS_human.gff3
│   ├── transcriptome_elongated.nfasta
│   └── index_files/                # index bowtie2 / hisat2 (outRNA + transcriptome_elongated)
├── cutadapt/
│   └── <sample>.cutadapt.25-35.fastq.gz
├── fastqc/
│   ├── fastqc_before_trimming/
│   └── fastqc_after_trimming/
├── no-outRNA/
│   └── <sample>.25-35.no-outRNA.fastq.gz
├── BAM_transcriptome.25-35/
│   └── transcriptome_elongated.<sample>.25-35.bam
├── riboWaltz.25-35/
│   ├── psite_table_forKmerCount.txt
│   ├── best_offset.csv
│   ├── frame_psite.csv
│   ├── frame_psite_length.csv
│   ├── psite_offset.csv
│   ├── psite_table_offset.csv
│   └── <sample>/                   # outputs riboWaltz per sample (plots + tables)
└── kmerCount/
    ├── samples_kmer.txt
    ├── Kmer/
    │   └── <sample>.tsv
    ├── Matrix/
    │   └── matrixFilteredHeader.tsv
    └── Kamrat/
        ├── index/
        │   ├── idx-mat.bin
        │   ├── idx-meta.bin
        │   └── idx-pos.bin
        └── merged-res.tsv

The main outputs are the RS index (kmerCount/Kamrat/index), the k-mer count table (kmerCount/Matrix/matrixFilteredHeader.tsv), and the contig count table (kmerCount/Kamrat/merged-res.tsv), where contigs represent merged k-mers.

Configuration

The pipeline uses a configuration file (config.yaml) that defines project-specific settings, including paths, reference files, trimming parameters, k-mer analysis settings, and more. This file must be tailored to your specific environment and data. Your working directory should contain a fastq/ directory for your FASTQ files, as well as a database/ directory for the reference files specified in the configuration file. In practice, you only need to define one main path: paths.local_path, which is the root folder of your project (the folder where you placed fastq/ and database/). All other pipeline outputs (e.g. RESULTS/, logs/, stats/, etc.) will be created automatically inside this same local_path, using the relative subpaths defined in config.yaml.

P-site / forced phase selection (k-mer cutting position)

In config.yaml, you can control the P-site position (also called phase) used to cut k-mers in-frame.

You have two options:

  1. Force a specific P-site position
    Set forced_phase to an integer (e.g., 12) to always cut k-mers using that P-site position:
mode: "phase"
forced_phase: 12
  1. Automatically select the P-site position (riboWaltz-based)
    When forced_phase is not set (null), the pipeline selects the smallest valid P-site reported by riboWaltz, and accepts positions that follow the codon periodicity:
  • P-site
  • P-site + 3
  • P-site + 6

This ensures that k-mers are cut consistently with the 3-nt reading frame.

# -----------------------------------------------------------------
#                      RiboKastIndex Configuration
# -----------------------------------------------------------------
project_name: "RiboKastIndex"

# -----------------------------------------------------------------
#                      Path Settings
# -----------------------------------------------------------------
paths:
  # Root directory of the project (write it once)
  local_path: "/store/EQUIPES/SSFA/MEMBERS/safa.maddouri/RiboKastIndex_test/"

  # Everything below is relative to local_path
  RibokastIndex_tools: "tools/"
  results_path: "RESULTS/"
  stats_path: "stats/"
  logs_path: "logs/"
  snakemake_log_path: ".snakemake/log/"
  fastq_path: "fastq/"
  conda_env: "RiboKastIndex.yaml"

# -----------------------------------------------------------------
#                      Reference Files
# -----------------------------------------------------------------
fasta: "human.fa"
gff: "human.gff3"
fasta_outRNA: "rRNA.fasta"

# -----------------------------------------------------------------
#                      Adapter Trimming Settings
# -----------------------------------------------------------------
already_trimmed: "no"
adapt_sequence: ""

# -----------------------------------------------------------------
#                      Length Selection for Profiling
# -----------------------------------------------------------------
readsLength_min: "25"
readsLength_max: "35"

# -----------------------------------------------------------------
#                      GFF File Settings
# -----------------------------------------------------------------
gff_cds_feature: "CDS"
gff_mRNA_feature: "mRNA"
gff_5UTR_feature: "five_prime_UTR"
gff_parent_attribut: "Parent"
gff_name_attribut: "Name"

# -----------------------------------------------------------------
#                      riboWaltz Analysis Settings
# -----------------------------------------------------------------
kmercount_pct_threshold: "60"

# -----------------------------------------------------------------
#                      K-mer Index Construction Settings
# -----------------------------------------------------------------
pathJoinCounts: "$PATH:/data/work/I2BC/safa.maddouri/tools/dekupl-joinCounts"

kmerSize: "25"
mode: "phase"          # normal | phase

forced_phase: 12
#forced_phase: null

kamrat_normalize: true
kamrat_nfbase: 1000000
kamratImg: "/store/EQUIPES/SSFA/MEMBERS/safa.maddouri/KaMRaT.sif"

About

Makes k-mer indexes from ribo-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •