RiboKastIndex is a bioinformatics workflow designed for processing Ribo-seq data.
The RiboKastIndex pipeline automates key steps ensuring a streamlined and reproducible approach for ribosome profiling. This pipeline processes data from raw sequencing reads to ribosome profiling outputs and k-mer analyses.
Once phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to input datasets.
Using KaMRaT, the pipeline constructs the comprehensive k-mer index (RSindex) and generates a contig count matrix (contigs are merged k-mers).
These outputs enable further analyses, such as determining which sequences from a list are actively translated. You can query RNA sequences in the RS index to assess their translation status.
Framed reads are those where, for example, more than 70% of the reads with a specific length map to the same frame of the CDS (coding sequence), indicating a P-site that aligns codons to P0 of the CDS. To determine the translated frame among the three possible frames, if more than 70% of the reads translate in frame 1, the reads with that specific length are considered phased.
From the phased reads, the pipeline extracts phased k-mers, which are aligned to the same translation frame. Once these phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to the input datasets. Using KaMRaT, the pipeline builds a comprehensive k-mer index and generates a contig count matrix (with contigs being merged k-mers), facilitating downstream analysis.
Before running the RiboKastIndex pipeline, ensure that the following prerequisites are met, including setting up the required Ribodoc Conda environment, KaMRaT for k-mer analysis, and joinCounts for merging k-mer counts.
The pipeline relies on a Conda environment defined in the RiboKastIndex.yaml file. Follow the steps below to set up and activate the environment.
If the environment is not already created, follow these steps to create it:
-
Install Miniconda or Conda if it's not already installed:
-
Create the environment from the RiboKastIndex.yaml file:
conda env create -f /path/to/RiboKastIndex/RiboKastIndex.yaml
-
List your environments: After creating the environment, run the following to get the path of the created environment:
conda info --envs
-
Activate the environment: Using the path or the environment name from the previous command, activate your environment:
source /home/yourusername/miniconda3/bin/activate your_environment_name
The pipeline uses KaMRaT for k-mer analysis. Follow these steps to download and configure the KaMRaT Singularity image:
-
Download the KaMRaT Singularity image (.sif) from the official GitHub repository:
-
Use the following command to download the image:
singularity pull KaMRaT.sif docker://transipedia/kamrat:latest
-
Configure the path to the downloaded image in the
config.yamlfile under thekamratImgkey:kamratImg: "/path/to/KaMRaT.sif"
Replace
/path/to/KaMRaT.sifwith the actual path where the Singularity image is located.
joinCounts is used for merging k-mer counts. You can find it on GitHub at the following link:
-
Clone and install joinCounts:
git clone https://github.com/Transipedia/dekupl-joinCounts.git cd dekupl-joinCounts make -
Set the path to
joinCountsin theconfig.yamlfile:pathJoinCounts: "$PATH:/path/to/dekupl-joinCounts"
Replace
/path/to/dekupl-joinCountswith the actual path to thejoinCountsexecutable.
The results generated by the RS_Framed_kmers pipeline are organized into several key directories:
- BAM_transcriptome.25-35/: Contains BAM and BAM index files of aligned reads to the transcriptome.
- adapter_lists/: Stores adapter sequences detected or used for trimming.
- annex_database/: Contains reference indices (Bowtie2 and Hisat2), GFF files, and other annotations used for the analysis.
- cutadapt/: Trimmed FastQ files for each sample after adapter removal.
- fastqc/: FastQC quality reports before and after trimming.
- kmerCount/: Results of k-mer counting, including individual k-mer counts for each sample, final merged k-mer result (
merged-res.tsv) and the kmers index. - no-outRNA/: FastQ files with rRNA reads removed.
- riboWaltz.25-35/: Results from riboWaltz analysis, including P-site offset data, periodicity plots, and frame-shift analysis.
Below is a concise tree structure of the key output directories:
RESULTS/
├── config.yaml
├── adapter_lists/
│ └── <sample>.txt
├── annex_database/
│ ├── NamedCDS_human.gff3
│ ├── transcriptome_elongated.nfasta
│ └── index_files/ # index bowtie2 / hisat2 (outRNA + transcriptome_elongated)
├── cutadapt/
│ └── <sample>.cutadapt.25-35.fastq.gz
├── fastqc/
│ ├── fastqc_before_trimming/
│ └── fastqc_after_trimming/
├── no-outRNA/
│ └── <sample>.25-35.no-outRNA.fastq.gz
├── BAM_transcriptome.25-35/
│ └── transcriptome_elongated.<sample>.25-35.bam
├── riboWaltz.25-35/
│ ├── psite_table_forKmerCount.txt
│ ├── best_offset.csv
│ ├── frame_psite.csv
│ ├── frame_psite_length.csv
│ ├── psite_offset.csv
│ ├── psite_table_offset.csv
│ └── <sample>/ # outputs riboWaltz per sample (plots + tables)
└── kmerCount/
├── samples_kmer.txt
├── Kmer/
│ └── <sample>.tsv
├── Matrix/
│ └── matrixFilteredHeader.tsv
└── Kamrat/
├── index/
│ ├── idx-mat.bin
│ ├── idx-meta.bin
│ └── idx-pos.bin
└── merged-res.tsv
The main outputs are the RS index (kmerCount/Kamrat/index), the k-mer count table (kmerCount/Matrix/matrixFilteredHeader.tsv), and the contig count table (kmerCount/Kamrat/merged-res.tsv), where contigs represent merged k-mers.
The pipeline uses a configuration file (config.yaml) that defines project-specific settings, including paths, reference files, trimming parameters, k-mer analysis settings, and more. This file must be tailored to your specific environment and data. Your working directory should contain a fastq/ directory for your FASTQ files, as well as a database/ directory for the reference files specified in the configuration file. In practice, you only need to define one main path: paths.local_path, which is the root folder of your project (the folder where you placed fastq/ and database/). All other pipeline outputs (e.g. RESULTS/, logs/, stats/, etc.) will be created automatically inside this same local_path, using the relative subpaths defined in config.yaml.
In config.yaml, you can control the P-site position (also called phase) used to cut k-mers in-frame.
You have two options:
- Force a specific P-site position
Setforced_phaseto an integer (e.g.,12) to always cut k-mers using that P-site position:
mode: "phase"
forced_phase: 12- Automatically select the P-site position (riboWaltz-based)
Whenforced_phaseis not set (null), the pipeline selects the smallest valid P-site reported by riboWaltz, and accepts positions that follow the codon periodicity:
P-siteP-site + 3P-site + 6…
This ensures that k-mers are cut consistently with the 3-nt reading frame.
# -----------------------------------------------------------------
# RiboKastIndex Configuration
# -----------------------------------------------------------------
project_name: "RiboKastIndex"
# -----------------------------------------------------------------
# Path Settings
# -----------------------------------------------------------------
paths:
# Root directory of the project (write it once)
local_path: "/store/EQUIPES/SSFA/MEMBERS/safa.maddouri/RiboKastIndex_test/"
# Everything below is relative to local_path
RibokastIndex_tools: "tools/"
results_path: "RESULTS/"
stats_path: "stats/"
logs_path: "logs/"
snakemake_log_path: ".snakemake/log/"
fastq_path: "fastq/"
conda_env: "RiboKastIndex.yaml"
# -----------------------------------------------------------------
# Reference Files
# -----------------------------------------------------------------
fasta: "human.fa"
gff: "human.gff3"
fasta_outRNA: "rRNA.fasta"
# -----------------------------------------------------------------
# Adapter Trimming Settings
# -----------------------------------------------------------------
already_trimmed: "no"
adapt_sequence: ""
# -----------------------------------------------------------------
# Length Selection for Profiling
# -----------------------------------------------------------------
readsLength_min: "25"
readsLength_max: "35"
# -----------------------------------------------------------------
# GFF File Settings
# -----------------------------------------------------------------
gff_cds_feature: "CDS"
gff_mRNA_feature: "mRNA"
gff_5UTR_feature: "five_prime_UTR"
gff_parent_attribut: "Parent"
gff_name_attribut: "Name"
# -----------------------------------------------------------------
# riboWaltz Analysis Settings
# -----------------------------------------------------------------
kmercount_pct_threshold: "60"
# -----------------------------------------------------------------
# K-mer Index Construction Settings
# -----------------------------------------------------------------
pathJoinCounts: "$PATH:/data/work/I2BC/safa.maddouri/tools/dekupl-joinCounts"
kmerSize: "25"
mode: "phase" # normal | phase
forced_phase: 12
#forced_phase: null
kamrat_normalize: true
kamrat_nfbase: 1000000
kamratImg: "/store/EQUIPES/SSFA/MEMBERS/safa.maddouri/KaMRaT.sif"