Basic Usage¶
rapidStats¶
Basic statistics calculation like analyzing read counts, distribution of reads on the two DNA strands and listing smallRNA modifications stratified by the defined regions are done using this script.
Input¶
- Trimmed sequence file (FASTQ) or an alignment file (BAM/SAM)
- BED file containing the localization and names of genes/regions to be quantified
We generate the alignments with bowtie2, if FASTQ files are provided as input. A two step alignment can also be performed, if necessary. i.e. First, to remove the sequences aligning to contaminants, and then aligning the rest of the sequences against the reference genome. To facilitate these alignments, bowtie2 index files should be provided against the respective input parameters along with the FASTQ file. We then subject the aligned files to quantify the read counts for the regions provided in the BED file. This quantification step provides an output file containing the read counts of various read lengths, modification, strandedness, etc.
Sample script¶
If using a previously aligned BAM file:
./rapidStats.sh -o=/path_to_output_directory/ -f=reads.bam -ft=BAM --remove=no -a=file.bed -r=/rapidPath/
If using a fastq file, and wish to quantify multiple BED files. Results will be stored in separate folders with each annotation file’s name:
./rapidStats.sh -o=/path_to_output_directory/ -f=reads.fq -a=file.bed,file2.bed -i=/path_to_index -r=/rapidPath/
If using a fastq file, and wish to perform a two-step alignment:
./rapidStats.sh -o=/path_to_output_directory/ -f=reads.fq -a=file.bed -i=/path_to_index --contamin=yes --indexco=/path_to_contaminants_index -r=/rapidPath/
The different parameters we provide currently are listed below.
short | long params | explanation |
---|---|---|
-h | –help | show the help on screen |
-o | –out | path to the output directory, directory will be created if non-existent |
-f | –file | path to the read fastq/BAM/SAM file |
-ft | –filetype | BAM/SAM/fq : Mention either BAM/SAM or FASTQ. Default FASTQ |
-a | –annot | bed file with regions that should be annotated with read alignments (Multiple Bed files should be separated by commas) |
-r | –rapid | set location of the rapid installation bin folder (e.g. /home/software/RAPID/bin/) if not in PATH |
-i | –index | set location of the bowtie2 index for alignment |
-p | –proc | An INTEGER for number of processors; for bowtie’s use (default: 4) |
-m | –multi | An INTEGER for number of alignments to report. ‘-k’ param of bowtie2 (default: 100) |
–contamin=yes | use a double alignment step first aligning to a contamination file (default no) | |
–indexco | set location of the contamination bowtie2 index for alignment (only with contamin=yes) | |
–remove=yes | remove unecessary intermediate files (default yes) |
Bed file format (Do not provide a header, its shown here only for clarity)¶
chromosome | start | end | geneName | type | strand (Gene Direction) |
---|---|---|---|---|---|
chr1 | 1234 | 1368 | geneA | region | + |
chr2 | 1234 | 1368 | geneB | region | - |
chr2 | 1432 | 1568 | geneB | region | - |
chr3 | 1234 | 1368 | geneC | background | - |
The column type in the Bed file says whether a gene has to be treated as background (knockdown) or not during normalizations.
rapidNorm¶
Normalization module aims to facilitate the comparison of genes across various samples, and vice versa. As sequencing depth differs across samples, the read counts have to be normalized. RAPID facilitates two kinds of normalization. (i) DESeq2 based, and (ii) a variant of Total Count Scaling (TCS) method to account for the knockdown associated smallRNAs inherent in sequencing. For a detailed description of the normalization strategy, please have a look at the bioarXiv.
By default, RAPID uses the modified TCS based normalization method. However, in order to provide flexibility with the choice of normalization, we have also incorporated the DESeq2 based normalization.
Input¶
- BED file containing the localization and names of genes/regions to be compared. Care should be taken to include only the gene/regions which were quantified in rapidStats
- Config file containing the location of rapidStats output folders
Sample script:¶
If normalizing using the TCS based normalization:
./rapidNorm.sh --out=/path_to_output_directory/ --conf=data.config --annot=regions.bed --rapid=/rapidPath/
If normalizing using the DESeq2 based normalization:
./rapidNorm.sh --out=/path_to_output_directory/ --conf=data.config --annot=regions.bed --rapid=/rapidPath/ -d=T
If normalizing using the TCS based scaling, while considering only reads of length 23bp, and 25bp:
./rapidNorm.sh --out=/path_to_output_directory/ --conf=data.config --annot=regions.bed --rapid=/rapidPath/ -l=23,25
short | long params | explanation |
---|---|---|
-h | –help | output help |
-o | –out | path to the output directory, directory will be created if non-existent |
-c | –conf | the config file that defines which rapidStats analysis folders should be used |
-a | –annot | bed file with regions that should be used for the comparison, this must be a subset of the regions that was used for rapidStats calls |
-r | –rapid | set location of the rapid installation bin folder (e.g. /home/software/RAPID/bin/) or put into PATH variable |
-d | –deseq | LOGICAL value. Use only TRUE or FALSE. Set this to TRUE, if you wish to use DESeq2 based normalization. Default is FALSE, which does a total count based scaling. |
-l | –restrictlength | An INTEGER of Read Lengths to be considered. If not provided, all reads will be used. (Multiple read lengths should be separated by commas)” |
The config file is a simple tab-delimited file that has three columns, the path to the folder produced by rapidStats, the name of the experiment, and list of regions need to be corrected in TCS based normalization. Each line is one dataset that should be included in the Normalization. Later these normalized statistics can be used to make comparison plots using rapidVis.
Config file format¶
location | name | background |
---|---|---|
/Control1/ | Ctrl1 | none |
/Control2/ | Ctrl2 | none |
/Condition1/ | Cond1 | geneA,geneB |
/Condition2/ | Cond2 | none |
geneA,geneB - Gene names provided as background should be same as provided in the rapidStats bed file.
rapidVis¶
The visualization module of RAPID creates informative plots from the output of rapidStats, and rapidNorm.
Input¶
- Path of the output folder from rapidStats, and rapidNorm
- BED file containing the localization and names of genes/regions need to be visualized. Care should be taken to include only the gene/regions which were quantified in rapidStats
Sample script:¶
If you want to plot rapidStats output:
./rapidVis.sh -t=stats -o=/path_to_output_directory_rapidStats/ -a=regions.bed -r=<$rapid>
If you want to plot rapidNorm output:
./rapidVis.sh -t=compare -o=/path_to_output_directory_rapidNorm/ -r=<$rapid>
short | long params | explanation |
---|---|---|
-h | –help | output help |
-o | –out | outputFolder_of_rapidStats.sh or rapidNorm.sh (Where Statistics and other files are located) |
-t | –type | stats OR compare - use stats to visualize rapidStats or use compare to visualize results of rapidNorm |
-a | –annot | bed file with regions that should be visualised (Not required for compare). Caution: Include only the gene/regions which were quantified in rapidStats |
-r | –rapid | set location of the rapid installation bin folder (e.g. /home/software/RAPID/bin/) or put into PATH variable |
rapidDiff¶
This module of RAPID implements DESeq2 software and generate basic graphs to highlight the differentially expressed gene/region among the samples.
Input¶
- Path of the output folder from rapidStats
- Config file describing the DESeq2 analysis setup
Sample script:¶
Generic Format:
./rapidDiff.sh --out=complete/path/outputDirectory/ --conf=data.config
If a different q-value cut-off is required:
./rapidDiff.sh --out=complete/path/outputDirectory/ --conf=data.config --alpha=0.01
- If only reads of length 23bp, and 25bp should be considered: ::
- ./rapidDiff.sh –out=complete/path/outputDirectory/ –conf=data.config –alpha=0.01 -l=23,25
short | long params | explanation |
---|---|---|
-h | –help | output help |
-o | –out | path to the output directory, directory will be created if non-existent |
-c | –conf | the config file that defines which rapidStats analysis folders should be used for extracting the raw counts of gene/regions analyzed |
-a | –alpha | qValue (adjusted p-value) cut-off to highlight in MA-Plot. Default is 0.05 |
-n | –nVal | Top ‘n’ values to be shown as heatmap. The top ‘n’ values are chosen in ascending order of qValue |
-r | –rapid | set location of the rapid installation bin folder (e.g. /home/software/RAPID/bin/) or put into PATH variable |
-l | –restrictlength | An INTEGER of Read Lengths to be considered (Default: All). Separate multiple values by commas. |
Config file format¶
sampleName | location | condition |
---|---|---|
Control1 | Ctrl1 | untreated |
Condition1 | Cond1 | treated |
This config file is a simple tab-delimited file that has three columns, with the same headers as mentioned in the above format.
sampleName tells the name to be used in the analysis output. location tells the location of rapidStats analysis folders should be used for extracting the raw counts of gene/regions analyzed (USE ONLY ABSOLUTE PATH) condition tells whether the sample is untreated or treated sample. For example, Use treated for drug treated cancerous samples; and untreated for cancer samples.