CUSHAW3 - sensitive and accurate gapped short-read alignment

Introduction

CUSHAW is a well-established leading next-generation sequencing read alignment software package based on multi-core and many-core computing. CUSHAW3, the third distribution of CUSHAW software package for next-generation sequencing read alignment, is an open-source parallelized, sensitive and accurate short-read aligner for shared-memory (multi-threaded) and distributed-memory sytems (MPI). We have compared CUSHAW3 to other leading aligners: Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. This algorithm has been presented in the paper "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding" and further extended to support distributed-memory systems using UPC++, which is presented in the paper Parallel and scalable short-read alignment on multi-core clusters using UPC++.

Note:

We have noticed an exceptional problem that CUSHAW3 compiled by GCC 4.9 produces inconsistent alignment results with the binaries compiled by older GCC compilers. We are investigating the problem and expect to fix it as soon as possible. Please do not use GCC 4.9 to compile CUSHAW3 for the present time. Alternatively, you can also choose to download the statically linked binary from the project website.

How to produce all mappings per read? specify a very large integer value to options "-multi" and "-max_occ" simultaneously.

Feedback

Click here to get the feedback from users.

Downloads

CUSHAW3 (v3.0.3)
More details about the changes in this version are availabe at changelog.
CUSHAW3-UPC (v1.0)^NEW
A parallel and scalable implementation of CUSHAW3 (v3.0.3) for multi-core compute clusters based on the UPC++ parallel programming language. The manual is available at here and the work is published in this paper.
Simulated reads

GCAT Benchmarks

We have evaluated the alignment quality, as well as the variant calling performance, using the public benchmark datasets in GCAT (Genome Comparison & Analytic Testing). Click the following links to get the results and comparisons with graphical interfaces.

Alignment on Illumina-like 100-bp single-end reads with SMALL indels
Alignment on Illumina-like 100-bp paired-end reads with SMALL indels
Alignment on Illumina-like 100-bp single-end reads with LARGE indels
Alignment on Illumina-like 100-bp paired-end reads with LARGE indels
Variant calling (with SAMtools v0.1.19) on Illumina 100-bp paired-end exome dataset with 30x coverage

Citation

Yongchao Liu, Bernt Popp, and Bertil Schmidt: "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding." PLoS ONE, 2014, 9(1):e86869
Jorge González-Domínguez, Yongchao Liu, Bertil Schmidt: "Parallel and scalable short-read alignment on multi-core clusters using UPC++". PLoS One, 2016, 11(1): e0145490.

Other related papers

Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "A fast CUDA compatible short read aligner to large genomes". GPU Technology Conference 2012 (GTC 2012), San Jose, USA, 2012
Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform". Bioinformatics, 2012, 28(14): 1830-1837
Yongchao Liu and Bertil Schmidt: "Long read alignment based on maximal exact match seeds".Bioinformatics, 2012, 28(18): i318-324
Yongchao Liu and Bertil Schmidt: "CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing". IEEE Design & Test, 2014, 31(1): 31-39
Yongchao Liu and Bertil Schmidt: "CUSHAW Software Package: harnessing CUDA-enabled GPUs for next generation sequencing read alignment". GPU Technology Conference 2014 (GTC 2014), San Jose, USA, 2014
Yongchao Liu, Thomas Hankeln, and Bertil Schmidt: "Parallel and space-efficient construction of Burrows-Wheeler transform and suffix array for big genome data". IEEE Transactions on Computational Biology and Bioinformatics, 2016, 13(3): 592-598.
Yongchao Liu and Bertil Schmidt: "CUSHAW Suite: parallel and efficient algorithms for NGS read alignment". Algorithms for Next-Generations Sequencing Data: Techniques, Approaches and Applications, edited by Mourad Elloumi, Springer, 2017.
Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu and Bertil Schmidt: "PUNAS: a parallel ungapped-alignment-featured seed verification for next-generation sequencing read alignment", 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), 2017, pp. 52-61.

Parameters

In CUSHAW3, we have intergrated both the genome indexing and the alignment algorithms into a single executable binary, and have given three commands, i.e. index, align and calign for the genoe indexing, base-space alignment and color-space alignment, respectively. The following list the parameters for all of the three commands.

Commands

index build BWT and FM-index
align perform base-space read alignments (e.g. Illumina, 454, Ion Torrent and PacBio)
calign perform color-space read alignments (ABI SOLiD)

Genome indexing

-p <string> (prefix of the index, default = fasta name)
-i <int> (interval for reduced suffix array [3], 2^#INT)
-c (build color-space index)

Base-space alignment

Input:

-r <string> (the file name base for the reference genome)
-f <string> file1 [file2] (single-end sequence files in FASTA/FASTQ format)
-b <string> file1 [file2] (single-end sequence files in BAM format)
-s <string> file1 [file2] (single-end sequence files in SAM format)
-q <string> file1_1 file1_2 [file2_1 file2_2] (paired-end sequence files in FASTA/FASTQ format)
-mode <int> (paired-end/mate-paired reads, default = 0)
- 0 means paired-end reads
- 1 means mate-paired reads

Output:

-o <string> (SAM output file path, default = STDOUT)
-min_qual <int> (minimal mapping quality score of a reported alignment, default = 0)
-multi <int> (output <= #int [1, 1000] equivalent best alignments, default = 10)
-rgid <string> (read group identifier [tag RG:ID])
-rgsm <string> (read group sample name [tag RG:SM], required if #rgid is given)
-rglb <string> (read group library [tag RG:LD], ineffective if #rgid is not given)
-rgpl <string> (read group platform/technology [tag RG:PL], ineffective if #rigid is not given)
supported values: capillary, ls454, illumina, solid, helicos, iontorrent, and pacbio
-rgpu <string> (read group platform unit identifier [tag RG:PU], ineffective if #rgid is not given)
-rgcn <string> (name of sequencing center produced the read [tag RG:CN], ineffiective if #rgid is not given)
-rgds <string> (description about the reads [tag RG:DS], ineffiective if #rgid is not given)
-rgdt <string> (date on which the run was produced [tag RG:DT], ineffiective if #rgid is not given)
-rgpi <string> (predicated median insert size [tag RG:PI], ineffiective if #rgid is not given)

Scoring:

-match <int> (score for a match, default = 1)
-mismatch <int> (penalty for a mismatch, default = 3)
-gopen <int> (gap open penalty, default = 5)
-gext <int> (gap extension penalty, default = 2)

Align:

-atype <int> (alignment type used, default = 1)
- 0 [local alignment]
- 1 [local + semiglobal alignment]
-mask_amb <int> (masking ambiguous bases in the reference, default = 0)
-min_score <int> (minimal optimal local alignment score divided by matching score, default = 30)
-min_id <float> (minimal identity of optical local alignments, default = 0.90)
-gmin_id <float> (minimal identity of optimal semiglobal alignments, default = 0.65)
-min_ratio <float> (minimal ratio of reads in optimal local/semiglobal alignments, default = 0.80)
-max_edit_dist <int> (maximum edit distance in optimal local/semiglobal alignments, default = -1 [-1 means disabled])

Seed:

-min_seed <int> (lower bound of minimal seed size, default = 13)
-max_seed <int> (upper bound of minimal seed size, default = 49)
-miss_prob <float> (missing probability to estimate the seed sizes, default = 0.04)
-max_occ <int> (maximal number of occurrences per seed, default = 2000)
-max_seeds_per_batch <int> (maximum number of seeds per batch for top hit selection, default = 16384)

Pairing:

-sensitive_pair <int> (more sensitive read pairing, default = 0)
-avg_ins <int> (insert size for paired-end reads, estimated from input if not specified)
-ins_std <int> (standard deviation of insert size, estimated from input if not specified)
-ins_npairs <int> (top number of read pairs for insert size estimation [#int times 0x10000], default = 1)
-ins_mapq <int> (minimal mapping quality score of a SE alignment for insert size estimation, default = 20)
-max_seedpairs <int> (maximal number of seed pairs per read pair, default = 10000000)
-no_rescue (do not rescue the mate using Smith-Waterman for an un-paired read)
-rescue_top_hits <int> (attempt the second rescuing from #int top hits for each end, default = 100)

Compute:

-t <int> (number of threads, default = 1)

Others:

-h (print out the usage of the program)

Color-space alignment

Input:

-r <string> (the file name base for the reference genome)
-f <string> file1 [file2] (single-end sequence files in FASTA/FASTQ format)
-b <string> file1 [file2] (single-end sequence files in BAM format)
-s <string> file1 [file2] (single-end sequence files in SAM format)
-q <string> file1_1 file1_2 [file2_1 file2_2] (paired-end sequence files in FASTA/FASTQ format)
-mode <int> (paired-end/mate-paired reads, default = 1)
- 0 means paired-end reads
- 1 means mate-paired reads
-trim_primer <int> (trim the primer base in color-space reads, default = 1)

Output:

-o <string> (SAM output file path, default = STDOUT)
-min_qual <int> (minimal mapping quality score of a reported alignment, default = 0)
-multi <int> (output <= #int [1, 1000] equivalent best alignments, default = 10)
-rgid <string> (read group identifier [tag RG:ID])
-rgsm <string> (read group sample name [tag RG:SM], required if #rgid is given)
-rglb <string> (read group library [tag RG:LD], ineffective if #rgid is not given)
-rgpl <string> (read group platform/technology [tag RG:PL], ineffective if #rigid is not given)
supported values: capillary, ls454, illumina, solid, helicos, iontorrent, and pacbio
-rgpu <string> (read group platform unit identifier [tag RG:PU], ineffective if #rgid is not given)
-rgcn <string> (name of sequencing center produced the read [tag RG:CN], ineffiective if #rgid is not given)
-rgds <string> (description about the reads [tag RG:DS], ineffiective if #rgid is not given)
-rgdt <string> (date on which the run was produced [tag RG:DT], ineffiective if #rgid is not given)
-rgpi <string> (predicated median insert size [tag RG:PI], ineffiective if #rgid is not given)

Scoring:

-match <int> (score for a match, default = 1)
-mismatch <int> (penalty for a mismatch, default = 3)
-gopen <int> (gap open penalty, default = 5)
-gext <int> (gap extension penalty, default = 2)

Align:

-atype <int> (alignment type used, default = 1)
- 0 [local alignment]
- 1 [local + semiglobal alignment]
-mask_amb <int> (masking ambiguous bases in the reference, default = 1)
-min_score <int> (minimal optimal local alignment score divided by matching score, default = 10)
-min_id <float> (minimal identity of optical local alignments, default = 0.90)
-gmin_id <float> (minimal identity of optimal semiglobal alignments, default = 0.65)
-min_ratio <float> (minimal ratio of reads in optimal local/semiglobal alignments, default = 0.80)
-max_edit_dist <int> (maximum edit distance in optimal local/semiglobal alignments, default = -1 [-1 means disabled])

Seed:

-min_seed <int> (lower bound of minimal seed size, default = 13)
-max_seed <int> (upper bound of minimal seed size, default = 49)
-miss_prob <float> (missing probability to estimate the seed sizes, default = 0.04)
-max_occ <int> (maximal number of occurrences per seed, default = 1024)
-max_seeds_per_batch <int> (maximum number of seeds per batch for top hit selection, default = 16384)

Pairing:

-sensitive_pair <int> (more sensitive read pairing, default = 0)
-avg_ins <int> (insert size for paired-end reads, estimated from input if not specified)
-ins_std <int> (standard deviation of insert size, estimated from input if not specified)
-ins_npairs <int> (top number of read pairs for insert size estimation [#int times 0x10000], default = 1)
-ins_mapq <int> (minimal mapping quality score of a SE alignment for insert size estimation, default = 20)
-max_seedpairs <int> (maximal number of seed pairs per read pair, default = 10000000)
-no_rescue (do not rescue the mate using Smith-Waterman for an un-paired read)
-rescue_top_hits <int> (attempt the second rescuing from #int top hits for each end, default = 100)

Compute:

-t <int> (number of threads, default = 1)

Others:

-h (print out the usage of the program)

Installation and Usage

Installation from source code

Preparation

Users can configure CUSHAW3 to use SSE2, instead of SSE4, when SSSE3 is not available on your CPUs. This configuration can be done by changing "have_ssse3 = 1" to "have_ssse3 = 0" in the Makefile. If users do not know whether their CPUs support SSSE3 or not, please just simply change to "have_ssse3=0" in the makefile because SSE2 is supported in nearlly all Intel and AMD CPUs.
How to known when to modify the Makefile to determine the use of either SSSE3 or SSSE2?
- run command "cat /proc/cpuinfo" to check the CPU information. In the "flags" line, check the existence of word "ssse3". If existing, it means that your CPU support SSSE3 and otherwise, not support.
- When you failed to compile CUSHAW3, please first check whether it is caused by unidentified SSSE3 assembly instructions.

Compile the algorithm

Type "make" command in the root directory of the software to compile the aligner.

Build the BWT and the FM-index

Type command "cushaw3 index -p nameprefix genome.fa" to construct BWT and FM-index for genomes, regardless of the genome size.

Typical Usage

cushaw3 align -r bwt_file_base -f infile1.fa -t 12 -o myalignment.sam
cushaw3 align -r bwt_file_base -f infile1.fa infile2.fq -t 12 -o myalignment.sam
cushaw3 align -r bwt_file_base -q infile_1.fq infile_1.fq -o myalignment.sam
cushaw3 calign -r bwt_file_base -q infile1.fa infile2.fq -t 12 -o myalignment.sam

Want all mappings per read?

specify a very large integer value to options "-multi" and "-max_occ" simultaneously. Please do not exceed the range of the signed integer type.

Important Notes:

gzip-Compressed FASTA and FASTQ formats, SAM and BAM foramts are supported as input.
When inputing multiple paired-end read files, the paired-end reads must have the same insert-size information.
The default scoring scheme is generally good enough for long read alignment. Certainly, better performance might be able to be obtained after making more efforts to finely tune the scoring scheme.
Users can use the parameter "-multi" to enable the output of mutlipe alignments per read.
Both aligned and unaligned reads are printed out to the SAM output file. In addition, for paired-end alignment, if an aligned read failed to be paired, it is outputted in single-end mode.
By default, CUSHAW3 estimates the insert size information from the input. The insert size is estimated from a fixed number of read pairs starting from the head of the inputs. This will take some extra time at startup time (e.g. takes about 1 minute using a single thread for the first 65536 100-bp read pairs). However, since this estimation is only conducted once, this extra time can be neligible. If users customize the insert size, this automatic estimation will be disabled, thus saving some time.
The parameter "-mask_amb" can be used to specify whether all ambiguous bases in the reference genome will be marsked or not.
By default, the maximum number of occurrences per seed is 2000. However, for some datasets from hightly repeatitive regions, the number of significant seeds may exceed this values. In this case, some signficant alignments may be lost. In order to solve this problem, users can use the option "-max_occ" to specify the maximum number of occurrences per seed. The drawbacks of increasing this value, however, is (1) more memory consumped for seed storage and (2) slower speed to process large quantitites of seeds.

Change Log

April 17, 2014
1. Removed some advanced compiling options for GCC in the Makefile.

Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.

CUSHAW3 - sensitive and accurate gapped short-read alignment

Site Map

Project Links

List of My Software

Big Data

Machine Learning

Scientific Computing

Sequence Alignment

Motif Discovery

NGS Read Alignment

NGS Read Error Correction

NGS de novo Assembly

NGS SNV calling

NGS Metagenomics

Inspire Innovation

Other related papers

Commands

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:

Installation from source code

Preparation

Compile the algorithm

Build the BWT and the FM-index

Typical Usage

Want all mappings per read?

Important Notes: