Introduction

CUSHAW is a well-established leading next-generation sequencing read alignment software package based on multi-core and many-core computing. CUSHAW3, the third distribution of CUSHAW software package for next-generation sequencing read alignment, is an open-source parallelized, sensitive and accurate short-read aligner for shared-memory (multi-threaded) and distributed-memory sytems (MPI). We have compared CUSHAW3 to other leading aligners: Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. This algorithm has been presented in the paper "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding" and further extended to support distributed-memory systems using UPC++, which is presented in the paper Parallel and scalable short-read alignment on multi-core clusters using UPC++.

Note:


Feedback

Click here to get the feedback from users.


Downloads


GCAT Benchmarks

We have evaluated the alignment quality, as well as the variant calling performance, using the public benchmark datasets in GCAT (Genome Comparison & Analytic Testing). Click the following links to get the results and comparisons with graphical interfaces.


Citation

Other related papers


Parameters

In CUSHAW3, we have intergrated both the genome indexing and the alignment algorithms into a single executable binary, and have given three commands, i.e. index, align and calign for the genoe indexing, base-space alignment and color-space alignment, respectively. The following list the parameters for all of the three commands.

Commands

Genome indexing

Base-space alignment

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:

Color-space alignment

Input:

Output:

Scoring:

Align:

Seed:

Pairing:

Compute:

Others:


Installation and Usage

Installation from source code

Preparation

  1. Users can configure CUSHAW3 to use SSE2, instead of SSE4, when SSSE3 is not available on your CPUs. This configuration can be done by changing "have_ssse3 = 1" to "have_ssse3 = 0" in the Makefile. If users do not know whether their CPUs support SSSE3 or not, please just simply change to "have_ssse3=0" in the makefile because SSE2 is supported in nearlly all Intel and AMD CPUs.

  2. How to known when to modify the Makefile to determine the use of either SSSE3 or SSSE2?
    • run command "cat /proc/cpuinfo" to check the CPU information. In the "flags" line, check the existence of word "ssse3". If existing, it means that your CPU support SSSE3 and otherwise, not support.
    • When you failed to compile CUSHAW3, please first check whether it is caused by unidentified SSSE3 assembly instructions.

Compile the algorithm

Type "make" command in the root directory of the software to compile the aligner.

Build the BWT and the FM-index

Type command "cushaw3 index -p nameprefix genome.fa" to construct BWT and FM-index for genomes, regardless of the genome size.

Typical Usage

  1. cushaw3 align -r bwt_file_base -f infile1.fa -t 12 -o myalignment.sam
  2. cushaw3 align -r bwt_file_base -f infile1.fa infile2.fq -t 12 -o myalignment.sam
  3. cushaw3 align -r bwt_file_base -q infile_1.fq infile_1.fq -o myalignment.sam
  4. cushaw3 calign -r bwt_file_base -q infile1.fa infile2.fq -t 12 -o myalignment.sam

Want all mappings per read?

  1. specify a very large integer value to options "-multi" and "-max_occ" simultaneously. Please do not exceed the range of the signed integer type.

Important Notes:

  1. gzip-Compressed FASTA and FASTQ formats, SAM and BAM foramts are supported as input.
  2. When inputing multiple paired-end read files, the paired-end reads must have the same insert-size information.
  3. The default scoring scheme is generally good enough for long read alignment. Certainly, better performance might be able to be obtained after making more efforts to finely tune the scoring scheme.
  4. Users can use the parameter "-multi" to enable the output of mutlipe alignments per read.
  5. Both aligned and unaligned reads are printed out to the SAM output file. In addition, for paired-end alignment, if an aligned read failed to be paired, it is outputted in single-end mode.
  6. By default, CUSHAW3 estimates the insert size information from the input. The insert size is estimated from a fixed number of read pairs starting from the head of the inputs. This will take some extra time at startup time (e.g. takes about 1 minute using a single thread for the first 65536 100-bp read pairs). However, since this estimation is only conducted once, this extra time can be neligible. If users customize the insert size, this automatic estimation will be disabled, thus saving some time.
  7. The parameter "-mask_amb" can be used to specify whether all ambiguous bases in the reference genome will be marsked or not.
  8. By default, the maximum number of occurrences per seed is 2000. However, for some datasets from hightly repeatitive regions, the number of significant seeds may exceed this values. In this case, some signficant alignments may be lost. In order to solve this problem, users can use the option "-max_occ" to specify the maximum number of occurrences per seed. The drawbacks of increasing this value, however, is (1) more memory consumped for seed storage and (2) slower speed to process large quantitites of seeds.

Change Log


Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.