Next-gen sequence alignment and RNA-seq analysis tools

  1. Bowtie is an ultrafast system for aligning short reads from next-generation sequencers to the human genome and any other genome.  Bowtie2, which supports gapped alignments, longer reads, and is equally fast, appeared in late 2011. The Bowtie project has been led from the beginning by Ben Langmead.
  2. Tophat is a fast splice junction mapper for RNA-Seq reads.  TopHat doesn’t need annotation, meaning it can find novel exons and splice sites even if they are missing from standard gene annotation.  TopHat was originally developed by Cole Trapnell, and TopHat2 was developed primarily by Daehwan Kim.
  3. HISAT is a very fast splice junction mapper, a successor to TopHat that is equally accurate and up to 50 times faster. The HISAT project and its spinoff, HISAT-Genotype, is led by Daehwan Kim.
  4. StringTie is a very fast transcript and accurate assembler and abundance estimator for RNA-seq data. Designed as a successor to Cufflinks, StringTie assembles transcripts from the alignments produced by TopHat/HISAT, identifying novel isoforms and estimating expression levels for all transcripts. The StringTie project is led by Ela Pertea.
  5. Cufflinks assembles the reads from an RNA-seq experiment, producing full-length transcripts in multiple isoforms, quantitating the levels of expression of each gene and each isoform.  The Cufflinks project is led by former student Cole Trapnell.
  6. MUMmer is a system for aligning whole genomes, chromosomes, and other very long DNA sequences.  It includes the Nucmer and Promer alignment tools. MUMmer has been in continuous usage for >15 years, and is still actively used and supported by the lab. MUMmer4 is on github as of 2018.
  7. DIAMUND is an efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome. When used on exome sequences from family trios, or to compare normal and diseased samples from the same individual, it produces a dramatically smaller list of candidate mutations than previous methods. Original developers: Steven Salzberg and Ela Pertea.
  8. TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across chromosomal fusion points, which results from the breakage and re-joining of different chromosomes, a common event in some tumors. Original developer: Daehwan Kim.
  9. EDGE-pro aligns and quantitates transcript data from bacterial and archaeal RNA-seq experiments. Original developer: Tanja Magoc.

Genome Assembly

  1. FLASH, Fast Length Adjustment of SHort reads, is a very fast program to merge paired-end reads that were sequenced from fragments that are shorter than twice the read length. Original developer: Tanja Magoc. Read the paper.
  2. MaSuRCA is a whole-genome assembler developed originally at the University of Maryland by Jim Yorke, Aleksey Zimin, and their colleagues. Ongoing development is a joint effort between JHU and UMD led by Aleksey Zimin. The latest version of the assembler includes modules designed to create assemblies using both short reads (Illumina) and long reads (PacBio/Oxford Nanopore).
  3. Quake is an error-correction package that detects and correct substitution sequencing errors in whole-genome sequencing data sets with deep coverage, primarily for next-generation sequencing projects. Original developer: David Kelley.  Read the paper.
  4. AMOScmp is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely related species.  Original developers: Mihai Pop and Adam Phillippy. Read the paper here.
  5. Minimus is a small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads from a single gene, or other tasks that don’t require a large-genome assembler. Original developers: Dan Sommer, Art Delcher, Steven Salzberg, Mihai Pop.  Read the paper.
  6. The AMOS Assembler project is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is also an open consortium that we started at TIGR, and that now includes multiple institutions.
  7. Hawkeye, a flexible graphical interface to genome assemblies from a variety of assemblers.  Original developers: Mike Schatz and Adam Phillippy. Read the paper.
  8. Bambus was the first publicly available, standalone genome assembly scaffolder. It orders and orients contigs into scaffolds based on various types of linking information.  Mihai Pop’s lab subsequently released Bambus2.
  9. AutoEditor, an older tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data from Sanger sequencing machines. On average AutoEditor corrected 80% of erroneous base calls, with an accuracy of 99.99%.  Original developers: Pavel Gajer and Mike Schatz. Read the paper.

Computational Gene Finding and Metagenomics

  1. Kraken is a very fast system for identifying the species represented by short (or long) DNA sequences, usually obtained through microbiome or metagenomic studies. The Kraken project is led by former Ph.D. student Derrick Wood.
  2. KrakenUniq (formerly KrakenHLL) is version of Kraken 1 that runs as fast as Kraken and can work with the same databases, but additionally counts the number of unique k-mers using the stream sketching algorithm HyperLogLog (HLL). This allows the user to filter and rank results by the coverage of genomes in the database, rather than the read counts. (Here’s the paper.)
  3. Centrifuge is a very fast and memory-efficient system for metagenomic sequence analsysis. It uses the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index to save memory. Centrifuge was developed by Daehwan Kim, Li Song, and Florian Breitwieser.
  4. Glimmer uses interpolated Markov models (IMMs) to find genes in microbial DNA. Used around the world for thousands of genomes.  Originally developed by Art Delcher and Steven Salzberg.
  5. Phymm and PhymmBL, first released in 2009, are systems for classifying short DNA sequences from metagenomics projects, labelling them with their likely species name. Originally developed by Arthur Brady.
  6. JIGSAW, a program that predicts gene models using the output from multiple sources of evidence, including other gene finders, Blast searches, and other alignment data. Originally developed by Jonathan Allen.
  7. GlimmerHMM, an interpolated Markov Model system for finding genes in many eukaryotes, including P. falciparum, A. thaliana, rice (O. sativa), mosquito (A. aegypti), B. malayi, C. neoformans, and others. Originally developed by Mihaela Pertea.
  8. GeneZilla, a generalized HMM for eukaryotic gene finding developed by Bill Majoros, a former Salzberg lab member (when the lab was at TIGR).
  9. GeneSplicer, a fast system for detecting splice sites in genomic DNA of various eukaryotes. Originally developed by Mihaela Pertea.

Transcription terminators, operons, and motif analysis tools

  1. TransTermHP (updated in 2010), a program that finds rho-independent transcription terminators in bacterial genomes. Originally developed in 2000 by Maria Ermolaeva.  Re-designed and re-implemented in 2007 by Carl Kingsford.
  2. OperonDB (update in progress, 2015), results from our operon-finding software on a large number of prokaryotic genomes. Described in OperonDB: a comprehensive database of predicted operons in microbial genomes (Pertea et al. 2009). Originally developed in 2001 by Maria Ermolaeva.  Redesigned and re-implemented in 2008 by Mihaela Pertea.
  3. ELPH, a motif finder that can find ribosome binding sites, exon splicing enhancers, or regulatory sites. Original developer: Mihaela Pertea.
  4. SeeESE, an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis, Drosophila, and other species. Originally developers: Mihaela Pertea and Steven Mount.
  5. Skewed oligomers from bacterial and archaeal genomes (described in Salzberg et al., Gene 217:1-2, 1998).  Get the source code.

Machine learning systems, pre-1995 and pre-computational biology