Next-gen sequence alignment and RNA-seq analysis tools
- Bowtie and Bowtie2 are ultrafast systems for aligning short reads from next-generation sequencers to the human genome or any other genome. The Bowtie project has been led from the beginning by former student Ben Langmead, who continues to develop it in his own lab.
- HISAT2 is a very fast splice junction mapper, a successor to TopHat that is equally accurate and up to 50 times faster. The HISAT2 project and its spinoff, HISAT-Genotype, were developed and are led by Daehwan Kim, who continues to improve them in his own lab.
- Tophat is a fast splice aligner for RNA-Seq reads. TopHat doesn’t need annotation, meaning it can find novel exons and splice sites even if they are missing from standard gene annotation. TopHat was originally developed by Cole Trapnell, and TopHat2 was developed primarily by Daehwan Kim.
- StringTie is a very fast and accurate transcript assembler and abundance estimator for RNA-seq data. StringTie assembles transcripts from the alignments produced by TopHat/HISAT, identifying novel isoforms and estimating expression levels for all transcripts. The StringTie project has been led from the beginning by Prof. Ela Pertea.
- Cufflinks assembles the reads from an RNA-seq experiment, producing full-length transcripts in multiple isoforms, quantitating the levels of expression of each gene and each isoform. The Cufflinks project was led by former student Cole Trapnell.
- MUMmer is a system for aligning whole genomes, chromosomes, and other very long DNA sequences. It includes the Nucmer and Promer alignment tools. MUMmer has been in continuous usage for >15 years, and is still actively used. MUMmer4 has been on github since 2018.
- DIAMUND is an efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome. When used on exome sequences from family trios, or to compare normal and diseased samples from the same individual, it produces a dramatically smaller list of candidate mutations than previous methods. Original developers: Steven Salzberg and Ela Pertea.
- TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across chromosomal fusion points, which results from the breakage and re-joining of different chromosomes, a common event in some tumors. Original developer: Daehwan Kim.
- EDGE-pro aligns and quantitates transcript data from bacterial and archaeal RNA-seq experiments. Original developer: Tanja Magoc.
Genome Assembly and contamination detection
- Conterminator is an efficient method for detecting incorrectly labeled sequences across kingdoms by an exhaustive all-against-all sequence comparison. It is developed on top of modules provided by MMseqs2. Both programs were developed by former postdoc Martin Steinegger who continues to improve them in his own lab.
- FLASH, Fast Length Adjustment of SHort reads, is a very fast program to merge paired-end reads that were sequenced from fragments that are shorter than twice the read length. Original developer: Tanja Magoc. Read the paper.
- MaSuRCA is a whole-genome assembler developed originally at the University of Maryland by Aleksey Zimin, Jim Yorke, and their colleagues. Ongoing development is led by Aleksey Zimin who is now at JHU. The latest version of the assembler includes modules designed to create assemblies using both short reads (Illumina) and long reads (PacBio/Oxford Nanopore).
- Quake is an error-correction package that detects and correct substitution sequencing errors in whole-genome sequencing data sets with deep coverage, primarily for next-generation sequencing projects. Original developer: David Kelley. Read the paper.
- AMOScmp is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely related species. Original developers: Mihai Pop and Adam Phillippy. Read the paper here.
- Minimus is a small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads from a single gene, or other tasks that don’t require a large-genome assembler. Original developers: Dan Sommer, Art Delcher, Steven Salzberg, Mihai Pop. Read the paper.
- The AMOS Assembler project is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is also an open consortium that we started at TIGR, and that now includes multiple institutions.
- Hawkeye, a flexible graphical interface to genome assemblies from a variety of assemblers. Original developers: Mike Schatz and Adam Phillippy. Read the paper.
- Bambus was the first publicly available, standalone genome assembly scaffolder. It orders and orients contigs into scaffolds based on various types of linking information. Mihai Pop’s lab subsequently released Bambus2.
- AutoEditor, an older tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data from Sanger sequencing machines. On average AutoEditor corrected 80% of erroneous base calls, with an accuracy of 99.99%. Original developers: Pavel Gajer and Mike Schatz. Read the paper.
Gene Finding, Genome annotation, and Metagenomics
- CHESS is a project to create a comprehensive annotation of the human genome, combining deep RNA-sequencing data from thousands of experiments to create a more accurate gene catalog than the widely-used RefSeq or Gencode gene sets. CHESS is a joint project with Ela Pertea’s lab. The 2018 paper appeared here.
- Liftoff is a new (2020) genome annotation system that maps genes and other genome annotations between assemblies of the same or closely-related species. Unlike previous systems, Liftoff that doesn’t require you to first align the genomes base-by-base, and it can also find additional gene copies present in the target assembly that are not annotated in the reference. The Liftoff project is led by Alaina Shumate.
- You can find a complete mapping of the RefSeq v110 human annotation from GRCh38 onto the finished CHM13 genome (from the T2T consortium), on our page here. We created the mapping using Liftoff.
- Balrog is a recent (2020) bacterial gene finder that uses a novel machine learning design to find protein-coding genes in any prokaryotic genome, without the need for species-specific training. Balrog has a sensitivity approaching 99% for most species. The Balrog project is led by Markus Sommer.
- Kraken is a very fast system for identifying the species represented by short (or long) DNA sequences, usually obtained through microbiome or metagenomic studies. The Kraken and Kraken2 projects are led by former Ph.D. student Derrick Wood.
- KrakenUniq is version of Kraken 1 that runs just as fast but also counts the number of unique k-mers using the stream sketching algorithm HyperLogLog. This allows the user to filter and rank results by the coverage of genomes, in addition to read counts. KrakenUniq was originally developed by Florian Breitwieser (here’s the paper). NEW (May 2022): a new release implemented by Christopher Pockrandt now allows KrakenUniq to run on small-memory machines, even with huge databases. Install it from github, or use bioconda here: https://anaconda.org/bioconda/krakenuniq.
- Bracken (Bayesian Reestimation of Abundance with KrakEN) is a statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Braken uses the taxonomy labels assigned by KrakenUniq or Kraken2 to estimate the number of reads originating from each species present in a sample. Bracken (here’s the paper, from 2017) was developed by Jennifer Lu.
- Pavian is a graphical viewer for exploring metagenomics classification results, with a special focus on infectious disease diagnosis. With Pavian, researchers can analyze, display and transform results from the Kraken and Centrifuge classifiers using interactive tables, heatmaps and flow diagrams. Pavian (code is here, journal paper here) was developed by Florian Breitwieser.
- Centrifuge is a very fast and memory-efficient system for metagenomic sequence analysis. It uses the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index to save memory. Centrifuge was developed by Daehwan Kim, Li Song, and Florian Breitwieser.
- Glimmer uses interpolated Markov models (IMMs) to find genes in microbial DNA. Used around the world for thousands of genomes. Originally developed by Art Delcher and Steven Salzberg.
- Phymm and PhymmBL, first released in 2009, are systems for classifying short DNA sequences from metagenomics projects, labelling them with their likely species name. Originally developed by Arthur Brady.
- JIGSAW, a program that predicts gene models using the output from multiple sources of evidence, including other gene finders, Blast searches, and other alignment data. Originally developed by Jonathan Allen.
- GlimmerHMM, an interpolated Markov Model system for finding genes in many eukaryotes, including P. falciparum, A. thaliana, rice (O. sativa), mosquito (A. aegypti), B. malayi, C. neoformans, and others. Originally developed by Mihaela Pertea.
- GeneZilla, a generalized HMM for eukaryotic gene finding developed by Bill Majoros, a former Salzberg lab member (when the lab was at TIGR), now a faculty member at Duke University.
- GeneSplicer, a fast system for detecting splice sites in genomic DNA of various eukaryotes. Originally developed by Mihaela Pertea.
Transcription terminators, operons, and motif analysis tools
- TransTermHP (updated in 2010), a program that finds rho-independent transcription terminators in bacterial genomes. Originally developed in 2000 by Maria Ermolaeva. Re-designed and re-implemented in 2007 by Carl Kingsford.
- OperonDB (update in progress, 2015), results from our operon-finding software on a large number of prokaryotic genomes. Described in OperonDB: a comprehensive database of predicted operons in microbial genomes (Pertea et al. 2009). Originally developed in 2001 by Maria Ermolaeva. Redesigned and re-implemented in 2008 by Mihaela Pertea.
- ELPH, a motif finder that can find ribosome binding sites, exon splicing enhancers, or regulatory sites. Original developer: Mihaela Pertea.
- SeeESE, an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis, Drosophila, and other species. Original developers: Mihaela Pertea and Steven Mount.
- Skewed oligomers from bacterial and archaeal genomes (described in Salzberg et al., Gene 217:1-2, 1998). Get the source code.
Machine learning systems, pre-1995 and pre-computational biology
- The OC1 decision tree system. Originally developed by S.K. Murthy.
- The PEBLS memory-based reasoning system. Originally developed by Scott Cost.