Software

Argonaut: Genome Assembly

Emily Trybulec 

EASEL: Structural Annotation

Cynthia Webster, Karl Fetter, Sumaira Zaman, Vidya Vuruputoor, Akriti Bhattarai, Vikesh Chinta, Jill Wegrzyn

EnTAP: Functional Annotation

Cynthia Webster, Alex Hart, Vidya Vuruputoor, Vikesh Chinta, Sumaira Zaman, Akriti Bhattarai, Karl Fetter, Jill Wegrzyn

Clickable Banner
Argonaut
Argonaut
Clickable Banner
EASEL Genome Annotation
EASEL
Clickable Banner
EnTAP
EnTAP
Argonaut streamlines these steps in a single NextFlow workflow, including: short and long-read read quality control, genome size estimation, genome assembly, polishing, redundancy reduction, assembly-to-assembly scaffolding, and assembly quality assessment. Given the variability of sequence inputs, depth, and genome complexity, Argonaut is designed with flexibility in mind - including the ability to modify read input types and assembly tools. It is compatible with both long (ONT and PacBio HiFi) and short (Illumina) reads, and supports hybrid assembly. Argonaut uniquely provides users with significant adjustability within the framework of the current best practices for genome assembly. The workflow produces well documented products at each stage including decontaminated reads, genome size estimates, coverage estimates, up to six de novo assemblies (Flye, Canu, Hifiasm, Masurca for short and hybrid assembly, and Redundans) and assembly quality statistics (BUSCO, Quast, Merqury).
EASEL is an open source genome annotation tool that leverages machine learning, RNA folding, and functional annotations to enhance gene prediction accuracy. It aligns high throughput short read data (RNA-Seq) and assembles putative transcripts via StringTie2 and PsiCLASS.  Complete open reading frames are subsequently predicted through TransDecoder using a gene family database (EggNOG) and coding region hints are generated. Gene models are independently used to train AUGUSTUS, and the resulting predictions are combined into a single gene set using AGAT. Implicated gene structures are filtered by primary and secondary features (RNA folding structure, free energy, primary sequences, EggNOG protein homology, OrthoDB alignments, RNA expression, exon number, length and prediction overlap) with a random forest algorithm and clade-specific training set. Transcripts that have a predicted translation initiation site and which score above a certain F1 threshold are retained and functionally annotated with EnTAP.
EnTAP is designed to improve the accuracy, speed, and flexibility of functional gene annotation. EnTAP integrates taxonomic scope to optimize the selection of the most appropriate descriptors, as well as filter for contaminants common in transcriptomes and genomes. EnTAP addresses the challenges associated with de novo transcriptome assembly that lead to inflated and inaccurate transcripts  through target transcript coverage (RSEM expression estimates) and the prediction of viable open reading frames (TransDecoder). Following filters applied through assessment of true expression and frame selection, translated proteins are compared to up to three protein databases, and independently assigned to gene families via EggNOG. Sequence similarity is implemented through Diamond for rapid assessment, and Gene Ontology terms are assigned from a combination of high quality UniProt alignments, when available, and EggNOG. EnTAP can process header information from both EBI and NCBI databases to aid in the selection of a single alignment that results from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness.