UNSW BABS Genome Project

Monday 18 June 2018

Sequencing snakes: the BABS Genomes at SBRS 2018

This month saw the first presentation of the BABS genomes outside UNSW. A poster on the genome assemblies was presented at the Sydney Bioinformatics Research Symposium 2018 (abstract below):

Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake)

Richard J Edwards, Timothy G Amos, Joshua Tang, Beni Cawood, Sabrina Rispin, Daniel Enosi Tuipulotu & Paul Waters.

Click on thumbnail for full resolution PDF. If you use anything from this work, please cite:

Edwards RJ et al. Pseudodiploid pseudo-long-read whole genome sequencing and assembly of Pseudonaja textilis (eastern brown snake) and Notechis scutatus (mainland tiger snake) [version 1; not peer reviewed]. F1000Research 2018, 7:753 (poster) (doi: 10.7490/f1000research.1115550.1)

Abstract

The precipitous drop in sequencing costs over recent years has seen the bottleneck in vertebrate whole genome sequencing (WGS) shift from data generation (sequencing) to data processing (assembly and annotation). Draft genomes generated from cheap shotgun Illumina sequencing tend to be highly fragmented with many tens of thousands of short contigs or scaffolds. This can be improved by preparing multiple paired end and “mate pair” libraries with different insert sizes, but this increases the cost of both sequencing and data storage/analysis. PacBio or Oxford Nanopore long read sequencing enables massive improvements in assembly quality but tends to be prohibitively expensive for organisms with large genome sizes, such as vertebrates. 10x Genomics Chromium “linked read” sequencing offers a solution to this problem. High molecular weight molecules of DNA are barcoded prior to standard shotgun Illumina sequencing. These barcodes can then be used for pseudo-long-read assembly, with improved handling of repetitive regions. Where heterozygous variants are dense enough, haplotypes can be phased to generate a “pseudodiploid” assembly with some regions represented as two alleles. This is all for the cost of an additional library prep with no extra sequencing. But does it work?

We have sequenced two of the deadliest venomous snakes in Australia using 10x Chromium linked reads: the mainland tiger snake (Notechis scutatus) and the eastern brown snake (Pseudonaja textilis). Supernova v2 assemblies of the data generated exceptionally high quality genomes for the price, with maximum scaffolds over 50 Mb and N50 values of 5.99 Mb for the tiger snake and 14.7 Mb for the brown snake. This was reflected in BUSCO (v2.0.1 short) completeness estimates of 87.3% (tiger snake) and 90.5% (brown snake). These data will be compared to tiger snake WGS using standard paired end Illumina NovaSeq shotgun sequencing, and discussed with respect to some of the downstream opportunities and challenges provided by pseudodiploid genome assemblies. In particular, BUSCO analysis of haploid, pseudodiploid, and non-redundant genome assemblies revealed some interesting and unexpected behaviour of this widely-used tool. We also present results from GenomeR, a Shiny app (in development) for batch kmer genome size estimation (http://shiny.slimsuite.unsw.edu.au/GenomeR/).

Snake genomes and ongoing annotation are being made available through the lab Web Apollo browser and search tool (https://slimsuite.unsw.edu.au/servers/apollo.php). We welcome contact from anyone interested in getting involved with the annotation and analysis of these genomes.

Watch this blog for more details on different aspects of the analysis.

Sunday 5 November 2017

UNSW Genome Annotation workshop, Tuesday 21st November 2017

I am pleased to announce that we will be running a replacement for July’s cancelled Genome Annotation workshop at UNSW on Tuesday 21st November 2017, 1100-1400. This will use WebApollo, which is the genome annotation browser we will be using for community annotation of our snake genomes.

Places are limited but it’s free and you can sign up here through Eventbrite.

DESCRIPTION

This workshop will include a short background lecture on the fundamentals of gene prediction and genome annotation followed by a hands-on component where we will conduct manual curation exercises using Apollo.

The workshop has been organised by EMBL-ABR and will be led by Dr Monica Munoz-Torres from Phoenix Bioinformatics who is an expert in genome annotation, current chair of the International Society for Biocuration Executive Committee, and former Project Manager of the Apollo Project.

Monica will be joining us direct from the San Francisco Bay Area, and we will have locally trained trainers on hand to help and facilitate the workshop locally.

TOPICS TO BE COVERED

Genome Annotation - why is it important?
Gene prediction
- what is a gene
- computation
- annotation
Genome curation
- knowledge
- curation - why is this necessary?
Structural Annotation using Apollo
Biological principles for curation with Apollo
Apollo functionality: step by step
Curation example

Requirements

Participants must bring their own eduroam-enabled laptop with either Chrome or Firefox installed.

Further information

https://www.embl-abr.org.au/genome-annotation-using-apollo-monica-munoz-torres/ or contact Richard Edwards.

Monday 28 August 2017

Where do our snakes come from?

The snakes we are sequencing for the BABS Genome project were kindly supplied by Nathan Dunstan at Venom Supplies as a collaborative contribution to Paul Waters and Denis O’Meally when they were at ANU. Thanks Nathan!

We have sequenced two Tiger snake parents, originally caught from the southeast of South Australia (just north of Mt Gambier) in about 2004. They were bred at Venom Supplies, and we have also sequenced one of the babies (sex unknown) born in February 2013.

The brown snake was a female from a clutch of eggs from a gravid (pregnant) female caught locally in the Barossa.

Photo Credits

Tiger Snake (left): Teneche [CC BY-SA 3.0] | Brown Snake (right): Denis O'Meally.

Tuesday 15 August 2017

Linked read sequencing is go!

We already have over four billion reads and 620 GB of NovaSeq Illumina data for our three tiger snakes; next week’s BABS3291 prac will look at some of the early ABySS assemblies of one of these snakes.

Phase 2 of the sequencing is now go! 10x Chromium linked read libraries were prepared at the Ramaciotti Centre for Genomics last week for one tiger snake and one eastern brown snake. These data promise to make much easier and more intact genome assemblies. We received notification today that the samples have arrived in the KCCG sequencing laboratory at the Garvan Institute for Illumina HiSeq X (“XTen”) sequencing.

Nobody knows how well linked read sequencing, which is optimised for human genomes, will work in a snake but we look forward to finding out!

Friday 4 August 2017

Important considerations for sample preparation

Today we had a tutorial on the things you need to think about during a genome sequencing project. The first student suggestion for sample selection and handling is good advice for life:

Thursday 3 August 2017

Sequencing technologies used for the BABS Genome

Sequencing for the BABS Genome is being performed at the Ramaciotti Centre for Genomics at UNSW, which is one of Australia’s top sequencing centres and has a long, rich history of genome sequencing.

The Gold Standard for genome assembly is currently to combine three technologies:

High coverage short read sequencing for accurate base calling of unique regions.
Long read sequencing for assembling complex and small repetitive regions of the genome.
Long range sequencing for scaffolding contigs across larger repetitive regions of the genome.

We will be using a combination of three of these latest technologies for the BABS genome:

Illumina NovaSeq and HiSeq X

Short read Illumina sequencing is still the starting point for sequencing large (>0.5 Gb) genomes. Although it is impossible to assemble short read data alone into a high-quality genome, it remains the most cost-effective technology in terms of high-quality bases sequenced per dollar. Illumina sequencing struggles with regions of the genome with certain compositional bias and short read assembly fails at repetitive regions. Nevertheless, it is possible to get a useful assembly of a large portion of the “unique” genome, which includes most of the protein-coding genes.

For the 2017 BABS genome, we are using two of the latest - and most cost-effective - Illumina sequencing platform: the HiSeq X (XTen) and new HiSeq NovaSeq. These machines have a phenomenal output per run. The NovaSeq is being used for pure Illumina sequencing, whereas the HiSeq X is being used for the sequencing component of the 10X Genomics Linked Read sequencing (below).

PacBio Sequel

Whole genome sequencing and assembly has been revolutionised by the development of long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore (MinION). With typical read lengths a hundred times longer than Illumina reads, long read sequencing enables resolution of many of the shorter repetitive regions in the genome.

Long read sequencing is still comparably expensive and the budget does not stretch for a pure PacBio assembly this year. However, we will be getting some sequencing done on the new PacBio Sequel, which will help with scaffolding Illumina contigs. We also hope to be able to generate a pure PacBio mitochondrial genome; mitochondria are present in multiple copies per cell, which effectively increases the depth of coverage!

10X Genomics Chromium Linked Reads

Due to the cost (and DNA requirements) of long read sequencing, there has been considerable effort in recent years to combine cost-effective Illumina short read sequencing with additional experimental approaches to leverage long-range information. The long range service offered by the Ramaciotti Centre is 10X Genomics Chromium linked read sequencing. Unlike PacBio or MinION, this does not contiguously sequence a long DNA molecule. Instead, it uses a clever barcoding system to link short reads back to their DNA molecule of origin. 10X Genomics software then uses this linkage to regenerate pseudo-long-reads that can be used for both genome assembly and haplotype phasing.

Friday 28 July 2017

What would YOU do with six billion sequencing reads?

The Ramaciotti Centre for Genomics, where we get all the sequencing done for the BABS Genome, is holding a competition to win a full sequencing run on their new NovaSeq 6000. This is one of the technologies we are using for our snake genomes - in fact, our three tiger snakes were part of the very first sequencing run on the new machine.

The capacity of this thing is awesome. In addition to the three snake samples, we had three cane toads as part of the ongoing [cane toad genome project] and, for a control, sequenced one of our yeast strains about 10,000 times!

The Competition

NovaSeq Mini Grant – How would you use 3 billion reads?

To celebrate the opening of our new genomics facility we are pleased to announce a mini grant valued up to $28,000. Researchers with innovative, collaborative projects are invited to submit a 250-word application outlining how 3 billion reads can be utilised to advance their research. The winner will receive an Illumina NovaSeq 6000 S2 100bp PE run (up to 3.3B reads/660Gb), with heavily subsidised library construction. Submit your entry by completing an application form and emailing it to Nextgenseq@unsw.edu.au with the subject heading “NovaSeq mini grant”. Terms and conditions apply.