introduction to genomes with ensembl - tufts...
TRANSCRIPT
1 of 24
Dr. Giulietta M. Spudich
Ensembl Outreach Team
Introduction to Genomes
with Ensembl
2 of 31
Objectives
What information about a gene can I find?
What about a region of the genome?
How do I navigate the data?
Introduction
1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 gb)
Large amounts of raw DNA sequence data
Fragment
BAC clones
Sequence
Contigs
Assemble
Scaffolds
Assemble
Genome Sequencing
CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG
CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA
TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT
GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC
CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG
CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA
TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT
TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT
AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG
ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG
AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA
AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG
GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT
AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT
CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG
CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA
TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT
ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG
AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG
TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA
Genome sequence
21 May 2012 6
The Ensembl genome browser:
making it interesting
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
• Splice variants, proteins, non-coding RNA
• Small and large scale sequence variation, phenotype associations
• Whole genome alignments, protein trees
• Potential promoters and enhancers, DNA methylation
• User upload, custom data
7 of 31
Genome Browsers
• Ensembl Genome Browsers
http://www.ensemblgenomes.org
• NCBI Map Viewer
http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser
http://genome.ucsc.edu
Ensembl is Used Worldwide
8 of 31
Top users:
UK
US
Canada
China
France
Germany
Italy
Japan
Spain
Data Volume Challenge
• UniProtKB/Swiss-Prot (reviewed)
536,029 (25,871 human) protein sequences
• UniProtKB/TrEMBL
22,128,511 (217,918)
9 of 24 www.uniprot.org
NCBI RefSeq (reviewed)
15,744,232 (24,539) NP_006570
NM_006579
Q8IU82
10 of 31
A consensus set of protein coding
sequences
• Reaching a consensus coding
sequence set for human and mouse.
• 26,473 (human)
22,187 (mouse) (*as of Sept 2011)
• If you see a “CCDS ID”, the coding
sequence is agreed upon.
Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4
11 of 31
What are the gold transcripts?
UTR Coding Intron
12 of 31
VEGA/Havana
(human, mouse, z-fish)
• Automatic annotation pipeline: Gene
building all at once (whole genome)
Ensembl
• Manual curation: reviewed by experts
VEGA: Vertebrate Genome Annotation
Havana
13 of 31
Genes and Transcripts in Ensembl
High Quality:
• CCDS transcripts
• Ensembl/Havana merged (gold)
transcripts
14 of 31
Ensembl/Havana
• Transcripts are from:
Ensembl
Havana
Ensembl/Havana
Ensembl (20_)
Havana (00_)
Both (“gold”)
Havana (00_)
15 of 31
Gene Names in Ensembl
• ENSG### Ensembl Gene ID
• ENST### Ensembl Transcript ID
• ENSP### Ensembl Peptide ID
• ENSE### Ensembl Exon ID
• For non-human species a suffix is added:
MUS for M. musculus ENSMUSG###
DAR (Danio rerio) for zebrafish: ENSDARG###
16 of 31
Ensembl Features
• The gene set.
• Comparative analysis
• Variation and regulation
• BioMart (data export)
• Display of external data (DAS)
• Programmatic access via the Perl API
• Open Source
17 of 31
Objectives
What information about a gene can I find?
What about a region of the genome?
How do I navigate the data?
See our coursebook for walk-throughs and
exercises using our browser:
http://www.ensembl.org/info/website/tutorials/coursebook.pdf
• Nucleotide level
• Single nucleotide polymorphism (SNP)
• Small insertions and deletions (InDels)
• Microsatellites (short tandem repeats)
• Structural
• Copy number variations (CNV)
• Large insertions and deletions
Variation
Sequence displays
Gene: Sequence
Transcript: Exons
Transcript:cDNA
Comparative Genomics
69 species in e!67
Ensembl tools
Phenotype for a gene
23 of 31
How is all this information
organised?
• Ensembl Views (Website)
• Ensembl Database (open source)
• BioMart „DataMining tool‟
Help and documentation
• Comments and questions?
• Mailing lists [email protected], [email protected]
• Course online www.ensembl.info/ecourse
• Our tutorials page www.ensembl.org/info/website/tutorials
• YouTube channel www.youtube.com/user/EnsemblHelpdesk
Follow us
• Facebook www.facebook.com/Ensembl.org
• Twitter https://twitter.com/Ensembl
• Come visit our blog! www.ensembl.info
Publications
• Flicek, P. et. al.
Ensembl 2012
Nucleic Acids Res 40:D84-90 (2012)
http://nar.oxfordjournals.org/content/40/D1/D84.long
• Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244
• Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295
http://www.ensembl.org/info/about/publications.html
Ensembl Paul Flicek (EBI), Steve Searle (Wellcome Trust Sanger Institute)
Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Thomas Maurel, Kieron Taylor
Comparative Genomics
Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli
Regulation Ian Dunham, Ikhlak Ahmed, Nathan Johnson, Thomas Juettemann, Steven Wilder
Variation Fiona Cunningham, Laurent Gil, Sarah Hunt, Will McLaren, Graham Ritchie, Anja Thormann
Analysis and Annotation
Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Simon White
Web Team Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Dan Sheppard, Steve Trevanion
Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster
Ensembl Genomes
Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Uma Maheswari, Mark McDowall, Michael Nuhn, Helder Pedro, Claudia Rato da Silva, Dan Staines, Iliana Toneva
Ensembl Strategy
Ewan Birney, Richard Durbin, Paul Flicek, Jen Harrow, Tim Hubbard, Glenn Proctor, Steve Searle
Ensembl Team