dal progeo $genoma$umanoad oggi: evoluzione$delle ... · structural variation (sv) "...
TRANSCRIPT
Dal proge*o genoma umano ad oggi: evoluzione delle tecniche di
sequenziamento, analisi genomica e proteomica e prospe9ve future!
David Horner Dipar.mento di Bioscienze Università degli Studi di Milano
Come va sequenziato il DNA?
• Sequenziamento Sanger (1978 – oggi): – Cos. rela.vamente al. – Richiede molto tempo per preparazione di campioni – Produce poche leLuri LUNGHI (1000 nt) – Pochi errori di sequenziamento
Sequenziamento Sanger (1978)
Sequenziamento Sanger (1978)
Genome
1) Frammentare in modo “casuale”, clonare fammen. in plasmidi
2) Sequenziare un fragmento (a caso)
3) Individuare un clone sovraposto …. Sequenziarlo e costruire un frammento piu lungo
4) Andare al passaggio 2 (fino alla fine!)
viruses plasmids
bacteria fungi
plants algae
insects
mollusks
rep.les
birds
mammals
Genomi, quanto sono grandi ?
104 108 105 106 107 1011 1010 109
bony fish
amphibians
Sequenziamento Sanger (anni 1990)
96 reazioni in parallelo 1000 nt x reazione
Robot!
1981 • Sinclair ZX-‐81
Computer
Whole Genome Shotgun Approach
Assembly by overlap
Sequenze Ripetute
Sequenze uniche
Sequenze ripetute
Se le sequenze ripetute sono meno lunghe del “leLure” di sequenziamento, non c’è problema
A B C
A B C
Sequenze Ripetute
Se sono piu lunghi, NON POSSIAMO ASSEMBLARE!
A B c ?
A C B ?
Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology read a 500-‐900 long word that comes
out of sequencer mate pair a pair of reads from two ends
of the same insert fragment con-g a con.guous sequence formed
by several overlapping reads with no gaps
supercon-g an ordered and oriented set (scaffold) of con.gs, usually by mate
pairs consensus sequence derived from the sequene mul.ple alignment of reads
in a con.g
Con.gs and scaffolds
1. Genome fragmentation
2. Library
3. Sequences 4. Genome assembly by overlap
Shot Gun Sequencing
Timeline
Meet Your Genome
(The Wheat genome (16.9 Gbp) is more than 5 .mes bigger than the human genome and 80% of its genome consists of repe..ve sequences)
The Human Genome
Quanto è COMPLESSO il genoma?
(Il Genoma di FRUMENTO (16.9 Gbp) è piu di 5 VOLTE piu grande di quello umano. 80% consiste di elemen. ripetu.)
Il genoma Umano c. 3Gb
Physical Mapping
Top down sequencing
1. 2.
3. 4.
Genome fragmentation
Physical map
Subclone library
Sequence clones by walking or by SHOTGUN strategy
Human Genome Project 16/02/2001
OK, abbiamo sequenziato il genoma …. Ora che cosa fare?
Dove sono I geni? Sequenziare ed allineare cDNA (mRNA) al genoma
Ma quali gene/allele sono responsabile per feno.pi di interesse?
Dobbiamo paragonare genomi di tan. individui diversi e fare sta.s.ca per capire feno.pi complessi …. Cioè, dobbiamo sequenziare TANTI individui della stessa specie ed associare feno.pi con geno.pi. Genome Wide Associa.on Studies (GWAS)
“GWAS” + “Human” nella leLeratura Prima di 2004 (60 ar.coli) Da 2004 in poi (>14000 ar.coli) Sono sta. sequenzia. > 10000 genomi umani da 2004 in poi,
Come è stato faLo?
Revolu.onary techniques in molecular gene.cs
Molecular cloning Sanger sequencing PCR
Gel Electrophoresis Bloung (Southern/Northern/Western etc) Expression cloning (microarrays)
Next Genera.on Sequencing
Next Genera.on Sequencing
• (Massively Parallel /Second Genera.on) • HIGH throughput (lots of data) • Rela.vely low cost • Transversal in terms of applica.on
Read Length is Not As Important For Resequencing
0%
10%20%
30%40%
50%
60%70%
80%90%
100%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
% o
f P
aire
d K
-mer
s w
ith U
niqu
ely
Ass
igna
ble
Loca
tion
E.COLIHUMAN
Cost per megabase of DNA sequence
Next-Generation Sequencing
Illumina / Solexa Gene.c Analyzer HiSeq 2000 (150x2 bp, 600 Gb / run)
Applied Biosystems SOLiD 4 SystemTM
(100x2 bp, 400 Gb / run)
Roche / 454 Genome Sequencer FLX .tanium (800 bp, 800 Mb / run)
Ion Proton PacBio
A number of platforms using different strategies and chemistries, and with different throughput are entering the market.
Fold coverage % sequenced 0.25 22 0.5 39 0.75 53 1 63 2 87.5 3 95 4 98.2 5 99.4 6 99.75 7 99.91 8 99.97 9 99.99 10 99.995
When has a genome been fully sequenced?
Illumina
• Bridge PCR
• Sequencing by synthesis using fluorescent reversible terminators
Technology Overview: Solexa/Illumina Sequencing
http://www.illumina.com/
Immobilize DNA to Surface
Source: www.illumina.com
Technology Overview: Solexa Sequencing
Bridge PCR
• DNA fragments are flanked with adaptors. • A flat surface coated with two types of primers, corresponding to the
adaptors. • Amplifica.on proceeds in cycles, with one end of each bridge
tethered to the surface. • Used by Solexa.
Sequence Colonies
The bases are “reversible terminators”, only one base can be added. Then they are modified so that the next round of extension can occur.
Sequence Colonies
Each base has a different Fluor (color). Excited by laser, and color is read.
Illumina sequencers sequencing-by-synthesis coupled with bridge amplification
Available versions: § HiSeq 2000 (up to 600 Gb, 250x2 bp reads)
§ HiSeq 1000 (up to 300 Gb, 250x2 bp reads)
§ Genome Analyzer (up to 95 Gb, 150x2 bp reads) § MiSeq pla=orm (up to 6 Gb, 250x2 bp reads)
Da 2008
SNP calling • The basic principle is simple!
• This looks like a homozygous SNP
ACTTTTGCCCTGTGTCTAAAATGCGTCGTAGCATGT - reference!ACTTTTGCCCTGTGACTAAAATG ! ! !read1! TTGCCCTGTGACTAAAATGCGT! ! !read2! TGCCCTGTGACTAAAATGCGTA ! !read3! GCCCTGTGACTAAAATGCGTAG ! !read4! GCCCTGTGACTAAAATGCGTAG ! !read5! CCTGTGACTAAAATGCGTAGTAG ! !read6!
SNP calling • And this one looks heterozygous
ACTTTTGCCCTGTGTCTAAAATGCGTCGTAGCATGT - reference!ACTTTTGCCCTGTGACTAAAATG ! ! !read1! TTGCCCTGTGTCTAAAATGCGT! ! !read2! TGCCCTGTGACTAAAATGCGTA ! !read3! GCCCTGTGTCTAAAATGCGTAG ! !read4! GCCCTGTGACTAAAATGCGTAG ! !read5! CCTGTGTCTAAAATGCGTAGTAG ! !read6!
On average, we think that we will find a SNP (Single Nucleo.de
Polymorphism) between 2 Human individuals about every 2000 bases.
99.5% iden.ty
maybe 1,500,000 differences!
Structural Variation (SV) l Any DNA sequence altera.on other than a single nucleo.de
subs.tu.on l copy number variations (CNV), l transposon movement l Expansion of trinucleotide and other simple repeats l insertions-deletions (indels) l translocations l inversions l the vast majority of SV events are small indels
• Human genomes differ more as a consequence of structural varia.on than of single-‐base-‐pair differences* – Causal events in hereditary diseases – somatic SV – markers for GWAS / mapping studies
49
Copy Number Varia.on (CNVs)
so... how representative is the reference genome?
Applica.ons of NGS playorms
• DNA sequencing - genome resequencing (SNPs, CNV, GWAS) - de novo sequencing - identification of genome structural variants (cancer genome) - 3D chromatin interactions - Epigenomics (chromatin state and genome methylation) - Metagenomics (taxonomic analysis of environmental samples)
• RNA sequencing - Qualitative and quantitative analysis of the Transcriptome - Identification and characterization of miRNAs and other ncRNAs - RNA editing - Metatrancriptomics (functional analysis of envronmental samples)