schedule 10-feb, 9.00-12.00 (a1.14) chapter 5 genetic variation, snps, haplotype, genetic distance ...
TRANSCRIPT
Schedule10-feb, 9.00-12.00 (A1.14) Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic
Distance No computer practicum
11-feb, 9.00-11.30 (C1.112) Score matrices Computer exercises: 13.30-15.30 (B1.24H&L)
15-feb, 9.00-11.00 (A1.10) Chapter 6 Natural Selection / Ka/Ks / HIV No computer practicum
16-feb, 9.00-11.00 (A1.06) Chapter 7 Phylogenetics / SARS Computer Exercises : 11.30-13.30 (B1.24H&L)
Mini projects: literature and computation studyAim To further demonstrate how bioinformatics is used in scientific research we
defined a complementary track during which you will perform (a) a literature study for a selected topic (b) computations related to the topic investigated.
Literature report: about 3000 words / English Computational exercise: programming language of choice (e.g., Perl,
Matlab)
Example topics for literature studyWe selected several topics that you may choose for your literature study and
computations. However, you are free to propose another topic but this has to be approved first (please contact Antoine van Kampen).
Before you start: Contact Antoine van Kampen,
[email protected]: April 1, 2010
Course material
http://www.bioinformaticslaboratory.nlEDUCATION
BIOMEDICAL SCIENCES
Chapter 5Are Neanderthals among us?Variation within and between species
Prof. dr. Antoine van KampenBiosystems Data Analysis
Swammerdam Institute for Life Sciences
Bioinformatics LaboratoryAcademic Medical Centre
Neanderthal ManSkeleton discovered in 1856 in Germany (Neander Thal)
Skeleton dated to about 44 thousand years ago
Unusual features of skeleton
Belongs to ancient species of hominid: any member of the biological family Hominidae (the "great apes")
Biologically different from modern humans
First reconstruction of Neanderthal Man
Are Neanderthals our ancestors?
Are modern Europeans the offspring of Neanderthals? Has recently been settled by genetic analysis
Human evolution and spread
Consider dynamic nature of DNADetermine how DNA sequences change over time
Use this information to infer the history and function of different parts of the genome
Variation data is also used in Medical research Forensics Genome annotation
Variation in DNA sequences 1Questions about human origins can be answered Exploit the fact that every genome has a slightly
different genome sequence Differences within and between species Siblings with same parents have differences
Variation in DNA accumulates via Mutations (mistakes made by the cellular machinery that
are then encoded in the genome) Cell’s proof-reading machinery is very good Estimate: one mistake for every 2 million – 1 billion
bases
Introduced by recombination (when organism is diploid)
Variation in DNA sequences 2Mutation rates differ between organisms differs between mitochondrial and nuclear genome In most animals the rate in mitochondrial genome is
higher than in the nuclear genome
New mutations at any one nucleotide position are rare
Most genetic differences between individuals are inherited mutations and not newly arisen variants exploit this fact to study the history of species shared mutations are indicative of shared ancestry
Germline mutationsMutations occur at every cell duplication genome is replicated each time
Creating a fully grown human: trillions of cells each of which dies of and is replaced
multiple times during a life time This is one reason that cancer is an illness of the
elderly
Mutations in skin cells or heart-muscle cells are not passed on to our offspring
Only mutations that occur in the germline cells have a chance of spreading through the population There are exceptions such as plants
Geneticists classify the animal cells into two types Germ-line cells
Cells that give rise to gametes such as eggs and sperm Somatic cells
All other cells
Germ-line mutations are those that occur directly in a sperm or egg cell, or in one of their precursor cells
Somatic mutations are those that occur directly in a body cell, or in one of its precursor cells
Mutations Can Occur in Germ-Line or Somatic Cells
Therefore, the mutation can be
passed on to future generations
The size of the patch will depend on the timing of the mutation
The earlier the mutation, the larger the patch
Therefore, the mutation cannot be passed on to future generations
MutationsNeutral no effect (this might still be a non-synonymous mutation!!)Deleterious disrupts some biological functionAdvantageous improves some biological function
If the mutation is not passed on to a child then it is lost
Polymorphism: Any difference among individuals at a specific position in the genome (regardless of frequency)
Point mutations: change of one basewhen these mutations are polymorphic within a species then we call them Single Nucleotide Polymorphisms (SNP)
Various versions of the DNA sequence are called alleles: we might find a SNP with an A allele and a T allele at a certain position
GTCCTTCATAATCATCACGGGACTGACCTTCATAACCATCACGGGACTAACCTTCATAACCATCTCCGGACC
Example: sequence polymorphism
3 sequences6 polymorphisms
www.hapmap.org
Hemoglobin beta gene (HBB)
Human variationSNPs
SNPs account for a large part of genetic variationHumans: on average 1 SNP / 1500 bp Any two sequences will differ at 0.067% of positions
Short tandem repeats (STRs or microsatellites)Repeats of short DNA words (e.g., CACACACACA)Due to slippage during replicationMutation rate at microsatellites is much higher than for SNPs
Rare types of variationIndels: insertions and deletionsRearrangments: inversions, duplications, transpositions
(copy-paste or cut-paste of genome sequences)
Applications of microsatelitesForensics. In forensic identification cases, the goal is typically to
link a suspect with a sample of blood, semen or hair taken from a crime. Alternatively, the goal may be to link a sample found on a suspect's clothing with a victim.
Relatedness testing in criminal work may involve investigating paternity in order to establish rape or incest.
Because the lengths of microsatellites may vary from one person to the next, scientists have begun to use them for above applications; a procedure known as DNA profiling or "fingerprinting“.
Applications of microsatelites
Diagnosis and Identification of Human DiseasesBecause microsatellites change in length early in the development of some cancers, they are useful markers for early cancer detection.
Because they are polymorphic they are useful in linkage studies which attempt to locate genes responsible for various genetic disorders.
Transitions and transversionsNot all point mutations are equally likely Even if mutation has no effect Due to molecular structure of nucleotides
4 transitions, 8 transversionsTransitions are more common than transversions
Genetic code is more robust to transitions
When only twosynonymous codonscode for same AAthey always differby only onetransition
Transitions withincoding sequence areon average lessharmful than transversions
DNA and amino acid substitutionsNot all nucleotide mutations occur with same frequency Due to chemical structure Due to sometimes deleterious consequence of change in
DNA
Not all changes between amino acids are seen with same frequency Sometimes because two AA are multiple nucleotide-
mutation steps away (Ala Cys requires 3 mutations GCA TGT )
Some AA more interchangeable due to biochemical characteristics such as size, polarity and hydrophobicity
Analysis of DNA-sequence variation
Human DNA sequence is 99.9% identical between individuals→3.000.000 varying nucleotides
Polymorphism: normal variation between individuals
Genetic variation May cause or predispose to inheritable diseases Determines e.g. individual drug response Used as markers to identify disease genes
Genotyping
• Genetic marker• Polymorphisms that are highly variable between individuals:
Microsatellites and single nucleotide polymorphisms (SNPs)• Marker may be inherited together with the disease predisposing gene
because of linkage disequilibrium (LD)
Important terms• Allele
• Alternative form of a gene or DNA sequence at a specific chromosomal location (locus)
• at each locus an individual possesses two alleles, one inherited from each parent
• Genotype• genetic constitution of an
individual, combination of alleles
Linkage disequilibrium, LD
Alleles are in LD, if they are inherited together more often than could be expected based on allele frequencies
Two loci are inherited together, because recombination during meiosis (formation of gametes) separates them only seldom
HaplotypeA haplotype is series of genetic variants (e.g., SNP) on one chromosome that are inherited from one parent
In subsequent generations the chromosomal haplotype is broken up by crossing over events in meiosis
In practice, “haplotype” refers to closely linked genetic loci.
SNPs that are located in close proximity tend to travel together known as linkage disequilibrium (LD) In general, loci that are located more closely together on a
chromosome will be in stronger LD Correlation between LD and physical distance separating two loci
is modest Some loci that are separated 20 bp will not be in LD, while other
loci separated by 200.000 bp will be in tight LD.
Haplotype
Multiple loci in the same chromosome that are inherited togetherUsually a string of SNPs that are linked
alleleslocus
haplotypes(combi of three alleleson onechromosome
Haplotype construction
No good experimental methods available to identify haplotypes
→ Computational methods to create haplotypes from genotype data
...Haplotype construction
Family-based haplotype construction Linkage analysis softwares: Simwalk, Merlin,
Genehunter, Allegro...
Population-based haplotype construction Not as reliable as family-based EM-algorithm (expectation maximization algorithm),
described in http://www-gene.cimr.cam.ac.uk/clayton/software/
SnpHap PHASE
Haplotype blocks• Low recombination rate in the region• Strong LD• Low haplotype diversity• Small number of SNPs in the block are enough to identify
common haplotypes; tag SNPs
Formation of haplotype blocksRecombination events that shuffle the components of a haplotype do not occur at random
Some locations in the genome have much higher recombination rates Recombination hotspots
The occurrence of recombination hotspots has contributed to the limited haplotype diversity of the genome There are fewer observed haplotypes than would be
expected by chance
Size of haplotype blocks vary from about 9kb to over 100kb. What is the size of a gene?
Average gene size: 10-15kb
recombination
x
chromosomes
Formation of haplotype blocks
meiosis
111
222
221
112
Few generations Hundreds of generations
221
231
Average block size• African populations: 11 kb• Non-african populations: 22 kb• 60%-80% of the genome is in the blocks of > 10 kb
1-150 kb
Block frequencies
Typically, only 3-5 common haplotypes account for >90% of the observed haplotypes
Information content is higherGene function may depend on more than one SNPSmaller number of required markers The amount of wrong positive association is reduced
Replacing of missing genotypes by computational methods Elimination of genotyping errorsChallenges: Haplotypes are difficult to define directly in the lab;
computational methods Defining of block boarders is ambiguous; several different
algorithms
Benefits of haplotypes instead of individual SNPs
Haplotype exampleExample: β2-adrenergic receptor gene (ADRB2)
Consider 8 SNPs; each two allelesOne would expect 28=256 haplotypesObserved number of haplotypes is much smaller and only 3 haplotypes are estimated to occur with large frequency
Genotypes and haplotypes• Diploid organism.• 2 bi-allelic loci on the same chromosome (e.g., SNPs).
• First locus alleles A and T: 3 genotypes AA, AT, and TT.• Second locus alleles G and C: 3 genotypes GG, GC,
and CC. • Individual: 9 possible configurations for the genotypes
at these two loci.
• Punnett square (next slide) shows the possible genotypes that an individual may carry and the corresponding haplotypes.
Genotypes and haplotypeshomologouschromosomes
A
GG
Ahaplotype
heterozygous
heterozygous
Number of haplotypes grows exponentiallywith number of polymorphisms
genotype
genotype
Question: what is the haplotype giventhe genotype?
Nuclear DNA: Genotypes and haplotypeshomologouschromosomes
A
GG
Ahaplotype
A
CG
A
A
CC
A
Nuclear DNA: Genotypes and haplotypeshomologouschromosomes
A
GG
Ahaplotype
A
CG
A
A
CC
A
T
GG
T
T
CG
T
T
CC
T
Nuclear DNA: Genotypes and haplotypeshomologouschromosomes
A
GG
Ahaplotype
A
CG
A
A
CC
A
T
GG
T
T
CG
T
T
CC
T
Nuclear DNA: Genotypes and haplotypeshomologouschromosomes
A
GG
Ahaplotype
A
CG
T A
GC
T
ambigous
Mitochondrial DNA (mtDNA): a model for the analysis of variation
mtDNA is ideal for studying human evolution because of high mutation rate Other technical advantages (e.g., easier to isolate)
Mitochondria contain high number of mutagenic oxygen molecules lead to high mutation rate
Small circular genome 16.596 bases long in humans 37 protein coding genes RNA genes Slightly different genetic code than the nuclear genome
Mitochondrial DNA (mtDNA)
D-loop-non-coding sequence-origin of replication-promoter-hypervariable regions (HVR-I, HVR-II) L=400-500bp
HVR: high variability among humans. Ideal for studyingthe relationships among individuals
Advantage of mtDNAmtDNA inherited only from mother
Every individual will have only one version of mtDNA We automatically know the haplotype
mtDNA: Genotypes and haplotypes
A
Ghaplotype
A TG AG TGC AC TC
mtDNA
A
G
T
C
T
G
mtDNA is only passed down through the mother. Since weonly have one version, we automatically known the haplotype if we know the genotype.
Variation between species
Genetic differences between species are responsible for behavior morphology physiology
Variation between species Tell us about relationships between species close related species have on average more similar DNA Tells us how evolution proceeded over millions of years
Key to understanding differences between species: number of nucleotide substitutions that separate two
DNA sequences
Substitution rateSubstitution rate between homologous sequences from different species, tells us about Time since divergence of species Biological function of genomic sequences Relationships among species
Mutation Originates in a single individual may be lost if individual leaves no offspring may become fixed throughout the species every
individual in the species will have the new allele at the specific nucleotide position.
Substitution rate: rate at which species accumulate such fixed differences
Substitution vs. polymorphismWhat happens after a mutation changes a
nucleotide in a locus
Polymorphism: mutant allele is one of several present in population
Substitution: the mutant allele fixes in the population. (New mutations at other nucleotides may occur later.)
Substitution schematic
Individual: 1 2 3 4 5 6 7
Time 0: aaat aaat aaat aaat aaat aaat aaat
Time 10: aaat aaat aaat aaat acat aaat aaat
Time 20: aaat aaat acat aaat acat acat acat
Time 30: acat acat acat acat acat acat acat
Time 40: acat acat actt acat acat acat acat
(1) times 10-29: polymorphism(2) time 30: mutation fixed -> substitution(3) time 40: new mutation: polymorphism
(1)
(2)
(3)
Substitution rates for neutral mutations
Most neutral mutations are lost
Only 1 out of 2N fix
Most that are lost go quickly (< 20 generations for population sizes from 100 - 2000)
Substitution rate is expressed as number of substitutions per site per million years
Neutral theory (neutral mutations)
Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa
Any new mutation: initially present at a frequency of 1/(2N) = chance of becoming fixed in population
Number of mutations 2Nµ substitution rate = ρ = 2Nµ(1/2N) = µ
2 copies of each gene
Population size N=102*N=20 alleles
Aa Aa Aa Aa Aa AB Aa Aa Aa Aa
mutation (rate=µ)
Neutral theoryρ = 2Nµ(1/2N) = µ (e.g., 2 mutations per site per
Myear)
Substitution rate of new mutations is independent of the population size and is equal to the neutral mutation rate.
Larger populations: -create more mutations-smaller chance of becoming
fixed
Smaller populations: -create less mutations -larger chance of becoming
fixed
Genetic driftChange in gene frequency due to chance fluctuations in a finite population
Number of substitutions per site (K)Genetic distance (K)K = number of substitutions per site number of substitutions / length of sequence(this controls for the length of sequence)
If divergence time (T) is known thenSubstitution rate ρ = K / (2T)(we divide by 2T because both lineages that come from
common ancestor can accumulate mutations independently)
Substitution rate is expressed as number of substitutions per site per million years
Estimating genetic distance Genetic distance (K) between two homologous sequences
number of substitutions since they diverged from common ancestor
ProblemSimple count will underestimate the true number of
differences when multiple substitutions have occurred at the same site.
Multiple substitutions
G A C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C C G G A C TT T C C T T C A A T C T C C G G A C TC A C C T T C A A T C T C C G G A C T1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 02 2 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
Observed: 3 Actual: 6
Observed: compare first and last sequenceIntermediate substitutions are not observed
Multiple substitutions
Need probabilistic model to correct for multiple changes
•True genetic distance: K•Observed differences: d•Due to back-mutations K ≥ d
SaturationAt the extreme: on average one substitution per site across the sequence during evolution (saturation)
Two random sequences of equal length will match for approximately ¼ of their sites
In saturation therefore the proportional genetic distance is ¼
Process of substitution is random Genetic distance (K) and observed differences (d) are random
variables Various ways to estimate K from d
Sequence evolution: Markov process. A sequence at time t depends only on the sequence at time t-1
The Jukes-Cantor model
Correction for multiple substitutions
Assumes that all substitutions are equally likely (e.g. transitions and transversions)
Substitution probability per site per second is α
Substitution means there are 3 possible replacements (e.g. C → {A,G,T})
Non-substitution means there is 1 possibility (e.g. C → C)
Transition matrix
Therefore, the one-step Markov process has the following transition matrix:
MJC =
A C G T
A 1-α α/3 α/3 α/3
C α/3 1-α α/3 α/3
G α/3 α/3 1-α α/3
T α/3 α/3 α/3 1-α
dK 34
43 1ln This leads to Jukes-Cantor formula
For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance
For saturation: d ↑ ¾ : K →∞So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate
dK 34
43 1ln
Jukes-Cantor formula
The Kimura two-parameter modelAssumes that transitions are more likely than transversions
Use two probabilities Transitions: α Transversions: β
a g c t
a 1--2
g 1--2
c 1--2
t 1--2
K = -0.5 ln(1-2P-Q) – 0.25 ln(1-2Q)P: fraction of transitionsQ: fraction of transversionsd=P+Q
Case study: are Neanderthals still among us?
Are homo sapiens related to Neanderthals
From GenBankTake 206 mtDNA (modern humans)Take 2 Neanderthal mtDNAs
Extract comparable parts from the hypervariable regions for the modern humans (only parts of the HVR were available for Neanderthals)
208 sequences of 800bp
compute genetic distance corrected by Jukes-Cantor formula
Results
Average distance between any two homo sapiens: 0.025 (out of 1000 bases, 25 will be different on average)
Average distance between homo sapiens and Neanderthal: 0.140 (140 out of 1000 bases): much higher
Make matrix of all pair-wise distances between sequences and use multi-dimensional scaling for visualization.
Alternative: use phylogenetic tree
Results: multi-dimensional scaling
Neanderthals: nota sub-population of human (different species)
distances between pointsreflect genetic distance
Results: phylogenetic distance
Neanderthals: moreclosely related to human
homo sapiens
Apes