schedule 10-feb, 9.00-12.00 (a1.14) chapter 5 genetic variation, snps, haplotype, genetic distance ...

Schedule10-feb, 9.00-12.00 (A1.14) Chapter 5 Genetic Variation, SNPs, Haplotype, Genetic

Distance No computer practicum

11-feb, 9.00-11.30 (C1.112) Score matrices Computer exercises: 13.30-15.30 (B1.24H&L)

15-feb, 9.00-11.00 (A1.10) Chapter 6 Natural Selection / Ka/Ks / HIV No computer practicum

16-feb, 9.00-11.00 (A1.06) Chapter 7 Phylogenetics / SARS Computer Exercises : 11.30-13.30 (B1.24H&L)

Mini projects: literature and computation studyAim To further demonstrate how bioinformatics is used in scientific research we

defined a complementary track during which you will perform (a) a literature study for a selected topic (b) computations related to the topic investigated.

Literature report: about 3000 words / English Computational exercise: programming language of choice (e.g., Perl,

Matlab)

Example topics for literature studyWe selected several topics that you may choose for your literature study and

computations. However, you are free to propose another topic but this has to be approved first (please contact Antoine van Kampen).

Before you start: Contact Antoine van Kampen,

[email protected]: April 1, 2010

mailto:[email protected]

Course material

http://www.bioinformaticslaboratory.nlEDUCATION

BIOMEDICAL SCIENCES

Chapter 5Are Neanderthals among us?Variation within and between species

Prof. dr. Antoine van KampenBiosystems Data Analysis

Swammerdam Institute for Life Sciences

Bioinformatics LaboratoryAcademic Medical Centre

Neanderthal ManSkeleton discovered in 1856 in Germany (Neander Thal)

Skeleton dated to about 44 thousand years ago

Unusual features of skeleton

Belongs to ancient species of hominid: any member of the biological family Hominidae (the "great apes")

Biologically different from modern humans

First reconstruction of Neanderthal Man

http://upload.wikimedia.org/wikipedia/commons/9/9b/Neanderthaler_Fund.png

http://upload.wikimedia.org/wikipedia/commons/e/e0/Homo_sapiens_neanderthalensis.jpg

Are Neanderthals our ancestors?

Are modern Europeans the offspring of Neanderthals? Has recently been settled by genetic analysis

Human evolution and spread

Consider dynamic nature of DNADetermine how DNA sequences change over time

Use this information to infer the history and function of different parts of the genome

Variation data is also used in Medical research Forensics Genome annotation

Variation in DNA sequences 1Questions about human origins can be answered Exploit the fact that every genome has a slightly

different genome sequence Differences within and between species Siblings with same parents have differences

Variation in DNA accumulates via Mutations (mistakes made by the cellular machinery that

are then encoded in the genome) Cell’s proof-reading machinery is very good Estimate: one mistake for every 2 million – 1 billion

bases

Introduced by recombination (when organism is diploid)

Variation in DNA sequences 2Mutation rates differ between organisms differs between mitochondrial and nuclear genome In most animals the rate in mitochondrial genome is

higher than in the nuclear genome

New mutations at any one nucleotide position are rare

Most genetic differences between individuals are inherited mutations and not newly arisen variants exploit this fact to study the history of species shared mutations are indicative of shared ancestry

Germline mutationsMutations occur at every cell duplication genome is replicated each time

Creating a fully grown human: trillions of cells each of which dies of and is replaced

multiple times during a life time This is one reason that cancer is an illness of the

elderly

Mutations in skin cells or heart-muscle cells are not passed on to our offspring

Only mutations that occur in the germline cells have a chance of spreading through the population There are exceptions such as plants

Geneticists classify the animal cells into two types Germ-line cells

Cells that give rise to gametes such as eggs and sperm Somatic cells

All other cells

Germ-line mutations are those that occur directly in a sperm or egg cell, or in one of their precursor cells

Somatic mutations are those that occur directly in a body cell, or in one of its precursor cells

Mutations Can Occur in Germ-Line or Somatic Cells

Therefore, the mutation can be

passed on to future generations

The size of the patch will depend on the timing of the mutation

The earlier the mutation, the larger the patch

Therefore, the mutation cannot be passed on to future generations

MutationsNeutral no effect (this might still be a non-synonymous mutation!!)Deleterious disrupts some biological functionAdvantageous improves some biological function

If the mutation is not passed on to a child then it is lost

Polymorphism: Any difference among individuals at a specific position in the genome (regardless of frequency)

Point mutations: change of one basewhen these mutations are polymorphic within a species then we call them Single Nucleotide Polymorphisms (SNP)

Various versions of the DNA sequence are called alleles: we might find a SNP with an A allele and a T allele at a certain position

GTCCTTCATAATCATCACGGGACTGACCTTCATAACCATCACGGGACTAACCTTCATAACCATCTCCGGACC

Example: sequence polymorphism

3 sequences6 polymorphisms

www.hapmap.org

Hemoglobin beta gene (HBB)

Human variationSNPs

SNPs account for a large part of genetic variationHumans: on average 1 SNP / 1500 bp Any two sequences will differ at 0.067% of positions

Short tandem repeats (STRs or microsatellites)Repeats of short DNA words (e.g., CACACACACA)Due to slippage during replicationMutation rate at microsatellites is much higher than for SNPs

Rare types of variationIndels: insertions and deletionsRearrangments: inversions, duplications, transpositions

(copy-paste or cut-paste of genome sequences)

Applications of microsatelitesForensics. In forensic identification cases, the goal is typically to

link a suspect with a sample of blood, semen or hair taken from a crime. Alternatively, the goal may be to link a sample found on a suspect's clothing with a victim.

Relatedness testing in criminal work may involve investigating paternity in order to establish rape or incest.

Because the lengths of microsatellites may vary from one person to the next, scientists have begun to use them for above applications; a procedure known as DNA profiling or "fingerprinting“.

Applications of microsatelites

Diagnosis and Identification of Human DiseasesBecause microsatellites change in length early in the development of some cancers, they are useful markers for early cancer detection.

Because they are polymorphic they are useful in linkage studies which attempt to locate genes responsible for various genetic disorders.

Transitions and transversionsNot all point mutations are equally likely Even if mutation has no effect Due to molecular structure of nucleotides

4 transitions, 8 transversionsTransitions are more common than transversions

Genetic code is more robust to transitions

When only twosynonymous codonscode for same AAthey always differby only onetransition

Transitions withincoding sequence areon average lessharmful than transversions

DNA and amino acid substitutionsNot all nucleotide mutations occur with same frequency Due to chemical structure Due to sometimes deleterious consequence of change in

DNA

Not all changes between amino acids are seen with same frequency Sometimes because two AA are multiple nucleotide-

mutation steps away (Ala Cys requires 3 mutations GCA TGT )

Some AA more interchangeable due to biochemical characteristics such as size, polarity and hydrophobicity

Analysis of DNA-sequence variation

Human DNA sequence is 99.9% identical between individuals→3.000.000 varying nucleotides

Polymorphism: normal variation between individuals

Genetic variation May cause or predispose to inheritable diseases Determines e.g. individual drug response Used as markers to identify disease genes

Genotyping

• Genetic marker• Polymorphisms that are highly variable between individuals:

Microsatellites and single nucleotide polymorphisms (SNPs)• Marker may be inherited together with the disease predisposing gene

because of linkage disequilibrium (LD)

Important terms• Allele

• Alternative form of a gene or DNA sequence at a specific chromosomal location (locus)

• at each locus an individual possesses two alleles, one inherited from each parent

• Genotype• genetic constitution of an

individual, combination of alleles

Linkage disequilibrium, LD

Alleles are in LD, if they are inherited together more often than could be expected based on allele frequencies

Two loci are inherited together, because recombination during meiosis (formation of gametes) separates them only seldom

HaplotypeA haplotype is series of genetic variants (e.g., SNP) on one chromosome that are inherited from one parent

In subsequent generations the chromosomal haplotype is broken up by crossing over events in meiosis

In practice, “haplotype” refers to closely linked genetic loci.

SNPs that are located in close proximity tend to travel together known as linkage disequilibrium (LD) In general, loci that are located more closely together on a

chromosome will be in stronger LD Correlation between LD and physical distance separating two loci

is modest Some loci that are separated 20 bp will not be in LD, while other

loci separated by 200.000 bp will be in tight LD.

Haplotype

Multiple loci in the same chromosome that are inherited togetherUsually a string of SNPs that are linked

alleleslocus

haplotypes(combi of three alleleson onechromosome

Haplotype construction

No good experimental methods available to identify haplotypes

→ Computational methods to create haplotypes from genotype data

...Haplotype construction

Family-based haplotype construction Linkage analysis softwares: Simwalk, Merlin,

Genehunter, Allegro...

Population-based haplotype construction Not as reliable as family-based EM-algorithm (expectation maximization algorithm),

described in http://www-gene.cimr.cam.ac.uk/clayton/software/

SnpHap PHASE

Haplotype blocks• Low recombination rate in the region• Strong LD• Low haplotype diversity• Small number of SNPs in the block are enough to identify

common haplotypes; tag SNPs

Formation of haplotype blocksRecombination events that shuffle the components of a haplotype do not occur at random

Some locations in the genome have much higher recombination rates Recombination hotspots

The occurrence of recombination hotspots has contributed to the limited haplotype diversity of the genome There are fewer observed haplotypes than would be

expected by chance

Size of haplotype blocks vary from about 9kb to over 100kb. What is the size of a gene?

Average gene size: 10-15kb

recombination

x

chromosomes

Formation of haplotype blocks

meiosis

111

222

221

112

Few generations Hundreds of generations

221

231

Average block size• African populations: 11 kb• Non-african populations: 22 kb• 60%-80% of the genome is in the blocks of > 10 kb

1-150 kb

Block frequencies

Typically, only 3-5 common haplotypes account for >90% of the observed haplotypes

Information content is higherGene function may depend on more than one SNPSmaller number of required markers The amount of wrong positive association is reduced

Replacing of missing genotypes by computational methods Elimination of genotyping errorsChallenges: Haplotypes are difficult to define directly in the lab;

computational methods Defining of block boarders is ambiguous; several different

algorithms

Benefits of haplotypes instead of individual SNPs

Haplotype exampleExample: β2-adrenergic receptor gene (ADRB2)

Consider 8 SNPs; each two allelesOne would expect 28=256 haplotypesObserved number of haplotypes is much smaller and only 3 haplotypes are estimated to occur with large frequency

Genotypes and haplotypes• Diploid organism.• 2 bi-allelic loci on the same chromosome (e.g., SNPs).

• First locus alleles A and T: 3 genotypes AA, AT, and TT.• Second locus alleles G and C: 3 genotypes GG, GC,

and CC. • Individual: 9 possible configurations for the genotypes

at these two loci.

• Punnett square (next slide) shows the possible genotypes that an individual may carry and the corresponding haplotypes.

Genotypes and haplotypeshomologouschromosomes

A

GG

Ahaplotype

heterozygous

heterozygous

Number of haplotypes grows exponentiallywith number of polymorphisms

genotype

genotype

Question: what is the haplotype giventhe genotype?

Nuclear DNA: Genotypes and haplotypeshomologouschromosomes

A

GG

Ahaplotype

A

CG

A

A

CC

A


A

GG

Ahaplotype

A

CG

A

A

CC

A

T

GG

T

T

CG

T

T

CC

T


A

GG

Ahaplotype

A

CG

T A

GC

T

ambigous

Mitochondrial DNA (mtDNA): a model for the analysis of variation

mtDNA is ideal for studying human evolution because of high mutation rate Other technical advantages (e.g., easier to isolate)

Mitochondria contain high number of mutagenic oxygen molecules lead to high mutation rate

Small circular genome 16.596 bases long in humans 37 protein coding genes RNA genes Slightly different genetic code than the nuclear genome

Mitochondrial DNA (mtDNA)

D-loop-non-coding sequence-origin of replication-promoter-hypervariable regions (HVR-I, HVR-II) L=400-500bp

HVR: high variability among humans. Ideal for studyingthe relationships among individuals

http://upload.wikimedia.org/wikipedia/commons/3/3e/Mitochondrial_DNA_en.svg

Advantage of mtDNAmtDNA inherited only from mother

Every individual will have only one version of mtDNA We automatically know the haplotype

mtDNA: Genotypes and haplotypes

A

Ghaplotype

A TG AG TGC AC TC

mtDNA

A

G

T

C

T

G

mtDNA is only passed down through the mother. Since weonly have one version, we automatically known the haplotype if we know the genotype.

Variation between species

Genetic differences between species are responsible for behavior morphology physiology

Variation between species Tell us about relationships between species close related species have on average more similar DNA Tells us how evolution proceeded over millions of years

Key to understanding differences between species: number of nucleotide substitutions that separate two

DNA sequences

Substitution rateSubstitution rate between homologous sequences from different species, tells us about Time since divergence of species Biological function of genomic sequences Relationships among species

Mutation Originates in a single individual may be lost if individual leaves no offspring may become fixed throughout the species every

individual in the species will have the new allele at the specific nucleotide position.

Substitution rate: rate at which species accumulate such fixed differences

Substitution vs. polymorphismWhat happens after a mutation changes a

nucleotide in a locus

Polymorphism: mutant allele is one of several present in population

Substitution: the mutant allele fixes in the population. (New mutations at other nucleotides may occur later.)

Substitution schematic

Individual: 1 2 3 4 5 6 7

Time 0: aaat aaat aaat aaat aaat aaat aaat

Time 10: aaat aaat aaat aaat acat aaat aaat

Time 20: aaat aaat acat aaat acat acat acat

Time 30: acat acat acat acat acat acat acat

Time 40: acat acat actt acat acat acat acat

(1) times 10-29: polymorphism(2) time 30: mutation fixed -> substitution(3) time 40: new mutation: polymorphism

(1)

(2)

(3)

Substitution rates for neutral mutations

Most neutral mutations are lost

Only 1 out of 2N fix

Most that are lost go quickly (< 20 generations for population sizes from 100 - 2000)

Substitution rate is expressed as number of substitutions per site per million years

Neutral theory (neutral mutations)

Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa

Any new mutation: initially present at a frequency of 1/(2N) = chance of becoming fixed in population

Number of mutations 2Nµ substitution rate = ρ = 2Nµ(1/2N) = µ

2 copies of each gene

Population size N=102*N=20 alleles

Aa Aa Aa Aa Aa AB Aa Aa Aa Aa

mutation (rate=µ)

Neutral theoryρ = 2Nµ(1/2N) = µ (e.g., 2 mutations per site per

Myear)

Substitution rate of new mutations is independent of the population size and is equal to the neutral mutation rate.

Larger populations: -create more mutations-smaller chance of becoming

fixed

Smaller populations: -create less mutations -larger chance of becoming

fixed

Genetic driftChange in gene frequency due to chance fluctuations in a finite population

Number of substitutions per site (K)Genetic distance (K)K = number of substitutions per site number of substitutions / length of sequence(this controls for the length of sequence)

If divergence time (T) is known thenSubstitution rate ρ = K / (2T)(we divide by 2T because both lineages that come from

common ancestor can accumulate mutations independently)

Substitution rate is expressed as number of substitutions per site per million years

Estimating genetic distance Genetic distance (K) between two homologous sequences

number of substitutions since they diverged from common ancestor

ProblemSimple count will underestimate the true number of

differences when multiple substitutions have occurred at the same site.

Multiple substitutions

G A C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C G G G A C TT T C C T T C A A T C A C C G G A C TT T C C T T C A A T C T C C G G A C TC A C C T T C A A T C T C C G G A C T1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 02 2 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0

Observed: 3 Actual: 6

Observed: compare first and last sequenceIntermediate substitutions are not observed

Multiple substitutions

Need probabilistic model to correct for multiple changes

•True genetic distance: K•Observed differences: d•Due to back-mutations K ≥ d

SaturationAt the extreme: on average one substitution per site across the sequence during evolution (saturation)

Two random sequences of equal length will match for approximately ¼ of their sites

In saturation therefore the proportional genetic distance is ¼

Process of substitution is random Genetic distance (K) and observed differences (d) are random

variables Various ways to estimate K from d

Sequence evolution: Markov process. A sequence at time t depends only on the sequence at time t-1

The Jukes-Cantor model

Correction for multiple substitutions

Assumes that all substitutions are equally likely (e.g. transitions and transversions)

Substitution probability per site per second is α

Substitution means there are 3 possible replacements (e.g. C → {A,G,T})

Non-substitution means there is 1 possibility (e.g. C → C)

Transition matrix

Therefore, the one-step Markov process has the following transition matrix:

MJC =

A C G T

A 1-α α/3 α/3 α/3

C α/3 1-α α/3 α/3

G α/3 α/3 1-α α/3

T α/3 α/3 α/3 1-α

dK 34

43 1ln This leads to Jukes-Cantor formula

For small d using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance

For saturation: d ↑ ¾ : K →∞So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate

dK 34

43 1ln

Jukes-Cantor formula

The Kimura two-parameter modelAssumes that transitions are more likely than transversions

Use two probabilities Transitions: α Transversions: β

a g c t

a 1--2

g 1--2

c 1--2

t 1--2

K = -0.5 ln(1-2P-Q) – 0.25 ln(1-2Q)P: fraction of transitionsQ: fraction of transversionsd=P+Q

Case study: are Neanderthals still among us?

Are homo sapiens related to Neanderthals

From GenBankTake 206 mtDNA (modern humans)Take 2 Neanderthal mtDNAs

Extract comparable parts from the hypervariable regions for the modern humans (only parts of the HVR were available for Neanderthals)

208 sequences of 800bp

compute genetic distance corrected by Jukes-Cantor formula

Results

Average distance between any two homo sapiens: 0.025 (out of 1000 bases, 25 will be different on average)

Average distance between homo sapiens and Neanderthal: 0.140 (140 out of 1000 bases): much higher

Make matrix of all pair-wise distances between sequences and use multi-dimensional scaling for visualization.

Alternative: use phylogenetic tree

Results: multi-dimensional scaling

Neanderthals: nota sub-population of human (different species)

distances between pointsreflect genetic distance

Results: phylogenetic distance

Neanderthals: moreclosely related to human

homo sapiens

Apes

schedule 10-feb, 9.00-12.00 (a1.14) chapter 5 genetic variation, snps, haplotype, genetic distance ...

Documents

genome cells

germline mutations mutations

precursor cells mutations

cells germline mutations

spread slide

diploid slide

elderly mutations

inherited mutations