cse182-l17 clustering population genetics: basics
Post on 21-Dec-2015
236 views
TRANSCRIPT
CSE182-L17
ClusteringPopulation Genetics: Basics
Unsupervised Clustering
• Given a set of points (in n-dimensions), and k, compute the k “best clusters”.
• In k-means, clustering is done by choosing k centers (means).
• Each point is assigned to the closest center.
• The notion of “best” is defined by distances to the center.
• Question: How can we compute the k best centers?
Clusters
Distance
• Given a data point v and a set of points X,
define the distance from v to X
d(v, X)
as the (Euclidean) distance from v to the closest point from X.
• Given a set of n data points V={v1…vn} and a set of k points X,
define the Squared Error Distortion
d(V,X) = ∑d(vi, X)2 / n 1 < i < n
v
K-Means Clustering Problem: Formulation
• Input: A set, V, consisting of n points and a parameter k
• Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X
This problem is NP-complete in general.
1-Means Clustering Problem: an Easy Case
• Input: A set, V, consisting of n points. • Output: A single point X that minimizes d(V,X) over all
possible choices of X.
This problem is easy.
However, it becomes very difficult for more than one center.
An efficient heuristic method for k-Means clustering is the Lloyd algorithm
K-means: Lloyd’s algorithm
• Choose k centers at random:– X’ = {x1,x2,x3,…xk}
• Repeat – X=X’
– Assign each v V to the closest cluster j
• d(v,xj) = d(v,X) Cj= Cj {v}
– Recompute X’• x’j (∑ v Cj v) /|Cj|
• until (X’ = X)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expression in condition 2
x1
x2
x3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expression in condition 2
x1
x2
x3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expression in condition 2
x1
x2
x3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expression in condition 2
x1
x2 x3
Conservative K-Means Algorithm
• Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence.
• A more conservative method would be to move one point at a time only if it improves the overall clustering cost
• The smaller the clustering cost of a partition of data points is the better that clustering is
• Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)
Microarray summary
• Microarrays (like MS) are a technology for probing the dynamic state of the cell.
• We answered questions like the following:– Which genes are coordinately regulated (They have
similar expression patterns in different conditions)?– How can we reduce the dimensionality of the system?– Using gene expression values from a sample, can you
predict if the sample is normal (state A) or diseased (state B)
• The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.
Microarray non-summary
• We did not cover:– How are the gene expression values
measured (the technology)? (CSE183)– How do you control variability across
different experiments (normalization)? (CSE183)
– What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)
Population Genetics
• The sequence of an individual does not say anything about the diversity of a population.
• Small individual genetic differences can have a profound impact on “phenotypes”– Response to drugs– Susceptibility to diseases
• Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.
Population Structure
• 377 locations (loci) were sampled in 1000 people from 52 populations.
• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)
AfricaEurasia East Asia
America
Oce
ania
Population Genetics
• What is it about our genetic makeup that makes us measurably different?
• These genetic differences are correlated with phenotypic differences
• With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals.
• Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.
What causes variation in a population?
• Mutations (may lead to SNPs)• Recombinations• Other genetic events (Ex: microsatellite
repeats)• Deletions, inversions
Single Nucleotide Polymorphisms
000001010111000110100101000101010010000000110001111000000101100110
Infinite Sites Assumption:Each site mutates at most once
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC
435335
STR can be used as a DNA fingerprint
• Consider a collection of regions with variable length repeats.
• Variable length repeats will lead to variable length DNA
• Vector of lengths is a finger-print
4 23 35 13 23 15 3
positions
indiv
idual
s
Recombination
0000000011111111
00011111
What if there were no recombinations?
• Life would be simpler• Each sequence would have a single
parent• The relationship is expressed as a tree.
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0
3
8 5
• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.
• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination
Infinite sites assumption and Perfect Phylogeny
• Each site is mutated at most once in the history.
• All descendants must carry the mutated value, and all others must carry the ancestral value
i
1 in position i0 in position i
Perfect Phylogeny
• Assume an evolutionary model in which no recombination takes place, only mutation.
• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.
• How can one reconstruct such a tree?
The 4-gamete condition
• A column i partitions the set of species into two sets i0, and i1
• A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.
• EX: i is heterogenous w.r.t {A,D,E}
iA 0B 0C 0D 1E 1F 1
i0
i1
4 Gamete Condition
• 4 Gamete Condition– There exists a perfect phylogeny if and only if for all
pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1.
– Equivalent to– There exists a perfect phylogeny if and only if for all
pairs of columns (i,j), the following 4 rows do not exist
(0,0), (0,1), (1,0), (1,1)
4-gamete condition: proof
• Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.
• (only if) Every perfect phylogeny satisfies the 4-gamete condition
• (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?
i0 i1
i
An algorithm for constructing a perfect phylogeny
• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.
• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every node.
• In each step, we add a column and refine some of the nodes containing multiple children.
• Stop if all columns have been considered.
Inclusion Property
• For any pair of columns i,j
– i < j if and only if i1 j1
• Note that if i<j then the edge containing i is an ancestor of the edge containing i
i
j
Example
1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
r
A B C D E
Initially, there is a single clade r, and each node has r as its parent
Sort columns
• Sort columns according to the inclusion property (note that the columns are already sorted here).
• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order
1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
Add first column
• In adding column i– Check each edge
and decide which side you belong.
– Finally add a node if you can resolve a clade
r
A BC DE
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
u
Adding other columns
• Add other columns on edges using the ordering property
r
E B
C
D
A
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
1
2
4
3
5
Unrooted case
• Switch the values in each column, so that 0 is the majority element.
• Apply the algorithm for the rooted case
Handling recombination
• A tree is not sufficient as a sequence may have 2 parents
• Recombination leads to loss of correlation between columns
Linkage (Dis)-equilibrium (LD)
• Consider sites A &B• Case 1: No
recombination– Pr[A,B=0,1] = 0.25
• Linkage disequilibrium
• Case 2:Extensive recombination– Pr[A,B=(0,1)=0.125
• Linkage equilibrium
A B0 10 10 00 01 01 01 01 0