cse182-l17 clustering population genetics: basics

CSE182-L17

ClusteringPopulation Genetics: Basics

Unsupervised Clustering

• Given a set of points (in n-dimensions), and k, compute the k “best clusters”.

• In k-means, clustering is done by choosing k centers (means).

• Each point is assigned to the closest center.

• The notion of “best” is defined by distances to the center.

• Question: How can we compute the k best centers?

Clusters

Distance

• Given a data point v and a set of points X,

define the distance from v to X

d(v, X)

as the (Euclidean) distance from v to the closest point from X.

• Given a set of n data points V={v1…vn} and a set of k points X,

define the Squared Error Distortion

d(V,X) = ∑d(vi, X)2 / n 1 < i < n

v

K-Means Clustering Problem: Formulation

• Input: A set, V, consisting of n points and a parameter k

• Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

This problem is NP-complete in general.

1-Means Clustering Problem: an Easy Case

• Input: A set, V, consisting of n points. • Output: A single point X that minimizes d(V,X) over all

possible choices of X.

This problem is easy.

However, it becomes very difficult for more than one center.

An efficient heuristic method for k-Means clustering is the Lloyd algorithm

K-means: Lloyd’s algorithm

• Choose k centers at random:– X’ = {x1,x2,x3,…xk}

• Repeat – X=X’

– Assign each v V to the closest cluster j

• d(v,xj) = d(v,X) Cj= Cj {v}

– Recompute X’• x’j (∑ v Cj v) /|Cj|

• until (X’ = X)

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1


x1

x2

x3

0

1

2

3

4

5

0 1 2 3 4 5



x1

x2 x3

Conservative K-Means Algorithm

• Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence.

• A more conservative method would be to move one point at a time only if it improves the overall clustering cost

• The smaller the clustering cost of a partition of data points is the better that clustering is

• Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)

Microarray summary

• Microarrays (like MS) are a technology for probing the dynamic state of the cell.

• We answered questions like the following:– Which genes are coordinately regulated (They have

similar expression patterns in different conditions)?– How can we reduce the dimensionality of the system?– Using gene expression values from a sample, can you

predict if the sample is normal (state A) or diseased (state B)

• The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.

Microarray non-summary

• We did not cover:– How are the gene expression values

measured (the technology)? (CSE183)– How do you control variability across

different experiments (normalization)? (CSE183)

– What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)

Population Genetics

• The sequence of an individual does not say anything about the diversity of a population.

• Small individual genetic differences can have a profound impact on “phenotypes”– Response to drugs– Susceptibility to diseases

• Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.

Population Structure

• 377 locations (loci) were sampled in 1000 people from 52 populations.

• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)

AfricaEurasia East Asia

America

Oce

ania

Population Genetics

• What is it about our genetic makeup that makes us measurably different?

• These genetic differences are correlated with phenotypic differences

• With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals.

• Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.

What causes variation in a population?

• Mutations (may lead to SNPs)• Recombinations• Other genetic events (Ex: microsatellite

repeats)• Deletions, inversions

Single Nucleotide Polymorphisms

000001010111000110100101000101010010000000110001111000000101100110

Infinite Sites Assumption:Each site mutates at most once

Short Tandem Repeats

GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC

435335

STR can be used as a DNA fingerprint

• Consider a collection of regions with variable length repeats.

• Variable length repeats will lead to variable length DNA

• Vector of lengths is a finger-print

4 23 35 13 23 15 3

positions

indiv

idual

s

Recombination

0000000011111111

00011111

What if there were no recombinations?

• Life would be simpler• Each sequence would have a single

parent• The relationship is expressed as a tree.

The Infinite Sites Assumption

0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0

3

8 5

• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.

• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination

Infinite sites assumption and Perfect Phylogeny

• Each site is mutated at most once in the history.

• All descendants must carry the mutated value, and all others must carry the ancestral value

i

1 in position i0 in position i

Perfect Phylogeny

• Assume an evolutionary model in which no recombination takes place, only mutation.

• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

• How can one reconstruct such a tree?

The 4-gamete condition

• A column i partitions the set of species into two sets i0, and i1

• A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.

• EX: i is heterogenous w.r.t {A,D,E}

iA 0B 0C 0D 1E 1F 1

i0

i1

4 Gamete Condition

• 4 Gamete Condition– There exists a perfect phylogeny if and only if for all

pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1.

– Equivalent to– There exists a perfect phylogeny if and only if for all

pairs of columns (i,j), the following 4 rows do not exist

(0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof

• Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.

• (only if) Every perfect phylogeny satisfies the 4-gamete condition

• (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?

i0 i1

i

An algorithm for constructing a perfect phylogeny

• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.

• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every node.

• In each step, we add a column and refine some of the nodes containing multiple children.

• Stop if all columns have been considered.

Inclusion Property

• For any pair of columns i,j

– i < j if and only if i1 j1

• Note that if i<j then the edge containing i is an ancestor of the edge containing i

i

j

Example

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

r

A B C D E

Initially, there is a single clade r, and each node has r as its parent

Sort columns

• Sort columns according to the inclusion property (note that the columns are already sorted here).

• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Add first column

• In adding column i– Check each edge

and decide which side you belong.

– Finally add a node if you can resolve a clade

r

A BC DE

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

u

Adding other columns

• Add other columns on edges using the ordering property

r

E B

C

D

A

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

1

2

4

3

5

Unrooted case

• Switch the values in each column, so that 0 is the majority element.

• Apply the algorithm for the rooted case

Handling recombination

• A tree is not sufficient as a sequence may have 2 parents

• Recombination leads to loss of correlation between columns

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No

recombination– Pr[A,B=0,1] = 0.25

• Linkage disequilibrium

• Case 2:Extensive recombination– Pr[A,B=(0,1)=0.125

• Linkage equilibrium

A B0 10 10 00 01 01 01 01 0

cse182-l17 clustering population genetics: basics

Documents

x slide

set of points x

n v slide

x x j v cj v c j

single point x

possible choices of

set of n data points

n points