an introduction to model-based clustering · an introduction to model-based clustering anish r....

42
An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services [email protected] London Nov 17, 2011

Upload: others

Post on 14-Oct-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

An Introduction to Model-Based Clustering

Anish R. Shah, CFA Northfield Information Services

[email protected]

London Nov 17, 2011

Page 2: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Clustering

• Observe characteristics of some objects – {x1, …, xN} N objects

• Goal: group alike objects

– say there are M clusters {z1, …, zN} cluster memberships zk = object k’s membership, a number from 1..M

– k, j in the same cluster → xk, xj similar -or- k, j in different clusters → xk, xj dissimilar

Page 3: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Examples of Characteristics • Clustering dog breeds

– x = (snout length / width of face, dog’s BMI, % of day spent sleeping)

• Positioning sandwich carts (in cluster centers)

– x = (location of office worker)

• Clustering country indices via returns – x = (past 5 years of monthly returns)

• Clustering stocks via fundamentals & returns

– x = (beta to the market, dividend rate, past 2 years of monthly returns)

Page 4: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Machine Learning • Rather than being programmed with rules, the system inferentially learns

the patterns/rules of reality from data

• Supervised Learning – Some of the training data is labeled – e.g. There are 5 company types - AAPL & MSFT are type 1, ... , XOM is type 5.

Find the prototype for each type and label the rest of the universe – e.g. Amazon & Netflix recommendations

• Unsupervised Learning

– None of the data has labels – Organize the system to maximize some criterion – e.g. Clustering maximizes similarity within each cluster – e.g. Principal Components Analysis maximizes explained variance – Vanilla clustering is the canonical example of unsupervised machine learning

Page 5: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Review of Forms of Hard Clustering • ‘Hard’ means an object is assigned to only one cluster

– In contrast, model-based clustering can give a probability distribution over the clusters

• Hierarchical Clustering

– Maximize distance between clusters – Flavors come from different ways of measuring distance

• Single Linkage – distance between the two nearest elements • Complete Linkage – distance between the two farthest elements • Average Linkage – mean (or median) distance between all elements

• K-Means

– Minimize mean (median in K-medians) distance within clusters

Page 6: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

K-Means / K-Medians

• K-Means (heuristically) assigns objects to clusters to minimize the average squared distance (absolute distance in K-Medians) from object to cluster center

• Minimize ⅟N ∑k=1..N ||xk- μzk||2

over z1..zN = cluster assignments μ1..μM = centers of the clusters

Page 7: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

K-Means Algorithm 1. Randomly assign objects to clusters

2. Calculate the center (mean) of each cluster

3. Check assignments for all the objects: if another center is nearer,

reassign the object to that cluster

4. Repeat steps 2-3 until no reassignments occur

• Extremely fast • The solution is a local max, so several starting points are used in

practice • (K-Medians) For robustness, step 2 finds centers via median

Page 8: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Mixture of Gaussians: A Model-Based Clustering Similar to K-Means

• Observe data for N objects, {x1, …, xN}

• Each cluster generates data distributed normally around its center – when object k is from cluster m, p(xk) ~ exp(||xk- μm||2 / σm

2)

• Some clusters appear more frequently than others – given no observation information, p(an object belongs to cluster m) = πM

• Find the setup that make the observations most likely to occur

– cluster centers {μ1 .. μM} – variances {σ1

2.. σM2}

– cluster frequencies {π1 .. πM}

Page 9: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Model-Based Clustering • Observe characteristics of some objects

– {x1, …, xN} N objects

• An object belongs to one of M clusters, but which is unknown – {z1, …, zN} cluster memberships, numbers from 1..M

• Some clusters are more likely than others

– P(zk=m) = πm (πm = frequency cluster m occurs)

• Within a cluster, objects’ characteristics are generated by the same distribution, which has free parameters – P(xk|zk=m) = f(xk, λm) (λm = parameters of cluster m) – f need not be Gaussian

Page 10: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Model-Based Clustering (2) • Now you have a model connecting the observations to the cluster

memberships and parameters – P(xk) = ∑m=1..M P(xk|zk=m) P(zk=m) – = ∑m=1..M f(xk, λm) πm

– P(x1 … xN) = ∏k=1..N P(xk) (assuming x’s are independent)

1. Find the values of the parameters by maximizing the likelihood (usually

the log of the likelihood) of the observations – max log P(x1 … xN) over λ1 … λM and π1 … πM – This turns out to be a nonlinear mess and is greatly aided by the “EM

Algorithm” (next slide) 2. With parameters in hand, calculate the probability of membership given

the observations – P(z|x) = P(x|z) P(z) / P(x)

Page 11: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

EM (Expectation-Maximization) Algorithm Setup

• Let θ = (λ1 … λM, π1 … πM), the parameters being maximized over

• Observe x. Don’t know z, the cluster memberships

• Want to maximize log p(x|θ), but it is too complicated

• EM can be used when – It’s possible to make an approximation of p(z|x,θ), the

conditional distribution of the hidden variables – log p(x,z|θ), the probability if all the variables were known, is

easy to manipulate

Page 12: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

The EM Algorithm • Want to maximizeθ log p(x|θ) = log ∫ p(x,z|θ) dz • (E Step)

– Create an approximate distribution of the missing data. Call it u(z) Ideally this is p(z|x,θ) – Let Q(θ) = the log likelihood under θ averaged by u(z) = ∫ log p(x,z|θ) u(z) dz

• (M Step) – Maximize Q(θ) over θ – θnew = the maximizer

• Repeat E & M steps until convergence

• EM switches between 1) finding an approximate distribution of missing

data given the parameters and 2) finding better parameters given the approximation

Page 13: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Determining the # of Clusters: Information Criteria

• BayesianIC, AkaikeIC, AkaikeICcorrected, …

• Minimum Description Length ideas

• Log-likelihood of observations penalized by the model’s complexity (# of parameters)

• My experience – Optimal # of clusters varies greatly with the choice of

criterion – e.g. BIC says 2, AICc says 9

Page 14: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Experiments • Country indices – monthly local currency returns of constituent

companies weighted by √cap

• An aside: I looked into dividing index returns by VIX to account for time varying volatility, but high volatility periods shrink too much

• A period’s clusters are inferred from past 60 months equally weighted

• 2 cluster distributions recall P(xk|zk=m) = f(xk, λm) (λm = parameters of m)

– Gaussian (2pi σm2)-D/2 ∏i=1..D exp(-½[(xk

i-μmi)/σm]2)

– Laplace (2σm2)-D/2 ∏i=1..D exp(-√2|xk

i-μmi|/σm)

Page 15: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Gaussian, 6 clusters

33% 27% 25%

10%

2% 2% 0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ca

nada

Fr

ance

Ita

ly

Germ

any

Denm

ark

Russ

ia

Spai

n Be

lgiu

m

Aust

ria

Switz

erla

nd

Swed

en

Nor

way

N

ethe

rland

s Tu

rkey

Ja

pan

Aust

ralia

Ho

ng K

ong

Mal

aysia

In

dia

Sout

h Ko

rea

Taiw

an

Sout

h Af

rica

Sing

apor

e Th

aila

nd

New

Zea

land

Ph

ilipp

ines

Pa

kist

an

Finl

and

Chile

In

done

sia

Mex

ico

Braz

il Gr

eece

Po

rtug

al

Pola

nd

Colo

mbi

a Ar

gent

ina

Czec

h Re

publ

ic

Hung

ary

Irela

nd

Isra

el

Vene

zuel

a Pe

ru

Chin

a Lu

xem

bour

g Sr

i Lan

ka

Page 16: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Gaussian, 5 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ca

nada

Fr

ance

Ita

ly

Germ

any

Denm

ark

Russ

ia

Spai

n Be

lgiu

m

Aust

ria

Switz

erla

nd

Swed

en

Nor

way

N

ethe

rland

s Tu

rkey

Ja

pan

Aust

ralia

Ho

ng K

ong

Mal

aysia

In

dia

Sout

h Ko

rea

Taiw

an

Sout

h Af

rica

Sing

apor

e Th

aila

nd

New

Zea

land

Ph

ilipp

ines

Pa

kist

an

Finl

and

Indo

nesia

M

exic

o Br

azil

Gree

ce

Pola

nd

Colo

mbi

a Ch

ile

Irela

nd

Port

ugal

Is

rael

Ve

nezu

ela

Peru

Ar

gent

ina

Czec

h Re

publ

ic

Hung

ary

Chin

a Lu

xem

bour

g Sr

i Lan

ka

33% 27% 15%

23%

2% 0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 17: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Gaussian, 4 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ca

nada

Fr

ance

Ita

ly

Germ

any

Denm

ark

Russ

ia

Spai

n Be

lgiu

m

Aust

ria

Switz

erla

nd

Swed

en

Nor

way

N

ethe

rland

s Ja

pan

Aust

ralia

Ho

ng K

ong

Mal

aysia

In

dia

Sout

h Ko

rea

Taiw

an

Sout

h Af

rica

Sing

apor

e Th

aila

nd

New

Zea

land

Ph

ilipp

ines

Pa

kist

an

Finl

and

Indo

nesia

M

exic

o Br

azil

Turk

ey

Gree

ce

Port

ugal

Po

land

Ch

ile

Irela

nd

Sri L

anka

Is

rael

Ve

nezu

ela

Peru

Co

lom

bia

Arge

ntin

a Cz

ech

Repu

blic

Hu

ngar

y Ch

ina

Luxe

mbo

urg

31% 27% 17%

25%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 18: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Gaussian, 3 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ca

nada

Fr

ance

Ita

ly

Germ

any

Finl

and

Denm

ark

Russ

ia

Spai

n Be

lgiu

m

Aust

ria

Switz

erla

nd

Swed

en

Nor

way

N

ethe

rland

s Tu

rkey

Gr

eece

Ja

pan

Aust

ralia

Ho

ng K

ong

Mal

aysia

In

dia

Sout

h Ko

rea

Taiw

an

Sout

h Af

rica

Sing

apor

e Th

aila

nd

New

Zea

land

Ph

ilipp

ines

Pa

kist

an

Chile

In

done

sia

Mex

ico

Braz

il Ire

land

Po

rtug

al

Sri L

anka

Po

land

Is

rael

Ve

nezu

ela

Peru

Co

lom

bia

Arge

ntin

a Cz

ech

Repu

blic

Hu

ngar

y Ch

ina

Luxe

mbo

urg

37% 27%

35%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 19: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Gaussian, 2 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ca

nada

Fr

ance

Ita

ly

Germ

any

Finl

and

Denm

ark

Russ

ia

Spai

n Be

lgiu

m

Chile

Au

stria

Sw

itzer

land

In

done

sia

Swed

en

Mex

ico

Nor

way

N

ethe

rland

s Br

azil

Turk

ey

Gree

ce

Irela

nd

Port

ugal

Sr

i Lan

ka

Pola

nd

Isra

el

Vene

zuel

a Pe

ru

Colo

mbi

a Ar

gent

ina

Czec

h Re

publ

ic

Hung

ary

Chin

a Lu

xem

bour

g Ja

pan

Aust

ralia

Ho

ng K

ong

Mal

aysia

In

dia

Sout

h Ko

rea

Taiw

an

Sout

h Af

rica

Sing

apor

e Th

aila

nd

New

Zea

land

Ph

ilipp

ines

Pa

kist

an

73% 27%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 20: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Laplace, 4 clusters

94%

2% 2% 2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ja

pan

Germ

any

Chin

a Sp

ain

Russ

ia

Switz

erla

nd

Italy

Au

stra

lia

Sout

h Ko

rea

Hong

Kon

g N

ethe

rland

s In

dia

Braz

il Ta

iwan

Sw

eden

Si

ngap

ore

Finl

and

Luxe

mbo

urg

Belg

ium

Th

aila

nd

Sout

h Af

rica

Nor

way

Ire

land

M

exic

o M

alay

sia

Denm

ark

Aust

ria

Pola

nd

Turk

ey

Gree

ce

Chile

In

done

sia

Isra

el

Czec

h Re

publ

ic

Hung

ary

Port

ugal

Co

lom

bia

New

Zea

land

Pe

ru

Paki

stan

Ar

gent

ina

Sri L

anka

Ve

nezu

ela

Cana

da

Phili

ppin

es

Fran

ce

Clusters estimated under a different distribution:

Page 21: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 1998 – Laplace, 3 clusters

96%

2% 2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Ja

pan

Fran

ce

Germ

any

Chin

a Sp

ain

Russ

ia

Cana

da

Switz

erla

nd

Italy

So

uth

Kore

a Ho

ng K

ong

Net

herla

nds

Indi

a Ta

iwan

Sw

eden

Si

ngap

ore

Finl

and

Luxe

mbo

urg

Belg

ium

Th

aila

nd

Sout

h Af

rica

Nor

way

Ire

land

M

exic

o M

alay

sia

Denm

ark

Aust

ria

Pola

nd

Turk

ey

Gree

ce

Chile

In

done

sia

Isra

el

Czec

h Re

publ

ic

Hung

ary

Port

ugal

Co

lom

bia

Phili

ppin

es

New

Zea

land

Pe

ru

Paki

stan

Ar

gent

ina

Sri L

anka

Ve

nezu

ela

Aust

ralia

Br

azil

Page 22: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2001 – Gaussian, 6 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Cana

da

Germ

any

Fran

ce

Taiw

an

Sout

h Af

rica

Sout

h Ko

rea

Braz

il Ch

ina

Aust

ralia

M

alay

sia

Thai

land

N

ethe

rland

s Si

ngap

ore

Swed

en

Indi

a Ita

ly

Hong

Kon

g Sw

itzer

land

Be

lgiu

m

Gree

ce

Mex

ico

Nor

way

Ch

ile

New

Zea

land

Po

land

Is

rael

Fi

nlan

d De

nmar

k Tu

rkey

Pa

kist

an

Aust

ria

Spai

n Ru

ssia

In

done

sia

Peru

Lu

xem

bour

g Cz

ech

Repu

blic

Ar

gent

ina

Sri L

anka

Co

lom

bia

Irela

nd

Phili

ppin

es

Hung

ary

Port

ugal

Ve

nezu

ela

33%

27%

15% 15% 8%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 23: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2001 – Gaussian, 5 clusters

25% 36% 29% 8%

2% 0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Cana

da

Germ

any

Fran

ce

Sout

h Af

rica

Sout

h Ko

rea

Braz

il Au

stra

lia

Mal

aysia

Th

aila

nd

Taiw

an

Swed

en

Chin

a N

ethe

rland

s In

dia

Sing

apor

e De

nmar

k Ita

ly

Hong

Kon

g Sw

itzer

land

Be

lgiu

m

Phili

ppin

es

Gree

ce

Mex

ico

Nor

way

Po

rtug

al

Pola

nd

Indo

nesia

Fi

nlan

d Tu

rkey

Ire

land

Au

stria

Ch

ile

Hung

ary

Luxe

mbo

urg

Spai

n Cz

ech

Repu

blic

N

ew Z

eala

nd

Isra

el

Russ

ia

Arge

ntin

a Pa

kist

an

Peru

Sr

i Lan

ka

Colo

mbi

a Ve

nezu

ela

Page 24: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2001 – Gaussian, 4 clusters

21% 40%

31% 8%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Cana

da

Germ

any

Fran

ce

Sout

h Ko

rea

Braz

il Au

stra

lia

Thai

land

Ta

iwan

So

uth

Afric

a Sw

eden

Ch

ina

Mal

aysia

N

ethe

rland

s In

dia

Sing

apor

e De

nmar

k Ita

ly

Hong

Kon

g Sw

itzer

land

Be

lgiu

m

Phili

ppin

es

Gree

ce

Mex

ico

Nor

way

Po

rtug

al

Pola

nd

Indo

nesia

Fi

nlan

d Tu

rkey

Pa

kist

an

Irela

nd

Aust

ria

Chile

Hu

ngar

y Pe

ru

Spai

n Cz

ech

Repu

blic

N

ew Z

eala

nd

Isra

el

Russ

ia

Sri L

anka

Lu

xem

bour

g Ar

gent

ina

Vene

zuel

a Co

lom

bia

Page 25: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2001 – Gaussian, 3 clusters

21% 40%

40%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Cana

da

Germ

any

Fran

ce

Sout

h Ko

rea

Braz

il Au

stra

lia

Thai

land

Ta

iwan

So

uth

Afric

a Sw

eden

Ch

ina

Mal

aysia

N

ethe

rland

s In

dia

Sing

apor

e De

nmar

k Ita

ly

Hong

Kon

g Sw

itzer

land

Be

lgiu

m

Phili

ppin

es

Gree

ce

Mex

ico

Nor

way

Po

rtug

al

Pola

nd

Indo

nesia

Fi

nlan

d Tu

rkey

Pa

kist

an

Irela

nd

Aust

ria

Chile

Hu

ngar

y Pe

ru

Luxe

mbo

urg

Spai

n Cz

ech

Repu

blic

N

ew Z

eala

nd

Isra

el

Russ

ia

Arge

ntin

a Sr

i Lan

ka

Vene

zuel

a Co

lom

bia

Page 26: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2001 – Gaussian, 2 clusters

56%

44%

0%

5%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Cana

da

Germ

any

Fran

ce

Taiw

an

Sout

h Af

rica

Sout

h Ko

rea

Braz

il Sw

eden

Ch

ina

Aust

ralia

M

alay

sia

Thai

land

N

ethe

rland

s In

dia

Sing

apor

e De

nmar

k Ita

ly

Hong

Kon

g Sw

itzer

land

Be

lgiu

m

Gree

ce

Mex

ico

Nor

way

Po

land

In

done

sia

Finl

and

Turk

ey

Paki

stan

Ire

land

Au

stria

Ph

ilipp

ines

Ch

ile

Hung

ary

Peru

Lu

xem

bour

g Sp

ain

Czec

h Re

publ

ic

New

Zea

land

Po

rtug

al

Isra

el

Russ

ia

Arge

ntin

a Sr

i Lan

ka

Vene

zuel

a Co

lom

bia

Page 27: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2004 – Gaussian, 6 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y N

ethe

rland

s Ca

nada

Sw

itzer

land

Sp

ain

Italy

Be

lgiu

m

Nor

way

Ire

land

De

nmar

k Po

rtug

al

Chin

a Ta

iwan

So

uth

Kore

a Si

ngap

ore

Mal

aysia

Ru

ssia

In

done

sia

Arge

ntin

a Ve

nezu

ela

Paki

stan

Co

lom

bia

Sri L

anka

Ja

pan

Aust

ralia

Th

aila

nd

Sout

h Af

rica

Mex

ico

Chile

Au

stria

N

ew Z

eala

nd

Phili

ppin

es

Peru

Ho

ng K

ong

Swed

en

Finl

and

Braz

il Is

rael

Lu

xem

bour

g Po

land

Hu

ngar

y Cz

ech

Repu

blic

In

dia

Gree

ce

Turk

ey

29%

25%

21% 19% 4%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 28: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2004 – Gaussian, 5 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y N

ethe

rland

s Sw

itzer

land

Sp

ain

Italy

Sw

eden

Fi

nlan

d N

orw

ay

Port

ugal

Lu

xem

bour

g Ch

ina

Taiw

an

Sout

h Ko

rea

Indi

a Gr

eece

Ru

ssia

Is

rael

Ar

gent

ina

Czec

h Re

publ

ic

Vene

zuel

a Pa

kist

an

Colo

mbi

a Sr

i Lan

ka

Japa

n Ca

nada

Au

stra

lia

Belg

ium

Br

azil

Sout

h Af

rica

Irela

nd

Mex

ico

Denm

ark

Chile

Au

stria

N

ew Z

eala

nd

Peru

Ho

ng K

ong

Sing

apor

e M

alay

sia

Thai

land

In

done

sia

Pola

nd

Phili

ppin

es

Hung

ary

Turk

ey

27%

27%

27% 17%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 29: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2004 – Gaussian, 4 clusters

33%

33%

31%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y N

ethe

rland

s Ca

nada

Sw

itzer

land

Sp

ain

Italy

Sw

eden

Be

lgiu

m

Nor

way

Ire

land

De

nmar

k Po

rtug

al

Luxe

mbo

urg

Chin

a Ta

iwan

So

uth

Kore

a Si

ngap

ore

Indi

a M

alay

sia

Gree

ce

Russ

ia

Isra

el

Indo

nesia

Ar

gent

ina

Czec

h Re

publ

ic

Vene

zuel

a Pa

kist

an

Colo

mbi

a Sr

i Lan

ka

Japa

n Au

stra

lia

Hong

Kon

g Fi

nlan

d Br

azil

Thai

land

So

uth

Afric

a M

exic

o Ch

ile

Pola

nd

Aust

ria

New

Zea

land

Ph

ilipp

ines

Hu

ngar

y Pe

ru

Turk

ey

Page 30: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2004 – Gaussian, 3 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y N

ethe

rland

s Ca

nada

Sw

itzer

land

Sp

ain

Italy

Sw

eden

Be

lgiu

m

Nor

way

Ire

land

De

nmar

k Po

rtug

al

Luxe

mbo

urg

Chin

a Ta

iwan

So

uth

Kore

a In

dia

Mal

aysia

Gr

eece

Ru

ssia

Tu

rkey

Is

rael

In

done

sia

Arge

ntin

a Cz

ech

Repu

blic

Ve

nezu

ela

Paki

stan

Co

lom

bia

Sri L

anka

Ja

pan

Aust

ralia

Ho

ng K

ong

Finl

and

Sing

apor

e Br

azil

Thai

land

So

uth

Afric

a M

exic

o Ch

ile

Pola

nd

Aust

ria

New

Zea

land

Ph

ilipp

ines

Hu

ngar

y Pe

ru

33% 34%

33% 0%

20%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 31: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2004 – Gaussian, 2 clusters

56%

44%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Japa

n U

nite

d Ki

ngdo

m

Fran

ce

Germ

any

Net

herla

nds

Cana

da

Switz

erla

nd

Spai

n Ita

ly

Aust

ralia

Sw

eden

Fi

nlan

d Be

lgiu

m

Braz

il So

uth

Afric

a N

orw

ay

Irela

nd

Mex

ico

Denm

ark

Chile

Po

rtug

al

Luxe

mbo

urg

Pola

nd

Aust

ria

New

Zea

land

Hu

ngar

y Ch

ina

Hong

Kon

g Ta

iwan

So

uth

Kore

a Si

ngap

ore

Indi

a M

alay

sia

Thai

land

Gr

eece

Ru

ssia

Tu

rkey

Is

rael

In

done

sia

Phili

ppin

es

Arge

ntin

a Cz

ech

Repu

blic

Pe

ru

Vene

zuel

a Pa

kist

an

Colo

mbi

a Sr

i Lan

ka

Page 32: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2007 – Gaussian, 6 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sp

ain

Net

herla

nds

Switz

erla

nd

Italy

Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Port

ugal

Ca

nada

Au

stra

lia

Hong

Kon

g Si

ngap

ore

Braz

il So

uth

Afric

a Th

aila

nd

Mal

aysia

M

exic

o Au

stria

De

nmar

k Ch

ile

New

Zea

land

Ja

pan

Sout

h Ko

rea

Chin

a Ta

iwan

N

orw

ay

Luxe

mbo

urg

Gree

ce

Arge

ntin

a Ph

ilipp

ines

Pa

kist

an

Peru

Ve

nezu

ela

Sri L

anka

Ru

ssia

Po

land

Tu

rkey

Hu

ngar

y Cz

ech

Repu

blic

Co

lom

bia

Indi

a In

done

sia

Isra

el

27% 27% 27% 13% 4%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 33: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2007 – Gaussian, 5 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sp

ain

Net

herla

nds

Switz

erla

nd

Italy

Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Port

ugal

Ca

nada

Au

stra

lia

Hong

Kon

g Si

ngap

ore

Braz

il So

uth

Afric

a Th

aila

nd

Mal

aysia

M

exic

o Au

stria

De

nmar

k Ch

ile

New

Zea

land

Ja

pan

Sout

h Ko

rea

Taiw

an

Nor

way

Lu

xem

bour

g Gr

eece

Po

land

Is

rael

In

done

sia

Hung

ary

Czec

h Re

publ

ic

Phili

ppin

es

Russ

ia

Chin

a In

dia

Arge

ntin

a Pa

kist

an

Colo

mbi

a Pe

ru

Vene

zuel

a Sr

i Lan

ka

Turk

ey

27% 27% 25% 19%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 34: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2007 – Gaussian, 4 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sp

ain

Net

herla

nds

Switz

erla

nd

Italy

Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Port

ugal

Ca

nada

Au

stra

lia

Hong

Kon

g Si

ngap

ore

Braz

il So

uth

Afric

a Th

aila

nd

Mex

ico

Aust

ria

Denm

ark

Chile

N

ew Z

eala

nd

Japa

n So

uth

Kore

a Ta

iwan

N

orw

ay

Luxe

mbo

urg

Mal

aysia

Gr

eece

Po

land

Is

rael

In

done

sia

Hung

ary

Czec

h Re

publ

ic

Phili

ppin

es

Russ

ia

Chin

a In

dia

Turk

ey

Arge

ntin

a Pa

kist

an

Colo

mbi

a Pe

ru

Vene

zuel

a Sr

i Lan

ka

27% 25% 27%

21%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 35: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2007 – Gaussian, 3 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sp

ain

Net

herla

nds

Switz

erla

nd

Italy

Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Port

ugal

Ca

nada

Au

stra

lia

Hong

Kon

g Si

ngap

ore

Braz

il So

uth

Afric

a Th

aila

nd

Mal

aysia

M

exic

o Au

stria

De

nmar

k Ch

ile

New

Zea

land

Ja

pan

Russ

ia

Sout

h Ko

rea

Chin

a Ta

iwan

In

dia

Nor

way

Lu

xem

bour

g Gr

eece

Po

land

Tu

rkey

Is

rael

Ar

gent

ina

Indo

nesia

Hu

ngar

y Cz

ech

Repu

blic

Ph

ilipp

ines

Pa

kist

an

Colo

mbi

a Pe

ru

Vene

zuel

a Sr

i Lan

ka

27% 27% 46%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 36: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2007 – Gaussian, 2 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Ca

nada

Sp

ain

Net

herla

nds

Switz

erla

nd

Aust

ralia

Ita

ly

Swed

en

Finl

and

Sing

apor

e Be

lgiu

m

Irela

nd

Mex

ico

Aust

ria

Denm

ark

Chile

Po

rtug

al

New

Zea

land

Ja

pan

Russ

ia

Hong

Kon

g So

uth

Kore

a Ch

ina

Taiw

an

Indi

a Br

azil

Nor

way

So

uth

Afric

a Th

aila

nd

Luxe

mbo

urg

Mal

aysia

Gr

eece

Po

land

Tu

rkey

Is

rael

Ar

gent

ina

Indo

nesia

Hu

ngar

y Cz

ech

Repu

blic

Ph

ilipp

ines

Pa

kist

an

Colo

mbi

a Pe

ru

Vene

zuel

a Sr

i Lan

ka

44% 56%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 37: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2010 – Gaussian, 6 clusters

27% 23%

19% 17% 12%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sw

itzer

land

Ita

ly

Net

herla

nds

Swed

en

Finl

and

Belg

ium

Ire

land

De

nmar

k Au

stria

So

uth

Kore

a Ho

ng K

ong

Indi

a Ta

iwan

Si

ngap

ore

Pola

nd

Turk

ey

Gree

ce

Indo

nesia

Cz

ech

Repu

blic

Hu

ngar

y Ca

nada

Au

stra

lia

Braz

il Lu

xem

bour

g Th

aila

nd

Sout

h Af

rica

Nor

way

M

exic

o Ar

gent

ina

Japa

n Sp

ain

Mal

aysia

Ch

ile

Isra

el

Port

ugal

Ph

ilipp

ines

N

ew Z

eala

nd

Russ

ia

Colo

mbi

a Pe

ru

Paki

stan

Sr

i Lan

ka

Vene

zuel

a Ch

ina

Page 38: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2010 – Gaussian, 5 clusters

31% 37% 15%

15%

2% 0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sw

itzer

land

Ita

ly

Aust

ralia

N

ethe

rland

s Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Denm

ark

Aust

ria

Port

ugal

Ca

nada

So

uth

Kore

a Ho

ng K

ong

Braz

il Ta

iwan

Si

ngap

ore

Luxe

mbo

urg

Thai

land

N

orw

ay

Pola

nd

Gree

ce

Indo

nesia

Is

rael

Cz

ech

Repu

blic

Hu

ngar

y Ph

ilipp

ines

Pe

ru

Arge

ntin

a Ja

pan

Spai

n So

uth

Afric

a M

exic

o M

alay

sia

Chile

N

ew Z

eala

nd

Russ

ia

Indi

a Tu

rkey

Co

lom

bia

Paki

stan

Sr

i Lan

ka

Vene

zuel

a Ch

ina

Page 39: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2010 – Gaussian, 4 clusters

29% 38% 15%

19%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sw

itzer

land

Ita

ly

Aust

ralia

N

ethe

rland

s Sw

eden

Fi

nlan

d Be

lgiu

m

Irela

nd

Denm

ark

Aust

ria

Cana

da

Sout

h Ko

rea

Hong

Kon

g Br

azil

Sing

apor

e Lu

xem

bour

g Th

aila

nd

Nor

way

M

alay

sia

Pola

nd

Gree

ce

Indo

nesia

Is

rael

Cz

ech

Repu

blic

Hu

ngar

y Ph

ilipp

ines

Pe

ru

Arge

ntin

a Ja

pan

Spai

n So

uth

Afric

a M

exic

o Ch

ile

Port

ugal

N

ew Z

eala

nd

Chin

a Ru

ssia

In

dia

Taiw

an

Turk

ey

Colo

mbi

a Pa

kist

an

Sri L

anka

Ve

nezu

ela

Page 40: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2010 – Gaussian, 3 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sw

itzer

land

Ita

ly

Net

herla

nds

Swed

en

Finl

and

Belg

ium

Ire

land

De

nmar

k Au

stria

Ja

pan

Spai

n Ca

nada

Au

stra

lia

Braz

il Si

ngap

ore

Luxe

mbo

urg

Thai

land

So

uth

Afric

a N

orw

ay

Mex

ico

Mal

aysia

Gr

eece

Ch

ile

Isra

el

Czec

h Re

publ

ic

Port

ugal

Ph

ilipp

ines

N

ew Z

eala

nd

Arge

ntin

a Ch

ina

Russ

ia

Sout

h Ko

rea

Hong

Kon

g In

dia

Taiw

an

Pola

nd

Turk

ey

Indo

nesia

Hu

ngar

y Co

lom

bia

Peru

Pa

kist

an

Sri L

anka

Ve

nezu

ela

27% 42% 31%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 41: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Jan 2010 – Gaussian, 2 clusters

0%

25%

50%

75%

100%

Uni

ted

Stat

es

Uni

ted

King

dom

Fr

ance

Ge

rman

y Sp

ain

Cana

da

Switz

erla

nd

Italy

Au

stra

lia

Net

herla

nds

Swed

en

Finl

and

Belg

ium

So

uth

Afric

a N

orw

ay

Irela

nd

Mex

ico

Denm

ark

Aust

ria

Port

ugal

N

ew Z

eala

nd

Japa

n Ch

ina

Russ

ia

Sout

h Ko

rea

Hong

Kon

g In

dia

Braz

il Ta

iwan

Si

ngap

ore

Luxe

mbo

urg

Thai

land

M

alay

sia

Pola

nd

Turk

ey

Gree

ce

Chile

In

done

sia

Isra

el

Czec

h Re

publ

ic

Hung

ary

Colo

mbi

a Ph

ilipp

ines

Pe

ru

Paki

stan

Ar

gent

ina

Sri L

anka

Ve

nezu

ela

44% 56%

0%

10%

Spre

ad

(% R

et/M

o)

Spread (MSE of monthly returns) and Size (% of probability mass)

Page 42: An Introduction to Model-Based Clustering · An Introduction to Model-Based Clustering Anish R. Shah, CFA Northfield Information Services . Anish@northinfo.com . London . Nov 17,

Closing Remarks • Idea works with different distributions

– Here, deviations from cluster centers were Gaussian or Laplace

• Can dress up the model to account for interactions – e.g. Countries A, B, and C are trading partners thus more likely to

belong to the same cluster

• Clustering helps identify what a market or security is, i.e. what forecasting model to use for it

• Purpose of presentation: switching from applying filters to thinking about underlying mathematical models – gives you your own custom tool set – makes understanding what something does infinitely easier