clustering financial time series using their correlations and their distributions

35
Introduction How to define a distance between two random walks? Applications Conclusion How to cluster random walks? Paris Machine Learning #5 Season 2: Time Series and FinTech Philippe Donnat 1 Gautier Marti 1,2 Frank Nielsen 2 Philippe Very 1 1 Hellebore Capital Management 2 Ecole Polytechnique th January Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

Upload: hellebore-capital-limited

Post on 16-Jan-2017

43 views

Category:

Data & Analytics


1 download

TRANSCRIPT

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

How to cluster random walks?Paris Machine Learning #5 Season 2: Time Series and FinTech

Philippe Donnat1 Gautier Marti1,2

Frank Nielsen2 Philippe Very1

1Hellebore Capital Management

2Ecole Polytechnique

th January

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

1 IntroductionData Science for the CDS marketHow to group random walks?What is a clustering program?

2 How to define a distance between two random walks?Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

3 ApplicationsResults on synthetic datasetsClustering Credit Default Swaps

4 Conclusion

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

1 IntroductionData Science for the CDS marketHow to group random walks?What is a clustering program?

2 How to define a distance between two random walks?Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

3 ApplicationsResults on synthetic datasetsClustering Credit Default Swaps

4 Conclusion

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Hellebore Capital Management & Data Science

Current R&D projects in Data Science:

Data mining: parsing & natural language processing

Inference: incomplete data sources

Portfolio & Risk analysis: understanding joint behaviours

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2010

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2015

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

Do you see clusters?

Random walks

French banks and building materialsCDS over 2006-2015

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Data Science for the CDS marketHow to group random walks?What is a clustering program?

What is a clustering program?

Definition

Clustering is the task of grouping a set of objects in such a waythat objects in the same group (cluster) are more similar to eachother than those in different groups.

Definition

We aim at finding K groups by positioning K group centers{c1, . . . , cK} such that data points {x1, . . . , xn} minimize

minc1,...,cK

n∑i=1

Kminj=1

d(xi , cj)2

But, what is the distance d between two random walks?

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

1 IntroductionData Science for the CDS marketHow to group random walks?What is a clustering program?

2 How to define a distance between two random walks?Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

3 ApplicationsResults on synthetic datasetsClustering Credit Default Swaps

4 Conclusion

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

Naive distance between two random walks

random walks Y ,Y ′

d·−→

increments X ,X ′ covariance scatterplot

X = (y2 − y1, . . . , yT − yT−1)

X ,X ′ points in RT : ||X − X ′||2 =√∑T−1

i=1 (Xi − X ′i )2

apply normalizations: e.g. (X −µ)/σ, (X −min)/(max−min)

capture rather well comovements

drawbacks: not robust to outliers, blind to signal shape

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

Our approach: split comovements and distributions

?

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

GNPR: A suitable representation

Definition

GNPR (Generic Non-Parametric Representation) projection:

T : VN → UN × GN (1)

X 7→ (GX (X ),GX )

GX : x 7→ P[X ≤ x ] cumulative distribution functionGX (X ) ∼ U [0, 1]1T rank(Xt) = 1

T

∑k≤T 1{Xk ≤ Xt} →T∞ P[X ≤ Xt ] = GX (Xt)

Property

T is a bijection.

N.B. It replicates Sklar’s theorem, the seminal result of Copula Theory.

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

A distance dθ leveraging GNPR

Definition

Let (X ,Y ) ∈ V2. Let GX ,GY be vectors of marginal cdf. Letθ ∈ [0, 1]. We define the following distance

d2θ (X ,Y ) = θd2

1 (GX (X ),GY (Y )) + (1− θ)d20 (GX ,GY ), (2)

where

d21 (GX (X ),GY (Y )) = 3E[|GX (X )− GY (Y )|2], (3)

and

d20 (GX ,GY ) =

1

2

∫R

(√dGX

dλ−√

dGY

)2

dλ. (4)

d0 Hellinger;

d1 =√

(1− ρS)/2, with ρS the rank correlation between X and Y .

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

GNPR θ = 1: Increase of correlation

Correlation

Den

sity

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5 Pearson Correlation

Spearman Correlation

Distribution of Correlations

10% more correlation, in average, using GNPR θ = 1 (rank statistics)instead of standard correlation

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

GNPR θ = 0: Find distribution peculiarities

Parametric modelling: Real-life CDS variations:

Which distribution?

Nokia

−310 −280 −250 −220 −190 −160 −130 −100 −70 −40 −10 10 30 50 70 90 110 130 150 170 190

Distribution Nokia

IncrementLo

g D

ensi

ty

5e−

051e

−04

2e−

045e

−04

1e−

032e

−03

5e−

031e

−02

2e−

02

020

040

060

080

010

0012

00

Nokia 5Y CDS

Time

CD

S S

prea

d

Jan−2006 Oct−2006 Jul−2007 Apr−2008 Feb−2009 Nov−2009 Sep−2010 Jul−2011 Apr−2012 Jan−2013 Oct−2013 Jul−2014

Telecom Italia

−62 −56 −50 −44 −38 −32 −26 −20 −14 −8 −2 2 6 10 16 22 28 34 40 46 52 58 64

Distribution Telecom Italia

Increment

Log

Den

sity

2e−

045e

−04

1e−

032e

−03

5e−

031e

−02

2e−

025e

−02

1e−

01

100

200

300

400

500

Telecom Italia 5Y CDS

Time

CD

S S

prea

d

Jan−2006 Dec−2006 Nov−2007 Oct−2008 Sep−2009 Jul−2010 Jun−2011 May−2012 Apr−2013 Mar−2014

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

1 IntroductionData Science for the CDS marketHow to group random walks?What is a clustering program?

2 How to define a distance between two random walks?Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

3 ApplicationsResults on synthetic datasetsClustering Credit Default Swaps

4 Conclusion

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

Description of the testing datasets

We define some interesting test case datasets to study:

distribution clustering (dataset A),dependence clustering (dataset B),a mix of both (dataset C).

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

Results: GNPR works!

Adjusted Rand IndexRepresentation Algorithm A B CX Ward 0 0.94 0.42(X − µX )/σX Ward 0 0.94 0.42(X −min)/(max−min) Ward 0 0.48 0.45GNPR θ = 0 Ward 1 0 0.47GNPR θ = 1 Ward 0 0.91 0.72GNPR θ? Ward 1 0.92 1

X k-means++ 0 0.90 0.44(X − µX )/σX k-means++ 0 0.91 0.45(X −min)/(max−min) k-means++ 0.11 0.55 0.47GNPR θ = 0 k-means++ 1 0 0.53GNPR θ = 1 k-means++ 0.06 0.99 0.80GNPR θ? k-means++ 1 0.99 1

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

HCMapper: Compare Hierarchical Clustering

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

HCMapper: Compare Hierarchical Clustering

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Results on synthetic datasetsClustering Credit Default Swaps

“Western sovereigns” cluster

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

1 IntroductionData Science for the CDS marketHow to group random walks?What is a clustering program?

2 How to define a distance between two random walks?Standard approach on time seriesComovements and distributionsGNPR: the best of both worlds

3 ApplicationsResults on synthetic datasetsClustering Credit Default Swaps

4 Conclusion

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Conclusion: Take Home Message

GNPR is a way to deal separately with

dependence information,

distribution information,

without loosing any.

Avenue for research:

better aggregation: generalized means?

consistency proof?

any idea of interesting random walks outside finance?

Check out www.datagrapple.comto follow our R&D stuff and news about the CDS market!

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Internships

If interested, please contact

[email protected]

[email protected]

for further details and application.

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

References I

Shai Ben-David and Ulrike Von Luxburg, Relating clusteringstability to properties of cluster boundaries., COLT, vol. 2008,2008, pp. 379–390.

Shai Ben-David, Ulrike Von Luxburg, and David Pal, A soberlook at clustering stability, Learning theory, Springer, 2006,pp. 5–19.

Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon, A stabilitybased method for discovering structure in clustered data,Pacific symposium on biocomputing, vol. 7, 2001, pp. 6–17.

Tilman Lange, Volker Roth, Mikio L Braun, and Joachim MBuhmann, Stability-based validation of clustering solutions,Neural computation 16 (2004), no. 6, 1299–1323.

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

How to find θ??

Clustering stability using perturbations due to:

bootstraptime-sliding windowdraws from an oracle

stability pros [BHEG01, LRBB04]stability cons [BDVLP06, BDVL08]

theta

T

accuracy

Accuracy landscape

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

AR

I

θ

Accuracy and stability of clustering using GNPR

AccuracyBootstrap stability

Time stabilityCross-datasets stability

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Properties of dθ

Property

∀θ ∈ [0, 1], 0 ≤ dθ ≤ 1

Property

For 0 < θ < 1, dθ is a metric

For θ ∈ {0, 1},U ∼ U [0, 1] 6= 1− U, yet d0(U, 1− U) = 0

V 6= 2V , yet d1(V , 2V ) = 0

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Description of the testing datasets

For 1 ≤ i ≤ N = pK∏S

s=1 Ks , we define

Xi =S∑

s=1

Ks∑k=1

βsk,iYsk +

K∑k=1

αk,iZik , (5)

where

a) αk,i = 1, if i ≡ k − 1 (mod K ), 0 otherwise;

b) βsk ∈ [0, 1],

c) βsk,i = βsk , if diKs/Ne = k, 0 otherwise.

(Xi )Ni=1 are partitioned into Q = K

∏Ss=1 Ks clusters of p random

variables each.

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

OTC data flow processing

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?

IntroductionHow to define a distance between two random walks?

ApplicationsConclusion

Conjecture: consistency of clustering random walks

200

400

600

0 500 1000 1500 2000T

N

0.4

0.6

0.8

1.0ARI

0.0

0.2

0.4

0.6

0.8

1.0

0 500 1000 1500 2000

AR

I

T

Clustering convergence to the ground-truth partition

Clustering distribution θ = 0 Clustering dependence θ = 1

Clustering total information θ = θ*

Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?