big data's big secret

45
BIG DATA’S DIRTY SECRET Harvey J. Stein Head, Quantitative Risk Analytics Bloomberg Machine Learning in Finance Workshop April 20, 2018

Upload: others

Post on 05-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data's Big Secret

BIG DATA’S DIRTY SECRET Harvey J. Stein

Head, Quantitative Risk Analytics Bloomberg

Machine Learning in Finance Workshop April 20, 2018

Page 2: Big Data's Big Secret

Bloomberg

OUTLINE

1. Introduction

2. Symptoms

3. Hole filling

4. Bad data detection

5. Summary

6. References

2

Page 3: Big Data's Big Secret

Introduction

Page 4: Big Data's Big Secret

Go gle "big data" OR "machine l'earning"

All News Images Videos

About 99,900,000 results (0. 75 seconds

Bloomberg

LOTS OF BIG DATA

Big data is big news!

3

Page 5: Big Data's Big Secret

Although I guess that’s not so surprising...

Go gle "president trump"

All News Images Videos

Bloomberg

TRUMPS TRUMP

Almost twice as popular as “President Trump”!

4

Page 6: Big Data's Big Secret

Go gle "president trump"

All News Images Videos

Bloomberg

TRUMPS TRUMP

Almost twice as popular as “President Trump”!

Although I guess that’s not so surprising...

4

Page 7: Big Data's Big Secret

Bloomberg

FAKE NEWS

But big data analysis doesn’t mean better data analysis I More variables I More outliers I More noise I More spurious results

Conclusion? I Data needs to be cleaned

We will discuss data anomalies and methods for cleaning data

5

Page 8: Big Data's Big Secret

Bloomberg

ACKNOWLEDGEMENTS

Work on data cleaning with: I Mario Bondioli I Jan Dash I Xipei Yang I Yan Zhang

6

Page 9: Big Data's Big Secret

Symptoms

Page 10: Big Data's Big Secret

Bloomberg

THE DATA

We worked with credit default swap (CDS) spread data I Spread = cost (in bp) of insuring against default of a given company

for a given time period I Quoted for 6 month, 1 year, 2 year, 3 year, 5 year, 7 year and 10

year horizons I Quoted for 1,000s of different individual companies I Quoted both for senior and subordinated debt I Consider market close data

7

Page 11: Big Data's Big Secret

Bloomberg

EXAMPLE

01/08 01/09 01/10 01/11 01/12 01/13 01/14 01/15 01/16-200

0

200

400

600

800

1000

1200

1400AVP Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

8

Page 12: Big Data's Big Secret

Bloomberg

DATA ISSUES

General data quality issues I Missing values I Bad values

Clean for a purpose I Relative valuation I Mark to market I Trading strategy development I Risk analysis

Risk I Missing data points

I Problematic return calculations I Problematic covariance calculations

I Bad values I Bad returns I Bad variances

9

Page 13: Big Data's Big Secret

Bloomberg

CDS DATA ISSUES

CDS data specific characteristics: I 6 month point missing for first 2.5 years I Often large range of values I High volatility makes detecting bad values difficult I Data used for risk analysis

I Deleting outliers reduces risk measures I Leaving anomalies inflates risk measures

10

Page 14: Big Data's Big Secret

Bloomberg

TYPICAL APPROACHES

Hole filling I Regression I Interpolation I Flat filling

Anomaly detection I Comparison to trailing volatility I Cluster analysis I Neural networks I Statistics-sensitive Non-linear Iterative Peak (SNIP) clipping

algorithm

11

Page 15: Big Data's Big Secret

Hole filling

Page 16: Big Data's Big Secret

Bloomberg

OVERVIEW

Hole filling Overview I Use Multi-channel Singular Spectrum Analysis (MSSA) hole filling

algorithm I Variant of Singular Spectrum Analysis (SSA) used simultaneously on

multiple time series I Decomposes each time series into a sum of components, one for

each eigenvector I Borrowed from geophysical data analysis I Makes use of both space relationships (covariance) and time

relationships (autocovariance and cross-autocovariance)

12

Page 17: Big Data's Big Secret

Bloomberg

SSA Uses:

I Inspect eigenvectors and components to extract specific features of data

I Smooth data by throwing away small eigenvalues I Helpful for stabilizing correlation calculations (smooth data then

compute) References:

I A beginner’s guide to SSA, Claessen and Groth, [CG] I Singular spectrum analysis, Wikipedia, [Wik16] I Analysis of Time Series Structure: SSA and Related

Techniques, Golyandina, Nekrutkin, and Zhigljavsky, [GNZ01] I A review on singular spectrum analysis for economic and

financial time series, Hassani and Thomakos, [HT10] I SSA, Random Matrix Theory, and Noise-Reduced Correlations,

Dash et al., [Das+16a] I Stable Reduced-Noise ’Macro’ SSA–Based Correlations for

Long-Term Counterparty Risk Management, Dash et al., [Das+16b]

13

Page 18: Big Data's Big Secret

Bloomberg

MSSA

Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series simultaneously

Uses: I Same as SSA, but takes relationships between different time series

into account I Used for forecasting

References: I Multivariate singular spectrum analysis for forecasting

revisions to real-time data, Patterson et al., [Pat+11] I Multivariate singular spectrum analysis: A general view and

new vector forecasting approach, Hassani and Mahmoudvand, [HM13]

I Advanced spectral methods for climatic time series, Ghil et al., [Ghi+02]

14

Page 19: Big Data's Big Secret

Bloomberg

MSSA BASED HOLE FILLING

MSSA hole filling algorithm: I Nominally fill holes (e.g. via interpolation) I Use level l hole filling algorithm for l = l0:

I Run MSSA algorithm I Replace holes with MSSA reconstruction using l biggest singular

values I Repeat until convergence

I Increment l by one and repeat until adding singular values doesn’t have much impact and used enough singular values

References: I Spatio-temporal filling of missing points in geophysical data

sets, Kondrashov and Ghil, [KG06]

15

Page 20: Big Data's Big Secret

Bloomberg

MIXED RESULTS Unfortunately, it doesn’t always work:

03/07 05/07 07/07 09/07 11/07 01/08-10

0

10

20

30

40

50

60

70

80NAB Senior USD: Original MSSA hole filling

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

16

Page 21: Big Data's Big Secret

Bloomberg

OBSERVATIONS

Observations: I Sometimes MSSA doesn’t line up with actual data I Sometimes MSSA bottoms out I Using too few singular values will smooth the data

Solutions: I Anchoring – patch in data in a more consistent fashion I Reparameterization – working in log space I Adjusting MSSA parameters I Avoid filling large gaps

17

Page 22: Big Data's Big Secret

Bloomberg

ANCHORING

Holes are replaced with MSSA partial reconstruction

I Can yield bias if remaining components shift results

Instead I Patch in differences relative to endpoints I Can be additive or multiplicative I One-sided holes need special treatment

18

Page 23: Big Data's Big Secret

Bloomberg

REPARAMETERIZATION

MSSA hole filling is like a fixed point algorithm I Trying to find points which match reconstruction I Similar to constrained optimization

Apply classic optimization techniques I Transform problem to eliminate constraints I Work in log space if values must be positive I Log space also helps to handle changes in magnitude

Fast drop-off of eigenvalues is evidence that working in log space is the right thing

19

Page 24: Big Data's Big Secret

Bloomberg

ADJUSTING MSSA PARAMETERS

Many parameters to adjust I Lag I Max/Min number of EVs I Max/Min percentage of sum of EVs I Measure of convergence

Smoothing caused by fast drop-off of EVs I Max/Min percentage ineffective I Can add more EVs, but leads to instability

20

Page 25: Big Data's Big Secret

r--Af I 1 . ..,,\

\,

,-,~- 'v\ ,~,-~ < •:, . '-_ ___,

Bloomberg

NEW RESULTS After adjustments NAB:

03/07 05/07 07/07 09/07 11/07 01/08-10

0

10

20

30

40

50

60

70

80

90NAB Senior USD: Angle/distance/MSSA spike detection, all SVs

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

21

Page 26: Big Data's Big Secret

Bad data detection

Page 27: Big Data's Big Secret

Bloomberg

BAD DATA

How to handle bad data? I Detect it I Remove it I In our case, replace it

22

Page 28: Big Data's Big Secret

Bloomberg

BAD DATA DETECTION

Many algorithms I Statistical – compare to statistical properties (like trailing SD) I Data science – clustering I Neural networks

References I Outlier Detection Techniques, Kriegel, Kroger, and Zimek, [KKZ10] I Detecting Local Outliers in Financial Time Series, Verhoeven and

McAleer, [VM] I Outlier Analysis, Aggarwal, [Agg13] I Algorithms for Mining Distance-Based Outliers in Large Datasets,

Knorr and Ng, [KN98] I Data Mining and Knowledge Discovery Handbook: A Complete

Guide for Practitioners and Researchers, Ben-Gal, [BG05] I An online spike detection and spike classification algorithm capable of

instantaneous resolution of overlapping spikes, Franke et al., [Fra+10] I A Survey of Outlier Detection Methodologies, Hodge and Austin,

[HA04]

23

Page 29: Big Data's Big Secret

Bloomberg

DIFFICULTIES

Regime changes and changing volatility

01/14 07/14 01/15 07/15 01/16 07/16-200

0

200

400

600

800

1000

1200

1400

1600

1800GNW Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

24

Page 30: Big Data's Big Secret

Bloomberg

HYBRID APPROACH

Data science approach – Cluster analysis I Angle-based I Distance-based

Hybrid approach I Run clustering on a windowed basis (in a neighborhood of each

point) I Combine MSSA with clustering I Remove points using analysis, then put them back if MSSA

reconstructs them close enough

Conservative approach I Do both angle and distance-based combined with MSSA I If both algorithms agree, then it’s really an anomaly

25

Page 31: Big Data's Big Secret

. · ...

Bloomberg

DISTANCE-BASED EXAMPLE

26

Page 32: Big Data's Big Secret

73~-----------------~

72

71

70

6Q

68

67

• •

no outlier 66 +----~~-~-~-~-~-~-~~----,

31 3.2 33 34 35 36 37 38 3ll 40 41

Bloomberg

ANGLE-BASED EXAMPLE

Angle-based, no outlier:

27

Page 33: Big Data's Big Secret

73

• n • • 71 • •

• • 70

• 00

68

151 outlier 66

31 32 33 34 35 36 37 38 311 «l 41

Bloomberg

ANGLE-BASED EXAMPLE

Angle-based outlier:

28

Page 34: Big Data's Big Secret

Bloomberg

RESULTS Filling of large holes

01/09 03/09 05/09 07/09 09/09 11/09 01/10-20

0

20

40

60

80

100

120

140

160COP Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

29

Page 35: Big Data's Big Secret

Bloomberg

RESULTS Ignoring regime changes

09/11 01/12 05/12 09/12 01/13 05/13-200

0

200

400

600

800

1000

1200CHK Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

30

Page 36: Big Data's Big Secret

X

Bloomberg

0

0 0

0 0 0

RESULTS Detecting and correcting bad data

01/15 03/15 05/15 07/15 09/15 11/15 01/160

0.5

1

1.5

2

2.5104 JNY Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

31

Page 37: Big Data's Big Secret

i

Bloomberg

.. 2012

_ LC240AS

_ LC2SOAS

_ LC280AS

_ LC290AS

LC300AS _ LC320AS

LC330AS

LC340AS

LC350AS

RESULTS

Even works on CMO OASs!

32

Page 38: Big Data's Big Secret

Summary

Page 39: Big Data's Big Secret

Bloomberg

SUMMARY

Moral of the story 1. Know your data!

I Bad data = bad results I Big data increases need for data cleaning I Look at your data!

2. Know its usage! I Cleaning must respect usage of data

3. Algorithms will often not work as advertised! I Your data can be different I Your data usage can be different

4. Expect substantial work modifying and adjusting algorithms I Tuning I Modifying algorithms I Combining algorithms I Performance must be inspected

33

Page 40: Big Data's Big Secret

SANFRANCISCO sAOP>WLO TOKYO .Pfesslhe<HELP> +496912041210 +n221neooo +442073307500 +12123112000 +14151122tl0 +55113(),114500 +0&2121000 +11332011900 ktYIW!uforinscanr -------------------------------- 1,.,.ns,stance.

Bloomberg

Thank you!

Harvey J. Stein

[email protected]

© 2018 Bloomberg Finance L.P. All rights reserved.

34

Page 41: Big Data's Big Secret

References

Page 42: Big Data's Big Secret

Bloomberg

REFERENCES

[Agg13] C. C. Aggarwal. Outlier Analysis. Springer, 2013.

[BG05] I. Ben-Gal. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005.

[CG] David Claessen and Andreas Groth. A beginner’s guide to SSA. CERES-ERTI, Ecole Normale Superieure. url: http://environnement.ens.fr/IMG/file/DavidPDF/ SSA_beginners_guide_v9.pdf.

[Das+16a] Jan W Dash et al. SSA, Random Matrix Theory, and Noise-Reduced Correlations. Tech. rep. Bloomberg LP, Sept. 2016. url: https://ssrn.com/abstract=2808027.

35

Page 43: Big Data's Big Secret

Bloomberg

REFERENCES [Das+16b] Jan W Dash et al. Stable Reduced-Noise ’Macro’

SSA–Based Correlations for Long-Term Counterparty Risk Management. Tech. rep. Bloomberg LP, May 2016. url: https://ssrn.com/abstract=2808015.

[Fra+10] F. Franke et al. “An online spike detection and spike classification algorithm capable of instantaneous resolution of overlapping spikes.” In: J Comput Neurosci (2010), pp. 127–148.

[Ghi+02] Michael Ghil et al. “Advanced spectral methods for climatic time series.” In: Reviews of geophysics 40.1 (2002).

[GNZ01] Nina Golyandina, Vladimir Nekrutkin, and Anatoly A Zhigljavsky. Analysis of Time Series Structure: SSA and Related Techniques. CRC Monographs on Statistics & Applied Probability. Chapman & Hall, 2001.

[HA04] V. J. Hodge and J. Austin. A Survey of Outlier Detection Methodologies. Kluwer Academic Publishers., 2004.

36

Page 44: Big Data's Big Secret

Bloomberg

REFERENCES

[HM13] Hossein Hassani and Rahim Mahmoudvand. “Multivariate singular spectrum analysis: A general view and new vector forecasting approach.” In: International Journal of Energy and Statistics 1.01 (2013), pp. 55–83.

[HT10] Hossein Hassani and Dimitrios Thomakos. “A review on singular spectrum analysis for economic and financial time series.” In: Statistics and Its Interface 3.3 (2010), pp. 377–397. issn: 1938-7989.

[KG06] Dmitri Kondrashov and Michael Ghil. “Spatio-temporal filling of missing points in geophysical data sets.” In: Nonlinear Processes in Geophysics 13.2 (2006), pp. 151–159.

[KKZ10] H.-P. Kriegel, P. Kroger, and A. Zimek. “Outlier Detection Techniques.” In: The 2010 SIAM International Conference on Data Mining, 2010.

37

Page 45: Big Data's Big Secret

Bloomberg

REFERENCES [KN98] Edwin M. Knorr and Raymond T. Ng. “Algorithms for

Mining Distance-Based Outliers in Large Datasets.” In: Proceedings of the 24th VLDB Conference New York, 1998.

[Pat+11] Kerry Patterson et al. “Multivariate singular spectrum analysis for forecasting revisions to real-time data.” In: Journal of Applied Statistics 38.10 (2011), pp. 2183–2211.

[VM] P. Verhoeven and M. McAleer. Detecting Local Outliers in Financial Time Series. Department of Economics, University of Western Australia.

[Wik16] Wikipedia. Singular spectrum analysis. [Online; accessed 9-December-2016]. 2016. url: https://en.wikipedia.org/w/index.php?title= Singular_spectrum_analysis.

38