big data's big secret
TRANSCRIPT
BIG DATA’S DIRTY SECRET Harvey J. Stein
Head, Quantitative Risk Analytics Bloomberg
Machine Learning in Finance Workshop April 20, 2018
Bloomberg
OUTLINE
1. Introduction
2. Symptoms
3. Hole filling
4. Bad data detection
5. Summary
6. References
2
Introduction
Go gle "big data" OR "machine l'earning"
All News Images Videos
About 99,900,000 results (0. 75 seconds
Bloomberg
LOTS OF BIG DATA
Big data is big news!
3
Although I guess that’s not so surprising...
Go gle "president trump"
All News Images Videos
Bloomberg
TRUMPS TRUMP
Almost twice as popular as “President Trump”!
4
Go gle "president trump"
All News Images Videos
Bloomberg
TRUMPS TRUMP
Almost twice as popular as “President Trump”!
Although I guess that’s not so surprising...
4
Bloomberg
FAKE NEWS
But big data analysis doesn’t mean better data analysis I More variables I More outliers I More noise I More spurious results
Conclusion? I Data needs to be cleaned
We will discuss data anomalies and methods for cleaning data
5
Bloomberg
ACKNOWLEDGEMENTS
Work on data cleaning with: I Mario Bondioli I Jan Dash I Xipei Yang I Yan Zhang
6
Symptoms
Bloomberg
THE DATA
We worked with credit default swap (CDS) spread data I Spread = cost (in bp) of insuring against default of a given company
for a given time period I Quoted for 6 month, 1 year, 2 year, 3 year, 5 year, 7 year and 10
year horizons I Quoted for 1,000s of different individual companies I Quoted both for senior and subordinated debt I Consider market close data
7
Bloomberg
EXAMPLE
01/08 01/09 01/10 01/11 01/12 01/13 01/14 01/15 01/16-200
0
200
400
600
800
1000
1200
1400AVP Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
8
Bloomberg
DATA ISSUES
General data quality issues I Missing values I Bad values
Clean for a purpose I Relative valuation I Mark to market I Trading strategy development I Risk analysis
Risk I Missing data points
I Problematic return calculations I Problematic covariance calculations
I Bad values I Bad returns I Bad variances
9
Bloomberg
CDS DATA ISSUES
CDS data specific characteristics: I 6 month point missing for first 2.5 years I Often large range of values I High volatility makes detecting bad values difficult I Data used for risk analysis
I Deleting outliers reduces risk measures I Leaving anomalies inflates risk measures
10
Bloomberg
TYPICAL APPROACHES
Hole filling I Regression I Interpolation I Flat filling
Anomaly detection I Comparison to trailing volatility I Cluster analysis I Neural networks I Statistics-sensitive Non-linear Iterative Peak (SNIP) clipping
algorithm
11
Hole filling
Bloomberg
OVERVIEW
Hole filling Overview I Use Multi-channel Singular Spectrum Analysis (MSSA) hole filling
algorithm I Variant of Singular Spectrum Analysis (SSA) used simultaneously on
multiple time series I Decomposes each time series into a sum of components, one for
each eigenvector I Borrowed from geophysical data analysis I Makes use of both space relationships (covariance) and time
relationships (autocovariance and cross-autocovariance)
12
Bloomberg
SSA Uses:
I Inspect eigenvectors and components to extract specific features of data
I Smooth data by throwing away small eigenvalues I Helpful for stabilizing correlation calculations (smooth data then
compute) References:
I A beginner’s guide to SSA, Claessen and Groth, [CG] I Singular spectrum analysis, Wikipedia, [Wik16] I Analysis of Time Series Structure: SSA and Related
Techniques, Golyandina, Nekrutkin, and Zhigljavsky, [GNZ01] I A review on singular spectrum analysis for economic and
financial time series, Hassani and Thomakos, [HT10] I SSA, Random Matrix Theory, and Noise-Reduced Correlations,
Dash et al., [Das+16a] I Stable Reduced-Noise ’Macro’ SSA–Based Correlations for
Long-Term Counterparty Risk Management, Dash et al., [Das+16b]
13
Bloomberg
MSSA
Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series simultaneously
Uses: I Same as SSA, but takes relationships between different time series
into account I Used for forecasting
References: I Multivariate singular spectrum analysis for forecasting
revisions to real-time data, Patterson et al., [Pat+11] I Multivariate singular spectrum analysis: A general view and
new vector forecasting approach, Hassani and Mahmoudvand, [HM13]
I Advanced spectral methods for climatic time series, Ghil et al., [Ghi+02]
14
Bloomberg
MSSA BASED HOLE FILLING
MSSA hole filling algorithm: I Nominally fill holes (e.g. via interpolation) I Use level l hole filling algorithm for l = l0:
I Run MSSA algorithm I Replace holes with MSSA reconstruction using l biggest singular
values I Repeat until convergence
I Increment l by one and repeat until adding singular values doesn’t have much impact and used enough singular values
References: I Spatio-temporal filling of missing points in geophysical data
sets, Kondrashov and Ghil, [KG06]
15
Bloomberg
MIXED RESULTS Unfortunately, it doesn’t always work:
03/07 05/07 07/07 09/07 11/07 01/08-10
0
10
20
30
40
50
60
70
80NAB Senior USD: Original MSSA hole filling
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
16
Bloomberg
OBSERVATIONS
Observations: I Sometimes MSSA doesn’t line up with actual data I Sometimes MSSA bottoms out I Using too few singular values will smooth the data
Solutions: I Anchoring – patch in data in a more consistent fashion I Reparameterization – working in log space I Adjusting MSSA parameters I Avoid filling large gaps
17
Bloomberg
ANCHORING
Holes are replaced with MSSA partial reconstruction
I Can yield bias if remaining components shift results
Instead I Patch in differences relative to endpoints I Can be additive or multiplicative I One-sided holes need special treatment
18
Bloomberg
REPARAMETERIZATION
MSSA hole filling is like a fixed point algorithm I Trying to find points which match reconstruction I Similar to constrained optimization
Apply classic optimization techniques I Transform problem to eliminate constraints I Work in log space if values must be positive I Log space also helps to handle changes in magnitude
Fast drop-off of eigenvalues is evidence that working in log space is the right thing
19
Bloomberg
ADJUSTING MSSA PARAMETERS
Many parameters to adjust I Lag I Max/Min number of EVs I Max/Min percentage of sum of EVs I Measure of convergence
Smoothing caused by fast drop-off of EVs I Max/Min percentage ineffective I Can add more EVs, but leads to instability
20
r--Af I 1 . ..,,\
\,
,-,~- 'v\ ,~,-~ < •:, . '-_ ___,
Bloomberg
NEW RESULTS After adjustments NAB:
03/07 05/07 07/07 09/07 11/07 01/08-10
0
10
20
30
40
50
60
70
80
90NAB Senior USD: Angle/distance/MSSA spike detection, all SVs
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
21
Bad data detection
Bloomberg
BAD DATA
How to handle bad data? I Detect it I Remove it I In our case, replace it
22
Bloomberg
BAD DATA DETECTION
Many algorithms I Statistical – compare to statistical properties (like trailing SD) I Data science – clustering I Neural networks
References I Outlier Detection Techniques, Kriegel, Kroger, and Zimek, [KKZ10] I Detecting Local Outliers in Financial Time Series, Verhoeven and
McAleer, [VM] I Outlier Analysis, Aggarwal, [Agg13] I Algorithms for Mining Distance-Based Outliers in Large Datasets,
Knorr and Ng, [KN98] I Data Mining and Knowledge Discovery Handbook: A Complete
Guide for Practitioners and Researchers, Ben-Gal, [BG05] I An online spike detection and spike classification algorithm capable of
instantaneous resolution of overlapping spikes, Franke et al., [Fra+10] I A Survey of Outlier Detection Methodologies, Hodge and Austin,
[HA04]
23
Bloomberg
DIFFICULTIES
Regime changes and changing volatility
01/14 07/14 01/15 07/15 01/16 07/16-200
0
200
400
600
800
1000
1200
1400
1600
1800GNW Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
24
Bloomberg
HYBRID APPROACH
Data science approach – Cluster analysis I Angle-based I Distance-based
Hybrid approach I Run clustering on a windowed basis (in a neighborhood of each
point) I Combine MSSA with clustering I Remove points using analysis, then put them back if MSSA
reconstructs them close enough
Conservative approach I Do both angle and distance-based combined with MSSA I If both algorithms agree, then it’s really an anomaly
25
. · ...
Bloomberg
DISTANCE-BASED EXAMPLE
26
73~-----------------~
72
71
70
6Q
68
67
• •
no outlier 66 +----~~-~-~-~-~-~-~~----,
31 3.2 33 34 35 36 37 38 3ll 40 41
Bloomberg
ANGLE-BASED EXAMPLE
Angle-based, no outlier:
27
73
• n • • 71 • •
• • 70
• 00
68
151 outlier 66
31 32 33 34 35 36 37 38 311 «l 41
Bloomberg
ANGLE-BASED EXAMPLE
Angle-based outlier:
28
Bloomberg
RESULTS Filling of large holes
01/09 03/09 05/09 07/09 09/09 11/09 01/10-20
0
20
40
60
80
100
120
140
160COP Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
29
Bloomberg
RESULTS Ignoring regime changes
09/11 01/12 05/12 09/12 01/13 05/13-200
0
200
400
600
800
1000
1200CHK Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
30
X
Bloomberg
0
0 0
0 0 0
RESULTS Detecting and correcting bad data
01/15 03/15 05/15 07/15 09/15 11/15 01/160
0.5
1
1.5
2
2.5104 JNY Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
31
i
Bloomberg
.. 2012
_ LC240AS
_ LC2SOAS
_ LC280AS
_ LC290AS
LC300AS _ LC320AS
LC330AS
LC340AS
LC350AS
RESULTS
Even works on CMO OASs!
32
Summary
Bloomberg
SUMMARY
Moral of the story 1. Know your data!
I Bad data = bad results I Big data increases need for data cleaning I Look at your data!
2. Know its usage! I Cleaning must respect usage of data
3. Algorithms will often not work as advertised! I Your data can be different I Your data usage can be different
4. Expect substantial work modifying and adjusting algorithms I Tuning I Modifying algorithms I Combining algorithms I Performance must be inspected
33
SANFRANCISCO sAOP>WLO TOKYO .Pfesslhe<HELP> +496912041210 +n221neooo +442073307500 +12123112000 +14151122tl0 +55113(),114500 +0&2121000 +11332011900 ktYIW!uforinscanr -------------------------------- 1,.,.ns,stance.
Bloomberg
Thank you!
Harvey J. Stein
© 2018 Bloomberg Finance L.P. All rights reserved.
34
References
Bloomberg
REFERENCES
[Agg13] C. C. Aggarwal. Outlier Analysis. Springer, 2013.
[BG05] I. Ben-Gal. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005.
[CG] David Claessen and Andreas Groth. A beginner’s guide to SSA. CERES-ERTI, Ecole Normale Superieure. url: http://environnement.ens.fr/IMG/file/DavidPDF/ SSA_beginners_guide_v9.pdf.
[Das+16a] Jan W Dash et al. SSA, Random Matrix Theory, and Noise-Reduced Correlations. Tech. rep. Bloomberg LP, Sept. 2016. url: https://ssrn.com/abstract=2808027.
35
Bloomberg
REFERENCES [Das+16b] Jan W Dash et al. Stable Reduced-Noise ’Macro’
SSA–Based Correlations for Long-Term Counterparty Risk Management. Tech. rep. Bloomberg LP, May 2016. url: https://ssrn.com/abstract=2808015.
[Fra+10] F. Franke et al. “An online spike detection and spike classification algorithm capable of instantaneous resolution of overlapping spikes.” In: J Comput Neurosci (2010), pp. 127–148.
[Ghi+02] Michael Ghil et al. “Advanced spectral methods for climatic time series.” In: Reviews of geophysics 40.1 (2002).
[GNZ01] Nina Golyandina, Vladimir Nekrutkin, and Anatoly A Zhigljavsky. Analysis of Time Series Structure: SSA and Related Techniques. CRC Monographs on Statistics & Applied Probability. Chapman & Hall, 2001.
[HA04] V. J. Hodge and J. Austin. A Survey of Outlier Detection Methodologies. Kluwer Academic Publishers., 2004.
36
Bloomberg
REFERENCES
[HM13] Hossein Hassani and Rahim Mahmoudvand. “Multivariate singular spectrum analysis: A general view and new vector forecasting approach.” In: International Journal of Energy and Statistics 1.01 (2013), pp. 55–83.
[HT10] Hossein Hassani and Dimitrios Thomakos. “A review on singular spectrum analysis for economic and financial time series.” In: Statistics and Its Interface 3.3 (2010), pp. 377–397. issn: 1938-7989.
[KG06] Dmitri Kondrashov and Michael Ghil. “Spatio-temporal filling of missing points in geophysical data sets.” In: Nonlinear Processes in Geophysics 13.2 (2006), pp. 151–159.
[KKZ10] H.-P. Kriegel, P. Kroger, and A. Zimek. “Outlier Detection Techniques.” In: The 2010 SIAM International Conference on Data Mining, 2010.
37
Bloomberg
REFERENCES [KN98] Edwin M. Knorr and Raymond T. Ng. “Algorithms for
Mining Distance-Based Outliers in Large Datasets.” In: Proceedings of the 24th VLDB Conference New York, 1998.
[Pat+11] Kerry Patterson et al. “Multivariate singular spectrum analysis for forecasting revisions to real-time data.” In: Journal of Applied Statistics 38.10 (2011), pp. 2183–2211.
[VM] P. Verhoeven and M. McAleer. Detecting Local Outliers in Financial Time Series. Department of Economics, University of Western Australia.
[Wik16] Wikipedia. Singular spectrum analysis. [Online; accessed 9-December-2016]. 2016. url: https://en.wikipedia.org/w/index.php?title= Singular_spectrum_analysis.
38