scaling multivariate statistics to massive data algorithmic problems and approaches alexander gray...

Scaling Multivariate Statistics to Massive Data

Algorithmic problems and approaches

Alexander GrayGeorgia Institute of Technology

www.fast-lab.org

Core methods ofstatistics / machine learning / mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)

4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-

point correlation 2-sample testing O(Nn)

Now pretty fast (2011)…

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

shift segmentation O(N), hierarchical clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Things we made fastfastest, fastest in some settings

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

shift segmentation O(N), hierarchical (FoF) clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Core computational problems

What are the basic mathematical operations making things hard?

• Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks

• Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

The “7 Giants” of data

1. Basic statistics

2. Generalized N-body problems

3. Graph-theoretic problems

4. Linear-algebraic problems

5. Optimizations

6. Integrations

7. Alignment problems

1. Basic statistics•e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries)

2. Generalized N-body problems•e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

3. Graph-theoretic problems•e.g. betweenness centrality, commute distance, graphical model inference

4. Linear-algebraic problems•e.g. linear algebra, PCA, Gaussian process regression, manifold learning

5. Optimizations•e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

6. Integrations•e.g. Bayesian inference

7. Alignment problems•e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match

Back to our listbasic, N-body, graphs, linear algebra, optimization, integration, alignment