scaling multivariate statistics to massive data algorithmic problems and approaches alexander gray...

24
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org

Upload: jake-oneill

Post on 27-Mar-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Scaling Multivariate Statistics to Massive Data

Algorithmic problems and approaches

Alexander GrayGeorgia Institute of Technology

www.fast-lab.org

Page 2: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Core methods ofstatistics / machine learning / mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)

4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-

point correlation 2-sample testing O(Nn)

Page 3: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Now pretty fast (2011)…

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N), hierarchical clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Page 4: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Things we made fastfastest, fastest in some settings

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N), hierarchical (FoF) clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Page 5: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Core computational problems

What are the basic mathematical operations making things hard?

• Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks

• Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

Page 6: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

The “7 Giants” of data

1. Basic statistics

2. Generalized N-body problems

3. Graph-theoretic problems

4. Linear-algebraic problems

5. Optimizations

6. Integrations

7. Alignment problems

Page 7: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

The “7 Giants” of data

1. Basic statistics•e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries)

2. Generalized N-body problems•e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

Page 8: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

The “7 Giants” of data

3. Graph-theoretic problems•e.g. betweenness centrality, commute distance, graphical model inference

4. Linear-algebraic problems•e.g. linear algebra, PCA, Gaussian process regression, manifold learning

5. Optimizations•e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

Page 9: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

The “7 Giants” of data

6. Integrations•e.g. Bayesian inference

7. Alignment problems•e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match

Page 10: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Back to our listbasic, N-body, graphs, linear algebra, optimization, integration, alignment

1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)

4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-

point correlation 2-sample testing O(Nn)

Page 11: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

5 settings

1. “Regular”: batch, in-RAM/core, one CPU

2. Streaming (non-batch)

3. Disk (out-of-core)

4. Distributed: threads/multi-core (shared memory)

5. Distributed: clusters/cloud (distributed memory)

Page 12: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

4 common data types

1. Vector data, iid

2. Time series

3. Images

4. Graphs

Page 13: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

3 desiderata

1. Fast experimental runtime/performance*

2. Fast theoretic (provable) runtime/performance*

3. Accuracy guarantees

*Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.

Page 14: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

7 general solution strategies

1. Divide and conquer (indexing structures)

2. Dynamic programming

3. Function transforms

4. Random sampling (Monte Carlo)

5. Non-random sampling (active learning)

6. Parallelism

7. Problem reduction

Page 15: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

1. Summary statistics

• Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries)

• What’s unique/challenges: streaming, new guarantees

• Promising/interesting: – Sketching approaches– AD-trees– MapReduce/Hadoop (Aster,Greenplum,Netezza)

Page 16: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

2. Generalized N-body problems

• Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

• What’s unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank)

• Promising/interesting: – Generalized/higher-order FMM O(N2) O(N)

– Random projections

– GPUs

Page 17: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

3. Graph-theoretic problems

• Examples: betweenness centrality, commute dist, graphical model inference

• What’s unique/challenges: high interconnectivity (cliques), out-of-core

• Promising/interesting: – Variational methods– Stochastic composite likelihood methods– MapReduce/Hadoop (Facebook,etc)

Page 18: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

4. Linear-algebraic problems

• Examples: linear algebra, PCA, Gaussian process regression, manifold learning

• What’s unique/challenges: probabilistic guarantees, kernel matrices

• Promising/interesting: – Sampling-based methods– Online methods– Approximate matrix-vector multiply via N-body

Page 19: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

5. Optimizations

• Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

• What’s unique/challenges: stochastic programming, streaming

• Promising/interesting: – Reformulations/relaxations of various ML forms– Online, mini-batch methods– Parallel online methods– Submodular functions– Global optimization (non-convex)

Page 20: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

6. Integrations

• Examples: Bayesian inference

• What’s unique/challenges: general dimension

• Promising/interesting: – MCMC– ABC– Particle filtering– Adaptive importance sampling, active learning

Page 21: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

7. Alignments

• Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross-match

• What’s unique/challenges: greater heterogeneity, measurement errors

• Promising/interesting: – Probabilistic representations– Reductions to generalized N-body problems

Page 22: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

Reductions/transformationsbetween problems

• Gaussian graphical models linear alg• Bayesian integration MAP optimization• Euclidean graphs N-body problems• Linear algebra on kernel matrices N-body

inside conjugate gradient• Can featurize a graph or any other structure

matrix-based ML problem• Create new ML methods with different

computational properties

Page 23: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

General conclusions

• Algorithms can dramatically change the runtime order, e.g. O(N2) to O(N)

• High dimensionality is a persistent challenge• The non-default (e.g. streaming, disk…)

settings need more research work• Systems issues need more work, e.g.

connection to data storage/management• Hadoop does not solve everything

Page 24: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

General conclusions

• No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc)

• More aspects of hardness (statistical and computational) are needed