hazy: making statistical applications easier to build and maintain
DESCRIPTION
Hazy: Making Statistical Applications Easier to Build and Maintain. Christopher Ré BigLearn Collaborators listed throughout. Big data is the future. Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?. How big is `big’?. Is it a GB? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/1.jpg)
Hazy: Making Statistical Applications Easier to Build and Maintain
Christopher RéBigLearn
Collaborators listed throughout
![Page 2: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/2.jpg)
Big data is the future.
Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?
![Page 3: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/3.jpg)
How big is `big’?Is it a GB?
That fits on your phone… but LeCun & Farabet has same
spirit.
Is it a TB?TB is big on a Hadoop…but is main memory in many
apps
Is it a PB?Do you have that problem?
NB: I love Hadoop, Y! and Cloudera.
“Let’s end Peta-philia” – John Doyle
Maybe something other than size is the common thread?
![Page 4: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/4.jpg)
Big Shifts Underlying Big Data
(1)More signal (data) beats a more complex model nine times out of ten. -- This is why we acquire big data.
(2)Move computation to the data – not data to to the computation.
My bet: The key is the ability to quickly sling signal together – where data live.
![Page 5: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/5.jpg)
Hazy: Making Statistical Applications Easier to Build and Maintain
![Page 6: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/6.jpg)
6
Two Trends that Drive Hazy1. Data in unprecedented number of formats
Hazy = statistical + data management techniques
2. Arms race for deeper understanding of data
Statistical tools attack both 1. and 2.
Hazy’s Goal: Understand common patterns in deploying & maintaining statistical tools on data.
![Page 7: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/7.jpg)
7
Hazy’s Thesis
The next breakthrough in data analysis may not be a new data analysis algorithm…
…but may be in the ability to rapidly combine, deploy, and maintain existing algorithms.
![Page 8: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/8.jpg)
8
Outline
Three Application Areas for Hazy
Victor/Bismarck (in Parallel RDBMS)
HogWild! (Shared Memory)
Other Hazy Projects
![Page 9: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/9.jpg)
9
Data constantly generated on the Web, Twitter, Blogs, and Facebook
Extract and Classify sentiment about products, ad campaigns, and customer facing entities.
Often deployed statistical tools: Extraction (CRFs) & Classification (SVM).
DARPA Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years of membership”
![Page 10: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/10.jpg)
Simplified View of MR Data Flow
Entity Linking Slot Filling
Infrastructure (Felix)
ClueWeb Freebase News Corpus
Mention and EntityFeatures
Pairs of EntityFeatures
Goals: High quality. Low Effort. Easy to test.
500M Docs or 15TB
![Page 11: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/11.jpg)
Two Key Features our system, Felix1. Felix allows many best-in-breed algorithms to
work jointly together. - Extraction (CRF), Classification (SVM)
2. Simple SQL-like rule-language (Markov Logic)– Felix allows us to leverage Wikipedia, Freebase,
ClueWeb in one high-level language (TBs of data)
Key: Marry simple robust statistical tools with flexible rules – and be able to scale to TBs.
TAC-KBP F1=39. SotA Systems F1=20
Feng Niu, Ce Zhang, Josh Slauson:www.cs.wisc.edu/hazy/felix
![Page 12: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/12.jpg)
12
Rough Demo!
![Page 13: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/13.jpg)
A physicist interpolates sensor readings and uses regression to more deeply understand their data
![Page 14: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/14.jpg)
14
Digital Optical Module (DOM)
![Page 15: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/15.jpg)
15
Workflow of IceCube
In Ice: Detection occurs.
At Pole: Algorithm says “Interesting!”
In Madison: Lots of data analysis.
Via satellite: Interesting DOM readings
![Page 16: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/16.jpg)
16
A Key Step: Detecting Track
Mathematical structure used to track neutrinos similar to labeling text.
Mark’s code to the South Pole
in 2012.
Mark Wellons, Ben Recht, Francis Halzen, and Gary Hill
Rules + Simple Regression
NB: Many data analysts are ex-physicists
(Experts: Both are Convex Programs)
![Page 17: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/17.jpg)
17
OCR and Speech
Getting text is challenging! (statistical model errors)
A social scientist wants to extract the frequency of synonyms of English words in 18th century texts.
Output of speech and OCR models similar to text-labeling models (transducers)
OCR & Speech
Social Scientists @ UW
Want to unify content management systems with RDBMS.
![Page 18: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/18.jpg)
18
Application Takeaways
Statistical processing on large data enables a wide variety of new applications.
Algorithm (IGMs) that solves bold models close to data in (1) an RDBMS & (2) Shared-Memory.
Goal: Develop the ability to rapidly combine, deploy, and maintain existing algorithms.
![Page 19: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/19.jpg)
19
Outline
Two Application Areas for Hazy
Victor/Bismarck (in Parallel RDBMS)
HogWild! (Shared Memory)
Other Hazy Projects
![Page 20: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/20.jpg)
20
Classify publications by subject area
![Page 21: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/21.jpg)
21
Example: Linear Models
Label papers as DB Papers or Non-DB Papers
2. Classify via plane
1 2
3
45
DB Papers
Non-DB Papers
1. Map each papers to Rdx
Question is: How do we pick a good plane (x)?
![Page 22: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/22.jpg)
22
Example: Linear Models
Instead: change f to a smooth function of distance to plane.e.g., the squared distance (least squares), hinge loss (svm), log loss (logistic regression). Different f = different model.
Input: Labeled papers. Each point labeled as DB (+) or not (-)
+ -
-
-+
x says DB Papers
x says Non-DB
xIdea: Let’s score each plane
Idea: minimize misclassification.
yi is a paper vector and its label
![Page 23: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/23.jpg)
Framework: Inverse Problems
Examples: 1. Paper Classification: yi is the paper with its label
2. Neutrino Tracking: yi is a DOM (sensor) reading
3. CRFs: yi is (document, labeling)
4. Netflix: yi is (user,movie,rating), Claim: General data analysis technique that is no more
difficult to compute than a SQL AVG.
x the model
yia data item
f scores the error
P Enforces prior
a
Experts: f and P are convex
![Page 24: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/24.jpg)
24
Background: Gradient Methods
Gradient Methods: Iterative. 1. Start at current x, 2. Take gradient at x, 3. Move in opposite direction
F(x)
![Page 25: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/25.jpg)
25
Incremental Gradient Methods
Select a single data item to approximate gradient
Gradient Methods: Iterative. 1. Start at current x, 2. Approximate gradient at x by selecting j in {1…N}, 3. Move in opposite direction
![Page 26: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/26.jpg)
26
Incremental Gradient Methods (iGMs)
Why use iGMs? iGMs converge to an optimal for many problems, but the real reason is:
iGMs are fast.
Technical Connection: iGM processing isomorphic to computing a SQL AVG
(x is the accumulator, G is an expression on a single tuple)
Solve statistical models in an RDBMS for free
![Page 27: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/27.jpg)
27
Most RDBMS 3 Functions in a UDA1. Initialize (State)2. Transition(State, data)3. Terminate(State)
iGMs ≈ User Defined Aggregates
State in R2 : (# terms, running total)Transition( (n, T), d ) = (n, T) + (1,d) Terminate( (n,T) ) = T/n
AVG
State in Rd : the model xTransition( x, yj ) = x - G(x,yj) Terminate( x ) = x
Gradient
![Page 28: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/28.jpg)
28
Some subtle differences…
UDAs typically commutative and algebraic
IGMs are formally neither, but are morally both
Morally Algebraic: One can average models, so morally algebraic [Zinkevich et al 10].
Morally Commutative: Different orders give different exact result, but any order converges to same result.
Key: IGMs work (in parallel) off the shelf in a UDA. (This is how we do Logistic Regression on the Web)
MADLib & Oracle versions coming. Terminology from Gray et al
![Page 29: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/29.jpg)
29
Hogwild!
[Niu, Recht, Ré, & Wright NIPS 11] Code on Web.
![Page 30: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/30.jpg)
Other Hazy work• Dealing with data in different formats
– Staccato: Arun Kumar. Query OCR documents [PODS 10 & VLDB 12]– Hazy goes to the South Pole: Mark Wellons Trigger Software for IceCube
• Machine Learning Algorithms– Hogwild run SGD on convex problems with no locking [NIPS 11]– Jellyfish faster than Hogwild! on Matrix Completion [Opt Online 11]
• Maintain Supervised Techniques on Evolving Corpora– iClassify: M. Levent Koc. Maintain classifiers on evolving examples [VLDB 11]– CRFLex: Aaron Feng. Maintain CRFs on Evolving Corpora [ICDE 12]
• Populate Knowledge Bases from Text– Tuffy: SQL + weights to combine multiple models [VLDB 11]– Felix: Running Markov Logic on Millions of Documents [Preprint ARXIV 11]– Feng Niu, Ce Zhang, Josh Slauson,and Adel Ardalan
• Infrastructure with Mike Cafarella (Michigan)– Manimal: Optimizing MapReduce with Relational Operations [VLDB 11]
30
Student who did real work in red.
![Page 31: Hazy: Making Statistical Applications Easier to Build and Maintain](https://reader035.vdocuments.net/reader035/viewer/2022062517/5681341c550346895d9b09bc/html5/thumbnails/31.jpg)
31
Conclusion
Many statistical data analyses are no more complicated to compute than a SQL AVG.
Hogwild! approach to multicore parallelism has no locking, but good speedup
Code, data, and papers:www.cs.wisc.edu/hazy