hazy: making statistical applications easier to build and maintain

Hazy: Making Statistical Applications Easier to Build and Maintain

Christopher RéBigLearn

Collaborators listed throughout

Big data is the future.

Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?

How big is `big’?Is it a GB?

That fits on your phone… but LeCun & Farabet has same

spirit.

Is it a TB?TB is big on a Hadoop…but is main memory in many

apps

Is it a PB?Do you have that problem?

NB: I love Hadoop, Y! and Cloudera.

“Let’s end Peta-philia” – John Doyle

Maybe something other than size is the common thread?

Big Shifts Underlying Big Data

(1)More signal (data) beats a more complex model nine times out of ten. -- This is why we acquire big data.

(2)Move computation to the data – not data to to the computation.

My bet: The key is the ability to quickly sling signal together – where data live.

Hazy: Making Statistical Applications Easier to Build and Maintain

6

Two Trends that Drive Hazy1. Data in unprecedented number of formats

Hazy = statistical + data management techniques

2. Arms race for deeper understanding of data

Statistical tools attack both 1. and 2.

Hazy’s Goal: Understand common patterns in deploying & maintaining statistical tools on data.

7

Hazy’s Thesis

The next breakthrough in data analysis may not be a new data analysis algorithm…

…but may be in the ability to rapidly combine, deploy, and maintain existing algorithms.

8

Outline

Three Application Areas for Hazy

Victor/Bismarck (in Parallel RDBMS)

HogWild! (Shared Memory)

Other Hazy Projects

9

Data constantly generated on the Web, Twitter, Blogs, and Facebook

Extract and Classify sentiment about products, ad campaigns, and customer facing entities.

Often deployed statistical tools: Extraction (CRFs) & Classification (SVM).

DARPA Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years of membership”

Simplified View of MR Data Flow

Entity Linking Slot Filling

Infrastructure (Felix)

ClueWeb Freebase News Corpus

Mention and EntityFeatures

Pairs of EntityFeatures

Goals: High quality. Low Effort. Easy to test.

500M Docs or 15TB

Two Key Features our system, Felix1. Felix allows many best-in-breed algorithms to

work jointly together. - Extraction (CRF), Classification (SVM)

2. Simple SQL-like rule-language (Markov Logic)– Felix allows us to leverage Wikipedia, Freebase,

ClueWeb in one high-level language (TBs of data)

Key: Marry simple robust statistical tools with flexible rules – and be able to scale to TBs.

TAC-KBP F1=39. SotA Systems F1=20

Feng Niu, Ce Zhang, Josh Slauson:www.cs.wisc.edu/hazy/felix

12

Rough Demo!

A physicist interpolates sensor readings and uses regression to more deeply understand their data

14

Digital Optical Module (DOM)

15

Workflow of IceCube

In Ice: Detection occurs.

At Pole: Algorithm says “Interesting!”

In Madison: Lots of data analysis.

Via satellite: Interesting DOM readings

16

A Key Step: Detecting Track

Mathematical structure used to track neutrinos similar to labeling text.

Mark’s code to the South Pole

in 2012.

Mark Wellons, Ben Recht, Francis Halzen, and Gary Hill

Rules + Simple Regression

NB: Many data analysts are ex-physicists

(Experts: Both are Convex Programs)

17

OCR and Speech

Getting text is challenging! (statistical model errors)

A social scientist wants to extract the frequency of synonyms of English words in 18th century texts.

Output of speech and OCR models similar to text-labeling models (transducers)

OCR & Speech

Social Scientists @ UW

Want to unify content management systems with RDBMS.

18

Application Takeaways

Statistical processing on large data enables a wide variety of new applications.

Algorithm (IGMs) that solves bold models close to data in (1) an RDBMS & (2) Shared-Memory.

Goal: Develop the ability to rapidly combine, deploy, and maintain existing algorithms.

19

Outline

Two Application Areas for Hazy

Victor/Bismarck (in Parallel RDBMS)

HogWild! (Shared Memory)

Other Hazy Projects

20

Classify publications by subject area

21

Example: Linear Models

Label papers as DB Papers or Non-DB Papers

2. Classify via plane

1 2

3

45

DB Papers

Non-DB Papers

1. Map each papers to Rdx

Question is: How do we pick a good plane (x)?

22

Example: Linear Models

Instead: change f to a smooth function of distance to plane.e.g., the squared distance (least squares), hinge loss (svm), log loss (logistic regression). Different f = different model.

Input: Labeled papers. Each point labeled as DB (+) or not (-)

+ -

-

-+

x says DB Papers

x says Non-DB

xIdea: Let’s score each plane

Idea: minimize misclassification.

yi is a paper vector and its label

Framework: Inverse Problems

Examples: 1. Paper Classification: yi is the paper with its label

2. Neutrino Tracking: yi is a DOM (sensor) reading

3. CRFs: yi is (document, labeling)

4. Netflix: yi is (user,movie,rating), Claim: General data analysis technique that is no more

difficult to compute than a SQL AVG.

x the model

yia data item

f scores the error

P Enforces prior

a

Experts: f and P are convex

24

Background: Gradient Methods

Gradient Methods: Iterative. 1. Start at current x, 2. Take gradient at x, 3. Move in opposite direction

F(x)

25

Incremental Gradient Methods

Select a single data item to approximate gradient

Gradient Methods: Iterative. 1. Start at current x, 2. Approximate gradient at x by selecting j in {1…N}, 3. Move in opposite direction

26

Incremental Gradient Methods (iGMs)

Why use iGMs? iGMs converge to an optimal for many problems, but the real reason is:

iGMs are fast.

Technical Connection: iGM processing isomorphic to computing a SQL AVG

(x is the accumulator, G is an expression on a single tuple)

Solve statistical models in an RDBMS for free

27

Most RDBMS 3 Functions in a UDA1. Initialize (State)2. Transition(State, data)3. Terminate(State)

iGMs ≈ User Defined Aggregates

State in R2 : (# terms, running total)Transition( (n, T), d ) = (n, T) + (1,d) Terminate( (n,T) ) = T/n

AVG

State in Rd : the model xTransition( x, yj ) = x - G(x,yj) Terminate( x ) = x

Gradient

28

Some subtle differences…

UDAs typically commutative and algebraic

IGMs are formally neither, but are morally both

Morally Algebraic: One can average models, so morally algebraic [Zinkevich et al 10].

Morally Commutative: Different orders give different exact result, but any order converges to same result.

Key: IGMs work (in parallel) off the shelf in a UDA. (This is how we do Logistic Regression on the Web)

MADLib & Oracle versions coming. Terminology from Gray et al

29

Hogwild!

[Niu, Recht, Ré, & Wright NIPS 11] Code on Web.

Other Hazy work• Dealing with data in different formats

– Staccato: Arun Kumar. Query OCR documents [PODS 10 & VLDB 12]– Hazy goes to the South Pole: Mark Wellons Trigger Software for IceCube

• Machine Learning Algorithms– Hogwild run SGD on convex problems with no locking [NIPS 11]– Jellyfish faster than Hogwild! on Matrix Completion [Opt Online 11]

• Maintain Supervised Techniques on Evolving Corpora– iClassify: M. Levent Koc. Maintain classifiers on evolving examples [VLDB 11]– CRFLex: Aaron Feng. Maintain CRFs on Evolving Corpora [ICDE 12]

• Populate Knowledge Bases from Text– Tuffy: SQL + weights to combine multiple models [VLDB 11]– Felix: Running Markov Logic on Millions of Documents [Preprint ARXIV 11]– Feng Niu, Ce Zhang, Josh Slauson,and Adel Ardalan

• Infrastructure with Mike Cafarella (Michigan)– Manimal: Optimizing MapReduce with Relational Operations [VLDB 11]

30

Student who did real work in red.

31

Conclusion

Many statistical data analyses are no more complicated to compute than a SQL AVG.

Hogwild! approach to multicore parallelism has no locking, but good speedup

Code, data, and papers:www.cs.wisc.edu/hazy

http://www.cs.wisc.edu/hazy