october hug

1

Scaling by CheatingApproximation, Sampling and Fault-Friendliness for Scalable Big Learning

Sean Owen / Director, Data Science @ Cloudera

2

Two Big Problems

3

Grow Bigger

“ Make quotes lookinteresting or different.”

Today’s big is just tomorrow’s small. We’re expected to process arbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days.

“

”David, Sr. IT Manager

4

And Be Faster


Speed is king. People expect up-to-the-second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x.

“

”Shelly, CTO

5

Two Big Solutions

6

Plentiful Resources


Disk and CPU are cheap, on-demand. Frameworks to harness them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply.

“

”“Scooter”, White Lab

7

Not Right, but Close Enough

Cheating

8

Kirk What would you say the odds are on our getting out of here?

Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one.

Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one?

Spock Seven thousand eight hundred twenty four point seven to one.

Kirk That's a pretty close approximation.

Star Trek, “Errand of Mercy”http://www.redbubble.com/people/feelmeflow

When To Cheat Approximate

9

• Only a few significant figures matter

• Least-significant figures are noise

• Only relative rank matters• Only care about

“high” or “low”

Do you care about 37.94% vs simply 40%?

11

Approximation

The Mean

12

• Huge stream of values: x1 x2 x3 … * • Finding entire population mean µ is expensive• Mean of small sample of N is close:

µN = (1/N) (x1 + x2 + … + xN)

• How much gets close enough?

* independent, roughly normal distribution

“Close Enough” Mean

13

• Want: with high probability p, at most ε errorµ = (1± ε) µN

• Use Student’s t-distribution (N-1 d.o.f.)t = (µ - µN) / (σN/√N )

• How unknown µ behaves relative to known sample stats t

“Close Enough” Mean

14

• Critical value for one tailtcrit = CDF-1((1+p)/2)

• Use library like Commons Math3:TDistribution.inverseCumulativeProbability()

• Solve for critical µcrit

CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N )

• µ “probably” at most µcrit

• Stop when (µcrit - µN) / µN small (<ε) t

15

Sampling

Word Count: Toy Example

17

• Input: text documents• Exactly how many times does

each word occur?• Necessary precision?• Interesting question?

Why?

Word Count: Useful Example

18

• About how many times does each word occur?

• Which 10 words occur most frequently?

• What fraction are Capitalized?

Hmm!

Common Crawl

19

• s3n://aws-publicdatasets/common-crawl/ parse-output/segment/*/textData-*

• Count top words, Capitalized, zucchini in 35GB subset

• github.com/srowen/commoncrawl

• Amazon EMR4 c1.xlarge instances

Raw Results

20

• 40 minutes• 40.1% Capitalized• Most frequent words:

the and to of a in de for is• zucchini occurs 9,571 times

Sample 10% of Documents

21

• 21 minutes• 39.9% Capitalized• Most frequent words:

the and to of a in de for is• zucchini occurs 967 times,

( 9,670 overall)

...if (Math.random() >= 0.1) continue;...

Stop When “Close Enough”

22

• CloseEnoughMean.java

• Stop mapping when % Capitalized is close enough

• 10% error, 90% confidenceper Mapper

• 18 minutes• 39.8% Capitalized

...if (m.isCloseEnough()) { break;}...

23

More Sampling

Item-Item Similarity

25

• Input: user-item click counts• Compute all-pairs item-item similarity• Output size is

(# Items x # Items)• Far too large to consume

in next job• But, virtually all similarities

are noise, near 0

1 9

7

2 2 2 3

1 1 1 1

2 4

4 3 3 1

1 8 8

2 1 1 2

Item

Use

r

Pruning

26

• ItemSimilarityJob• --threshold

Discard similarities < value• --maxSimilaritiesPerItem

Keep only top n pairs per item• --maxPrefsPerUser

Ignore excess from “prolific” users

1 0 0.5 0 0 0.5 0 0

0 1 0.1 0 0 0.2 0 0.1

0.5 0.1 1 0 -0.2 0 0 0

0 0 0 1 0 0 0 0

0 0 -0.2 0 1 0.2 0 0.2

0.5 0.2 0 0 0.2 1 0 0

0 0 0 0 0 0 1 0

0 0.1 0 0 0.2 0 0 1

Item

Item

Pruning Experiment

27

• Líbímseti dating site data set• 135K users x 165K profiles• 17M data points• Rating on 1-10 scale

• Compute all item-item Pearson correlations

• Amazon EMR2 m1.xlarge

Pruning Experiment

28

• 0 threshold• <10000 pairs per item• <1000 prefs per user• 178 minutes• 20,400 MB output

• >0.3 threshold• <10 pairs per item• <100 prefs per user• 11 minutes• 2 MB output

No Pruning Pruning

Resources

29

• Apache Mahoutmahout.apache.org

• Commons Mathcommons.apache.org/proper/commons-math/

• github.com/srowen/ commoncrawl

• [email protected]

october hug

Technology