ted dunning, chief application architect, mapr at mlconf atl - 9/18/15
TRANSCRIPT
![Page 1: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/1.jpg)
© 2014 MapR Technologies 1© 2014 MapR Technologies
Cheap Learning Complements Deep Learning
Ted Dunning
![Page 2: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/2.jpg)
© 2014 MapR Technologies 2
Me, Us• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG
• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s
• Info– Hash tag - #mapr #mlconfatl– See also - @ApacheDrill
@ted_dunning and @mapR
![Page 3: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/3.jpg)
© 2014 MapR Technologies 3
Agenda• Rationale• Why cheap isn't the same as simple-minded• Some techniques• Examples
![Page 4: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/4.jpg)
© 2014 MapR Technologies 4
Why is cheap better than deep (sometimes)Greenfield problems can be
– Easy (large number of these)– Impossible (large number of these)– Hard but possible (right on the boundary)
Mature problems can be– Easy (these are already done)– Impossible (still a large number of these)– Hard but possible (now the majority of the effort)
![Page 5: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/5.jpg)
© 2014 MapR Technologies 5
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
![Page 6: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/6.jpg)
© 2014 MapR Technologies 6
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value
![Page 7: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/7.jpg)
© 2014 MapR Technologies 7
If we can handle the scale
It’s really big
![Page 8: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/8.jpg)
© 2014 MapR Technologies 8
With great scale comes great opportunity• Increasing scale by 1000x changes the game
• We essentially have green fields opening up all around
• Most of the opportunities don’t require advanced learning
![Page 9: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/9.jpg)
© 2014 MapR Technologies 9
A simple example - security monitoring
• “Small” data– Capture IDS logs– Detect what you already know
• “Big” data– Capture switch, server, firewall logs as well– New patterns emerge immediately
![Page 10: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/10.jpg)
© 2014 MapR Technologies 10
Another example – fraud detection
• “Small” data– Maintain card profiles– Segment models– Evaluate all transactions
• “Big” Data– Maintain card profiles, full 90 day transaction history– Per user hierarchical models– Evaluate all transactions
![Page 11: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/11.jpg)
© 2014 MapR Technologies 11
Easy != Stupid• You still have to do things reasonably well
– Techniques that are not well founded are still problems
• Heuristic frequency ratios still fail – Coincidences still dominate the data– Accidental 100% correlations abound
• Related techniques still broken for coincidence– Pearson’s χ2
– Simple correlations
![Page 12: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/12.jpg)
© 2014 MapR Technologies 12
Blast from the past
![Page 13: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/13.jpg)
© 2014 MapR Technologies 13
Scale does not cure wrong
It just makes easy more common
![Page 14: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/14.jpg)
© 2014 MapR Technologies 14
A core technique• Many of these easy problems reduce to finding interesting
coincidences
• This can be summarized as a 2 x 2 table
• Actually, many of these tables
A OtherB k11 k12
Other
k21 k22
![Page 15: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/15.jpg)
© 2014 MapR Technologies 15
How do you do that?• This is well handled using G-test
– See wikipedia– See http://bit.ly/surprise-and-coincidence
• Original application in linguistics now cited > 2000 times
• Available in ElasticSearch, in Solr, in Mahout• Available in R, C, Java, Python
![Page 16: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/16.jpg)
© 2014 MapR Technologies 16
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 2
![Page 17: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/17.jpg)
© 2014 MapR Technologies 17
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 20.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
![Page 18: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/18.jpg)
© 2014 MapR Technologies 18
So we can find interesting coincidence
and that gets us exactly what?
![Page 19: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/19.jpg)
© 2014 MapR Technologies 19
Cooccurrence Analysis
![Page 20: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/20.jpg)
© 2014 MapR Technologies 20
Real-life example• Query: “Paco de Lucia”• Conventional meta-data search results:
– “hombres de paco” times 400– not much else
• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff
![Page 21: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/21.jpg)
© 2014 MapR Technologies 21
Real-life example
![Page 22: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/22.jpg)
© 2014 MapR Technologies 22
Any other domains?
![Page 23: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/23.jpg)
© 2014 MapR Technologies 23
Document classification
![Page 24: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/24.jpg)
© 2014 MapR Technologies 24
Language identification
![Page 25: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/25.jpg)
© 2014 MapR Technologies 25
OK … Works for language
Anything else?
![Page 26: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/26.jpg)
© 2014 MapR Technologies 26
Species identification
![Page 27: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/27.jpg)
© 2014 MapR Technologies 27
Anything useful?
Like, to do with money?
![Page 28: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/28.jpg)
© 2014 MapR Technologies 28
Common Point of Compromise• Scenario:
– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits
• Goal:– Find merchant 0
• Meta-goal:– Screen algorithms for this task without leaking sensitive data
![Page 29: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/29.jpg)
© 2014 MapR Technologies 29
Simulation Setup
![Page 30: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/30.jpg)
© 2014 MapR Technologies 30
Simulation Strategy• For each consumer
– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time
• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor
• Restate data– Flatten transaction streams– Sort by time
• Tunables– Compromise probability, background fraud, detection probability
![Page 31: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/31.jpg)
© 2014 MapR Technologies 31
But that isn’t very realistic!• No details of the fraud• No details of the fraudsters• No details on the transactions• No details on the models
• How can this be any good at all?
![Page 32: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/32.jpg)
© 2014 MapR Technologies 32
Secure Development is Hard
![Page 33: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/33.jpg)
© 2014 MapR Technologies 33
Secure Development is Hard
Outside collaborators are outside the security perimeter
They can’t see the data and they can’t tune new algorithms to fit reality
![Page 34: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/34.jpg)
© 2014 MapR Technologies 34
How To Make Realistic Data
![Page 35: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/35.jpg)
© 2014 MapR Technologies 35
Parametric Simulation
Parametric matching of failure signatures allows emulation of complex data properties
Matching on KPI’s and failure modes guarantees practical fidelity
![Page 36: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/36.jpg)
© 2014 MapR Technologies 36
Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores
![Page 37: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/37.jpg)
© 2014 MapR Technologies 37
So how does it work in practice?
![Page 38: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/38.jpg)
© 2014 MapR Technologies 38
![Page 39: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/39.jpg)
© 2014 MapR Technologies 39
Really truly bad guys
![Page 40: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/40.jpg)
© 2014 MapR Technologies 40
Summary• We live in a golden age of newly achieved scale
• That scale has lowered the tree– Hard problems are much easier– Lots of low-hanging fruit all around us
• Cheap learning has huge value
• Code available at http://github.com/tdunning
![Page 41: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15](https://reader036.vdocuments.net/reader036/viewer/2022062412/5882862c1a28ab24788b78c1/html5/thumbnails/41.jpg)
© 2014 MapR Technologies 41
Me, Us• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG
• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s
• Info– Hash tag - #mapr #mlconfatl– See also - @ted_dunning and @mapR