intelligent ruby + machine learning
DESCRIPTION
TRANSCRIPT
Intelligent Ruby: Machine Learning @igrigorik
Intelligent Ruby + Machine Learning
Ilya Grigorik@igrigorik
what, why, the trends, and the toolkit
Intelligent Ruby: Machine Learning @igrigorik
Machine Learning is ___________speak up!
Intelligent Ruby: Machine Learning @igrigorik
“Machine learning is a discipline that is concerned with the design and development of algorithms that allow
computers to evolve behaviors based on empirical data”
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
Runtime
ML & AI in the academiaand how it’s commonly taught
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
Runtime
ML & AI in the real worldor, at least, where the trends are going
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
RuntimeRuntime
RuntimeRuntime
Runtime
• compute constraints matter (duh)• CPU vs GPU?• on-demand supercomputing• supercomputer by the hour (cloud)
Runtime is a practical constraintwhich is often overlooked by academia
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
RuntimeRuntime
RuntimeRuntime
Runtime• Trillion+ page web• Trillions of social connections• Petabytes of unstructured data• Growing at exponential rate
Data, is often no longer scarce…in fact, we (Rubyists) are responsible for generating a lot of it…
Data InputData Input
Data InputData Input
Intelligent Ruby: Machine Learning @igrigorik
RuntimeRuntime
RuntimeRuntime
Runtime
Mo’ data, Mo’ problems? Requires more resources? No better off…?
Data InputData Input
Data InputData Input
Data Input
?
Intelligent Ruby: Machine Learning @igrigorik
“Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing”
Michelle Banko, Eric Brillhttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.646
“More input data vs. Better Algorithms”
Intelligent Ruby: Machine Learning @igrigorik
"We were able significantly reduce the error rate, compared to the best system trained on the standard training set size, simply by adding more training data... We see that even out to a billion words the learners continue to benefit from additional training data."
“Data-Driven Learning”
Intelligent Ruby: Machine Learning @igrigorik
Brute-forcing “learning” with Big-Datadata as the algorithm…
Intelligent Ruby: Machine Learning @igrigorik
NLP with Big-Data Google does this better than anyone else…
Word|segmentation|is|tricky
Strategy 1: Grammar for dummiesStrategy 2: Natural language toolkit (encode a language model)Strategy 3: Take a guess!
新星歐唐尼爾 保守特立獨行
Wordsegmentationistricky
Intelligent Ruby: Machine Learning @igrigorik
Word Segmentation: Take a guess!Estimate the probability of every segmentation, pick the best performer
P(W) x P(ordsegmentationistricky)P(Wo) x P(rdsegmentationistricky)…P(Word) x P(segmentationistricky)
P(W) = ????
argmax
Intelligent Ruby: Machine Learning @igrigorik
Word Segmentation: Take a guess!That’s how Google does it, and does it well…
P(W) = # of google hits / ~ # of pages on the webnot kidding.. it works.
Exercise: write a ruby script for it.
P(W) = Google’s n-gram dataset / # of n-grams
• Algorithm: Scrape the web, count the words, done.• Adding new language: scrape the web, count the words, done.
http://bit.ly/dyTvLO
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
RuntimeRuntime
RuntimeRuntime
Runtime
Of course, smarter algorithms still matter!don’t get me wrong…
Data InputData Input
Data InputData Input
Intelligent Ruby: Machine Learning @igrigorik
Learning vs. Compressionclosely correlated concepts
If we can identify significant concepts (within a dataset) then we can represent a large dataset with fewer bits.
If we can represent our data with fewer bits (compress our data), then we have identified “significant” concepts!
“Machine Learning”
Intelligent Ruby: Machine Learning @igrigorik
Ex: Classification
Intelligent Ruby: Machine Learning @igrigorik
Predicting a “tasty fruit”with the perceptron algorithm (y = mx + b)
Color
Feel
Red = Not tastyGreen = Tasty
? Tasty…
? Exercise: maximize the margin
http://bit.ly/bMcwhI
Intelligent Ruby: Machine Learning @igrigorik
Where perceptron breaks downwe need a better model…
Green = PositivePurple = Negative
Intelligent Ruby: Machine Learning @igrigorik
Idea: y = x2
Throw the data into a “higher dimensional” space!
Gree = PositivePurple = Negative
Perfect!
http://bit.ly/dfG7vD
Intelligent Ruby: Machine Learning @igrigorik
Support Vector MachinesThat’s the core insight! Simple as that.
require 'SVM'
sp = Problem.newsp.addExample(”spam", [1,1,0])sp.addExample(”ham", [0,1,1])
pa = Parameter.new
m = Model.new(sp, pa)m.predict [1, 0, 0]
http://bit.ly/a2oyMu
Intelligent Ruby: Machine Learning @igrigorik
Ex: Recommendations
Intelligent Ruby: Machine Learning @igrigorik
Linear Algebra + Singular Value DecompositionA bit of linear algebra for good measure…
1 0 0 11 1 0 00 0 1 00 1 1 11 0 0 ?
Ben
Fred
Tom
James
A B C D
Bob
Any M x N matrix (where M >= N), can be decomposed into:
M x M - call it UM x N - call it SN x N - call it V
Observation: we can use this decomposition to approximate the original MxN matrix (by fiddling with S and then recomputing U x S x V)
Intelligent Ruby: Machine Learning @igrigorik
SVD in actionbread and butter of computer vision systems
Intelligent Ruby: Machine Learning @igrigorik
gem install linalgto do the heavy-lifting…
require 'linalg'
m = Linalg::DMatrix[[1,0,1,0], [1,1,1,1], ... ]]
# Compute the SVD Decompositionu, s, vt = m.singular_value_decomposition
# ... compute user similarity# ... make recommendations based on similar users!
http://bit.ly/9lXuOL
Intelligent Ruby: Machine Learning @igrigorik
Ex: Clustering
Intelligent Ruby: Machine Learning @igrigorik
Learning & compressionare closely correlated concepts
1. AAAA AAA AAAA AAA AAAAA
2. BBBBB BBBBBB BBBBB BBBBB
3. AAAA BBBBB AAA BBBBB AA
Raw data
Similarity?
similarity(1, 3) > similarity(1, 2)
similarity(2, 3) > similarity(1, 2)
Yeah.. but how did you figure that out?
Some of you ran Lempel-Ziv on it…
Intelligent Ruby: Machine Learning @igrigorik
Clustering with Zlibno knowledge of the domain, just straight up compression
files = Dir['data/*']
def deflate(*files) z = Zlib::Deflate.new z.deflate(files.collect {|f| open(f).read}.join("\n"), Zlib::FINISH).sizeend
pairwise = files.combination(2).collect do |f1, f2|
a = deflate(f1) b = deflate(f2) both = deflate(f1, f2)
{ :files => [f1, f2], :score => (a+b)-both }
end
pp pairwise.sort {|a,b| b[:score] <=> a[:score]}.first(20)
Similarity = amount of space saved when compressed together vs. individually
Exercise: cluster your ITunes library..
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
RuntimeRuntime
RuntimeRuntime
Runtime
“Ensemble Methods in Machine Learning”Thomas G. Diettrerich (2000)
Data InputData Input
Data InputData Input
AlgorithmAlgorithm
AlgorithmAlgorithm
“Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a vote of their predictions… ensembles can often perform better than any single classifier.”
Intelligent Ruby: Machine Learning @igrigorik
The Ensemble = 30+ membersBellKor = 7 members http://nyti.ms/ccR7ul
Intelligent Ruby: Machine Learning @igrigorik
Collaborative, Collaborative Filtering?Unfortunately, GitHub grew didn’t buy into the idea…
require 'open-uri' class Crowdsource def initialize load_leaderboard # scrape github contest leaders parse_leaders # find their top performing results fetch_results # download best results cleanup_leaders # cleanup missing or incorrect data crunchit # build an ensemble end #...end Crowdsource.new
Intelligent Ruby: Machine Learning @igrigorik
Algorithm Data InputData Output
RuntimeRuntime
RuntimeRuntime
Runtime
Data InputData Input
Data InputData Input
AlgorithmAlgorithm
AlgorithmAlgorithm
In Summary:
• Data-driven: simple models and a lot data trump elaborate models based on less data• Ensembles: embrace complexity of many small, independent models!• Complex ideas are constructed on simple ideas: explore the simple ideas
More resources, More data, More Models = Collaborative, Data-Driven Learning
Intelligent Ruby: Machine Learning @igrigorik
Phew, time for questions?hope this convinced you to explore the area further…
Support Vector Machines in Ruby:http://www.igvita.com/2008/01/07/support-vector-machines-svm-in-ruby/
Collaborative Filtering with Ensembles:http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/
SVD Recommendation System in Ruby:http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
gem install ai4rhttp://ai4r.rubyforge.org/