flickr: computer vision at scale with hadoop and storm (huy nguyen)
DESCRIPTION
This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm. Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/ If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobsTRANSCRIPT
![Page 1: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/1.jpg)
Presented by: Huy Nguyen @huyng
Computer vision at scale with Hadoop and Storm
Copyright © 2014 Yahoo, Inc.
![Page 2: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/2.jpg)
Outline
▪ Computer vision at 10,000 feet
▪ Hadoop training system
▪ Storm continuous stream processing
▪ Hadoop batch processing
▪ Q&A
![Page 3: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/3.jpg)
![Page 4: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/4.jpg)
How can we help Flickr users better organize, search, and browse
their photo collection?
photo credit: Quinn Dombrowski
![Page 5: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/5.jpg)
What is AutoTagging?
Any photo you upload to Flickr is now automatically classified using computer vision software
Class Probability
Flowers 0.98
Outdoors 0.95
Cat 0.001
Grass 0.6
ClassifiersClassifiersClassifiers
![Page 6: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/6.jpg)
Demo
![Page 7: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/7.jpg)
Auto Tags Feeding Search
▪
▪
▪
▪
▪
▪
![Page 8: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/8.jpg)
Our Goals
▪ Train thousands of classifiers
▪ Classify millions of photos per day
▪ Compute auto tags on backlog of billions of photos
▪ Enable reprocessing bi-weekly and every quarter
▪ Do it all within 90 days of being acquired!
![Page 9: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/9.jpg)
Our system for training classifiers
▪ High level overview of image classification
▪ Hadoop workflow
![Page 10: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/10.jpg)
Classification in a nutshell
How do we define a classifier?
A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b
Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data
Classification is done by computing z, given some data point x
![Page 11: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/11.jpg)
Image classification in a nutshell
To classify images we need to convert raw pixels into a “feature” to plug into x
computefeature
![Page 12: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/12.jpg)
How are image classifiers made?
flower Class Name
Negative Examples
(approx. 100K)
Positive Examples(approx. 10K)
Repeat a few thousand times for each class
ComputeFeatures Train
Classifier
w,b
![Page 13: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/13.jpg)
Hadoop Classifier Training
![Page 14: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/14.jpg)
Hadoop Classifier Training
HDFS
Pixel Data+
Label Data
1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>
dog 1 TRUEdog 2 FALSEdog 3 TRUEcat 1 FALSEcat 2 TRUEcat 3 FALSE..
Pixel Data Label Data
photo_id, pixel_data class, photo_id, label
![Page 15: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/15.jpg)
Hadoop Classifier Training - JOIN
map
map
map
reduce
Pixels
Labels
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..
▪ mappers emit (photo_id, pixels) OR (photo_id, class, label)▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
![Page 16: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/16.jpg)
Hadoop Classifier Training - TRAIN
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..
map
map
map
reduce
computefeature
trainingclassifier
dogclf.
catclf.
▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image
▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
![Page 17: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/17.jpg)
Hadoop Classifier Training
▪ Workflow coordinated with OOZIE
![Page 18: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/18.jpg)
Hadoop Classifier Training
▪ 10,000 mappers▪ 1,000 reducers▪ A few thousand classifiers in less than 6 hours▪ Replaced OpenMPI based system
▪ Easier deployment▪ Access to more hardware▪ Much simpler code
![Page 19: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/19.jpg)
Area of Improvements
▪ Reducers have all training data in memory▪ A potential bottleneck if we want to expand training set for any given
set of classifiers▪ Known path to avoiding this bottleneck
▪ Incrementally update models as new data arrives
![Page 20: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/20.jpg)
Our system for tagging the flickr corpus
▪ Batch processing with Hadoop
▪ Stream processing with Storm
![Page 21: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/21.jpg)
Batch processing in Hadoop
www HDFS AutotagsMapper
Search Engine Indexer
HDFS IndexingReducer
AutotagsMapperAutotagsMapper
![Page 22: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/22.jpg)
Hadoop Auto tagging
HDFS
Pixel Data
Autotag Results
1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>
1 { “cat”: .98, “dog”: 0.3 }2 { “cat”: .97, “dog”: 0.2 }3 { “cat”: .2, “dog”: 0.3 }4 { “cat”: .5, “dog”: 0.9 }..n { “cat”: .90, “dog”: 0.23 }
Pixel Data Autotag Results
photo_id, pixel_data photo_id, results
![Page 23: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/23.jpg)
Hadoop classification performance
▪ 10,000 Mappers▪ Processed in about 1 week entire corpus
![Page 24: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/24.jpg)
Main challenge: packaging our code
▪ Heterogenous Python & C++ codebase
▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?
![Page 25: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/25.jpg)
Main challenge: packaging our code
▪ tried: pyinstaller, didn’t work
▪ rolled our own solution using strace▪ /autotags_package
▪ /lib▪ /lib/python2.7/site-packages▪ /bin▪ /app
![Page 26: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/26.jpg)
Stream processing (our first attempt)
www Redis
PHPWorkersPHPWorkersPHPWorkers
fork autotags a.jpg
searchdb
![Page 27: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/27.jpg)
Why we started looking into Storm
![Page 28: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/28.jpg)
Why we started looking into Storm
FeatureComputation Classification
300 ms 10 ms
ClassifierLoad
300 ms
Classifier loading accounted for 49% of total processing time
![Page 29: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/29.jpg)
Solutions we evaluated at the time
▪ Mem-mapping model
▪ Batching up operations
▪ Daemonize our autotagging code
![Page 30: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/30.jpg)
Enter Storm
![Page 31: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/31.jpg)
![Page 32: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/32.jpg)
Storm computational framework
▪ Spouts▪ The source of your data
stream. Spouts “emit” tuples
▪ Tuples▪ An atomic unit of work
▪ Bolts▪ Small programs that
processes your tuple stream
▪ Topology▪ A graph of bolts and
spouts
![Page 33: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/33.jpg)
Defining a topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("images", new ImageSpout(), 10);
builder.setBolt("autotag", new AutoTagBolt(), 3)
.shuffleGrouping("images")
builder.setBolt("dbwriter", new DBWriterBolt(), 2)
.shuffleGrouping("autotag")
![Page 34: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/34.jpg)
Defining a bolt
class Autotag(storm.BasicBolt):
def __init__(self):
self.classifier.load()
def process(self, tuple):
<process tuple here>
![Page 35: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/35.jpg)
Our final solution
www Redis AutotagsSpout
AutotagsBolt
Search Engine Indexer
Bolt
DB WriterBolt
Topology
![Page 36: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/36.jpg)
Storm architecture
Supervisors
(10 nodes)
Nimbus
(1 node)
Zookeepers
(3 nodes)
Spouts and bolts live here
![Page 37: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/37.jpg)
Fault tolerant against node failures
Nimbus server dies▪ no new jobs can be submitted▪ current jobs stay running▪ simply restart
Supervisor node dies▪ All unprocessed jobs for that node get reassigned▪ Supervisor restarted
Zookeeper node dies▪ Restart node, no data loss as long as not too many die at once
![Page 38: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/38.jpg)
storm commands
storm jar topology-jar-path class
![Page 39: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/39.jpg)
storm commands
storm list
![Page 40: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/40.jpg)
storm commands
storm kill topology-name
![Page 41: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/41.jpg)
storm commands
storm deactivate topology-jar-path class
![Page 42: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/42.jpg)
storm commands
storm activate topology-name
![Page 43: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/43.jpg)
Is Storm just another queuing system?
![Page 44: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/44.jpg)
Yes, and more
▪ A framework for separating application design from scaling concerns
▪ Has a great deployment story▪ atomic start, kill, deactivate for free▪ no more rsync
▪ No single point of failure▪ Wide industry adoption▪ Is programming language agnostic
![Page 45: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/45.jpg)
Try storm out!
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/nathanmarz/storm-deploy
![Page 46: Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)](https://reader034.vdocuments.net/reader034/viewer/2022051322/540de0458d7f72767e8b4ba3/html5/thumbnails/46.jpg)
Thank you
Presentation copyright © 2014 Yahoo, Inc.