deep learning made easy with deep features

1. Deep learning Made Easy with Deep Features Piotr Teterwak Dato, Machine Learning Engineer 2. 2 Hello, my name is Piotr Teterwak Machine Learning Engineer, Dato 3. 3 Who is Dato? 4. Graphlab Create: Production ML Pipeline DATA YourWebServiceor IntelligentApp ML Algorithm Data cleaning & feature eng Offline eval & Parameter search Deploy model Data engineering Data intelligence Deployment Goal: Platform to help implement, manage, optimize entire pipeline 5. Deep Learning 6. Todays talk 7. Todays talk Features in ML Deep Neural Networks learn features Using those learned features for new tasks Productionizing 8. Features are key to machine learning 9. 9 Simple example: Spam filtering A user opens an email - Will she thinks its spam? Whats the probability email is spam? Text of email User info Source info Input: x MODEL Yes! No Output: Probability of y 10. 10 Feature engineering: the painful black art of transforming raw inputs into useful inputs for ML algorithm E.g., important words, complex transformation of input, MODEL Yes! No Output: Probability of y Feature extraction Features: (x) Text of email User info Source info Input: x 11. Deep Learning for Learning Features 12. 12 Linear classifiers Most common classifier - Logistic regression - SVMs - Decision correspond to hyperplane: - Line in high dimensional space w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0 13. 13 What can a simple linear classifier represent? AND 0 0 1 1 14. 14 What can a simple linear classifier represent? OR 0 0 1 1 15. 15 What cant a simple linear classifier represent? XOR 0 0 1 1 Need non-linear features 16. 16 Non-linear feature embedding 0 0 1 1 17. 17 Graph representation of classifier: Useful for defining neural networks x 1 x 2 x d y 1 w2 w0 + w1 x1 + w2 x2 + + wd xd > 0, output 1 < 0, output 0 Input Output 18. 18 What can a linear classifier represent? x1 OR x2 x1 AND x2 x 1 x 2 1 y -0.5 1 1 x 1 x 2 1 y -1.5 1 1 19. Solving the XOR problem: Adding a layer XOR = x1 AND NOT x2 OR NOT x1 AND x2 z 1 -0.5 1 -1 z1 z2 z 2 -0.5 -1 1 x 1 x 2 1 y 1 -0.5 1 1 Thresholded to 0 or 1 20. 20 http://deeplearning.stanford.edu/wiki/images/4/40/Network3322.png Deep Neural Networks P(cat|x) P(dog|x) 21. 21 Deep Neural Networks Can model any function with enough hidden units. This is tremendously powerful: given enough units, it is possible to train a neural network to solve arbitrarily difficult problems. But also very difficult to train, too many parameters means too much memory+computation time. 22. 22 Neural Nets and GPUs Many operations in Neural Net training can happen in parallel Reduces to matrix operations, many of which can be easily parallelized on a GPU. 23. 24 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 24. 25 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 25. 26 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 26. 27 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 27. 28 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 28. 29 Convolutional Neural Nets Strategic removal of edges Input Layer Hidden Layer 29. 30 Convolutional Neural Nets http://ufldl.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif 30. 31 Pooling layer Ranzato, LSVR tutorial @ CVPR, 2014. www.cs.toronto.edu/~ranzato 31. 32 Pooling layer http://ufldl.stanford.edu/wiki/images/6/6c/Pooling_schematic.gif 32. 33 Final Network Krizhevsky et al. 12 33. Applications to computer vision 34. 35 Image features Features = local detectors - Combined to make prediction - (in reality, features are more low-level) Face! Eye Eye Nose Mouth 35. 36 Standard image classification approach Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs Face 36. 37 Many hand crafted features exist Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ but very painful to design 37. 38 Change image classification approach? Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs FaceCan we learn features from data? 38. 39 Use neural network to learn features Input Learned hierarchy Output Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations ICML 2009 39. Sample results Traffic sign recognition (GTSRB) - 99.2% accuracy House number recognition (Google) - 94.3% accuracy 40 40. Krizhevsky et al. 12: 60M parameters, won 2012 ImageNet competition 41 41. 42 ImageNet 2012 competition: 1.2M images, 1000 categories 42 42. 43 Application to scene parsing Carlos Guestrin 2005-2014 Y LeCun MA Ranzato Semantic Labeling: Labeling every pixel with the object it belongs to [ Farabet et al. ICML 2012, PAMI 2013] Would help identify obstacles, targets, landing sites, dangerous areas Would help line up depth map with edge maps 43. A quick demo! 44 44. Challenges of deep learning 45. Deep learning score card Pros Enables learning of features rather than hand tuning Impressive performance gains on - Computer vision - Speech recognition - Some text analysis Potential for much more impact Cons 46. Deep learning workflow Lots of labeled data Training set Validation set 80% 20% Learn deep neural net model Validate 47. Deep learning score card Pros Enables learning of features rather than hand tuning Impressive performance gains on - Computer vision - Speech recognition - Some text analysis Potential for much more impact Cons Computationally really expensive Requires a lot of data for high accuracy Extremely hard to tune - Choice of architecture - Parameter types - Hyperparameters - Learning algorithm - Computational + so many choices = incredibly hard to tune 48. 49 Can we do better? Input Learned hierarchy Output Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations ICML 2009 49. Deep features: Deep learning + Transfer learning 50. 51 Transfer learning: Use data from one domain to help learn on another Lots of data: Learn neural net Great accuracy Some data: Neural net as feature extractor + Simple classifier Great accuracy on new problem Old idea, explored for deep learning by Donahue et al. 14 51. 52 Whats learned in a neural net Neural net trained for Task 1 Very specific to Task 1More generic Can be used as feature extractor vs. 52. 53 Transfer learning in more detail Neural net trained for Task 1 Very specific to Task 1More generic Can be used as feature extractor Keep weights fixed! For Task 2, learn only end part Use simple classifier e.g., logistic regression, SVMs Class? 53. 54 Using ImageNet-trained network as extractor for general features Using classic AlexNet architechture pioneered by Alex Krizhevsky et. al in ImageNet Classification with Deep Convolutional Neural Networks It turns out that a neural network trained on ~1 million images of about 1000 classes makes a surprisingly general feature extractor First illustrated by Donahue et al in DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition 54 54. 55 Caltech-101 55 55. Transfer learning with deep features Training set Validation set 80% 20% Learn simple model Some labeled data Extract features with neural net trained on different task Validate Deploy in production 56. Demo 58 57. What else can we do with Deep Features? 59 58. Finding similar images 60 59. Applications to text data 60. Simple text classification with bag of words aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 Zaire 0 Use simple classifier e.g., logistic regression, SVMs Class ? One feature per word 61. Word2Vec: Neural network for finding word representation Mikolov et al. 13 Skip-gram Model: From a word, predict nearby words in sentence dog A went for a walk Neural net Viewed as deep features 62. Word2Vec: Neural network for finding high dimensional representation per word Mikolov et al. 13 http://www.folgertkarsdorp.nl/word2vec-an-introduction/ 63. 65 Related words placed nearby high dim space Projecting 300 dim space into 2 dim with PCA (Mikolov et al. 13) 64. Blog corpus Haha Yea Hahaha Hahah Lisxc Umm Hehe laughingoutloud LOL Closest words in 300 dim Predicts gender of author with 79% accuracy 65. ML in production (Or how this is relevant to data scientists) 66. 2015: Production ML pipeline DATA YourWebServiceor IntelligentApp ML Algorithm Data cleaning & feature eng Offline eval & Parameter search Deploy model Data engineering Data intelligence Deployment Using deep learning Goal: Platform to help implement, manage, optimize entire pipeline 67. In real life 68. Take Home Message 69. 71 Take Home Message Use simple classifier e.g., logistic regression, SVMs Class? Deep Features are remarkable! 70. 72 CONF.DATO.COM 71. 73 Dato Office Hours @ Galvanize SF Bring your laptop & some data & well help you get started When: Thurs (tomorrow) 2:30p-5p followed by beers Where: Galvanize 44 Tehama St. (SOMA) in SF Talk to me/email me: [email protected] + 72. Get the software: dato.com/download Learn: dato.com/learn Learn more: blog.dato.com Join us: were hiring lots! Contact me: [email protected] 73. 75 Go create something! [with Dato] Data Engineering Data Intelligence Deployment Fast & scalable Rich data type support Visualization App-oriented ML Supporting utils Extensibility Batch & always-on RESTful interface Elastic & robust

deep learning made easy with deep features

Data & Analytics