applied machine learning at facebook: a datacenter ......applied machine learning at facebook: a...
TRANSCRIPT
-
Applied Machine Learning at Facebook: A Datacenter Infrastructure PerspectiveFACEBOOK, INC
PRESENTED BY RAVI RAHMAN
-
Overview of Presentation1. Applications of Machine Learning at Facebook
2. Hardware Architectures for Machine Learning
3. FBLearner Flow and Predictor
-
Scale of Facebook2,000,000,000+ Users
12 Datacenters
11+ Machine Learning Services
Trillions of ML inference queries per day
-
Machine Learning Applications at Facebook
Mod
el C
ompl
exity
-
Machine Learning Process
-
Machine Learning Process at Facebook
-
Machine Learning Process at FacebookService Layer
Hardware Layer
Application Layer
-
CPU vs GPU ArchitectureCPU
General purpose
Ideal for sequential or independent operations
Low throughput
Low latency
High power consumption per flop
GPU
Specialized for machine learning
Ideal for parallelized operations on batch inputs
High Throughput
High Latency
Low power consumption per flop
-
Facebook Hardware Options2U Single Socket 2U Dual Socket 3U: Big Sur 3U: Big Basin
CPUs 1x Broadwell/Skylake
2x Broadwell/Skylake
1 1
GPUs -- -- 8x M40 8x P100 / V100
Memory 32 GB >> 32GB 96GB 128 GB
Teraflops 0.8 1.6 56 125.6
Use Cases Web Tier Compute and Memory Intensive
Deep Neural ML Training
Deep Neural Network ML Training
Products Facebook.com, Facer, Search
News Feed, Sigma Deprecated Facer, Lumos, Language Translation, Speech Recognition
-
Training Pipelines
-
Deep Neural Network ParallelismData Parallelism◦ Train the full model on multiple machines, and combine the results◦ Best suited for small models that can fit on the GPU and that support large batches◦ Achieved 4x throughput using 5x machine count
Model Parallelism◦ Train a portion of the model on each machine and synchronize intermediate results across machines◦ Best suited for large models that require small batch sizes
-
Data Parallelism
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf
https://static.googleusercontent.com/media/research.google.com/en/archive/large_deep_networks_nips2012.pdf
-
Model Parallelism
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf
https://static.googleusercontent.com/media/research.google.com/en/archive/large_deep_networks_nips2012.pdf
-
Deep Neural Network ParallelismQuestions:◦ What is the overhead with parallelism?
◦ Which type of parallelism is better?
◦ What are some alternative solutions to parallelism?
-
Service Layer
PyTorchResearch ML workloads
Python DSL
Generic support for CPU and GPU backends
Open Source
OnnxOpen Neural Network
Exchange
Framework-independent way of machine learning
models and networks
Vendor implementations optimize for performance
Caffe2Productionized ML
workloads
Specialized, Hardware-Specific Backends
Optimized for Performance
Deprecated – now part of PyTorch
FBLearnerFeature Store: Collection
of Features
Flow: Pipeline management system for
training workflows, including job schedule
Predictor: Low-latency inference engine
-
PyTorch Examplehttps://colab.research.google.com/drive/1yhOm3g5EiNSElrZcZFLPawP6rbkazlkY?usp=sharing
https://colab.research.google.com/drive/1yhOm3g5EiNSElrZcZFLPawP6rbkazlkY?usp=sharing
-
FBLearner Flow
https://engineering.fb.com/2016/05/09/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/
https://engineering.fb.com/2016/05/09/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/
-
Data Workload vs Training WorkloadData Workers◦ Retrieves, pre-processes, and condenses data◦ Must adapt to new data and features
Training Workers◦ Trains the underlying model with data from the data workers◦ Fully utilizes machine resources
-
“Free” Compute
-
Complexity of Datacenter-Scale ML TrainingHyperparameter Optimization
Optimal utilization of heterogenous compute resources
Proper placement of machine learning jobs relative to data
Network overhead
Code and infrastructure re-use for a wide variety of applications and machine learning architectures
Regulations (e.g. GDPR)
Question: How does the Facebook datacenter differ from that of Google, Amazon, or Microsoft?
-
FBLearner PredictorServes online predictions for models trained using FBLearner Flow
Online Inference runs on CPU workloads
Stringent SLA Requirements for Inference Applications◦ “Nice to Have” deadline: Approximate results, that can be later overridden, can be returned early to the
user◦ Firm deadline: Proper content must be returned to the user
-
Fitting Models for the Inference Hardware Latency vs Throughput◦ SLA requirements
Quantization of weights without sacrificing model accuracy ◦ Fewer bits
Pruning the model so it fits within the LLC◦ Reduced tail latency
-
New Hardware for Machine LearningTPU CEREBRAS
-
ConclusionFacebook’s machine learning requirements span a diverse set of applications, which must be compatible with a heterogenous set of hardware
Spare capacity from diurnal compute cycles are available for machine learning training workloads
Distributed training must account for network performance
Optimal performance requires tight coupling of training and inference hardware with model design
-
QuestionsWhat data would you like to see?
Is their approach generalizable to machine learning use cases outside of Facebook?
How does Facebook’s approach compare to that of Microsoft, Amazon, and Google – all of whom offer a public cloud?
Applied Machine Learning at Facebook: A Datacenter Infrastructure PerspectiveOverview of PresentationScale of FacebookMachine Learning Applications at FacebookMachine Learning ProcessMachine Learning Process at FacebookMachine Learning Process at FacebookCPU vs GPU ArchitectureFacebook Hardware OptionsTraining PipelinesDeep Neural Network ParallelismData Parallelism�Model ParallelismDeep Neural Network ParallelismService LayerPyTorch ExampleFBLearner FlowData Workload vs Training Workload“Free” ComputeComplexity of Datacenter-Scale ML TrainingFBLearner PredictorFitting Models for the Inference Hardware New Hardware for Machine LearningConclusionQuestions