accelerating ai inference infographic -...
TRANSCRIPT
© 2017 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. All other trademarks and copyrights are the property of their respective owners.
A good user experience requires that inference takes no more than seven
milliseconds, but CPU-powered inference cannot meet this bar.
The combination of TensorRT with NVIDIA GPUs delivers the world’s fastest inference for AI-enabled services, with latency under
that seven-millisecond mark.
CPU INFERENCE GPU INFERENCE
GPU POWERED DEEP LEARNING TRAININGWelcome to the era of AI and intelligent machines. An era fueled by big data, driven
by deep learning, and powered by GPUs. Deep learning training is the process by which a neural network learns from data in the form of images, video, text, speech,
and transactions, converting it into intelligence.
AI IS ONLY USEFUL IF IT’S FASTA neural network can be trained to understand natural conversation, monitor
hundreds of live video streams, or navigate a vehicle safely through a city. However, inference needs to be fast to deliver that learned intelligence to users.
ACCELERATING AI GPU Deep Learning with the NVIDIA® TensorRT™
Programmable Inference Accelerator
GPU POWERED INFERENCEInference is when a trained neural network is deployed into a product or
application so that it can do things such as recognize images, understand conversational speech, or make a shopping recommendation.
DEEPLEARNING BIG DATA
FASTER AI. LOWER COST.www.nvidia.com/inference
ONE UNIFIED SOLUTION FOR AIWith the introduction of TensorRT on GPUs, NVIDIA offers an AI
inferencing solution that sharply boosts performance and slashes the cost of inferencing from the data center to cloud to
edge devices, including self-driving cars and robots.
ACCURACY
LOW LATENCY
EFFICIENCY
VERSATILITY
PERFORMANCE
INFERENCE COST SAVINGS
SPEECHRECOGNITION
1HGXSERVER
160CPU
SERVERS
500MACTIVE USERS
COST OF CPU DATA CENTERS
200K CPU SERVERS100MW OF POWER
15 MIN PER DAY
$1 BILLIONXX =
=
PROBLEM
SOLUTION
1XCPU-ONLY
40X
RELATIVE INFERENCE PERFORMANCE (IMAGES/SEC)
INFERENCE PERFORMANCE40X faster inference with TensorRT 3 on NVIDIA Tesla® V100 compared to CPU-only inference.
TESLA V100+ TensorRT
Deep learning is now used to build AI into everything from kitchen appliances to cars to robots. With every
new use case, the cost of supporting these applications and products increases as well.
MOREFEATURES
THE EXPLOSION OF AIThe uses for deep learning inference are becoming more complex and
widespread. People have come to expect fast and natural interactions with their devices. At the same time, unprecedented growth is dramatically
increasing the number and variety of AI-powered applications and products.
MOREPRODUCTS
MORECOMPLEXITY
MOREAPPLICATIONS
ANALYZINGSTRATEGY
INTERPRETINGDATA
Neural networks are getting more complex because they are delivering more sophisticated
services and this drives up development and deployment costs.
UNDERSTANDINGSPEECH
PERSONALIZINGCONTENT
RECOGNIZINGBEHAVIOR
PREDICTINGEVENTS
AVOIDINGCOLLISIONS
DRONES
AUTONOMOUSCARS
SMART CITIES
VIRTUALASSISTANTS
MOBILE PHONES
MEDICAL DEVICES
THE POWER OF NVIDIA TensorRTTensorRT is a high-performance optimizing compiler and runtime engine for
production deployment of AI applications. It can rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers,
embedded, or automotive GPU platforms.
Multi-Stream ExecutionScales to multiple input streams by
processing them in parallel using the same model and weights.
Dynamic Tensor MemoryReduces memory footprint and
improves memory re-use by allocating memory for each tensor
only for the duration its usage.
Layer and Tensor FusionImproves GPU utilization and optimizes
memory storage and bandwidth by combining successive nodes into a single
node, for single kernel execution.
Weight and ActivationPrecision Calibration
Significantly improves inference performance of models trained in FP32
full precision by quantizing them to INT8, while minimizing accuracy loss.
Kernel Auto-TuningOptimizes execution time by choosing the best data layer and best parallel
algorithms for the target: Jetson, Tesla or DrivePX GPU platform.
TensorRT Optimizer
TensorRT Runtime