accelerating ai inference infographic -...

1
© 2017 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. All other trademarks and copyrights are the property of their respective owners. A good user experience requires that inference takes no more than seven milliseconds, but CPU-powered inference cannot meet this bar. The combination of TensorRT with NVIDIA GPUs delivers the world’s fastest inference for AI-enabled services, with latency under that seven-millisecond mark. CPU INFERENCE GPU INFERENCE GPU POWERED DEEP LEARNING TRAINING Welcome to the era of AI and intelligent machines. An era fueled by big data, driven by deep learning, and powered by GPUs. Deep learning training is the process by which a neural network learns from data in the form of images, video, text, speech, and transactions, converting it into intelligence. AI IS ONLY USEFUL IF IT’S FAST A neural network can be trained to understand natural conversation, monitor hundreds of live video streams, or navigate a vehicle safely through a city. However, inference needs to be fast to deliver that learned intelligence to users. ACCELERATING AI GPU Deep Learning with the NVIDIA ® TensorRT Programmable Inference Accelerator GPU POWERED INFERENCE Inference is when a trained neural network is deployed into a product or application so that it can do things such as recognize images, understand conversational speech, or make a shopping recommendation. DEEP LEARNING BIG DATA FASTER AI. LOWER COST. www.nvidia.com/inference ONE UNIFIED SOLUTION FOR AI With the introduction of TensorRT on GPUs, NVIDIA offers an AI inferencing solution that sharply boosts performance and slashes the cost of inferencing from the data center to cloud to edge devices, including self-driving cars and robots. ACCURACY LOW LATENCY EFFICIENCY VERSATILITY PERFORMANCE INFERENCE COST SAVINGS SPEECH RECOGNITION 1 HGX SERVER 160 CPU SERVERS 500M ACTIVE USERS COST OF CPU DATA CENTERS 200K CPU SERVERS 100MW OF POWER 15 MIN PER DAY $1 BILLION X X = = PROBLEM SOLUTION 1X CPU-ONLY 40X RELATIVE INFERENCE PERFORMANCE (IMAGES/SEC) INFERENCE PERFORMANCE 40X faster inference with TensorRT 3 on NVIDIA Tesla ® V100 compared to CPU-only inference. TESLA V100 + TensorRT Deep learning is now used to build AI into everything from kitchen appliances to cars to robots. With every new use case, the cost of supporting these applications and products increases as well. MORE FEATURES THE EXPLOSION OF AI The uses for deep learning inference are becoming more complex and widespread. People have come to expect fast and natural interactions with their devices. At the same time, unprecedented growth is dramatically increasing the number and variety of AI-powered applications and products. MORE PRODUCTS MORE COMPLEXITY MORE APPLICATIONS ANALYZING STRATEGY INTERPRETING DATA Neural networks are getting more complex because they are delivering more sophisticated services and this drives up development and deployment costs. UNDERSTANDING SPEECH PERSONALIZING CONTENT RECOGNIZING BEHAVIOR PREDICTING EVENTS AVOIDING COLLISIONS DRONES AUTONOMOUS CARS SMART CITIES VIRTUAL ASSISTANTS MOBILE PHONES MEDICAL DEVICES THE POWER OF NVIDIA TensorRT TensorRT is a high-performance optimizing compiler and runtime engine for production deployment of AI applications. It can rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers, embedded, or automotive GPU platforms. Multi-Stream Execution Scales to multiple input streams by processing them in parallel using the same model and weights. Dynamic Tensor Memory Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration its usage. Layer and Tensor Fusion Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution. Weight and Activation Precision Calibration Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss. Kernel Auto-Tuning Optimizes execution time by choosing the best data layer and best parallel algorithms for the target: Jetson, Tesla or DrivePX GPU platform. TensorRT Optimizer TensorRT Runtime

Upload: duongthuy

Post on 05-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating AI Inference Infographic - Nvidiaimages.nvidia.com/content/pdf/infographic/inference-infographic.pdf · conversational speech, ... SMART CITIES VIRTUAL ASSISTANTS

© 2017 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. All other trademarks and copyrights are the property of their respective owners.

A good user experience requires that inference takes no more than seven

milliseconds, but CPU-powered inference cannot meet this bar.

The combination of TensorRT with NVIDIA GPUs delivers the world’s fastest inference for AI-enabled services, with latency under

that seven-millisecond mark.

CPU INFERENCE GPU INFERENCE

GPU POWERED DEEP LEARNING TRAININGWelcome to the era of AI and intelligent machines. An era fueled by big data, driven

by deep learning, and powered by GPUs. Deep learning training is the process by which a neural network learns from data in the form of images, video, text, speech,

and transactions, converting it into intelligence.

AI IS ONLY USEFUL IF IT’S FASTA neural network can be trained to understand natural conversation, monitor

hundreds of live video streams, or navigate a vehicle safely through a city. However, inference needs to be fast to deliver that learned intelligence to users.

ACCELERATING AI GPU Deep Learning with the NVIDIA® TensorRT™

Programmable Inference Accelerator

GPU POWERED INFERENCEInference is when a trained neural network is deployed into a product or

application so that it can do things such as recognize images, understand conversational speech, or make a shopping recommendation.

DEEPLEARNING BIG DATA

FASTER AI. LOWER COST.www.nvidia.com/inference

ONE UNIFIED SOLUTION FOR AIWith the introduction of TensorRT on GPUs, NVIDIA offers an AI

inferencing solution that sharply boosts performance and slashes the cost of inferencing from the data center to cloud to

edge devices, including self-driving cars and robots.

ACCURACY

LOW LATENCY

EFFICIENCY

VERSATILITY

PERFORMANCE

INFERENCE COST SAVINGS

SPEECHRECOGNITION

1HGXSERVER

160CPU

SERVERS

500MACTIVE USERS

COST OF CPU DATA CENTERS

200K CPU SERVERS100MW OF POWER

15 MIN PER DAY

$1 BILLIONXX =

=

PROBLEM

SOLUTION

1XCPU-ONLY

40X

RELATIVE INFERENCE PERFORMANCE (IMAGES/SEC)

INFERENCE PERFORMANCE40X faster inference with TensorRT 3 on NVIDIA Tesla® V100 compared to CPU-only inference.

TESLA V100+ TensorRT

Deep learning is now used to build AI into everything from kitchen appliances to cars to robots. With every

new use case, the cost of supporting these applications and products increases as well.

MOREFEATURES

THE EXPLOSION OF AIThe uses for deep learning inference are becoming more complex and

widespread. People have come to expect fast and natural interactions with their devices. At the same time, unprecedented growth is dramatically

increasing the number and variety of AI-powered applications and products.

MOREPRODUCTS

MORECOMPLEXITY

MOREAPPLICATIONS

ANALYZINGSTRATEGY

INTERPRETINGDATA

Neural networks are getting more complex because they are delivering more sophisticated

services and this drives up development and deployment costs.

UNDERSTANDINGSPEECH

PERSONALIZINGCONTENT

RECOGNIZINGBEHAVIOR

PREDICTINGEVENTS

AVOIDINGCOLLISIONS

DRONES

AUTONOMOUSCARS

SMART CITIES

VIRTUALASSISTANTS

MOBILE PHONES

MEDICAL DEVICES

THE POWER OF NVIDIA TensorRTTensorRT is a high-performance optimizing compiler and runtime engine for

production deployment of AI applications. It can rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers,

embedded, or automotive GPU platforms.

Multi-Stream ExecutionScales to multiple input streams by

processing them in parallel using the same model and weights.

Dynamic Tensor MemoryReduces memory footprint and

improves memory re-use by allocating memory for each tensor

only for the duration its usage.

Layer and Tensor FusionImproves GPU utilization and optimizes

memory storage and bandwidth by combining successive nodes into a single

node, for single kernel execution.

Weight and ActivationPrecision Calibration

Significantly improves inference performance of models trained in FP32

full precision by quantizing them to INT8, while minimizing accuracy loss.

Kernel Auto-TuningOptimizes execution time by choosing the best data layer and best parallel

algorithms for the target: Jetson, Tesla or DrivePX GPU platform.

TensorRT Optimizer

TensorRT Runtime