accelerating ai inference infographic -...

© 2017 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. All other trademarks and copyrights are the property of their respective owners.

A good user experience requires that inference takes no more than seven

milliseconds, but CPU-powered inference cannot meet this bar.

The combination of TensorRT with NVIDIA GPUs delivers the world’s fastest inference for AI-enabled services, with latency under

that seven-millisecond mark.

CPU INFERENCE GPU INFERENCE

GPU POWERED DEEP LEARNING TRAININGWelcome to the era of AI and intelligent machines. An era fueled by big data, driven

by deep learning, and powered by GPUs. Deep learning training is the process by which a neural network learns from data in the form of images, video, text, speech,

and transactions, converting it into intelligence.

AI IS ONLY USEFUL IF IT’S FASTA neural network can be trained to understand natural conversation, monitor

hundreds of live video streams, or navigate a vehicle safely through a city. However, inference needs to be fast to deliver that learned intelligence to users.

ACCELERATING AI GPU Deep Learning with the NVIDIA® TensorRT™

Programmable Inference Accelerator

GPU POWERED INFERENCEInference is when a trained neural network is deployed into a product or

application so that it can do things such as recognize images, understand conversational speech, or make a shopping recommendation.

DEEPLEARNING BIG DATA

FASTER AI. LOWER COST.www.nvidia.com/inference

ONE UNIFIED SOLUTION FOR AIWith the introduction of TensorRT on GPUs, NVIDIA offers an AI

inferencing solution that sharply boosts performance and slashes the cost of inferencing from the data center to cloud to

edge devices, including self-driving cars and robots.

ACCURACY

LOW LATENCY

EFFICIENCY

VERSATILITY

PERFORMANCE

INFERENCE COST SAVINGS

SPEECHRECOGNITION

1HGXSERVER

160CPU

SERVERS

500MACTIVE USERS

COST OF CPU DATA CENTERS

200K CPU SERVERS100MW OF POWER

15 MIN PER DAY

$1 BILLIONXX =

=

PROBLEM

SOLUTION

1XCPU-ONLY

40X

RELATIVE INFERENCE PERFORMANCE (IMAGES/SEC)

INFERENCE PERFORMANCE40X faster inference with TensorRT 3 on NVIDIA Tesla® V100 compared to CPU-only inference.

TESLA V100+ TensorRT

Deep learning is now used to build AI into everything from kitchen appliances to cars to robots. With every

new use case, the cost of supporting these applications and products increases as well.

MOREFEATURES

THE EXPLOSION OF AIThe uses for deep learning inference are becoming more complex and

widespread. People have come to expect fast and natural interactions with their devices. At the same time, unprecedented growth is dramatically

increasing the number and variety of AI-powered applications and products.

MOREPRODUCTS

MORECOMPLEXITY

MOREAPPLICATIONS

ANALYZINGSTRATEGY

INTERPRETINGDATA

Neural networks are getting more complex because they are delivering more sophisticated

services and this drives up development and deployment costs.

UNDERSTANDINGSPEECH

PERSONALIZINGCONTENT

RECOGNIZINGBEHAVIOR

PREDICTINGEVENTS

AVOIDINGCOLLISIONS

DRONES

AUTONOMOUSCARS

SMART CITIES

VIRTUALASSISTANTS

MOBILE PHONES

MEDICAL DEVICES

THE POWER OF NVIDIA TensorRTTensorRT is a high-performance optimizing compiler and runtime engine for

production deployment of AI applications. It can rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers,

embedded, or automotive GPU platforms.

Multi-Stream ExecutionScales to multiple input streams by

processing them in parallel using the same model and weights.

Dynamic Tensor MemoryReduces memory footprint and

improves memory re-use by allocating memory for each tensor

only for the duration its usage.

Layer and Tensor FusionImproves GPU utilization and optimizes

memory storage and bandwidth by combining successive nodes into a single

node, for single kernel execution.

Weight and ActivationPrecision Calibration

Significantly improves inference performance of models trained in FP32

full precision by quantizing them to INT8, while minimizing accuracy loss.

Kernel Auto-TuningOptimizes execution time by choosing the best data layer and best parallel

algorithms for the target: Jetson, Tesla or DrivePX GPU platform.

TensorRT Optimizer

TensorRT Runtime

accelerating ai inference infographic -...

Documents