gpu inference platform - nvidia · 2018-11-26 · 효율적인inference를위한hw/sw service...
TRANSCRIPT
본 세션은 다음과 같은 분들을 위해 준비했습니다
효율적인 Inference를 위한 HW/SW Service Architecture를 고민하는분들
TensorFlow, Caffe, Pytorch 등다양한 Framework 기반으로학습된모델들을제공할수 있는 Inference Platform 구축을고민하는분들
서비스구축 시 GPU의성능과 QoS를가장효율적으로사용할수있는 Inference Platform 구축을고민하는분들
요약
TESLA T4 TENSORRT
INFERENCE
SERVER
KUBERNETES
ON NVIDIA
GPUs (KONG)
TENSORRT 5
TESLA T4
INFERENCE GPU – TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
SPECIFICATION
75 watts
NEW TURING TENSOR CORE4 x 4 Matrix Processing Array
T4 INFERENCE SPEEDUP
Resnet 50 (27x) DeepSpeech 2 (21x) GNMT (36x)
INFERENCE EFFICIENCY
CPU-only
Server
Tesla
P4Tesla
V100
Tesla
T4
1
25
21
56
images/sec/watt
https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/
TURING MPS (MULTI-PROCESS SERVICE)
HW TRANSCODING ENGINEDeep Learning과 Video Application의효과적인연동가능
Decode: P4 대비 2배 (38개의 Full-HD video stream decode 지원)
Encode: Performance Mode / Efficiency Mode
TENSORRT INFERENCE SERVER
(TRTIS)
TRT INFERENCE SERVER란?
NV DL SDK
NV Docker
DNN Models
TensorRT
Inference
Server
Kubernetes
완성도높은 Datacenter 서비스용모듈
컨테이너기반의 고성능 Inference Server
CPU / GPU 자원이용률극대화
특징
다양한 model 지원: TensorRT, TensorFlow, Caffe2, ONNX
Multi-GPU 지원: 모든 GPU에 inference req. 분산처리
Multi-tenancy 지원: 복수개의모델과 복수개의인스턴스, 인스턴스별
복수개의버전동시구동가능
Batch request 지원: Throughput 개선
모니터링 metrics 제공: Service Orchestration, HA, LB, QoS 등으로활용
기본아키텍쳐
SAMPLE CONFIGURATION
https://github.com/NVIDIA/dl-inference-
server/blob/18.10/src/core/model_config.proto
CLIENT SDKhttps://github.com/NVIDIA/dl-inference-server/
TRT Inference Server를 위한 C++ / Python Client Libraries: HTTP / gRPC
Client sample code 제공: image_client, perf_client
브랜치버전은 TRT Inference Server와 동일
HOW TO USE-FRONT-END / BACK-END 분리를통한 SCALABLE ARCHITECTURE
-MODEL VERSIONING을활용한 온라인모델업데이트
-INSTANCE-GROUP 옵션을활용한 THROUGHPUT / LATENCY 최적화
HOW TO USE-FRONT-END / BACK-END 분리를통한
-SCALABLE ARCHITECTURE
Front-end servers
Back-end inference servers
with TRTIS
LB
Client requests
HOW TO USE-MODEL VERSIONING을활용한온라인모델업데이트
-하나의 모델에 복수개의 version 지원
-TRTIS 구동 시에도동적으로 Model Version 변경가능: Atomic Inference 지원
-Rolling Update, A/B test 등에효과적
-Version Policy: All / Latest / Specific
HOW TO USE-INSTANCE-GROUP 옵션을 활용한 THROUGHPUT / LATENCY 최적화
REPORT METRICS FOR QOS
KUBERNETES ON NVIDIA GPU
(KONG)
KUBERNETES ON NVIDIA GPUS
Scale-up to thousands of GPUs
Self-healing cluster orchestration
GPU optimized out-of-the-box
Powered by NVIDIA Container Runtime
Upstream all diffs
특징NVIDIA Device Plugin을이용하여 Kubernetes에서 GPU 리소스를관리함
Heterogeneous GPU 환경에서 GPU type과 memory 요구사항등을활용하여효과적인관리가가능함
NVIDIA DGCM (https://developer.nvidia.com/data-center-gpu-manager-dcgm), Prometheus, Grafana를 이용하여다양한 GPU 항목과 health check를모니터링시스템에연동가능
DCGM EXPORTER FOR PROMETHEUS
https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
참조NVIDIA GPU Cloud (NGC) container registry: https://ngc.nvidia.com/
T4: https://www.nvidia.com/en-us/data-center/tesla-t4/
TRTIS Client SDK: https://github.com/NVIDIA/dl-inference-server
TRTIS User Guide:
https://docs.nvidia.com/deeplearning/sdk/inference-user-guide/index.html
Kubernetes on NVIDIA GPUs:
https://developer.nvidia.com/kubernetes-gpu
SEOUL | NOVEMBER 7 - 8,2018
www.nvidia.com/ko-kr/ai-conference/