end-to-end deep learning compiler stack...© 2019, amazon web services, inc. or its affiliates. all...

24
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: Animesh Jain Amazon SageMaker Neo End-to-End Deep Learning Compiler Stack AWS AI

Upload: others

Post on 22-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Presenter: Animesh JainAmazon SageMaker Neo

End-to-End Deep Learning Compiler Stack

AWS AI

Page 2: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Deep Learning is Pervasive

Amazon TranslateFluent translation of text

Amazon ComprehendDiscover insights andrelationships in text

Amazon TranscribeAutomatic speech recognition

Amazon PollyNatural sounding text to speech

Amazon RekognitionDeep learning-based image

and video analysis

Amazon LexBuild chatbots to engage

customers

Page 3: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

ARM – Unique Role in AWS Ecosystem

A1 EC2 InstanceOptimized cost and performance

for scale-out workloads

Alexa DevicesBuild natural voice experiences

3rd party Edge DevicesPortable Deep Learning

Page 4: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

How to Accelerate Deep Learning?

Page 5: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda

• Overview of Neo• Relay Graph Optimizations• TVM Tensor IR Optimizations• Evaluation

Page 6: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

After

Deep Learning Compiler

Model Marketplace

Inference Target

Before

Model Marketplace

Inference Target

Deep Learning Inference

Page 7: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

TVM: end-to-end optimization stack

Models and hardware targets are far away!

Page 8: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Framework Graph

MxnetTF …. parsers

Relay Graph

Target-independent Relay passes

Target-optimized graph

Target-dependent Relay passes

Intel

x86

ARM

CPU

Nvidia

GPU

ARM

GPU

Schedule templates written in TVM Tensor IR

.. More targets

AutoTVM – Tuning the kernels

Optimized Binary

Codegen – LLVM, Cuda, C, …

TVM OverviewFramework Parsers

Graph level optimizations

Tensor-level optimizations

Machine code generation

Page 9: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda

• Overview of TVM• Relay Graph Optimizations• TVM Tensor IR Optimizations• Evaluation

Page 10: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Computation Graph Optimization

conv2d

relu

conv2d

relu

flatten

dense

softmax

operation

inputsdataflowdependency

w1

w2

w3

data

channels=32,kernel_size=(3,3), padding=(1,1),use_bias=0

attributes

shape=(1,10)

Represent high-leveldeep learning computations

Target Independent• Constant propagation• Dead code elimination• Operation fusion

Target Dependent• Data layout transform• Graph partitioning• Legalization

Page 11: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Operation Fusion Example

conv2d

relu

bnfused-conv2d-

bn-relu

Page 12: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Target Dependent Layout Transformation

Data

CONV

BATCH_NORM

RELU POOLING

CONV

FLATTEN

NCHW

NCHW

NCHW

Undef

NCHW

KernelOIHW

Mean / VarianceC

Kernel

Data

CONV_NCHW16c

BATCH_NORM

RELU POOLING

CONV_NCHW16c

FLATTEN

NCHW16c

NCHW16c

NCHW16c

NCHW16c

Kernel

Mean / Variance

Kernel

LayoutTransformNCHW

NCHW16c

LayoutTransformOIHW16i16o OIHW

LayoutTransform CC16c

OIHW LayoutTransform

LayoutTransform

NCHW16c

NCHWUndef

OIHW16i16oOIHW

optimizedlayout

LayoutTransform forparameters can be

pre-computed duringcompile time.

AlterOpLayout

NCHW

NCHW

Observation - Data layout NCHWcleads to better memory accesses

ATC’19 Optimizing CNN Model Inference on CPUs

Page 13: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda

• Overview of TVM• Relay Graph Optimizations• TVM Tensor IR Optimizations• Evaluation

Page 14: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

TVM Tensor IR• Compute definition

• TVM Schedule – Developer-friendly loop transformations• Do not need hardware ISA knowledge to perform loop optimizations

C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[k, j], axis=k))

s = tvm.create_schedule(C.op) Ø xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)Ø ko, ki = s[C].split(k, factor=4)Ø s[C].reorder(ko, xi, ki, yi)Ø s[C].unroll(ki)Ø s[C].vectorize(yi)Ø s[C].parallel(xo)...

Page 15: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Large Search Space for Schedule

Loop Transformations

Thread Bindings Cache Locality

Thread Cooperation Tensorization Latency

Hiding

Large search space foroptimizationchoices

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 16: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

• Relatively low experiment cost• Domain-specific problem structure• Large quantity of similar tasks

Templated Schedule Program

Training data

LearningStatistical Cost Model

AutoTVM - Learning-based Program Optimizer

Code generator

NeurIPS’19 Learning to Optimize Tensor Programs

Page 17: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda

• Overview of TVM• Relay Graph Optimizations• TVM Tensor IR Optimizations• Evaluation

Page 18: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Experiment Setup

• Server - EC2 A1 Instance§ 16-core ARMv8

• Edge device - Acer aiSage§ Rockchip RK3399 SoC + ARM Mali GPU T-860

Page 19: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

TVM – Evaluation on ARM A1 Server

00.5

1

1.52

2.53

3.54

resne

t18

resne

t34

resne

t50

resne

t101

resne

t152

vgg1

1_bn

vgg1

3_bn

vgg1

6_bn

vgg1

9_bn

incep

tion_

v3

dens

enet_

121

dens

enet_

161

dens

enet_

169

dens

enet_

201

SSD_resn

et50

Speedup of TVM execution normalized to Tensorflow-Eigen

Page 20: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

TVM – Evaluation on Edge Device Acer aiSage

Baseline: ACLBatch size = 1

ICPP’19 A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Page 21: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Effects of Tuning Convolution operators in TVMBatch size = 1 Acer aiSage

ICPP’19 A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Page 22: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Takeaways

• Deep learning compilation is essential for portability and performance across a variety of targets.

• Optimizations are important at all levels – graph- and tensor-level.

• Abstracting compute and HW-dependent schedule enables developers to write kernels without extensive knowledge of HW ISA.

• Open-source collaborations are the key to achieve the dream of running deep learning everywhere.

Page 23: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Linaro Community

• TVM enjoys LLVM ARM codegen support• Better support for Int8 instructions• Better support for different ARM variants

• Better schedules• Data layout optimizations are hardware dependent• TVM performance of ARM GPUs can be improved

• ACL support for TVM

Page 24: End-to-End Deep Learning Compiler Stack...© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presenter: AnimeshJain Amazon SageMakerNeo End-to-End Deep Learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Thank you!Check it out!

https://github.com/dmlc/tvm

https://github.com/neo-ai/