learning to optimize tensor programs04-15... · learning to optimize tensor programs high-level...

31
Learning to Optimize Tensor Programs Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Upload: others

Post on 23-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning to Optimize Tensor ProgramsTianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau,

Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Page 2: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Huge gap between model/frameworks and hardware backends

Frameworks

Explosion of hardware backends

Acclerator

Page 3: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Existing Approach

Hardware

Frameworks

Page 4: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Existing Approach

High-level data flow graph

Hardware

Frameworks

Page 5: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

Frameworks

Page 6: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

eg. cuDNN Offload to heavily optimized DNN operator library

Frameworks

Page 7: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

Page 8: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

Page 9: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

Page 10: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

New operators

Page 11: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

New operators

Page 12: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

New operators

Page 13: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Limitations of Existing Approach

cuDNN

Frameworks

New operators

Engineering intensive

Page 14: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 15: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 16: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Hardware

Frameworks

Page 17: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations

Learning to generate optimized program for new operator workloads and hardware

Hardware

Frameworks

Page 18: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Search over Possible Program Transformations

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 19: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Search over Possible Program Transformations

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 20: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Search over Possible Program Transformations

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

Billions of possible optimization choices

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

Page 21: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning-based Program Optimizer

Program Optimizer Program Code Generator

Page 22: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning-based Program Optimizer

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

Page 23: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning-based Program Optimizer

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

LearningStatistical Cost Model

Page 24: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Learning-based Program Optimizer

• Relatively low experiment cost • Domain-specific problem structure • Large quantity of similar tasks

Unique Problem Characteristics

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

LearningStatistical Cost Model

Page 25: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Program-aware Cost Modeling

High-Level Configuration

Page 26: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

Page 27: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

C A By 64 64 64x 8 8 64k 1 8 8

y 1x 8k 64

touched memory

outer looplength

statistical features

Boosted Tree Ensembles

Page 28: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

for

context vec of x

for

for

context vec of y

context vec of k

+

soft scatter

finalembedding

TreeGRU

C A By 64 64 64x 8 8 64k 1 8 8

y 1x 8k 64

touched memory

outer looplength

statistical features

Boosted Tree Ensembles

Page 29: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

Transfer Learning Among Different WorkloadsHistorical Optimization Tasks

Domain Invariant Program Representations

Transferable Models to speedup new tasks

Page 30: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

State of Art Performance

Poster #104

Nvidia GPU ARM CPU ARM GPU

Page 31: Learning to Optimize Tensor Programs04-15... · Learning to Optimize Tensor Programs High-level data flow graph and optimizations Learning to generate optimized program for new operator

State of Art Performance

Poster #104

Nvidia GPU ARM CPU ARM GPU

 In production use inside several major companies