perception systems for autonomous vehicles using energy ... · perception systems for autonomous...
TRANSCRIPT
Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks
Forrest Iandola, Ben Landen, Kyle Bertin, Kurt Keutzer and the DeepScale Team
I M P L E M E N T I N G A U TO N O M O U S D R I V I N G
THE FLOW
SENSORS
LIDAR
ULTRASONIC CAMERA
RADAR
OFFLINE MAPS
REAL-TIME PERCEPTION
PATH PLANNING &
ACTUATION
What does a car need to see? What does a car need to see?
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
What does a car need to see?
Object Detection
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
Vehicle (99%) Vehicle (98%) Vehicle (100%) Vehicle (100%) Vehicle (99%)
Cyclist (99%)
Cyclist (99%)
Pedestrian (99%) Pedestrian
(99%)
What does a car need to see?
Distance
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
Vehicle (99%) 15m
Vehicle (98%) 20m
Vehicle (100%) 10m
Vehicle (100%) 14m
Vehicle (99%) 18m
Cyclist (99%) 16m
Cyclist (99%) 14m
Pedestrian (99%) 7m
Pedestrian (99%) 7m
What does a car need to see?
Object Tracking
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
Vehicle (99%) 15m ID: 5 (95 frames)
Vehicle (98%) 20m ID: 4 (140 frames)
Vehicle (100%) 10m ID: 1 (135 frames)
Vehicle (100%) 14m ID: 2 (140 frames)
Cyclist (99%) 16m ID: 6 (90 frames)
Cyclist (99%) 14m ID: 7 (95 frames)
Pedestrian (99%) 7m ID: 8 (60 frames)
Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)
18m ID: 3 (140 frames)
What does a car need to see?
Free Space & Driveable Area
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
Cyclist (99%) 16m ID: 6 (90 frames)
Cyclist (99%) 14m ID: 7 (95 frames)
Pedestrian (99%) 7m ID: 8 (60 frames)
Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)
15m ID: 5 (95 frames)
Vehicle (98%) 20m ID: 4 (140 frames)
Vehicle (100%) 10m ID: 1 (135 frames)
Vehicle (100%) 14m ID: 2 (140 frames)
Vehicle (99%) 18m ID: 3 (140 frames)
What does a car need to see?
Lane Recognition
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy.
Cyclist (99%) 16m ID: 6 (90 frames)
Cyclist (99%) 14m ID: 7 (95 frames)
Pedestrian (99%) 7m ID: 8 (60 frames)
Pedestrian (99%) 7m ID: 9 (60 frames) Vehicle (99%)
15m ID: 5 (95 frames)
Vehicle (98%) 20m ID: 4 (140 frames)
Vehicle (100%) 10m ID: 1 (135 frames)
Vehicle (100%) 14m ID: 2 (140 frames)
Vehicle (99%) 18m ID: 3 (140 frames)
Audi https://www.slashgear.com/man-vs-machine-my-rematch-against-audis-new-self-driving-rs-7-21415540/
BMW + Intel https://newsroom.intel.com/news-releases/bmw-group-intel-mobileye-will-autonomous-test-vehicles-roads-second-half-2017/
Ford http://cwc.ucsd.edu/content/connected-cars-long-road-autonomous-vehicles
Today's autonomous cars require a lot of computing hardware!
…and perception is the most computationally-intensive part of the software stack
Big computers = expensive cars
As a workaround, companies want people to share autonomous vehicles to amortize hardware costs
As a workaround, companies want people to share autonomous vehicles to amortize hardware costs
Shared autonomous vehicles will likely have some of the downsides as public transportation
Will better computer chips make autonomous cars affordable?
Will better computer chips make autonomous cars affordable?
Deep Learning Processors have arrived!
[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version)
T H E S E RV E R S I D E
Platform Computation (GFLOPS/s)
Memory Bandwidth
(GB/s)
Computation-to-bandwidth
ratio
Power (TDP Watts)
Year
NVIDIA K20 [1] 3500
(32-bit float) 208 (GDDR5)
17 225 2012
NVIDIA V100 [2] 112000
(16-bit float) 900 (HBM2)
124 (yikes!)
250 2018
Uh-oh… Processors are improving much faster than Memory.
Deep Learning Processors have arrived!
[1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier
M O B I L E P L AT FO R MS
Device Cores Computation (GFLOPS/s)
Memory Bandwidth
(GB/s)
Computation-to-bandwidth
ratio
System Power (TDP Watts)
Year
Samsung Galaxy Note 3
Arm Mali T-628 GPU [1]
120 (32-bit float)
12.8 (LPDDR3)
9.3 ~10 2013
Huawei P20
Kirin 970 NPU [2] 1920
(16-bit float) 30 (LPDDR4X)
64 (ouch!)
~10 2018
NVIDIA Jetson Xavier [3,4]
NVIDIA Tensor Cores
30000 (832 int)
137
218 (yikes!)
10 to 30 (multiple modes)
2018
What will the next generation Deep Learning servers look like?
https://medium.com/@shan.tang.g/a-list-of-chip-ip-for-deep-learning-48d05f1759ae
What will the next generation Deep Learning servers look like? 2 0 TO P/ W CO MP U TAT I O N
Platform Efficiency (TOP/s/W)
Computation (TOP/s)
Memory Bandwidth
(TB/s)
Computation-to-bandwidth
ratio
Power (TDP Watts)
Year
NVIDIA K20 [1] 0.015 3.50
(32-bit float) 0.208 (GDDR5)
17 225 2012
NVIDIA V100 [2] 0.45 112
(16-bit float) 0.900 (HBM2)
124 250 2018
Next-gen: 20 TOP/W 20 2500* 1.800
(HBM3) [3] 1389 (oh no!)
250 2020 (est.)
[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) [3] https://www.eteknix.com/gddr6-hbm3-details-emerge/
* Assuming half the power is spent on computation, and the other half is spent on memory and other devices. 20 TOP/s/W * 20W * 0.5 = 2500 TOP/s
Small Neural Nets to the rescue
squeeze (verb): to make an AI system use less resources using whatever means necessary
squeeze (verb): to make an AI system use less resources using whatever means necessary
Memory Footprint
and Bandwidth
Computational Operations Power
and Energy Time
squeeze (verb): to make an AI system use less resources using whatever means necessary
Memory Footprint
and Bandwidth
Computational Operations Power
and Energy Time
New DNN Models
Application-specific
Quantization and Pruning
Superior Implementations
Differentiated Data and Training
Strategies
Most CV Applications Rely on Only a Few Core CV Capabilities
Image Classification
Object Detection
Semantic Segmentation
And the best accuracy for each of these capabilities is given by Convolutional Neural Nets
But We Need a Very Different Kind of DNN
DGX-1, 170 TFLOPS, 3.2 KWatts,
128 GB Memory
TitanX: 11 TFLOPS, 223 Watts,
12 GB Memory
VGG16[1] model: - Parameter size: 552 MB - Memory: 93 MB/image - Computation: 15.8 GFLOPs/image
[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
Smartphones 100's GFLOPs
3 Watts 2-4 GB
IOT Devices 100's MHz
<1Watt <1 GB
Speed more Related to Memory Accesses than Operations
L1 D-Cache (per core)
L2 Cache (shared)
Off-chip DRAM
Size 32 KB 2 MB 4 GB
Read Latency 4 cycles 22 cycles ~200 cycles
Read Bandwidth 20.8 GB/s 166.4 GB/s 28.7 GB/s
L1 Cache/TLB
L2 Cache Galaxy S7
Samsung Exynos M1 Access Times
Energy More Related to Memory Accesses than operations (45nm 0.9V)
0 20 40 60 80 100
Energy (pJ)
18.5x
100x
10,000x
5.5x
500x
0 500 1000 1500 2000
8b INT Mult
16b FP Mult
32b FP Mult
64b Cache Read (32KB)
64b Cache Read (1MB)
DRAM
Mark Horowitz, “Computing’s Energy Problem (and what we can do about it),” ISSCC 2014
10,000 DNN Architectural Configurations Later: SqueezeNet (2016)
[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS2012 [2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv: 1602.07360 (2016). (February 2016)
CNN Top-5 Accuracy ImageNet
Model Parameters
Model Size
AlexNet[1] 80.3% 60M 243MB
SqueezeNet[2] 80.3% 1.2M 4.8MB
AlexNet [1]
SqueezeNet [2]
compresses to 500KB
SqueezeNet: Immediate Success in Embedded Vision
Enabled embedded processor vendors (ARM, NXP, Qualcomm) to demo CNNs Quickly ported to all the major Deep Learning Frameworks
NXP – Embedded Vision Summit
Qualcomm – Facebook F8
Apple CoreML
SqueezeDet for Object Detection (2017)
Bounding boxes
Final detections Input
image
Best Paper Award: Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer. 2017. SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In Proceedings, CVPR Embedded Computer Vision Workshop, July 2017.
Filter Conv Det
feature map
• ~2M model parameters • 57 FPS • 1.4 Joules Frame
SqueezeSeg: Semantic Segmentation for LIDAR (2018)
LIDAR point cloud segmentation SqueezeSegV2: • Higher accuracy: v1[1]: 64.6% -> v2[2]: 73.2% (+8.6%) • Better Sim2Real performance: v1[1]: 30% -> v2[2]: 57.4% (+27.4%)
• Outperforms v1 trained on real data w/o intensity
[1] Wu, Bichen, et al. "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud." ICRA18 [2] Wu, Bichen, et al. "SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud." arXiv:1809.08495 (2018).
Squeeze Family
Image Classification
Object Detection
Semantic Segmentation
SqueezeNet
SqueezeNext
ShiftNet
SqueezeDet SqueezeSeg-{v1, v2} DiracDeltaNet
DNASNet
Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications MobileNet V2: Inverted Residuals and Linear Bottlenecks
Designed for efficiency on mobile phones. Family of pareto optimal models to target needs of the user. V1 based on Depthwise Separable Convolutions. V2 introduces Inverted Residuals and Linear Bottlenecks. Supports Classification, Detection, Segmentation and more.
Model Compression
≥50X 10X
Slide Credit: Prof. Warren Gross (McGill Univ.)
DNN Architecture Search
Anatomy of a convolution layer
⨷ 384
13 13
384
13 13
… …
384
3x3 conv
384
3 3 13
13
… … …
13
13
Filters: Kernel Reduction
⨷ 384
13 13
384
13 13
… …
384
3x3 conv
384
3 3 13
… … …
13
13
3
3
1
1
9x reduction in model parameters
Filters/Channel Reduction
⨷ 384
13 13
384
13 13
… …
384
3x3 conv
384
3 3 13
… … …
13
13 3x3 conv
…
384
384
3 3
…
128
128
3 3
9x reduction in model parameters
Model Distillation/Compression
Model Distillation
Li, et al. Mimicking Very Efficient Network for Object Detection. CVPR, 2017.
Examples of what's on a DNN Architect's Palette
Spatial Convolution e.g. 3x3
Shift Channel Shuffle
Depthwise Convolution Pointwise Convolution 1x1
The Art of Small Model Design Small Neural Nets Are Beautiful – ESWeek 2017
The palette of an adept mobile/embedded DNN designer has grown very rich! Overall architecture: economize on layers while retaining accuracy Layer types
Kernel reduction: 5 x 5 3 x 3 1 x 1 Channel reduction: e.g. FireLayer Experiment with novel layer types that consume no FLOPS
Shuffle Shift
Model distillation: let big models teach smaller ones Apply pruning Tailor bit precision (aka quantization) to target processor
Iandola, Forrest, and Kurt Keutzer. "Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures." In Proceedings of the Twelfth International Conference on Hardware/Software Codesign and System Synthesis Companion, p. 1. ACM, 2017. (ESWeek 2017). Also, (arXiv:1710.02759)
Artistic/Engineering Process of Designing a Deep Neural Net
• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination
Can we automate this?
• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination
DNAS: Differentiable Neural Architecture Search
Differentiable Neural Architecture Search: • Extremely fast: 8 GPUs, 24 hours
• Can search for different conditions case-by-case • Optimize for actual latency Bichen Wu,
Kurt Keutzer, Peizhao Zhang,
Yanghan Wang,
Fei Sun, Yiming Wu,
Yuandong Tian, Peter Vajda, Yangqing Jia
DNAS in context (FLOPs to normalize comparison)
MobileNetV2: [4] Acc: 71.8%, FLOPs: 300M
More FLOPs - BAD
ImageNet top-1 Accuracy -- Good PNAS: [2] Acc: 74.2%, FLOPs: 588M Search cost*: 6,000 GPU-hrs
DARTS: [3] Acc: 73.1%, FLOPs: 595M Search cost: 288 GPU-hrs
AMC: [5] Acc: 70.8%, FLOPs: 150M
MnasNet: [6] Acc: 74.0, FLOPs: 317M Search Cost*: 91,000 GPU-hrs
NAS: [1] Acc: 74.0%, FLOPs: 564M Search cost: 48,000 GPU-hrs
* Estimated from the paper description
[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.070122.6 (2017). [2] Liu, Chenxi, et al. "Progressive neural architecture search." arXiv preprint arXiv:1712.00559 (2017). [3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018) [4] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks.” CVPR18 [5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." Proceedings of the European Conference on Computer Vision (ECCV). 2018. [6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv preprint arXiv:1807.11626 (2018).
DNASNet: (ours) Acc: 74.2%, FLOPs: 295M Search Cost: 216 GPU-hrs
• X-axis: FLOPs • Y-axis: accuracy • Mark size: search cost • Circles: search cost unknown
DNAS for device-aware search
NET Latency on iPhoneX
Latency on Samsung S8
Top-1 acc
DNAS-iPhoneX 19.84 ms 23.33 ms (20% slower)
73.20%
DNAS-S8 27.53 ms (25% slower)
22.12 ms 73.27%
• For different targeted devices, both DNASNets achieve similar accuracy.
• However, per target DNN optimization was required
The Future: Breaking down the wall between DNN Design & Hardware Design
DNN Designers • Unaware of arithmetic intensity • Floating point vs fixed point costs
• Memory hierarchy and latency
NN HW Accelerator architects • Using outdated models:
- AlexNet
- VGG16 • Using irrelevant datasets:
- MNIST
- CIFAR
Key Takeaways
• Autonomous vehicles currently need thousands (or even hundreds of thousands) of dollars of computing hardware
• Processing is on a trajectory of rapid improvement (in operations-per-Watt) • but other aspects of the system (e.g. memory) are improving much more slowly • today's neural networks will be choked by slow memory on tomorrow's DNN accelerators (this is
already happening and will get worse)
• Designing new (smaller) neural networks helps with all of the following • making full use of next-generation computing platforms • reducing the hardware costs in autonomous vehicles • enabling lower-cost, larger-scale rollouts of autonomous vehicles