farabet and lecun’s original talk: gti neuflow: a runtime...
TRANSCRIPT
![Page 1: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/1.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 1
NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision
Paper by: Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello and Yann LeCun
Presentation by: Brendan Adkins and Tarun Khubchandani
1
Farabet and LeCun’s original talk: https://www.youtube.com/watch?v=KaJtT1K3GtI
![Page 2: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/2.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 2
Introduction and Background
2
![Page 3: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/3.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 3
Computer Vision
3
● Extract high level information from images
○ Form relationship between high-dimensional data to low dimensional space
● Object Recognition○ Dense feature extraction from
regularly spaced samples● GPUs increasing in prominence in CV
○ Inexpensive, easily available, easily programmable
○ Poor performance/power consumption compared to custom HW
![Page 4: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/4.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 4
Contribution
● Provide real-time detection, categorization and localization of pipelined megapixel images○ 10x less power consumption than
laptop computer (~10W)○ 100x speedup in application
● Similar work being carried out at NEC Labs, Stanford and Kaist
4
![Page 5: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/5.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 5
Architecture
5
![Page 6: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/6.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 6
Dataflow Grid
6
● 2D Grid of Processing Tiles (PT)○ Bank of Processing Operators ○ Routing MUX connecting local data
lines to global/neighbor tiles● Smart DMA
○ Asynchronous data transfers with priority to off-chip memory
● Global/Local Data Lines○ Global lines connect PT to SDMA○ Local Data lines connect neighbors
● Runtime Configuration Bus○ Reconfigure grid to specialize at
runtime
![Page 7: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/7.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 7
Runtime Reconfiguration● FPGA:
○ Versatile, simple processing elements (~104/package)○ ~ms reconfiguration time but ~hr synthesis time
● Multicore Processor:○ Simple usage (extensions to programming languages for parallelism)○ Far fewer processing elements (10-100)
● Proposed Architecture:○ Halfway between above options○ Applications specialise in vision
7
![Page 8: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/8.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 8
Optimized Processing Tiles● Specialized heavily on 2D
convolutions○ Top row PTs are MACs used as
convolvers, implemented in FPGA by hardwired MAC
○ Middle row is general purpose ops○ Bottom row is non-linear mapping
(normalization, linear activation, etc.). Done with look-up or linear decomposition
● Pipelined to have 1 result/cycle● Pixels stored Q8.8 and scaled to 32-bit
in operations
8
![Page 9: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/9.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 9
Architecture Constraints
● High throughput, but not necessary low latency
○ Operations replicated in both dimensions○ # similar computations > latency in pipelined
processor● Must be stallable
○ Allows any path to be configured, even if requiring more bandwidth than available
○ Achieved with FIFO buffer
9
![Page 10: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/10.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 10
Architecture Constraints
● Configuration time ≈ system latency○ Crucial to runtime reconfiguration, achieved
with configuration bus● Coarse grained processing elements
○ Maximize ratio between computing and routing
10
![Page 11: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/11.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 11
Smart DMA
● Custom engine to allow multiple async access
● Arbiter MUX/DEMUX access to memory with high bandwidth
● Ports can be configured to R/W specific chunks and communicate status to Control Unit
○ Dataflow: Operation driven fully by data
11
![Page 12: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/12.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 12
Compiler
12
![Page 13: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/13.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 13
Purpose● Extracting levels of parallelism
from graph descriptions of algorithms
● Graphs are given in the Torch5 environment
○ Matrix representation similar to MATLAB
● Known sequence of operators are matched to pre-optimized routines
13
Training a XOR gate in Torch 5, http://torch5.sourceforge.net/manual/newbieTutorial.html
![Page 14: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/14.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 14
Parallelization Methods● Across modules
○ Special cases○ Cascading convolutions and nonlinear mapping
● Across images○ Can use multiple PTs to convolve multiple inputs with a kernel at once○ NueFlow/LuaFlow’s strength and the most simple method
● Within an image
14
![Page 15: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/15.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 15
Application and Performance Comparison
15
![Page 16: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/16.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 16
Application: Street Scenes● Trained with LabelMe dataset of
Spanish cities.● 3 phases of training● Post training network mapped to
NueFlow using LuaFlow
16
![Page 17: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/17.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 17
Phase 1: CN1● 3 Convolutions● Small kernels (5x5)
○ Small receptive field
● Focus on minimizing cross entropy to promote rare categories
17
![Page 18: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/18.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 18
Phase 2: CN2● 3 Convolutions● Kernels increased to 9x9 size
○ Receptive field 2% of image
18
![Page 19: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/19.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 19
Phase 3: CN3● 4 Convolutions● Kernels kept at 9x9 size
○ Receptive field 5% of image
19
![Page 20: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/20.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 20
Performance Comparison● V6
○ Competitive GOP rate○ Strength in power efficiency,
indicates potential use case in systems in which the speed of an mGPU would suffice, but are power-constrained.
20
● IBM○ Vast projected improvement in GOP rate and efficiency○ Could fully eclipse the mGPU in speed and efficiency
![Page 21: Farabet and LeCun’s original talk: GtI NeuFlow: A Runtime …ziyang.eecs.umich.edu/iesr/lectures/farabet11jun-present.pdf · 2020-04-10 · Presentation By: Brendan Adkins and Tarun](https://reader034.vdocuments.net/reader034/viewer/2022050413/5f89feadc07c0313a6796321/html5/thumbnails/21.jpg)
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 21
Questions?
21