Download - Nervana and the Future of Computing
Proprietary and confidential. Do not distribute.
Nervana and the Future of Computing
26 April 2016 Arjun Bansal
Co-founder & VP Algorithms, Nervana
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.
AI on demand using Deep Learning
2
DL
Image Classification
Object Localization
Video Indexing
Text Analysis
Nervana Platform
Machine Translation
Proprietary and confidential. Do not distribute.
Image classification and video activity detection
3
Deep learning model Potential applications
• Trained on a public dataset1 of 13K videos in 100 categories
• Training was approximately 3 times faster than competitive framework
• Can be extended to perform scene and object detection, action similarity labeling, video retrieval, anomaly detection
1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php
• Activity detection and monitoring for security
• Automatic editing of captured moments from video camera
• Facial recognition and image based retrieval
• Sense and avoid systems for autonomous driving
• Baggage screening at airports and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
Proprietary and confidential. Do not distribute.ne r vana
Object localization and recognition
4
Proprietary and confidential. Do not distribute.ne r vana
Speech to text
5https://youtu.be/NaqZkV_fBIM
Proprietary and confidential. Do not distribute.ne r vana
Question answering
6
Stories
Mary journeyed to Texas. John went to Maryland.
Mary went to Iowa. John travelled to Florida.
Questions
Answers
Where is John located?
Florida
Proprietary and confidential. Do not distribute.ne r vana
Reinforcement learning
7
Pong Breakout
https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
Proprietary and confidential. Do not distribute.ne r vana
Application areas
8
Healthcare Agriculture Finance
Online Services Automotive Energy
Proprietary and confidential. Do not distribute.
Nervana is building the future of computing
9
The Economist, March 12, 2016
Cloud Computing
Custom ASIC
Deep Learning / AI
Proprietary and confidential. Do not distribute.ne r vana
nervana cloud
10
Images
Text
Tabular
Speech
Time series
Video
Data
import train build deploy
Cloud
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
11
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
11
• Fastest library
• Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM
Domains • Images • Video • Speech • Text • Time series
Proprietary and confidential. Do not distribute.ne r vana
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
11
• Fastest library
• Model support
• Cloud integration
Proprietary and confidential. Do not distribute.ne r vana
Backends
• CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
• Optimized at assembler level
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
=1
nervana engine
10 GPUs
200 CPUs
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
Instruction and data memory
Ctrl
ALU
CPU
Data Memory
Ctrl
Nervana
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
12
• 10-100x gain
• Architecture optimized for
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ne r vana
Special purpose computation
13
1940s: Turing Bombe
Motivation: Automating calculations, code breaking
Proprietary and confidential. Do not distribute.ne r vana
General purpose computation
14
2000s: SoC
Motivation: reduce power and cost, fungible computing.
Enabled inexpensive mobile devices.
Proprietary and confidential. Do not distribute.ne r vana
Dennard scaling has ended
15
What business and
technology constraints do
we have now?
Proprietary and confidential. Do not distribute.ne r vana
Many-core tiled architectures
16
Tile Processor Architecture Overview for the TILEPro Series 5
CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW
The Tile Processor™ implements Tilera’s multicore architecture, incorporating a two-dimensional array of processing elements (each referred to as a tile), connected via multiple two-dimensional mesh networks. Based on Tilera’s iMesh™ Interconnect technology, the architecture is scalable and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle.
CDNTDNIDNMDNSTNUDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI(10GbE)
TDNIDNMDNSTNUDN
LEGEND:
Tile Detail
port2msh0
port0
port2 port1 port0
DDR2
DDR2
port0msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII(GbE)
XAUI(10GbE)
FlexI/O
PCIe(x4 lane)
I2C, JTAG,HPI, UART,
SPI ROM
FlexI/O
PCIe(x4 lane)
port1 port1
msh3 msh2
port2msh0
port0
port2 port1 port0
port0msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port07,0 port0
pcie0
port0
port1
rshim0
gpio0
pcie1port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1port0
port1
SwitchEngine
CacheEngine
ProcessorEngine
UDN
STN
MDN
IDN
TDN
CDN
UDN
STN
MDN
IDN
TDN
CDN
STNSTN
TDNTDNIDNIDNMDNMDNUDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased performance without clock rate increase or smaller devices.
Requires changes in programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi Knight’s landing
Proprietary and confidential. Do not distribute.ne r vana
FPGA architectures
17
Altera Arria 10
Motivation: fine grained parallelism, reconfigurable, lots of IO, scalable.
Slow clock speed, lacks compute density for machine learning.
Proprietary and confidential. Do not distribute.ne r vana
Neuromorphic architectures
18IBM TrueNorth
activated by input spike events, which are gen-erated by neurons anywhere in the system anddelivered after a desired axonal delay of between1 and 15 time steps. Although the brain has adedicated wire for each connection, in our archi-tecture spike events are carried between cores bytime-multiplexed wires (21) that interconnect atwo-dimensional mesh network of routers, eachwith five ports (north, south, east, west, and lo-cal). The routers form the backbone of a two-dimensional mesh network interconnecting a64-by-64 core array (Fig. 2H). When a neuron ona core spikes, it looks up in local memory anaxonal delay (4 bits) and the destination address(8-bit absolute address for the target axon andtwo 9-bit relative addresses representing corehops in each dimension to the target core). Thisinformation is encoded into a packet that is in-
jected into the mesh, where it is handed fromcore to core—first in the x dimension then in they dimension (deadlock-free dimension-order rout-ing) until it arrives at its target core before fanningout via the crossbar (fig. S2). To implementfeedback connections within a core, where aneuron connects to an axon on the same core,the packet is delivered by using the router’s localchannel, which is efficient because it never leavesthe core. To scale the two-dimensional meshacross chip boundaries where the number ofinterchip connections is limited, we used amerge-split structure at the four edges of themesh to serialize exiting spikes and deserializeentering spikes (Fig. 2I). Spikes leaving the meshare tagged with their row (for spikes travelingeast-west) or column (for spikes traveling north-south) before being merged onto a shared link
that exits the chip. Conversely, spikes enteringthe chip from a shared link are split to the ap-propriate row or column by using the taggedinformation.From a physical view, to implement this
functional blueprint, we built TrueNorth, a fullyfunctional digital chip (supplementary sectionS6) with 1 million spiking neurons and 256million synapses (nonplastic). With 5.4 billiontransistors occupying 4.3-cm2 area in Samsung’s28-nm process technology, TrueNorth has ∼428million bits of on-chip memory. Each core has104,448 bits of local memory to store synapsestates (65,536 bits), neuron states and parame-ters (31,232 bits), destination addresses (6656bits), and axonal delays (1024 bits). In terms ofefficiency, TrueNorth’s power density is 20 mWper cm2, whereas that of a typical central processing
670 8 AUGUST 2014 • VOL 345 ISSUE 6197 sciencemag.org SCIENCE
Fig. 2. TrueNorth architecture. Panels are organized into rows at threedifferent scales (core, chip, and multichip) and into columns at four differentviews (neuroscience inspiration, structural, functional, and physical). (A) Theneurosynaptic core is loosely inspired by the idea of a canonical corticalmicrocircuit. (B) A network of neurosynaptic cores is inspired by the cortex’stwo-dimensional sheet. (C) The multichip network is inspired by the long-range connections between cortical regions shown from the macaque brain(30). (D) Structure of a neurosynaptic core with axons as inputs, neurons asoutputs, and synapses as directed connections from axons to neurons.Multicore networks at (E) chip scale and (F) multichip scale are both createdby connecting a neuron on any core to an axon on anycore with point-to-pointconnections. (G) Functional view of core as a crossbar where horizontal linesare axons, cross points are individually programmable synapses, vertical linesare neuron inputs, and triangles are neurons. Information flows from axons
via active synapses to neurons. Neuron behaviors are individually program-mable, with two examples shown. (H) Functional chip architecture is a two-dimensional array of cores where long-range connections are implementedby sending spike events (packets) over a mesh routing network to activate atarget axon. Axonal delay is implemented at the target. (I) Routing networkextends across chip boundaries through peripheral merge and split blocks.(J) Physical layout of core in 28-nm CMOS fits in a 240-mm-by-390-mmfootprint. A memory (static random-access memory) stores all the data foreach neuron, a time-multiplexed neuron circuit updates neuron membranepotentials, a scheduler buffers incoming spike events to implement axonaldelays, a router relays spike events, and an event-driven controllerorchestrates the core’s operation. (K) Chip layout of 64-by-64 core array,wafer, and chip package. (L) Chip periphery to support multichip networks.I/O, input/output.
RESEARCH | REPORTS
Proprietary and confidential. Do not distribute.ne r vana
Neural network parallelism
20
Data chunk 1 Data chunk n…
Processor 1 Processor n
…
parameter server
Full deep
network on
each processor
Parameter coordination
Data parallelism Model parallelism
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
PCIE SW PCIE SW
GPU
GPU
GPU
GPU
PCIE SW
CPU
SSD
CPU
IB10G
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
PCIE SW PCIE SW
GPU
GPU
GPU
GPU
PCIE SW
CPU
SSD
CPU
IB10G
GPU
GPU
GPU
GPU
PCIE SW
Proprietary and confidential. Do not distribute.ne r vana
Existing computing topologies are lacking
21
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
GPU
CPU
SSD
CPU
GPU
GPU
GPU
IB10G
PCIE SW PCIE SW
GPU
GPU
GPU
GPU
PCIE SW
CPU
SSD
CPU
IB10G
GPU
GPU
GPU
GPU
PCIE SW
Proprietary and confidential. Do not distribute.ne r vana
nervana compute topology
22
CPU
CPU
SSD
IB10G
SSD
IB10G
nn
n n
nn
nn
PCIE SW
PCIE SW
Proprietary and confidential. Do not distribute.ne r vana
Distributed linear algebra and convolution
23
CS267 Lecture 2 13
Summary of Parallel Matrix Multiply • SUMMA
• Scalable Universal Matrix Multiply Algorithm • Attains communication lower bounds (within log p)
• Cannon • Historically first, attains lower bounds • More assumptions
• A and B square • P a perfect square
• 2.5D SUMMA • Uses more memory to communicate even less
• Parallel Strassen • Attains different, even lower bounds
02/27/2014! CS267 Lecture 12! 49! 02/27/2014! CS267 Lecture 12! 50!
SUMMA Algorithm • SUMMA = Scalable Universal Matrix Multiply • Presentation from van de Geijn and Watts
• www.netlib.org/lapack/lawns/lawn96.ps • Similar ideas appeared many times
• Used in practice in PBLAS = Parallel BLAS • www.netlib.org/lapack/lawns/lawn100.ps
SUMMA uses Outer Product form of MatMul • C = A*B means C(i,j) = Σk A(i,k)*B(k,j) !
• Column-wise outer product: C = A*B = Σk A(:,k)*B(k,:) ! = Σk (k-th col of A)*(k-th row of B)!!• Block column-wise outer product (block size = 4 for illustration) C = A*B = A(:,1:4)*B(1:4,:) + A(:,5:8)*B(5:8,:) + … = Σk (k-th block of 4 cols of A)*! (k-th block of 4 rows of B)! 02/27/2014! CS267 Lecture 12! 51!
52!
SUMMA – n x n matmul on P1/2 x P1/2 grid
• C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!• A[i,k] is n/P1/2 x b submatrix of A!• B[k,j] is b x n/P1/2 submatrix of B !• C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !
• summation over submatrices!• Need not be square processor grid !
* = i"
j"
A[i,k]"
k"k"
B[k,j]"
C[i,j]
02/27/2014! CS267 Lecture 12!
SUMMA distributed matrix multiply C=A*B
(Jim Demmel, CS267 lecture notes)
Matrix multiplication on multidimensional torus networks
Edgar Solomonik and James Demmel
Division of Computer ScienceUniversity of California at Berkeley, CA, USA
[email protected], [email protected]
Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA havea 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version ofCannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.This algorithm is useful for higher-dimensional torus interconnects that can achieve more injectionbandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannoncan lower the algorithmic bandwidth cost by a factor of up to d. With rectangular collectives, SUMMAalso achieves the lower bandwidth cost but has a higher latency cost. We use Charm++ virtualizationto e�ciently map SD-Cannon on unbalanced and odd-dimensional torus network partitions. Our per-formance study on Blue Gene/P demonstrates that an MPI version of SD-Cannon can exploit multiplecommunication links and improve performance.
1 Introduction
Torus interconnects can scale to hundreds of thousands of nodes because they achieve good bisection band-width while maintaining bounded router degree on each node. Additionally, many scientific simulation andphysically structured codes can be mapped to exploit locality on torus networks. In particular, 3-dimensional(3D) tori have been widely deployed in networks (e.g. IBM Blue Gene/L, Blue Gene/P, and the Cray XTseries). The newest generation of high-end supercomputer networks is beginning to move to higher dimen-sionality (e.g. IBM Blue Gene/Q is 5D [4], K computer is 6D [14]). This transition is natural since theminimal-cost (bisection bandwidth with respect to number of pins) topology for a network of 100,000 nodesis 3D, while for 1,000,000 nodes it is 5D or 6D [5]. Higher-dimensional interconnects motivate the design ofalgorithms that can use such networks e�ciently. In this paper, we adapt a classical matrix multiplicationalgorithm to exploit full injection bandwidth on a torus network of any dimension.
Cannon’s algorithm [3] is a parallel algorithm for matrix multiplication (C = A ·B) on a square (pp-by-
pp)
processor grid. After staggering the initial matrix layout, Cannon’s algorithm performspp shifts of A and B
along the two dimensions of the processor grid. The algorithm can be done in-place and all communicationis e�ciently expressed in the form of near-neighbor data passes. Given n-by-n matrices, each processor mustsend O(n2/
pp) words of data in O(
pp) messages along each dimension. The amount of words and messages
sent by each node in Cannon’s algorithm is asymptotically optimal [2] assuming minimal memory usage.However, since each node sends messages to nearest neighbors in 2 dimensions, at most 2 network links canbe saturated per node. However, a d-dimensional bidirectional torus network has 2d outgoing links per nodethat can be utilized.
It is known that a di↵erent algorithm, SUMMA [12], can utilize all 2d links and send a minimal numberof words. For matrix multiplication of n-by-n matrices, SUMMA sends O(n2/
pp) data in the form of n
outer-products, which can be pipelined or blocked. Each update requires a broadcast along a row or columnof processors. If a higher-dimensional torus is flattened into each row and column of the mapping, rectangularcollective algorithms [13, 6, 10] can utilize all dimensions of the network. Rectangular algorithms subdivideand pipeline the messages into edge-disjoint spanning trees formed by traversing the network in di↵erentdimensional orders. However, SUMMA typically sends more messages since it does O(
pp) broadcasts, rather
than the O(pp) near-neighbor sends in Cannon’s algorithm.
Cannon’s algorithm does not employ communication collectives, so it cannot utilize rectangular collectives.We design a generalization of Cannon’s algorithm, Split-Dimensional Cannon’s algorithm (SD-Cannon), that
Proprietary and confidential. Do not distribute.ne r vana
Summary
24
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better