interconnect for machine learning jan… · 7/8/2019 · regular topology generation • simple...
TRANSCRIPT
Interconnect for Machine LearningM L = I N T E R C O N N E C T + N E U R O N S + S O F T W A R E
K. CHARLES JANACPresident and CEO
Arteris IP, the Data Highway of the SoC
Copyright © Arteris IP 2019 2
Arteris IP Ncore® cache coherent interconnect IP
Arteris IP FlexNoC® non-coherent interconnect IP
Arteris IP CodaCache™ last level cache
ChipletLinkFlexNoC® Non-coherent Interconnect
High Speed Wired Peripherals
USB 3USB 2
PHY3.0, 2.0
PCIe
PHY
Ethernet
PHY
Wireless Subsystem
WiFi
GSM
LTE
LTE Adv.
Memory Subsystem
HBM2LP DDR
DDR 4/5
PHY PHY
Memory Scheduler
Memory Controller
FlexWay Subsystem Interconnect
HDMI
MIPI
Display
PMU
JTAG
I/O Peripherals
CRICrypto
Firewall (PCF+)
RSA-PSSCert.
Engine
Security Subsystem
Design-Specific Subsystems
Accelerator Subsystem(s)
Ncore® Cache Coherent Interconnect
Arm® Cortex® CPU Subsystem
A76AE
DSU (L3 cache)
A76AE
A76AE
A76AE
A76AE
DSU (L3 cache)
A76AE
A76AE
A76AE
CodaCache™
Last Level Cache
CMC $
Proxy $
Application-specific IP Subsystem
DSP
SRAM
IP
IP
IP
IP
FlexNoC Interconnect
Machine Learning Subsystem
Ncore Interconnect
CHI & ACE Protocols
Accel Accel Accel
Chiplet
Link
AI Package
8 July 2019
Machine Learning – Training vs Inference
• Deep Learning is a branch of machine learning that involves layering algorithms in an effort to gain greater understanding of the data.
8 July 2019 Copyright © Arteris IP 2019 3
– Deep learning can take raw information without any meaning and construct hierarchical representations that allow insights to be generated
– Deep learning’s results improve more and more with large training data sets
– Machine learning allows machines to make predictions based on “experience” data sets
8 July 2019 Copyright © Arteris IP 2019 4
The Brain is neurons and interconnect
Machine Learning SoC is– Neurons (processors)
– Interconnect
– I/Os and peripherals
– Lots of software
– Lots of data
Deep Learning Specific Interconnect FeaturesFOR ACCELERATED DEVELOPMENT OF MACHINE LEARNING SOCS
8 July 2019 Copyright © Arteris IP 2019 5
• Automated Topology generation
• Customization of automated results
• Flexible router architecture
Regular (AI) Topologies
• Source synchronous communications
• VC-Links™ - Virtual Channels
Large Chips
NEW!
NEW!NEW!
NEW!
NEW!
• Multicast• Multi-channel HBM2
memory support• High bandwidth
datapaths
Huge Bandwidth
NEW!
NEW!
NEW!
Characteristics of mesh and ring architectures
• One or more processing elements (PE), memories (caches) or I/O controllers per corner router
• Multicast / Broadcast writes for BW efficiency
• Sophisticated interleaving for optimal off-chip HBM2 access
Copyright © Arteris IP 2019 6
Mem
Con
trol
ler
Mem
Con
trol
ler M
em C
ontrollerM
em C
ontroller
Mem ControllerMem Controller
Mem ControllerMem Controller
20+ mm
20+
mm
ROUTERS
PE
8 July 2019
Regular Topology Generation
• Simple elementary components enable building essentially any topology provided that:– Routing function is static and defined at configuration time
– Routes have no dependencies – deadlock avoidance is guaranteed by the tool
• It is then possible to build regular topologies such as:– Large crossbars with distributed pipelining
– 2D Meshes
– Rings (1D) and Torus (2D)
8 July 2019 Copyright © Arteris IP 2019 7
a) Mesh topology b) Bi-directional Torus topology
c) Torus topology
Editing Regular Topology Generator Results
Copyright © Arteris IP 2019 88 July 2019
Topology description file
(JSON)
Project Description File (PDD)
1) Generate 2) View and Edit NoC topology
3) View and Edit NoC
RTL
SystemC
UVM
Fle
xEdi
t
Fle
xArt
ist
Welcome to Arteris FlexEdit
Mesh tiling benefits and drawbacks
PRO
• Can be easier for place & route team to massively scale (hard macros)
• Can simplify top-level interconnect, or allow scalable reuse of existing NoC architecture
CONS
• Requires additional logic and memory in tile
• Adds complexity to NoC addressing
• Reduces flexibility for system-level optimizations (QoS, memory interleaving, power management, etc.)
Copyright © Arteris IP 2019 9
Tile A1 Tile A2
Tile B1 Tile B2
8 July 2019
Write Broadcast Station for Multicast
• Broadcast Station ingress ports mapped as Write Only slave in FlexNoC memory map– There can be any number of stations in a FlexNoC
– Broadcast done as close as possible to the destinations
• Writing to Broadcast Station will make it send in turn the writes to multiple destinations– Based on address prefix per egress port
– Can be other broadcast stations
– Can be ”normal” slaves
• Broadcast Stations egress ports as masters of the NoC– Optional: selection of egress ports by user bits: dynamic multicast
• Optional : Support of posted writes for higher performance
Copyright © Arteris IP 2019 10
M_source(CPU, DMA, etc…)
BC
_S0
BC_S1
BC_S2
T0
T1
T2
T3
8 July 2019
Getting data off-chip: HBM2 load balancing
• Advanced interleaving and response reordering
• Distributed controller ports → long distances → source-synchronous and virtual channels
8 July 2019 Copyright © Arteris IP 2019 11
I-NIU
I-NIU
I-NIU
T-NIU
T-NIU
T-NIU
T-NIU
T-NIU
T-NIU
T-NIU
T-NIU
I-NIU
RB
RB
RB
RB
8 ch
anne
ls H
BM
2 co
ntro
ller
256
256
512
128
1024
Mem
Con
trol
ler
Mem
Con
trol
ler M
em C
ontrollerM
em C
ontroller
Mem ControllerMem Controller
Mem ControllerMem Controller
20 mm
20 m
m
From this… …to this.
Scale-up alternative: “Islands” of cache coherency
• Multiple subsystems that are internally cache-coherent
• Each subsystem usually different, optimized for set(s) of tasks
• Can provide scalability for edge inference processing while meeting latency and power requirements
8 July 2019 Copyright © Arteris IP 2019 12
HBM2, GDDR6 or DRAM
Acc1
Acc2
Acc3
Acc4
Acc5
Directory
Transport Interconnect
Non-coherent
bridgeP
roxy cache ($)
Non-coherent
bridgeP
roxy cache ($)
CPU1Cache ($)
Coherent Memory Interface
Coherent Agent Interface
AI AccelCache ($)
Coherent Agent Interface
Coherent Memory Interface
Snoop Filters(s)
Snoop Filters(s)
CMC ($) CMC ($)
Acc1
Acc2
Acc3
Acc4
Acc5
Directory
Transport Interconnect
Non-coherent
bridgeP
roxy cache ($)
Non-coherent
bridgeP
roxy cache ($)
CPU1Cache ($)
Coherent Memory Interface
Coherent Agent Interface
AI AccelCache ($)
Coherent Agent Interface
Coherent Memory Interface
Snoop Filters(s)
Snoop Filters(s)
CMC ($) CMC ($)
Acc1
Acc2
Acc3
Acc4
Acc5
Directory
Transport Interconnect
Non-coherent
bridgeP
roxy cache ($)
Non-coherent
bridgeP
roxy cache ($)
CPU1Cache ($)
Coherent Memory Interface
Coherent Agent Interface
AI AccelCache ($)
Coherent Agent Interface
Coherent Memory Interface
Snoop Filters(s)
Snoop Filters(s)
CMC ($) CMC ($)
NoC Interconnect
AI and ML are changing SoCs architecturesAFFECTING AT LEAST 3 LEVELS OF DEVELOPMENT
8 July 2019 Copyright © Arteris IP 2019 13
• Mesh and ring architectures
• Tiling• Many “islands” of
cache coherency
SoC Level
• Complex memory hierarchies
• Systolic arrays
• And everything in between
Processing Elements
• Interleaved HBM2 memory support
• Multiple inter-chipletcommunications standards
Off-Chip Communication
PE
Interconnect is key to architecture at all 3 levels!
Active Arteris IP Machine Learning Customers
Transportation AI / Machine Learning
ToshibaNXP
Major FPGA Company
AutomotiveSoC Maker
8 July 2019 Copyright © Arteris IP 2019 14
Major Automotive Tier-1 #2
Major ADAS System Maker
Major Automotive EV OEM
Stealth AI Company #1
Stealth AI Company #2
Challenges: Determinism, Visibility, Controllability
15
• How do you verify a deep learning system?
• How do you debug the Neural Network?
• How to you trace intermediate results?
• How do you determine how ML arrives @ solution?
• How do you make a Neural Network “safe”?
• What are the ethics and biases of these systems?
Copyright © Arteris IP 20198 July 2019
Conclusion
• ML today is not particularly intelligent, but it is clever and improving– CNNs being deployed in automotive and data center acceleration SoCs
– Eventually majority of complex SoCs will contain machine learning sections
– A simple continuation of the exponential increasing processing performance trend will not break ML out of the scope of applications that it can do well today
– CNNs are non-linear curve fitting engines with a functionality ceiling
• New architectures and approaches needed for next generation of machine learning problems to deliver real AI solutions– Interconnect technology innovation is key to AI architecture evolution
8 July 2019 Copyright © Arteris IP 2019 16
8 July 2019 Copyright © Arteris IP 2019 17