interconnect for machine learning jan… · 7/8/2019  · regular topology generation • simple...

9
Interconnect for Machine Learning ML = INTERCONNECT + NEURONS + SOFTWARE K. CHARLES JANAC President and CEO Arteris IP, the Data Highway of the SoC Copyright © Arteris IP 2019 2 Arteris IP Ncore® cache coherent interconnect IP Arteris IP FlexNoC® non-coherent interconnect IP Arteris IP CodaCache™ last level cache Chiplet Link FlexNoC ® Non-coherent Interconnect High Speed Wired Peripherals USB 3 USB 2 PHY 3.0, 2.0 PCIe PHY Ethernet PHY Wireless Subsystem WiFi GSM LTE LTE Adv. Memory Subsystem HBM2 LP DDR DDR 4/5 PHY PHY Memory Scheduler Memory Controller FlexWay Subsystem Interconnect HDMI MIPI Display PMU JTAG I/O Peripherals CRI Crypto Firewall (PCF+) RSA-PSS Cert. Engine Security Subsystem Design-Specific Subsystems Accelerator Subsystem(s) Ncore® Cache Coherent Interconnect Arm® Cortex® CPU Subsystem A76AE DSU (L3 cache) A76AE A76AE A76AE A76AE DSU (L3 cache) A76AE A76AE A76AE CodaCache™ Last Level Cache CMC $ Proxy $ Application-specific IP Subsystem DSP SRAM IP IP IP IP FlexNoC Interconnect Machine Learning Subsystem Ncore Interconnect CHI & ACE Protocols Accel Accel Accel Chiplet Link AI Package 8 July 2019

Upload: others

Post on 25-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Interconnect for Machine LearningM L = I N T E R C O N N E C T + N E U R O N S + S O F T W A R E

K. CHARLES JANACPresident and CEO

Arteris IP, the Data Highway of the SoC

Copyright © Arteris IP 2019 2

Arteris IP Ncore® cache coherent interconnect IP

Arteris IP FlexNoC® non-coherent interconnect IP

Arteris IP CodaCache™ last level cache

ChipletLinkFlexNoC® Non-coherent Interconnect

High Speed Wired Peripherals

USB 3USB 2

PHY3.0, 2.0

PCIe

PHY

Ethernet

PHY

Wireless Subsystem

WiFi

GSM

LTE

LTE Adv.

Memory Subsystem

HBM2LP DDR

DDR 4/5

PHY PHY

Memory Scheduler

Memory Controller

FlexWay Subsystem Interconnect

HDMI

MIPI

Display

PMU

JTAG

I/O Peripherals

CRICrypto

Firewall (PCF+)

RSA-PSSCert.

Engine

Security Subsystem

Design-Specific Subsystems

Accelerator Subsystem(s)

Ncore® Cache Coherent Interconnect

Arm® Cortex® CPU Subsystem

A76AE

DSU (L3 cache)

A76AE

A76AE

A76AE

A76AE

DSU (L3 cache)

A76AE

A76AE

A76AE

CodaCache™

Last Level Cache

CMC $

Proxy $

Application-specific IP Subsystem

DSP

SRAM

IP

IP

IP

IP

FlexNoC Interconnect

Machine Learning Subsystem

Ncore Interconnect

CHI & ACE Protocols

Accel Accel Accel

Chiplet

Link

AI Package

8 July 2019

Page 2: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Machine Learning – Training vs Inference

• Deep Learning is a branch of machine learning that involves layering algorithms in an effort to gain greater understanding of the data.

8 July 2019 Copyright © Arteris IP 2019 3

– Deep learning can take raw information without any meaning and construct hierarchical representations that allow insights to be generated

– Deep learning’s results improve more and more with large training data sets

– Machine learning allows machines to make predictions based on “experience” data sets

8 July 2019 Copyright © Arteris IP 2019 4

The Brain is neurons and interconnect

Machine Learning SoC is– Neurons (processors)

– Interconnect

– I/Os and peripherals

– Lots of software

– Lots of data

Page 3: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Deep Learning Specific Interconnect FeaturesFOR ACCELERATED DEVELOPMENT OF MACHINE LEARNING SOCS

8 July 2019 Copyright © Arteris IP 2019 5

• Automated Topology generation

• Customization of automated results

• Flexible router architecture

Regular (AI) Topologies

• Source synchronous communications

• VC-Links™ - Virtual Channels

Large Chips

NEW!

NEW!NEW!

NEW!

NEW!

• Multicast• Multi-channel HBM2

memory support• High bandwidth

datapaths

Huge Bandwidth

NEW!

NEW!

NEW!

Characteristics of mesh and ring architectures

• One or more processing elements (PE), memories (caches) or I/O controllers per corner router

• Multicast / Broadcast writes for BW efficiency

• Sophisticated interleaving for optimal off-chip HBM2 access

Copyright © Arteris IP 2019 6

Mem

Con

trol

ler

Mem

Con

trol

ler M

em C

ontrollerM

em C

ontroller

Mem ControllerMem Controller

Mem ControllerMem Controller

20+ mm

20+

mm

ROUTERS

PE

8 July 2019

Page 4: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Regular Topology Generation

• Simple elementary components enable building essentially any topology provided that:– Routing function is static and defined at configuration time

– Routes have no dependencies – deadlock avoidance is guaranteed by the tool

• It is then possible to build regular topologies such as:– Large crossbars with distributed pipelining

– 2D Meshes

– Rings (1D) and Torus (2D)

8 July 2019 Copyright © Arteris IP 2019 7

a) Mesh topology b) Bi-directional Torus topology

c) Torus topology

Editing Regular Topology Generator Results

Copyright © Arteris IP 2019 88 July 2019

Topology description file

(JSON)

Project Description File (PDD)

1) Generate 2) View and Edit NoC topology

3) View and Edit NoC

RTL

SystemC

UVM

Fle

xEdi

t

Fle

xArt

ist

Welcome to Arteris FlexEdit

Page 5: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Mesh tiling benefits and drawbacks

PRO

• Can be easier for place & route team to massively scale (hard macros)

• Can simplify top-level interconnect, or allow scalable reuse of existing NoC architecture

CONS

• Requires additional logic and memory in tile

• Adds complexity to NoC addressing

• Reduces flexibility for system-level optimizations (QoS, memory interleaving, power management, etc.)

Copyright © Arteris IP 2019 9

Tile A1 Tile A2

Tile B1 Tile B2

8 July 2019

Write Broadcast Station for Multicast

• Broadcast Station ingress ports mapped as Write Only slave in FlexNoC memory map– There can be any number of stations in a FlexNoC

– Broadcast done as close as possible to the destinations

• Writing to Broadcast Station will make it send in turn the writes to multiple destinations– Based on address prefix per egress port

– Can be other broadcast stations

– Can be ”normal” slaves

• Broadcast Stations egress ports as masters of the NoC– Optional: selection of egress ports by user bits: dynamic multicast

• Optional : Support of posted writes for higher performance

Copyright © Arteris IP 2019 10

M_source(CPU, DMA, etc…)

BC

_S0

BC_S1

BC_S2

T0

T1

T2

T3

8 July 2019

Page 6: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Getting data off-chip: HBM2 load balancing

• Advanced interleaving and response reordering

• Distributed controller ports → long distances → source-synchronous and virtual channels

8 July 2019 Copyright © Arteris IP 2019 11

I-NIU

I-NIU

I-NIU

T-NIU

T-NIU

T-NIU

T-NIU

T-NIU

T-NIU

T-NIU

T-NIU

I-NIU

RB

RB

RB

RB

8 ch

anne

ls H

BM

2 co

ntro

ller

256

256

512

128

1024

Mem

Con

trol

ler

Mem

Con

trol

ler M

em C

ontrollerM

em C

ontroller

Mem ControllerMem Controller

Mem ControllerMem Controller

20 mm

20 m

m

From this… …to this.

Scale-up alternative: “Islands” of cache coherency

• Multiple subsystems that are internally cache-coherent

• Each subsystem usually different, optimized for set(s) of tasks

• Can provide scalability for edge inference processing while meeting latency and power requirements

8 July 2019 Copyright © Arteris IP 2019 12

HBM2, GDDR6 or DRAM

Acc1

Acc2

Acc3

Acc4

Acc5

Directory

Transport Interconnect

Non-coherent

bridgeP

roxy cache ($)

Non-coherent

bridgeP

roxy cache ($)

CPU1Cache ($)

Coherent Memory Interface

Coherent Agent Interface

AI AccelCache ($)

Coherent Agent Interface

Coherent Memory Interface

Snoop Filters(s)

Snoop Filters(s)

CMC ($) CMC ($)

Acc1

Acc2

Acc3

Acc4

Acc5

Directory

Transport Interconnect

Non-coherent

bridgeP

roxy cache ($)

Non-coherent

bridgeP

roxy cache ($)

CPU1Cache ($)

Coherent Memory Interface

Coherent Agent Interface

AI AccelCache ($)

Coherent Agent Interface

Coherent Memory Interface

Snoop Filters(s)

Snoop Filters(s)

CMC ($) CMC ($)

Acc1

Acc2

Acc3

Acc4

Acc5

Directory

Transport Interconnect

Non-coherent

bridgeP

roxy cache ($)

Non-coherent

bridgeP

roxy cache ($)

CPU1Cache ($)

Coherent Memory Interface

Coherent Agent Interface

AI AccelCache ($)

Coherent Agent Interface

Coherent Memory Interface

Snoop Filters(s)

Snoop Filters(s)

CMC ($) CMC ($)

NoC Interconnect

Page 7: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

AI and ML are changing SoCs architecturesAFFECTING AT LEAST 3 LEVELS OF DEVELOPMENT

8 July 2019 Copyright © Arteris IP 2019 13

• Mesh and ring architectures

• Tiling• Many “islands” of

cache coherency

SoC Level

• Complex memory hierarchies

• Systolic arrays

• And everything in between

Processing Elements

• Interleaved HBM2 memory support

• Multiple inter-chipletcommunications standards

Off-Chip Communication

PE

Interconnect is key to architecture at all 3 levels!

Active Arteris IP Machine Learning Customers

Transportation AI / Machine Learning

ToshibaNXP

Major FPGA Company

AutomotiveSoC Maker

8 July 2019 Copyright © Arteris IP 2019 14

Major Automotive Tier-1 #2

Major ADAS System Maker

Major Automotive EV OEM

Stealth AI Company #1

Stealth AI Company #2

Page 8: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

Challenges: Determinism, Visibility, Controllability

15

• How do you verify a deep learning system?

• How do you debug the Neural Network?

• How to you trace intermediate results?

• How do you determine how ML arrives @ solution?

• How do you make a Neural Network “safe”?

• What are the ethics and biases of these systems?

Copyright © Arteris IP 20198 July 2019

Conclusion

• ML today is not particularly intelligent, but it is clever and improving– CNNs being deployed in automotive and data center acceleration SoCs

– Eventually majority of complex SoCs will contain machine learning sections

– A simple continuation of the exponential increasing processing performance trend will not break ML out of the scope of applications that it can do well today

– CNNs are non-linear curve fitting engines with a functionality ceiling

• New architectures and approaches needed for next generation of machine learning problems to deliver real AI solutions– Interconnect technology innovation is key to AI architecture evolution

8 July 2019 Copyright © Arteris IP 2019 16

Page 9: Interconnect for Machine Learning Jan… · 7/8/2019  · Regular Topology Generation • Simple elementary components enable building essentially any topology provided that: –

8 July 2019 Copyright © Arteris IP 2019 17