heterogeneous multi-processing for sw- defined multi-tiered … · 2017. 9. 12. · heterogeneous...

54
Heterogeneous Multi-Processing for SW- Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott (Xilinx Research) SDC, 2017

Upload: others

Post on 05-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

Heterogeneous Multi-Processing for SW-Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott (Xilinx Research) SDC, 2017

Page 2: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Heterogeneous Multi-Processing for

Software-Defined Multi-Tiered Storage Architectures

Who – Xilinx Research and Missing Link Electronics

Why – Multi-tiered storage needs predictable performance scalability,

deterministic low-latency and cost-efficient flexibility / programmability

What – Tera-OPS processing performance in a single-chip heterogeneous

compute solution running Linux

How – Combine “unconventional” dataflow architectures for acceleration &

offloading with Dynamic Partial Reconfiguration and High-Level Synthesis

Content

Page 2

Page 3: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 3

Xilinx Research and Missing Link Electronics

Page 4: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 4

Xilinx – The All Programmable Company

$2.38B FY15 revenue

>55% market segment share

3,500+ employees worldwide

20,000 customers worldwide

3,500+ patents

60 industry firsts

XILINX - Founded 1984

Headquarters

Research and Development

Sales and Support

Manufacturing

Page 5: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Xilinx Research - Ireland

Page 5

Applications & Architectures

Through application-driven

technology development with

customers, partners, and

engineering & marketing

Page 6: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Vision: The convergence of software and off-the-shelf programmable logic

opens-up more economic system realizations with predictable scalability!

Mission: To de-risk the adoption of heterogeneous compute technology by

providing pre-validated IP and expert design services.

Certified Xilinx Alliance Partner since 2011, Preferred Xilinx PetaLinux Design

Service Partner since 2013.

Missing Link Electronics Xilinx Ecosystem Partner

Page 6

Page 7: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Missing Link Electronics Products & Services

TCP/IP & UDP/IP Network Protocol Accelerators at 10/25/50 GigE line-rate.

Patented Mixed Signal systems solutions with integrated Delta-Sigma converters in FPGA logic.

SATA Storage Extension for Xilinx Zynq All-Programmable Systems-on-Chip.

A team of FPGA and Linux engineers to support our customer’s technology projects in the USA and Europe.

Key-Value-Store Accelerator for hybrid SSD/HDD memcached and object storage.

Low-Latency Ethernet MAC form German Fraunhofer HHI.

Page 7

Page 8: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 8

Motivation

Page 9: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Software significantly impacts

latency and energy efficiency

in systems with nonvolatile

memory

However, software-defined

flexibility is necessary to fully

utilize novel storage

technologies

Hyper-capacity hyper-

converged storage systems

need more performance, but

within cost and energy

envelopes

Technology Forces in Storage

Page 9

Source: Steven Swanson and Adrian M. Caulfield, UCSD

IEEE Computer, August 2013

Page 10: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

CPU system performance scalability is limited

The Von Neumann Bottleneck [J. Backus, 1977]

Page 10

New Compute Architectures are needed

Page 11: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

CPU system performance scalability is limited

Spatial computing offers further scaling opportunity

Spatial vs. Temporal Computing

Page 11

New Compute Architectures are needed to

take advantage of this

Sequential Processing

with CPU

Parallel Processing

with Logic Gates

Source: Dr. Andre DeHon, Upenn: “Spatial vs. Temporal Computing”

Page 12: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Architectural Choices for Storage Devices

Page 12

Log P E R F O R M A N C E

Lo

g

F L

E X

I B

I L

I T

Y

Lo

g P

O W

E R

D

I S

S I P

A T

I O

N

103 . . . 104

10

5 . . . 1

06

Application

Specific Signal

Processors

Digital

Signal

Processors

General

Purpose

Processors

Application

Specific

ICs

Physically

Optimized

ICs

StrongARM110

0.4 MIPS/mW TMS320C54x

3MIPS/mW ICORE

20-35 MOPS/mW

Source: T.Noll, RWTH Aachen

Field

Programmable

Devices

Tera OPS Processing Power At low Wattage

Page 13: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 13

Use Case: Image/ Video Storage

Page 14: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

A Flexible All Programmable Storage Node

Page 14

Storage Node

MPSoC FPGA

Reconfigurable

Processing Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

Page 15: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Storage Node

Meta Data Extraction, e.g. Image Quality Metrics

Page 15

Meta Data Extraction Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

• Partially under and over exposed

• Medium contrast

Page 16: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Storage Node

Processing, e.g. Thumbnailing, Auto-Correction

Page 16

Thumbnailing

Auto Correction Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

Page 17: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Storage Node

Semantic Feature Extraction, e.g. Classification

Page 17

Semantic Feature

Extraction Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

• Group of People

• Xilinx

• Outdoor

Page 18: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Storage Node

Semantic Search Support

Page 18

Semantic Feature Search Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

• Group of People

• Xilinx

• Outdoor Scene

Page 19: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Storage Node

Performance Metrics, e.g. Bandwidth, Latency

Page 19

Semantic Feature Search Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

• Performance Counter

• Pattern Matching

• ID Generation for Tracing

Page 20: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Runtime Programmability

Page 20

Storage Node

Reconfigurable

Processing Network

NVMe

Drive

NVMe

Drive

NVMe

Drive Monitoring

Meta Data

Extraction

Semantic

Feature

Extraction

Page 21: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 21

Architectural Concepts

Page 22: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Heterogeneous compute device as a single-chip solution

Direct network interface with full accelerator for protocols

Performance scaling with dataflow architectures

Scaling capacity and cost with a Hybrid Storage subsystem

Software-defined services

Key Concepts Presented at SDC-2016

Page 22

Page 23: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

SDC-2016: Single-Chip Solution for Storage

Page 23

DDRx

channels DDRx

channels

M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives

DDRx

channels DDRx

channels DDRx

channels

Data Node

FPGA fabric (PL)

Processing System with quad core 64b processors (A53)

TCP/IP stack

Petalinux

Memory management

Key Value Store Abstraction

(memcached)

NVMe interface

Memory controller

Hybrid Memory System

Network management

Router

Xilinx IP SD Services MLE IP

Page 24: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

SDC-2016: Hardware Accelerated Network Stack

Page 24

DDRx

channels DDRx

channels

M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives

DDRx

channels DDRx

channels DDRx

channels

Data Node

FPGA fabric (PL)

Processing System with quad core 64b processors (A53)

TCP/IP stack

Petalinux

Memory management

Key Value Store Abstraction

(memcached)

NVMe interface

Memory controller

Hybrid Memory System

Network management

Router

TCP/IP Full Accelerator Supports 10/25/50

GigE line-rates

Page 25: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Now: 10 Gbps demonstrated with a 64b data path @ 156MHz using 20% of FPGA

Next: 100 Gbps can be achieved by using a 512b @ 200MHz pipeline for example

SDC-2016: Dataflow architectures for performance scaling

Page 25

Streaming Architecture:

Flow-controlled series of processing

stages which manipulate and pass

through packets and their associated

state

Source: Blott et al: Achieving 10Gbps line-rate key-value stores with FPGAs; HotCloud 2013

Page 26: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

SSDs combined with DDRx channels can be used to build high

capacity & high performance object stores

Concepts and early prototype to scale to 40TB & 80Gbps key

value stores

SDC-2016: Scaling Capacity via hybrids

Page 26

Parser Hash

Lookup

Value

Store

Access

Formatter

DDRx

channels DDRx

channels

M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives M.2 NVMe

drives

DDRx

Hash Table DDRx

Channels

Hash Table

Value

Store

Value

Store

Hybrid MemorySystem

Source: HotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory

Page 27: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

SDC-2016: Handling High Latency Accesses without Sacrificing Throughput

Read SSD Read SSD Read SSD

100usec

• Dataflow architectures: no limit to number of outstanding requests

• Flash can be serviced at maximum speed

Read SSD Read SSD Read SSD

time

Read SSD Read SSD Read SSD Read SSD Read SSD

Request

Buffer

Read SSD Read SSD Read SSD Read SSD Read SSD

Response Response Response Response Response Response Response Response Response Response Response Response Response Response Response

Cmd:

Rsp:

Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD Read SSD

Page 27

Page 28: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Software-Defined Services

Page 28

Spatial computing of additional services at no performance cost until

resource limitations are reached

Software-Defined Services Software-Defined Services

Page 29: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 29

Software-Defined Services

Page 30: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Offload engines for Linux Kernel Crypto-API

Non-intrusive latency analysis via PCIe TLP “Tracers”

Inline processing with Deep Convolutional Neural Networks

Declarative Linux Kernel Support Partial Reconfiguration

Software-Defined Services – Proof-of-Concepts

Page 30

Page 31: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Crypto-API is a cryptography framework in the Linux kernel used

for encryption, decryption, compression, de-compression, etc.

Needs acceleration to support processing at higher line-rates

(100 GigE).

Open Source software implementation that follows a streaming

dataflow processing architecture

– Hardware Interface: AXI Streaming

– Software/ Hardware Interface: SG-DMA in, SG-DMA out

High-Level Synthesis generated accelerator blocks from

reference C code

Software-Defined Services - Example 1) Accelerating the Linux Kernel Crypto-API

Page 31

Page 32: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

System Architecture of Crypto-API Accelerator

Page 32

Page 33: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Performance analysis and ongoing monitoring of bandwidth and

latency in distributed systems is difficult.

– Round-trip times

– Time-outs

– Throttling

When done in software, results get distorted by additional

compute burden.

When done in Programmable Logic, it can be (clock cycle)

accurate and non-intrusive via adding so-called “Tracers” into

the dataflow.

Software-Defined Services - Example 2) Non-Intrusive Latency Analysis via PCIe TLP Tracers

Page 33

Page 34: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Tracers within PCIe Transaction Layer Packets (TLP)

– Based on addresses/ IDs, detected at PCIe switches and endpoints

– Transparent for transport layer (Ethernet, etc)

Tracer-Based Performance Analysis

Page 34

Page 35: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Full implementation on network with multiple boards

Proof-of-Concept Implementation

Page 35

Page 36: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Latency Monitoring WithTracers - Overview

Page 36

Page 37: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Latency Monitoring with Tracers - Results

Page 37

Page 38: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Deep Convolutional Neural Networks (CNN) have demonstrated

values in classification, recognition and data-mining.

However, CNN can be very compute intensive, when done at

single or double float precision.

Recent approaches involve reduced precision (INT8, or even

less), as well as dataflow-oriented compute architectures.

– Taps into tremendous compute power within Programmable Logic

What if, CNN can be run close to the data, within the storage

node?

Software-Defined Services - Example 3) Inline Processing w/ Neural Networks

Page 38

Page 39: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Streaming Dataflow Processing in BNN Inference

Page 39

Courtesy “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”,

Umuroglu, Fraser, Blott et al., 25th Symp. on FPGA, 2017

Page 40: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

BNN Results

Page 40

Courtesy “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”,

Umuroglu, Fraser, Blott et al., 25th Symp. on FPGA, 2017

Page 41: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Supports both full and partial reconfiguration of FPGAs

Adds a device tree interface for controlling the partial

reconfiguration process

Handles all FPGA internal processes

Abstract device and vendor neutral interface

Software-Defined Services – Infrastructure Linux Kernel FPGA Framework

Page 41

Page 42: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Linux FPGA Framework Architecture

Page 42

Page 43: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

A Declarative Partial Reconfiguration Framework

Page 43

Platform: ZC706

Bitstream Size: 5.9 MiB

Overall latency: ≈135 ms

Page 44: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 44

Conclusion & Outlook

Page 45: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Trend towards unconventional architectures

– A diversification of increasingly heterogeneous devices and systems

– Convergence of networking, compute and storage within single

nodes

– CPU-only processing runs out of steam

Key concepts for demonstrating Software-Defined Services

– Offload engines for Linux Kernel Crypto-API

– Non-intrusive latency analysis via PCIe TLP “Tracers”

– Inline processing with Deep Convolutional Neural Networks

Results:

– On commercially available hardware

– Available for collaboration or in-house development

Conclusion

Page 45

Page 46: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Xilinx Zynq UltraScale+ MPSoC (XCZU19EG)

– ARM Cortex A-53 quad-core, ARM Coretx R5 dual-core, 1,968 DSP slices

– 1.1 million system logic cells, 34Mbit BRAM, 36Mbit UltraRAM

– 5x PCIe Gen3/4, 4x 100GigE, 44x 16.3Gbps, 28x 32.72Gbps

Single-Chip Implementation

Page 46

Page 47: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Sidewinder-100 from Fidus Systems

Accelerator IP and Linux BSP from MLE

Commercially Available Development System

Page 47

Dual m.2 SSDs

Dual

DDR4LP

SODIMM

NGFF

8643

Dual

QSFP28

PCIe Gen4 x8

PCIe Gen3 x8

Page 48: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Page 48

Backup

Page 49: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Binarized Neural Networks (BNN):

Training with float, CNN Inference runs at reduced precision

– Less data (Mbytes) for parameters, less compute burdon.

Reduced Precision Neural Networks

Page 49

Binar ized Neural Networks: Training Neural Networks with Weights and Activations Constrained to + 1 or − 1

Table 1. Classification test error rates of DNNs trained on MNIST (MLP architecture without unsupervised pretraining), CIFAR-10

(without data augmentation) and SVHN.Data set MNIST SVHN CIFAR-10

Binarized activations+weights, during training and test

BNN (Torch7) 1.40% 2.53% 10.15%

BNN (Theano) 0.96% 2.80% 11.40%

CommitteeMachines’ Array (Baldassi et al., 2015) 1.35% - -

Binarized weights, during training and test

BinaryConnect (Courbariaux et al., 2015) 1.29± 0.08% 2.30% 9.90%

Binarized activations+weights, during test

EBP (Cheng et al., 2015) 2.2± 0.1% - -

Bitwise DNNs (Kim & Smaragdis, 2016) 1.33% - -

Ternary weights, binary activations, during test

(Hwang & Sung, 2014) 1.45% - -

No binarization (standard results)

Maxout Networks (Goodfellow et al.) 0.94% 2.47% 11.68%

Network in Network (Lin et al.) - 2.35% 10.41%

Gated pooling (Lee et al., 2015) - 1.69% 7.62%

Figure 1. Training curves of a ConvNet on CIFAR-10 depend-

ing on the method. The dotted lines represent the training costs

(square hinge losses) and the continuous lines the corresponding

validation error rates. Although BNNs are slower to train, they

are nearly asaccurate as 32-bit float DNNs.

and AdaMax variants, which are detailed in Algo-

rithms 3 and 4, whereas in our Theano experiments,

we use vanilla BN and ADAM.

2.1. MLP on MNIST (Theano)

MNIST is an image classification benchmark dataset (Le-

Cun et al., 1998). It consists of a training set of 60K and

a test set of 10K 28 ⇥ 28 gray-scale images represent-

ing digits ranging from 0 to 9. In order for this bench-

mark to remain a challenge, we did not use any convo-

lution, data-augmentation, preprocessing or unsupervised

learning. The MLP we train on MNIST consists of 3 hid-

den layers of 4096 binary units (see Section 1) and a L2-

SVM output layer; L2-SVM has been shown to perform

better than Softmax on several classification benchmarks

Figure 2. Binary weight filters, sampled from of thefirst convolu-

tion layer. Since we have only 2k 2

unique 2D filters (where k is

the filter size), filter replication is very common. For instance, on

our CIFAR-10 ConvNet, only 42% of the filtersare unique.

(Tang, 2013; Lee et al., 2014). We regularize the model

with Dropout (Srivastava, 2013; Srivastava et al., 2014).

The square hinge loss is minimized with the ADAM adap-

tive learning rate method (Kingma & Ba, 2014). We use

an exponentially decaying global learning rate, as per Al-

gorithm 1, and also scale the learning rates of the weights

with their initialization coefficients from (Glorot & Bengio,

2010), as suggested by Courbariaux et al. (2015). We use

Batch Normalization with aminibatch of size 100 to speed

up the training. As is typical, we use the last 10K samples

of the training set as a validation set for early stopping and

model selection. We report the test error rate associated

with thebest validation error rateafter 1000 epochs (wedo

not retrain on the validation set). The results are reported

in Table 1.

2.2. MLP on MNIST (Torch7)

Weuseasimilar architecture asin our Theano experiments,

without dropout, and with 2048 binary units per layer in-

stead of 4096. Additionally, we use the shift base AdaMax

Page 50: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Design automation runs scheduling and resource binding to generate

RTL code comprising data paths plus state machines for control flow

Working Principles of High-Level Synthesis

Page 50

Page 51: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Automated performance

optimizations via parallelization

at dataflow level

Benefits of HLS-Based C/C++ FPGA Design

Page 51

Automatic interface synthesis

and driver code generation for

HW/SW connectivity

Page 52: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Reconfiguration Performance

Page 52

Config

ura

tion

La

ten

cy

Decon

figura

tion

La

ten

cy

Page 53: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Scheduling Latency - Profiling Results

53

Measurement of example system (AES accelerator on ZC706 board)

Measured latencies via ftrace function entry and exit timestamps

Bitstream Size: 5.9 MiB

Overall latency: ≈135 ms

20%

63%

16%

1%

Runtime [%]

Load Bitstream

Partial Reconfiguration

Load Platform Driver

Rest incl. Framework

Page 54: Heterogeneous Multi-Processing for SW- Defined Multi-Tiered … · 2017. 9. 12. · Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who –

© MLE .

Endric Schubert

Email: [email protected]

Ulrich Langenbach

Email: [email protected]

Missing Link Electronics

www.missinglinkelectronics.com

Ph US: +1-408-475-1490

Ph GER: +49-731-141149-0

Contact

Page 54