accelerator design for big data processing frameworksmatutani/papers/m... · machine learning...

Accelerator Design for Big Data Processing

FrameworksHiroki Matsutani

Dept. of ICS, Keio Universityhttp://www.arc.ics.keio.ac.jp/~matutani

July 5th, 2017 International Forum on MPSoC for Software-Defined Hardware (MPSoC'17) 1

http://en.wikipedia.org/wiki/Image:Keio-logo.png

http://en.wikipedia.org/wiki/Image:Keio-logo.png

Batch processing

2

Input stream data Message queuing

BatchView

Database layer(Polyglot persistence)

DB queries

Data exchange (serialization)

Customer analysis Topic prediction Blockchain records Geolocation query …Big data (Surveillance,

Network service, SNS,UAV, IoT)

Batchprocessing,Learning

Batch processing requires time and thus recent data cannot be reflected to the analysis result Combine batch and stream processing to make up the realtime capability

Batch + Stream processing

3


RealtimeView


DB queries


Customer analysis Topic prediction Blockchain records Geolocation query …

Big data (Surveillance, Network service, SNS,UAV, IoT)

BatchView

Batch processing,Learning

Stream processing,Inference

Nathan Marz, et.al., "Big Data: Principles and best practices of scalable realtime data systems", Manning Publications (2015).

Batch + Stream processing

4


RealtimeView


DB queries




BatchView



I/O intensive Compute intensive

Message queuing middleware

(Apache Kafka)

Stream processing (Apache Spark

Streaming)

Batch processing(Apache Spark)

Machine learning (Apache Spark

MLlib)Serialization

(Apache Thrift)KVS / Column DB(Redis, HBase)

Document DB(MongoDB)

Graph DB (Neo4j), graph

processing

Online machine learning (Classification,

Outlier detection, Change point

detection, Abnormal behavior detection）

Tight integration of I/O and compute FPGA

Massive parallelism Networked GPU clusterFPGAFour

10GbE

GPUsHostSwitch

Acceleration for data store

5


RealtimeView


DB queries






(Apache Kafka)


Streaming)



MLlib)Serialization




processing





Massive parallelism Networked GPU cluster

BatchView



FPGAFour10GbE

GPUsHostSwitch

Multilevel NOSQL cache: FPGA NIC

6

Normal DB Access

In-Kernel KVS Cache (L2$) Hit

NIC Cache (L1$) Hit

Add

Request Reply

Result

NetFPGA-10G

NetFPGA-10G Device Driver

User Space

10GbE x4

In-Kernel KVS (Key-Value Store) Cache 512GB

DRAM 288MB

NIC HW CacheMachine Learning

[IEEE HoTI’16]

Multilevel NOSQL cache: FPGA NIC

7

Normal DB Access

In-Kernel KVS Cache (L2$) Hit

NIC Cache (L1$) Hit

Add

Request Reply

Result

NetFPGA-10G


User Space

10GbE x4

In-Kernel KVS (Key-Value Store) Cache 512GB

DRAM 288MB

NIC HW CacheMachine Learning

Multilevel NOSQL cache:L1 NOSQL cache … FPGA-based hardware cacheL2 NOSQL cache … In-kernel software cacheTradeoffs between capacity and speed:L1 NOSQL cache … Very fast/efficient but smallL2 NOSQL cache … Fast and largeDesign space exploration [IEEE HoTI’16]

FPGA NIC Cache for Blockchain• Blockchain

– A chain of blocks each contains transactions verified and shared by all the parties

8

1. Bob wants to send money to Alice

2. TX(BobAlice) is represented as a block

3. The block is broadcasted to all the nodes

Bob Block

Block

Block

Block

Block

Block

4. After verified, the block is added to the blockchain

FPGA NIC Cache for Blockchain• IoT devices (SPV nodes)

– Cannot maintain whole the blockchain data (>100GB) due to resource limitation

• Simple payment verification– Ask full node to check whether a transaction

of interest has been completed or not

9

Alice

1. Alice wants to verify TX(BobAlice)

Block

Block

Block

Block

Full node

2. Alice contacts a full node to verify it

FPGA NIC Cache for Blockchain• IoT devices (SPV nodes)

– Cannot maintain whole the blockchain data (>100GB) due to resource limitation

• Simple payment verification– Ask full node to check whether a transaction

of interest has been completed or not

10

Alice

Block

Block

Block

Block

Full node

2. Alice contacts a full node to verify it

The number of IoT devices that join blockchainwill increaseTo reduce full node accesses from SPV nodes, FPGA NIC KVS is used as “cache” of blockchain[HEART’17]

FPGA NIC Cache

1. Alice wants to verify TX(BobAlice)

11


RealtimeView


DB queries






(Apache Kafka)


Streaming)



MLlib)Serialization




processing






BatchView



FPGAFour10GbE

GPUsHostSwitch

Acceleration for data processing

Data processing w/ Spark+GPU

Array Cache (Converted from Spark RDDs)

User Space

Reduction & transformation of RDDs offloaded to GPU

Array Cache (RDDs)

RDD 3

RDD reduction (count, max, reduce, …)Value

RDD-to-RDD transformation (sort, filter, map, …)

RDD 1

RDD 2

RDD 1 RDD 2

RDD 3

12

[HEART’16]

Data are stored in RDDs (i.e., distributed shared memory)RDDs are converted to array structure & transferred to GPU

13

GPUsPCIe card inserted in the server

10/40GbE Switch10G+10G

10G+10G

Many GPUs are directly connected to Apache Spark server via NEC ExpEther (20Gbps)

Data processing w/ Spark+GPUs

14

Data processing w/ Spark+GPUs

Array Cache (Converted from Spark RDDs)

User Space

Reduction & transformation of RDDs offloaded to GPUs

Array Cache (RDDs) RDD 1 RDD 2

RDD 3

10GbE Switch

[ICPADS’16]

RDD 1

RDD 2

RDD 3

RDD 4

RDD 5

RDD 6

15


RealtimeView


DB queries






(Apache Kafka)


Streaming)



MLlib)Serialization




processing






BatchView



FPGAFour10GbE

GPUsHostSwitch

Acceleration for data processing

10Gbps outlier filtering: FPGA NIC

16

NetFPGA-10G


10GbE x4

Data MiningMachine learning algorithms Mahalanobis Distance Local Outlier Factor(LOF) K-Nearest Neighbor(KNN)

Sensor Data Explosion

Outlier

Huge Size Small Size

Only anomaly-valued packets are received

User Space

10Gbps outlier filtering: FPGA NIC

17

NetFPGA-10G


10GbE x4

Data MiningMachine learning algorithms Mahalanobis Distance Local Outlier Factor(LOF) K-Nearest Neighbor(KNN)

Sensor Data Explosion

Outlier

Huge Size Small Size

Only anomaly-valued packets are received

User SpaceDensity-based approach to find outliers (e.g., higher LOF value when k neighbors are distant)All reference data needed for density computation Frequently-accessed reference data clusters are

cached in FPGA NIC [PDP’17]

Spark Streaming: FPGA NIC

18

NetFPGA-10G


10GbE x4

One-at-a-time

Results of one-at-a-time processing(e.g., filter) received

User SpaceMicro-batch• Stream processing

• Spark Streaming– Micro-batch style for

compatibility w/ Spark– Large latency

One-at-a-time style

Micro-batch style

batch batch

Stream processing components which can be executed as “one-at-a-time style” are offloaded to FPGA NIC [IEEE BigData WS’16]

(e.g, 1sec)

Summary

19


RealtimeView


DB queries






(Apache Kafka)


Streaming)



MLlib)Serialization




processing






BatchView



FPGAFour10GbE

GPUsHostSwitch

References (1/2)• FPGA-based KVS accelerator

– Yuta Tokusashi, et.al., "A Multilevel NOSQL Cache Design Combining In-NIC and In-Kernel Caches", IEEE Hot Interconnects 2016.

• FPGA-based Blockchain cache– Yuma Sakakibara, et.al., "An FPGA NIC Based

Hardware Caching for Blockchain", HEART 2017.

• FPGA-based Spark Streaming accelerator– Kohei Nakamura, et.al., "An FPGA-Based

Low-Latency Network Processing for Spark Streaming", IEEE BigData Workshops 2016.

20

References (2/2)• FPGA-based machine learning accelerator

– Ami Hayashi, et.al., "An FPGA-Based In-NIC Cache Approach for Lazy Learning Outlier Filtering", PDP 2017.

– Ami Hayashi, et.al., "A Line Rate Outlier Filtering FPGA NIC using 10GbE Interface", ACM SIGARCH CAN (2015).

• GPU-based acceleration of Apache Spark– Yasuhiro Ohno, et.al., "Accelerating Spark

RDD Operations with Local and Remote GPU Devices", ICPADS 2016.

21

22

Thank you !

Acknowledgement:This work was supported by MIC SCOPE, NEDO IoT, JSPS Kaken B, and SECOM

Thank you for listening

accelerator design for big data processing frameworksmatutani/papers/m... · machine learning...

Documents