fpgas for high performance computing€¦ · • an fpga or field-programable gate array is a...

FPGAs for High performance computing

Admintech 2018 – Valencia – May, 9th

Francisco PerezField Applications Engineerfrancisco.perez@intel.com

AGENDA

• FPGAs como aceleradores HW de aplicaciones

• Introducción a la arquitectura de las FPGAs

• Herramientas y plataformas disponibles -> FPGAs para programadores

• Casos reales de utilización FPGA en datacenter

• Inteligencia Artificial: Beneficios de las FPGAs para inferencia de CNN

Intro to fpga

Multi-purpose accelerator engine

Francisco PerezField Applications Engineerfrancisco.perez@intel.com

Programmable Solutions Group 4

What iS an fpga?

Definition

• An FPGA or Field-programable gate array is a configurable devicecontaining thousands of digital logic blocks.

• How these blocks are connected together, and their functionality, can beimplemented using a specific hardware description languaje.

• This array of programable logic can reproduce quite simple circuits, like alogic gate or combinational function, up to really complex System-on-chip solutions.

• It’s reprogramable, so their funcionality can be changed when needed.

What iS an fpga TODAY?

An advanced, multi-function accelerator

• Offer greater throughput, execution speed, and energy efficiency than CPUs on computationally intensive parts of algorithms

• With the ability to adapt quickly to changes in algorithms, new standards, data patterns, or performance needs

• They can be reconfigured in the field to accelerate any algorithm

Programmable Solutions Group

Transforming Data Centers To a single Accelerator Architecture

CPU GPU ASSP

ASIC FPGA

Artificial Intelligence

Big Data Analytics (Hadoop, SPARK, SQL, NoSQL)

Video Transcoding

NFV/SDNStorage Acceleration

Security and DPI (Deep Packet Inspection)

Accelerating Key network Functions

M a n y K i n d s o f B o x e s

Routers Firewalls SwitchesSpecial-Purpose

Appliances

Switching

Security

Inspection & Reporting

Data AnalyticsArtificialIntelligence

VideoTranscoding

Cyber SecurityFinancial Acceleration

Genomics

What can FPGAs do for your application?

yper-acceleration of Apache SparkData Analytics Solution

Bigstream: Spark acceleration solution

The only platform to offer seamless acceleration of Apache Spark using Intel FPGAs

• Zero code change for Spark• Intelligent, automatic, computation slicing• Multilevel acceleration strategies• Abstracts away programming front end and

processor back end• Intelligently, automatically programs FPGA

H Y P E R - A C C E L E R A T I O N

Dataflow Adaptation Layer

Bigstream Dataflow

Bigstream Hypervisor

accelerationUse of fpga is limited

FPGA developers lack Spark programming models and big data knowledge

Skill gapProgramming model difference

Big Data developers lack FPGA experience

Spark acceleration Accelerating Performance

0Code changes or additions

to Queries1

8XPerformance

acceleration1

1: Running TPC-DS benchmark per Spark/SQL Business Intelligence Benchmarks. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines.Compares to open source Apache Spark running on Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz.

– image processing accelerationVideo Application

Image Processing Needs AND Challenges

Decoding, resizing, cropping, encoding of image files are typical processes which need large numbers of servers. This becomes cost prohibitive.

Boom resources performanceImage Computational CPU

Internet traffic increasing by 24%* annually - image is a large portion of internet data.Companies are handling huge volumes of images in the data center• Cloud storage• Mobile instant messaging• Social networking• E-Commerce

CPU performance per core is struggling to keep paceFPGA to the rescue

*source: Cisco--VNI Forecast Highlights Tool

Ctaccel Accelerates image processing

CTAccel Image Processing (CIP) effectively accelerates the following image processing/analytics workflows• Thumbnail Generation/Transcoding• Image processing (sharpen/color filter)• Image analytics

CIP includes the following FPGA-based accelerated functions• Decoder: JPEG• Pixel processing: Resizing/Crop• Encoder: JPEG, WebP, Lepton

Software compatibility with OpenCV, ImageMagick and Lepton

Image Processing: Accelerating Performance

4.9xFaster JPEG to

WebP 1

5X lower

latency 1

1 Compared to Intel® Xeon® E5-2630 v2 CPU, JPEG to WEBP.

database access acceleration High velocity cloud data applications

Database access Latency challenges

Increasing Faster Real-Time

Flood of data from multiple sources (Big

data, Internet of Things (IoT), business analysis,

e-commerce)

data volumes decision-Making Performance

Companies increasingly reliant

on data to fuel innovation and

decision making

Database analytics requires real-time

performance (SaaS, Finance, Industrial,

Resource management)

Cloud/relational database based data analytics employed across all industries – access times impact business results

Swarm64 Accelerates database access times

Performance SCALABILITythroughput optimization

Seamless plug-in that enables popular databases and supports any configuration – in the cloud or on-premise

• High velocity data access• Accelerates filtering, SQL-query pre-processing and de/compression• Compatible with existing applications• MySQL, PostgreSQL and MariaDB support (others in development)• No change to IT infrastructure required, easy to deploy

Solving Real-World Problems: database acceleration

Traditional Data Warehousing 2

2X+ 3X+ 10X+ FASTER REAL-TIME DATA ANALYTICS 1

Storagecompression 3

1. Based on database queries run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64..2. Data warehousing tested with queries and data taken from TPC-DS benchmark. Testing performed by Swarm64. 3. Based on database size run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64.

Solution Roadmap

Demos:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- GZIP (Accelize / CAST)- SDR to HDR Conversion (Accelize/b<>com)

DCP 1.0 ProductionDCP 1.1 Alpha (with network)

DCP 1.1 Production(with network connectivity)

DCP 1.0 Beta

Production*:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- SQL DB Acceleration (Swarm64)- Spark Acceleration (Bigstream)- AI Training & Inference (i-abra)

- Broadcast H.264 / H.265 Codecs (SoC Technologies)- Financial back testing (Levyx)- Genomics GATK Pipeline (Falcon Computing)Beta:- Deep Learning Acceleration Suite (Intel – Beta release)

Production*:- NoSQL DB Acceleration (Reniac)- C/C++ to OpenCL Compiler (Falcon Computing)- JPEG to WebP (CTAccel)- H.264 Transcode (Adaptive Microware)- Spark/Hadoop Shuffle Accel(A3Cube)- High Frequency Trading (Algo-Logic)- Deep Learning Acceleration Suite (Intel)

Q1 Q2 Q3 Q4Q4

Production*:- PAL / SAP Hana Acceleration (Xelera)- Advanced Firewall (F5 Networks)- Security NIC (Napatech)- 40G TCP/IP Offload (Enyx)- Real-time Financial Analytics (Velocidata)- Machine Learning Compiler (Myrtle Software)- High Frequency Trading (Celerix Technology)

Production:- H.264 Encoder/Decoder, H.265 Decoder (IBEX)- Kafka / Spark Accelerator Engine (Megh Computing)- Algorithmic trading (Xcelerit)- AV1 Hybrid Codec (ATEME)- Oil & Gas (Senai)- Risk Check Compliance (Aplicata)- Hadoop / Spark Acceleration (Wasai)

* Production status by end of quarter

Basic Architecture Description

FPGA Overview

▪ Field Programmable Gate Array (FPGA)

– Millions of logic elements

– Thousands of embedded memory blocks

– Thousands of DSP blocks

– Programmable routing

– High speed transceivers

– Various built-in hardened IP

▪ Used to create Custom Hardware!

DSP Block

Memory Block

Programmable

Routing Switch

ModulesLet’s zoom in

Basic Elements

1-bit configurable operation

Configured to perform any 1-bit operation:

AND, OR, NOT, ADD, SUB

Basic Element

1-bit register(store result)

Flexible Interconnect

Wider custom operations are implemented by configuring and interconnecting Basic Elements

… …

Custom Operations Using Basic Elements

Wider custom operations are implemented by configuring and interconnecting Basic Elements

16-bit add

Your custom 64-bit bit-shifter and encode

32-bit sq rt

… …

Memory Blocks

MemoryBlock

data_in

data_out

Can be configured and grouped using the

interconnect to create various cache architectures

Lots of smaller caches

Few larger caches

Floating Point Multiplier/Adder Blocks

data_in

Dedicated floating point multiply and add blocks

data_out

Configurable Routing

Blocks are connected into a custom data-path that matches your application.

Configurable IOThe Custom data-path can

be connected directly to custom or standard IO

interfacesfor inline data processing:

PCIe, Network InterfacesCameras, Disk Drives

Traditional FPGA Design Entry

▪ Used by hardware designers only

▪ Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog

▪ A designer must describe the behavior of the algorithm to create a low-level digital circuit

– Logic, Registers, Memories, State Machines, etc.

▪ Complete design times up to several months!

always @(a or b or c or d or sel)

case (sel)

2’b00: mux_out = a;

2b’01: mux_out = b;

2b’10: mux_out = c;

2’b11: mux_out = d;

endcase

b mux_outc

FPGA High Level Design with OpenCL™

Goal: Design FPGA custom hardware with C-based software language

▪ Benefits

– Makes FPGA acceleration available to software engineers

– Debug and optimize in a software-like environment

– Significant productivity gains compared to hardware-centric flow

– Easier to perform design exploration

– Abstracts away FPGA design flow and FPGA hardware

__kernel void _foo (__global float *x) {

int i …

*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos

Pipeline Generation for FPGAsWhy does a software designer want an FPGA?

A simple program

add:R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

kernel void

add( global int* Mem ) {

Mem[100] += 42*Mem[101];

OpenCL Code Instruction Level (IR)

Why execute your program on an FPGA over a CPU?

A simple 3-address CPU

Instruction

Registers

PC Load StoreLdAddr StAddr

CWriteEnable

LdData

StData

Load memory value into register

Instruction

Registers

CWriteEnable

LdData

StData

R0Load Mem[100]R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Load memory value into register

Instruction

Registers

CWriteEnable

LdData

StData

CDataR0 Load Mem[100]

R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Load immediate value into register

Instruction

Registers

CWriteEnable

LdData

StData

CDataR0 Load Mem[100]R1 Load Mem[101]

R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Multiply two registers, store result in register

Instruction

Registers

CWriteEnable

LdData

StData

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42

R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

Add two registers, store result in register

Instruction

Registers

CWriteEnable

LdData

StData

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2

R0 Add R2, R0Store R0 Mem[100]

Store register value into memory

Instruction

Registers

CWriteEnable

LdData

StData

CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0

Store R0Mem[100]

CPU activity, step by step

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

How about the FPGA?

FPGA does not have a fixed architecture

FPGA is massively parallel and configurable

Should not limit yourself to fixing the architecture and executing instructions sequentially

Try to execute instructions in a parallel fashion

Programmable Solutions Group Intel Confidential

Unroll the cpu Hw and specialize by position

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

1. Instructions are fixed. Remove “Fetch”

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

2. Remove unused ALU ops

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

2. Remove unused ALU ops3. Remove unused Load / Store

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!

And propagate state.

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

And propagate state.5. Remove dead data.

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

And propagate state.5. Remove dead data.6. Reschedule!

FPGA Custom HardwareCustom Datapath: Your algorithm, in Silicon!

▪ Creates typically very deeply pipelined version of a kernel

– Huge number of operations simultaneously inflight

▪ Data can more easily be localized on chip Build exactly what you need:

Operations

Data widths

Memory size & configuration

Efficiency:

Throughput / Latency / Power

load load

42High-level code

Mem[100] += 42 * Mem[101]

Custom datapath

Summary

▪ FPGAs are composed by millions of reconfigurable logic elements, memory

and DSP blocks.

▪ Algorithms in HW can be implemented by chaining blocks together using a

programmable interconnection matrix.

▪ FPGAs are an ideal solution to build high performance processing datapaths

offloading generic processors.

▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and

energy-efficient solution for accelerating workloads

Accelerating workloads

Francisco PerezIntel Field Applications Engineerfrancisco.perez@intel.com

with Intel® XEON® CPUs and FPGAs

Network Platforms Group

Data Movement and Processing Explosion

‡ Source: “Gartner Says 8.4 Billion Connected ‘Things’ Will Be in Use in 2017, Up 31 Percent From 2016”, 2/7/2017, http://www.gartner.com/newsroom/id/3598917 (Table 1 - IoT Units Installed Base by Category, 2020 column – Grand Total, including consumer+business units) 53

5G Wireless

Big Data Processing and Analytics

Explosion in data processing needs in

▪ Network▪ Storage▪ Compute

High Speed wireline and wireless links▪ Bring the data to the data center at

ever increasing rates

>20BConnected Devices

by 2020‡

Workloads

▪ Processing must be done within a fixed space and power budget

▪ Data Centers cannot grow unbounded▪ By leveraging Intel accelerators, like FPGAs,

these processing needs can be addressed

Hyper-connectedWorld

High-PerformanceComputing Demands

Intel® Xeon® Scalable Processor Family Acceleration Options

General Purpose Optimized

Intel® Xeon® CPU Intel® FPGA Intel® QuickAssistTechnology

Workloads General-purposeAVX-512

Flexible and Versatile set of algorithmic workloads

Standard Cryptography and Compression

Product Focus Software and instruction acceleration

Low-latency, parallel stream processing, custom

High-bandwidth

Hardware FlexibilitySoftware FlexibilityFixed

HW Acceleration

System flexibility with Intel Xeon CPU SKU optionsCan be slotted into 1U servers

Intel® FPGA Data Center Platform OptionsEnabled By The Acceleration Stack for Intel® Xeon® CPU with FPGAs

PCIe Acceleration Cards

PCIe Gen3x8

Versatile Workload Acceleration• Customizable Hardware Architecture using Arria® 10 GX FPGAs

High Performance with Arria® 10 GX FPGA • 1150K logic elements available with 53Mb of embedded memory• 8GB DDR4 Memory with ECC (2 banks), 2133 Mbps

High Data Ingestion and Lower Latency• PCIe x8 Gen3 electrical, x16 mechanical *• 1x QSFP with 4x 10GbE or 40GbE support

Low Power in Small Form Factor• 70W TDP, 45W FPGA• 650 LFM at Tla 55°C – Passively Cooled• 1 RU, as small as ½ Length, ½ Height

Intel® Programmable Acceleration Card (PAC)

With Arria® 10 GX FPGA

SAmpling TodayGeneral Availability 1H2018

Next Generation PACs & Platforms

Powered by InteL® Xeon® CPU with FPGAs

Higher PerformanceIncreased connectivity

More integration options

Application & IP Migration to Multiple Platforms

PCIe link

SW Applicationuses AcceleratorHW Accelerator

Configuration

DeviceHW Accelerator Execution Model

Host Machine

Accelerator

Intel Confidential for NDA Use Only 58

End UserDeveloped

New Cloud Scale Services with FPGA in the Data Center

Static/dynamic FPGA programming

Storage Network

Orchestration Software (FPGA Enabled)

Intel Developed

3rd partyDeveloped

Compute

Resource Pool

SoftwareDefinedInfrastructure

Secure

Public and Private Cloud Users

Accelerator Store

Launch workload

Workloadaccelerators

Intel® Xeon® processor VM

Virtualized

Workload NWorkload 2

Workload 1

Pull workloadfrom library

AllocateCompute Unit

Microsoft Azure

A faster, more efficient, more intelligent cloud

Data explosion: 2013 4.4 ZB - 2020 44 ZB

ML, DNN, AI are driving requirements up faster

Autonomous decision making

Real-time insights into connected devices

Interactive user experiences

Cloud-scale services

Searches and recommendations (Indexing the Internet!)

The need for SCALE

The need for LOW-LATENCY

The need for THROUGHPUT

1001101010

4.4 ZB 44 ZB

0100010101

1101101010

1011000110

0100010101

1001101010

0010011011

1011101010

0001001110

0110001011

Source: IDC 2014

WCS Gen4.1 Blade with NIC and Catapult FPGA

Catapult v2 Mezzanine card

Management

Fabric

Hardware

(FPGA)

Super Low-

latency

Network

Traditional software (CPU) server plane

QPI CPUCPU

TOR40Gb/s

Web search

ranking

Web search

ranking

Traditional software (CPU) server plane

QPICPU

40Gb/s ToR

40Gb/s

QSFP QSFP

Hardware acceleration plane

Interconnected FPGAs form a

separate plane of computation

Can be managed and used

independently from the CPU

Web search

ranking

Deep neural

networks

SDN offload

https://insidehpc.com/2018/04/cray-build-fpga-accelerated-supercomputer-paderborn-university/

FPGAs for every programmer

Software Developers are the New FPGA Developers

“I don’t speak FPGA!

What is the programming model, and where are the compilers, libraries and tools I am used to?”

Board Design &Qualification

Software Development

FPGA Accelerator Development

Intel® Investment in All These Areas Democratizes FPGA Acceleration

Loadable AFU image(.gbs)

FPGA Platforms (Programmable Acceleration Cards)

Intel Xeon FPGA Acceleration Libraries

Frameworks

Orchestration / Rack Level Management

FPGA Interface Manager (FIM)

Intel® DAALIntel® MKLIntel® MKL-DNN

Rack Scale Design

Hardware

Vertical Software Frameworks/Libs (DL, Networking, Genomics, etc.)

Applications/ Orchestration

Intel® DL Deployment Toolkit

IP Libraries: DLA, GEMM, VirtIO, pHMMCompression, Encryption, etc..

Open Programmable Acceleration Engine (OPAE Software API)

Drivers, virtualization, API’s, acceleration engineIntel FPGA SDK for OpenCL™, Intel Quartus® Prime

FPGA Images

NDA required

User Applications Deep Learning, Networking, Genomics, etc.

Operating Systems OS Enablement: Linux, Windows

FPGA HW & SW Tool Chains

✓ Simplify FPGA programming model

Common Infrastructure

What is acceleration Stack for Xeon with FPGA?

Ecosystem of FPGAWorkloads

Application & FPGA Development

FPGA Deployment& Management

Data Center OperatorIntegrated Services Vendors

HW &SW Developer

End ApplicationUser

Enabled by

Out-of-Box Flow for Acceleration StackBuy Server

w/ PAC

Download & Install Deployment Package of

Acceleration Stack

Intel Website

Deployment Flow

Development Flow

Download & Install Developer Package of

Acceleration Stack

Install Server OS

Download & Install Workload

Download & Install Simulator

Download & HLS or OpenCL(Optional)

Write Host Application

Vendor Website

Create & Simulate WorkloadHW & SW

Developer

End ApplicationUser

How Can FPGA Accelerators Be Created?

AcceleratorFunctionalUnit (AFU)

Self-Developed Externally-Sourced

VHDL or VerilogC/C++ Programming

Language Ecosystem Partner

Performance OptimizedHigher Productivity Contracted EngagementIntel® Reference Designs

Intel® HLS Compiler

Intel® FPGA SDK for

OpenCL™

components of acceleration Stack for Xeon with FPGA: Overview

Application

Drivers

Accelerator

Functional

Unit (AFU)

Signal Bridge and Management

Intel®

Xeon®

Software

Hardware

FPGA Interface ManagerProvided by Intel

User, Intel, or 3rd-Party IPPlugs into AFU Slot

PCIe* DriversProvided by Intel

Open Programmable Acceleration Engine (OPAE)

Provided by Intel

Libraries

Developed by User

User, Intel, and 3rd Party

Qualified and Validated for volume deploymentProvided by OEMs

components of acceleration Stack: FPGA INTERFACE MANAGER (FIM)

Simplifies the use of FPGAs

Hardware

Application

Drivers

Software

Accelerator

Functional

Unit (AFU)

Intel® Xeon®

Provided by Intel

Libraries

Developed by User

How Accelerator Functions interface to FPGA INTERFACE Manager

FPGA INTERFACE UNIT (FIU)

FPGA INTERFACE MANAGER (FIM) 400 MHzPCIe Gen 3x8 Hard IP Controller

CCI-P (512-bit Bidirectional Data Path)

User Accelerator

(e.g. Matrix Multiply)

ACCELERATOR FUNCTION UNIT (AFU)

Standard framework and abstraction layer for AFU integration with Acceleration Stack

SDRAM Bank 0 Interface

267 MHz

512-Bit

DIMM 0

Master

1067 MHz

64-Bit

MHzUsr_Clk

Usr_Clk

SDRAM Bank 1 Interface

267 MHz

512-Bit

DIMM 1

Master

1067 MHz

64-Bit

Interface to Xeon Host via common API Drivers (OPAE)

components of acceleration Stack: OPEN Programmable ACCELERATION ENGINE

Simplifies the use of FPGAs

Hardware

Application

Drivers

Software

Accelerator

Functional

Unit (AFU)

Intel® Xeon®

Provided by Intel

Libraries

Developed by User

OPAE: Simplified FPGA Programming Model for Application Developers

Bare Metal

FPGA Hardware + Interface Manager

FPGA Driver(physical function – PF)

FPGA API (C) (enumeration, management, access)

Applications, Frameworks, Intel® Acceleration Libraries

Bare Metal OS Virtual Machine

FPGA Driver(virtual function - VF)

OS, Hypervisor

FPGA Driver (common – AFU, local memory, HSSI)

Consistent API across product generations and platforms▪ Abstraction for hardware specific FPGA resource details

Designed for minimal software overhead and latency▪ Lightweight user-space library (libfpga)

Open ecosystem for industry and developer community▪ License: FPGA API (BSD), FPGA driver (GPLv2)

FPGA driver being upstreamed into Linux kernel

Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)

Includes guides, command-line utilities and sample code

Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE

What an FPGA Accelerator looks like to Application Software

From the OS’s point of view

▪ FPGA hardware appears as a regular PCIe device

▪ FPGA accelerator appears as a set of features accessible by software programs running on host

Unified C API model

▪ Resource management and orchestration services in a data center use to discover and select the FPGA resources and organize them to be used by the workloads

Architecture supports Single Root I/O Virtualization (SROIV) PCIe extension, enabling host software to access the accelerator:

▪ Via a hypervisor/VMM (Virtual Function)

▪ Bypassing the VMM/Hypervisor Physical Functions

User Application Software

Orchestration Services

Application Libraries

Operating System

Drivers

Hypervisor

AFUFPGA

* 01.org is an open source community site

• Acceleration Stack for Intel® Xeon® with FPGAs

• FPGA Acceleration Platforms• Acceleration Solutions & Ecosystem• Knowledge Center• FPGA as a Service• Academia• 01.org *

Intel® portal for all things relatedto FPGA acceleration

www.intel.com/fpgaaccelerationhub

Summary

energy-efficient solution for accelerating workloads

▪ Intel® Programmable Acceleration Cards are PCIe cards already certified for

Servers

▪ Acceleration Stack simplifies FPGAs adoption to software programmers

▪ There is a growing list of ready-to-use workloads accelerators to solve real use

▪ Intel® provides high level synthesis tools to develop HW accelerators for

custom needs

Deep learning for intel fpgas

Francisco PerezIntel Field Applications Engineerfrancisco.perez@intel.com

High Performance and Custom Inference

A Car A Black Car Volkswagen Passat license plate number Not the owner !!

Amazing new capabilities

People Detection People Tracking Analyze behavior/ intentions

Challenges Markets and Applications

Edge Gateway/Fog Data center/Cloud

Image, Audio, Speech, Text, NLP

Medical Imaging, Auto, Industrial

Data Center, CloudDigital Surveillance, Smart

City, Smart Classroom

Data center applications require efficient, low latency

compute across multiple nodes, a diverse set of

workloads including image, speech, and text.

Digital surveillance solutions need to support many input cameras and provide real-

time, low latency identification of specific people, faces, vehicle plates & gestures.

Location-Aware applications require real-time detection

and identification of objects using a variety of input

sensors and hybrid/ heterogenous processing.

The full system

edge gateway datacenter

more analytics to the edge

Faster respond time, more controllability on the edge

Less bandwidth

Less storage required

Machine Learning How do you

engineer the best features?

𝑁 × 𝑁

NEURAL NETWORK

𝒇𝟏, 𝒇𝟐, … , 𝒇𝑲Roundness of faceDist between eyesNose widthEye socket depthCheek bone structureJaw line length…etc.

CLASSIFIERALGORITHM

SVMRandom ForestNaïve BayesDecision TreesLogistic RegressionEnsemble methods

𝑁 × 𝑁

Deep LearningHow do you guide the model to find the best features?

MULTIPLE approaches to AI

Deep learning: Training vs. inference

Lots of labeled data!

Training

Inference

Forward

Backward

Model weights

Forward“Bicycle”?

“Strawberry”

“Bicycle”?

HumanBicycle

Strawberry

??????

Data set size

Did you know?Training requires a very large

data set and deep neural network (i.e. many layers) to achieve the highest accuracy

in most cases

Real-time Inference

Mainstream Training

Intensive Training

Mainstream Inference

Higher Inference Throughput

Vision1-20W

Speech/Audio1-100+mW

Mainstream Inference

Autonomous driving

CustomInference

IntelGNA

Mainstream AI

Flexible Acceleration

GeneralAI

Deep Learning

inginf

All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.

End-to-end ai compute

libraries

Intel® Deep Learning Deployment Toolkittools

Frameworks

Intel® DAAL

hardwareMemory & Storage Networking

Intel Python Distribution

Mlib BigDL

Intel® Nervana™ Graph

inteL® AI portfolio

experiences

Associative Memory Base

Intel® Computer Vision SDK

Visual Intelligence

Intel® FPGA DL Acceleration

SuiteIntel® Math Kernel Library

(MKL, MKL-DNN)

Compute

Design Flow with Machine Learning

Data Collection Data

Choose

Network

Inference

Engine

Parameters

Selection

Architecture

Choose Network topology▪ Use framework (e.g. Caffe,

Tensor Flow)

Train Network▪ A high-performance computing (HPC)

workload from large dataset▪ Weeks to months process

Inference Engine (FPGA Focus)▪ Implementation of the neural

network performing real-time inferencing

Improvement Strategies• Collect more data• Improve network

Deep Learning Topology Processing

“head”

Neural net

“the body”

Most of the compute is here

Vision: CNNs

features

Feature vector

for index

Object

detect

Post-processing

Intel FPGA Deep Learning

Acceleration Suite

Re-size /

Pre-processing

Intel® Computer Vision SDK & Components

OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

What’s Inside the Intel® Computer Vision SDKComponent tools

Traditional Computer Vision for Intel CPU/CPU with integrated graphics - Optimized Computer Vision Libraries

GPUCPU FPGA VPU

Trained Models

Linux for FPGA only

Increase Processor Graphics Performance–Linux* only

GPU = Intel CPU with integrated graphics processing unit/Intel® Processor GraphicsVPU = Intel® Movidius™ Vision Processing Unit

Intel® Deep Learning Deployment Toolkit

Model Optimizer Convert & Optimize

Inference EngineOptimized Inference

OpenCV* OpenVX*

OpenCL™ Intel® Integrated Graphics

Drivers & Runtimes

Intel® Media SDK (open source

version)

BitstreamsFPGA RunTime Environment (RTE) (from Intel® FPGA SDK for OpenCL™)

IR = Intermediate

Representation format

Intel® Deep Learning Deployment Toolkit Take Full Advantage of the Power of Intel® Architecture

.dataIRIR

IR = Intermediate Representation format

Convert & optimize to fit all targets

Load, infer

CPU Plugin

GPU Plugin

FPGA Plugin

Myriad Plugin

Model Optimizer

Convert & Optimize

Extendibility C++

Extendibility OpenCL™

Extendibility OpenCL/TBD

Extendibility TBD

Model Optimizer

▪ What it is: Preparation step -> imports trained models

▪ Why important: Optimizes for performance/space with conservative topology transformations; biggest boost is from conversion to data types matching hardware.

Inference Engine

▪ What it is: High-level inference API

▪ Why important: Interface is implemented as dynamically loaded plugins for each hardware type. Delivers best performance for each type without requiring users to implement and maintain multiple code pathways.

Trained Model

Inference Engine

Common API (C++)

Optimized cross-platform inference

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Improve Performance with Model Optimizer

▪ Easy to use, Python*-based workflow does not require rebuilding frameworks.

▪ Import Models from various frameworks (Caffe*, TensorFlow*, MXNet*, more are planned…)

▪ More than 100 models for Caffe, MXNet and TensorFlow validated.

▪ Caffe is not required to generate IRs for models consisting of Standard Layers, OR when user already provides his custom layers

Trained Model

Model Optimizer

Analyze

Quantize

Optimize topology

Convert

Intermediate Representation (IR) file

Optimal Model Performance Using the Inference Engine

Inference Engine Common API

Inference Engine Runtime

Movidius API

Movidius™ Myriad 2

Intel® IntegratedGraphics(GPU)

CPU: Intel® Xeon®/Core™/Atom®

clDNN PluginIntel Math Kernel

Library (MKLDNN)Plugin

OpenCL™Intrinsics

FPGA Plugin

Applications/Service

Intel® FPGA

▪ Simple & Unified API for Inference across all Intel® architecture (IA)

▪ Optimized inference on large IA hardware targets (CPU/GEN/FPGA)

▪ Heterogeneity support allows execution of layers across hardware types

▪ Asynchronous execution improves performance

▪ Futureproof/scale your development for future Intel® processors

Transform Models & Data into Results & Intelligence

MovidiusPlugin

DLA SW

Intel® FPGA DLA Suite Usage

GoogleNet Optimized Template

ResNet Optimized Template

Additional, Generic CNN Templates

SqueezeNet Optimized Template

VGG Optimized Template

• Supports common software frameworks (Caffe, Tensorflow)

• Intel DL software stack provides graph optimizations

• Intel FPGA Deep Learning Acceleration Suite provides turn-key or customized CNN acceleration for common topologies

Caffe TensorFlow

Intel®

Xeon®

Processor

Intel ®

Inference

Engine

Optimizer

ConvPE Array

Crossbar

Memory

Reader/Writer

Feature Map Cache

ConfigEngine

Optimized Acceleration Engine

Standard ML Frameworks

Intel Deep Learning

Deployment Toolkit

Heterogenous

CPU/FPGA

Deployment

Pre-compiled Graph Architectures

Hardware Customization Supported

Machine Learning on Intel® FPGA Platform

Acceleration Stack Platform Solution

DLA Runtime Engine DLA Workload

OpenCL™ RuntimeBBS

Hardware

Platform & IP

Software Stack

DL Deployment Toolkit

Acceleration Stack

Application

PAC Family

Boards

Intel® Xeon

ML Framework

(Caffe*, TensorFlow*)

For more information on the Acceleration Stack for Intel® Xeon® CPU with FPGAs on

the Intel® Programmable Acceleration Card, visit the Intel® FPGA Acceleration Hub

Increase Deep Learning Performance on Public Models using the Intel® Computer Vision SDK even MORE with FPGA Accelerator Cards (Frames Per Second (FPS))

Public modelsBatch

OpenCV* optimized

(non-Intel)

Intel® CV SDK

on CPU

Intel CV SDK w/

Floating Point 16

(FP16)1

Intel CV SDK on Intel®

Arria 10-1150GX FPGA

Squeezenet* 1.1 1 4.27x 7.03x 4.39x 16.51x

Vgg16* 1 1.83x 2.39x 4.32x 5.57x

GoogLeNet* v1 1 3.37x 6x 6.11x 16.89x

SSD 300* 1 1.85x 2.66x 4.54x 8.61x

Squeezenet* 1.1 32 4.22x 5.95x 7.52x 19.91x

Vgg16* 32 1.91x 2.64x 4.35x 8.08x

GoogLeNet* v1 32 3.48x 5.77x 7.11x 18.81x

SSD 300* 32 1.89x 2.72x 3.87x 8.87x

Or offload to Intel® FPGA

Intel Computer Vision SDK Accelerates Performance of Deep Learning Models running on Intel Hardware Get Faster Results with Less Work

Optimize itUse Intel

Or offload to Intel® Iris™ Pro Graphics

Baseline Caffe* Framework - Out of Box

These are multiples of how much faster than base line the model will run

Exploits the benefits of HW parallelism

CNN Computation in One Slide

Inew 𝑥 𝑦

𝑥′=−1

𝑦′=−1

Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Input Feature Map

(Set of 2D Images)

Filter

(3D Space)

Output Feature

Repeat for Multiple Filters

to Create Multiple “Layers”

of Output Feature Map

Why Intel® FPGAs for Machine Learning? – Reason 1

Convolutional Neural Networks are Compute Intensive

Fine-grained & low latency between compute and memory

Function 2Function 1 Function 3

Optional

MemoryOptional Memory

Pipeline Parallelism

Feature Benefit

Highly parallel

architecture

Facilitates efficient low-batch video

stream processing and reduces latency

Configurable

Distributed

Floating Point DSP

Blocks

FP32 9Tflops, FP16, FP11

Accelerates computation by tuning

compute performance

Tightly coupled

high-bandwidth

memory

>50TB/s on chip SRAM bandwidth,

random access, reduces latency,

minimizes external memory access

Programmable

Data Path

Reduces unnecessary data movement,

improving latency and efficiency

Configurability

Support for variable precision (trade-off

throughput and accuracy). Future proof

designs, and system connectivity

Convolutional Neural Networks are Compute Intensive

▪ Deep Learning Is Undergoing constant innovation

– Better Accuracy/Higher Compute Density

▪ Efforts to improve throughput and efficiency are ongoing

– Batching, Sparsity, Weight Sharing, Compression, etc. . .

▪ This rapid and constant evolution can present a challenge if implemented on a fixed architecture (e.g. a GPU) . . .

Why Intel® FPGAs for Machine Learning? – Reason 2Future Proof: Rapid Innovation of DL Topologies

Intel® FPGA Deep Learning Acceleration Suite

▪ CNN acceleration engine for common topologies executed in a graph loop architecture

– AlexNet, GoogleNet, LeNet, SqueezeNet, VGG16, ResNet, Yolo, SSD, LSTM…

▪ Software Deployment

– No FPGA compile required

– Run-time reconfigurable

▪ Customized Hardware Development

– Custom architecture creation w/ parameters

– Custom primitives using OpenCL™ flow

Convolution PE Array

Crossbar

prim prim prim custom

Memory Reader/Writer

Feature Map Cache

ConfigEngine

DLA Architecture: Built for Performance

▪ Maximize Parallelism on the FPGA

– Filter Parallelism (Processing Elements)

– Input-Depth Parallelism

– Winograd Transformation

– Batching

– Feature Stream Buffer

– Filter Cache

▪ Choosing FPGA Bitstream

– Data Type / Design Exploration

– Primitive Support

ReLUConvolution /

ConnectedNorm MaxPool

Stream Buffer

ConvPE

Crossbar

ReLUMaxPool

Memory Reader/Writer

Feature Map Cache

ConfigEngine

Execute

Mapping Graphs in DLA

AlexNet Graph

Conv ReLu Norm MaxPool Fully Conn.

Convolution /

Connected

Norm MaxPool

Blocks are run-time reconfigurable and bypassable

Stream Buffer

Convolution /

Connected

Norm MaxPool

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

Norm MaxPool

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

MaxPool

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

Convolution /

Connected

AlexNet Graph

Stream Bufferoutput

Support for Different Topologies

ReLUConvolution /

Connected

Norm MaxPool

Stream Buffer

Permute Flatten PriorBox SoftMaxConcatLRN Reshape

Convolution /

Connected

Norm MaxPool

Stream Buffer

Support for Different Topologies

Tradeoff between features and performance

Crossbar

ReLULRN

NormMaxPool

Memory

Reader/Writer

Feature Map Cache

ConfigEngine

Crossbar

ReLULRN

NormMaxPool

Memory

Reader/Writer

Feature Map Cache

ConfigEngine

Prior Box

Permute

Concat FlattenSoftMax

Reshape

Programmable Solutions Group Intel Confidential – CNDA Required 118

User Flows for Intel® FPGA DL Acceleration Suite

IPArchitect

Neural Net

Design

Offline

Compiler

Intel® FPGA SDK

for OpenCL™

BitstreamLibrary

Data Scientist

Compile

DLA Runtime Engine

DLA Graph Compiler

DLA Runtime API

Customized Architecture

custom primitives

custom layers

CV SDK

Model Optimizer

Inference Engine API

Software Deployment Flow

Architecture Development Flow

Design Program

Summary

energy-efficient solution for accelerating AI applications

▪ Intel® FPGA DLA Suite supports CNN inference on FPGAs

▪ Accessed through Intel® Computer Vision SDK

▪ Available for Intel® Programmable Acceleration Card

▪ Future Proof: can adapt to rapid innovation of DL Topologies

fpgas for high performance computing€¦ · • an fpga or field-programable gate array is a...

Documents

lógica programable

programable logic devices

automatas programable

programable pld s

relé programable controlador programable interfaz...

automatic programable radiographic

relé programable zen

xilinx memory interfaces made easy with xilinx fpgas and...

programable logic circuit

calculadora programable - ies las musas...calculadora...

programable logic controllers

l’autÒmat programable

modulo programable

autómata programable

de las industrias de alta - gob.mx · programable (cplds y...

circuito protector programable

control lÓgico programable

relé programable

dsd 2007 concurrent error detection for fsms designed for...

control lógico programable