fpgas for high performance computing€¦ · • an fpga or field-programable gate array is a...
Post on 31-May-2020
32 Views
Preview:
TRANSCRIPT
FPGAs for High performance computing
1
Admintech 2018 – Valencia – May, 9th
Francisco PerezField Applications Engineerfrancisco.perez@intel.com
AGENDA
• FPGAs como aceleradores HW de aplicaciones
• Introducción a la arquitectura de las FPGAs
• Herramientas y plataformas disponibles -> FPGAs para programadores
• Casos reales de utilización FPGA en datacenter
• Inteligencia Artificial: Beneficios de las FPGAs para inferencia de CNN
Intro to fpga
3
Multi-purpose accelerator engine
Francisco PerezField Applications Engineerfrancisco.perez@intel.com
Programmable Solutions Group 4
What iS an fpga?
Definition
• An FPGA or Field-programable gate array is a configurable devicecontaining thousands of digital logic blocks.
• How these blocks are connected together, and their functionality, can beimplemented using a specific hardware description languaje.
• This array of programable logic can reproduce quite simple circuits, like alogic gate or combinational function, up to really complex System-on-chip solutions.
• It’s reprogramable, so their funcionality can be changed when needed.
Programmable Solutions Group 5
What iS an fpga TODAY?
An advanced, multi-function accelerator
• Offer greater throughput, execution speed, and energy efficiency than CPUs on computationally intensive parts of algorithms
• With the ability to adapt quickly to changes in algorithms, new standards, data patterns, or performance needs
• They can be reconfigured in the field to accelerate any algorithm
Programmable Solutions Group
Transforming Data Centers To a single Accelerator Architecture
6
CPU GPU ASSP
ASIC FPGA
Artificial Intelligence
Big Data Analytics (Hadoop, SPARK, SQL, NoSQL)
Video Transcoding
NFV/SDNStorage Acceleration
Security and DPI (Deep Packet Inspection)
Programmable Solutions Group
Accelerating Key network Functions
7
M a n y K i n d s o f B o x e s
Routers Firewalls SwitchesSpecial-Purpose
Appliances
Switching
Security
Inspection & Reporting
Data AnalyticsArtificialIntelligence
VideoTranscoding
Cyber SecurityFinancial Acceleration
Genomics
8
What can FPGAs do for your application?
yper-acceleration of Apache SparkData Analytics Solution
Bigstream: Spark acceleration solution
The only platform to offer seamless acceleration of Apache Spark using Intel FPGAs
• Zero code change for Spark• Intelligent, automatic, computation slicing• Multilevel acceleration strategies• Abstracts away programming front end and
processor back end• Intelligently, automatically programs FPGA
H Y P E R - A C C E L E R A T I O N
Dataflow Adaptation Layer
Bigstream Dataflow
Bigstream Hypervisor
accelerationUse of fpga is limited
FPGA developers lack Spark programming models and big data knowledge
Skill gapProgramming model difference
Big Data developers lack FPGA experience
Spark acceleration Accelerating Performance
0Code changes or additions
to Queries1
8XPerformance
acceleration1
1: Running TPC-DS benchmark per Spark/SQL Business Intelligence Benchmarks. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines.Compares to open source Apache Spark running on Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz.
– image processing accelerationVideo Application
Image Processing Needs AND Challenges
Decoding, resizing, cropping, encoding of image files are typical processes which need large numbers of servers. This becomes cost prohibitive.
Boom resources performanceImage Computational CPU
Internet traffic increasing by 24%* annually - image is a large portion of internet data.Companies are handling huge volumes of images in the data center• Cloud storage• Mobile instant messaging• Social networking• E-Commerce
CPU performance per core is struggling to keep paceFPGA to the rescue
*source: Cisco--VNI Forecast Highlights Tool
Ctaccel Accelerates image processing
CTAccel Image Processing (CIP) effectively accelerates the following image processing/analytics workflows• Thumbnail Generation/Transcoding• Image processing (sharpen/color filter)• Image analytics
CIP includes the following FPGA-based accelerated functions• Decoder: JPEG• Pixel processing: Resizing/Crop• Encoder: JPEG, WebP, Lepton
Software compatibility with OpenCV, ImageMagick and Lepton
Image Processing: Accelerating Performance
4.9xFaster JPEG to
WebP 1
5X lower
latency 1
1 Compared to Intel® Xeon® E5-2630 v2 CPU, JPEG to WEBP.
database access acceleration High velocity cloud data applications
Database access Latency challenges
Increasing Faster Real-Time
Flood of data from multiple sources (Big
data, Internet of Things (IoT), business analysis,
e-commerce)
data volumes decision-Making Performance
Companies increasingly reliant
on data to fuel innovation and
decision making
Database analytics requires real-time
performance (SaaS, Finance, Industrial,
Resource management)
Cloud/relational database based data analytics employed across all industries – access times impact business results
Swarm64 Accelerates database access times
Performance SCALABILITythroughput optimization
Seamless plug-in that enables popular databases and supports any configuration – in the cloud or on-premise
• High velocity data access• Accelerates filtering, SQL-query pre-processing and de/compression• Compatible with existing applications• MySQL, PostgreSQL and MariaDB support (others in development)• No change to IT infrastructure required, easy to deploy
Solving Real-World Problems: database acceleration
Traditional Data Warehousing 2
2X+ 3X+ 10X+ FASTER REAL-TIME DATA ANALYTICS 1
Storagecompression 3
1. Based on database queries run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64..2. Data warehousing tested with queries and data taken from TPC-DS benchmark. Testing performed by Swarm64. 3. Based on database size run with SWARM64 acceleration vs. no acceleration. Testing performed by Swarm64.
13
20
Solution Roadmap
2018
Demos:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- GZIP (Accelize / CAST)- SDR to HDR Conversion (Accelize/b<>com)
2017
DCP 1.0 ProductionDCP 1.1 Alpha (with network)
DCP 1.1 Production(with network connectivity)
DCP 1.0 Beta
Production*:- Key Value Store (Algo-Logic)- PairHMM (Broad/Intel)- SQL DB Acceleration (Swarm64)- Spark Acceleration (Bigstream)- AI Training & Inference (i-abra)
- Broadcast H.264 / H.265 Codecs (SoC Technologies)- Financial back testing (Levyx)- Genomics GATK Pipeline (Falcon Computing)Beta:- Deep Learning Acceleration Suite (Intel – Beta release)
Production*:- NoSQL DB Acceleration (Reniac)- C/C++ to OpenCL Compiler (Falcon Computing)- JPEG to WebP (CTAccel)- H.264 Transcode (Adaptive Microware)- Spark/Hadoop Shuffle Accel(A3Cube)- High Frequency Trading (Algo-Logic)- Deep Learning Acceleration Suite (Intel)
Q1 Q2 Q3 Q4Q4
Production*:- PAL / SAP Hana Acceleration (Xelera)- Advanced Firewall (F5 Networks)- Security NIC (Napatech)- 40G TCP/IP Offload (Enyx)- Real-time Financial Analytics (Velocidata)- Machine Learning Compiler (Myrtle Software)- High Frequency Trading (Celerix Technology)
Production:- H.264 Encoder/Decoder, H.265 Decoder (IBEX)- Kafka / Spark Accelerator Engine (Megh Computing)- Algorithmic trading (Xcelerit)- AV1 Hybrid Codec (ATEME)- Oil & Gas (Senai)- Risk Check Compliance (Aplicata)- Hadoop / Spark Acceleration (Wasai)
* Production status by end of quarter
Basic Architecture Description
Programmable Solutions Group 22
FPGA Overview
▪ Field Programmable Gate Array (FPGA)
– Millions of logic elements
– Thousands of embedded memory blocks
– Thousands of DSP blocks
– Programmable routing
– High speed transceivers
– Various built-in hardened IP
▪ Used to create Custom Hardware!
DSP Block
Memory Block
Programmable
Routing Switch
Logic
ModulesLet’s zoom in
23
Basic Elements
1-bit configurable operation
Configured to perform any 1-bit operation:
AND, OR, NOT, ADD, SUB
Basic Element
1-bit register(store result)
24
Flexible Interconnect
Wider custom operations are implemented by configuring and interconnecting Basic Elements
… …
25
Custom Operations Using Basic Elements
Wider custom operations are implemented by configuring and interconnecting Basic Elements
…
16-bit add
Your custom 64-bit bit-shifter and encode
32-bit sq rt
… …
26
Memory Blocks
MemoryBlock
20 Kb
addr
data_in
data_out
Can be configured and grouped using the
interconnect to create various cache architectures
Lots of smaller caches
Few larger caches
27
Floating Point Multiplier/Adder Blocks
data_in
Dedicated floating point multiply and add blocks
data_out
28
Configurable Routing
Blocks are connected into a custom data-path that matches your application.
Configurable IOThe Custom data-path can
be connected directly to custom or standard IO
interfacesfor inline data processing:
PCIe, Network InterfacesCameras, Disk Drives
Programmable Solutions Group
Traditional FPGA Design Entry
▪ Used by hardware designers only
▪ Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog
▪ A designer must describe the behavior of the algorithm to create a low-level digital circuit
– Logic, Registers, Memories, State Machines, etc.
▪ Complete design times up to several months!
always @(a or b or c or d or sel)
begin
case (sel)
2’b00: mux_out = a;
2b’01: mux_out = b;
2b’10: mux_out = c;
2’b11: mux_out = d;
endcase
a
dsel
2
b mux_outc
30
Programmable Solutions Group 32
FPGA High Level Design with OpenCL™
Goal: Design FPGA custom hardware with C-based software language
▪ Benefits
– Makes FPGA acceleration available to software engineers
– Debug and optimize in a software-like environment
– Significant productivity gains compared to hardware-centric flow
– Easier to perform design exploration
– Abstracts away FPGA design flow and FPGA hardware
__kernel void _foo (__global float *x) {
int i …
}
*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos
Pipeline Generation for FPGAsWhy does a software designer want an FPGA?
33
A simple program
34
add:R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
kernel void
add( global int* Mem ) {
...
Mem[100] += 42*Mem[101];
}
OpenCL Code Instruction Level (IR)
Why execute your program on an FPGA over a CPU?
B
A
AALU
A simple 3-address CPU
35
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
B
A
AALU
Load memory value into register
36
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
R0Load Mem[100]R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
B
A
AALU
Load memory value into register
37
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CDataR0 Load Mem[100]
R1Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
B
A
AALU
Load immediate value into register
38
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CDataR0 Load Mem[100]R1 Load Mem[101]
R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
B
A
AALU
Multiply two registers, store result in register
39
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42
R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
B
A
AALU
Add two registers, store result in register
40
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2
R0 Add R2, R0Store R0 Mem[100]
B
A
AALU
Store register value into memory
41
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CDataR0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0
Store R0Mem[100]
CPU activity, step by step
42
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Time
How about the FPGA?
43
FPGA does not have a fixed architecture
FPGA is massively parallel and configurable
Should not limit yourself to fixing the architecture and executing instructions sequentially
Try to execute instructions in a parallel fashion
Programmable Solutions Group Intel Confidential
Unroll the cpu Hw and specialize by position
44
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
Programmable Solutions Group Intel Confidential
… and specialize
45
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
Programmable Solutions Group Intel Confidential
… and specialize
46
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load / Store
Programmable Solutions Group Intel Confidential
… and specialize
47
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!
And propagate state.
Programmable Solutions Group Intel Confidential
… and specialize
48
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!
And propagate state.5. Remove dead data.
Programmable Solutions Group Intel Confidential
… and specialize
49
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load / Store4. Wire up registers properly!
And propagate state.5. Remove dead data.6. Reschedule!
Programmable Solutions Group
50
FPGA Custom HardwareCustom Datapath: Your algorithm, in Silicon!
▪ Creates typically very deeply pipelined version of a kernel
– Huge number of operations simultaneously inflight
▪ Data can more easily be localized on chip Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
Throughput / Latency / Power
load load
store
42High-level code
Mem[100] += 42 * Mem[101]
Custom datapath
Programmable Solutions Group 51
Summary
▪ FPGAs are composed by millions of reconfigurable logic elements, memory
and DSP blocks.
▪ Algorithms in HW can be implemented by chaining blocks together using a
programmable interconnection matrix.
▪ FPGAs are an ideal solution to build high performance processing datapaths
offloading generic processors.
▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and
energy-efficient solution for accelerating workloads
Accelerating workloads
52
Francisco PerezIntel Field Applications Engineerfrancisco.perez@intel.com
with Intel® XEON® CPUs and FPGAs
Network Platforms Group
Data Movement and Processing Explosion
‡ Source: “Gartner Says 8.4 Billion Connected ‘Things’ Will Be in Use in 2017, Up 31 Percent From 2016”, 2/7/2017, http://www.gartner.com/newsroom/id/3598917 (Table 1 - IoT Units Installed Base by Category, 2020 column – Grand Total, including consumer+business units) 53
5G Wireless
Big Data Processing and Analytics
Explosion in data processing needs in
▪ Network▪ Storage▪ Compute
High Speed wireline and wireless links▪ Bring the data to the data center at
ever increasing rates
>20BConnected Devices
by 2020‡
Workloads
▪ Processing must be done within a fixed space and power budget
▪ Data Centers cannot grow unbounded▪ By leveraging Intel accelerators, like FPGAs,
these processing needs can be addressed
Hyper-connectedWorld
High-PerformanceComputing Demands
Network Platforms Group
Intel® Xeon® Scalable Processor Family Acceleration Options
General Purpose Optimized
Intel® Xeon® CPU Intel® FPGA Intel® QuickAssistTechnology
Workloads General-purposeAVX-512
Flexible and Versatile set of algorithmic workloads
Standard Cryptography and Compression
Product Focus Software and instruction acceleration
Low-latency, parallel stream processing, custom
High-bandwidth
Hardware FlexibilitySoftware FlexibilityFixed
HW Acceleration
Network Platforms Group
System flexibility with Intel Xeon CPU SKU optionsCan be slotted into 1U servers
Intel® FPGA Data Center Platform OptionsEnabled By The Acceleration Stack for Intel® Xeon® CPU with FPGAs
PCIe Acceleration Cards
PCIe Gen3x8
Versatile Workload Acceleration• Customizable Hardware Architecture using Arria® 10 GX FPGAs
High Performance with Arria® 10 GX FPGA • 1150K logic elements available with 53Mb of embedded memory• 8GB DDR4 Memory with ECC (2 banks), 2133 Mbps
High Data Ingestion and Lower Latency• PCIe x8 Gen3 electrical, x16 mechanical *• 1x QSFP with 4x 10GbE or 40GbE support
Low Power in Small Form Factor• 70W TDP, 45W FPGA• 650 LFM at Tla 55°C – Passively Cooled• 1 RU, as small as ½ Length, ½ Height
Intel® Programmable Acceleration Card (PAC)
With Arria® 10 GX FPGA
SAmpling TodayGeneral Availability 1H2018
Next Generation PACs & Platforms
Powered by InteL® Xeon® CPU with FPGAs
Higher PerformanceIncreased connectivity
More integration options
Application & IP Migration to Multiple Platforms
24
Programmable Solutions Group
PCIe link
SW Applicationuses AcceleratorHW Accelerator
Configuration
DeviceHW Accelerator Execution Model
Host Machine
HW
Accelerator
Intel Confidential for NDA Use Only 58
End UserDeveloped
New Cloud Scale Services with FPGA in the Data Center
Static/dynamic FPGA programming
FPGA
Storage Network
Orchestration Software (FPGA Enabled)
Intel Developed
3rd partyDeveloped
Compute
Resource Pool
SoftwareDefinedInfrastructure
Secure
Public and Private Cloud Users
Accelerator Store
Launch workload
Workloadaccelerators
Intel® Xeon® processor VM
Accel
Virtualized
Workload NWorkload 2
Workload 1
Pull workloadfrom library
AllocateCompute Unit
Microsoft Azure
A faster, more efficient, more intelligent cloud
Data explosion: 2013 4.4 ZB - 2020 44 ZB
ML, DNN, AI are driving requirements up faster
Autonomous decision making
Real-time insights into connected devices
Interactive user experiences
Cloud-scale services
Searches and recommendations (Indexing the Internet!)
The need for SCALE
The need for LOW-LATENCY
The need for THROUGHPUT
1001101010
2013
1001101010
2020
4.4 ZB 44 ZB
0100010101
1101101010
1011000110
0100010101
1001101010
0010011011
1011101010
0001001110
0110001011
Source: IDC 2014
WCS Gen4.1 Blade with NIC and Catapult FPGA
Catapult v2 Mezzanine card
Management
Fabric
Hardware
(FPGA)
Super Low-
latency
Network
Traditional software (CPU) server plane
QPI CPUCPU
QSFP
TOR40Gb/s
Web search
ranking
Web search
ranking
Traditional software (CPU) server plane
QPICPU
QSFP
40Gb/s ToR
FPGA
CPU
40Gb/s
QSFP QSFP
Hardware acceleration plane
Interconnected FPGAs form a
separate plane of computation
Can be managed and used
independently from the CPU
Web search
ranking
Deep neural
networks
SDN offload
SQL
https://insidehpc.com/2018/04/cray-build-fpga-accelerated-supercomputer-paderborn-university/
FPGAs for every programmer
67
68
Software Developers are the New FPGA Developers
“I don’t speak FPGA!
What is the programming model, and where are the compilers, libraries and tools I am used to?”
Board Design &Qualification
Software Development
FPGA Accelerator Development
Intel® Investment in All These Areas Democratizes FPGA Acceleration
16
Loadable AFU image(.gbs)
FPGA Platforms (Programmable Acceleration Cards)
Intel Xeon FPGA Acceleration Libraries
Frameworks
Orchestration / Rack Level Management
FPGA Interface Manager (FIM)
Intel® DAALIntel® MKLIntel® MKL-DNN
Rack Scale Design
Hardware
Vertical Software Frameworks/Libs (DL, Networking, Genomics, etc.)
Applications/ Orchestration
Intel® DL Deployment Toolkit
70
IP Libraries: DLA, GEMM, VirtIO, pHMMCompression, Encryption, etc..
Open Programmable Acceleration Engine (OPAE Software API)
Drivers, virtualization, API’s, acceleration engineIntel FPGA SDK for OpenCL™, Intel Quartus® Prime
FPGA Images
NDA required
User Applications Deep Learning, Networking, Genomics, etc.
Operating Systems OS Enablement: Linux, Windows
FPGA HW & SW Tool Chains
✓ Simplify FPGA programming model
Common Infrastructure
What is acceleration Stack for Xeon with FPGA?
Ecosystem of FPGAWorkloads
Application & FPGA Development
FPGA Deployment& Management
Data Center OperatorIntegrated Services Vendors
HW &SW Developer
End ApplicationUser
Enabled by
71
Out-of-Box Flow for Acceleration StackBuy Server
w/ PAC
Download & Install Deployment Package of
Acceleration Stack
Intel Website
Deployment Flow
Development Flow
Download & Install Developer Package of
Acceleration Stack
Install Server OS
Download & Install Workload
Download & Install Simulator
Download & HLS or OpenCL(Optional)
Write Host Application
Vendor Website
Create & Simulate WorkloadHW & SW
Developer
End ApplicationUser
73
How Can FPGA Accelerators Be Created?
AcceleratorFunctionalUnit (AFU)
Self-Developed Externally-Sourced
VHDL or VerilogC/C++ Programming
Language Ecosystem Partner
Performance OptimizedHigher Productivity Contracted EngagementIntel® Reference Designs
Intel® HLS Compiler
Intel® FPGA SDK for
OpenCL™
components of acceleration Stack for Xeon with FPGA: Overview
74
Application
Drivers
Accelerator
Functional
Unit (AFU)
Signal Bridge and Management
Intel®
Xeon®
Software
FPGA
Hardware
FPGA Interface ManagerProvided by Intel
User, Intel, or 3rd-Party IPPlugs into AFU Slot
PCIe* DriversProvided by Intel
Open Programmable Acceleration Engine (OPAE)
Provided by Intel
Libraries
Developed by User
User, Intel, and 3rd Party
FPGA Platforms (Programmable Acceleration Cards)
Qualified and Validated for volume deploymentProvided by OEMs
components of acceleration Stack: FPGA INTERFACE MANAGER (FIM)
76
Simplifies the use of FPGAs
Hardware
Application
Drivers
Software
Accelerator
Functional
Unit (AFU)
Signal Bridge and Management
Intel® Xeon®
FPGA
FPGA Interface ManagerProvided by Intel
User, Intel, or 3rd-Party IPPlugs into AFU Slot
PCIe* DriversProvided by Intel
Open Programmable Acceleration Engine (OPAE)
Provided by Intel
Libraries
Developed by User
User, Intel, and 3rd Party
FPGA Platforms (Programmable Acceleration Cards)
How Accelerator Functions interface to FPGA INTERFACE Manager
77
FPGA
FPGA INTERFACE UNIT (FIU)
FPGA INTERFACE MANAGER (FIM) 400 MHzPCIe Gen 3x8 Hard IP Controller
CCI-P (512-bit Bidirectional Data Path)
User Accelerator
Logic
(e.g. Matrix Multiply)
ACCELERATOR FUNCTION UNIT (AFU)
400
MHz
Standard framework and abstraction layer for AFU integration with Acceleration Stack
AV MM
Slave
SDRAM Bank 0 Interface
267 MHz
512-Bit
DIMM 0
AV MM
Master
1067 MHz
64-Bit
ECC
200
MHz
100
MHzUsr_Clk
Usr_Clk
/2
CH2
TX
CH1
TX
CH1
RX
CH0
TX
CH0
RX
AV MM
Slave
SDRAM Bank 1 Interface
267 MHz
512-Bit
DIMM 1
AV MM
Master
1067 MHz
64-Bit
ECC
Interface to Xeon Host via common API Drivers (OPAE)
components of acceleration Stack: OPEN Programmable ACCELERATION ENGINE
78
Simplifies the use of FPGAs
Hardware
Application
Drivers
Software
Accelerator
Functional
Unit (AFU)
Signal Bridge and Management
Intel® Xeon®
FPGA
FPGA Interface ManagerProvided by Intel
User, Intel, or 3rd-Party IPPlugs into AFU Slot
PCIe* DriversProvided by Intel
Open Programmable Acceleration Engine (OPAE)
Provided by Intel
Libraries
Developed by User
User, Intel, and 3rd Party
FPGA Platforms (Programmable Acceleration Cards)
79
OPAE: Simplified FPGA Programming Model for Application Developers
Bare Metal
FPGA Hardware + Interface Manager
FPGA Driver(physical function – PF)
FPGA API (C) (enumeration, management, access)
Applications, Frameworks, Intel® Acceleration Libraries
Bare Metal OS Virtual Machine
FPGA Driver(virtual function - VF)
OS, Hypervisor
FPGA Driver (common – AFU, local memory, HSSI)
OS
Consistent API across product generations and platforms▪ Abstraction for hardware specific FPGA resource details
Designed for minimal software overhead and latency▪ Lightweight user-space library (libfpga)
Open ecosystem for industry and developer community▪ License: FPGA API (BSD), FPGA driver (GPLv2)
FPGA driver being upstreamed into Linux kernel
Supports both virtual machines and bare metal platforms
Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)
Includes guides, command-line utilities and sample code
Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE
80
What an FPGA Accelerator looks like to Application Software
From the OS’s point of view
▪ FPGA hardware appears as a regular PCIe device
▪ FPGA accelerator appears as a set of features accessible by software programs running on host
Unified C API model
▪ Resource management and orchestration services in a data center use to discover and select the FPGA resources and organize them to be used by the workloads
Architecture supports Single Root I/O Virtualization (SROIV) PCIe extension, enabling host software to access the accelerator:
▪ Via a hypervisor/VMM (Virtual Function)
▪ Bypassing the VMM/Hypervisor Physical Functions
User Application Software
Orchestration Services
Application Libraries
Operating System
Drivers
Hypervisor
OPAE
AFUFPGA
* 01.org is an open source community site
• Acceleration Stack for Intel® Xeon® with FPGAs
• FPGA Acceleration Platforms• Acceleration Solutions & Ecosystem• Knowledge Center• FPGA as a Service• Academia• 01.org *
Intel® portal for all things relatedto FPGA acceleration
25
www.intel.com/fpgaaccelerationhub
Programmable Solutions Group 82
Summary
▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and
energy-efficient solution for accelerating workloads
▪ Intel® Programmable Acceleration Cards are PCIe cards already certified for
Servers
▪ Acceleration Stack simplifies FPGAs adoption to software programmers
▪ There is a growing list of ready-to-use workloads accelerators to solve real use
cases
▪ Intel® provides high level synthesis tools to develop HW accelerators for
custom needs
Deep learning for intel fpgas
83
Francisco PerezIntel Field Applications Engineerfrancisco.perez@intel.com
High Performance and Custom Inference
A Car A Black Car Volkswagen Passat license plate number Not the owner !!
Amazing new capabilities
Thief
Thief
Thief
People Detection People Tracking Analyze behavior/ intentions
Programmable Solutions Group
Challenges Markets and Applications
Edge Gateway/Fog Data center/Cloud
Image, Audio, Speech, Text, NLP
Medical Imaging, Auto, Industrial
Data Center, CloudDigital Surveillance, Smart
City, Smart Classroom
Data center applications require efficient, low latency
compute across multiple nodes, a diverse set of
workloads including image, speech, and text.
Digital surveillance solutions need to support many input cameras and provide real-
time, low latency identification of specific people, faces, vehicle plates & gestures.
Location-Aware applications require real-time detection
and identification of objects using a variety of input
sensors and hybrid/ heterogenous processing.
The full system
edge gateway datacenter
more analytics to the edge
Faster respond time, more controllability on the edge
Less bandwidth
Less storage required
87
Machine Learning How do you
engineer the best features?
𝑁 × 𝑁
Arjun
NEURAL NETWORK
𝒇𝟏, 𝒇𝟐, … , 𝒇𝑲Roundness of faceDist between eyesNose widthEye socket depthCheek bone structureJaw line length…etc.
CLASSIFIERALGORITHM
SVMRandom ForestNaïve BayesDecision TreesLogistic RegressionEnsemble methods
𝑁 × 𝑁
Arjun
Deep LearningHow do you guide the model to find the best features?
MULTIPLE approaches to AI
88
Deep learning: Training vs. inference
Lots of labeled data!
Training
Inference
Forward
Backward
Model weights
Forward“Bicycle”?
“Strawberry”
“Bicycle”?
Error
HumanBicycle
Strawberry
??????
Data set size
Acc
ura
cy
Did you know?Training requires a very large
data set and deep neural network (i.e. many layers) to achieve the highest accuracy
in most cases
Programmable Solutions Group
Real-time Inference
Mainstream Training
Intensive Training
Mainstream Inference
Higher Inference Throughput
NNP
Vision1-20W
Speech/Audio1-100+mW
Mainstream Inference
Autonomous driving
CustomInference
IntelGNA
(IP)
Mainstream AI
Flexible Acceleration
GeneralAI
Deep Learning
train
inginf
eren
ceDa
ta Ce
nter
/ Wo
rkst
ation
Data
Cent
er/
Work
stat
ionGa
tewa
y/edg
e
All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
End-to-end ai compute
Programmable Solutions Group
libraries
Intel® Deep Learning Deployment Toolkittools
Frameworks
Intel® DAAL
hardwareMemory & Storage Networking
Intel Python Distribution
Mlib BigDL
Intel® Nervana™ Graph
inteL® AI portfolio
experiences
Associative Memory Base
Intel® Computer Vision SDK
Visual Intelligence
Intel® FPGA DL Acceleration
SuiteIntel® Math Kernel Library
(MKL, MKL-DNN)
Compute
More*
90
Programmable Solutions Group 92
Design Flow with Machine Learning
Data Collection Data
Store
Choose
Network
Train
Network
Inference
Engine
Parameters
Selection
Architecture
Choose Network topology▪ Use framework (e.g. Caffe,
Tensor Flow)
Train Network▪ A high-performance computing (HPC)
workload from large dataset▪ Weeks to months process
Inference Engine (FPGA Focus)▪ Implementation of the neural
network performing real-time inferencing
Improvement Strategies• Collect more data• Improve network
Programmable Solutions Group 93
Deep Learning Topology Processing
“head”
1
“head”
2
“head”
10
Neural net
“the body”
image
…
Most of the compute is here
Vision: CNNs
features
Feature vector
for index
Tags
Object
detect
Post-processing
Intel FPGA Deep Learning
Acceleration Suite
Re-size /
crop
image
Pre-processing
94
Intel® Computer Vision SDK & Components
OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
What’s Inside the Intel® Computer Vision SDKComponent tools
Traditional Computer Vision for Intel CPU/CPU with integrated graphics - Optimized Computer Vision Libraries
GPUCPU FPGA VPU
Trained Models
Linux for FPGA only
Increase Processor Graphics Performance–Linux* only
GPU = Intel CPU with integrated graphics processing unit/Intel® Processor GraphicsVPU = Intel® Movidius™ Vision Processing Unit
Intel® Deep Learning Deployment Toolkit
Model Optimizer Convert & Optimize
IR
Inference EngineOptimized Inference
OpenCV* OpenVX*
OpenCL™ Intel® Integrated Graphics
Drivers & Runtimes
Intel® Media SDK (open source
version)
BitstreamsFPGA RunTime Environment (RTE) (from Intel® FPGA SDK for OpenCL™)
IR = Intermediate
Representation format
95
Intel® Deep Learning Deployment Toolkit Take Full Advantage of the Power of Intel® Architecture
Caffe
TF
MxNet
.dataIRIR
IR = Intermediate Representation format
Convert & optimize to fit all targets
Load, infer
CPU Plugin
GPU Plugin
FPGA Plugin
Myriad Plugin
Model Optimizer
Convert & Optimize
Extendibility C++
Extendibility OpenCL™
Extendibility OpenCL/TBD
Extendibility TBD
Model Optimizer
▪ What it is: Preparation step -> imports trained models
▪ Why important: Optimizes for performance/space with conservative topology transformations; biggest boost is from conversion to data types matching hardware.
Inference Engine
▪ What it is: High-level inference API
▪ Why important: Interface is implemented as dynamically loaded plugins for each hardware type. Delivers best performance for each type without requiring users to implement and maintain multiple code pathways.
Trained Model
Inference Engine
Common API (C++)
Optimized cross-platform inference
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
96
Improve Performance with Model Optimizer
▪ Easy to use, Python*-based workflow does not require rebuilding frameworks.
▪ Import Models from various frameworks (Caffe*, TensorFlow*, MXNet*, more are planned…)
▪ More than 100 models for Caffe, MXNet and TensorFlow validated.
▪ Caffe is not required to generate IRs for models consisting of Standard Layers, OR when user already provides his custom layers
Trained Model
Model Optimizer
Analyze
Quantize
Optimize topology
Convert
Intermediate Representation (IR) file
Optimal Model Performance Using the Inference Engine
97
Inference Engine Common API
Plu
g-I
n A
rch
ite
ctu
re
Inference Engine Runtime
Movidius API
Movidius™ Myriad 2
DLAS
Intel® IntegratedGraphics(GPU)
CPU: Intel® Xeon®/Core™/Atom®
clDNN PluginIntel Math Kernel
Library (MKLDNN)Plugin
OpenCL™Intrinsics
FPGA Plugin
Applications/Service
Intel® FPGA
▪ Simple & Unified API for Inference across all Intel® architecture (IA)
▪ Optimized inference on large IA hardware targets (CPU/GEN/FPGA)
▪ Heterogeneity support allows execution of layers across hardware types
▪ Asynchronous execution improves performance
▪ Futureproof/scale your development for future Intel® processors
Transform Models & Data into Results & Intelligence
MovidiusPlugin
Programmable Solutions Group
DLA SW
API
Intel® FPGA DLA Suite Usage
GoogleNet Optimized Template
ResNet Optimized Template
Additional, Generic CNN Templates
SqueezeNet Optimized Template
VGG Optimized Template
• Supports common software frameworks (Caffe, Tensorflow)
• Intel DL software stack provides graph optimizations
• Intel FPGA Deep Learning Acceleration Suite provides turn-key or customized CNN acceleration for common topologies
Caffe TensorFlow
Intel®
Xeon®
Processor
Intel ®
FPGA
Inference
Engine
Model
Optimizer
ConvPE Array
Crossbar
DDR
Memory
Reader/Writer
Feature Map Cache
DDR
DDR
DDR
ConfigEngine
Optimized Acceleration Engine
Standard ML Frameworks
Intel Deep Learning
Deployment Toolkit
Heterogenous
CPU/FPGA
Deployment
Pre-compiled Graph Architectures
Hardware Customization Supported
Programmable Solutions Group 99
Machine Learning on Intel® FPGA Platform
Acceleration Stack Platform Solution
DLA Runtime Engine DLA Workload
OpenCL™ RuntimeBBS
Hardware
Platform & IP
Software Stack
DL Deployment Toolkit
Acceleration Stack
Application
PAC Family
Boards
Intel® Xeon
CPU
ML Framework
(Caffe*, TensorFlow*)
For more information on the Acceleration Stack for Intel® Xeon® CPU with FPGAs on
the Intel® Programmable Acceleration Card, visit the Intel® FPGA Acceleration Hub
100
Increase Deep Learning Performance on Public Models using the Intel® Computer Vision SDK even MORE with FPGA Accelerator Cards (Frames Per Second (FPS))
Public modelsBatch
Size
OpenCV* optimized
(non-Intel)
Intel® CV SDK
on CPU
Intel CV SDK w/
Floating Point 16
(FP16)1
Intel CV SDK on Intel®
Arria 10-1150GX FPGA
Squeezenet* 1.1 1 4.27x 7.03x 4.39x 16.51x
Vgg16* 1 1.83x 2.39x 4.32x 5.57x
GoogLeNet* v1 1 3.37x 6x 6.11x 16.89x
SSD 300* 1 1.85x 2.66x 4.54x 8.61x
Squeezenet* 1.1 32 4.22x 5.95x 7.52x 19.91x
Vgg16* 32 1.91x 2.64x 4.35x 8.08x
GoogLeNet* v1 32 3.48x 5.77x 7.11x 18.81x
SSD 300* 32 1.89x 2.72x 3.87x 8.87x
Or offload to Intel® FPGA
Intel Computer Vision SDK Accelerates Performance of Deep Learning Models running on Intel Hardware Get Faster Results with Less Work
Optimize itUse Intel
Tools
Or offload to Intel® Iris™ Pro Graphics
Baseline Caffe* Framework - Out of Box
These are multiples of how much faster than base line the model will run
Exploits the benefits of HW parallelism
101
Programmable Solutions Group
CNN Computation in One Slide
Inew 𝑥 𝑦
=
𝑥′=−1
1
𝑦′=−1
1
Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
Input Feature Map
(Set of 2D Images)
Filter
(3D Space)
Output Feature
Map
Repeat for Multiple Filters
to Create Multiple “Layers”
of Output Feature Map
102
Programmable Solutions Group
Why Intel® FPGAs for Machine Learning? – Reason 1
Convolutional Neural Networks are Compute Intensive
Fine-grained & low latency between compute and memory
Function 2Function 1 Function 3
IO IO
Optional
MemoryOptional Memory
Pipeline Parallelism
Feature Benefit
Highly parallel
architecture
Facilitates efficient low-batch video
stream processing and reduces latency
Configurable
Distributed
Floating Point DSP
Blocks
FP32 9Tflops, FP16, FP11
Accelerates computation by tuning
compute performance
Tightly coupled
high-bandwidth
memory
>50TB/s on chip SRAM bandwidth,
random access, reduces latency,
minimizes external memory access
Programmable
Data Path
Reduces unnecessary data movement,
improving latency and efficiency
Configurability
Support for variable precision (trade-off
throughput and accuracy). Future proof
designs, and system connectivity
Convolutional Neural Networks are Compute Intensive
Programmable Solutions Group
▪ Deep Learning Is Undergoing constant innovation
– Better Accuracy/Higher Compute Density
▪ Efforts to improve throughput and efficiency are ongoing
– Batching, Sparsity, Weight Sharing, Compression, etc. . .
▪ This rapid and constant evolution can present a challenge if implemented on a fixed architecture (e.g. a GPU) . . .
104
Why Intel® FPGAs for Machine Learning? – Reason 2Future Proof: Rapid Innovation of DL Topologies
Programmable Solutions Group 105
Intel® FPGA Deep Learning Acceleration Suite
▪ CNN acceleration engine for common topologies executed in a graph loop architecture
– AlexNet, GoogleNet, LeNet, SqueezeNet, VGG16, ResNet, Yolo, SSD, LSTM…
▪ Software Deployment
– No FPGA compile required
– Run-time reconfigurable
▪ Customized Hardware Development
– Custom architecture creation w/ parameters
– Custom primitives using OpenCL™ flow
Convolution PE Array
Crossbar
prim prim prim custom
DD
R
Memory Reader/Writer
Feature Map Cache
DD
R
ConfigEngine
Programmable Solutions Group 106
DLA Architecture: Built for Performance
▪ Maximize Parallelism on the FPGA
– Filter Parallelism (Processing Elements)
– Input-Depth Parallelism
– Winograd Transformation
– Batching
– Feature Stream Buffer
– Filter Cache
▪ Choosing FPGA Bitstream
– Data Type / Design Exploration
– Primitive Support
ReLUConvolution /
Fully
ConnectedNorm MaxPool
Stream Buffer
ConvPE
Array
Crossbar
ReLUMaxPool
DDR
Memory Reader/Writer
Feature Map Cache
DDR
DDR
DDR
ConfigEngine
Norm
Execute
Programmable Solutions Group 107
Mapping Graphs in DLA
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
ReLU
Convolution /
Fully
Connected
Norm MaxPool
Blocks are run-time reconfigurable and bypassable
Stream Buffer
Programmable Solutions Group 108
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
Norm MaxPool
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 109
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
Norm MaxPool
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 110
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Stream Bufferoutput
input
Blocks are run-time reconfigurable and bypassable
Programmable Solutions Group 111
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 112
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
MaxPool
Programmable Solutions Group 113
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 114
Mapping Graphs in DLA
ReLU
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 115
Mapping Graphs in DLA
Convolution /
Fully
Connected
AlexNet Graph
Conv ReLu Norm MaxPool Fully Conn.
Blocks are run-time reconfigurable and bypassable
Stream Bufferoutput
input
Programmable Solutions Group 116
Support for Different Topologies
ReLUConvolution /
Fully
Connected
Norm MaxPool
Stream Buffer
Permute Flatten PriorBox SoftMaxConcatLRN Reshape
ReLU
Convolution /
Fully
Connected
Norm MaxPool
Stream Buffer
Programmable Solutions Group 117
Support for Different Topologies
Tradeoff between features and performance
Convolution PE Array
Crossbar
ReLULRN
NormMaxPool
Memory
Reader/Writer
Feature Map Cache
ConfigEngine
Convolution PE Array
Crossbar
ReLULRN
NormMaxPool
Memory
Reader/Writer
Feature Map Cache
ConfigEngine
Prior Box
Permute
Concat FlattenSoftMax
Reshape
vs
Programmable Solutions Group Intel Confidential – CNDA Required 118
User Flows for Intel® FPGA DL Acceleration Suite
IPArchitect
Neural Net
Design
Offline
Compiler
Intel® FPGA SDK
for OpenCL™
BitstreamLibrary
Data Scientist
Compile
DLA Runtime Engine
DLA Graph Compiler
DLA Runtime API
Customized Architecture
custom primitives
custom layers
CV SDK
Model Optimizer
Inference Engine API
Software Deployment Flow
Architecture Development Flow
Design Program
Programmable Solutions Group 119
Summary
▪ FPGAs provide a flexible, deterministic low-latency, high-throughput, and
energy-efficient solution for accelerating AI applications
▪ Intel® FPGA DLA Suite supports CNN inference on FPGAs
▪ Accessed through Intel® Computer Vision SDK
▪ Available for Intel® Programmable Acceleration Card
▪ Future Proof: can adapt to rapid innovation of DL Topologies
120
top related