vision acceleration api landscape: options and trade-offs

Copyright © 2017 Your Company Name 1Your Company Logo goes here. See Slide 8 for instructions

Neil Trevett, Khronos PresidentVP Developer Ecosystem, NVIDIA

May2017

Vision Acceleration API Landscape: Options and Trade-offs

© Copyright Khronos Group 2017 -

The Search for Vision Performance and PortabilityLeads to a layered Ecosystem – and Graph Programming

Explicit Kernels

Diverse Hardware (GPUs, DSPs, FPGAs)

GPU FPGA

DSPDedicated Hardware

Single Source C++ Languages

Graph Programming

TensorRT

Code Intermediate Representations

PTX HSAIL


OpenVX – Performance Portable Vision • OpenVX developers express a graph of image operations (‘Nodes’)

- Using a C API

• Nodes can be executed on any hardware or processor coded in any language- Implementers can optimize under the high-level graph abstraction

• Graphs are the key to run-time optimization opportunities…

Array of Keypoints

YUVFrame

GrayFrame

CameraInput

RenderingOutput

Pyrt

Color Conversion

Channel Extract

Optical Flow

Harris Track

Image Pyramid

RGBFrame

Array of Features

Ftrt-1OpenVX Graph

OpenVX Nodes

Feature Extraction Example Graph


OpenVX Efficiency through Graphs..

Reuse pre-allocated memory for

multiple intermediate data

MemoryManagement

Less allocation overhead,more memory forother applications

Replace a sub-graph with a

single faster node

Kernel Fusion

Better memorylocality, less kernel launch overhead

Split the graph execution across

the whole system: CPU /

GPU / dedicated HW

GraphScheduling

Faster executionor lower powerconsumption

Execute a sub-graph at tile granularity

instead of image granularity

DataTiling

Better use of data cache andlocal memory


OpenVX Evolution

OpenVX 1.0 Spec released October 2014

ConformantImplementations

OpenVX 1.1 Spec released May 2016

ConformantImplementations

AMD OpenVX Tools- Open source, highly

optimized for x86 CPU and OpenCL for GPU

- “Graph Optimizer” looks at entire processing pipeline and

removes, replaces, merges functions to improve

performance and bandwidth- Scripting for rapid

prototyping, without re-compiling, at production

performance levelshttp://gpuopen.com/compute-product/amd-openvx/

New FunctionalityExpanded Nodes FunctionalityEnhanced Graph Framework

OpenVX 1.2 Spec released 1st May 2017

New FunctionalityConditional node execution

Feature detection Classification operators

Expanded imaging operations

ExtensionsNeural Network Acceleration

Graph Save and Restore16-bit image operation

OpenVX Roadmap Discussions

New Functionality Under Discussion

NNEF Import

Accelerated Programmable user

nodes


Khronos NNEF (Neural Net Exchange Format• Range of Neural Network tools and inferencing architectures is rapidly increasing• NNEF encapsulates neural network formal semantics

- Structure, Data formats- Commonly used operations (such as convolution, pooling, normalization, etc.)

• Cross-vendor Neural Net file format removes industry friction- Simple exchange between tools and inferencing engines- Unified format for network optimizations

NN Authoring Framework 1



Inference Engine 1

Inference Engine 2

Inference Engine 3




Inference Engine 1

Inference Engine 2

Inference Engine 3

Every Tool Needs an Exporter to Every Accelerator


OpenVX 1.2 and Neural Net Extension• Convolution Neural Network topologies can be represented as OpenVX graphs

- Layers are represented as OpenVX nodes- Layers connected by multi-dimensional tensors objects- Layer types include convolution, activation, pooling, fully-connected, soft-max- CNN nodes can be mixed with traditional vision nodes

• Import/Export Extension- Efficient handling of network Weights/Biases or complete networks

• Will be able to import NNEF files into OpenVX Neural Nets

VisionNode

VisionNode

VisionNode

Downstream ApplicationProcessing

NativeCamera Control CNN Nodes

An OpenVX graph mixing CNN nodes with traditional vision nodes


Building Graphs in C++ Rather than an APIC++ Based Approaches to Graph Programming

Explicit Kernels


GPU FPGA



Graph Programming

TensorRT


PTX HSAIL


Halide – C++ Based Image Processing• Halide is a programming language embedded in C++

- C++ code builds an in-memory representation of a Halide graph- Compiled to an object file, to be run in the same process

• Separate descriptions of ‘algorithm’ per output pixel and execution ‘schedule’- Can hand or auto-optimize schedule without affecting the algorithm correctness

• Compiler based on LLVM/Clang - Compiler targets: X86, ARM, CUDA, OpenCL, and OpenGL

HalideProgram

(Image processing Algorithm)

Human Expert

Autoscheduler

HalideSchedule

(mapping of algorithm to hardware)

Halide Code Generator

(LLVM/Clang)Platform-

optimized binary


Khronos SYCL - Single Source C++ • Single-source heterogeneous programming using STANDARD C++

- Use C++ templates and lambda functions for host & device code

• Kernel Fusion in C++ is a widely used compiler technique - proven to work- Halide, Eigen, Boost.Compute, …- Optimization at the C++, not assembly, level- Achieves better performance on complex software than hand-coding

• Rapid optimization of multiple libraries - more information at http://sycl.tech- SYCLBLAS- SYCL Eigen- SYCL TensorFlow- SYCL GTX- triSYCL- ComputeCpp- VisionCpp- ComputeCpp SDK


Graph Programming - Fusion Results• C++ Kernel fusion provides optimization benefits

- Tiled operations in local memory- Reduced bandwidth to off-chip memory

Courtesy Codeplay: https://www.slideshare.net/AndrewRichards28/open-standards-for-adas-andrew-richards-codeplay-at-autosens-2016-66476890


Convergence with Standard ISO C++• SYCL aligns the hardware acceleration of OpenCL with direction of the C++ standard

- Use of C++14 with open source C++17 Parallel STL hosted by Khronos

• Khronos working with others on bringing proposals to ISO C++ for:- Executors – for scheduling work- “Managed pointers” or “channels” – for sharing data

• Hoping to target C++ 20- But timescales are tight

• The eventual aim is for standard C++ tosupport advanced accelerationoffload


Explicit Programming and Run TimesRun-time support for Acceleration Offload

Explicit Kernels


GPU FPGA



Graph Programming

TensorRT


PTX HSAIL


OpenCL as Parallel Language/Library Backend

C++ based Neural

network framework

MulticoreWare open source project on Bitbucket

Compiler directives for

Fortran,C and C++

Java language extensions

for parallelism

Language for image

processing and computational photography

Single Source C++

Programming for OpenCL

Approaching 200 languages, frameworks and projects using OpenCL as a compiler

target to access vendor-optimized, heterogeneous compute runtimes

Low Level Explicit APIs

Vision processing

open source project

Open source software library

for machine learning


OpenCL 2.2 - Top to Bottom C++

OpenCL 1.0Specification

Dec08 Jun10OpenCL 1.1

Specification

Nov11OpenCL 1.2

SpecificationOpenCL 2.0

Specification

Nov13

Device partitioningSeparate compilation and linking

Enhanced image supportBuilt-in kernels / custom devicesEnhanced DX and OpenGL Interop

Shared Virtual MemoryOn-device dispatch

Generic Address SpaceEnhanced Image Support

C11 AtomicsPipes

Android ICD

3-component vectorsAdditional image formatsMultiple hosts and devicesBuffer region operations

Enhanced event-driven executionAdditional OpenCL C built-ins

Improved OpenGL data/event interop

18 months 18 months 24 months

OpenCL 2.1 Specification

Nov1524 months

SPIR-V in CoreSubgroups into core

Subgroup query operationsclCloneKernel

Low-latency device timer queries

OpenCL 2.2 PROVISIONAL

May167months

Single Source C++ ProgrammingFull support for features in C++14-based Kernel Language

API and Language SpecsBrings C++14-based Kernel Language into core specification

Portable Kernel Intermediate LanguageSupport for C++14-based kernel language e.g. constructors/destructors

OpenCL C++ Kernel LanguageSPIR-V 1.1 with C++ support

SYCL 2.2 for single source C++


OpenCL Conformant Implementations

OpenCL 1.0Specification

Dec08 Jun10OpenCL 1.1Specification

Nov11OpenCL 1.2 Specification


Nov13

1.0|Jul13

1.0|Aug09

1.0|May09

1.0|May10

1.0 | Feb11

1.0|May09

1.0|Jan10

1.1|Aug10

1.1|Jul11

1.2|May12

1.2|Jun12

1.1|Feb11

1.1|Mar11

1.1|Jun10

1.1|Aug12

1.1|Nov12

1.1|May13

1.1|Apr12

1.2|Apr14

1.2|Sep13

1.2|Dec12Desktop

Mobile

FPGA

2.0|Jul14


Nov15

1.2|May15

2.0|Dec14

1.0|Dec14

1.2|Dec14

1.2|Sep14

Vendor timelines are first implementation of each spec generation

1.2|May15

Embedded

1.2|Aug15

1.2|Mar16

2.0|Nov15

2.0|Apr17


Strong Vulkan MomentumGames Studios publicly confirming that work is ongoing on Vulkan Titles

All Major GPU Companies shipping Vulkan Drivers – for Desktop and Mobile Platforms

Mobile, Embedded and Console Platforms Supporting Vulkan

Embedded Linux

In first 12months :#Vulkan Games on PC = 11

In first 18 months#DX12 Games on PC = 19

https://en.wikipedia.org/wiki/List_of_games_with_Vulkan_support

Android TVAndroid 7.0 Nintendo Switch


Vulkan Explicit GPU Control

GPU

High-level Driver Abstraction

Context managementMemory allocationFull GLSL compiler

Error detectionLayered GPU Control

ApplicationSingle thread per context

GPU

Thin DriverExplicit GPU Control

ApplicationMemory allocation

Thread managementSynchronization

Multi-threaded generation of command buffers

Language Front-end CompilersInitially GLSL

Loadable debug and validation layers

Vulkan 1.0 provides access to OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality

but with increased performance and flexibility

Loadable LayersNo error handling overhead in

production code

SPIR-V Pre-compiled Shaders:No front-end compiler in driver

Future shading language flexibility

Simpler drivers:Improved efficiency/performance

Reduced CPU bottlenecksLower latency

Increased portability

Graphics, compute and DMA queues: Work dispatch flexibility

Command Buffers:Command creation can be multi-threadedMultiple CPU cores increase performance

Resource management in app code: Less hitches and surprises

Vulkan Benefits

SPIR-V pre-compiled shaders


Vulkan Influence on OpenCL Roadmap• OpenCL converging with Vulkan – expanding beyond graphics + more processor types

- Thin, powerful, explicit run-time for control and predictability- Feature sets and dial-able precision for target market agility- Installable tools and three layer ecosystem for flexibility- Vulkan renderpasses are already a way to enabled tiled processing

• OpenCL also considering optional reduced precision for vision and inferencing- Will enable significant number of DSP implementations to become conformant

Thin, explicit run-time with rigorous memory/execution model.

Low-latency, fine-grain pre-emption and synchronization

Dial-able types and precision

Features that can be enabled for particular target markets

Real-time Pre-emption and

QoS scheduling

Explicit Asynch

DMA

Self-synchronized, self-scheduled

graphs

Stream Processing …

Math Libraries

Vendor-supplied and open source middleware

Language Front-ends

Tool Layers

Installable tool & validation layers

Applications

API Definitions


SPIR-V Ecosystem

LLVM

Third party kernel and shader Languages

SPIR-V• Khronos defined and controlled cross-API intermediate language• Native support for graphics

and parallel constructs • 32-bit Word Stream

• Extensible and easily parsed• Retains data object and control

flow information for effective code generation and translation

OpenCL C++OpenCL C

GLSLKhronos has open sourced these tools and translators

IHV Driver Runtimes

Other Intermediate

Forms

SPIR-V Validator

SPIR-V (Dis)Assembler LLVM to SPIR-V Bi-directional

Translator

Khronos plans to open source these tools soon

HLSL

https://github.com/KhronosGroup/SPIRV-Tools

‘glslang’ GLSL to SPIR-V compiler

Front-end compilers leverage clang


Possible Convergence of Graph Technologies• API-created Graphs such as OpenVX benefit from flexibility of user-programmed nodes

- OpenVX Tiling extension lets them participate in tiled/fused optimizations- But currently user nodes can run only on the CPU

• Perhaps use a C++ language, such such as a Halide algorithm, to program user nodes - That can be offloaded and scheduled within the OpenVX graph- Perhaps use SPIR-V to define node capabilities and store portable Node programs

VisionNode

VisionNode

VisionNode

Downstream ApplicationProcessing

NativeCamera Control

CNN Nodes

Discussion – should an OpenVX user programmed node in a C++ domain specific language may have its executable stored as SPIR-V?

Programmed User Node


Safety Critical APIs

New Generation APIs for safety certifiable vision, graphics and

computee.g. ISO 26262 and DO-178B/C

OpenGL ES 1.0 - 2003Fixed function graphics

OpenGL ES 2.0 - 2007Shader programmable pipeline

OpenGL SC 1.0 - 2005Fixed function graphics subset

OpenGL SC 2.0 - April 2016Shader programmable pipeline subset

Experience and Guidelines

Vulkan SC being discussedSmall driver size

Advanced functionalityGraphics and compute

OpenVX SC 1.1 Released 1st May 2017Restricted “deployment” implementation

executes on the target hardware by reading the binary format and executing the pre-

compiled graphs

Khronos SCAP ‘Safety Critical Advisory Panel’Guidelines for designing APIs that

ease system certification.Open to Khronos member AND

industry experts. If interested to join contact [email protected]


Key Takeaways and What’s Next?• Vision Tools and APIs are becoming increasingly sophisticated

- Ecosystem is layering libraries, language and run-times

• Compiler technologies becoming increasingly central – especially C++- To enable graph-based language-based solutions with kernel fusion- LLVM and Clang open source projects being used by many

• Safety-critical APIs becoming increasingly essential- Many vision applications need system certification

• Still no cross-vendor camera APIs?- Is the time yet right for this to be a target for standardization?

• Please join if your company interested in Khronos open standards - Neil Trevett | [email protected] | @neilt3d

vision acceleration api landscape: options and trade-offs

Documents