vision acceleration api landscape: options and trade-offs
TRANSCRIPT
Copyright © 2017 Your Company Name 1Your Company Logo goes here. See Slide 8 for instructions
Neil Trevett, Khronos PresidentVP Developer Ecosystem, NVIDIA
May2017
Vision Acceleration API Landscape: Options and Trade-offs
© Copyright Khronos Group 2017 - Page 2
The Search for Vision Performance and PortabilityLeads to a layered Ecosystem – and Graph Programming
Explicit Kernels
Diverse Hardware (GPUs, DSPs, FPGAs)
GPU FPGA
DSPDedicated Hardware
Single Source C++ Languages
Graph Programming
TensorRT
Code Intermediate Representations
PTX HSAIL
© Copyright Khronos Group 2017 - Page 3
OpenVX – Performance Portable Vision • OpenVX developers express a graph of image operations (‘Nodes’)
- Using a C API
• Nodes can be executed on any hardware or processor coded in any language- Implementers can optimize under the high-level graph abstraction
• Graphs are the key to run-time optimization opportunities…
Array of Keypoints
YUVFrame
GrayFrame
CameraInput
RenderingOutput
Pyrt
Color Conversion
Channel Extract
Optical Flow
Harris Track
Image Pyramid
RGBFrame
Array of Features
Ftrt-1OpenVX Graph
OpenVX Nodes
Feature Extraction Example Graph
© Copyright Khronos Group 2017 - Page 4
OpenVX Efficiency through Graphs..
Reuse pre-allocated memory for
multiple intermediate data
MemoryManagement
Less allocation overhead,more memory forother applications
Replace a sub-graph with a
single faster node
Kernel Fusion
Better memorylocality, less kernel launch overhead
Split the graph execution across
the whole system: CPU /
GPU / dedicated HW
GraphScheduling
Faster executionor lower powerconsumption
Execute a sub-graph at tile granularity
instead of image granularity
DataTiling
Better use of data cache andlocal memory
© Copyright Khronos Group 2017 - Page 5
OpenVX Evolution
OpenVX 1.0 Spec released October 2014
ConformantImplementations
OpenVX 1.1 Spec released May 2016
ConformantImplementations
AMD OpenVX Tools- Open source, highly
optimized for x86 CPU and OpenCL for GPU
- “Graph Optimizer” looks at entire processing pipeline and
removes, replaces, merges functions to improve
performance and bandwidth- Scripting for rapid
prototyping, without re-compiling, at production
performance levelshttp://gpuopen.com/compute-product/amd-openvx/
New FunctionalityExpanded Nodes FunctionalityEnhanced Graph Framework
OpenVX 1.2 Spec released 1st May 2017
New FunctionalityConditional node execution
Feature detection Classification operators
Expanded imaging operations
ExtensionsNeural Network Acceleration
Graph Save and Restore16-bit image operation
OpenVX Roadmap Discussions
New Functionality Under Discussion
NNEF Import
Accelerated Programmable user
nodes
© Copyright Khronos Group 2017 - Page 6
Khronos NNEF (Neural Net Exchange Format• Range of Neural Network tools and inferencing architectures is rapidly increasing• NNEF encapsulates neural network formal semantics
- Structure, Data formats- Commonly used operations (such as convolution, pooling, normalization, etc.)
• Cross-vendor Neural Net file format removes industry friction- Simple exchange between tools and inferencing engines- Unified format for network optimizations
NN Authoring Framework 1
NN Authoring Framework 2
NN Authoring Framework 3
Inference Engine 1
Inference Engine 2
Inference Engine 3
NN Authoring Framework 1
NN Authoring Framework 2
NN Authoring Framework 3
Inference Engine 1
Inference Engine 2
Inference Engine 3
Every Tool Needs an Exporter to Every Accelerator
© Copyright Khronos Group 2017 - Page 7
OpenVX 1.2 and Neural Net Extension• Convolution Neural Network topologies can be represented as OpenVX graphs
- Layers are represented as OpenVX nodes- Layers connected by multi-dimensional tensors objects- Layer types include convolution, activation, pooling, fully-connected, soft-max- CNN nodes can be mixed with traditional vision nodes
• Import/Export Extension- Efficient handling of network Weights/Biases or complete networks
• Will be able to import NNEF files into OpenVX Neural Nets
VisionNode
VisionNode
VisionNode
Downstream ApplicationProcessing
NativeCamera Control CNN Nodes
An OpenVX graph mixing CNN nodes with traditional vision nodes
© Copyright Khronos Group 2017 - Page 8
Building Graphs in C++ Rather than an APIC++ Based Approaches to Graph Programming
Explicit Kernels
Diverse Hardware (GPUs, DSPs, FPGAs)
GPU FPGA
DSPDedicated Hardware
Single Source C++ Languages
Graph Programming
TensorRT
Code Intermediate Representations
PTX HSAIL
© Copyright Khronos Group 2017 - Page 9
Halide – C++ Based Image Processing• Halide is a programming language embedded in C++
- C++ code builds an in-memory representation of a Halide graph- Compiled to an object file, to be run in the same process
• Separate descriptions of ‘algorithm’ per output pixel and execution ‘schedule’- Can hand or auto-optimize schedule without affecting the algorithm correctness
• Compiler based on LLVM/Clang - Compiler targets: X86, ARM, CUDA, OpenCL, and OpenGL
HalideProgram
(Image processing Algorithm)
Human Expert
Autoscheduler
HalideSchedule
(mapping of algorithm to hardware)
Halide Code Generator
(LLVM/Clang)Platform-
optimized binary
© Copyright Khronos Group 2017 - Page 10
Khronos SYCL - Single Source C++ • Single-source heterogeneous programming using STANDARD C++
- Use C++ templates and lambda functions for host & device code
• Kernel Fusion in C++ is a widely used compiler technique - proven to work- Halide, Eigen, Boost.Compute, …- Optimization at the C++, not assembly, level- Achieves better performance on complex software than hand-coding
• Rapid optimization of multiple libraries - more information at http://sycl.tech- SYCLBLAS- SYCL Eigen- SYCL TensorFlow- SYCL GTX- triSYCL- ComputeCpp- VisionCpp- ComputeCpp SDK
© Copyright Khronos Group 2017 - Page 11
Graph Programming - Fusion Results• C++ Kernel fusion provides optimization benefits
- Tiled operations in local memory- Reduced bandwidth to off-chip memory
Courtesy Codeplay: https://www.slideshare.net/AndrewRichards28/open-standards-for-adas-andrew-richards-codeplay-at-autosens-2016-66476890
© Copyright Khronos Group 2017 - Page 12
Convergence with Standard ISO C++• SYCL aligns the hardware acceleration of OpenCL with direction of the C++ standard
- Use of C++14 with open source C++17 Parallel STL hosted by Khronos
• Khronos working with others on bringing proposals to ISO C++ for:- Executors – for scheduling work- “Managed pointers” or “channels” – for sharing data
• Hoping to target C++ 20- But timescales are tight
• The eventual aim is for standard C++ tosupport advanced accelerationoffload
© Copyright Khronos Group 2017 - Page 13
Explicit Programming and Run TimesRun-time support for Acceleration Offload
Explicit Kernels
Diverse Hardware (GPUs, DSPs, FPGAs)
GPU FPGA
DSPDedicated Hardware
Single Source C++ Languages
Graph Programming
TensorRT
Code Intermediate Representations
PTX HSAIL
© Copyright Khronos Group 2017 - Page 14
OpenCL as Parallel Language/Library Backend
C++ based Neural
network framework
MulticoreWare open source project on Bitbucket
Compiler directives for
Fortran,C and C++
Java language extensions
for parallelism
Language for image
processing and computational photography
Single Source C++
Programming for OpenCL
Approaching 200 languages, frameworks and projects using OpenCL as a compiler
target to access vendor-optimized, heterogeneous compute runtimes
Low Level Explicit APIs
Vision processing
open source project
Open source software library
for machine learning
© Copyright Khronos Group 2017 - Page 15
OpenCL 2.2 - Top to Bottom C++
OpenCL 1.0Specification
Dec08 Jun10OpenCL 1.1
Specification
Nov11OpenCL 1.2
SpecificationOpenCL 2.0
Specification
Nov13
Device partitioningSeparate compilation and linking
Enhanced image supportBuilt-in kernels / custom devicesEnhanced DX and OpenGL Interop
Shared Virtual MemoryOn-device dispatch
Generic Address SpaceEnhanced Image Support
C11 AtomicsPipes
Android ICD
3-component vectorsAdditional image formatsMultiple hosts and devicesBuffer region operations
Enhanced event-driven executionAdditional OpenCL C built-ins
Improved OpenGL data/event interop
18 months 18 months 24 months
OpenCL 2.1 Specification
Nov1524 months
SPIR-V in CoreSubgroups into core
Subgroup query operationsclCloneKernel
Low-latency device timer queries
OpenCL 2.2 PROVISIONAL
May167months
Single Source C++ ProgrammingFull support for features in C++14-based Kernel Language
API and Language SpecsBrings C++14-based Kernel Language into core specification
Portable Kernel Intermediate LanguageSupport for C++14-based kernel language e.g. constructors/destructors
OpenCL C++ Kernel LanguageSPIR-V 1.1 with C++ support
SYCL 2.2 for single source C++
© Copyright Khronos Group 2017 - Page 16
OpenCL Conformant Implementations
OpenCL 1.0Specification
Dec08 Jun10OpenCL 1.1Specification
Nov11OpenCL 1.2 Specification
OpenCL 2.0 Specification
Nov13
1.0|Jul13
1.0|Aug09
1.0|May09
1.0|May10
1.0 | Feb11
1.0|May09
1.0|Jan10
1.1|Aug10
1.1|Jul11
1.2|May12
1.2|Jun12
1.1|Feb11
1.1|Mar11
1.1|Jun10
1.1|Aug12
1.1|Nov12
1.1|May13
1.1|Apr12
1.2|Apr14
1.2|Sep13
1.2|Dec12Desktop
Mobile
FPGA
2.0|Jul14
OpenCL 2.1 Specification
Nov15
1.2|May15
2.0|Dec14
1.0|Dec14
1.2|Dec14
1.2|Sep14
Vendor timelines are first implementation of each spec generation
1.2|May15
Embedded
1.2|Aug15
1.2|Mar16
2.0|Nov15
2.0|Apr17
© Copyright Khronos Group 2017 - Page 17
Strong Vulkan MomentumGames Studios publicly confirming that work is ongoing on Vulkan Titles
All Major GPU Companies shipping Vulkan Drivers – for Desktop and Mobile Platforms
Mobile, Embedded and Console Platforms Supporting Vulkan
Embedded Linux
In first 12months :#Vulkan Games on PC = 11
In first 18 months#DX12 Games on PC = 19
https://en.wikipedia.org/wiki/List_of_games_with_Vulkan_support
Android TVAndroid 7.0 Nintendo Switch
© Copyright Khronos Group 2017 - Page 18
Vulkan Explicit GPU Control
GPU
High-level Driver Abstraction
Context managementMemory allocationFull GLSL compiler
Error detectionLayered GPU Control
ApplicationSingle thread per context
GPU
Thin DriverExplicit GPU Control
ApplicationMemory allocation
Thread managementSynchronization
Multi-threaded generation of command buffers
Language Front-end CompilersInitially GLSL
Loadable debug and validation layers
Vulkan 1.0 provides access to OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality
but with increased performance and flexibility
Loadable LayersNo error handling overhead in
production code
SPIR-V Pre-compiled Shaders:No front-end compiler in driver
Future shading language flexibility
Simpler drivers:Improved efficiency/performance
Reduced CPU bottlenecksLower latency
Increased portability
Graphics, compute and DMA queues: Work dispatch flexibility
Command Buffers:Command creation can be multi-threadedMultiple CPU cores increase performance
Resource management in app code: Less hitches and surprises
Vulkan Benefits
SPIR-V pre-compiled shaders
© Copyright Khronos Group 2017 - Page 19
Vulkan Influence on OpenCL Roadmap• OpenCL converging with Vulkan – expanding beyond graphics + more processor types
- Thin, powerful, explicit run-time for control and predictability- Feature sets and dial-able precision for target market agility- Installable tools and three layer ecosystem for flexibility- Vulkan renderpasses are already a way to enabled tiled processing
• OpenCL also considering optional reduced precision for vision and inferencing- Will enable significant number of DSP implementations to become conformant
Thin, explicit run-time with rigorous memory/execution model.
Low-latency, fine-grain pre-emption and synchronization
Dial-able types and precision
Features that can be enabled for particular target markets
Real-time Pre-emption and
QoS scheduling
Explicit Asynch
DMA
Self-synchronized, self-scheduled
graphs
Stream Processing …
Math Libraries
Vendor-supplied and open source middleware
Language Front-ends
Tool Layers
Installable tool & validation layers
Applications
API Definitions
© Copyright Khronos Group 2017 - Page 20
SPIR-V Ecosystem
LLVM
Third party kernel and shader Languages
SPIR-V• Khronos defined and controlled cross-API intermediate language• Native support for graphics
and parallel constructs • 32-bit Word Stream
• Extensible and easily parsed• Retains data object and control
flow information for effective code generation and translation
OpenCL C++OpenCL C
GLSLKhronos has open sourced these tools and translators
IHV Driver Runtimes
Other Intermediate
Forms
SPIR-V Validator
SPIR-V (Dis)Assembler LLVM to SPIR-V Bi-directional
Translator
Khronos plans to open source these tools soon
HLSL
https://github.com/KhronosGroup/SPIRV-Tools
‘glslang’ GLSL to SPIR-V compiler
Front-end compilers leverage clang
© Copyright Khronos Group 2017 - Page 21
Possible Convergence of Graph Technologies• API-created Graphs such as OpenVX benefit from flexibility of user-programmed nodes
- OpenVX Tiling extension lets them participate in tiled/fused optimizations- But currently user nodes can run only on the CPU
• Perhaps use a C++ language, such such as a Halide algorithm, to program user nodes - That can be offloaded and scheduled within the OpenVX graph- Perhaps use SPIR-V to define node capabilities and store portable Node programs
VisionNode
VisionNode
VisionNode
Downstream ApplicationProcessing
NativeCamera Control
CNN Nodes
Discussion – should an OpenVX user programmed node in a C++ domain specific language may have its executable stored as SPIR-V?
Programmed User Node
© Copyright Khronos Group 2017 - Page 22
Safety Critical APIs
New Generation APIs for safety certifiable vision, graphics and
computee.g. ISO 26262 and DO-178B/C
OpenGL ES 1.0 - 2003Fixed function graphics
OpenGL ES 2.0 - 2007Shader programmable pipeline
OpenGL SC 1.0 - 2005Fixed function graphics subset
OpenGL SC 2.0 - April 2016Shader programmable pipeline subset
Experience and Guidelines
Vulkan SC being discussedSmall driver size
Advanced functionalityGraphics and compute
OpenVX SC 1.1 Released 1st May 2017Restricted “deployment” implementation
executes on the target hardware by reading the binary format and executing the pre-
compiled graphs
Khronos SCAP ‘Safety Critical Advisory Panel’Guidelines for designing APIs that
ease system certification.Open to Khronos member AND
industry experts. If interested to join contact [email protected]
© Copyright Khronos Group 2017 - Page 23
Key Takeaways and What’s Next?• Vision Tools and APIs are becoming increasingly sophisticated
- Ecosystem is layering libraries, language and run-times
• Compiler technologies becoming increasingly central – especially C++- To enable graph-based language-based solutions with kernel fusion- LLVM and Clang open source projects being used by many
• Safety-critical APIs becoming increasingly essential- Many vision applications need system certification
• Still no cross-vendor camera APIs?- Is the time yet right for this to be a target for standardization?
• Please join if your company interested in Khronos open standards - Neil Trevett | [email protected] | @neilt3d