project no: 644312 · ref. ares(2015)2565621 - 18/06/2015. d2.1 application analysis and system...

28
D2.1 Application analysis and system requirements Page 1 of 28 This document is Confidential, and was produced under the RAPID project (EC contract 644312). Project No: 644312 D2.1 Application analysis and system requirements June 16, 2015 Abstract: This deliverable describes the application analysis and the system requirements for the selected use-cases: a real-time face recognition engine, a hand tracking software, and an antivirus application. A brief introduction of the three use- cases is initially presented followed by the performance/bandwidth requirements, and runtime library dependences. Finally, the document concludes with a detailed review of the validation scenarios. Document Manager David Oro Herta Security Document Id N°: rapid_D2.1 Version: 2.4 Date: 16/6/2015 Filename: rapid_D2.1_v2.4.docx Confidentiality This document contains proprietary and confidential material of certain RAPID contractors, and may not be reproduced, copied, or disclosed without appropriate permission. The commercial use of any information contained in this document may require a license from the proprietor of that information. Ref. Ares(2015)2565621 - 18/06/2015

Upload: others

Post on 04-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 1 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Project No: 644312

D2.1 Application analysis and system requirements

June 16, 2015

Abstract:

This deliverable describes the application analysis and the system requirements for the selected use-cases: a real-time

face recognition engine, a hand tracking software, and an antivirus application. A brief introduction of the three use-

cases is initially presented followed by the performance/bandwidth requirements, and runtime library dependences.

Finally, the document concludes with a detailed review of the validation scenarios.

Document Manager

David Oro Herta Security

Document Id N°: rapid_D2.1 Version: 2.4 Date: 16/6/2015

Filename: rapid_D2.1_v2.4.docx

Confidentiality

This document contains proprietary and confidential material of certain RAPID contractors, and may not be

reproduced, copied, or disclosed without appropriate permission. The commercial use of any information

contained in this document may require a license from the proprietor of that information.

Ref. Ares(2015)2565621 - 18/06/2015

Page 2: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 2 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

The RAPID Consortium consists of the following partners:

Participant No. Participant Organisation Names Short Name Country

1 Foundation of Research and Technology Hellas FORTH Greece

2 Sapienza University of Rome UROME Italy

3 Atos Spain S.A. ATOS Spain

4 Queen's University Belfast QUB United

Kingdom

5 Herta Security S.L. HERTA Spain

6 SingularLogic S.A. SILO Greece

7 University of Naples "Parthenope" UNP Italy

The information in this document is provided “as is” and no guarantee or warranty is given that the

information is fit for any particular purpose. The user thereof uses the information at its sole risk and

liability.

Revision history

Version Author Notes

0.5 Javier Vera Initial version

1.0 David Oro Updated introduction and description of use-case apps

1.1 David Oro Added benchmarks

1.2 Iakovos Mavroidis Initial hand tracking use-case description

1.3 David Oro Proofreading and minor format/spacing changes

1.4 Nikolaos Kyriazis Description of the hand tracking application completed

1.5 Giorgos Vasiliadis Added description of the antivirus application

1.6 Carles Fernández Final formatting and homogenization of sections

1.7 Nikolaos Kyriazis Added required CUDA calls for hand tracking application

1.8 Javier Vera Added required CUDA calls and assembly

1.9 Carles Fernández Incorporated changes from SILO

2.0 Iakovos Mavroidis Added formal list with requirements from Herta and FORTH

2.1 David Oro Reformatted the text and modified table spacing

2.2 David Oro Text modification to ensure quality compliance

2.3 Iakovos Mavroidis Text modification to ensure quality compliance

2.4 Fco. Javier Nieto Modifications in requirements tables according to comments

Page 3: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 3 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Table of Contents

1. Introduction .................................................................................................................................. 5

1.1. Glossary of Acronyms ............................................................................................................. 6

2. BioSurveillance Application ........................................................................................................ 8

2.1. Client-Server Mode Analysis ................................................................................................... 9

2.2. Requirements ......................................................................................................................... 10

3. Kinect 3D Hand Tracking Application ...................................................................................... 12

3.1. Standalone Mode Analysis .................................................................................................... 13

3.2. Client-Server Mode Analysis ................................................................................................. 14

3.3. Requirements ......................................................................................................................... 16

4. Antivirus Application ................................................................................................................ 18

5. Application Requirements and Validation Scenarios ................................................................ 20

References .......................................................................................................................................... 28

List of Figures Figure 1: RAPID infrastructure ............................................................................................................... 5

Figure 2: Tegra K1 development board ................................................................................................... 6

Figure 3: Tegra X1 development board ................................................................................................... 6

Figure 4: Face recognition pipeline ......................................................................................................... 8

Figure 5: Face recognition use-case ....................................................................................................... 10

Figure 6: Graphical illustration of the proposed method. A Kinect RGB image (a) and the

corresponding depth map (b). The hand is segmented (c) by jointly considering skin color and depth.

The proposed method fits the employed hand model (d) to this observation recovering the hand

articulation (e). ....................................................................................................................................... 12

Figure 7: A basic schematic of the manually derived client-server decomposition of the 3D hand

tracking application................................................................................................................................ 14

Figure 8: Antivirus application. Files are mapped onto pinned memory that can be copied via DMA

onto the graphics card. The matching engine performs a first-pass filtering on the GPU and return

potential true positives for further checking onto the CPU. .................................................................. 19

Page 4: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 4 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Executive Summary

The RAPID project aims to transparently offload computer-intensive kernel task computations from

low-power devices such as tablets and/or smartphones to a remote cloud populated with high-

performance multicore CPUs and GPUs. Under this scenario, it is expected that devices designed for

power envelops as greater as 150W will run on a datacenter which is not power-constrained. From a

user-level perspective, the main goal of the project is thus to build a state-of-the-art acceleration as a

service (AaaS) platform with QoS capabilities on top of a virtualized pool of nodes.

This deliverable describes the application analysis requirements of three use-cases developed by

HERTA and FORTH research partners. These use-cases were carefully designed to test and determine

how the full-blown capabilities of the RAPID infrastructure would perform under real-world

workloads. HERTA provides a GPU-based face recognition application called BioSurveillance that

works with standard IP or USB cameras. The design of such software is scalable in bandwidth,

throughput, power, and I/O requirements. Even with an embedded device such as the NVIDIA Tegra

K1 or NVIDIA Tegra X1, the face processing engine is capable of yielding real-time performance.

On the other hand, FORTH provides two innovative applications. The first one relies on a Kinect 3D

depth sensor to provide an interactive graphical hand tracking experience. As such, this task is fully

parallelizable, and thus could also benefit from offloading computer-intensive operations to the

RAPID cloud. To conclude, the second application consists of an antivirus software that performs

pattern matching also in parallel using GPUs.

Page 5: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 5 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

1. Introduction

RAPID will develop an efficient heterogeneous CPU-GPU cloud computing infrastructure, which can

be used to seamlessly offload CPU-based and GPU-based tasks of applications running on low-power

devices ranging from smartphones, notebooks, tablets, portable/wearable devices, robots, and cars to

more powerful devices over a heterogeneous network (HetNet) such as NVIDIA’s Tesla GPUs or Intel

Core i7 multicore CPUs. The idea behind the abovementioned infrastructure is to dynamically off-load

computer-intensive applications that cannot be executed on embedded platforms due to the

performance limitations derived from the power constraints to a remote cloud infrastructure populated

with both powerful CPUs and GPUs.

Figure 1: RAPID infrastructure

As it is depicted in Figure 1, the RAPID infrastructure consists of an embedded platform (colored in

blue) and a remote cloud (colored in green). Typically, the embedded platform is powered by a system

on chip (SoC) microarchitecture. This platform usually integrates a low-power multicore CPU paired

with a GPU with low core counts. Additionally, it also integrates a hardware video decoder and an I/O

interface (e.g. Ethernet 802.3 or 802.11). These hardware blocks are included on the same die and

typically dissipate less than 5 Watts.

On the other hand, the devices used in the cloud are high-end CPUs and GPUs with high core counts

and double precision floating point capabilities with a power envelope of roughly 150W.

Unfortunately, the high-end computing devices are just simply impossible to be powered by a battery.

Therefore, the RAPID approach to deal with this issue is to identify the most power-intensive tasks

implemented using multithreaded data-parallel kernels with annotations, and then offload them to the

heterogeneous CPU/GPU cloud. Recent advances in equipment and link-level protocols that leverage

Wi-Fi and fiber optic networks enable low-latency and high-bandwidth communications between the

embedded devices and the remote pool of high-performance CPUs and GPUs.

The selected embedded platform will be powered either by a NVIDIA’s Tegra K1 or X1 SoC (see

Figure 2 and 3). These families of chips feature respectively a general-purpose quad-core or eight-core

ARM Cortex CPU with a 192 or 256 CUDA-enabled GPU. Unlike OpenCL, CUDA-enabled GPUs

are available only from NVIDIA. The decision of restricting the scope of supported GPUs to this

Low-end CPU + GPU

Video Decoder

Ethernet I/O

GPU CPU

CPU Network GPU

CPU GPU

Embedded Platform Remote Cloud

Page 6: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 6 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

manufacturer is due to the fact that it provides the best driver and software stack currently available in

the market. Portability of kernel code between embedded NVIDIA and server/workstation GPUs is

guaranteed by the CUDA API and related framework libraries.

The cloud infrastructure will be populated by virtualized multicore Intel Core i7 CPUs and NVIDIA

GeForce GTX 980 GPUs. Unlike the GPU code, the compute-intensive CPU host code needs to be

either manually ported or recompiled, since the embedded CPUs and the cloud platform CPUs employ

a different ISA.

Figure 2: Tegra K1 development board

Figure 3: Tegra X1 development board

1.1. Glossary of Acronyms

Acronym Definition

CO Confidential

D Deliverable

DMP Data Management Plan

DoA Description of the Action

EC European Commission

EU European Union

GA Grant Agreement

PU Public

SVN Subversion

Page 7: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 7 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

WP Work Package

CPU Central Processing Unit

GPU Graphics Processing Unit

SoC System on Chip

CUDA Compute Unified Device Architecture

ISA Instruction Set Architecture

V4L Video For Linux

API Application Program Interface

RTSP Real Time Streaming Protocol

SIMD Single Instruction Multiple Data

SSE Streaming SIMD Extensions

CCTV Closed-circuit Television

FPS Frames Per Second

Page 8: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 8 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

2. BioSurveillance Application HERTA’s BioSurveillance software is designed for performing unconstrained real-time face

recognition using standard CCTV cameras. This face recognition software is implemented in a

pipeline that can be split into several stages: video decoding, face detection, feature extraction and

template matching.

Figure 4: Face recognition pipeline

As shown in Figure 4, the pipeline starts from an input video stream. This video feed is usually

broadcasted from a surveillance IP camera or a high-definition webcam using the H.264 video codec.

The software automatically handles both the container demuxing and transport layer decoding (e.g.

RTSP protocol) so it can work with any surveillance camera manufacturer just by specifying the IP

address in the command line. Similarly, it also works with any USB webcam through the usage of the

Video for Linux (V4L) API. For the webcam scenario, video decoding is not required since USB

webcams usually broadcast uncompressed video frames. If required, the parsed H.264 frames are sent

to the video decoding stage for further processing.

Video decoding is conducted on the Tegra chip using the on-die hardware video decoder by leveraging

the OpenMAX IL abstraction API provided by NVIDIA. Then the decoded H.264 slices are sent in

NV12 format to the face detection stage where the location of faces are determined. This latter stage is

performed on the GPU cores available in the Tegra K1/X1 by leveraging the CUDA parallel

programming model. Both the (X,Y) coordinates and size (MxN) of each detected face are packed and

then sent to feature extraction stage which analyze multiple face regions with the purpose of building a

summarized template that characterizes dimensions. Finally, the template matching required for the

classification stage is computed to determine the similarity between histograms.

The face recognition use-case application has several dependencies with third party libraries and APIs.

GPU offloading is also achieved through the usage of the NVIDIA CUDA programming model.

Specifically, the application requires at least CUDA 6.5 for launching and executing data parallel

kernel operations on GPU cores. Additionally, OpenCV (www.opencv.org) is required for managing

the input camera/video and show the results (e.g. detected/recognized faces) on the screen.

Advanced CUDA features such as managed memory transfers or allocations, shuffle instructions and

pinned host memory are used in BioSurveillance. For this reason, a GPU board with Compute

Capability 3.0 or greater is required to correctly offload computations.

Input

Video Decoding Face

Detection Feature

Extraction Template Matching

Page 9: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 9 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

2.1. Client-Server Mode Analysis

The face recognition BioSurveillance application is highly sensitive to variations in network activity

as it has to send face templates to the GPU cloud in order perform the matching process. This latter

step must conduct the comparison between extracted face templates from the input video and

templates of the subjects stored in a database. Table 1 illustrates the latencies for each step of the face

recognition pipeline for both an Intel Core i7 and a Tegra K1 SoC using a USB webcam as an input.

These statistics were gathered from the ARM CPUs by carefully setting the clock frequency to the

maximum performance profile available in the Linux Kernel (i.e. /sys/devices/system/cpu/).

The video decoding step was performed using a pure software decoding engine based on the

FFmpeg/Libav open source projects [1][2]. For this specific test, GPU-based face detection was also

intentionally disabled in order to accurately determine the slowdown/speed up (S) between a desktop

CPU (Intel Core i7) and a low-power CPU (Tegra K1).

In both architectures, the usage of SIMD instructions (e.g. SSE and NEON) was enabled to fully

exploit the underlying vector extensions available on CPUs and thus increase the performance of

floating point operations. These extensions were enabled simply by setting the –msse4.2 -

mfpmath=sse and –mfpu=neon respectively for both Intel x86-64 and ARMv7 architectures.

Step Intel Core i7 (ms) TEGRA K1 (ms) S

Video Decoding 4.3 24.1 5.6

Face Detection 8.2 26.3 3.2

Feature Extraction 6.9 58.6 8.5

Template Matching 23.3 58.2 [Local CPU] 2.4

5.0 [Remote GPU] 0.21

Table 1: Face recognition pipeline latency comparison between Intel Core i7 and Tegra K1

Bandwidth measurements were performed using the open source Wireshark [3] packet analyzer engine

(see Figure 5). Under the RAPID architecture, it is expected to perform the face matching process on

the remote cloud. In order to simulate this environment, a customized CPU client software was

developed by targeting the Tegra K1 architecture for performing video decoding, face detection and

facial feature extraction. The final step of template matching was performed on a remote host

equipped with an NVIDIA Quadro K2200 GPU card. Therefore, the extracted templates were sent to

the remote host using TCP sockets.

Under this scenario, the average obtained latency for GPU-based template matching was 5

milliseconds. These results mean that the remote Quadro K2200 GPU was10 times faster than the

ARM CPU implementation, and 4.6 times faster than an Intel Core i7 including memory and network

transfers. Regarding bandwidth requirements, the application sent on average a given TCP packet each

0.017 ms over a wired Gigabit Ethernet link. It should be noted that this experiment was conducted by

attaching an USB webcam to the Tegra K1 board. Finally, the obtained average frame rate was

Page 10: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 10 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

increased from 6 FPS to roughly 9 FPS (1.5X speed up) by transparently offloading template matching

computations to the remote GPU.

Figure 5: Face recognition use-case

These preliminary results show great potential for the remote GPU offloading techniques to be

developed in RAPID, as the minimum tolerated latency rates for commercial face recognition usually

ranges between 500 and 1000 ms.

2.2. Requirements

The current implementation of client-server BioSurveillance has the following technical requirements:

Client: NVIDIA Tegra TK1 or TX1

Server: NVIDIA GTX 750 Ti or better

CUDA Compute Capability 3.0

CUDA Pinned memory

CUDA Managed memory

More concretely, current implementation requires support for following CUDA runtime calls and

assembly instructions:

CUDA calls runtime API required by BioSurveillance

cudaCreateTextureObject cudaMallocPitch

cudaCreateChannelDesc cudaMemset2D

cudaFree cudaMemcpy2D

cudaMallocManaged cudaStreamDestroy

cudaStreamAttachMemAsync cudaStreamCreateWithFlags

Intrinsics required by BioSurveillance: (CC >= 3.0)

Page 11: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 11 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

__shfl_down which calls assembly: asm volatile ("shfl.down.b32 %0, %1, %2, %3;" : "=r"(ret) :

"r"(var), "r"(delta), "r"(c));

__shfl_xor which calls assembly: asm volatile ("shfl.bfly.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var),

"r"(laneMask), "r"(c));

The minimum and optimal performance of the application in terms of latency and FPS are contained in

the following table, based on the conducted measurements. The minimum values are based on strictly

necessary requirements for successful commercial implementations (typically one second of maximum

delay), whereas we consider as optimal the values that would reproduce the same results that were

measured in this single client, single server architecture.

Minimum performance Desired performance

Latency 1000 ms 115 ms

FPS 6 FPS 9 FPS

Required upstream bandwidth 1.0Mbps 1.5 Mbps

Required downstream bandwidth 36 Kbps 55 Kbps

Table 2: Minimum and optimal performance for a database of 192 templates.

The bandwidth requirements are computed as follows: given the necessary framerate, the amount of

data to be sent to the server is simply the size of each template (21,600 bytes), and the amount of data

to be received corresponds to floating point score values for all the subjects in the database. Given that

all measurements have been carried out with 192 enrollee images in the database, this represents 768

bytes for each template matching step. Multiplying these data sizes by the FPS yields the required

bandwidth.

Page 12: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 12 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

3. Kinect 3D Hand Tracking Application

The 3D tracking of articulated objects is a theoretically interesting and challenging problem. One of its

instances, the 3D tracking of human hands has a number of diverse applications including but not

limited to human activity recognition, human-computer interaction, understanding human grasping,

robot learning by demonstration, etc. Towards developing an effective and efficient solution, one has

to struggle with a number of complicating and interacting factors such as the high dimensionality of

the problem, the chromatically uniform appearance of a hand and the severe self-occlusions that occur

while a hand is in action. To ease some of these problems, some very successful methods employ

specialized hardware for motion capture [4] or the use of visual markers [5]. Unfortunately, such

methods require a complex and costly hardware setup, interfere with the observed scene, or both.

Several attempts have been made to address the problem by considering only marker less visual data.

Existing approaches can be categorized into model- and appearance-based. Model-based approaches

provide a continuum of solutions but are computationally costly and depend on the availability of a

wealth of visual information, typically provided by a multi-camera system. Appearance-based

methods are associated with much less computational cost and hardware complexity, but they

recognize a discrete number of hand poses that correspond typically to the method’s training set.

The input to the proposed method (see Figure 6) is an image acquired using the Kinect sensor, together

with its accompanying depth map. Skin color detection followed by depth segmentation is used to

isolate the hand in 2D and 3D. The adopted 3D hand model comprises of a set of appropriately

assembled geometric primitives. Each hand pose is represented as a vector of 27 parameters. Hand

articulation tracking is formulated as the problem of estimating the 27 hand model parameters that

minimize the discrepancy between hand hypotheses and the actual observations. To quantify this

discrepancy, we employ graphics rendering techniques to produce comparable skin and depth maps for

a given hand pose hypothesis. An appropriate objective function is thus formulated and a variant of

PSO is employed to search for the optimal hand configuration. The result of this optimization process

is the output of the method for the given frame. Temporal continuity is exploited to track the hand

articulation in a sequence of frames.

Figure 6: Graphical illustration of the proposed method. A Kinect RGB image (a) and the corresponding

depth map (b). The hand is segmented (c) by jointly considering skin color and depth. The proposed

method fits the employed hand model (d) to this observation recovering the hand articulation (e).

The most computationally demanding part of the proposed method is the evaluation of a hypothesis-

observation discrepancy where a hypothetical image is compared against the actual observed image.

This computation involves rendering, pixel-wise operations between an observation and a hypothesis

map and summation over the results. We exploit the inherent parallelism of this computation by

performing these operations on a GPU. Hardware instancing is employed to accelerate the rendering

process, exploiting the fact that the hand model is made up of transformed versions of the same two

Page 13: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 13 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

primitives (a cylinder and a sphere). The pixel-wise operations between maps are inherently parallel

and the summations of the maps are performed efficiently by employing a pyramidal scheme. More

details on the GPU implementation are provided in previous works [6].

3.1. Standalone Mode Analysis

Kinect is a motion sensing input device by Microsoft for the Xbox 360 video game console and

Windows PCs. Based around a webcam-style add-on peripheral for the Xbox 360 console, it enables

users to control and interact with the Xbox 360 without the need to touch a game controller. Recently,

there is an increasing research interest on pattern recognition (with emphasis on head pose and gesture

recognition) applications [7][8][9][10] due to the novel Kinect depth sensor. The Kinect 3D hand

tracking by FORTH has international recognition (Microsoft has shown a lot of interest) and it is

considered to be one of the state-of-the-art 3D hand tracking software [11].

While several applications have used Kinect 3D cameras there is still an important barrier for the wide

adoption of this technology in robotics domain. The main problem is that these applications require

tremendous processing power and memory, and associated energy. For example, Kinect 3D hand

tracking software can derive the following performance results depending on the system configuration:

Tracking FPS=1.73: CPU: Pentium(R) Dual-Core CPU T4300 @ 2.10GHz with 4096 MBs of

RAM, GPU: GeForce GT 240M with 1024 MBs of RAM

Tracking FPS=2.15: CPU: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz with 4096 MBs of

RAM, GPU: GeForce 9600 GT with 1024 MBs of RAM

Tracking FPS=2.66: CPU: Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz with 4096 MBs

of RAM, GPU: Quadro FX 1600M with 256 MBs of RAM

Tracking FPS=19.94: CPU: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz with 6144 MBs of

RAM, GPU: GeForce GTX 580 (www.geforce.com) with 1536 MBs of RAM,

Using large, high-end power-hungry servers for low-power robots is not an attractive approach.

RAPID aims to solve both the high-energy and the low-performance issues. Within the RAPID

project, we will target to reach 30 FPS, which is the maximum frame rate the Kinect cameras support,

by offloading all the compute-intensive tasks to the RAPID-based accelerator. However the resulting

FPS measure depends on the performance of the server, the performance of the client, the network and

the RAPID infrastructure. The Kinect 3D hand tracking software exhibits huge parallelism (even in

the processing of a single frame) that can be exploited by using many cloud GPUs in parallel. In this

way, RAPID is expected to bridge the gap between 3D pattern tracking applications and robotics

domain.

Page 14: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 14 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

3.2. Client-Server Mode Analysis

ObservationCompute

bounding box

Preprocessing

Upload preprocessed observations

Compute pose

ServerClient

Visualization

Figure 7: A basic schematic of the manually derived client-server decomposition of the 3D hand tracking

application.

The steps comprising the 3D hand tracking algorithm can be grouped in fundamental modules of

operations, as shown in Figure 7. A brief walkthrough of the presented flowchart, depicting the 3D

hand tracking loop, follows.

Observation: Whenever acquisition from the camera is required as an input to some module,

images (RGB image and a depth map) are captured and stored. The output comprises a RGB

image of 480x640x3 bytes (per pixel byte triplet) and a depth image of 480x640x2 bytes (per

pixel unsigned short).

Compute bounding box: From the tracking solution of the previous frame produced by

Compute pose and according to the assumption for temporal continuity a region of interest

(ROI) is formulated in the vicinity of this solution. Plainly stated, the next solution is expected

to be in the vicinity of the previous one. The ROI amounts to a rectangular area defined as the

2D bounding box of the back projected hand tracking solution for the previous frame. The

bounding box is padded with extra space so as to also account for potential motion outside the

tight bounds of the rendering for the previous solution.

Preprocessing: Features are extracted from observations. To implement the required focus,

feature extraction only occurs on images produced by Observation within the 2D bounding

box computed by Compute bounding box. The features comprise a pair of 64x64 images, i.e.

normalized extracted hand silhouette (64x64 bytes) and extracted hand depth measurements

(64x64x2 bytes).

Upload preprocessed observations: The computations involved in finding the next 3D hand

pose are parallel and are mostly performed on a GPU (3D rendering, GPGPU). Thus,

Page 15: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 15 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

observations provided by Observation and other information (e.g. camera calibration) is

uploaded to the GPU for further processing.

Compute pose: Given the observations uploaded by Upload preprocessed observations and

the tracking solution for the previous frame, thousands of hand pose hypotheses are made, 3D

rendered and evaluated against the observations, in order to find the most compatible

hypothesis. This best hypothesis is dubbed the solution for the current frame. This is the

computationally heaviest step of the presented pipeline. Output from this module signifies the

end of the current loop and the start of the next one, i.e. acquiring observations anew,

preprocessing, etc.

Visualization: The images input from Observation and the best matching 3D hand pose

computed by Compute pose are fused into a single visualization. In this visualization, a 3D

rendering of the hand pose is superimposed on the RGB image.

While the presented method can be used for offline processing, the most relevant variant to RAPID

and the most common and thus interesting use of 3D hand tracking is real-time application. In real-

time application a camera is used to capture the motion of a subject and produce a hand tracking

solution at interactive frame rates. In the following, we differentiate between two rates, tracking rate

and loop rate. Tracking rate regards the speed at which Compute pose is processed. Loop rate regards

the rate at which the entire graph is processed (including tracking rate). While tracking rate should be

considered fixed, as it would require a significant amount of work to accelerate even more, loop rate

comes with significant room for acceleration, through interleaving.

The following numbers are indicative and regard a platform with the following specifications: Intel

Core i7 CPU 950 @ 3.07 GHz with an NVIDIA GTX 970 GPU.

In the standalone version of 3D hand tracking, real-time execution yields the following rates:

Tracking rate: 28.7 FPS

Loop rate: 26.4 FPS

In the client-server version of 3D hand tracking, with both client and server running on the same

machine, real-time execution yields the following rates:

Tracking rate: 28 FPS

Loop rate: 21 FPS

The two applications have an interesting difference in their rates. In the standalone version the

tracking rate is slightly higher than the corresponding one in the client-server version. This is due to

the fact that in the client-server version GPU resources are shared between the client and the server

processes, through time slicing, which is efficient but it still incurs some overhead. The client-server

loop-rate is expectedly slower than the corresponding rate of the standalone version due to the

interprocess communication overhead. Executing the server on the described machine and moving the

execution of the client to a different machine yields the same tracking rate and varying loop rates,

which are affected by the throughput and lag of the connecting network and the processing power of

the client machine. Indicatively, running the client on a 100Mbps intranet and a client host with

similar specifications to the server yields the same results. Running the client on a laptop, over the

Page 16: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 16 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

internet and across countries (server at Heraklion, Greece and client at Amsterdam, the Netherlands)

yields a loop rate of around 5 FPS.

3.3. Requirements

The current implementation of client-server 3D hand tracking has the following technical

requirements:

1 or 2 machines to execute the server and the client.

TCP/IP network connecting the server and the client.

o Remote Procedure Call ability.

Server

o A multi-core CPU

o A CUDA-enabled GPU with the runtime supporting the instruction set of Table 3.

Client

o A multi-core CPU

o Optionally a CUDA-enabled GPU, for better performance, with the runtime

supporting the instruction set of Table 3.

o A RGBD camera or a stored RGBD video.

To achieve the real-time figures presented in the previous section the baseline specifications are as

follows:

CPU: Intel Core i7 CPU 950 @ 3.07 GHz or better.

GPU: NVIDIA GTX 970 or better:

o Number of cores: 1664

o Clock frequency: 1050MHz

o Memory frequency: 7Gbps

o Memory interface: 256bit

o Memory bandwidth: 224GB/s

o A better GPU would be one with higher frequency figures rather than more cores.

Network: The payload (w/o TCP/IP overhead) that needs to be transferred amounts to:

o Client Server: 20696 bytes per frame = 620880 bytes/s(30 FPS) = 5Mbps

RGB: 64x64x3 bytes

Depth: 64x64x2 bytes

Tracking solution: 27x8 bytes

o Client Server: 216 bytes per frame = 6480 bytes/s (30 FPS) = 0.05Mbps

Tracking solution: 27x8 bytes

o The lower the lag the greater the loop rate. The effect of lag in loop rate can be

improved with pipelining.

Page 17: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 17 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

cudaThreadSynchronize

__cudaRegisterFatBinary

__cudaUnregisterFatBinary

cudaGetErrorString

cudaMemcpy

__cudaRegisterFunction

cudaConfigureCall

cudaD3D9SetDirect3DDevice

cudaDeviceReset

cudaFree

cudaFuncGetAttributes

cudaGLSetGLDevice

cudaGetDevice

cudaGetDeviceProperties

cudaGetLastError

cudaGraphicsD3D9RegisterResource

cudaGraphicsGLRegisterBuffer

cudaGraphicsGLRegisterImage

cudaGraphicsMapResources

cudaGraphicsResourceGetMappedPointer

cudaGraphicsResourceSetMapFlags

cudaGraphicsSubResourceGetMappedArray

cudaGraphicsUnmapResources

cudaGraphicsUnregisterResource

cudaLaunch

cudaMalloc

cudaMemcpy2DFromArray

cudaSetupArgument

cudaThreadExit

cudaMallocPitch

cudaMemset

Table 3: The entry points of the CUDA runtime which 3D Hand Tracking software invokes.

Page 18: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 18 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

4. Antivirus Application

The ever-increasing amount of malicious software in todays connected world, poses a tremendous

challenge to network operators, IT administrators, as well as ordinary home users. Antivirus software

is one of the most widely used tools for detecting and stopping malicious or unwanted software. For an

effective defense, one needs virus scanning performed at central network traffic ingress points, as well

as at end-host computers. As such, anti-malware software applications scan traffic at e-mail gateways

and corporate gateway proxies, and also on edge compute devices such as file servers, desktops and

laptops. Unfortunately, the constant increase in storage capacity, number of end-devices and the sheer

number of malware, poses significant challenges to virus scanning applications, which end up

requiring multi-gigabit scanning throughput.

Analysis. Typically, a malware scanner spends the bulk of its time matching data streams against a

large set of known signatures. For instance, the signatures set of ClamAV [12], the most popular open-

source antivirus, contains more than 60 thousand string and regular expression signatures, that have to

be matched against each incoming data stream.

Pattern matching algorithms analyze the data stream and compare it against the set of signatures to

detect known malware. The signature patterns can be fairly complex, composed of different-size

strings, wild-card characters, range constraints, and sometimes recursive forms. Every year, as the

amount of malware grows, the number of signatures is increasing proportionally, exposing scaling

problems of anti-malware products.

Design. The antivirus application utilizes the highly parallel capabilities of commodity graphics

processing units to improve the performance of malware scanning programs. From a high-level view,

malware scanning is divided into two phases. First, all files are scanned by the GPU, to quickly filter

out the data segments that do not contain any viruses. The GPU uses a prefix of each virus signature to

quickly filter-out clean data. Most data do not contain any viruses, so such filtering is quite efficient.

This results in identifying all potentially malicious files, but a number of clean files as well. The GPU

then outputs a set of suspect matched files and the corresponding offsets in those files. In the second

phase, all those files are rescanned using a full pattern-matching algorithm.

The overall architecture of antivirus application is shown in Figure 8. The contents of each file are

stored into a buffer in a region of main memory that can be transferred via DMA into the memory of

the GPU. The SPMD operation of the GPU is ideal for creating multiple search engine instances that

will scan for virus signatures on different data in a massively parallel fashion. If the GPU detects a

suspicious virus, that is, there is prefix match, the file is passed to the verification module for further

investigation. If the data stream is clean, no further computation takes place. Therefore, the GPU is

employed as a first-pass high-speed filter, before completing any further potential signature-matching

work on the CPU.

Page 19: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 19 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Figure 8: Antivirus application. Files are mapped onto pinned memory that can be copied via DMA onto

the graphics card. The matching engine performs a first-pass filtering on the GPU and return potential

true positives for further checking onto the CPU.

The current implementation of antivirus has the following technical requirements:

A CUDA-enabled GPU.

Optionally TCP/IP network connectivity, in order to scan data received from network.

Page 20: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 20 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

5. Application Requirements and Validation Scenarios

This section provides the formal list of the system requirements as well as the validation scenarios of

RAPID based on the previous sections. There are 5 main categories of the application requirements:

Requirements related to the CUDA support derived from the BioSurveillance and 3D

Hand Tracking applications. These requirements are identified with the prefix

"BIOS_CUDA" and "HT3D_CUDA" respectively.

Requirements related to the RAPID devices derived from the BioSurveillance and 3D

Hand Tracking applications. These requirements are identified with the prefix

"BIOS_DEV" and "HT3D_DEV" respectively, and they are focused on the applications

themselves.

Requirements related to the network infrastructure derived from the BioSurveillance and

3D Hand Tracking applications. These requirements are identified with the prefix

"BIOS_NET" and "HT3D_NET".

Requirements related to the performance of the CPU and GPU of the RAPID server

derived from the 3D Hand Tracking application. These requirements are identified with

the prefix "HT3D_CPU".

System-level requirements related to the functionality of the RAPID infrastructure derived

from the 3D Hand Tracking application. These requirements are identified with the prefix

"HT3D_SYS".

More general system requirements which are related to added-value features in RAPID

that could be required by other applications, identified with the prefix “GEN_SYS”.

The application requirements are related to specific components of a RAPID-based system. The main

components of the RAPID infrastructure are ThinkAir [13], which is being developed by UROME and

provides the offloading mechanism, and GVirtuS [14], which is being developed by UNP and provides

the GPU-based virtualization on the server side. Moreover other important components of a RAPID-

based system are the low-power device on which the accelerated application is executed, the network

infrastructure between the low-power device and the RAPID server, and the RAPID server on which

the tasks are offloaded.

The requirements derived from the two applications as well as from other potential applications are the

following:

# Id BIOS_CUDA_01 Name Support CUDA capabilities 3.0 or higher

Priority High Req. Type Functional

Description RAPID infrastructure must support compute capabilities 3.0

Purpose GPU accelerators are all about performance. RAPID must support at least CC

3.0 to fully exploit modern GPU hardware. BioSurveillance use case will not

work on GPUs that does not support compute capability 3.0.

Use Cases BioSurvelliance use-case scenario, identifying faces by following the defined

workflow.

Validation

scenario

If GVirtuS or underlying GPUs do not support compute capability 3.0, then

BioSurveillance will not work.

Related WPs WP6

Components GVirtuS

Relationships BIOS_CUDA_01, BIOS_CUDA_02, BIOS_CUDA_03, BIOS_CUDA_04,

HT3D_CUDA_04

Page 21: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 21 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

# Id BIOS_CUDA_02 Name Support CUDA version 6.5

Priority High Req. Type Functional

Description RAPID must support at least CUDA toolchain 6.5 (this

includes cudaMallocManaged and variants)

Purpose GPU accelerators are all about performance. RAPID must support at least

CUDA libraries 6.5 to fully exploit modern GPU hardware. BioSurveillance

makes use of advanced memory allocation for improving performance on

unified-memory devices (cudaMallocManaged).

Use Cases BioSurvelliance scenario, identifying faces by following the defined

workflow.

Validation

scenario

If cudaMallocManaged or other CUDA 6.5 features are not supported,

BioSurveillance software will not work.

Related WPs WP6

Components GVirtuS

Relationships BIOS_CUDA_01, BIOS_CUDA_02, BIOS_CUDA_03, BIOS_CUDA_04,

HT3D_CUDA_04

# Id BIOS_CUDA_03 Name PTX Assembly instructions required

Priority High Req. Type Functional

Description Assembly instructions shfl.down and shfl.bfly are required to implement

__shfl_down and __shfl_xor intrinsics. These instrinsics are used in

BioSurveillance use case for efficient information exchange inside CUDA

warps. Any CUDA compatible card which supports compute capability 3.0

and beyond must implement these assembly instructions.

Purpose These assembly instructions allow fast and efficient data exchange between

threads in a warp without using shared memory.

Use Cases BioSurvelliance scenario, identifying faces by following the defined

workflow.

Validation

scenario

Run BioSurveillance application in RAPID infrastructure. If the GPU cards

support these assembly instructions, application will run. If not, the

application will fail to load.

Related WPs WP6

Components GvirtuS

Relationships BIOS_CUDA_01, BIOS_CUDA_02, BIOS_CUDA_03, BIOS_CUDA_04,

HT3D_CUDA_04

# Id HT3D_CUDA_04 Name CUDA runtime requirements

Priority High Req. Type Functional

Description The CUDA runtime should be of at least version 6.5 and should support the

runtime API sub-part of Table 1.

Purpose Set the minimum CUDA runtime requirements for 3D Hand Tracking.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

CUDA-powered aspects of 3D Hand Tracking, e.g. tracking and

visualization, work successfully.

Related WPs WP6

Components GVirtuS

Relationships BIOS_CUDA_01, BIOS_CUDA_02, BIOS_CUDA_03, BIOS_CUDA_04,

HT3D_CUDA_04

Page 22: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 22 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

# Id BIOS_DEV_01 Name Sensor devices required

Priority High Req. Type Functional

Description Standard webcams are required as sensing devices for the BioSurveillance

use-case application. Standard Logitech HD webcams are recommended for

the BioSurveillance application.

Purpose BioSurveillance requires real-time video streams as an input to be processed.

RAPID should take into account the kind of data managed by the

applications.

Actors N/A

Use Cases BioSurvelliance scenarios

Validation

scenario

Run BioSurveillance in stand-alone mode. The application should work

correctly with the connected webcam devices.

Related WPs WP3

Components Low-power devices

Relationships BIOS_DEV_01, HT3D_DEV_02

# Id HT3D_DEV_02 Name Input for 3D Hand Tracking

Priority Low Req. Type Functional

Description 3D Hand Tracking requires RGBD input, i.e. a stream of RGB images

accompanied with registered depth maps. For online execution data need to

be acquired from a depth sensor. For offline execution data could be loaded

from storage.

Purpose Establish input for 3D Hand Tracking. RAPID should take into account the

kind of data managed by the applications.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

Run 3D Hand Tracking in stand-alone mode.

Related WPs WP3

Components Low-power devices

Relationships BIOS_DEV_01, HT3D_DEV_02

# Id BIOS_NET_01 Name Support minimum bandwidth

Priority Medium Req. Type Non-functional

Description The network used among RAPID servers and clients must facilitate at least

the aggregated minimum requirements of downstream and upstream

bandwidths for all clients and servers within the deployed RAPID

infrastructure.

Purpose The indicated upstream and downstream network bandwidth requirements are

needed for the online use case applications to work sufficiently fluent and

within the expected framerate range.

Use Cases BioSurveillance application

Validation

scenario

Online use-case applications (such as BioSurveillance) running on RAPID

infrastructure will run correctly if minimum bandwidth requirements are

covered. Otherwise, they will experience additional network latencies.

Related WPs WP4-6

Components ThinkAir and Network

Relationships BIOS_NET_01, BIOS_NET_02, HT3D_NET_03, HT3D_NET_04

Page 23: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 23 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

# Id BIOS_NET_02 Name Work under maximum latency

Priority Medium Req. Type Non-functional

Description The BioSurveillance use-case application needs to work below the maximum

indicated network latency. RAPID has to monitor the network status and deal

with network latency as a Quality of Service (QoS) parameter.

Purpose Given the sensitivity of real-time security systems to the rapid availability of

alarm events, especially in the context of critical infrastructures, maximum

latency has to be respected in order to carry out successful commercial

implementations.

Use Cases BioSurveillance use-case scenario.

Validation

scenario

The network latency has to be empirically estimated for a BioSurveillance

client and server on a RAPID infrastructure, as the time elapsed since the

sending of descriptors and until the retrieval of scores. Valid latency values

must be under 1 second of delay.

Related WPs WP4-6

Components ThinkAir and Network

Relationships BIOS_NET_01, BIOS_NET_02, HT3D_NET_03, HT3D_NET_04

# Id HT3D_NET_03 Name Networking

Priority High Req. Type Functional

Description The 3D Hand Tracking server and client require intercommunication in order

to exchange data. The requirements include a Remote Procedure Call

infrastructure, built over TCP/IP. RAPID has to take into account this kind of

communication when doing offloading.

Purpose Establishing the networking infrastructure’s ability to support 3D Hand

Tracking.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

Hand Tracking should be successful in offline mode, processing data from

storage, where speed is not a factor.

Related WPs WP6

Components Network

Relationships BIOS_NET_01, BIOS_NET_02, HT3D_NET_03, HT3D_NET_04

# Id HT3D_NET_04 Name Fast networking

Priority Medium Req. Type Non-functional

Description For 3D Hand Tracking to be fast, intercommunication also needs to be

relatively fast. Due to the dimensions of the images used by the application,

RAPID must guarantee it is possible to achieve certain transfer rates.

Otherwise, it will not be possible to perform smooth real-time executions in

the platform. Therefore, network speed must be taken into account as another

QoS parameter.

Latency affects tracking rate, as it lengthens the client’s tracking loop, in

time. Latency should reflect the specifications of a fast local network, and this

has to be taken into account by RAPID as well.

Purpose Establishing the networking infrastructure’s ability to support real time 3D

Hand Tracking.

Use Cases Online or offline 3D Hand Tracking.

Page 24: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 24 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Validation

scenario

The processing rate across distinct nodes should closely approximate the

required rate when client and server are executed on the same node.

For real-time execution data need to be exchanged at a rate of 30 FPS. The

data to be exchanged comprise an RGB image of dimensions 64x64 ( ), a depth map of dimensions 64x64 ( )

and a vector of 27 doubles ( ). The networking infrastructure

should allow for the payload to be transferred at a rate of

over TCP/IP.

Related WPs WP4-6

Components ThinkAir and Network

Relationships BIOS_NET_01, BIOS_NET_02, HT3D_NET_03, HT3D_NET_04

# Id HT3D_CPU_01 Name Manage CPU/GPU server specifications

Priority High Req. Type Functional

Description The server processing node should be equipped with a multi-core CPU and a

CUDA-enabled GPU. These are requirements that RAPID must take into

account when performing allocation of resources. Therefore, it is necessary

that RAPID will be able to detect and process this kind of requirements.

Purpose Establish that 3D Hand Tracking can be executed on the server by fulfilling a

set of minimum requirements related to the number of CPUs and GPUs.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

Proof-of-concept server can handle more than 2 clients/devices that run 3D

Hand Tracking application.

Related WPs WP6

Components RAPID servers

Relationships HT3D_CPU_01, HT3D_CPU_02, HT3D_CPU_03

# Id HT3D_CPU_02 Name Manage server real-time specifications

Priority High Req. Type Non-functional

Description Apart from the type and number of CPU/GPUs, the application has concrete

requirements for this kind of resources, such as clock frequency, number of

cores, memory frequency, memory bandwidth, etc. RAPID has to take into

account that certain applications could have these concrete requirements, in

order to guarantee their correct execution.

Purpose Establish that 3D Hand Tracking can be executed on the server at about 30

FPS.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

3D Hand Tracking operates at about 30fps in real-time execution. For doing

so, the assigned server’s CPU and GPU specifications should meet the

following minimum requirements:

CPU: Intel Core i7 CPU 950 @ 3.07 GHz or better.

GPU: NVIDIA GTX 970 or better:

Number of cores: 1664

Clock frequency: 1050 MHz

Memory frequency: 7 Gbps

Memory interface: 256 bit

Memory bandwidth: 224 GB/s

Related WPs WP6

Page 25: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 25 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Components RAPID servers

Relationships HT3D_CPU_01, HT3D_CPU_02, HT3D_CPU_03

# Id HT3D_CPU_03 Name Client real-time specifications

Priority Medium Req. Type Non-functional

Description If the client’s CPU and GPU specifications met the server’s specifications

(HT3D_CPU_11), hand tracking would be executed at about 30fps.

Resources allocation should guarantee this is possible.

Purpose Establish that 3D Hand Tracking can be executed on the client at about 30fps.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

3D Hand Tracking operates at about 30 FPS

Related WPs WP6, WP7

Components RAPID servers

Relationships HT3D_CPU_01, HT3D_CPU_02, HT3D_CPU_03

# Id HT3D_SYS_01 Name Number of processing nodes

Priority Low Req. Type Functional

Description The server-client 3D Hand Tracking software may run on a single processing

node or more. The common scenario involves two processing nodes, one for

processing (server) and one for data acquisition and visualization (client). The

manually derived server-client decomposition may run on a single PC or two

PCs. In RAPID, the heavy computations devoted to the server may be further

delegated to even more processing nodes, if required. Purpose Investigate the execution model of server-client 3D Hand Tracking.

Use Cases Online or offline 3D Hand Tracking.

Validation

scenario

The decomposition across processing nodes should maximize tracking

throughput. Ideally, real-time execution (30 FPS) should be achieved on

powerful processing nodes.

Related WPs WP5, WP6

Components RAPID infrastructure and ThinkAir Server

Relationships N/A

# Id GEN_SYS_01 Name Manage available resources

Priority High Req. Type Functional

Description In order to perform an adequate management of resources, RAPID needs to

maintain a kind of model of the devices available and the resources they

provide to the applications which will be running. This model should reflect

the status of the infrastructure in terms of applications running, capability of

the nodes (CPUs and GPUs available, memory available…) and already

allocated resources.

Moreover, nodes will be classified as Class-1/2/3/4/5, depending on their

nature (from ultra-low-power devices to public clouds). Purpose Maintain the control of the infrastructure available, so it will be possible to

perform the best possible management of resources.

Use Cases All.

Validation

scenario

In every scenario, different nodes will be available for execution and

offloading. RAPID will identify with 100% accuracy these nodes and the

resources they provide.

Page 26: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 26 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

Related WPs WP5, WP6

Components RAPID infrastructure and ThinkAir Server

Relationships HT3D_CPU_01, HT3D_CPU_02, BIOS_NET_01

# Id GEN_SYS_02 Name Support application requirements

Priority Medium Req. Type Functional

Description RAPID must facilitate a way to provide infrastructure requirements for

applications in an easy way, so it will be possible to perform an adequate

management of the applications deployed in the infrastructure.

These requirements may include number of CPUs/GPUs required and their

characteristics, memory, etc…

This requirement implies to define a format for providing these infrastructure

requirements in a simple but effective way.

Purpose In order to perform the right allocation of resources, RAPID must be able to

retrieve infrastructure requirements from the applications, when these are

crucial for providing a minimum performance.

Use Cases All.

Validation

scenario

In each scenario, applications will provide these requirements and RAPID

will be able to extract all of them correctly.

Related WPs WP5, WP6

Components RAPID infrastructure and ThinkAir Server

Relationships HT3D_CPU_01, HT3D_CPU_02, BIOS_NET_01, BIOS_NET_02,

HT3D_NET_03, HT3D_NET_04

# Id GEN_SYS_03 Name Support Quality of Service

Priority High Req. Type Functional

Description In certain cases, applications will have some quality-related requirements that

must be taken into account, such as network latency or minimum bandwidth.

RAPID will deal with these requirements as Quality of Service (QoS)

requirements that should be agreed and monitored. This means that, when

deploying or offloading an application, RAPID will check it is possible to

provide the required levels of quality, trying to negotiate if it is not the case,

and providing the best solution available.

Moreover, these QoS parameters will be monitored whenever possible , in

order to guarantee the agreed quality levels are met. Purpose Some of the requirements applications may have are related to the resources

to be provided, but they are related to non-functional aspects, and they have

to be taken into account.

Use Cases All.

Validation

scenario

In each scenario, applications will provide these requirements and RAPID

will deal with them. Scenarios will check that, whenever it is possible to

fulfill the QoS aspects, they will be fulfilled 100%.

Related WPs WP5, WP6

Components RAPID infrastructure, ThinkAir Server and ThinkAir Clients

Relationships BIOS_NET_01, BIOS_NET_02, HT3D_NET_03, HT3D_NET_04,

GEN_SYS_02

Page 27: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 27 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

# Id GEN_SYS_04 Name Resources allocation mechanism

Priority High Req. Type Functional

Description There must be an adequate infrastructure management mechanism which will

facilitate applications deployment and offloading to the right devices.

Required infrastructure and other QoS parameters will be taken into account

when performing these operations.

The mechanism will take into account the infrastructure model, taking into

account certain policies for improving performance (i.e. offloading cannot be

done to nodes belonging to a lower class). Purpose Applications need that RAPID performs the adequate resources allocation, so

they will perform as expected, while the infrastructure resources are exploited

in an optimal way.

Use Cases All.

Validation

scenario

In each scenario, applications will provide these requirements and RAPID

will perform the resources allocation according to the requirements they have,

guaranteeing applications performance is adequate. This will be validated by

HT3D_CPU_01, HT3D_CPU_02, BIOS_NET_01, BIOS_NET_02,

HT3D_NET_03 and HT3D_NET_04.

Related WPs WP5, WP6

Components RAPID infrastructure, ThinkAir Server and ThinkAir Clients

Relationships GEN_SYS_01, GEN_SYS_02, GEN_SYS_03

# Id GEN_SYS_05 Name Smart offloading

Priority Medium Req. Type Functional

Description RAPID will enable offloading thanks to ThinkAir functionalities by

indicating the code which can be executed remotely. RAPID has to guarantee

offloading operations will be performed in an optimal way, by selecting the

adequate resources and devices.

As certain aspects are monitored (such as some QoS aspects), these will also

be taken into account to re-formulate resources allocation in case potential

issues are detected. Purpose In certain cases, it is necessary to offload part of the code of an application for

executing it in more capable machines. It is important that this offloading is

done adequately, so performance will increase.

Use Cases All.

Validation

scenario

In each scenario, applications will offload certain parts of their code in a

smart way, so QoS requirements will be fulfilled. This will be validated by

HT3D_CPU_01, HT3D_CPU_02, BIOS_NET_01, BIOS_NET_02,

HT3D_NET_03 and HT3D_NET_04.

Related WPs WP5, WP6

Components RAPID infrastructure, ThinkAir Server and ThinkAir Clients

Relationships GEN_SYS_01, GEN_SYS_02, GEN_SYS_03, GEN_SYS_04

Page 28: Project No: 644312 · Ref. Ares(2015)2565621 - 18/06/2015. D2.1 Application analysis and system requirements ... 1.1 David Oro Added benchmarks 1.2 Iakovos Mavroidis Initial hand

D2.1 Application analysis and system requirements

Page 28 of 28

This document is Confidential, and was produced under the RAPID project (EC contract 644312).

References

[1] FFmpeg: https://www.ffmpeg.org/

[2] Open source audio and video processing tools: http://libav.org/

[3] Wireshark: https://www.wireshark.org/

[4] Mark Schneider and Charles Stevens. Development and Testing of a New Magnetic-tracking

Device for Image Guidance. SPIE Medical Imaging, pages 65090I–65090I–11, 2007.

[5] Robert Y. Wang and Jovan Popovic. Real-time Hand-tracking With a Color Glove. ACM

Transactions on Graphics, 28(3):1, July 2009.

[6] Nikolaos Kyriazis, Iason Oikonomidis, and Antonis A. Argyros, "A GPU-powered

Computational Framework for Efficient 3D Model-based Vision", Technical Report TR420,

ICS-FORTH, July 2011.

[7] P. Padeleris, X. Zabulis and A. Argyros, “Head pose estimation on depth data based on

Particle Swarm Optimization”, HAU3D, 2012.

[8] A. Shimada, K. Kondo, D. Deguchi, G. Morin and H. Stern, "Kitchen Scene Context based

Gesture Recognition", ICPR, 2012.

[9] Matthias Wölfel, "Kinetic Space", http://code.google.com/p/kineticspace/

[10] Alexey Kurakin, Zhengyou Zhang, Zicheng Liu, "A Real-Time System for Dynamic Hand

Gesture Recognition with a Depth Sensor", EUSIPCO, 2012.

[11] Kinect 3D Hand Tracking, First price at CHALEARN Gesture Recognition competition, 11

Nov 2012

[12] ClamAV Antivirus: http://www.clamav.net

[13] Sokol Kosta, Andrius Aucinas, Pan Hui, Richard Mortier, and Xinwen Zhang,

"ThinkAir: Dynamic resource allocation and parallel execution in cloud for mobile code

offloading", INFOCOM, page 945-953. IEEE, (2012).

[14] GVirtuS: https://code.google.com/p/gvirtus/

[15] RAPID description: http://www.rapid-project.eu/project-description.html