this project in arm is in part funded by ict-emuco, a european

Post on 12-Sep-2021

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Application Performance Analysis

of the Cortex-A9 MPCore

Bryan Lawrence

This project in ARM is in part

funded by ICT-eMuCo, a

European project supported

under the Seventh Framework

Programme (7FP) for research

and technological development

2

Agenda

Motivation

Experimentation platforms

Performance exploration of different application classes

Performance evaluation of multiple concurrent applications

Summary and conclusion

3

Phone ++ Upcoming Use Cases

Mobile Internet Browsing

Video conferencing

Gaming on the Go

Multi-player over 3G / 4G

Network

3D Navigation

4

Mobile Phone Applications

Compute

Intensive

5

Tablet Applications

Compute

Intensive

6

Achieving Scalable Performance

Clock frequency of processor not the only metric of

performance

Scalable, energy efficient performance required from mobile

devices – phones, tablets to large enterprise computing

Can multicore processors provide a potential solution ?? .....

7

Hardware Platforms

Versatile Express

ARM-NEC Cortex™-A9 processor

test-chip ~400MHz

Cortex-A9 x 4

4x NEON™/FPU

32KB I&D invidual L1 caches

512K L2 cache

1GB RAM (32b DDR2)

Early Partner Silicon

Cortex-A9 x 2 @ 1GHz

1GB RAM

8

Video Decode / Encode

Hardware encoder/decoders are common in consumer

Video/audio codecs standards evolve rapidly

Many codecs are used infrequently to justify h/w

Consumer applications involve other video processing

Different from encode / decode (E.g. video editing)

Simultaneous encode / decode required for video

conferencing

9

FFmpeg used for decode

X264 library used with FFmeg for video encode

CIF & VGA resolutions

Commonly used in video conf.

Movie trailers used

Order of computation more than video conf. Streams

Compression factor of 100 - 200

H.264 Decode / Encode

10

H.264 Decode / Encode

Results for single core operation

Normalized logarithmic scales used

Encode is more compute intensive than decode (at least ~2-3 times)

Writing out decoded streams

to secondary storage media

limited by media bandwidth

11

H.264 Decode / Encode

Concurrent video decode + encode

Important use case for video conferencing

Excellent scalability is observed for up to all 4 cores

Encoding is at least

2-3 times or more compute

intensive than decode

Ideally more resources

should be dedicated to

encode

12

On2/Google VP8

Libvpx library used for decoding VP8 (from WebM project)

Libvpx uses multi-threading and actively takes advantage of

parallelizability available in the VP8 codec.

Comparative results obtained on Versatile Express and 1GHz

dual core platforms

13

On2/Google VP8

Shows good scalability with the

number of cores.

Scalability is relatively independent

of the number of partitions in the

video frame

Saturation is observed for no. of

threads > no. of cores

Designers can query the platform

to fetch the no. of cores –

determine available paralelizability

1GHz dual-core

Versatile Express

14

Compilation - ffmpeg

Code compilation has inherent

parallelism in terms of modules

Most build systems allow for this

compilation to be exploited

E.g. make –j 4

Compilation of FFmpeg and

Linux Kernel shown here

1GHz dual-core

Versatile Express

15

Compilation – Linux Kernel

1GHz dual-core

Versatile Express

Almost linear speed-up is observed

with no. of cores for both cases

Effectively doubles (quadruples)

the utilized memory bandwidth

for 2 cores (4 cores)

16

Browsers

Browser benchmark using collection of web-pages

similar to the mix found in common browsing

Speed-up of 1.54 times observed between single and

dual core execution

The ‘webcore’ fraction of the pie grows for multicore

execution

Normalized Performance Execution time decomposition

1.54x

17

Multiple Concurrent Applications

Multitasking is becoming mainstream

in mobile devices today

Common combinations include

Browser + Audio playback

E.g. Internet Radio

Browser + background download

Independent applications can

benefit immensely from

parallelization

18

Browser + Pandora Internet Radio

Speed up factor of 1.9

Super linear speed-up can

be observed sometimes

due to reduced cache

pollution from conflicting

applications

The speed-up can be

traded for energy by

slowing the cores down

(depends on the

fabrication process

technology used)

Normalized Performance

Execution time decomposition

1.9x

19

Browser + Internet File Download

Speed up factor of 1.64x

Common use case

involves downloading an

App from an application

store or market-place

while browsing the

internet

Email synchronization in

the bakground also forms

a similar use case

Normalized Performance

Execution time decomposition

1.64x

20

Cortex-A9 MP Benefits – Performance

Browser

(single app)

1

1.54

1 Core

2 Core

21

Cortex-A9 MP Benefits – Richer Experience

Browser

(single app)

1

1.54

Browser +

Pandora

0.78

1.50

Browser +

Download

0.73

1.20

1 Core

2 Core

22

Cortex-A9 MP Benefits – Richer Experience

Browser

(single app)

1

1.54

Browser +

Pandora

0.78

1.50

Browser +

Download

0.73

1.20

1 Core

2 Core

1.64x 1.9x

23

Summary and Conclusion

This presentation demonstrates the scalability of the ARM

Cortex-A9 MPCore™ processor across various classes of

applications, on today’s currently available software

Better power/performance can be achieved using an efficient

low power ARM multicore processor, as compared to a single

processor at much higher freq.

Next generation software will make more intensive use of

threads, and scalability will improve further.

24

Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < Salesinfo-IN@arm.com >

top related