benchmarks for analyzing embedded multicore processor

Leveling the Field for Multicore Open Systems Architectures

Markus LevyPresident, EEMBC

President, Multicore Association

Analyzing the Multicore Ecosystem

• Embedded Microprocessor Benchmark Consortium® (EEMBC)

• Industry benchmarks since 1997• Tool for evaluating embedded processors,

compilers, systems• MultiBench for shredding multicore

processors

Enabling the Multicore Ecosystem

• Initial engagement began in May 2005• Industry-wide participation• Current efforts

• Communications APIs• Hypervisors• Multicore Programming Practices• Resource Management

Multicore Issues to Solve• Concurrent programming • Communications, synchronization, resource

management between/among cores• Performance analysis• Debugging• Distributed power management• OS virtualization• Modeling and simulation• Load balancing• Algorithm partitioning

Multicore Benchmarking Rules

• Do not rely on a single answer• Match your application requirements

– Small or large data sets– Few or many threads– Dependencies– OS overhead

Benchmarking Multicore – What’s Important?

• Measuring scalability• Memory and I/O bandwidth• Inter-core communications• OS scheduling support• Efficiency of synchronization• System-level functionality

EEMBC Multicore Strategy• Evaluation and future development of

scalable SMP architectures– Includes multicore and manycore

• Measure impact of parallelization and scalability across both data processing and computationally-intensive tasks

Some Results

Quad Core

1 2 4 8

Number of concurrent streams

64M-check-reassembly 64M-cmykw2 64M-rotatew2 64M-tcp-mixed

Huge drop in performance when oversubscribed Nice scaling on networking

only workloads

Some benchmarks plateau earlier then expected

MultiBench Results

Two Processor System Utilizing Single Memory Controller

QuadCore

Processor 1

QuadCore

Processor 2

DDR2Interface

Processors 1 and 2 must always arbitrate for memory via their front side bus connection through the North Bridge.

NorthBridge

Intel Front Side Bus

Dual Quads vs. Single Quad

1. Find max values for each workload2. Ratio of all scores

No Workloads Yield 2x Scaling

Two Processor System Utilizing Dual Memory Controllers

QuadCore

Processor 1

QuadCore

Processor 2

LinkDDR2Interface

DDR2Interface

Direct AccessShared Access

Doubly Shared Access

Two Processor System Utilizing Single Memory Controller

QuadCore

Processor 1

QuadCore

Processor 2

Link DDR2Interface

• Processor 1 must always access memory by traversing link to Processor 2• Requires arbitration to access Processor 2’s memory

• Processor 2 always has prioritized access to this memory since it is directly attached.

• Affinity can help performance

Dual Quad Cores –Single Memory Controller

Dual Quad Cores –Dual Memory Controllers

1 2 3 4 6 8 12 16 20

64M-cmykw2

64M-cmykw2-rotatew2

64M-rotatew2

1 2 3 4 6 8 12 16 20

64M-cmykw2

64M-cmykw2-rotatew2

64M-rotatew2

1 2 3 4 5 6 7 8 9

rotate-color-4M-90deg (DS)

rotate-color-4M-90deg (DD)

•Rotation by multiple workers cooperating to process a single image.•Each worker thread acquires slices and writes them to the output buffer.

•Potential bottlenecks related to memory interfaces and synchronization between worker threads.

Dual Quad Cores –Single Memory Controller

Dual Quad Cores –Dual Memory Controllers0

1 2 3 4 5 6 7 8 9

rotate-16x4Ms32w1rotate-16x4Ms32w2

1 2 3 4 5 6 7 8 9

rotate-16x4Ms32w4

rotate-16x4Ms32w8

The Multicore Association Roadmap

Communications- MCAPI: ultra-light weight

Resource Management - Memory management- Basic synchronization - Resource registration- Resource partitioning

Task Management-Task scheduling

The Four MCA Pillars

Virtualization (or OS)

Communication ResourceManagement

TaskManagement

Multicore SystemAdopted stds

MCA Foundation APIs

Value Added Functions• Languages

• Programming Models• Design Environments

• Application Generators• Benchmarks

Services•Load Balancing

•System Mgt.•Power Mgt.•Reliability

•Quality of Service

Why Virtualize?

• Hardware consolidation• Migration and hosting of legacy applications• Resource management and balancing• Faster provisioning• Fault tolerance

Virtualized Multicore System

Fractional CPU assignment for background tasks such as firmware update/system health monitor/power management

Benchmarking involves many system-level elements

HARDWARE PERIPHERALS

ARM CORE #1

TRANGO #1

ARM CORE #2

TRANGO #2

ARM CORE #3

TRANGO #3

ARM CORE #4

TRANGO #4

AppApp App AppAppApp App App

SMPSMP RTOS

AppApp App App

VPU A VPU B VPU C VPU D VPU E VPU F VPU G

VPU GVPU FVPU A VPU B VPU C VPU D VPU E

CORE #2 CORE #3 CORE #4 CORE #1

EEMBC HypermarkImportant Metrics

• Overall performance overhead of a hypervisor, i.e. CPU loading

• Static footprint (code and data size)• Jitter• Interrupt latency• Comparison to native performance

Multicore Programming Practices(MPP)

• Long term– Continue research into languages, methodologies, etc

• Short term– How today’s embedded C/C++ code may be written to be

“multicore ready”

• Influence of a group of like-minded methodology experts to ensure completeness, usefulness and industry-wide compatibility

• Creation of a standard “best practices” guide through a recognized, neutral industry body– Based on capturing current best practices

MPP Scope & Approach

Sequential C/C++

Architecture 1e.g. Shared

Memory

Architecture 2e.g. Programmer Managed Memory

Architecture 3e.g. Message

Passing

• Focus on existing C/C++ without extensions, targeting current architectures

• Draw up framework of common pitfalls when transitioning from serial to parallel

• Consider solutions or avoidance tactics

• Analyze performance of solution

Summary

• Performance analysis highlights benefits and pitfalls

• EEMBC licensing and membership

• Portability prevents entrapment• Multicore Association standards and

working groups

benchmarks for analyzing embedded multicore processor

Documents

bs6724 multicore

videocard benchmarks - warsaw university of technology ·...

from a to e: analyzing tpcs oltp benchmarks pınar tözün...

heterogeneous multicore

dynatrace performance benchmarks€¦ · performance...

multicore application debugging multicore debugging

benchmarks ramon zatarain. index benchmarks and benchmarking...

analyzing performance asymmetric multicore processors for...

grade specific benchmarks: - shelby county schools middle...

multicore function

analyzing performance of multicore applications in...

multicore processors: challenges, opportunities, emerging...

bs7846 multicore

multicore- und gpgpu- architekturen€¦ · multicore- und...

from a to e: analyzing tpc’s oltp benchmarks

tax benchmarks and variations statement · web viewtax...

multicore computing

iec60502 multicore

computing performance benchmarks among cpu, … performance...

intel multicore