systemy rt i embedded wykład 7 rdzenie arm, część3 · arm1136 • arm1136j(f)-s processor:...

Systemy RT i embedded

Wrocław 2013

Wykład 7

Rdzenie ARM, część 3

Plan

• ARM11

• Cortex-A5

• Cortex-A9

• Cortex-A15

• big.LITTLE processing

• Cortex-A50

ARM11

Source: [1]

ARM11

• The ARM11™ processor family provides the engine that powers a lot of smartphones in production today;

• It is also widely used in consumer, home, and embedded applications.

• ARM11 delivers extreme low power and a range of performance from 350 MHz in small area designs up to 1 GHz in speed optimized designs

• ARM11 processor software is compatible with all previous generations of ARM processors,

• ARM11 introduces 32-bit SIMD for media processing

ARM11 features summary

• Main features:

– Powerful ARMv6 instruction set architecture

– ARM Thumb® instruction set reduces memory

bandwidth and size requirements by up to

35%

– ARM Jazelle® technology for efficient

embedded Java execution

– ARM DSP extensions

– SIMD (Single Instruction Multiple Data) media

processing extensions deliver up to 2x

performance for video processing


• Main features:– ARM TrustZone® technology for on-chip security

foundation (ARM1176JZ-S and ARM1176JZF-S processors)

– Low power consumption:• 0.6 mW/MHz (0.13 µm, 1.2 V) including cache controllers

• Energy saving power-down modes address static leakage currents in advanced processes

– High performance integer processor• 8-stage integer pipeline delivers high clock frequency (9

stages for ARM1156T2(F)-S)

• Separate load-store and arithmetic pipelines

• Branch Prediction and Return Stack


• Main features:

– Thumb-2 technology (ARM1156(F)-S only) for

enhanced performance, energy efficiency and code

density

– High performance memory system design

• Supports 4-64k cache sizes

• Optional tightly coupled memories with DMA for multi-

media applications

• High-performance 64-bit memory system speeds data access

for media processing and networking applications

• ARMv6 memory system architecture accelerates OS context-

switch


• Main features:– Vectored interrupt interface and low-interrupt-

latency mode speeds interrupt response and real-time performance

– Optional Vector Floating Point coprocessor for automotive/industrial controls and 3D graphics acceleration (ARM1136JF-S, ARM1176JZF-S and ARM1156T2F-S processors)

ARM11 – cores types

ARM1176

• ARM1176JZ(F)-S and ARM11 MPCore™

Processors:

– Designed for use as applications processors in

consumer and wireless products.

– Both processors feature the ARMv6 instruction

set architecture, with media processing

extensions, ARM Jazelle® technology, and ARM

Thumb® for compact code.

ARM1176

• ARM1176JZ(F)-S and ARM11 MPCore™ Processors:

– In the ARM11 processor family, only the

ARM1176JZ(F)-S processor has ARM TrustZone™

technology. TrustZone technology provides support

within the CPU and platform architecture for building

the trusted computing environments required to

enable protection of critical system functions from

downloaded applications, copyright protection of

downloaded media, safe over-the-air system

upgrades.

ARM1136• ARM1136J(F)-S Processor:

– Designed for use as applications processors; includes many features of the ARM1176JZ(F)-S processor

– Does not include AMBA® 3 AXI™ bus or TrustZone.

– Some users implement the ARM1136J(F)-S processor for compatibility with existing AMBA AHB bus peripherals from their ARM9 processor-based SoCdesigns

– AMBA AHB to AXI fabric enables simpler migration of AHB bus peripherals to ARM1176JZ(F)-S processor-based designs.

– Software-compatible migration path to latest generation ARM Cortex-A class processors.

ARM1156

• ARM1156T2-S Processor:– First ARM processor to incorporate ARM Thumb-2

technology for even higher code density and instruction set efficiency.

– Thumb-2 technology uses 31 percent less memory than pure 32-bit code to reduce system cost, while delivering up to 38 percent better performance than existing Thumb technology.

– These processors also feature optional parity protection for caches and Tightly Coupled Memories (TCM), and non-maskable interrupts, making them ideal for embedded control applications where high reliability or high availability are paramount.

ARM1156

• ARM1156T2-S Processor:

– The ARM1156T2-S processors feature an enhanced

Memory Protection Unit (MPU) and offer an ideal

upgrade path for embedded control applications

currently using ARM946E-S, ARM966E-S or older 16-

bit processors.

– These processors feature AMBA 3 AXI specification

interfaces, offering higher system bus bandwidth

with fewer bus layers and rapid timing closure.

– Software-compatible migration path to latest

generation ARM Cortex-R class processors

Cortex vs ARM9 vs ARM11

Jazelle

Jazelle

• Main features:

– Jazelle technology for acceleration of execution

environments

– Jazelle is a combined hardware and software

solution:

• software is a full featured multi-tasking Java Virtual

Machine (JVM), highly optimized to take advantage of

Jazelle technology architecture extensions available in

many ARM processor cores

• hardware support depends on the silicon vendor

– Jazelle architecture extensions delivers high

performance applications and games, fast start-up

and application switching with a very low memory

and power budget

Jazelle

• Main features:

– High-efficiency Java bytecode execution, >1000

Caffeine Marks @ 200MHz

– Ultra-low Java system cost

– Low power consumption for battery operated

wireless embedded devices

– Single chip MCU, DSP and Java solution

– Integrated into a number of ARM CPU cores

– Rapid ASIC or ASSP integration with reduced time-to-

market

Jazelle

Jazelle – layer model

Jazelle – how it works

• A third instruction set: Java Byte Code (besides

ARM and Thumb)

• A new Java processor mode (with J bit in CPSR)

• The switching between Java mode and other

modes is very simple and fast

• Interrupts are handled as normal, and cause an

immediate return from Java state to ARM state

to run the interrupt handler. At the end of the

interrupt routine, the normal return mechanism

will return the processor to Java state

Jazelle – interrupts

Jazelle – registers

Vector Floating Point

VFP - architecture

• ARM Floating Point architecture (VFP) provides

hardware support for floating point

operations in half-, single- and double-precision

floating point arithmetic

• There have been three main versions of VFP to

date:

– VFPv1 is obsolete

– VFPv2 is an optional extension to the ARM

instruction set in the ARMv5TE, ARMv5TEJ and

ARMv6 architectures

– VFPv3 is an optional extension to the ARM

instruction set in the ARMv7 architecture

VFP9 - coprocessor

• VFP9-S synthesizable Vector Floating Point

(VFP) coprocessor is compatible with all of the

ARM9E cores

• The support code has two components:– a library of routines which perform unimplemented functions

(such as transcendental functions)

– some supported functions (such as division) and a set of

exception handlers for processing exception conditions

VFP9 - coprocessor

• Features:– ARM VFPv2 ISA

– 16 double precision or 32 single precision registers

– Full IEEE754 compliance with ARM support code

– Run-Fast mode for near IEEE754 compliance (hardware only)

– Binary compatible with VFP10 and VFP11

– Portable to any process with supporting tools and cell library

– 100 - 130K gates

– 1.3Mflops/MHz

– Area <1.0mm2 TSMC 0.13µm G

– 180 - 210MHz (worst case) TSMC 0.13µm G

– <0.4mW/MHz (typical) power consumption on TSMC 0.13µm G

VFP10 - coprocessor

• VFP10-S synthesizable Vector Floating Point

(VFP) coprocessor is compatible with all of the

ARM10E cores

• The support code has two components:– a library of routines which perform unimplemented functions

(such as transcendental functions)

– some supported functions (such as division) and a set of

exception handlers for processing exception conditions

VFP10 - coprocessor• Features:

– ISA is ARM VFPv2

– 16 double precision or 32 single precision registers

– Large independent register file with 64-bit LD/ST interface

– Full IEEE754 compliance with ARM support code

– Run-Fast mode for near IEEE754 compliance (hardware only)

– Binary compatible with VFP9 and VFP11

– Scalar and vector operation support (ideal for FP DSP)

– Parallel LD/ST, FMAC, and DIV/SQRT execution engines

– 2.0Mflops/MHz

– Area ~1.16mm 2 TSMC 0.13µm LV

– Up to 325MHz (worst case) TSMC 0.13µm LV

– <0.4mW/MHz (typical) power consumption on TSMC 0.13µm LV

VFP10 - coprocessor

• VFP10 Instruction Set (VFPv2):

– Arithmetic:

• Add, Sub, Mult, Neg-Mult, Negate, Abs Value,

Compare, Div, Square Root

– FMAC (Single and double versions):

• Multiply-Add, Multiply-Subtract, Neg-Multiply-

Add, Neg-Multiply-Subtract

– Type conversions

– Load/Store scalars and vectors, 64-bits per

cycle

VFP11 - coprocessor• VFP11 synthesizable Vector Floating Point (VFP)

coprocessor is compatible with all of the ARM11

cores (VFP v2 compatibile)

• The VFP11 coprocessor is optimized for:– high data transfer bandwidth through 64-bit split load and store

buses

– fast hardware execution of a high percentage of operations on

normalized data, resulting in higher overall performance while

providing full IEEE 754 standard support when required

– hardware divide and square root operations in parallel with

other arithmetic operations to reduce the impact of long-

latency operations

– near IEEE 754 standard compatibility in RunFast mode without

support code assistance, providing determinable run-time

calculations for all input data

VFP11 - coprocessor• The VFP11 coprocessor has three separate

instruction pipelines:

– the Multiply and Accumulate (FMAC) pipeline

– the Divide and Square root (DS) pipeline

– the Load/Store (LS) pipeline.

• Each pipeline can operate independently of the

other pipelines and in parallel with them

• Each of the three pipelines shares the first two

pipeline stages, Decode and Issue

VFP11 - coprocessor• More than one instruction to be completed per

cycle.

• Instructions issued to the FMAC pipeline can

complete out of order with respect to

operations in the LS and DS pipelines

• Except for divide and square root operations,

the pipelines support single-cycle throughput

for all single-precision operations and most

double-precision operations

VFP11 - coprocessor• Double-precision multiply and multiply and

accumulate operations have a two-cycle

throughput.

• The LS pipeline is capable of supplying two

single-precision operands or one double-

precision operand per cycle, balancing the data

transfer capability with the operand

requirements.

FMAC Pipeline

VFPv3 FPU• VFPv3 version of the FPU can be found in Cortex-A

architectures

• The FPU features are:– support for single-precision and double-precision floating-point formats

– support for conversion between half-precision and single-precision

– operation latencies reduced for most operations in single-precision and

double-precision

– high data transfer bandwidth through 64-bit split load and store buses

– completion of load transfers can be performed out-of-order

– normalized and denormalized data are all handled in hardware

– trapless operation enabling fast execution

– support for speculative execution

– low power consumption with high level clock gating and small die size.

VFPv3 FPU• Unlike VFPv2 implementations, the VFPv3

implementation provides:

– fixed-point to floating-point conversion instructions

and floating-point constant loads

– IEEE half-precision and alternative half-precision

format support

– trapless exception support.

Trust Zone

Trust Zone• ARM TrustZone® technology is a system-wide approach to

security on high performance computing platforms for a huge

array of applications including secure payment, digital rights

management (DRM), enterprise and web-based services

• TrustZone technology, tightly integrated tightly into Cortex™-

A and ARM1176 processors, extends throughout the system

via the AMBA® AXI™ bus and specific TrustZone System IP

blocks

• It is possible to secure peripherals such as secure memory,

crypto blocks, keyboard and screen to ensure they can be

protected from software attack.

Trust Zone - hardware• The security of the system is achieved by

partitioning all of the SoC hardware and software

resources so that they exist in one of two worlds:– the Secure world for the security subsystem

– the Normal world for everything else

• Hardware logic present in the TrustZone-

enabled AMBA3 AXI™ bus fabric ensures that

Normal world components do not access Secure

world resources, enabling construction of a

strong perimeter boundary between the two

Trust Zone - hardware

Trust Zone - hardware• The TrustZone hardware architecture

extensions enable a single physical processor

core to execute code safely and efficiently

from both the Normal world and the Secure

world in a time-sliced fashion

• This removes the need for a dedicated security

processor core, which saves silicon area and

power

• The final aspect of the TrustZone hardware

architecture is a security-aware debug

infrastructure that can enable control over

access to Secure world debug, without

Trust Zone - software• The implementation of a Secure world in the

SoC hardware requires some secure software to

run within it and to make use of the sensitive

assets stored there

• There are many possible secure software

architectures:

– The most advanced is a dedicated Secure world

operating system

– The simplest is a synchronous library of code placed

in the Secure world

Trust Zone - software

Cortex-A

Cortex-A

Application Examples for Cortex-A Processors

Cortex-A

• The ARM Cortex™-A series of applications processors provide an entire range of solutions:– for devices hosting a rich OS platform

– For devices hosting user applications:

• ultra-low-cost handset

• smartphones,

• mobile computing platforms,

• digital TV and set-top boxes

• enterprise networking,

• printers and

• server solutions

• etc

Cortex-A

• All Cortex-A Processors share a common architecture and feature set:– ARMv7-A architecture

– Support for full Operating Systems (Symbian, Andriod, Ubuntu, etc.)

– Instruction Set Support - ARM, Thumb-2, Thumb, Jazelle®, DSP

– TrustZone® Security Extensions

– Advanced single-precision and double-precision Floating Point support

– NEON™ media processing engine

Cortex-A features summary

Cortex-A5

Cortex-A5

• Main features:– Architecture: ARM v7-A

– 1.57 DMIPS / MHz per core

– Single or multicore versions available (1-4 cores)

– MMU

– ARM/Thumb/Thumb-2

– ThrustZone Technology

– Configurable L1 caches (from 4-64kB)

Cortex-A5

• Main features:

– NEON Media Processing Engine - The MPE extends the

Cortex-A5 Floating Point Unit (FPU) an additional register set

supporting a rich set of SIMD operations over 8, 16, and 32-bit

integer and 32-bit Floating-Point data types

– VFPv4-D16

– Jazelle

– AXI bus (over 3x memory bandwidth over

ARM1176JZ-S)

– Advanced Multicore Technologies

– pipeline with dynamic branch prediction

Cortex-A5

Cortex-A5

• Configurable options:

Cortex-A5• Multicore technologies:

– SCU – Snoop Control Unit - central intelligence

responsible for managing:

• interconnect,

• arbitration,

• communication,

• cache-2-cache and system memory transfers,

• cache coherence

• other capabilities for all multicore technology

enabled processors


– ACP – Accelerator Coherence Port – AMBA 3 AXI

compatibile slave interface on the SCU

providing an interconnect point for a range of

system masters that - for overall system

performance, power consumption or reasons

of software simplification - are better

interfaced directly with the Cortex-A5 MPCore

processor


– GIC – Generic Interrupt Controller - provides a rich and flexible approach to inter-processor communication and the routing and prioritisation of system interrupts

– Supports up to 224 independent interrupts under software control:• each interrupt can be distributed across CPU,

• hardware prioritised,

• routed between the operating system and TrustZone software management layer

Cortex-A8

Cortex-A8

• Main features:

– Architecture: ARM v7-A


– Single core versions available only

– MMU


– ThrustZone Technology

– NEON

– VFP v3

Cortex-A9

Cortex-A9• Main features:



– Single or Multicore versions available (1-4 cores)

– MMU


– Jazelle

– DSP extension


– NEON MPE

– VFP v3

Cortex-A9

• Main features:

– superscalar, variable length, out-of-order

pipeline with dynamic branch prediction

– two 64-bit AXI master interfaces with Master

0 for the data side bus and Master 1 for the

instruction side bus

– support for advanced power management

with up to 3 power domains

– Support for Preload Engine

Cortex-A9 - options

Cortex-A

Source: [3]

Cortex-A

Cortex-A9 – PLE

• PLE – Preload Engine - loads selected

regions of memory into L2

• PLE FIFO available

ARM NEON

• The ARM® NEON™ general-purpose SIMD engine efficiently processes current and future multimedia formats, enhancing the user experience.

• NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD.

ARM NEON

Source: [2]

ARM NEON - features

• Supports wide range of multimedia

codecs:

– Many soft codec standards: MPEG-4, H.264,

On2 VP6/7/8, Real, AVS.....

– Ideal solution for normal size "internet

streaming" decode of various formats

– Not just for codecs - also applicable to 2D

and 3D graphics and other vector processing

– Off the shelf tools, OS support, and

ecosystem support

ARM NEON - features

• Fewer cycles needed than in previous

versions:

– NEON will give 60-150% performance boost on

complex video codecs

– Individual simple DSP algorithms can show

larger performance boost (4x-8x)

– Processor can sleep sooner, resulting in

overall dynamic power saving

ARM NEON - features

• SIMD and scalar single-precision floating-point

computation

• scalar double-precision floating-point

computation

• SIMD and scalar half-precision floating-point

conversion

• SIMD 8, 16, 32, and 64-bit signed and unsigned

integer computation

• 8 or 16-bit polynomial computation for single-

bit coefficients

ARM NEON - features

• structured data load capabilities

• large, shared register file, addressable as:

– thirty-two 32-bit S (single) registers

– thirty-two 64-bit D (double) registers

– sixteen 128-bit Q (quad) registers

ARM NEON - Operations

• Data operations include:

– addition and subtraction

– multiplication with optional accumulation

– maximum or minimum value driven lane selection

operations

– inverse square-root approximation

– comprehensive data-structure load instructions,

including register-bank-resident table lookup

ARM NEON - features

• Other performance boosting features:– Aligned and unaligned data access allows for

efficient vectorization of SIMD operations.

– Clean instruction set architecture designed for autovectorizing compilers and hand coding.

– Efficient access to packed arrays such as ARGB or xyz coordinates

– Support for both integer and floating point operations ensures adaptability to a broad range of applications, from codecs to High Performance Computing to 3D graphics.

ARM NEON - features

• Other perfomance boosting features:

– Tight coupling to the ARM processor provides

a single instruction stream and a unified view

of memory, presenting a single development

platform target with a simpler tool flow.

– The large NEON register file with its dual

128-bit/64-bit views enables efficient

handling of data and minimizes access to

memory, enhancing data throughput.

Cortex-A15

Cortex-A15• Main features:


– 1-4X SMP within a single processor cluster

– Multiple coherent SMP processor clusters through AMBA® 4 technology

– MMU

– ARM/Thumb-2

– DSP & SIMD extensions - increased perfromance


– NEON Advanced SIMD – increased perfromance

– VFP v4

Cortex-A15

• Main features:

– ThrustZone Technologies

– Hardware virtualization support - highly efficient

hardware support for data management and arbitration,

whereby multiple software environments and their applications

are able to simultaneously access the system capabilities

– Large Physical Address Extensions (LPAE) -enables the processor to access up to 1TB of memory

big.LITTLE processing


• Main features:

– Connection of „big” processor – Cortex-A15

and little, power efficient processor – Cortex-

A7

– Such combination simplifies connecting high

performance of a device (smartphone…) with

a long battery life

– Both processors can have 1-4 cores and

implements a single AMBA® 4 coherent

interface


• Cortex-A7:

– Cortex-A7 is an in-order, non-symmetric dual-

issue processor with a pipeline length of

between 8-stages and 10-stages


• Cortex-A15:

– Cortex-A15 is an out-of-order sustained triple-

issue processor with a pipeline length of

between 15-stages and 24-stages


• Performance & Energy comparison

– the energy consumed by the execution of an

instruction is partially related to the number

of pipeline stages it must traverse


• Interconnection – System architecture


• Task

Migratrion

Cortex-A50 series

Cortex-A50 Series

• The Cortex-A50 Series is the latest range

of processors based on the ARMv8

Architecture

• The series includes support for new

energy efficient 64-bit execution state

(AArch64) that operates alongside an

enhanced version of ARM’s existing 32-bit

execution state

• The Cortex-A50 Series comprises the A53

and A57 processors

Cortex-A50 Series

• Cortex-A50 series processors are 32-bit

processors with 64-bit capability

• They deliver more performance for ARMv7

32-bit code in AArch32 execution state,

and offer support for 64-bit data and

larger virtual addressing space in AArch64

execution state

• Clean interworking between 32-bit and

64-bit is supported

Cortex-A50 Series – Why 64-bits?

• An obvious reason for 64-bit is the support of

more than 4GB of physical memory

• In server and desktop applications, OS and

application software are frequently 64-bit today

• Support for 64-bit in ARMv8 will enable ARM

processors to become more broadly deployed in

server and desktop applications, and will

provide future-proof support for the eventual

migration of 64-bit operating systems to mobile

applications

Cortex-A50 Series – processors

• Series consists currently of two models:

A53 and A57

• Both processors can operate

independently or be combined into an


• Both processors are fully compatible with

extensive ARM software assets

Cortex-A53

series

Cortex-A53

• The Cortex-A53 processor is ARM's most

efficient application processor ever

• This processor can deliver the compute

power of today's high-end smartphone, in

lowest power and area footprint, enabling

all-day battery life for typical device uses

• Cortex-A53 efficiently runs legacy ARM 32-

bit applications

Cortex-A53

• Cortex-A53 features cache coherent interoperability with ARM Mali™ family graphics processing units (GPUs)

• Cortex-A53 connects seamlessly to ARM interconnect with up to 16 cores configurations with more in the future

• Cortex-A53 offers optional reliability and scalability features for high-performance enterprise applications

Cortex-A53 - performance

Cortex-A57 series

Cortex-A57

• The Cortex-A57 processor is ARM's most

advanced, high-performance application

processor

• The Cortex-A57 processor efficiently runs

legacy ARM 32-bit applications

• Optional reliability and scalability

features for high-performance enterprise

applications

Cortex-A57

• The Cortex-A57 processor features

interoperability with ARM Mali™ family

graphics processing units (GPUs) for GPU

compute applications

• The Cortex-A57 processor connects

seamlessly to ARM interconnect with up to

16 core configurations with more in the

future

Cortex-A57 - performance

Thank you for your attention

References

[1] ARM11 core documentation; www.arm.com

[2] www.arm.com

[3] ARM9 family documentation; www.arm.com

systemy rt i embedded wykład 7 rdzenie arm, część3 · arm1136 • arm1136j(f)-s processor:...

Documents