systemy rt i embedded wykład 7 rdzenie arm, część3 · arm1136 • arm1136j(f)-s processor:...
TRANSCRIPT
Systemy RT i embedded
Wrocław 2013
Wykład 7
Rdzenie ARM, część 3
Plan
• ARM11
• Cortex-A5
• Cortex-A9
• Cortex-A15
• big.LITTLE processing
• Cortex-A50
ARM11
ARM11
Source: [1]
ARM11
• The ARM11™ processor family provides the engine that powers a lot of smartphones in production today;
• It is also widely used in consumer, home, and embedded applications.
• ARM11 delivers extreme low power and a range of performance from 350 MHz in small area designs up to 1 GHz in speed optimized designs
• ARM11 processor software is compatible with all previous generations of ARM processors,
• ARM11 introduces 32-bit SIMD for media processing
ARM11 features summary
• Main features:
– Powerful ARMv6 instruction set architecture
– ARM Thumb® instruction set reduces memory
bandwidth and size requirements by up to
35%
– ARM Jazelle® technology for efficient
embedded Java execution
– ARM DSP extensions
– SIMD (Single Instruction Multiple Data) media
processing extensions deliver up to 2x
performance for video processing
ARM11 features summary
• Main features:– ARM TrustZone® technology for on-chip security
foundation (ARM1176JZ-S and ARM1176JZF-S processors)
– Low power consumption:• 0.6 mW/MHz (0.13 µm, 1.2 V) including cache controllers
• Energy saving power-down modes address static leakage currents in advanced processes
– High performance integer processor• 8-stage integer pipeline delivers high clock frequency (9
stages for ARM1156T2(F)-S)
• Separate load-store and arithmetic pipelines
• Branch Prediction and Return Stack
ARM11 features summary
• Main features:
– Thumb-2 technology (ARM1156(F)-S only) for
enhanced performance, energy efficiency and code
density
– High performance memory system design
• Supports 4-64k cache sizes
• Optional tightly coupled memories with DMA for multi-
media applications
• High-performance 64-bit memory system speeds data access
for media processing and networking applications
• ARMv6 memory system architecture accelerates OS context-
switch
ARM11 features summary
• Main features:– Vectored interrupt interface and low-interrupt-
latency mode speeds interrupt response and real-time performance
– Optional Vector Floating Point coprocessor for automotive/industrial controls and 3D graphics acceleration (ARM1136JF-S, ARM1176JZF-S and ARM1156T2F-S processors)
ARM11 – cores types
ARM1176
• ARM1176JZ(F)-S and ARM11 MPCore™
Processors:
– Designed for use as applications processors in
consumer and wireless products.
– Both processors feature the ARMv6 instruction
set architecture, with media processing
extensions, ARM Jazelle® technology, and ARM
Thumb® for compact code.
ARM1176
• ARM1176JZ(F)-S and ARM11 MPCore™ Processors:
– In the ARM11 processor family, only the
ARM1176JZ(F)-S processor has ARM TrustZone™
technology. TrustZone technology provides support
within the CPU and platform architecture for building
the trusted computing environments required to
enable protection of critical system functions from
downloaded applications, copyright protection of
downloaded media, safe over-the-air system
upgrades.
ARM1136• ARM1136J(F)-S Processor:
– Designed for use as applications processors; includes many features of the ARM1176JZ(F)-S processor
– Does not include AMBA® 3 AXI™ bus or TrustZone.
– Some users implement the ARM1136J(F)-S processor for compatibility with existing AMBA AHB bus peripherals from their ARM9 processor-based SoCdesigns
– AMBA AHB to AXI fabric enables simpler migration of AHB bus peripherals to ARM1176JZ(F)-S processor-based designs.
– Software-compatible migration path to latest generation ARM Cortex-A class processors.
ARM1156
• ARM1156T2-S Processor:– First ARM processor to incorporate ARM Thumb-2
technology for even higher code density and instruction set efficiency.
– Thumb-2 technology uses 31 percent less memory than pure 32-bit code to reduce system cost, while delivering up to 38 percent better performance than existing Thumb technology.
– These processors also feature optional parity protection for caches and Tightly Coupled Memories (TCM), and non-maskable interrupts, making them ideal for embedded control applications where high reliability or high availability are paramount.
ARM1156
• ARM1156T2-S Processor:
– The ARM1156T2-S processors feature an enhanced
Memory Protection Unit (MPU) and offer an ideal
upgrade path for embedded control applications
currently using ARM946E-S, ARM966E-S or older 16-
bit processors.
– These processors feature AMBA 3 AXI specification
interfaces, offering higher system bus bandwidth
with fewer bus layers and rapid timing closure.
– Software-compatible migration path to latest
generation ARM Cortex-R class processors
Cortex vs ARM9 vs ARM11
Jazelle
Jazelle
• Main features:
– Jazelle technology for acceleration of execution
environments
– Jazelle is a combined hardware and software
solution:
• software is a full featured multi-tasking Java Virtual
Machine (JVM), highly optimized to take advantage of
Jazelle technology architecture extensions available in
many ARM processor cores
• hardware support depends on the silicon vendor
– Jazelle architecture extensions delivers high
performance applications and games, fast start-up
and application switching with a very low memory
and power budget
Jazelle
• Main features:
– High-efficiency Java bytecode execution, >1000
Caffeine Marks @ 200MHz
– Ultra-low Java system cost
– Low power consumption for battery operated
wireless embedded devices
– Single chip MCU, DSP and Java solution
– Integrated into a number of ARM CPU cores
– Rapid ASIC or ASSP integration with reduced time-to-
market
Jazelle
Jazelle – layer model
Jazelle – how it works
• A third instruction set: Java Byte Code (besides
ARM and Thumb)
• A new Java processor mode (with J bit in CPSR)
• The switching between Java mode and other
modes is very simple and fast
• Interrupts are handled as normal, and cause an
immediate return from Java state to ARM state
to run the interrupt handler. At the end of the
interrupt routine, the normal return mechanism
will return the processor to Java state
Jazelle – interrupts
Jazelle – registers
Vector Floating Point
VFP - architecture
• ARM Floating Point architecture (VFP) provides
hardware support for floating point
operations in half-, single- and double-precision
floating point arithmetic
• There have been three main versions of VFP to
date:
– VFPv1 is obsolete
– VFPv2 is an optional extension to the ARM
instruction set in the ARMv5TE, ARMv5TEJ and
ARMv6 architectures
– VFPv3 is an optional extension to the ARM
instruction set in the ARMv7 architecture
VFP9 - coprocessor
• VFP9-S synthesizable Vector Floating Point
(VFP) coprocessor is compatible with all of the
ARM9E cores
• The support code has two components:– a library of routines which perform unimplemented functions
(such as transcendental functions)
– some supported functions (such as division) and a set of
exception handlers for processing exception conditions
VFP9 - coprocessor
• Features:– ARM VFPv2 ISA
– 16 double precision or 32 single precision registers
– Full IEEE754 compliance with ARM support code
– Run-Fast mode for near IEEE754 compliance (hardware only)
– Binary compatible with VFP10 and VFP11
– Portable to any process with supporting tools and cell library
– 100 - 130K gates
– 1.3Mflops/MHz
– Area <1.0mm2 TSMC 0.13µm G
– 180 - 210MHz (worst case) TSMC 0.13µm G
– <0.4mW/MHz (typical) power consumption on TSMC 0.13µm G
VFP10 - coprocessor
• VFP10-S synthesizable Vector Floating Point
(VFP) coprocessor is compatible with all of the
ARM10E cores
• The support code has two components:– a library of routines which perform unimplemented functions
(such as transcendental functions)
– some supported functions (such as division) and a set of
exception handlers for processing exception conditions
VFP10 - coprocessor• Features:
– ISA is ARM VFPv2
– 16 double precision or 32 single precision registers
– Large independent register file with 64-bit LD/ST interface
– Full IEEE754 compliance with ARM support code
– Run-Fast mode for near IEEE754 compliance (hardware only)
– Binary compatible with VFP9 and VFP11
– Scalar and vector operation support (ideal for FP DSP)
– Parallel LD/ST, FMAC, and DIV/SQRT execution engines
– 2.0Mflops/MHz
– Area ~1.16mm 2 TSMC 0.13µm LV
– Up to 325MHz (worst case) TSMC 0.13µm LV
– <0.4mW/MHz (typical) power consumption on TSMC 0.13µm LV
VFP10 - coprocessor
• VFP10 Instruction Set (VFPv2):
– Arithmetic:
• Add, Sub, Mult, Neg-Mult, Negate, Abs Value,
Compare, Div, Square Root
– FMAC (Single and double versions):
• Multiply-Add, Multiply-Subtract, Neg-Multiply-
Add, Neg-Multiply-Subtract
– Type conversions
– Load/Store scalars and vectors, 64-bits per
cycle
VFP11 - coprocessor• VFP11 synthesizable Vector Floating Point (VFP)
coprocessor is compatible with all of the ARM11
cores (VFP v2 compatibile)
• The VFP11 coprocessor is optimized for:– high data transfer bandwidth through 64-bit split load and store
buses
– fast hardware execution of a high percentage of operations on
normalized data, resulting in higher overall performance while
providing full IEEE 754 standard support when required
– hardware divide and square root operations in parallel with
other arithmetic operations to reduce the impact of long-
latency operations
– near IEEE 754 standard compatibility in RunFast mode without
support code assistance, providing determinable run-time
calculations for all input data
VFP11 - coprocessor• The VFP11 coprocessor has three separate
instruction pipelines:
– the Multiply and Accumulate (FMAC) pipeline
– the Divide and Square root (DS) pipeline
– the Load/Store (LS) pipeline.
• Each pipeline can operate independently of the
other pipelines and in parallel with them
• Each of the three pipelines shares the first two
pipeline stages, Decode and Issue
VFP11 - coprocessor• More than one instruction to be completed per
cycle.
• Instructions issued to the FMAC pipeline can
complete out of order with respect to
operations in the LS and DS pipelines
• Except for divide and square root operations,
the pipelines support single-cycle throughput
for all single-precision operations and most
double-precision operations
VFP11 - coprocessor• Double-precision multiply and multiply and
accumulate operations have a two-cycle
throughput.
• The LS pipeline is capable of supplying two
single-precision operands or one double-
precision operand per cycle, balancing the data
transfer capability with the operand
requirements.
FMAC Pipeline
VFPv3 FPU• VFPv3 version of the FPU can be found in Cortex-A
architectures
• The FPU features are:– support for single-precision and double-precision floating-point formats
– support for conversion between half-precision and single-precision
– operation latencies reduced for most operations in single-precision and
double-precision
– high data transfer bandwidth through 64-bit split load and store buses
– completion of load transfers can be performed out-of-order
– normalized and denormalized data are all handled in hardware
– trapless operation enabling fast execution
– support for speculative execution
– low power consumption with high level clock gating and small die size.
VFPv3 FPU• Unlike VFPv2 implementations, the VFPv3
implementation provides:
– fixed-point to floating-point conversion instructions
and floating-point constant loads
– IEEE half-precision and alternative half-precision
format support
– trapless exception support.
Trust Zone
Trust Zone• ARM TrustZone® technology is a system-wide approach to
security on high performance computing platforms for a huge
array of applications including secure payment, digital rights
management (DRM), enterprise and web-based services
• TrustZone technology, tightly integrated tightly into Cortex™-
A and ARM1176 processors, extends throughout the system
via the AMBA® AXI™ bus and specific TrustZone System IP
blocks
• It is possible to secure peripherals such as secure memory,
crypto blocks, keyboard and screen to ensure they can be
protected from software attack.
Trust Zone - hardware• The security of the system is achieved by
partitioning all of the SoC hardware and software
resources so that they exist in one of two worlds:– the Secure world for the security subsystem
– the Normal world for everything else
• Hardware logic present in the TrustZone-
enabled AMBA3 AXI™ bus fabric ensures that
Normal world components do not access Secure
world resources, enabling construction of a
strong perimeter boundary between the two
Trust Zone - hardware
Trust Zone - hardware• The TrustZone hardware architecture
extensions enable a single physical processor
core to execute code safely and efficiently
from both the Normal world and the Secure
world in a time-sliced fashion
• This removes the need for a dedicated security
processor core, which saves silicon area and
power
• The final aspect of the TrustZone hardware
architecture is a security-aware debug
infrastructure that can enable control over
access to Secure world debug, without
Trust Zone - software• The implementation of a Secure world in the
SoC hardware requires some secure software to
run within it and to make use of the sensitive
assets stored there
• There are many possible secure software
architectures:
– The most advanced is a dedicated Secure world
operating system
– The simplest is a synchronous library of code placed
in the Secure world
Trust Zone - software
Cortex-A
Cortex-A
Application Examples for Cortex-A Processors
Cortex-A
• The ARM Cortex™-A series of applications processors provide an entire range of solutions:– for devices hosting a rich OS platform
– For devices hosting user applications:
• ultra-low-cost handset
• smartphones,
• mobile computing platforms,
• digital TV and set-top boxes
• enterprise networking,
• printers and
• server solutions
• etc
Cortex-A
• All Cortex-A Processors share a common architecture and feature set:– ARMv7-A architecture
– Support for full Operating Systems (Symbian, Andriod, Ubuntu, etc.)
– Instruction Set Support - ARM, Thumb-2, Thumb, Jazelle®, DSP
– TrustZone® Security Extensions
– Advanced single-precision and double-precision Floating Point support
– NEON™ media processing engine
Cortex-A features summary
Cortex-A5
Cortex-A5
• Main features:– Architecture: ARM v7-A
– 1.57 DMIPS / MHz per core
– Single or multicore versions available (1-4 cores)
– MMU
– ARM/Thumb/Thumb-2
– ThrustZone Technology
– Configurable L1 caches (from 4-64kB)
Cortex-A5
• Main features:
– NEON Media Processing Engine - The MPE extends the
Cortex-A5 Floating Point Unit (FPU) an additional register set
supporting a rich set of SIMD operations over 8, 16, and 32-bit
integer and 32-bit Floating-Point data types
– VFPv4-D16
– Jazelle
– AXI bus (over 3x memory bandwidth over
ARM1176JZ-S)
– Advanced Multicore Technologies
– pipeline with dynamic branch prediction
Cortex-A5
Cortex-A5
• Configurable options:
Cortex-A5• Multicore technologies:
– SCU – Snoop Control Unit - central intelligence
responsible for managing:
• interconnect,
• arbitration,
• communication,
• cache-2-cache and system memory transfers,
• cache coherence
• other capabilities for all multicore technology
enabled processors
Cortex-A5• Multicore technologies:
– ACP – Accelerator Coherence Port – AMBA 3 AXI
compatibile slave interface on the SCU
providing an interconnect point for a range of
system masters that - for overall system
performance, power consumption or reasons
of software simplification - are better
interfaced directly with the Cortex-A5 MPCore
processor
Cortex-A5• Multicore technologies:
– GIC – Generic Interrupt Controller - provides a rich and flexible approach to inter-processor communication and the routing and prioritisation of system interrupts
– Supports up to 224 independent interrupts under software control:• each interrupt can be distributed across CPU,
• hardware prioritised,
• routed between the operating system and TrustZone software management layer
Cortex-A8
Cortex-A8
• Main features:
– Architecture: ARM v7-A
– 2.0 DMIPS / MHz per core
– Single core versions available only
– MMU
– ARM/Thumb/Thumb-2
– ThrustZone Technology
– NEON
– VFP v3
Cortex-A9
Cortex-A9• Main features:
– Architecture: ARM v7-A
– 2.5 DMIPS / MHz per core
– Single or Multicore versions available (1-4 cores)
– MMU
– ARM/Thumb/Thumb-2
– Jazelle
– DSP extension
– Advanced Multicore Technologies
– NEON MPE
– VFP v3
Cortex-A9
• Main features:
– superscalar, variable length, out-of-order
pipeline with dynamic branch prediction
– two 64-bit AXI master interfaces with Master
0 for the data side bus and Master 1 for the
instruction side bus
– support for advanced power management
with up to 3 power domains
– Support for Preload Engine
Cortex-A9 - options
Cortex-A
Source: [3]
Cortex-A
Cortex-A9 – PLE
• PLE – Preload Engine - loads selected
regions of memory into L2
• PLE FIFO available
ARM NEON
• The ARM® NEON™ general-purpose SIMD engine efficiently processes current and future multimedia formats, enhancing the user experience.
• NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD.
ARM NEON
Source: [2]
ARM NEON - features
• Supports wide range of multimedia
codecs:
– Many soft codec standards: MPEG-4, H.264,
On2 VP6/7/8, Real, AVS.....
– Ideal solution for normal size "internet
streaming" decode of various formats
– Not just for codecs - also applicable to 2D
and 3D graphics and other vector processing
– Off the shelf tools, OS support, and
ecosystem support
ARM NEON - features
• Fewer cycles needed than in previous
versions:
– NEON will give 60-150% performance boost on
complex video codecs
– Individual simple DSP algorithms can show
larger performance boost (4x-8x)
– Processor can sleep sooner, resulting in
overall dynamic power saving
ARM NEON - features
• SIMD and scalar single-precision floating-point
computation
• scalar double-precision floating-point
computation
• SIMD and scalar half-precision floating-point
conversion
• SIMD 8, 16, 32, and 64-bit signed and unsigned
integer computation
• 8 or 16-bit polynomial computation for single-
bit coefficients
ARM NEON - features
• structured data load capabilities
• large, shared register file, addressable as:
– thirty-two 32-bit S (single) registers
– thirty-two 64-bit D (double) registers
– sixteen 128-bit Q (quad) registers
ARM NEON - Operations
• Data operations include:
– addition and subtraction
– multiplication with optional accumulation
– maximum or minimum value driven lane selection
operations
– inverse square-root approximation
– comprehensive data-structure load instructions,
including register-bank-resident table lookup
ARM NEON - features
• Other performance boosting features:– Aligned and unaligned data access allows for
efficient vectorization of SIMD operations.
– Clean instruction set architecture designed for autovectorizing compilers and hand coding.
– Efficient access to packed arrays such as ARGB or xyz coordinates
– Support for both integer and floating point operations ensures adaptability to a broad range of applications, from codecs to High Performance Computing to 3D graphics.
ARM NEON - features
• Other perfomance boosting features:
– Tight coupling to the ARM processor provides
a single instruction stream and a unified view
of memory, presenting a single development
platform target with a simpler tool flow.
– The large NEON register file with its dual
128-bit/64-bit views enables efficient
handling of data and minimizes access to
memory, enhancing data throughput.
Cortex-A15
Cortex-A15• Main features:
– Architecture: ARM v7-A
– 1-4X SMP within a single processor cluster
– Multiple coherent SMP processor clusters through AMBA® 4 technology
– MMU
– ARM/Thumb-2
– DSP & SIMD extensions - increased perfromance
– Advanced Multicore Technologies
– NEON Advanced SIMD – increased perfromance
– VFP v4
Cortex-A15
• Main features:
– ThrustZone Technologies
– Hardware virtualization support - highly efficient
hardware support for data management and arbitration,
whereby multiple software environments and their applications
are able to simultaneously access the system capabilities
– Large Physical Address Extensions (LPAE) -enables the processor to access up to 1TB of memory
big.LITTLE processing
big.LITTLE processing
• Main features:
– Connection of „big” processor – Cortex-A15
and little, power efficient processor – Cortex-
A7
– Such combination simplifies connecting high
performance of a device (smartphone…) with
a long battery life
– Both processors can have 1-4 cores and
implements a single AMBA® 4 coherent
interface
big.LITTLE processing
• Cortex-A7:
– Cortex-A7 is an in-order, non-symmetric dual-
issue processor with a pipeline length of
between 8-stages and 10-stages
big.LITTLE processing
• Cortex-A15:
– Cortex-A15 is an out-of-order sustained triple-
issue processor with a pipeline length of
between 15-stages and 24-stages
big.LITTLE processing
• Performance & Energy comparison
– the energy consumed by the execution of an
instruction is partially related to the number
of pipeline stages it must traverse
big.LITTLE processing
• Performance & Energy comparison
– the energy consumed by the execution of an
instruction is partially related to the number
of pipeline stages it must traverse
big.LITTLE processing
• Interconnection – System architecture
big.LITTLE processing
• Task
Migratrion
Cortex-A50 series
Cortex-A50 Series
• The Cortex-A50 Series is the latest range
of processors based on the ARMv8
Architecture
• The series includes support for new
energy efficient 64-bit execution state
(AArch64) that operates alongside an
enhanced version of ARM’s existing 32-bit
execution state
• The Cortex-A50 Series comprises the A53
and A57 processors
Cortex-A50 Series
• Cortex-A50 series processors are 32-bit
processors with 64-bit capability
• They deliver more performance for ARMv7
32-bit code in AArch32 execution state,
and offer support for 64-bit data and
larger virtual addressing space in AArch64
execution state
• Clean interworking between 32-bit and
64-bit is supported
Cortex-A50 Series – Why 64-bits?
• An obvious reason for 64-bit is the support of
more than 4GB of physical memory
• In server and desktop applications, OS and
application software are frequently 64-bit today
• Support for 64-bit in ARMv8 will enable ARM
processors to become more broadly deployed in
server and desktop applications, and will
provide future-proof support for the eventual
migration of 64-bit operating systems to mobile
applications
Cortex-A50 Series – processors
• Series consists currently of two models:
A53 and A57
• Both processors can operate
independently or be combined into an
big.LITTLE processing
• Both processors are fully compatible with
extensive ARM software assets
Cortex-A53
series
Cortex-A53
• The Cortex-A53 processor is ARM's most
efficient application processor ever
• This processor can deliver the compute
power of today's high-end smartphone, in
lowest power and area footprint, enabling
all-day battery life for typical device uses
• Cortex-A53 efficiently runs legacy ARM 32-
bit applications
Cortex-A53
• Cortex-A53 features cache coherent interoperability with ARM Mali™ family graphics processing units (GPUs)
• Cortex-A53 connects seamlessly to ARM interconnect with up to 16 cores configurations with more in the future
• Cortex-A53 offers optional reliability and scalability features for high-performance enterprise applications
Cortex-A53 - performance
Cortex-A57 series
Cortex-A57
• The Cortex-A57 processor is ARM's most
advanced, high-performance application
processor
• The Cortex-A57 processor efficiently runs
legacy ARM 32-bit applications
• Optional reliability and scalability
features for high-performance enterprise
applications
Cortex-A57
• The Cortex-A57 processor features
interoperability with ARM Mali™ family
graphics processing units (GPUs) for GPU
compute applications
• The Cortex-A57 processor connects
seamlessly to ARM interconnect with up to
16 core configurations with more in the
future
Cortex-A57 - performance
Thank you for your attention
References
[1] ARM11 core documentation; www.arm.com
[2] www.arm.com
[3] ARM9 family documentation; www.arm.com