Transcript
Page 1: ARM Cortex-A9  MPCore ™  processor

ARM Cortex-A9 MPCore™ processorPresented by- Chris Cai (xiaocai2)Rehana Tabassum (tabassu2)Sam Mussmann (mussmnn2)

Page 2: ARM Cortex-A9  MPCore ™  processor

Background“The architectural simplicity of ARM processors leads to very small implementations, and small implementations mean devices can have very low power consumption. Implementation size, performance, and very low power consumption are key attributes of the ARM architecture.”

ARM Architecture Reference Manual ARMv7-A edition

Page 3: ARM Cortex-A9  MPCore ™  processor

Background (2)ARM is RISC• Uniform register file• Load/store architecture• Simple addressing

Page 4: ARM Cortex-A9  MPCore ™  processor

Background (3)• The ARM Cortex-A9 processor is the high performance choice in a family of low power, cost-sensitive devices.• The Cortex-A9 microarchitecture is delivered either as a Cortex-A9 single core processor or a scalable multicore processor: the Cortex-A9 MPCore ™ processor

Page 5: ARM Cortex-A9  MPCore ™  processor

Where is it used? • Examples:

- Apple A5 (iPhone 4S, iPad 2, iPad mini)

http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementationshttp://en.wikipedia.org/wiki/Iphone_4s

Page 6: ARM Cortex-A9  MPCore ™  processor

Where is it used? (2) • Examples:

- NVIDIA Tegra 2 (Motorola Xoom, Droid X2)

http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementationshttp://en.wikipedia.org/wiki/Motorola_Xoom

Page 7: ARM Cortex-A9  MPCore ™  processor

Where is it used? (3) • Examples:

- PlayStation Vita

http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementationshttp://en.wikipedia.org/wiki/PlayStation_Vita

Page 8: ARM Cortex-A9  MPCore ™  processor

What are its specs?• The Cortex A9 core:

- Gives 2.50 DMIPS/MHz/core (Dhrystone MIPS)- Generally clocked between 800MHz and 2GHz- Possible to run > 1GHz and < 250mW

http://arm.com/products/processors/cortex-a/cortex-a9.php?tab=Specificationshttp://www.linuxfordevices.com/c/a/News/ARM-spins-multicoreenabled-Cortex-core/

Page 9: ARM Cortex-A9  MPCore ™  processor

Presentation Overview• Micro-architecture• Memory System• Multi-core

Page 10: ARM Cortex-A9  MPCore ™  processor

Microarchitecture Overview• Variable length, out of order, superscalar pipeline– Two instructions are fetched in one cycle– Issue up to 4 instructions per cycle into:

• Primary data processing pipeline• Secondary data processing pipeline• Load-store pipeline • Compute engine (FPU/NEON) pipeline

• Speculative execution– Supporting virtual renaming of physical registers and removing pipelines stalls due to data dependencies

Page 11: ARM Cortex-A9  MPCore ™  processor

CortexA9 Microarchitecture

www.arm.com/files/pdf/armcortexa-9processors.pdf

Instruction Fetch

Decode

IssueRename Execute Writeback

Memory

Page 12: ARM Cortex-A9  MPCore ™  processor

Instruction Fetch• Instruction cache size: 16KB, 32KB, or 64KB• Superscalar pipeline: fetching two instructions at once• Branch Prediction:

– Global History Buffer: 1K ~ 16K entries– Branch-Target Address Cache: 512 ~ 4K entries– Return stack of 4 x 32 bits

• Fast-loop mode: instruction loop that are smaller than 64 bytes often complete without additional instruction cache accesses

Page 13: ARM Cortex-A9  MPCore ™  processor

Instruction Decode

• Super Scalar Decoder- Capable of decoding two full instructions per cycle

Page 14: ARM Cortex-A9  MPCore ™  processor

Rename

• Register Renaming- Resolving data dependencies and unroll small loops by hardware

Page 15: ARM Cortex-A9  MPCore ™  processor

Issue

• Issue can be fed maximum of 2 instructions per cycle• Issue can dispatch up to 4 instructions per cycle• Out of order selection of instructions from queue

Page 16: ARM Cortex-A9  MPCore ™  processor

Execute

• Variable length Executing Stage (1 ~ 3 cycles)- Most Instructions finish within 1 cycle- Instruction which folds shifts and rotates can take 3 cycles

• ADD r0, r1, r2 (1 cycle)• ADD r0, r1, r2 LSL #2 (2 cycle)

• Corresponds to a = b + (c << 2); • ADD r0, r1, r2 LSL r3 (3 cycle)

• Corresponds to a = b + (c << d);

Page 17: ARM Cortex-A9  MPCore ™  processor

Execute (2)• NEON Media Processing Engine

- NEON technology supports instructions targeted primarily at audio, video, 3D graphics, image and speech processing.

http://www.arm.com/files/pdf/AT_-_NEON_for_Multimedia_Applications.pdf

Page 18: ARM Cortex-A9  MPCore ™  processor

Execute (3)• What is NEON?

– NEON is a wide SIMD data processing architecture• 32 registers, 64 bit wide or 16 registers, 128 bit wide

– NEON instructions perform “Packed SIMD” processing • Registers can be considered as “vector” of same data type• Instructions perform the same operation in all lanes

http://www.arm.com/files/pdf/AT_-_NEON_for_Multimedia_Applications.pdf

Page 19: ARM Cortex-A9  MPCore ™  processor

Execute (4)• NEON Media Processing Engine supports vector computations on:

- half-precision (16bit), single-precision (32bit), double-precision (64bit) floating-point numbers- 8, 16, 32 and 64 bit signed and unsigned integers

• Supported Operations Include:- addition, subtraction, multiplication- maximum or minimum of a vector of operands- Inverse square-root approximation (y = x^-(1/2))- many more

Page 20: ARM Cortex-A9  MPCore ™  processor

Memory• dependent load-store instructions forwarded for resolution within memory system• 2-level TLB structure– micro TLB

• 32 entries on data side and 32 or 64 entries on instruction side • to reduce power consumed in translation and protection look-ups

– main TLB http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf

Page 21: ARM Cortex-A9  MPCore ™  processor

Memory (2)

• Data prefetcher– monitor cache line requests by processor and cache misses to determine how much data to prefetch– can prefetch up to 8 independent data streams– prefetch and allocate data in the L1 data cache, as long as it keeps hitting in the prefetched cache line– When stop prefetching?

Page 22: ARM Cortex-A9  MPCore ™  processor

Memory HierarchyCPU

Instruction Cache Data CacheCPU

Instruction Cache Data CacheCPU

Instruction Cache Data CacheCPU

Instruction Cache Data CacheSnoop Control Unit (SCU) Accelerator Coherence Port

L2 CacheMain Memory

Cortex A9 MPcore

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

Page 23: ARM Cortex-A9  MPCore ™  processor

L1 caches• Non-unified

- 32 bytes line length - can be disabled independently

• 16, 32 or 64KB• 4 - way associative• support for Security Extensions• I cache: VIPT• D cache: PIPT

- reduce number of caches flushes and refills and save energy

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

CPU

D$ I$

CPU

D$ I$

CPU

D$ I$

CPU

D$ I$

SCU ACP

L2 Cache

Main Memory

Cortex A9 MPcore

AXI RW 64-bit bus AXI RW 64-bit bus

Page 24: ARM Cortex-A9  MPCore ™  processor

L2 cache • shared, unified • Off-chip• 128KB to 8MB• 4 to 16-way associative

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

CPU

D$ I$

CPU

D$ I$

CPU

D$ I$

CPU

D$ I$

SCU ACP

L2 Cache

Main Memory

Cortex A9 MPcore

AXI RW 64-bit bus AXI RW 64-bit bus

Page 25: ARM Cortex-A9  MPCore ™  processor

Snoop Control Unit• Integral part of cache memory systems • Connects processors to memory system through AXI interfaces

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

CPUD$ I$

CPUD$ I$

CPUD$ I$

CPUD$ I$

SCU ACP

L2 CacheMain Memory

Cortex A9 MPcore

AXI RW 64-bit bus AXI RW 64-bit bus

Page 26: ARM Cortex-A9  MPCore ™  processor

• SCU functions :- maintain data cache coherency- initiate L2 memory accesses- arbitrate between processors’ simultaneous request for L2 accesses - manages accesses from ACP

• does not support instruction cache coherencyhttp://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

Snoop Control Unit (1)

Page 27: ARM Cortex-A9  MPCore ™  processor

Accelerator Coherence Port

• optional AXI 64-bit slave port• allows to connect to non-cached system mastering peripherals and accelerators—DMA engine or cryptographic accelerator• SCU enforces memory coherency

http://www.arm.com/files/pdf/ARMCortexA-9Processors.pdf

Page 28: ARM Cortex-A9  MPCore ™  processor

Multi-Core

http://www.arm.com/files/pdf/ARMCortexA-9Processors.pdf

Page 29: ARM Cortex-A9  MPCore ™  processor

Cache Coherence – MESI

http://en.wikipedia.org/wiki/MESI_protocol

Page 30: ARM Cortex-A9  MPCore ™  processor

Cache Coherence – MESI (2)

System Level Benchmarking Analysis of the Cortex A9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

ARM MPCore has optimizations to MESI:• Duplicated tag RAMs

All done in the Snoop Control Unit

Page 31: ARM Cortex-A9  MPCore ™  processor

Cache Coherence – MESI (2)

System Level Benchmarking Analysis of the Cortex A9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

ARM MPCore has optimizations to MESI:• Duplicated tag RAMs• Cache-2-Cache transfer

All done in the Snoop Control Unit

Page 32: ARM Cortex-A9  MPCore ™  processor

Cache Coherence – MESI (2)

System Level Benchmarking Analysis of the Cortex A9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

ARM MPCore has optimizations to MESI:• Duplicated tag RAMs• Cache-2-Cache transfer• Migratory Lines

All done in the Snoop Control Unit

Page 33: ARM Cortex-A9  MPCore ™  processor

Generalized Interrupt Control

System Level Benchmarking Analysis of the Cortex A9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

• Which core services interrupts?• GIC gives the programmer control• Centralizes interrupts, then dispatches to individual core(s)

Page 34: ARM Cortex-A9  MPCore ™  processor

Top Related