arm cortex-a9 mpcore ™ processor

Click here to load reader

Post on 24-Feb-2016




5 download

Embed Size (px)


ARM Cortex-A9 MPCore ™ processor. Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2). Background. - PowerPoint PPT Presentation


PowerPoint Presentation

ARM Cortex-A9 MPCore processorPresented by- Chris Cai (xiaocai2)Rehana Tabassum (tabassu2)Sam Mussmann (mussmnn2)BackgroundThe architectural simplicity of ARM processors leads to very small implementations, and small implementations mean devices can have very low power consumption. Implementation size, performance, and very low power consumption are key attributes of the ARM architecture.ARM Architecture Reference Manual ARMv7-A editionBackground (2)ARM is RISCUniform register fileLoad/store architectureSimple addressing

Background (3)The ARM Cortex-A9 processor is the high performance choice in a family of low power, cost-sensitive devices.The Cortex-A9 microarchitecture is delivered either as a Cortex-A9 single core processor or a scalable multicore processor: the Cortex-A9 MPCore processor

Where is it used? Examples:Apple A5 (iPhone 4S, iPad 2, iPad mini)

Where is it used? (2) Examples:NVIDIA Tegra 2 (Motorola Xoom, Droid X2)

Where is it used? (3) Examples:PlayStation Vita

What are its specs?The Cortex A9 core:Gives 2.50 DMIPS/MHz/core (Dhrystone MIPS)Generally clocked between 800MHz and 2GHzPossible to run > 1GHz and < 250mW OverviewMicro-architectureMemory SystemMulti-coreMicroarchitecture OverviewVariable length, out of order, superscalar pipelineTwo instructions are fetched in one cycleIssue up to 4 instructions per cycle into:Primary data processing pipelineSecondary data processing pipelineLoad-store pipeline Compute engine (FPU/NEON) pipeline

Speculative executionSupporting virtual renaming of physical registers and removing pipelines stalls due to data dependencies

CortexA9 Microarchitecture FetchDecodeIssueRenameExecuteWritebackMemoryInstruction FetchInstruction cache size: 16KB, 32KB, or 64KBSuperscalar pipeline: fetching two instructions at onceBranch Prediction:Global History Buffer: 1K ~ 16K entriesBranch-Target Address Cache: 512 ~ 4K entriesReturn stack of 4 x 32 bitsFast-loop mode: instruction loop that are smaller than 64 bytes often complete without additional instruction cache accesses

Accurate branch and return prediction reduce the number of incorrect instruction fetch and decode operations and save energy

Fast loop mode saves energy

12Instruction Decode

Super Scalar DecoderCapable of decoding two full instructions per cycleRenameRegister RenamingResolving data dependencies and unroll small loops by hardware


Issue can be fed maximum of 2 instructions per cycleIssue can dispatch up to 4 instructions per cycleOut of order selection of instructions from queueExecute

Variable length Executing Stage (1 ~ 3 cycles)Most Instructions finish within 1 cycleInstruction which folds shifts and rotates can take 3 cyclesADD r0, r1, r2 (1 cycle)ADD r0, r1, r2 LSL #2 (2 cycle)Corresponds to a = b + (c