performance optimization for mobile memory sub-systems

1

Performance Optimization for Mobile Memory Sub-Systems

Saira Malik MemCon 2015

2

What to/ How to Optimize ?

Need: Better QoS

Power

Bandwidth

Latency

Performance Exploration

Understanding Application Demands

Providing great Quality-Of-Service

3

§ ARM mobile based SoC components and memory traffic requirements

§ Memory controller traffic optimization challenges § Role of memory controller and interconnect in performance

optimization § Quality of Service (QoS) mechanisms § Performance results – block and system level § Summary and ongoing work

Overview

4

Example ARM-based Mobile Sub-system

GIC-500I/O Coherent

Masters

Cortex-A72 Cortex-A53

Peripherals

MMU-500

MMU-500

LPDDR3/LPDDR4 DRAM

HNC-Beckett

Mali-T880 GPU

CoreLink CCI-500Snoop Filter

Mali-V550Video

Mali-DP550Display

HNC-Beckett

ELA

STM

PTM

Arc

her

DMC-500

DMC-500

DMC-500

DMC-500

MemorySystem

(Integrated TrustZone)

Cache Coherent Interconnect

MEM Ctrl

MEM Ctrl

MEM Ctrl

MEM Ctrl

Snoop Filter

Cortex Big CPU

Cortex Little CPU

Memory Management Unit (MMU)

MMU

Mali GPU

Interrupt Controller

Mali Video Mali Display

Hybrid Interconnect

Non-coherent Interconnect

5

System Performance Analysis Flowchart

Define Use Cases -  Mobile

-  Camera -  Display/Video -  CPU

Understand Agent Requirements Analyze Workload (Operation specific)

Performance contracts

Requirements

Loss Profile

Latency Sensitive

High BW Power Budget

Latency Contracts

Avg. vs worst case Latency

Determine Traffic Profiles

-  Traffic Nature -  Address Distribution

BW Requirements

Characterize Agents

6

Memory Traffic Characterization

Agent Traffic Type Addressing Traffic

Nature Latency Contract with

System

CPU Cache miss, write backs

Random Low BW, Persistent and urgent

Low Low latency (LL)

GPU Large packet compute or gaming data

Random, large contiguous access

Large bursts of traffic with quiet period

Variable Min guaranteed bandwidth (HB)

Display Continuous image data streaming

Longer bursts with spatial locality

Real time, non-disrupt able traffic in large bursts with quiet period

Sensitive. Med-High (based on buffer size)

Max guaranteed latency for real-time operation (RT)

7

Role of Memory Sub-System

GIC-500I/O Coherent

Masters


Peripherals

MMU-500

MMU-500

LPDDR3/LPDDR4 DRAM

HNC-Beckett

Mali-T880 GPU


Mali-V550Video

Mali-DP550Display

HNC-Beckett

EL

AS

TM

PT

MA

rch

er

DMC-500

DMC-500

DMC-500

DMC-500

MemorySystem



MEM Ctrl

MEM Ctrl

MEM Ctrl

MEM Ctrl

Snoop Filter

Cortex Big CPU

Cortex Little CPU


MMU

Mali GPU


Mali Video

Mali Display

Hybrid Interconnect


Memory Controller Characteristics: •  Fastest access to Memory •  Secure access •  Max. system performance/constrained

energy budget •  Max utilization of DRAM

•  Highest possible bytes/ACT •  Parallelize access (Multiple banks) •  Minimum Turnaround time •  Per-bank refreshes

•  Obey DRAM timing parameters •  Optimize for its critical resource

(DRAM)

8

Memory Sub-System: SoC Challenges

•  Agents – Different perf. Requirements •  Cannot treat all requests equally •  Has an Integrated Role

•  Effective Resource allocation – system intent •  Close integration with system •  Optimized with interconnect •  Reduce Bottlenecks

•  Should take informed decisions •  Trade memory efficiency for system efficiency

Memory Sub-system Integrated in SoC Solve a Multi-objective problem to

satisfy all IP traffic types simultaneously in all conditions

9

System Performance Challenge/How to Solve?

§  How to efficiently arbitrate amongst agents that have different perf. Requirements?

§  Apply QoS Mechanisms

§  Ensures traffic demands §  Avoid congestion §  Throttling to bound memory capacity

§  What are the steps involved?

Interpret system characteristics

Translate into quantitative attributes

Categorize traffic streams

Apply QoS Mechanisms

Measure it

Desired system performance

achieved ?

Change parameters

Yes No

10

Translation to QoS Attributes

§  Translate agent characteristics into measureable attributes

§  Assign agents a priority value (Qv) §  Use AXI AxQoS signal

§  Each request entering DMC has Qv assigned

§  Provides capability to understand priority §  DMC to honor Qv for arbitration §  Maps Q-values to a Qv band Fabric Master Fabric QoS

Reserved 15

- 14

Disp-0 13

Disp-1 12

Video 11

Big-CPU/Little-CPU 10

GPU 7

High-High High

Medium

Low

15

12

11

8

7

4

3

0

11

QoS Mechanisms and their Measurement

§  Key QoS mechanism supported by ARM SoCs

1.  Static QoS 2.  Priority Escalation 3.  Regulations

a)  Macro-regulation based on System Traffic b)  Micro-regulation via deadline scheduling c)  Complete regulation via QoSACCEPT

protocol

§  Measurement Test Setup §  Unit Level Experiments

§  Single DMC §  BFMs to inject equal traffic with different Qv §  Random traffic simulation §  Tested for un-congested and congested

environments

§  System Level Experiments §  Multiple DMC + CCI + rest of mobile sub-system §  RTL running in emulation environment with real-

world traffic from standard benchmarks

§  Definitions §  Non-congested Environment

Fair allocation of DMC buffer slots Fair allocation of DRAM bank resources

§  Congested Environment Qv is used to test for arbitration and resource allocation

12

1. Static QoS Arbitration

Non-Congested System

Congested System

Key Issue: Low priority requests can suffer starvation !

13

2. Priority Escalation

§  To permit liveliness Qv should be permitted to change throughout

§  Enable priority Escalation §  Tracks aging on transactions §  Raises priority §  Prevents starvation

§  Measurements (Test Setup)

Congested System with Escalation

•  Priority escalation improves transaction latency response for transactions with low QoS settings

Key Issue: No guarantee for deadline requirements. No dynamic control.

14

GIC-500I/O Coherent

Masters


Peripherals

MMU-500

MMU-500

LPDDR3/LPDDR4 DRAM

HNC-Beckett

Mali-T880 GPU


Mali-V550Video

Mali-DP550Display

HNC-Beckett

EL

AS

TM

PT

MA

rch

er

DMC-500

DMC-500

DMC-500

DMC-500

MemorySystem



MEM Ctrl

MEM Ctrl

MEM Ctrl

MEM Ctrl

Snoop Filter

Cortex Big CPU

Cortex Little CPU


MMU

Mali GPU


Mali Video Mali Display

Hybrid Interconnect


§  Applies regulations at different levels §  Source (Agents/Interconnect) §  Destination (Memory sub-system)

§  Overrides source Qv based on system needs

§  Allows a transaction stream to modify its priority over time to achieve target service at minimal cost to the system

3. Qv Regulation

Macro-Regulation

Micro-Regulation

15

3a. Macro-Regulation at System Level : Example Scenario

Low Latency (LL)

Real Time (RT)

High Bandwidth (HB)

High-High

High

Medium

Low

•  Low Latency traffic macro-regulated from H to M •  CPU swamps system -Stop getting special treatment

•  Real Time is HH and regulated to manage deadline •  Get deterministic latency through network

•  Differentiate deadline through Qv

•  High Bandwidth is L macro-escalated to M •  Default to low priority

•  Escalate if not achieving sufficient service

Approach can used to infer state of system Can set policy to achieve goals in context of system

status and associated costs

16

3b. Micro Regulation – DMC Low Latency (LL)

Real Time (RT)

High Bandwidth (HB)

High- High

High

Medium

Low

•  Real Time is remapped to M on ingress

until it times-out •  Remap based on timeout latency

•  Qv/QID to override and manage QoS overtime

All requests are escalated over time to avoid starvation

17

Programmed Latency Deadline for RT Master

Latency of LL Master Remains Lower Prior to RT Deadline

Micro-Regulation via Deadline Scheduling

Deadline arbitration with inc. LL and RT Traffics

Real Time

High BW

Low Latency

Low latency (LL) transactions have better latency response than other transactions until latency deadline approaches Latency deadline forces near 100% of Real-time (RT) transactions to complete ahead of LL transactions

18

GIC-500I/O Coherent

Masters


Peripherals

MMU-500

MMU-500

LPDDR3/LPDDR4 DRAM

HNC-Beckett

Mali-T880 GPU


Mali-V550Video

Mali-DP550Display

HNC-Beckett

EL

AS

TM

PT

MA

rch

er

DMC-500

DMC-500

DMC-500

DMC-500

MemorySystem



MEM Ctrl

MEM Ctrl

MEM Ctrl

MEM Ctrl

Snoop Filter

Cortex Big CPU

Cortex Little CPU


MMU

Mali GPU


Mali Video

Mali Display

Hybrid Interconnect


3c. QoSAccept

§  Issue: CPU Latency not bounded when Real Time traffic exceeds

§  Solution: §  QoSAccept Threshold visibility from DMC

to the interconnect §  Master (Interconnect) decides which

transactions to forward to slave (DMC) based on the QoS levels acceptable by slave

Ensures non-blocking access for real-time and minimizes CPU latency

19

Test Experiment on System with QoSAccept

Test Setup §  1x CPUs and 3x GPUs, interleaved §  Shareable traffic: CPU 100%, GPU 30% §  Random address, Read only, 0% hit §  Used 3x IO masters to saturate DMC with RT-traffic §  Threshold value for QoSACCEPT = 12

Clock cycles

All with QoSAccept

GPU-on, GPU =11, CPU=15

GPU-on, GPU =11, CPU=11

GPU - off

CCI Internal Latency for CPU accesses

QoS Accept

GPU GPU QoS Value

CPU QoS Value

OFF OFF - 11

OFF ON 11 11

OFF ON 11 15

ON ON 11 15

QoSACCEPT enables optimized CPU latency within interconnect with GPU traffic ON

55%

83%

20

§  Ongoing work §  Effect of enabling both macro and micro regulation on overall system efficiency §  Better QoS management by obtaining system heuristics

§  As a Systems-IP company we realize it’s a System Level issue and not just a DMC issue. §  We are trying to enable schemes and providing ways through our IPs to help our

customers to build better and efficient high performance products

Summary

20

21

Thank You

22

Performance runs on ARM RTL Emulation

0

10

20

30

40

50

60

70

80

90

100

0 9 19 29 40 49 59 68 81 89

Ban

dwid

th (

%)

Total DPU BW (%)

Dual Display BW sweep

Total DPU

GPU_BW

LCPU_BW

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90

DP

U F

IFO

Usa

ge (

%)

Mas

ter

Rd

Late

ncy

Total DPU BW (%)

Dual Display BW sweep DPU0_RdLat LCPU_RdLat

GPU_RdLat DPU0_FIFO

Well managed DPU latency with heavy CPU and GPU traffic GPU BW is impacted by CPU as expected (lower priority)

performance optimization for mobile memory sub-systems

Documents