analyzing and debugging performance issues with … · as well as axi4™ from the amba 4 ... how...

Introduction

The evolution of today’s system-on-chip (SoC) devices from uni-processor systems to heterogeneous multi-processor designs has added a significant burden to the SoC designer’s job. Designers are confronted with integrating many high-performance masters and slaves with dynamically changing traffic profiles.

Figure 1 illustrates how functions with real-time, maximum-latency require-ments compete with high-bandwidth streaming traffic, along with CPUs that need minimum latency to reach optimum performance. Advanced system intel-lectual property (IP)—the “glue” that provides the interconnect tying all of the major functional blocks together and connecting them to main memory—is required to help solve these competing requirements. Just as each system may have its own unique set of design challenges, system IP is, by its nature, highly configurable, allowing the designer to choose the most optimal configuration for their design. Advanced system IP not only allows designers to select inter-connect topologies but also places solutions such as hardware-managed cache coherency and dynamic end-to-end quality of service at their disposal.

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP ComponentsBy William Orme, Strategic Marketing Manager, ARM Ltd. and Nick Heaton, Senior Solutions Architect, Cadence

Finding the optimal configuration options that meet the requirements of a particular system requires complementary design tools to enable the designer to rapidly explore and correlate trade-offs in performance, power, and area (PPA). This paper describes the challenges confronting the designer and proposes a new tool leveraging ARM® and Cadence technology to overcome the challenges of today’s highly integrated, multi-processor system-on-chip (SoC) designs.

Contents

Introduction .................................1

Accelerating SoC Integration

with CoreLink NIC-400 and

Interconnect Workbench ..............3

Performance Implications of

Interconnect Choices ...................3

Interconnect Design Choices ........4

Verifying Latency .........................4

Push-Button Testbench

Generation ..................................6

And There’s More... .....................7

Conclusion ...................................8

DMAController

CommsControl

AppsProcessor

GeometryProcessor

TilingRenderer

AudioCODEC

ImageTransform

MotionEstimation

MotionCompensate

Buffer Texture BufferNAND Flash

Peripheral

Peripheral

Peripheral

Peripheral

StaticMemoryCtrl

DisplayController

NetworkInterface

DynamicMemoryCtrl

Primitives Primitives Tile Lists

Frame Buffer Application Memory

Media Source

Interconnect

HD

Vid

eo

CPU GPU

Minimum Bandwidth

Maximum Latency

Minimum Latency

Figure 1: Typical smartphone traffic

The configuration options the designer chooses need to satisfy a multi-dimensional problem affecting the perfor-mance of each function as well as the physical size and power dissipation.

Figure 2 shows a typical SoC core, which uses the ARM CoreLink™ System IP components connected to a Cadence® Databahn DDR controller.

Configurable: AXI4/AXI3/AX8

Thin Link

AXI4

ACE ACE

ACE-Lite

ACE-Lite ACE-Lite

ACE-Lite ACE-Lite

ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM

MMU-400

NIC-400

Configurable: AXI4/AXI3/AHB/APB

AXI4

NIC-400Databahn IP

128b

MMU-400

ADB-400

128b

MMU-400

ADB-400

128b

ADB-400

128b

128b 128b 128b

ADB-400

128b

CoreLink™ CCI-400 Cache Coherent Interconnect128 bit @ up to 0.5 Cortex-A15 frequency

DDR PHY

LPDDR2 Model

DDR PHY

DDR3 Model

Figure 2: Typical SoC configuration

www.cadence.com 2

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

The sophistication of these system IP components, which is necessary to allow the designer to integrate many functions together, provides many choices to the designer. Finding the optimal configuration options that meet the requirements of a particular system requires complementary design tools to enable the designer to rapidly explore and correlate trade-offs in performance, power, and area (PPA). This paper describes the challenges confronting the designer and proposes a new tool to accelerate the integration of many SoC functions with an optimized system IP configuration.

Accelerating SoC Integration with CoreLink NIC-400 and Interconnect Workbench

In order to better understand the range of choices SoC designers face, let us look at one of the system IP compo-nents from ARM, the CoreLink NIC-400™. This is a highly configurable component that can have any number (there are practical limits) of master and slave interfaces of any of the AMBA® 3 family of protocols (AHB™, AXI™, APB™) as well as AXI4™ from the AMBA 4 family. Each of these interfaces can have configurable width, address maps, and clock speeds. In addition, the user can configure the internal implementation of the IP to control the paths from master to slave and add mechanisms for bandwidth and latency management called Quality of Service (QoS) and Virtual Networks (QVN).

Adding to this high configurability, the IP also allows the user to make additional choices to help with routing congestion and layout through a mechanism called “Thin Links.” For a complex SoC with hundreds of IP, connecting them all to the main system memory can create situations where an AMBA bus may need to be routed across the chip. However, this situation may not be ideal for wide AMBA buses. Thin Links allow the user to create a point-to-point AMBA connection using only a few wires, thereby alleviating the routing problem.

This connection is a user configuration choice for each interface. In fact, the NIC-400 is so configurable that ARM provides CoreLink AMBA Designer, an interactive tool created specifically to make it easy for users to select imple-mentation options. Figure 3 shows an example of using AMBA Designer to configure a complex NIC-400 inter-connect.

Figure 3: AMBA Designer

Performance Implications of Interconnect Choices

From the perspective of the SoC designer, using a contemporary GUI to configure and generate a complex IP is very efficient: they can quickly architect the SoC and, within minutes, have a new interconnect IP (or multiple cascaded interconnect IPs) configured, generated, and stitched together. Creating the design is, however, only one piece of the problem. How will this choice of implementation actually perform under the range of scenarios that will be encountered in the full SoC context? This question is a challenge to answer.

www.cadence.com 3


The Cadence Interconnect Workbench provides a suite of capabilities to enable this kind of “what if?” experimen-tation. Let’s look at an example of the kind of analysis that Interconnect Workbench enables.

Figure 4 illustrates a bandwidth plot from a performance scenario with specific read bandwidth criteria met; displayed USB and High-Speed I/O bandwidths are in the 100-300MBps range. Interconnect Workbench allows us to quickly visualize this kind of simulation running on cycle-accurate register-transfer level (RTL) models of the inter-connect using Cadence VIP for AMBA to model the masters and slaves.

Figure 4: Bandwidth split by source

Interconnect Design Choices

The SoC designer must design his or her interconnect to provide sufficient throughput at low enough latency while minimizing its area, power, and routing congestion. The CoreLink NIC-400 allows the designer to craft different topologies while varying the size and number of switch matrices. Smaller switches permit higher operating frequencies and lower latencies, and timing closure options for the insertion of register slices allow trade off of throughput with static latencies. Different AMBA protocols can be selected and data bus widths varied to increase bandwidth.

To prevent blocking without adding more and more physical channels, virtual channels can be defined, allowing virtual channels to remain clear for latency-critical masters even where another virtual channel is fully utilized. Dynamic regulators can be inserted at the ingress to the interconnect network to prioritize traffic within a single virtual or physical channel, thus ensuring the required quality of service is met for each master. Once an inter-connect configuration is selected, the designer needs to be able to verify its performance under load.

Verifying Latency

An important question that designers should ask is, “What is the consequence of adding an asynchronous bridge into my architecture with respect to latency?” The graph in Figure 5 shows the latency of the accesses across a CCI-400 Interconnect with and without ADB-400 asynchronous domain bridges. The top chart is the latency distri-bution without bridges, the bottom chart is with bridges.

www.cadence.com 4


Figure 5: CCI-400 latency with and without ADB bridges

Interconnect Workbench allows us to also investigate latency through statistical distributions. Figure 6 shows a latency distribution view of a group of simulations. It is easy to identify the slowest transactions on a distribution, as the buckets to the right are the slowest. Also the chart clearly illustrates that the latency for reads and writes is distorted and writes happen more quickly than reads.

Figure 6: Latency distribution

From the latency distribution, Interconnect Workbench provides the ability to click on a bucket and show the transaction(s) in that bucket along with all the details, thus enabling the rapid debug of latency outliers. As shown in Figure 7, right-clicking on the transaction details further accelerates debugging by launching the SimVision tool. Within the tool, the simulation waveform is already configured and markers highlight the transaction of interest.

www.cadence.com 5


Figure 7: One-click waveform debug of slow transactions

Push-Button Testbench Generation

As has been shown, Interconnect Workbench provides comprehensive analysis capabilities for capturing cycle-accurate performance metrics. How are the testbenches created to run this kind of analysis?

Interconnect Workbench provides a complete solution for automatically generating a UVM testbench for any ARM Interconnect from the NIC-301™, NIC-400, and CCI-400™ CoreLink System IPs. Once a user has defined the inter-connect implementation details, AMBA Designer generates the RTL as well as an IP-XACT XML file that matches the design. Interconnect Workbench has been architected to read this IP-XACT and automatically generate a UVM testbench in either of the most popular high-level verification languages (HVLs): e or SystemVerilog.

In a typical SoC, a mix of components makes up the “glue” connecting all of the major IP together. Reading an IP-XACT description of these system IP cores enables Interconnect Workbench to provide performance analysis capabilities for not only the interconnect components but also the cycle-accurate models of the DDR controller. Figure 8 shows how Interconnect Workbench might be used to generate a testbench for the core of an SoC and the included DDR controller.

www.cadence.com 6


Configurable: AXI4/AXI3/AHB

Thin Link

AXI4

ACE ACE

ACE-Lite ACE-Lite ACE-Lite

ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM

MMU-400

NIC-400

Configurable: AX14/AX13/AHB/APB

AXI4

NIC-400Databahn IP

128b

MMU-400

ADB-400

128b

MMU-400

ADB-400

128b

ADB-400

128b

128b 128b 128b

ADB-400

ACE VIP ACE VIP

AXI4 VIP

AXI4 VIP

AXI4 VIP AXI4 VIP

ACE-Lite VIP

ACE-Lite VIP

ACE-Lite VIP ACE-Lite VIP

Plug-inPlug-in

128b

CoreLink™ CCI-400 Cache Coherent Interconnect128 bit @ up to 0.5 Cortex-A15 frequency

DDR PHY

LPDDR2 Model

DDR PHY

DDR3 Model

ICM VIP

Figure 8: Interconnect Workbench-generated testbench

And There’s More...

With the introduction of AMBA 4, ARM introduced hardware coherency to the world through the ACE™ protocol, a contrast to the previously described, relatively simple non-coherent systems. Coherency enables multiple masters to share coherent data structures while enabling L1 and L2 local caching for higher performance, for example in the ARM big.LITTLE™ processor clusters using Cortex™-A15 and Cortex-A7 processors. These caches are kept coherent through interconnect hardware, with other components in the system further improving performance, power consumption, and simplifying software. This coherency adds to the performance prediction scenarios for the system designer.

In the same way the Interconnect Workbench can help with non-cached systems, it can be used with the AMBA 4 cache-coherent protocols. In a cached system using, for example, the ARM CCI-400 cache-coherent interconnect traffic from an I/O, the master can share the L2 cache of either of two processor clusters connected via the ACE interfaces using snoop commands. If transactions have data cached in these clusters, there will be a “snoop hit.” If the corresponding data is not stored in these caches, then the transaction will eventually be forced to go to main memory, resulting in a “snoop miss”. The difference in latency of these hits and misses is significant, and it is of paramount importance for a SoC designer to characterize the behavior of the system under differing loads and conditions. Interconnect Workbench provides the perfect vehicle to do this kind of analysis. Figure 9 shows a latency distribution for a CCI-400 simulation with data split by hits and misses.

www.cadence.com 7


Figure 9: CCI-400 latency distribution showing hits and misses

As can be seen, the expected lower latency for hits is validated by the analysis. The value of visualization is that is it easy to see if hits were slower than expected or if the misses were quicker, which might point to either a functional problem or perhaps an error in the scenario.

Conclusion

The increase in complexity of SoCs based on heterogeneous, multi-core systems can be addressed by advanced system IP. Design integration can be accelerated with appropriate tools that simplify choice of architectures, clock schemes, power domains, memory sizes, cache sizes, QoS mechanisms, and other configuration options. ARM’s AMBA Designer provides a quick way to generate CoreLink interconnect designs from a large set of configurable options. The Cadence Interconnect Workbench is a valuable tool for measuring and comparing different archi-tectures and configurations in cycle-accurate RTL simulations for a variety of scenarios. Understanding how these numerous and varying IP cores behave together in a system when pushed to their limits is key to ensuring that a new design delivers on its expected performance targets.

Cadence Design Systems enables global electronic design innovation and plays an essential role in the creation of today’s electronics. Customers use Cadence software, hardware, IP, and expertise to design and verify today’s mobile, cloud and connectivity applications. www.cadence.com www.cadence.com

© 2013 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence and the Cadence logo are registered trademarks of Cadence Design Systems, Inc. in the United States and other countries. ARM and AMBA are registered trademarks and ACE, AHB, APB, AXI, AXI4, big.LITTLE, CCI-400, CoreLink, Cortex, NIC-301, and NIC-400 are trademarks of ARM Ltd. 1496 10/13 CY/DM/PDF


analyzing and debugging performance issues with … · as well as axi4™ from the amba 4 ... how...

Documents