ieee transactions on computers 1 plasticity-on-chip …

13
1 Plasticity-on-Chip Design: Exploiting 2 Self-Similarity for Data Communications 3 Yao Xiao , Student Member, IEEE, 4 Shahin Nazarian, Member, IEEE, and Paul Bogdan, Senior Member, IEEE 5 Abstract—With the increasing demand for distributed big data analytics and data-intensive programs which contribute to large 6 volumes of packets among processing elements (PEs) and memory banks, we witness a pressing need for new mathematical models 7 and algorithms that can engineer a brain-inspired plasticity into the computing platforms by mining the topological complexity of high- 8 level programs (HLPs) and exploiting their self-similar and fractal characteristics for designing reconfigurable domain-specific 9 computing architectures. In this article, we present Plasticity-on-Chip (PoC) by engineering plasticity into ”artificial brains” to mine and 10 exploit the self-similarity of HLPs. First, we present a communication modeling of HLPs (e.g., C/C++ implementations of various 11 applications) that relies on static and dynamic compiler analysis of programs with varying input seeds, performing comprehensive 12 program analysis of all traces, and representing the HLPs as weighted directed acyclic graphs while capturing the intrinsic timing 13 constraints and data/control flow requirements. Second, we propose a rigorous mathematical framework for determining the optimal 14 parallel degree of executing a set of interacting HLPs (by partitioning them into clusters of densely interconnected supernodes - tasks) 15 which helps us decide the number of available heterogeneous PEs, the amount of required memory and the structure of the 16 synthesized deadlock-free irregular NoC topology that offers an efficient communication medium. These clusters serve as abstract 17 models of computation for the synthesized PEs within the parallel execution model. Finally, exploiting the fractal and complex networks 18 concepts, we extract in-depth features from graphs that serve as inputs for distributed reinforcement learning. Our experimental results 19 on synthesized PEs and NoCs show performance improvements as high as 7.61x when compared to the traditional NoC and 2.6x 20 compared to gem5-Aladdin. 21 Index Terms—Software hardware codesign, optimal parallelization degree, self-similarity, graph neural networks, intelligent scheduler Ç 22 1 INTRODUCTION 23 C OMPLEX cyber-physical applications consist of several 24 processes/functionalities [13] constantly interacting 25 and communicating with each other in order to solve a real- 26 life problem [15]. This heterogeneous and complex interde- 27 pendent organization makes HLPs extremely difficult to 28 model and understand in terms of dynamical and required 29 allocated resources. Despite significant progress in parallel 30 programming and languages [14], [12], [34], we still rely on 31 old compiler techniques, invented back in 1973, such as the 32 data-flow and control-flow graphs [2], when we had one 33 Intel 4040 4-bit microprocessor. This needs to be reconciled 34 with recent advancements in hardware design [13]. 35 Over the past decades, the monolithic core has been 36 replaced by complex multiprocessors consisting of hun- 37 dreds and thousands of PEs in order to provide high perfor- 38 mance, ease of computer system design, and low power 39 consumption. In dual- or quad-core systems, bus- or cross- 40 bar-based interconnect designs are applied for simplicity. 41 However, they do not scale well to meet the needs of large- 42 scale CMPs. Consequently, network-on-chip (NoC) has 43 been developed as an alternative scalable interconnect to 44 efficiently route messages among numerous cores [15], [4]. 45 NoC offers a (1) simple interconnect design; and (2) low 46 electrical loading as each core is connected to a small num- 47 ber of cores. 48 Despite the advent of parallel programming, such as 49 Pthreads and OpenMP [25], that enables the NoC-based 50 parallel architectures, programmers have to understand the 51 low-level details of hardware platforms such as the amount 52 of memory and the number of CPUs and GPUs, making 53 software productivity low [9], [10]. This software-hardware 54 codesign methodology is bottom-up. Hardware designers 55 build heterogeneous platforms first by design space explo- 56 ration [29] and software programmers write applications 57 for these platforms. Normally, such applications are not 58 optimal in terms of performance. We argue that in the 59 future this bottom-up approach should be replaced with a 60 top-down with feedback approach as shown in Fig. 1. Soft- 61 ware programmers write applications without the necessity 62 of process spawning and cache tuning [24]. These applica- 63 tions, as we mention in Section 3, encode the high-level 64 structures such as computations and communications. 65 However, the hardware requires an intelligence ability to 66 reconfigure itself [11] (e.g., communication patterns decide 67 the topology) to reflect high-level structures imposed by 68 applications. The runtime system [2] sitting between soft- 69 ware and hardware is able to perform the analytical The authors are with the Department of Electrical and Computer Engineer- ing, Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089 USA. E-mail: {xiaoyao, shahin.nazarian, pbogdan} @usc.edu. Manuscript received 15 Aug. 2020; revised 22 Mar. 2021; accepted 28 Mar. 2021. Date of publication 0 . 0000; date of current version 0 . 0000. (Corresponding author: Yao Xiao.) Recommended for acceptance by L. Chen and Z. Lu. Digital Object Identifier no. 10.1109/TC.2021.3071507 IEEE TRANSACTIONS ON COMPUTERS 1 0018-9340 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Upload: others

Post on 15-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

1 Plasticity-on-Chip Design: Exploiting2 Self-Similarity for Data Communications3 Yao Xiao , Student Member, IEEE,

4 Shahin Nazarian,Member, IEEE, and Paul Bogdan, Senior Member, IEEE

5 Abstract—With the increasing demand for distributed big data analytics and data-intensive programs which contribute to large

6 volumes of packets among processing elements (PEs) and memory banks, we witness a pressing need for new mathematical models

7 and algorithms that can engineer a brain-inspired plasticity into the computing platforms by mining the topological complexity of high-

8 level programs (HLPs) and exploiting their self-similar and fractal characteristics for designing reconfigurable domain-specific

9 computing architectures. In this article, we present Plasticity-on-Chip (PoC) by engineering plasticity into ”artificial brains” to mine and

10 exploit the self-similarity of HLPs. First, we present a communication modeling of HLPs (e.g., C/C++ implementations of various

11 applications) that relies on static and dynamic compiler analysis of programs with varying input seeds, performing comprehensive

12 program analysis of all traces, and representing the HLPs as weighted directed acyclic graphs while capturing the intrinsic timing

13 constraints and data/control flow requirements. Second, we propose a rigorous mathematical framework for determining the optimal

14 parallel degree of executing a set of interacting HLPs (by partitioning them into clusters of densely interconnected supernodes - tasks)

15 which helps us decide the number of available heterogeneous PEs, the amount of required memory and the structure of the

16 synthesized deadlock-free irregular NoC topology that offers an efficient communication medium. These clusters serve as abstract

17 models of computation for the synthesized PEs within the parallel execution model. Finally, exploiting the fractal and complex networks

18 concepts, we extract in-depth features from graphs that serve as inputs for distributed reinforcement learning. Our experimental results

19 on synthesized PEs and NoCs show performance improvements as high as 7.61x when compared to the traditional NoC and 2.6x

20 compared to gem5-Aladdin.

21 Index Terms—Software hardware codesign, optimal parallelization degree, self-similarity, graph neural networks, intelligent scheduler

Ç

22 1 INTRODUCTION

23 COMPLEX cyber-physical applications consist of several24 processes/functionalities [13] constantly interacting25 and communicating with each other in order to solve a real-26 life problem [15]. This heterogeneous and complex interde-27 pendent organization makes HLPs extremely difficult to28 model and understand in terms of dynamical and required29 allocated resources. Despite significant progress in parallel30 programming and languages [14], [12], [34], we still rely on31 old compiler techniques, invented back in 1973, such as the32 data-flow and control-flow graphs [2], when we had one33 Intel 4040 4-bit microprocessor. This needs to be reconciled34 with recent advancements in hardware design [13].35 Over the past decades, the monolithic core has been36 replaced by complex multiprocessors consisting of hun-37 dreds and thousands of PEs in order to provide high perfor-38 mance, ease of computer system design, and low power39 consumption. In dual- or quad-core systems, bus- or cross-40 bar-based interconnect designs are applied for simplicity.

41However, they do not scale well to meet the needs of large-42scale CMPs. Consequently, network-on-chip (NoC) has43been developed as an alternative scalable interconnect to44efficiently route messages among numerous cores [15], [4].45NoC offers a (1) simple interconnect design; and (2) low46electrical loading as each core is connected to a small num-47ber of cores.48Despite the advent of parallel programming, such as49Pthreads and OpenMP [25], that enables the NoC-based50parallel architectures, programmers have to understand the51low-level details of hardware platforms such as the amount52of memory and the number of CPUs and GPUs, making53software productivity low [9], [10]. This software-hardware54codesign methodology is bottom-up. Hardware designers55build heterogeneous platforms first by design space explo-56ration [29] and software programmers write applications57for these platforms. Normally, such applications are not58optimal in terms of performance. We argue that in the59future this bottom-up approach should be replaced with a60top-down with feedback approach as shown in Fig. 1. Soft-61ware programmers write applications without the necessity62of process spawning and cache tuning [24]. These applica-63tions, as we mention in Section 3, encode the high-level64structures such as computations and communications.65However, the hardware requires an intelligence ability to66reconfigure itself [11] (e.g., communication patterns decide67the topology) to reflect high-level structures imposed by68applications. The runtime system [2] sitting between soft-69ware and hardware is able to perform the analytical

� The authors are with the Department of Electrical and Computer Engineer-ing, Viterbi School of Engineering, University of Southern California, LosAngeles, CA 90089 USA. E-mail: {xiaoyao, shahin.nazarian, pbogdan}@usc.edu.

Manuscript received 15 Aug. 2020; revised 22Mar. 2021; accepted 28Mar. 2021.Date of publication 0 . 0000; date of current version 0 . 0000.(Corresponding author: Yao Xiao.)Recommended for acceptance by L. Chen and Z. Lu.Digital Object Identifier no. 10.1109/TC.2021.3071507

IEEE TRANSACTIONS ON COMPUTERS 1

0018-9340 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tps://www.ieee.org/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

70 reasoning on the number of resources such as CPUs, GPUs,71 and memory required to satisfy the constraints for reconfig-72 urable hardware [24].73 In machine learning (ML) and artificial intelligence (AI),74 the recent technological advances have greatly contributed75 to a rapid increase in the complexity of algorithms [1]. Sev-76 eral compiler techniques (e.g., data- or control-flow analy-77 sis) are used to understand programs. While they are78 simple and efficient enough to understand programs, with79 the advent of parallel computing and resource-aware com-80 puting, they fail to decipher the heterogeneous structural81 organizations, formation, and dynamics of parallel pro-82 grams. Moreover, they cannot discover the relationship83 between programs and underlying resources.84 In this paper, we address the following urgent research85 questions: 1. How to mine and model applications in order86 to (i) learn their heterogeneous structures and (ii) capture87 their requirements for parallel computing? 2.How to decide88 the optimal number of resources for parallel execution? The tech-89 nological miniaturization has made it possible to design90 general-purpose machines with hundreds and possibly91 even thousands of cores. However, due to the fact that par-92 allel programs tend to suffer from synchronization over-93 head, increasing the number of forked threads does not lead94 to better performance. This guides us to formulate an opti-95 mization model to determine the best number while consid-96 ering data communications. 3. How to design irregular97 deadlock-free NoCs with low latency from the model of98 programs? Data-intensive applications running on NoCs99 with regular topologies (e.g., ring, mesh) would result in

100 frequent and complex data communication patterns. In101 return, this would demand hierarchical storage of data in102 memories (i.e., caches, main memory banks, and disks).103 However, dedicated shortcut links inside irregular NoCs104 help packets bypass the overloaded wires to improve105 latency. 4. How to design power efficient routers in NoC?106 Several NoC prototypes such as Intel 80-core Terascale chip107 [13] and MIT RAW chip [4] consume 30-40 percent of the108 total energy on NoC, respectively. In addition, on average109 25 percent of the total energy is required for data communi-110 cation [15].111 Therefore, we propose the Plasticity-on-Chip (PoC) design112 flow to engineer plasticity on future chips (see Fig. 2). PoC113 includes a design flow that incorporates several feedback114 loops to provide themechanisms for introducing and evaluat-115 ing plasticity in heterogeneous processing elements, memory

116hierarchy, and interconnect components of future manycore117systems, while realizing the inherent dynamics of programs.118This PoC design flow enables future manycore systems to119mimic brain computation and perform like ever-changing120networks and systems with a very large number of process-121ing, memory, and interconnect elements (i.e., PEs, MEs, and122IEs) as well as decision-making agents (i.e., deep reinforce-123ment learning (DRL), Bayesian optimization, etc.) that evolve124over time, keep getting better and stronger as they run new125programs, and adapt to varying environments and condi-126tions. More precisely, we first profile the C/C++ programs of127the data-intensive or data analytics applications to obtain low128level virtual machine (LLVM) intermediate representation129(IR) instructions with different input seeds. Next, we perform130static and dynamic analysis to identify data and control131dependencies and construct a self-similar and power-law dis-132tributed graph (SPDG). The graph is weighted, directed, and133acyclic where nodes represent instructions with the corre-134sponding input seed; edges represent data and control depen-135dencies; and weights represent data volumes and time to136transfer data. This graph is able to model the structural and137dynamical nature of a complex real-life HLP. Next, we intro-138duce two global structural attributes to best describe and learn139the SPDG: (i) The degree distributionP ðkÞ expresses the prob-140ability of having a node with k links. We illustrate that some141graphs are scale-free, which means that their degree distribu-142tions follow a power law, at least asymptotically. (ii) The frac-143tal dimension, which captures the degree of self-similarity or144self-repeating patterns in graphs across all scales by applying145a renormalization procedure. By linking two properties in146complex networks, our PoC compiler flow realizes the com-147munications of programs and connects them with low-level148requirements such as memory sizes and the optimal paralleli-149zation degree. Next, we develop an optimization engine to150find the optimal number of clusters, while minimizing data151communication cost. At the same time, we ensure that (1) the152generated cluster graph does not contain cyclic waiting, in153order to resolve the main cause for deadlocks in existing154NoCs; (2) the number of links per router cannot exceed a pre-155defined number to minimize the number of buffers. These156clusters serve as an abstract model for PEs to be synthesized157onto hardware. Finally, we provide a self-assessment of plas-158ticity by providing several feedback loops to monitor the159rewards via deep reinforcement learning (DRL) [5].160To better understand and learn the hidden structures and161dynamics in communications of HLPs, we define the prem-162ises of a bridge between programming languages, complex163networks, and fractal geometry through representing pro-164grams as self-similar power-law distributed graphs and165make the following novel contributions:

166� We link programming languages with complex net-167works (CNs) by modeling each application as a self-168similar power-law distributed graph and providing169two CN metrics to describe the graph complexity:170degree distribution and fractal dimension. We dem-171onstrate that the old compiler optimization techni-172ques can also be transformed into graph algorithms.173� We define a node-attribute box-covering algorithm174to mine the complexity of weighted graphs corre-175sponding to HLPs.

Fig. 1. Design hierarchy: now (for SoC) and future (for PoC).

2 IEEE TRANSACTIONS ON COMPUTERS

Page 3: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

176 � Wepropose rigorousmathematical modeling for parti-177 tioning the weighted graph corresponding to various178 interdependent HLPs into interconnected dense clus-179 ters and thus discovering the optimal parallel degree of180 these applications. The structure of clusters is con-181 strained to avoid cyclic resource waiting and depen-182 dencies that lead to poor performance and scalability.183 � We propose a deep reinforcement learning (DRL)184 approach to learn the optimal mapping of clusters185 generated onto heterogeneous platforms.186 The rest of the paper is organized as follows. Section 2187 describes the related work. Section 3 provides motivating188 examples to model the structural communications of pro-189 grams. Section 4 provides graph definitions for PoC. Section 5190 discusses our PoC framework. We provide experimental191 results in Section 6 and conclude the paper in Section 7.

192 2 RELATED WORK

193 Fractality in Software and Hardware: Daniel et al. [12] show that194 dynamic data dependency graphs for numerous benchmarks195 exhibit fractal communication. This fractality can influence196 the spatiotemporal locality of NoC communication in a CMP.197 Concas et al. [6] demonstrate that in an object-oriented system,198 the fractal behavior can be exhibited from the static class net-199 work where nodes represent classes and edges represent rela-200 tionships between classes. This dimensionality is one of the201 fundamental metrics to capture the complexity of software202 systems. Alan et al. [22] speculatewithout experimental obser-203 vations that programs exhibit fractal properties. In contrast to204 this prior work, we analyze the topological characteristics of205 graphs encoding the universal computational and communi-206 cation features of HLPs. The box covering algorithm used for207 analyzing the fractal dimension takes into account node208 attributes such as addition and subtraction. We also imply209 that the out-degree distribution of the graph has a power-law210 relationship.

211Domain-Specific Accelerators. Scholar work in [14], [10] apply212machine learning techniques tounderstand anddesigndomain213specific system-on-chips (DSSoCs). Uhrie et al. [14] propose a214method to extract parallel kernels from dynamic application215traces to build ontological representations of computation. The216ontological inference is then used to design a DSSoC architec-217ture. SOSPCS [10] uses neural networks (NNs) to identify simi-218lar structures in computation and reinforcement learning to219map tasks onto heterogeneous NoC-based systems. However,220the main differences between PoC and SOSPCS are in cluster221identification and mapping. During cluster identification, in222SOSPCS, clusters are first recognized using NNs to find com-223mon structures among graphs. Once we find these structures224which can be accelerated,we thenpartition the rest of the graph225into clusters. In contrast, here, we consider the irregular topol-226ogy as communication links among different partitioned clus-227ters. Inter-cluster edges are regarded as data communication228patterns within the NoC. Therefore, we convert the problem of229discovering the optimal degree of parallelization into solving230the optimization engine to partition an SPDG into highly inter-231connected clusters to be implemented and mapped onto hard-232ware. The other is cluster mapping. SOSPCS uses the tradi-233tional reinforcement learning whereas, in this work, we234develop deep reinforcement learningwith graph convolutional235networks which is more suitable because of the structure of the236graphswe are dealingwith.We design a graph neural network237as a learnable agent in reinforcement learning to provide238actions in the environment. Nowatzki et al. [25] propose the239transformable dependence graph, a high-level alternative of240the time-consuming compiler+simulator stack to study behav-241ioral specialized accelerators.242However, our PoC design flow learns the structural and243dynamical knowledge about HLPs and helps us re-engineer244future manycore systems that are similar to brains, perform245like ever-changing networks and systems with a very large246number of processing, memory, and interconnect elements247to provide performance improvement and power efficiency.

Fig. 2. PoC framework: First, we construct a self-similar power-law distributed graph abstraction through program tracing, analysis, and profiling ofHLPs. Next, we propose an optimization strategy for optimal parallelization discovery (partitioning the graph into several interdependent clusters),ensuring no cyclic resource waiting (deadlock-free) and the number of incoming and outgoing edges within a threshold (scalability). Finally, a DRLapproach synthesizes these clusters into heterogeneous processing elements and an irregular NoC to improve network latency, application perfor-mance, and power consumption.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 3

Page 4: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

248 3 MOTIVATING EXAMPLES

249 In this section, we show that SPDG captures structural com-250 munication in HLPs and two fundamental techniques in251 programming languages, (namely, iteration and recursion),252 exhibit self-similar and fractal characteristics, which are dis-253 cussed in detail in Section 5.3.254 Iteration is a core technique for a program to define a255 number of repetitions, usually via for-loops and while-loops,256 which is universally applied in complex algorithms in areas257 such as machine learning and autonomous systems. The258 first example in Figs. 3a and 3d is the summation of i from 0259 to 10 using a for-loop. The structure of its graph representa-260 tion is similar to the star shape. Zooming in the graph gives261 us a similar structure to the original graph, which implies a262 self-similar organization. Therefore, we can capture its self-263 similarity through a fractal dimension of -1.6.). The second264 example in Figs. 3b and 3e contains a for-loop and an array,265 which is different from the first example which doesn’t266 involve an array. Its graph representation is similar to a lat-267 tice topology. Self-similarity appears in the nodes (addition)268 connected from the center node (array allocation) called a269 hub, which can be measured by a fractal dimension of -3.4.270 This can have implications for coded computing.271 Recursion is another significant routine for a program to272 solve a problem where the solution depends on solutions to273 smaller instances of the same problem. The power of recur-274 sion is the possibility of defining an infinite set of possible

275functions by a finite set of statements. It is widely applied in276areas such as digital signal processing and divide-and-con-277quer. The third example in Figs. 3c and 3f calculates a Fibo-278nacci number using recursion. The graph representation is279similar to the tree structure. Each branch (a smaller instance280of the fib function) can be regarded as a smaller version of281the tree (the original problem). This self-similarity can be282captured by a fractal exponent of -2.1.283Therefore, as indicated by these three examples, there284exist some patterns of communication in graph representa-285tion of programs which we can exploit to capture and286understand the intrinsic nature of approximate self-similar-287ity and degree distribution in most algorithms and pro-288grams. Moreover, the three structures serve as the building289blocks for complex graphs transformed from real-life pro-290grams in Fig. 4, with fractal dimensions whose exponents291are -1.2 and -1.4, respectively.

2924 POC GRAPH DEFINITIONS

293In this section, we present four important definitions and294elaborate them in Section 5.

295Definition 1. A program (P) is defined as a sequence of IR296instructions in terms of a specific input seed S. Mathematically,297we can express it as PðirijS; i 2 jNP jÞ where iri is an ith IR298instruction and NP is the number of instructions.

Fig. 3. PoC graph models and fractal dimensions of iteration and recursion, i.e., 1.6, 3.4, and 2.1, respectively. Fruchterman Reingold is a graph lay-out algorithm.

Fig. 4. PoC graph models and fractal dimensions of real-life examples, i.e., 1.2, and 1.4, respectively.

4 IEEE TRANSACTIONS ON COMPUTERS

Page 5: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

299 Definition 2. A self-similar and power-law distributed graph300 (SPDG) is defined as a weighted directed graph with annotations301 SPDGðni; attri; eij; wijji; j 2 jN jÞ where ni represents an ith302 instruction in a program P ; attri denotes a node attribute on303 nodeni; eij, associatedwith the corresponding weightwij, charac-304 terizes the dependence of the current node ni on the previous node305 nj to guarantee the strict program order; andN is the total num-306 ber of nodes.

307 In our implementation, each weight wij is measured by308 the time it takes to execute the current instruction ni multi-309 plied by the amount of data volume required to transfer310 from node nj to node ni. In this way, we can encode the cost311 of data communication into an SPDG, which allows us to312 propose an optimization engine in Section 5.2 to discover313 the optimal parallelization degree while taking into account314 communication patterns to avoid deadlocks. Each node315 attribute attri describes the type of the current ith instruc-316 tion such as addition, subtraction, and multiplication. In317 addition, we construct the SPDG at the granularity of318 instructions rather than the functionalities of applica-319 tions. There are two advantages: (1) It is coarse-grained320 enough to reduce simulation time and memory space for321 keeping track of all low-level assembly instructions and322 data structures. (2) It is fine-grained enough to express323 inter-dependencies between each pair of instructions324 dynamically collected.

325 Definition 3. A cluster graph (CG) is defined as CGðn0i; e0ij; w0ij326 ji; j 2 jN 0jÞ where n0i represents a sequence of instructions from327 an SPDG; e0ij represents an edge connecting two clusters n0i328 and n0j; w

0ij represents the sum of weights from cluster n0i to n

0j;

329 N 0 is the number of clusters.

330 Definition 4. A data volume is defined as the number of required331 packets transferred between two instructions. We obtain data332 volumes during the application modeling, which further guides333 us to implement buffers with appropriate sizes inside routers.

334 Therefore, in this paper, our goal is to (1) infer a self-simi-335 lar and power-law distributed graph from a program336 (P ! SPDG); (2) partition the graph into interconnected337 clusters (SPDG! CG); and (3) map the clusters onto hard-338 ware (HW) with heterogeneous PEs with an irregular NoC339 topology (CG! HW ).

3405 POC - PLASTICITY-ON-CHIP DESIGN

341In this section, we present our PoC framework in the follow-342ing four steps: (A) Communication modeling; (B) Optimal343parallelization discovery; (C) Fractal dimension and degree344distribution; and (D) PE mapping and irregular NoC syn-345thesis via DRL.

3465.1 Communication Modeling: P ! SPDG

347Fig. 5 shows the high-level and detailed procedures to348model each C/C++ application as an SPDG via the static349and dynamic compiler analysis. First, the distributor com-350piles input HLPs into IR instructions. It distributes instruc-351tions and assigns a specific input seed to each worker. In352the meantime, it checks the code coverage to see if code353paths are fully covered. Next, in a pipelined approach, each354worker thread (1) keeps track of the number of basic blocks355and instructions in each basic block through code tracing;356(2) analyzes the instructions to identify the dependencies357between instructions to construct an intermediate graph;358and (3) profiles the memory instructions to get data vol-359umes and times required to calculate edge weights. Finally,360the aggregator merges the intermediate graphs into an361SPDG.

3625.1.1 Distributor Assigns IR and Seeds

363The advantage of IR is that it is not only architecture-inde-364pendent (high-level) enough to abstract away details of365low level machine code and prevent register spilling, but366also low-level enough to model machine code and prevent367hundreds of instructions from translation. Low level vir-368tual machine (LLVM) is a compiler engine that makes369program analysis lifelong and transparent by introducing370intermediate representation (IR) as a common model for371analysis, transformation, and synthesis [18]. In our372approach, (1) IR instructions are accumulated dynamically373to obtain the correct instruction traces running under one374specific input seed. (2) IR instructions are collected with375different input seeds to make sure that the code paths are376fully covered. Different input seeds can exercise different377execution paths (basic blocks) since programs normally378have various statements such as if-else, for, and while to379control the execution of programs. With only one input

Fig. 5. Modeling of C/C++ code in PoC design flow. The distributor compiles HLPs into LLVM IR instructions and assigns an input seed to eachworker thread to guarantee the code paths are fully covered. Next, each worker, with the assigned seed, performs code tracing, analysis, and profil-ing to construct an IR graph. Finally, the aggregator merges these IR graphs into a self-similar graph.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 5

Page 6: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

380 seed where some execution paths are exercised, the rest of381 the paths remain undiscovered due to the fact that382 dynamic IR instructions are collected. To model C/C++383 applications, we need to consider all corner cases, i.e., all384 possible combinations of control paths to be fully covered.385 This is achieved by first transforming each program into a386 control-flow graph (CFG). We keep track of all nodes that387 are covered in a CFG and manually decide input seeds if388 the goal is not complete after a time limit. Therefore, in389 order to comprehensively model C/C++ programs, the dis-390 tributor assigns different seeds to worker threads to exer-391 cise all code segments.

392 5.1.2 Worker Threads Analyze and Profile Code

393 After dynamic IR instructions are collected, worker threads394 identify all instructions inside each basic block via code395 tracing. Specifically, we maintain a container called map396 with keys being a specific IR instruction inside the corre-397 sponding basic block and values being the index of a basic398 block. Next, workers perform code analysis to recognize399 data, control, and memory dependencies of the current400 instruction on previous instructions. Data dependency ana-401 lyzes if source registers of the current instruction depend on402 the previous instructions whereas control dependency403 investigates the next instruction to which an instruction404 jumps if the instruction is a jump or function call. We also405 perform register renaming, i.e., if a register name appears406 again, rename it to make sure no false dependencies exist.407 We also perform memory dependency to see whether two408 memory objects point to the same address.409 In the implementation, we maintain two vectors called410 dest and instr. The dest vector is used to keep track of des-411 tination registers in all IR registers whereas the instr vec-412 tor is a container to store all IR instructions dependent on413 registers in previous instructions. Finally, during code414 profiling, for each memory operation, we use the rdtsc415 function to read the processor’s time-stamp counter and416 determine the number of CPU ticks required for the417 instruction. Although the timing is architecture depen-418 dent, it provides intuitions on how memory data are419 communicated, which is partitioned to minimize data420 communication among clusters.

421 Algorithm 1. Aggregation algorithm

422 Input: A series of intermediate graphs G ¼ ðG1; G2; . . . ; GnÞ423 generated by worker threads424 Output: An SPDG with all of the nodes inside G425 1: while G is not empty do426 2: k ¼ 0427 3: for g 2 G do428 4: Npartial½kþþ� = RemoveNodesWNoInEdges(g)429 5: ifNpartial has common nodes then430 6: Insert one node with the maximum weight431 7: else432 8: Insert all nodes with weights

433 5.1.3 Aggregator Merges Graphs

434 After each worker thread produces an intermediate graph,435 the aggregator first extracts nodes with no incoming edges

436from intermediate graphs and deletes these nodes from437graphs. If identical nodes exist, only one node is inserted438into the final graph with the maximum weight. Otherwise,439nodes and connections are pushed into the SPDG. In this440way, intermediate graphs are combined into the SPDG with441all of the nodes and edges.

4425.1.4 Complexity Analysis

443In general, the complexity of the proposed algorithm is444related to the number of seeds necessary per application,445which is also related to the complexity of the application.446Given N as the number of nodes in the control flow447graph, the complexity of the proposed algorithm is448OðN2Þ. This is because, for each iteration, we also have to449visit each node to see if the graph is fully covered. The450number of seeds on average is OðNÞ. In this work, seeds451are generated randomly or manually. At the beginning of452the profiling, we maintain the control flow graph of a pro-453gram. A basic block is a straight-line code sequence with454no branches in except to the entry and no branches out455except at the exit. An execution path with respect to a456seed is a series of nodes (basic blocks) traversed during457the execution. During the first stage where the program458starts to execute, we use random seeds to run the pro-459gram. If a certain path is exerted, we color the corre-460sponding nodes in the control flow graph. Once the461graph is fully colored, we stop the execution. We set an462upper time limit for this stage, 10 minutes in the work,463which can be adjusted by the user. In case we reach the464maximum time limit, we investigate the partially colored465control flow graph to see which paths are not covered466and use the expert knowledge to design seeds that can be467used to cover these paths.468Example: Fig. 5 shows how, for a randomized input seed4691, an application can be compiled into its corresponding470LLVM IR instructions. The parser reads one IR instruction471at a time and analyzes whether the source register %i472depends on previous destination registers. Assuming it is,473this instruction is waiting to be profiled, pushed into the474instr vector, and its destination register %9 is hashed into475the destination table. Next, the parser reads the second476instruction. Since this instruction depends on the previous477one (the source register %9 is the same as the destination478register of the first instruction), it is inserted into the vector.479Following the same rules, the parser can construct interme-480diate graphs with different input seeds. After obtaining481three intermediate graphs, the parser performs aggregation482to combine the graphs into an SPDG to make sure that the483SPDG covers every intermediate graph. Following Algo-484rithm 1, it first reads and removes nodes with no incoming485edges, which is node 0 in this case. Since node 0 is identical486among intermediate graphs, the parser inserts it into the487SPDG and stores values of 16, 34, and 57 as the maximum488weights from node 0 to nodes 1, 5, and 4 respectively. Next,489the parser reads and removes nodes 5 and 4 from graphs490and inserts them into the SPDG with the corresponding491maximum weights 34 and 57, respectively. By the same492token, three intermediate graphs with different input seeds493are combined into an SPDG. Fig. 4 shows some real-life494benchmarks.

6 IEEE TRANSACTIONS ON COMPUTERS

Page 7: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

495 5.2 Optimal Parallelization Discovery: SPDG! CG

496 In this section, stemming from Definitions 2 and 3, we pro-497 pose an optimization engine to partition an SPDG into sev-498 eral interdependent clusters to determine the optimal499 degree of parallelization and minimize the inter-cluster500 weights (data communication cost). Specific communication501 patterns (data and control dependencies) each program has502 are helpful to design irregular NoC, thereby reducing the503 network congestion and latency.504 We consider the irregular topology as communication505 links among different partitioned clusters. That is, each506 cluster is an abstract description of a sequence of instruc-507 tions to be mapped onto hardware. Inter-cluster edges are508 regarded as data communication patterns inside NoC.509 Therefore, we convert the problem of discovering the opti-510 mal degree of parallelization into solving the optimization511 engine to partition an SPDG into highly inter-connected512 clusters to be implemented mapped onto hardware as513 follows:514 Given an SPDG, find non-overlapping clusters such that515 the measureM is maximized.

M ¼ 1

mR1 � �1

eR2 � �2

2nR3 (1)517517

518

R1 ¼Xi;j

Aijdðci; cjÞ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}the sum of weights in a cluster

� kini koutj

mdðci; cjÞ|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}

inter-cluster weight sum

26664

37775

(2)520520

521

R2 ¼X

i;j2CG1ðAij 6¼ 0Þ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}

an edge exists

1ðdDFSðjÞ > dDFSðiÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}the edge is backward

(3)

523523

524

R3 ¼Xi2CG

(1

Xu2CG:1ðBiu 6¼0Þ

1|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}the number of in-coming edges

> l1

!

þ1 X

v2CG:1ðBvi 6¼0Þ1|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}

the number of out-going edges

> l2

!);

(4)526526

527 where m is the sum of weights of all edges (m ¼Pi;j wij); e528 is the number of edges in an SPDG; n is the number of clus-529 ters; �1 and �2 are user-defined hyper-parameters to control530 regularization terms; Aij represents the edge weight from531 node j to i in an SPDG, 0 means no edge exists; Bij repre-532 sents the edge weight from cluster j to i in a CG; kini repre-533 sents the sum of weights of all in-coming edges adjacent to534 node i (kini ¼

Pp wpi); k

outi represents the sum of weights of

535 all out-going edges adjacent to node i (kouti ¼P

q wiq); the

delta function dðu; vÞ equals 1 if u ¼ v, and 0 otherwise; ci isa cluster index from 1 to n; 1ðsÞ is the indicator function. It

equals 1 if s evaluates true, and 0 otherwise; l1 and l2 are

user-defined parameters to indicate the number of in-

coming and out-going edges each cluster should have;

dDFSðiÞ measures the depth of node i via depth first search

(DFS).536Maximization of Eq. (1) has a direct influence on (a) maxi-537mizing intra-cluster edge weights and minimizing inter-538cluster edge weights (data communication) in Eq. (2); (b) pre-539venting the occurrence of backward edges in the CG indicated540by Eq. (3); (c) restricting the number of edges associatedwith a541cluster in the Eq. (4). The first term in Eq. (1) measures a frac-542tion of edges within clusters minus an expected fraction of543such edges. The higher the value, the larger the statically sur-544prising fraction of the edges that fall within a chosen cluster.545The first regularization term constrains that clusters cannot546have cyclic resource waiting, resulting in a deadlock-free net-547work when implemented on hardware. The idea is to identify548whether any backward edge exists, which can cause resource549waiting and further deadlock.We first assign a depth for each550node via DFS. If an edge exists and it is backward, R2 is non-551zero, leading to a smaller M. The second regularization term552controls the number of incoming edges and out-going edges553within thresholds, respectively. The scalability of crossbars554prevents routers from supporting a large number of input555and output ports. Therefore, we restrict the number of edges556adjacent to a cluster within l1 and l2. Otherwise, R3 is non-557zero, makingM smaller compared to a casewhereR3 is 0.558However, solving this optimization problem is NP-hard.559Therefore, inspired by the algorithm mentioned in [3], we560solve the optimization engine to partition an SPDG into561interconnected clusters which constitute a CG in Fig. 6.562Connection to the Standard Compiler Techniques. Our graph563representation SPDG, not only captures the optimal degree564of parallelization, but can also perform different standard565compiler techniques.

5661. Checkpointing: It is a technique that allows pro-567grammers to debug code and provides fault toler-568ance for computing systems. Therefore, it is similar569to perform a cut in an SPDG and output the results570from the graph.5712. Dead code elimination: It is a technique to remove code572that does not influence the program results. It is simi-573lar to the idea of checking children of each node in an574SPDG to see their destination registers are identical.5753. Unreachable code analysis: It analyzes any code that576will never be executed regardless of the values of577variables and other conditions at runtime. We can578analyze the unreachable code in an SPDG if it is not579fully connected. If not, two or more subgraphs indi-580cate the unreachable code.

Fig. 6. Example of graphs from an fft application.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 7

Page 8: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

581 5.3 Fractal Dimension and Degree Distribution

582 In this section, we discuss the fractal dimension and degree583 distribution for graphs generated from Section 5.1 to extract584 structural and dynamical information as features, which are585 used in the next section to decide the optimal mapping of586 clusters and designing an irregular NoC.587 The first fundamental property in complex networks arises588 with the discovery that the probability distribution of the589 number of edges per node, P ðkÞ (also known as the degree590 distribution), can be represented by a power law [31]:

P ðkÞ ¼ k�g ; (5)592592

593 where k is the degree and g is a degree exponent. However,594 most LLVM instructions require at most 2 source registers595 and forward the results to many instructions. That is, most596 nodes in an SPDG have at most 2 in-coming edges and597 many out-going edges. It is not interesting to model the in-598 degree, which is bounded between 0 and 2. Therefore, in599 order to better discover structural patterns for an SPDG, we600 model the out-degree as a power law.601 The second significant property of graphs is their self-602 similar or fractal behavior measured through the fractal603 dimension (FD). FD is a core concept used to measure the604 degree of a geometric object that is exactly or approximately605 similar to a part of itself. The box covering algorithm is com-606 monly used to measure the dimension in unweighted [31],607 [31] or weighted graphs [33], [30], [4]. The relationship608 between the number of boxesNB and size of boxes lB is given609 by [31]:

NB � l�dBB ; (6)

611611

612 where dB is the fractal dimension. However, each SPDG not613 only has weights, but also provides additional information614 which is not exploited in the box covering (BC) algorithm,615 namely the node attribute (e.g., addition, subtraction).616 Therefore, we propose a modified BC algorithm to calculate617 the FD for weighted graphs with node attributes in618 Algorithm 2.

619 Algorithm 2.Modified box covering algorithm

620 Input: A graph G621 Output: NB; lB622 1: Calculate the shortest path dij for any pair of nodes623 2: lB = 0624 3: while dmin ¼ dij:popðÞ do625 4: lBþ ¼ dmin626 5: for i; j 2 G do627 6: if dij � lB and attri ¼¼ attrj then628 7: Insert an edge between i and j into G2629 8: NB.append(graph_coloring(G2))

630Fig. 7 shows an example of the modified BC algorithm.631At some iteration, lB is set to be 12. The algorithm calculates632the shortest distance dij from node i to node j with the same633attribute, 0 otherwise. For example, d15 ¼ 12 whereas d14 ¼6340. This is the central difference between our approach and635prior work (d14 ¼ 12 for them) [4]. dij � lB means a connec-636tion in the second graph. After constructing this graph, we637apply a graph coloring algorithm to identify that the num-638ber of boxes NB, in this case, is 2. Fig. 4c shows that fractal639dimensions for BMI and drone controller are 1.2, and 1.4,640respectively.641We believe that the use of SPDGs is fundamental to model642the structures and dynamics of a vast number of real-life pro-643grams. We can investigate the structural properties of a net-644work and infer outcomes on the behavior of such programs.645These results have an impact on our understanding of the646evolution and behavior of complex programs.

6475.4 Mapping PEs on Hardware: CG! HW

648So far, the PoC framework identifies the optimal number of649heterogeneous PEs for parallel execution and designs an650abstract description of irregular deadlock-free NoC. How-651ever, power efficiency is not optimal even though Eq. (2)652tries to restrict most data movements locally inside each653cluster. Compute-intensive and memory-intensive applica-654tions have a substantial difference in data movements. Gen-655eral-purpose NoCs cannot guarantee the best performance656and power efficiency for all applications. Especially, exces-657sive buffers in each router could cost over 30 percent of the658total power. Therefore, in order for the hardware to be659power efficient while maintaining desirable performance,660we need to reconfigure the optimal size of buffers for each661application.662In this section, we propose a DRL based mapper (see663Fig. 8). We seek to overcome the drawbacks of the NP-664hard integer linear programming based method to665improve the performance of searching for an optimal666solution at runtime. We represent the process of mapping667clusters onto PEs and deciding the number of buffers in668routers as a Markov Decision Process (MDP) and apply669DRL to learn the optimal mapping and buffer sizes given670a CG and a reward function for the quality of a mapping.671We represent the underlying hardware as an environ-672ment, and cluster mapping and buffer resizing as a set of673actions to maximize the reward function. Moreover, we674combine RL with graph neural networks (GNNs) to learn675the best actions with the help of the structures of a given676CG and their features, i.e., fractal dimension and power677law distribution.

Fig. 7. Modified box covering algorithm. Colors indicate attributes.

Fig. 8. PoC’s DRL engine for PE mapping/synthesis.

8 IEEE TRANSACTIONS ON COMPUTERS

Page 9: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

678 1. Markov Decision Process (MDP). MDP is a technique679 that can solve most RL problems with discrete actions. A680 fully observed MDP is a process with a state st for the time681 t. At each state st, an action at is chosen to determine the682 next possible state stþ1 with an unknown probability683 pðstþ1jst; atÞ. An agent Aðatjst; uÞ selects a sequence of684 actions at with a policy via learned parameters u. Under an685 action at with a state st, the immediate reward rðst; atÞ is for-686 warded back to the agent after a transition from the state st687 with action at.688 We consider the mapping of PEs and resizing of buffers689 as an MDP. The state space consists of different locations of690 available PEs on NoC and the buffer size for each router.691 Based on different target platforms, the action space consists692 of synthesizing the current PE onto NoC and resizing buf-693 fers if the platform is reconfigurable. If the heterogeneous694 platform consists of CPUs and GPUs, the action space con-695 sists of mapping the current cluster/PE onto a CPU or GPU696 depending on the agent and resizing buffers. In experimen-697 tal results, we illustrate results on both platforms. The698 agent’s task, then, is to give the next mapping/synthesis of699 a PE onto a given platform at a given st such that the final700 hardware architecture which serves as the environment will701 maximize a cumulative reward function Rt.702 2. Reward Function. The reward function is used to evalu-703 ate and learn the optimal policy. In our implementation,704 given a set of interconnected clusters represented in a CG,705 we design an immediate reward below for the current state706 st under the action at.

rðst; atÞ ¼ rpðst; atÞ � rmðst; atÞ � reðst; atÞ¼ 1

N1

Xi2N 0

PerfðcijstÞ � 1

N2

Xi2N 0

stðbufiÞ

� 1

N3

Xij2N 0

w0ij½EreadðciÞ þ EwriteðcjÞ�;708708

709 where rp, rm, and re represent the reward for performance,710 memory, and energy, respectively; N 0 is the number of clus-711 ters in CG; Perf is a C clock function to measure the perfor-712 mance of a cluster ci; w

0ij is the communication cost from

713 clusters ci to cj; Eread and Ewrite represent the amount of714 energy each read or write costs, respectively; Ewrite ¼ HEff

sw

715 (H is the average switching activity and Effsw is the energy to

716 switch one bit), and Eread ¼ ðn� 1ÞEwrite where n is the num-717 ber of flits held in buffers before the read operation defined in718 [14] to model the energy consumption of buffers; N1, N2, and719 N3 represent the normalization terms, respectively.720 The first metric is used to measure the total execution721 time of a given application synthesized into hardware or722 mapped onto hardware. The second metric is used to calcu-723 late the number of buffers returned by the agent, i.e., graph724 convolutional neural network. The last models the energy725 consumption of data communication inside NoC for each726 communication link from node i to node j (read from i and727 write to j). Therefore, in order to achieve high performance728 and low energy consumption, the agent needs to offer high729 positive rewards for an efficient mapping and penalize high730 energy consumption by providing negative rewards.731 During PE mapping/synthesis, we want to learn the732 optimal policy which maximizes the accumulative reward

733

734

735

736

737

738

739

740

741

742

743

744

745

746747R ¼PTt¼1 g

trðst; atÞwhere g 2 ½0; 1� is a decay factor. We use748Q values to represent the maximum accumulative reward749the agent can obtain by taking the action at in the state st.750Then, the optimal value Q�ðst; atÞ can be calculated by Bell-751man equation [34] as follows.

Q�ðst; atÞ ¼ E½rtþ1 þ gmaxatþ1Q � ðstþ1; atþ1jst; atÞ�: (7)753753

754

755Since the state transition is stochastic, we follow Q-learn-756ing [34] to update Q values.

Qtþ1ðst; atÞ ð1� aÞQtðst; atÞþ a½rðst; atÞ þ gmaxatþ1Qtðstþ1; atþ1Þ�;

758758

759where a 2 ½0; 1� is the learning rate.7603. Agent: graph convolutional neural network. The inputs to761the agent are features extracted from clusters in a CG via762fractal dimension and degree distribution. A schematic dia-763gram of our graph convolutional neural network (GCNN)764architecture is shown in Fig. 9 to act as an agent to approxi-765mate the Q values from the Eq. (8). The inputs to the graph766neural network are the cluster graph (CG) obtained from767Section 5.2 and the weight matrix with one row being a fea-768ture vector for each node. In the implementation, we use769node availability and the size of buffers as node features.770Followed by the input layer are two identical hidden layers,771each layer consisting of graph convolutional networks to772encapsulate the hidden representation by aggregating fea-773ture information from its neighbors and a non-linear trans-774formation ReLU layer. Finally, an output layer uses the775traditional neural network with extra input features (degree776exponent and fractal dimension) as regression to obtain the777Q values for the current state and action. We use the cross778entropy as the loss function between the target Q value and779the inferred Q values from the neural network. During780training, the experience data (previous state, action, reward,781current state) are collected while exploring the search space.782For each experience, the training takes place in four steps: 1.783The graph neural network estimates Q values of the previ-784ous state; 2. The graph neural network estimates Q values785of the current state; 3. Calculate the target Q values for the786action under the current state using the reward function; 4.787Train the graph neural network with the input of the previ-788ous state and output of the target Q values.7894. Environment: Plasticity-on-Chip. The environment inter-790acts with the agent to receive possible actions (i.e.,

Fig. 9. Graph neural network architecture as a learnable agent. In DRL, aGNN takes as input the cluster (CG) generated from an SPDG and itsnode features. The hidden layers include a graph convolutional networkfollowed by a ReLU layer. The output layer consists of the traditionalneural network as regression to determine the Q values for states andactions.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 9

Page 10: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

791 placement of PEs and routers, the number of required buf-792 fers, and irregular interconnect). The environment provides793 an immediate reward which consists of the execution time794 provided by the runtime system, the number of buffers in795 routers, and the analytical energy consumption. This796 reward returns back to the agent to help make better deci-797 sions next time. The environment (PEs, MEs, and IEs) keeps798 evolving over time with the help of agents to maximize the799 long-term accumulative reward.

800 6 EXPERIMENTAL RESULTS

801 In this section, we discuss the simulation setup and present802 experimental results to investigate the validity, feasibility,803 and effectiveness of the proposed methodology.

804 6.1 Simulation Configuration

805 In experiments, we demonstrate the PoC framework on two806 different platforms, i.e., the heterogeneous platform and the807 reconfigurable platform. First, for a heterogeneous platform808 that consists of CPUs, GPUs, and accelerators, we use a cycle-809 accurate performance simulator gem5-gpu, interfaced with810 PEs obtained from applications in Section 5.4, which are fur-811 ther mapped onto the reconfigurable NoC. Similar to the812 gem5-Aladdin setup [30], CPUs and GPUs in gem5-gpu run-813 ning user programs can invoke PEs via the ioctl system call,814 which is mainly used for device communication. We assign a815 special file descriptor to each domain specific processing ele-816 ment (DSPE) and run them simultaneously. The CPUs and

817GPUs, on the other hand, spin wait for DSPEs to finish the818task. The baseline is running applications on gem5. Table 1819lists the simulator setup for gem5, gem5-gpu, and gem5-820Aladdin. We have also compared our results with MinBD821[10] to validate the benefits of buffer resizing in reconfigura-822ble NoC. The technology node used for energy and area is82365nm [14]. Second, for a reconfigurable platform,we generate824RTL for each application directly from SPDG which captures825communications and computations, and compare it with826Aladdin [29] and LegUp HLS [4] in terms of performance,827area, and power using commercial 40nm standard cells. In828addition, RTL is generated such that each computation is829mapped to an RTL instruction from SPDGs. The FPGA board830we are using is the Nexys 4 DDR XC7A100T-1CSG324C. This831platform contains 63,400 Slice LUTs, 4,860 Kbits of fast block832RAM, 210 IOs, and 240 DSP slices and the maximal clock fre-833quency is 450 MHz. Without explicitly mentioned, we set �1,834�2, l1, and l2 to be 1, 1, 4, and 4, respectively. Table 2 lists a set835of benchmarkswe consider in the evaluation.

8366.2 Graph Statistics and Parallelization Degree

837Table 3 shows the number of nodes, edges, and clusters for838each benchmark, respectively. We select benchmarks and839input seeds to make sure that graphs are complex enough840to show the effectiveness of PoC. In addition, it is interesting841to note that the optimal number of clusters for mandel is the842highest among all of the benchmarks. The reason is that in843parallel computing, mandel is called an embarrassingly par-844allel program, which offers little or no effort to separate the845problem into a number of parallel tasks. Therefore, the846number of clusters is higher compared to others.

8476.3 Experimental Results (Heterogeneous848Platforms)

8496.3.1 Effects of l1 and l2 on Application Performance

850l1 and l2 control the number of generated in-coming edges851and out-going edges of each node in a CG. It is interesting852to see how these two hyper-parameters affect application853performance. Therefore, we choose two different configura-854tions. The bar charts in Fig. 10a show application perfor-855mance whereas lines indicate areas. Since dijkstra and fft are856more data-intensive compared to others, the buffer sizes are

TABLE 1Configuration Parameters

CPU Cores 32 In-order cores, 16 MSHRsClock Frequency 2.4 GHzL1 private cache 64KB, 4-way associative

32-byte blocksL2 shared cache 256KB, distributedMemory 4 GB, 8 GB/s bandwidth

GPU Number 32Clock Frequency 575 MHzMemory 768 MB, 86.4 GB/s bandwidth

Network Topology MeshRouting Algorithm XY routingFlow Control Virtual channel flit-based

TABLE 2Programs and Descriptions

Program Description Input Size

dijkstra Find the shortest path 50 nodes

fft Fast Fourier transform vector of size 1024

k-means K cluster partitioning 128 2D tuples

mandel Calculate Mandelbrot set 4092 points

md Molecular dynamics 512 particles

nn Neural network 3 hidden FC layers

neuron ReLU neurons 64 neurons

cnn Conv. neural network conv-pool-FC

TABLE 3Graph Statistics

Program No. Nodes No. Edges C g dB DI

dijkstra 248,959 291,112 53 2.29 1.32 0.6

fft 109,295 143,183 31 2.21 1.44 0.8

k-means 98,592 119,112 33 2.24 1.53 1.3

mandel 235,051 260,042 108 2.43 1.26 4.2

md 1,799,353 2,361,213 79 2.17 1.51 2.3

nn 124,496 161,428 64 2.16 1.37 1.7

neuron 57,883 73,431 38 2.20 1.24 1.2

cnn 361,464 520,596 57 2.18 1.45 2.5

C stands for the number of clusters.DI stands for data intensiveness, which iscalculated by the number of computations divided by the number of memoryoperations.

10 IEEE TRANSACTIONS ON COMPUTERS

Page 11: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

857 comparatively identical, leading to the same area and per-858 formance. However, for the rest of compute-intensive appli-859 cations, increasing the number of links associated with each860 router can increase application performance up to 45.7 per-861 cent. However, the area overhead is larger compared to862 data-intensive applications due to crossbars and buffers.

863 6.3.2 Application Performance

864 Fig. 10b shows normalized application speedup compared865 to the baseline, MinBD, and gem5-Aladdin. In general, our866 approach can achieve up to 2.6x improvement compared to867 gem5-Aladdin and 7.61x over the baseline due to the fact that868 we restrict most data movements inside each cluster (PE). In869 general, our approachworks betterwith data-intensive applica-870 tions (2.45x on average) compared to compute-intensive appli-871 cations (2x on average) because of the communication-aware872 partitioning and optimally-buffered NoC. The performance873 improvement over MinBD is due to the synthesized irregular874 NoC architecture during the configuration discussed in Sec-875 tion 5.2. Instead of relying on the XY routing algorithm to send876 a flit from one node to another, there is a direct link synthesized877 in between to reduce the number of hops the flit could take.

878 6.3.3 Power Consumption

879 Figure c shows the normalized power improvement compared880 to gem5-Aladdin and MinBD. Data-intensive applications881 have a smaller improvement (1.14x on average) compared to882 MinBD because for a fixed area, they have sufficient buffering883 resources to prevent network congestion. For compute-inten-884 sive applications, our approach achieves higher power885 improvement compared to gem5-Aladdin (1.84x on average)886 and MinBD (1.53x on average), respectively. The power

887improvement over PE+MinBD is due to theNoC configuration888and buffering in Sections 5.2 and 5.4, respectively. On the one889hand, since routing a flit from one node to another requires a890certain amount of power, reducing the number of hops for the891flit can greatly reduce the dynamic power consumption. On892the other, static power is necessary even though someNoCbuf-893fers are idle. Determining the optimal number of buffers894required for target applications can improve the power effi-895ciency. This improves the PoC fitness towards energy efficient896applications.

8976.3.4 Average Network Latency

898In Section 5.4, DRL maps PEs onto CPUs, GPUs, or accelera-899tors and decides the optimal buffer size for irregular NoC900topology. In order to quantify network congestion, network

Fig. 10. Experimental results for a heterogeneous platform.

Fig. 11. Experimental results for a reconfigurable platform.

Fig. 12. Average network latency.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 11

Page 12: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

901 latency is measured since packets have to be buffered, given902 a fixed throughput, which could lead to an increased delay.903 Therefore, we evaluate the average network latency com-904 pared to the baseline and MinBD. In Fig. 12, the average905 improvement in the network latency across all applications906 is 1.41x. Furthermore, the minimum and maximum network907 latency improvements across all applications are 1.2x908 and 1.5x, respectively. The reason that some applications909 provide higher improvement is due to the heavily inter-910 connected communities. In addition, data movement is con-911 strained almost inside each core, leading to a low network912 injection rate, which improves the network latency.

913 6.4 Experimental Results (Reconfigurable914 Platforms)

915 Figs. 11a and 11b shows the execution time and power con-916 sumption compared to Aladdin and LegUp. In general, our917 approach can achieve up to 1.8x improvement compared to918 Aladdin and 2.5x to LegUp in terms of the execution time, and919 2.1x toAladdin and 2.0x toLegUp in terms of power consump-920 tion. It is due to the fact that weminimize the cost of data com-921 munication among clusters via an optimization model. When922 clusters are synthesized into hardware, the injection rate of923 packets into the communication substrate is low, leading to924 performance improvement and power reduction. In addition,925 Fig. 11c shows areas compared to Aladdin and LegUp. How-926 ever, for some applications such as k-means, nn, and neuron,927 hardware areas generated by PoC are worse than LegUp and928 Aladdin. We think it is because the existence of NoC synthe-929 sized by PoC takes space if an application does not require too930 much area. Nevertheless, for the rest of the applications that931 require a large amount of area, ours can achieve 1.06x com-932 pared toAladdin and LegUp. This is because of buffer resizing933 to reduce the NoC area and the optimal degree of paralleliza-934 tion to efficiently organize PE structures.

935 7 CONCLUSION

936 As machine learning and artificial intelligence continue to937 evolve and transform, high-level programs are becoming938 more and more complex and elusive, making parallelization939 extremely difficult by runtime systems. At the same time, this940 situation becomesworse as software programmers have a lim-941 ited understanding of the architecture of the underlying plat-942 form. Therefore, understanding the structures and dynamics943 of programs becomes vital in parallel computing. In this paper,944 we present PoC, the Plasticity on Chip design flow for future945 manycore systems. We first perform static and dynamic946 program analysis to construct a self-similar and power-law947 distributed graph by collecting LLVM IR instructions for dif-948 ferent input seeds and combining all IR instructions into a949 graph. Next, inspired by the network paradigms, we show950 that the structural and dynamical information can be analyzed951 by degree distribution and fractal dimension from two well-952 known techniques in programming, i.e., iteration and recur-953 sion. PoC embeds an optimization model to partition the954 graph in order to identify inter-connected clusters with the955 minimal cost of data communication. PoC’sDRL engine solves956 the configurability and synthesis/mapping problem by asking957 the agent, i.e., graph convolutional neural networks, to learn958 the optimal policy, given clusters to be synthesized/mapped.

959ACKNOWLEDGMENTS

960This work was supported in part by the National Science961Foundation under the Career Award under Grant CPS/962CNS-1453860, in part by the NSF Award under Grant CCF-9631837131, Grant MCB-1936775, and Grant CNS-1932620, and964in part by the DARPA Young Faculty Award and DARPA965Director Award under Grant N66001-17-1-4044.

966REFERENCES

967[1] S. Amershi et al. “Software engineering for machine learning: A968case study,” in Proc. Int. Conf. Softw. Eng. Softw. Eng. Practice, 2019,969pp. 291–300.970[2] T. Bergan et al., “CoreDet: A compiler and runtime system for971deterministic multithreaded execution,” ACM SIGPLAN Noti-972ces, vol. 45, no. 3, pp. 53–64, 2010.973[3] V. D. Blondel et al., “Fast unfolding of communities in large974networks,” J. Stat. Mechanics: Theory Exp., vol. 2008, no. 10, 2008,975Art. no. P10008.976[4] A. Canis et al. “LegUp: High-level synthesis for FPGA-based pro-977cessor/accelerator systems,” in Proc. ACM/SIGDA Int. Symp. Field978Program. Gate Arrays, 2011, pp. 33–36.979[5] M. Cheng, J. Li, P. Bogdan, and S. Nazarian, H2O-Cloud: A980resource and quality of service-aware task scheduling framework981for warehouse-scale data centers,” IEEE Trans. Comput. Aided Des.982Integr. Circuits Syst., vol. 39, no. 10, pp. 2925–2937, Oct. 2020.983[6] G. Concas et al., “Fractal dimension in software networks,”984Europhys. Lett., 2006, vol. 76, no. 6, Art. no. 1221.985[7] H. Dang, M. Snir, and W. Gropp ,“Towards millions of communi-986cating threads,” in Proc. 23rd Eur. MPI Users’ Group Meeting, 2016,987pp. 1–14.988[8] J. Dongarra et al., “The international exascale software project989roadmap,” Int. J. High Perform. Comput. Appl., vol. 25, no. 1, pp. 3–60,9902011.991[9] W. Ecker, W. M€uller, and R. D€omer, Hardware-dependent software.992Dordrecht, The Netherlands: Springer, 2009.993[10] C. Fallin et al., “MinBD: Minimally-buffered deflection routing994for energy-efficient interconnect,” in Proc. IEEE/ACM Int. Symp.995Networks-on-Chip, , 2012, pp. 1–10.996[11] S. C. Goldstein et al., “PipeRench: A reconfigurable architecture997and compiler,” Computer, vol. 33, no. 4, pp. 70–77, Apr. 2000.998[12] D. Greenfield and S. W. Moore, “Fractal communication in soft-999ware data dependency graphs,” in Proc. Annu. Symp. Parallelism1000Algorithms Architectures, 2008, pp. 116–118.1001[13] Y. Hoskote et al., “A 5-GHz mesh interconnect for a teraflops pro-1002cessor,” IEEE MICRO, vol. 27, no. 5, pp. 51–61, Sep.–Oct. 2007.1003[14] A. B. Kahng et al., “ORION 2.0: A fast and accurate NoC power and1004area model for early-stage design space exploration,” Des., Automat.1005Test Eur. Conf. Exhibi., 2009, pp. 423–428.1006[15] G. Kestor et al., “Quantifying the energy cost of data movement in1007scientific applications,” in Proc. IEEE Int. Symp. Workload Charac-1008terization, 2013, pp. 56–65.1009[16] G. A. Kildall, “A unified approach to global program opti-1010mization,” in Proc. Annu. ACM SIGACT-SIGPLAN Symp. Princ.1011Program. Lang., 1973, pp. 194–206.1012[17] M. Kowarschik and C. Wei, “An overview of cache optimization-1013techniques and cache-aware numerical algorithms,” Algorithms1014Memory Hierarchies. Berlin, Germany: Springer, 2003.1015[18] C. Lattner and V. Adve, “LLVM: A compilation framework for1016lifelong program analysis & transformation,” in Proc. Int. Symp.1017Code Gener. Optim., 2004, pp. 75–86.1018[19] Y. Li et al., “A network-centric hardware/algorithm co-design to1019accelerate distributed training of deep neural networks,” in Proc.102051st Annu. IEEE/ACM Int. Symp.Microarchitecture, 2018, pp. 175–188.1021[20] R. Marculescu and P. Bogdan, “The chip is the network: Toward a1022science of network-on-chip design,” Found. Trends Electron. Des.1023Automat., vol. 2: no. 4, pp. 371–461, 2009.1024[21] R. Marculescu et al., “Outstanding research problems in NoC1025design: System, microarchitecture, and circuit perspectives,”1026IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 28, no. 1,1027pp. 3–21, Jan. 2009.1028[22] A. Mycroft, “Programming language design and analysis moti-1029vated by hardware evolution,“ in Proc. Int. Static Anal. Symp.,10302007, pp. 18–33.

12 IEEE TRANSACTIONS ON COMPUTERS

Page 13: IEEE TRANSACTIONS ON COMPUTERS 1 Plasticity-on-Chip …

1031 [23] L. Nai et al., “Exploring big graph computing An empirical study1032 from architectural perspective,” J. Parallel Distrib. Comput., vol.1033 108, pp. 122–137, 2017.1034 [24] S. Nazarian and P. Bogdan, “S4oC: A self-optimizing, self-adapt-1035 ing secure system-on-chip design framework to tackle unknown1036 threats–a network theoretic, learning approach,” IEEE Int. Symp.1037 Circuits Syst., 2020, pp. 1–8.1038 [25] T. Nowatzki and K. Sankaralingam, “Analyzing behavior special-1039 ized acceleration,” in Proc. Int. Conf. Architectural Support Program.1040 Lang. Operating, 2016, pp. 697–711.1041 [26] S. R. Paul et al. ,“ A Unified Runtime for PGAS and Event-Driven1042 Programming,” in Proc. IEEE/ACM 4th Int. Workshop Extreme Scale1043 Program. Models Middleware, 2018, pp. 46–53.1044 [27] G. Schirner, A. Gerstlauer, and R. Domer, “Automatic generation1045 of hardware dependent software for MPSoCs from abstract sys-1046 tem specifications,” in Proc. Asia South Pacific Des. Automat. Conf.,1047 2008, pp. 271–276.1048 [28] S. Seo et al., “Argobots: A lightweight low-level threading and1049 tasking framework,” IEEE Trans. Parallel Distrib. Syst., vol. 29,1050 no. 3, pp. 512–526, Mar. 2018.1051 [29] Y. S. Shao et al., “Aladdin: A pre-RTL, power-performance acceler-1052 ator simulator enabling large design space exploration of custom-1053 ized architectures,” in Proc. ACM/IEEE 41st Int. Symp. Comput.1054 Archit., 2014, pp. 97–108.1055 [30] Y. S. Shao et al., “Co-designing accelerators and SoC interfaces1056 using gem5-aladdin,” in Proc. Annu. IEEE/ACM Int. Symp. Micro-1057 archit., 2016, pp. 1–12.1058 [31] C. Song et al., “Origins of fractality in the growth of complex1059 networks,”Nat. Phys., vol. 2, pp. 275–281, 2006.1060 [32] C. Song, S. Havlin, and H. A. Makse, “Self-similarity of complex1061 networks,”Nature, vol. 433, pp. 392–395, 2005.1062 [33] Y. Song et al., “Multifractal analysis of weighted networks by a1063 modified sandbox algorithm,” Sci. Rep., vol. 5, 2015, Art. no. 17628.1064 [34] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-1065 tion. Cambridge, MA,USA: MIT press, 2018.1066 [35] M. B. Taylor et al., Evaluation of the Raw microprocessor: An1067 exposed-wire-delay architecture for ILP and streams, in Proc.1068 Annu. Int. Symp. Comput. Archit., 2004, pp. 2–13.1069 [36] R. Uhrie et al., “Machine understanding of domain computation for1070 domain-specific system-on-chips,” in Proc. SPIE, 2019, pp. 180–187.1071 [37] E. Waingold et al. “Baring it all to software: Raw machines,” Com-1072 puter, vol. 30, no. 9, pp. 86–93, Sep. 1997.1073 [38] D. Wei et al., “Box-covering algorithm for fractal dimension of1074 weighted networks,” Sci. Rep., vol. 3, 2013, Article no. 3049.1075 [39] Y. Xiao, S. Nazarian, and P. Bogdan, “Self-optimizing and self-1076 programming computing systems: A combined compiler, com-1077 plex networks, and machine learning approach,” IEEE Trans. Very1078 Large Scale Integration Syst., 2019, vol. 27, no. 6, pp. 1416–1427,1079 Jun. 2019.1080 [40] Y. Xie et al., “Design space exploration for 3D architectures,”1081 Emerg. Technol. Comput. Syst., vol. 2, no. 2, pp. 65–103.1082 [41] Y. Xue and P. Bogdan, “Reliable multi-fractal characterization of1083 weighted complex networks: algorithms and implications,” Sci.1084 Rep., 2017, vol. 7, 2017, Art. no. 7487.

1085 Yao Xiao (Student Member, IEEE) is currently1086 working toward the PhD degree with theMing Hsieh1087 Department of Electrical and Computer Engineer-1088 ing, University of Southern California. His research1089 interests include program analysis and compiler1090 optimization using machine learning, parallel1091 computer architecture and programming, graph1092 representation of high-level programs, modeling1093 applications as complex networks, mapping com-1094 munities detected from complex networks onto1095 NoCs, and designing high-performanceNoCs.

1096Shahin Nazarian (Member, IEEE) is currently an1097associate professor of engineering practice with1098the Ming Hsieh Electrical and Computer Engi-1099neering Department, USC Viterbi School of Engi-1100neering. Before joining USC, he was a senior1101R&D software engineer with Magma Design1102Automation (now part of Synopsys) focusing on1103timing and noise solution development as part of1104Talus and Tekton tools. He has numerous indus-1105trial experiences, as a technical expert, software,1106and hardware design engineer, in a wide range of1107areas including computer-aided design, architecture, and embedded1108system design. He is the founder and president of Vervecode, Inc., and1109also is a consultant to several technical companies. His current research1110interests include, machine learning, system timing and power optimiza-1111tion of VLSI circuits, and emerging technologies.

1112Paul Bogdan (Senior Member, IEEE) received1113the PhD degree from Carnegie Mellon University.1114He is a Jack Munushian Early Career Chair asso-1115ciate professor with the Ming Hsieh Department1116of Electrical and Computer Engineering, Univer-1117sity of Southern California. His research interests1118include performance analysis and design meth-1119odologies for manycore systems, theoretical1120foundations of cyber-physical systems, control of1121complex time-varying networks, modeling and1122analysis of biological systems and swarms, new1123control algorithms for dynamical systems exhibiting multifractal charac-1124teristics, modeling biological or molecular communication, fractal mean1125field games to model and analyze biological, and social and technologi-1126cal system-of-systems. His work has been recognized with a number of1127distinctions, including the 2019 Defense Advanced Research Projects1128Agency Directors Fellowship. He was the recipient of the 2018 Recipient1129of IEEE CEDA Ernest S. Kuh Early Career Award, 2017 DARPA Young1130Faculty Award, 2017 Okawa Foundation Award, 2015 National Science1131Foundation CAREER Award, 2012 A.G. Jordan Award from Carnegie1132Mellon University for an outstanding PhD thesis and service, several1133best paper awards, including the 2013 Best Paper Award from the 18th1134Asia and South Pacific Design Automation Conference, 2012 Best1135Paper Award from the Networks-on-Chip Symposium, 2012 D.O. Peder-1136son Best Paper Award from IEEE Transactions on Computer-Aided1137Design of Integrated Circuits and Systems, 2012 Best Paper Award1138from the International Conference on Hardware/Software Codesign and1139System Synthesis, and the 2009 Roberto Rocca PhD Fellowship.

1140" For more information on this or any other computing topic,1141please visit our Digital Library at www.computer.org/csdl.

XIAO ET AL.: PLASTICITY-ON-CHIP DESIGN: EXPLOITING SELF-SIMILARITY FOR DATA COMMUNICATIONS 13