pycoram: yet another implementation of coram memory

PyCoRAM: Yet Another Implementation of CoRAM MemoryArchitecture for Modern FPGA-based Computing

ShinyaTakamaeda-Yamazaki

Tokyo Institute of TechnologyJSPS Research FellowTokyo, Japan 152-8552

[email protected]

Kenji KiseTokyo Institute of Technology

Tokyo, Japan [email protected]

James C. HoeCarnegie Mellon University

Pittsburgh, PA [email protected]

ABSTRACTDevelopment of FPGA system is very complicated becausedesigners should manage sea of logics and bare-metal com-ponents on FPGAs. CoRAM memory architecture is an en-deavor to simplify the FPGA accelerator development. Itssoft-logic implementation for existing FPGAs gives an ab-stract of on-chip/off-chip memory and on-chip interconnec-tions. This paper presents PyCoRAM which is yet anotherimplementation of CoRAM memory architecture, for mod-ern integrated development environments (IDE) provided bythe FPGA vendors. PyCoRAM supports AMBA AXI4, amajor commercial interconnect architecture, as the on-chipinterconnection. PyCoRAM synthesizes an application IP-core from a user-defined application RTL designs with ab-stracted CoRAM memory components and software threadswritten in Python. Then the designer can easily use thesynthesized accelerator as a typical IP-core, as same as theother IP-cores such like soft-core CPU. We evaluated thecurrent implementation of PyCoRAM on two actual FPGAboards. The evaluation result shows that PyCoRAM cansufficiently utilize the memory bandwidth by representationof the memory access pattern in the minimal software modelof Python. This paper is for Category 1 (new unpublishedmanuscript).

1. INTRODUCTIONWith the growing demands for higher energy efficiency ofcomputing, computer architects have to explore a sophis-ticated approach employing heterogeneous computing re-sources, such as GPGPU, FPGA and ASIC[1]. FPGA-basedcustom computing is a hopeful way for both high perfor-mance and energy efficiency. In FPGA accelerator devel-opment, system designers have to implement not only thedetailed computing logic, but also a time-consuming anderror-prone complicated control logic to manage the datamarshaling among the on-chip components and the off-chipmemory.

CoRAM[2] (Connected RAM) memory architecture is an en-deavor to simplify the FPGA accelerator development flow.Its soft-logic implementation[3] on existing FPGAs gives an

abstract of on-chip/off-chip memory components and on-chip interconnections. In CoRAM, the application designercan represent a complicated memory access pattern by us-ing the C-like software model. It simplifies the developmentwith keeping the accelerator performance.

Aside from this, FPGA vendors provide integrated develop-ment environments (IDE), such as Xilinx Platform Studio[4]and Altera Qsys[5], for IP-core-based system development.In the IDEs, most hardware components, including off-chipmemory interface and I/O interface, are handled as abstractfunction units connected to the specified interconnection. Ithelps developers to build a large system, such as FPGA-based SoC.

To integrate CoRAM memory abstraction into modern IDE-based development flow, we developed PyCoRAM, which isyet another implementation of CoRAM memory architecturefor modern IDEs provided by the FPGA vendors. Certainly,PyCoRAM also gives users the memory abstraction to sim-plify the accelerator development, as same as the originalCoRAM. Unlike the original CoRAM, PyCoRAM supportsAMBA AXI4[6, 7] as FPGA on-chip interconnect way. Py-CoRAM tool-chain automatically generates an AXI4 IP-coremodule of the accelerator from the user-defined applicationRTL design and a software program of memory access pat-terns. As the name suggests, PyCoRAM uses Python as theprogramming language to model the memory access pat-terns.

In this paper, we present the architecture and microarchi-tecture of PyCoRAM, including the Python-to-Verilog high-level synthesis compiler to generate a hardware componentto manage on-chip/off-chip data movements. As an initialevaluation, we implemented a very simple bandwidth in-tensive kernel on two kinds of actual FPGA boards. Theevaluation result shows that the current PyCoRAM imple-mentation can sufficiently utilize the memory bandwidth bystraightforward representation of the memory access patternin Python.

2. BACKGROUND2.1 CoRAM Memory ArchitectureCoRAM is a portable memory architecture for efficient FPGA-based computing. The programing view of CoRAM is sim-plified by the abstraction of memory fabrics, as illustratedin Figure.1. Data movement behavior among on-chip mem-ory fabrics and off-chip memory components is representedin the software model. Application development concern issplit into two parts; (1) computing kernels and (2) controlthreads. Computing kernels tightly connected to CoRAM

2013 Workshop on the Intersections of Computer Architecture and Reconfigurable Logic (CARL): Category 1

HW Kernels

(Computing Logics)!

Co

RA

M

Me

mo

ry!

Read

Write!

Manage!

Control

Threads

(FSMs)

CoRAM Channel!

Read/Write Read/Write

Communication

FIFOs (Registers)

Abstracted

On-chip Memories

Off-chip Memory

Figure 1: CoRAM Programming Model

function units (CoRAM Memory, which is an abstract mem-ory block, and CoRAM Channel, which is a communicationchannel between computing kernels and control threads).CoRAM abstract memory blocks are connected to off-chipmemory components via on-chip interconnection. Controlthreads manage the on-chip data movements. Designersspecify the behavior of a control thread by the represen-tation of software model of C. This separation simplifies thedevelopment with keeping the accelerator performance. Italso improves the portability of accelerator design for theother FPGA boards.

A CoRAM memory block has a standard interface as sameas on-chip SRAMs of FPGAs (BRAM). The user-logic canbe defined in a very simple structure, such like a computa-tion pipeline and on-chip memory blocks supplying data forthem. CoRAM channels are used for control between user-logic and control threads. For instance, the user-logic canknow that data for the computation are ready on on-chipCoRAM memory block from control threads via the chan-nels, and notify the completion of the computation to thecontrol threads.

Originally, CoRAM ideology is a standardization of memoryabstraction for future advanced FPGAs with hard-wired on-chip memory interconnections and sophisticated CPU-baseddata marshal units. In order to examine the CoRAM philos-ophy, a soft-logic implementation on existing FPGAs is pro-posed[3]. In the soft-logic implementation, a control threadis realized as a logic unit of FSM (Finite State Machine) us-ing FPGA reconfigurable resources. The soft-logic CoRAMhas an llvm-based HLS (high level synthesis) compiler totranslate a software model of control thread into an RTLdesign with FSMs. As an interconnect system to the off-chip memory, the soft-logic CoRAM employs CONNECT[8],which is a well-tuned FPGA-oriented high throughput NoC(network on chip). Additionally, an optimization methodcalled ShrinkWrap[9] is also proposed, which enhances theon-chip network performance and reduces the resource con-sumption.

2.2 Vendor-Provided IDE and On-chip Inter-connection

FPGA vendors provide useful IDEs for IP-core-based sys-tem development. A hardware component is treated as anabstract function units connected to the specified on-chipinterconnection. Most IDEs provide also many useful IP-cores as standard. System designers can effectively develop aFPGA system, with the provided standard IP-cores and/orsoft-core CPUs. As an interconnection mechanism amonghardware components on an FPGA, the IDEs support stan-

Control Thread HLS Compiler

Generation of FSMs and Control

Actions from Python AST

Converted RTL Design

Verilog HDL!

Control Thread Definition

Python!

User-logic RTL Translator

Static Analysis for

PyCoRAM Signal Insertion

Control Thread RTL Design

Verilog HDL !

System Builder

Writing Top-level RTL Design with AXI4, Test-bench and Configuration Files

User-logic RTL Design

Verilog HDL!

PyC

oR

AM

To

ol-ch

ain!

AXI4 IP-core!

RTL Design

Verilog HDL!

Test bench

Verilog HDL!

Configuration Files

.mpd, .pao!

Figure 2: PyCoRAM Tool-chain

dardized on-chip interconnect architectures, such as AMBAAXI[6, 7]. Designers can easily expand the system by ap-pending original IP-cores into the design, if the original IP-cores are designed for those standard interconnect architec-ture.

Additionally, these interconnect architectures supported bythe vendor IDEs have large capability for accelerator devel-opment. For instance, AMBA AXI4, supported by XilinxPlatform Studio[4], has several useful functions to improvethe system throughput and development efficiency; auto-matic data width conversion and frequency conversion be-tween multiple interfaces, Out-of-Order transmission usinginterface IDs, and so on.

3. PYCORAM: YET ANOTHER IMPLEMEN-TATION OF CORAM

3.1 OverviewPyCoRAM is an orchestrated framework of CoRAM mem-ory abstraction and a vendor-provided IDE with standardon-chip interconnect interface. PyCoRAM synthesizes anIP-core package from user-defined RTL designs of computingkernels and software description of memory access patterns.

PyCoRAM does not generate an RTL design of on-chip net-work. Instead, PyCoRAM utilizes a standard on-chip inter-connection architecture supported by vender IDEs. In thecurrent implementation, PyCoRAM supports AMBA AXI4as the interconnection. As the IDE, Xilinx Platform Stu-dio (XPS) is supported. At synthesis time of FPGA circuitimage, XPS automatically synthesizes an AXI4 interconnec-tion network and connects the IP-core instances.

Figure.2 illustrates the IP-core generation flow of PyCo-RAM. PyCoRAM provides (1) a Python-to-Verilog high-level synthesis compiler (Control Thread HLS Compiler)to generate a hardware design to manage data movementamong on-chip memory components and off-chip DRAMs,(2) an RTL design translator (User-logic Translator) to au-tomatically insert CoRAM memory signals into the inputuser design, and (3) a coordinated RTL synthesizer (SystemBuilder) to generate a top-level IP-core design with AXI4interconnect bridges, a test-bench and some essential con-figuration files for the IDE tool.

The microarchitecture overview of a generated IP-core isillustrated in Figure.3. The hardware kernel that the de-signer modeled with CoRAM memory abstraction is located


Control

Thread!

FSM!

AXI4 Interconnect!

DRAM Controller!

DRAM (Off-chip)!

DMAC!

AXI I/F!

CoRAM

Memory!

CoRAM

Memory!

AXI I/F!

CoRAM

Memory!

CoRAM

Memory!

DMAC! DMAC!

AXI I/F!

CoRAM

Stream!

DMAC!

AXI I/F!

CoRAM

Stream!

CoRAM

Channel!

HW Kernels

(Computing Logics)!

DMA

Cluster!

DMA

Cluster!

FPGA!

Other

AXI

IP-core

or

CPU

PyCoRAM IP

(Application)!

Figure 3: PyCoRAM Microarchitecture Overview

on the top of design. Abstract CoRAM memory objects inthe user-defined design are replaced with a combination ofphysical CoRAM memory objects (Block RAM) and DMAcontrollers. Several CoRAM memory objects may work asa cluster within a single DMA controller. The owner DMAcontroller treats these CoRAM memory objects as a vir-tual single Block RAM, in order to support scatter/gatheroperation and multi-banked Block RAM. Additionally, Py-CoRAM supports CoRAM stream interface connected withoff-chip DRAM in standard. CoRAM stream interfaces canbe accessed as FIFO from the user-defined logics.

Control threads, which represent the memory access pattern,are synthesized into FSM-based units by the HLS compiler.Control threads can send/receive tokens to/from the user-defined logics via CoRAM channels. CoRAM channel hasa FIFO interface. The user-defined logics can be controlledfrom the control thread, and the control threads can use thedynamic value from the user-defined logics. Control threadunits should manage the data on physical CoRAM memoryobjects, by using the DMA controllers. DMA controllersreceive a request of data movement between on-chip BlockRAM and off-chip DRAM. Then the DMA controllers senda transfer request to the off-chip DRAM interface via AXIinterconnection.

AXI interface (AXI I/F) has a general port for handshakingwith the DMA controller and an AXI port that treats a rawAXI4 protocol. Note that the data width of DMA controllerequals to the width of CoRAM memory object. However,the data width of root AXI interconnection may be widerthan the width of DMA controller. In this case, the IDEtool automatically inserts a data width converter betweentwo interfaces that the different data width have.

With PyCoRAM supports, PyCoRAM accelerator IP-coreand other IP-cores can share a same AXI4 interconnectionin a system. System designers can develop a heterogeneousSoC with multiple IP-cores on a single FPGA, in direct wayusing the vendor IDE.

3.2 Python-to-Verilog Compiler for ControlThread

As described, every data movement between on-chip CoRAMmemory (Block RAM) and off-chip DRAM is managed bycontrol threads. In PyCoRAM, control threads’ behavior isrepresented in Python, a major script programing language.

!"#$%&'()"*+*&,$"&-.+/0"123"/4122+5"&-.+/06,1*"7"%&,1*)+*&,$893:7;<"3101=93047>?<"@9A+7B;?CD6/4122+5"7"%&,1*%4122+5893:7;<"3101=93047>?D66,1*E@9A+"7"B?F66!"GH2/09&2"3+G92909&263+G"1,,1$E@H*8,+I+10E09*+DJ6""""133,"7";6""""@H*"7";6""""G&,"9"92",12K+8,+I+10E09*+DJ6""""""""!"0,12@G+,"G,&*"L'()"0&"%&'()"*+*&,$6"""""""",1*M=,90+8;<"133,<",1*E@9A+D6""""""""!"=,90+"1"N15H+"0&"04+"/4122+56""""""""/4122+5M=,90+8133,D6""""""""!"=190"G&,"1"0&O+2"G,&*"04+"H@+,P5&K9/6""""""""@H*"7"/4122+5M,+138D6""""""""133,"Q7",1*E@9A+"R"C6""""!"39@I51$"@010+*+20"G&,"S+,95&K"@9*H5109&26""""I,9208T@H*7T<"@H*D66!"GH2/09&2"/15561,,1$E@H*8FD6

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

##"

#$"

#%"

#&"

#'"

#("

#)"

#*"

#+"

$!"

$#"

$$"

Figure 4: PyCoRAM Control Thread

An actual control thread is implemented as a state-machineby using some FPGA hardware resources. We developedPython-to-Verilog HLS compiler to generate an RTL de-sign with FSMs to control DMA controllers. Figure.4 showsan example of acceptable Python code of control thread.The example thread just transfers the data from an off-chipDRAM to on-chip CoRAM memory for several times.

A control thread in PyCoRAM consists of (1) instance cre-ations of CoRAM components (memory, stream, channeland register), (2) method invocations of CoRAM instance,and (3) general value definitions and method definitions. Inline 1 in Figure.4, CoRAM memory instance is defined forthis control thread. In line 2, CoRAM channel instance isdefined as a bridge between the control thread and the user-defined logic. To issue a DMA transfer for CoRAM memoryobject, read/write methods of them are called, as shown inline 12. As arguments of the method, a local address onthe CoRAM memory, remote address of DRAM and trans-fer size (in word) are passed. User-defined logics and con-trol threads can communicate with each other. User-definedlogics accept their controls from control threads via CoRAMchannels. Control threads can use dynamic values sent fromthe connected user-defined logics.

As well as general software programming, the basic con-trol syntaxes (if/for/while statements and return/break/-cotninue statements) are available. Designers can easily put


!""#$%$&'()*+$%$&',-#$)$).$#!./+0123'$$$$$4$(-5+$-6('

Control Thread (Python)!

Compile!

FSM part (Verilog HDL)!

Assignment part (Verilog HDL)!

7!(+0(8!8+2'9:$;$:9'<3$=+/).$$$$$$$$'$$(8!8+$>%$?@'+."$'?3$=+/).$$$$$$$$'$$(8!8+$>%$A@'+."$'A3$=+/).$$$$$$$$'$$(8!8+$>%$B@'+."$'B3$=+/).$$$$$$$$'$$),00CC(&C)$>$122$(8!8+$>%$D@$'+E(+$(8!8+$>%$<?@'+."$'9:$;$:9'<<3$=+/).$$$$$$$$'$$(8!8+$>%$B@'+."$'9:$;$:9'+."7!(+'

7!(+0(8!8+2'9:$;$:9'<3$=+/).$$$$$$$$'$$CC(&C!""#$>%$&@'+."$'?3$=+/).$$$$$$$$'$$CC(&C(F5$>%$&@'+."$'A3$=+/).$$$$$$$$'$$CC(&C)$>%$&@'+."$'''''9:$;$:9'<<3$=+/).$$$$$$$$'$$CC(&C)$>%$0CC(&C)$G$<2@'+."$'9:$;$:9'+."7!(+'

Figure 5: HSL Compilation Example

memory access patterns of the application in software man-ner.

The compiler creates a FSM (finite state machine) froma software model. The compiler obtains AST (AbstractSyntax Tree) by using the native code parser of Python.The AST is scanned by the general visitor pattern, and isconverted to Verilog HDL. In the current implementation,each value assignment is converted basically into an identicalstate of FSM, as illustrated in Figure.5. Control syntaxesare realized as conditional transitions. The generated RTLcontains two “always” statements: FSM definition and valueassignment at each FSM state.

The current compiler implementation is still naive, but it in-cludes a basic optimizer that simplifies arithmetic/logic ex-pressions, replaces high-cost operators (multiplication anddivision) with equivalent low-cost operators, and convertssingle-assigned value into constant. In Python, variable def-inition should be done with a value assignment. Basically,a register (“reg” entry in Verilog HDL) is prepared for everyvariable, and the value of a register is assigned in the al-ways statement. To handle constant value (like “parameter”in Verilog HDL) in control threads, the compiler optimizerautomatically converts the definition of variables with justa single assignment into constant definitions of parameter inVerilog HDL.

The current compiler implementation has some limitations.(1) Only “integer” is supported as the data type. Object-oriented programming, such like class definition, is not sup-ported. Array structures (“list”, “dict”, “tuple” and “set”)are not supported. (2) Nested function definition is not sup-ported. (3) Recursive call of methods is not supported.

3.3 User RTL Design ConversionHardware kernels are dominant logics of computing workingwith CoRAM objects. Therefore they should have an opti-mized computing pipeline implemented by traditional HDLsor well-tuned HLS environments. In the current implemen-tation of PyCoRAM, only Verilog HDL is supported as thelanguage of input hardware kernels. When HLS tools orother HDLs are used, RTL codes of hardware kernels shouldbe translated into Verilog HDL in some way.

When a hardware kernel is modeled, designers can use stubsas CoRAM objects in their kernel code. Figure.6(a) showsan example instance creation of the CoRAM memory ob-ject. To identify the personality of CoRAM object, someparameters are essential; (1) thread name, (2) object id, (3)data width of the interface and (4) address length to expressthe capacity.

Table 1: Target FPGA BoardsParameter Digilent Atlys[10] Xilinx ML605[11]

FPGA Xilinx XilinxSpartan-6 LX45 Virtex-6 LX240T

Logic Cells 43,661 241,152BRAM Capacity 261 KB 1849 KB

DRAM DDR2-800 DDR3-800DRAM Bandwidth 1.2 GB/s 6.4 GB/sDRAM Capacity 128 MB 512 MBClock Frequency 100 MHzInterconnection AMBA AXI4Interconnection Crossbar

Type (Performance Optimized)Interconnection 128 bit 256 bit

WidthInterconnection 100 MHz 200 MHz

Frequency

s3!

+!

s2!

+!

s1!

+!

s0!

+!

MUX!

D[0]!D[1]!D[2]!D[3]!D[0]!D[1]!D[2]!D[3]!

sum!

+!

MUX!

D[3]! D[2]! D[1]! D[0]!

Output!

CoRAM Memory

0!

CoRAM Memory

1!

from DMA Controller 0! from DMA Controller 1!

Figure 7: User-logic Design of Array Sum (# SIMDways = 4)

In PyCoRAM compilation step, the user RTL translator,shown in Figure.2, analyzes the kernel code to replace thestubs. The translator automatically inserts additional sig-nals of CoRAM objects into the design. As illustrated Fig-ure.6(a), a “generate” statement can be used to represent aparameterized module/instance hierarchy, unlike the currentsoft-logic implementation of the original CoRAM. As shownin Figure.6(b), some CoRAM-related signals are inserted tothe instance port. When a generate statement is used, some“if” statements are appended to switch the inserted signals.If a module with CoRAM objects is used twice or more, themodule definition in the design is replicated to avoid a nameconfliction by additional signals.

In order to convert designs in this way, syntax parser anddataflow analysis of Verilog HDL are used. We used Pyver-ilog for design analysis in PyCoRAM, which is our originalVerilog HDL design analysis tool-kit.

4. EVALUATION4.1 MethodologyWe evaluated the performance of generated IP-core designby PyCoRAM on two actual FPGA boards. We used Digi-lent Atlys (Xilinx Spartan-6 LX45) and Xilinx ML605 (Xil-inx Virtex-6 LX240T). The configurations of target FPGAboards are listed in Table.1. We used Xilinx Platform Studio


!"#"$%&"'()$*+,-.'+/0.'+,+123'4"!+#5'6))78''9)$%:;":)$<2=8''''>*8''''''?9@AB;CDEAFBGCHB;F*I&J$"%KC#%:"I3L8''''''?9@AB;CMG*+3L8''''''?9@AB;CBGGACNFH*OCB3L8''''''?9@AB;CGBDBCOMGDE*OCG38''''''38''+#P&C:":)$<C#%:"8'''''*8''''''?9NQ*9NQ3L8''''''?BGGA*:":C%KK$3L8''''''?G*:":CK3L8''''''?OF*:":CR"3L8''''''?S*:":CT38''''''3.8"#K'"#K!"#"$%&"8

!"

#"

$"

%"

&"

'"

("

)"

*"

!+"

!!"

!#"

!$"

!%"

!&"

!'"

!("

!)"

(a) Instantiation Declaration in Input RTL !

!"#"$%&"'()$*+,-.'+/0.'+,+123'4"!+#5'6))78''+(**+',,'-33'4"!+#''''''''8''''9)$%:;":)$<2=8''''''>*8''''''''?9@AB;CDEAFBGCHB;F*I&J$"%KC#%:"I3L8''''''''?9@AB;CMG*+3L8''''''''?9@AB;CBGGACNFH*OCB3L8''''''''?9@AB;CGBDBCOMGDE*OCG38''''''''38''''+#P&C:":)$<C#%:"8''''''*8'''''''?9NQ*9NQ3L8'''''''?BGGA*:":C%KK$3L8'''''''?G*:":CK3L8'''''''?OF*:":CR"3L8'''''''?S*:":CT3L8'''''''?U)$%:C%KK$*&J$"%KC#%:"CU)$%::":)$<C-C%KK$3L8'''''''?U)$%:CK*&J$"%KC#%:"CU)$%::":)$<C-CK3L8'''''''?U)$%:CR"*&J$"%KC#%:"CU)$%::":)$<C-CR"3L8'''''''?U)$%:CT*&J$"%KC#%:"CU)$%::":)$<C-CT38'''''''3.8''"#K''8

!"

#"

$"

%"

&"

'"

("

)"

*"

!+"

!!"

!#"

!$"

!%"

!&"

!'"

!("

!)"

!*"

#+"

#!"

##"

#$"

#%"

#&"

#'"

#("

#)"

#*"

$+"

$!"

$#"

$$"

$%"

$&"

$'"

$("

$)"

$*"

%+"

%!"

%#"

%$"

%%"

%&"

%'"

"6P"'+(**+',,'233'4"!+#''''''''8''''9)$%:;":)$<2=8''''''>*8''''''''?9@AB;CDEAFBGCHB;F*I&J$"%KC#%:"I3L8''''''''?9@AB;CMG*+3L8''''''''?9@AB;CBGGACNFH*OCB3L8''''''''?9@AB;CGBDBCOMGDE*OCG38''''''''38''''+#P&C:":)$<C#%:"8''''''*8'''''''?9NQ*9NQ3L8'''''''?BGGA*:":C%KK$3L8'''''''?G*:":CK3L8'''''''?OF*:":CR"3L8'''''''?S*:":CT3L8'''''''?U)$%:C%KK$*&J$"%KC#%:"CU)$%::":)$<C2C%KK$3L8'''''''?U)$%:CK*&J$"%KC#%:"CU)$%::":)$<C2CK3L8'''''''?U)$%:CR"*&J$"%KC#%:"CU)$%::":)$<C2CR"3L8'''''''?U)$%:CT*&J$"%KC#%:"CU)$%::":)$<C2CT38'''''''3.8''"#K8"#K'"#K!"#"$%&"8

(b) Instantiation Declaration in the Converted RTL !

Figure 6: RTL Conversion: Instance Declaration in generate statement

!"#"$%&'()*)+,-.)/)+,-.)/)+-0)1)234567"8$%&'()*)+,-.)/)+)1)594:36;%(!$7"8$%&'()*)+,-.)/)+)1)594:36%&8!$%&'()*)+-0)1)2<467(=("#)*)>>!"#"$%&'()?)>;%(!$7"8$%&'()/)>%&8!$%&'()?)0@@@)A)-@)?)-66:BC,)*)DE:BC85CE:3>,F)%&8!$%&'(F)7"8$%&'(@6:BC+)*)DE:BC85CE:3>+F)%&8!$%&'(F)7"8$%&'(@6GHB995I)*)DE:BCDHB995I>,F)J.@66:5BK$BKK:)*),6LMC)*),66:BC,NO:<45$9E92IEGP<9Q>,F):5BK$BKK:R,/;%(!$7"8$%&'(/>%&8!$%&'(?0@F);%(!$7"8$%&'(@6:BC,NOB<4>@6GHB995INO:<45>;%(!$7"8$%&'(@66:BC+NO:<45$9E92IEGP<9Q>,F):5BK$BKK:R+/;%(!$7"8$%&'(/>%&8!$%&'(?0@F);%(!$7"8$%&'(@6:BC+NOB<4>@6GHB995INO:<45>;%(!$7"8$%&'(@66SE:)<)<9):B9Q5>7(=("#@T6)))):5BK$BKK:)R*)>;%(!$7"8$%&'(/>%&8!$%&'(?0@@)/)-6))))6))))LMC)*)GHB995IN:5BK>@6)))):BC,NO:<45$9E92IEGP<9Q>,F):5BK$BKK:F);%(!$7"8$%&'(@6)))):BC,NOB<4>@6))))GHB995INO:<45>;%(!$7"8$%&'(@66))))LMC)*)GHB995IN:5BK>@6)))):BC+NO:<45$9E92IEGP<9Q>,F):5BK$BKK:R;%(!$7"8$%&'(/>%&8!$%&'(?0@F);%(!$7"8$%&'(@6)))):BC+NOB<4>@6))))GHB995INO:<45>;%(!$7"8$%&'(@66LMC)*)GHB995IN:5BK>@6LMC)*)GHB995IN:5BK>@6U:<94>VLMC*VF)LMC@6

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

##"

#$"

#%"

#&"

#'"

#("

#)"

#*"

#+"

$!"

$#"

$$"

$%"

$&"

$'"

$("

$)"

$*"

$+"

%!"

%#"

%$"

%%"

%&"

%'"

%("

Figure 8: Control Thread for Array Sum

14.6 to synthesize a bit file and to generate an AXI inter-connect.

In this study, we used array-sum illustrated in Figure.7, atiny application to measure the memory bandwidth utiliza-tion. The array-sum calculates the summation value of anarray. We used a parameterized design to vary the number ofsimultaneous computations at 1 cycle. The computing ker-nel uses two CoRAM memory instances as double-buffered.The data size for each computation is 32-bit. A data com-bination is stored in a same row on CoRAM memory. Thetotal size of accessed data on the DRAM is 128MB. The con-trol thread description in Python is very short, illustratedin Figure.8. For double-buffered, two CoRAM memory in-stances are defined. This control thread issues DMA re-quests to transfer the data from the DRAM to the on-chipCoRAM memories.

The operation frequency of computing kernels is 100 MHzin both FPGA boards. However, each AXI4 interconnec-tion to off-chip DRAM interface is optimized for each mem-ory bandwidth. For Altys board, the maximum memorybandwidth is 1.2 GB/s 1. The data width of AXI4 inter-

1Official maximum bandwidth is 1.6 GB/s. However, theoperation frequency of the memory clock is degraded to300MHz from 400MHz. Therefore the maximum bandwidth

connection is 128 bit, and the clock frequency is 100MHz.Therefore the interconnection bandwidth is 1.6 GB/s. ForML605, the maximum memory bandwidth is 6.4 GB/s. Thedata width is 256 bit, and the clock frequency is 200MHz.Therefore the interconnection bandwidth is 6.4 GB/s. TheFPGA system is controlled by a host computer via RS-232C.The execution cycle count is measured by using a hardwarecounter, implemented as a user-logic on FPGA. The band-width utilization is calculated as below.

BWUtil =TotalDataSize[B] × Freq[Hz]

Cycle[Cycle] × MaxBW [B/s]

4.2 Memory Bandwidth UtilizationWe observed the impact of the number of SIMD ways to thememory bandwidth utilization. Figure.9 shows the mea-sured bandwidth utilization on Digilent Atlys. Figure.10shows the measured bandwidth utilization on Xilinx ML605.

On Atlys board, 85.5% of the maximum memory bandwidthis utilized when the SIMD size is 16 bytes. When the SIMDsize is 32 bytes, 86.3% of the bandwidth is utilized. In thesecases, the required bandwidths are 1.6 GB/s and 3.2GB/s,respectively. They are greater than the maximum mem-ory bandwidth. The bandwidth inefficiency of 14% thoughtto be due to manipulation overhead on the control thread,interconnection latency and DRAM latency. On ML605,84.9% of the maximum memory bandwidth is utilized, whenthe SIMD size is 64 bytes. In this case, the required memorybandwidth is 6.4 GB/s, which equals the maximum memorybandwidth. As alike as Atlys, about 15% loss of bandwidthis observed.

The current PyCoRAM can achieve sufficiently memory per-formance with abstracted memory system and easy-to-usesoftware-like model. However, a certain amount of perfor-mance inefficiency is observed. In the current implemen-tation of PyCoRAM, the DMA controller cannot hide thememory access latency because the DMA controller handlesup to only one memory transaction at a time. Due to thememory intensive kernel in this evaluation, the elapsed timeof computation is smaller than the time of the data mar-shaling for each memory block, so that the memory latencyimpact directly affected the performance. In order to im-prove the memory performance, the DMA controller shouldhave an advanced microarchitecture to handle multiple out-standing requests, but it will introduce some extra hardwareoverhead.

is limited up to 1.2 GB/s.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4 8 16 32

Ba

nd

wid

th U

tili

za

tio

n!

SIMD size [byte]

Figure 9: Bandwidth Utilization (Digilent Atlys)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4 8 16 32 64

Ba

nd

wid

th U

tili

za

tio

n!

SIMD size [byte]

Figure 10: Bandwidth Utilization (Xilinx ML605)

4.3 Resource UtilizationIn the viewpoint of resource utilization, the generated IP-core uses 50% of slices that whole the accelerator systemused. The rest 50% is used by the AXI interconnect gener-ated by the IDE and system functions for DRAM, clockingand reset. The AXI interconnect consumes 33% slices ofwhole the system. In this evaluation, it is synthesized underthe performance optimized mode. Its resource utilizationcan be reduced by using area optimized mode.

The user-logic consumes 42% of slices that the IP-core to-tally used. The control thread utilizes 12% slices of wholethe IP-core. Both the DMA controllers and the AXI inter-faces consume 37% slices of whole the IP-core. The rest isused for glue logic to connect them. These results show thatthe hardware overhead is not small, because the applicationused for this evaluation is relatively small. If the more real-istic and larger application is adopted, the ratio of hardwareoverhead will be diminished.

5. CONCLUSIONThis paper presented PyCoRAM which is yet another im-plementation of CoRAM memory architecture for modernIDEs. We evaluated the current implementation of PyCo-RAM on two actual FPGA boards. The evaluation resultshows that PyCoRAM can sufficiently utilize the memorybandwidth under the abstraction of memory and memoryaccess patterns in Python.

PyCoRAM is still under research and development. We haveto evaluate the performance and the hardware overhead inmore realistic applications. We already have implementedsome memory intensive applications, such like matrix-matrixmultiplication and stencil computation.

In order to establish a more sophisticated programming model

for FPGAs, the programmability of PyCoRAM should becompared to the traditional HDLs and the other hardwaremodeling environments. Actually CoRAM and PyCoRAMcan abstract complex memory systems by software-basedmodeling. We think, however, another easy way should beintroduced to improve the programming efficiency to con-struct a well-tuned pipeline design.

Finally, we are planning to make our PyCoRAM toolkit pub-lic in the near future for efficient development of FPGA-based computing system.

6. ACKNOWLEDGMENTThis work is supported in part by Core Research for Evolu-tional Science and Technology (CREST), JST, Japan. Wethank Michael K. Papamichael, Gabriel Weisz and Eric S.Chung for their help with the original CoRAM environment.

7. REFERENCES[1] Eric S. Chung, Peter A. Milder, James C. Hoe, and

Ken Mai. Single-chip heterogeneous computing: Doesthe future include custom logic, fpgas, and gpgpus? InProceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture,MICRO ’43, pages 225–236, Washington, DC, USA,2010. IEEE Computer Society.

[2] Eric S. Chung, James C. Hoe, and Ken Mai. Coram:an in-fabric memory architecture for fpga-basedcomputing. In Proceedings of the 19th ACM/SIGDAinternational symposium on Field programmable gatearrays, FPGA ’11, pages 97–106, New York, NY,USA, 2011. ACM.

[3] Eric S. Chung, Michael K. Papamichael, GabrielWeisz, James C. Hoe, and Ken Mai. Prototype andevaluation of the coram memory architecture forfpga-based computing. In Proceedings of theACM/SIGDA international symposium on FieldProgrammable Gate Arrays, FPGA ’12, pages139–142, New York, NY, USA, 2012. ACM.

[4] Xilinx platform studio.http://www.xilinx.com/tools/xps.htm.

[5] Qsys - altera’s system integration tool.http://www.altera.com/products/software/quartus-ii/subscription-edition/qsys/qts-qsys.html.

[6] Amba open specifications.http://www.arm.com/products/system-ip/amba/amba-open-specifications.php.

[7] Ug761 axi reference guide - xilinx.http://www.xilinx.com/support/documentation/ip documentation/ug761 axi reference guide.pdf.

[8] Michael K. Papamichael and James C. Hoe. Connect:re-examining conventional wisdom for designing nocsin the context of fpgas. In Proceedings of theACM/SIGDA international symposium on FieldProgrammable Gate Arrays, FPGA ’12, pages 37–46,New York, NY, USA, 2012. ACM.

[9] E.S. Chung and M.K. Papamichael. Shrinkwrap:Compiler-enabled optimization and customization ofsoft memory interconnects. In Field-ProgrammableCustom Computing Machines (FCCM), 2013 IEEE21st Annual International Symposium on, pages113–116, 2013.

[10] Atlys spartan-6 fpga development board.http://www.digilentinc.com/Products/Detail.cfm?NavPath=2,400,836&Prod=ATLYS.

[11] Virtex-6 fpga ml605 evaluation kit.http://www.xilinx.com/products/boards-and-kits/EK-V6-ML605-G.htm.


pycoram: yet another implementation of coram memory

Documents