pycoram: yet another implementation of coram memory architecture for modern fpga-based computing...
DESCRIPTION
Presentation slide for CARL2013 (Co-located with MICRO-46) at Davis, CA. PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based ComputingTRANSCRIPT
PyCoRAM Yet Another Implementation of CoRAM Memory
Architecture for Modern FPGA-based Computing
Shinya Takamaeda-Yamazaki†‡, Kenji Kise†, James C. Hoe*
†Tokyo Institute of Technology ‡JSPS Research Fellow
*Carnegie Mellon University
Dec 7, 2013 CARL2013@Davis, CA
Agenda
n Background n PyCoRAM Overview
n PyCoRAM Microarchitecture
n Evaluation
n Conclusion
Dec 7, 2013 Shinya T-Y. Tokyo Tech 2
Dec 7, 2013 Shinya T-Y. Tokyo Tech 3
Background
FPGA as SoC
n Put together various components on a single FPGA l CPU core
• Microblaze (Soft-macro)
• Cortex-A9 (Hard-macro)
l Hardware accelerator logic • Modeled in traditional RTL
– Verilog HDL, VHDL
• Modeled in new modeling tool – Bluespec, AutoESL, Chisel, …
l DDRx DRAM interface
l PCI-express
l Ethernet, …
Dec 7, 2013 Shinya T-Y. Tokyo Tech 4
FPGA
CPU HW Acc
DRAM I/F Ether PCI-E
Interconnect
HW Acc
Portability Issue of Application Design n How to support various FPGA platforms?
l Different logic size, memory interface, peripherals and I/O propertyL
Dec 7, 2013 Shinya T-Y. Tokyo Tech 5
Digilent Atlys (Xilinx Spartan-6 LX45)
Xilinx ML605 (Xilinx Virtex-6 LX240T)
ScalableCore System (our FPGA system) (Xilinx Spartan-6 LX16 × 128-node)
IP-core Based System Development n To build a system, add IP-cores and connect themJ
l IP-cores are connected through a standard on-chip interconnect
l EDK automatically generates an on-chip interconnection and (some) device-dependent interfaces
• No (or few) annoying steps!
Dec 7, 2013 Shinya T-Y. Tokyo Tech 6 Xilinx Platform Studio (XPS)
IP-core List
Interconnect
FPGA
CPU HW Acc
DRAM I/F Ether PCI-E
Interconnect
HW Acc
DRAM
IP-core Instances
Abstract Memory System for FPGAs n CoRAM (Connected RAM) [FPGA’11]
l High-level abstraction for memory management • Decoupling computing logics and memory access behaviors
• Memory access patterns in software model (C language)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 7
HW Kernels (Computing Logics)
CoR
AM
M
emor
y
Read Write
Manage
Control Threads (Memory Access
Pattern in C)
CoRAM Channel
Read/Write Read/Write
Communication FIFOs (Registers)
Abstracted On-chip Memories
Off-chip Memory
Dec 7, 2013 Shinya T-Y. Tokyo Tech 8
What “Runs” CoRAM?
Memory Translation (TLBs)
Architecture Microarchitecture
Memory Interfaces and Caches
Fabr
ic
Clus
ter D
MA
Control thread
programs
Control Logic
FPGA SR
AM Core Logic
Fabr
ic
Clus
ter D
MA
Control Logic
Network-on-Chip
6/19/2013 CoRAM Tutorial 18
From CoRAM Tutorial @FPGA’13
High-level synthesis from C to RTL using LLVM
RTL conversion
CONNECT NoC generator
Dec 7, 2013 Shinya T-Y. Tokyo Tech 9
PyCoRAM Overview
Motivation: CoRAM for EDK n Integration of CoRAM memory architecture for modern
EDK-based development flow with standard IP-cores
Dec 7, 2013 Shinya T-Y. Tokyo Tech 10
Standard On-chip Interconnect
CoRAM Abstraction
Accelerator logic Standard IP-core
Device-dependent Interfaces
CPU core
Portable application design with CoRAM Cooperation with standard IP-cores
PyCoRAM
n Python-based implementation of CoRAM memory architecture for modern FPGA EDKs l CoRAM memory abstraction for EDK development flow
n Key features l Control Thread in Python
• We developed Python-to-Verilog HLS Compiler from scratch
l AMBA AXI4 Interconnect for on-chip interconnect • For IP-core based development on Xilinx Platform Studio (XPS)
l Parameterized RTL Design Support for User-logic • Generate-statement and Parameter-statement analyzed by our
original Verilog analysis tool-chain (Pyverilog)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 11
Comparison with Original CoRAM CoRAM PyCoRAM
Language for Control-Thread C Python
Supported Memory Operations
(Blocking/Non-Blocking) Read/Write
(Blocking/Non-Blocking) Read/Write
On-chip Interconnect CONNECT NoC [FPGA’12] AMBA AXI4
FSM Granularity in Control Thread LLVM-IR Python AST Node
Generate Statement Support for User logics No Yes
Supported FPGAs Xilinx ML605 Altera Terasic DE-4
Any FPGAs supporting AXI Bus
# Lines of Code 11,682 lines (w/o CONNECT)
4,922 lines (w/o Pyverilog)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 12
FSM: Finite State Machine LLVM-IR: Low Level Virtual Machine Intermediate Representation AST: Abstract Syntax Tree
PyCoRAM Development Flow n PyCoRAM generates an IP-core package from user-logic
RTLs and control thread scripts in Python l Each part can be replaced with the original CoRAM’s component
Dec 7, 2013 Shinya T-Y. Tokyo Tech 13
User-logic (Verilog HDL)
Control Threads (Python)
Logic Hierarchy Analysis
Python-to-Verilog
Compilation
Control Signal
Insertion IP-core Packing
(RTL, .mpd, and
.pao)
IP-core Integration
on EDK
Synthesis Control
Signal Port Addition
FPGA Bit File
Portable Application
Design
PyCoRAM Tool-chain Vendor EDA Flow Python-to-Verilog HLS
RTL Conversion IP-core
generation with AXI4 Interface
Top design synthesis with
AXI4
FPGA Accelerator with PyCoRAM IP-core
Dec 7, 2013 Shinya T-Y. Tokyo Tech 14
Control Thread
FSM
AXI4 Interconnect
DRAM Controller
DRAM (Off-chip)
DMAC
AXI I/F
CoRAM Memory
CoRAM Memory
AXI I/F
CoRAM Memory
CoRAM Memory
DMAC DMAC
AXI I/F
CoRAM Stream
DMAC
AXI I/F
CoRAM Stream
CoRAM Channel
HW Kernels (Computing Logics)
DMA Cluster
DMA Cluster
FPGA
Other AXI
IP-core or
CPU
PyCoRAM IP (Application)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 15
PyCoRAM Microarchitecture
PyCoRAM Microarchitecture (Logical View)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 16
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
CoRAM Memory
DMAC
CoRAM Stream FSM
GPIO
PyCoRAM Microarchitecture (Logical View)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 17
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
CoRAM Memory
DMAC
CoRAM Stream FSM
GPIO Modeled in RTL (Verilog HDL)
Memory Access Pattern
in Python
PyCoRAM Microarchitecture (Physical View)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 18
PyCoRAM IP
AXI4 Interconnect
DRAM Controller FPGA
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
AXI I/F
CoRAM Memory
DMAC
AXI I/F
CoRAM Stream FSM
GPIO
PyCoRAM Microarchitecture (Physical View)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 19
PyCoRAM IP
AXI4 Interconnect
DRAM Controller FPGA
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
AXI I/F
CoRAM Memory
DMAC
AXI I/F
CoRAM Stream FSM
GPIO
AXI4 Master Interface
Control Thread in Python
Parameterized RTL Design Support
Control Thread in Python n Operations for CoRAM objects
l To/from CoRAM Memory • Data movement pattern with DMA operations
between on-chip CoRAM memory and DRAM
l To/from CoRAM Channel • Token communication action
between user-logic and control thread
Dec 7, 2013 Shinya T-Y. Tokyo Tech 20
def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)�calc_sum(8)�
# Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task �
User I/O
User Logic
CoRAM Channel
Control Thread
DMAC
CoRAM Memory FSM
0�1�2�3�4�5�6�7�8�9�10�11�
CoRAM objects in User Logic n CoRAM objects as standard BRAM or FIFO
l Very similar interface to the standard memory components
l User-logic can use their contents in them in the same way
n Essential parameters to define object characteristics l Thread name, ID, data width, address length, …
Dec 7, 2013 Shinya T-Y. Tokyo Tech 21
CoramMemory1P�#(� .CORAM_THREAD_NAME("thread_name"),� .CORAM_ID(0),� .CORAM_ADDR_LEN(ADDR_LEN),� .CORAM_DATA_WIDTH(DATA_WIDTH)� )�inst_memory�(.CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q)� );�
(a) CoRAM Memory
CoramChannel�#(� .CORAM_THREAD_NAME("thread_name"),� .CORAM_ID(0),� .CORAM_ADDR_LEN(CHANNEL_ADDR_LEN),� .CORAM_DATA_WIDTH(CHANNEL_DATA_WIDTH)� )�inst_channel�(.CLK(CLK),� .RST(RST),� .D(comm_d),� .ENQ(comm_enq),� .FULL(comm_full),� .Q(comm_q),� .DEQ(comm_deq),� .EMPTY(comm_empty)� );�
(b) CoRAM Channel
AXI4 Master Interface n DMA controller works as AXI4 master IP-core interface
Dec 7, 2013 Shinya T-Y. Tokyo Tech 22
DMA Controller
・・・
DramAddr BramAddr
Size WrEn RdEn Busy
Ready
Control Thread
FSM
Deq
ue
Em
pty
RdD
ata
Alm
Full
Enq
ue
WrD
ata
Rea
dy
RdE
n
RdE
n
Siz
e
Add
r
AXI Master Interface (Protocol Conversion)
AXI4 Interconnect
WR
EA
DY
WVA
LID
WD
ATA
BVA
LID
AWR
EA
DY
AWVA
LID
AWLE
N
AWA
DD
R
RVA
LID
RR
EA
DY
RD
ATA
AR
RE
AD
Y
AR
VALI
D
AR
LEN
AR
AD
DR
Write Address Channel
Write Data Channel
Read Address Channel
Read Data Channel
HW Kernels (Computing Logic)
Add
r
WrD
ata
RdD
ata
WrE
n
CoRAM Memory (BRAM)
Add
r
WrD
ata
RdD
ata
WrE
n
Add
r
WrD
ata
RdD
ata
WrE
n
CoRAM Memory (BRAM)
Add
r
WrD
ata
RdD
ata
WrE
n
CoRAM Channel
Deq
ue
Em
pty
RdD
ata
Alm
Full
Enq
ue
WrD
ata
Deq
ue
Em
pty
RdD
ata
Alm
Full
Enq
ue
WrD
ata
・・・ DMA Cluster
For Parameterized RTL design support
Dec 7, 2013 Shinya T-Y. Tokyo Tech 23
Dataflow
State Machine
n Generate-statement support by advanced RTL analyzer l Not supported by the original CoRAM compiler
n Pyverilog: Python-based Tool-chain for Verilog HDL Design l Parser
l Dataflow Analysis
l Optimization
l RTL Code Generation
l Control flow Analysis
l Graphical Output
Dec 7, 2013 Shinya T-Y. Tokyo Tech 24
Evaluation
Evaluation n Point: Maximum memory bandwidth utilization
l PyCoRAM is a memory abstraction framework
n Setup l 2 FPGA boards
• Digilent Atlys – Spartan-6 LX45
– DDR2-800 DRAM 128MB (1.2GB/s*)
– AXI4 128-bit, 100MHz (1.6GB/s)
• Xilinx ML605 – Virtex-6 LX240T
– DDR3-800 DRAM 512MB (6.4GB/s)
– AXI4 256-bit, 200MHz (6.4GB/s)
l EDK • Xilinx Platform Studio (14.6)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 25
Xilinx ML605 (Xilinx Virtex-6 LX240T)
Digilent Atlys (Xilinx Spartan-6 LX45)
*Due to 300MHz operation
Evaluation: Application n Array-sum: calculate summation value of an array
l Two CoRAM memories as Double-buffered
l Varying SIMD width (=# simultaneous ops) to check the effect to the memory bandwidth utilization
• 4, 8, 16, 32, 64 (bytes)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 26
s3
+
s2
+
s1
+
s0
+
MUX
D[0]
D[1]
D[2]
D[3]
D[0]
D[1]
D[2]
D[3]
sum
+
MUX
D[3] D[2] D[1] D[0]
Output
CoRAM Memory
0
CoRAM Memory
1 from DMA Controller 0 from DMA Controller 1
Memory Bandwidth Utilization n Good bandwidth utilization
l Atlys: 85.5% (at 16-byte)
l ML605: 84.9% (at 64-byte)
n Degradation reasons l Sequential (single) transaction for each DMA controller
• Memory latency directly affects the performance adversely
Dec 7, 2013 Shinya T-Y. Tokyo Tech 27
0
0.2
0.4
0.6
0.8
1
4 8 16 32 64
Ban
dwid
th U
tiliz
atio
n
SIMD size [byte]
0
0.2
0.4
0.6
0.8
1
4 8 16 32
Ban
dwid
th U
tiliz
atio
n
SIMD size [byte]
Atlys (Spartan-6) ML605 (Virtex-6)
Dec 7, 2013 Shinya T-Y. Tokyo Tech 28
Conclusion and …
Conclusion n PyCoRAM: Python-based implementation of CoRAM
memory architecture for modern FPGA EDKs
n Future work l Further evaluation on more realistic applications
l AXI4 slave feature for control thread
l Tutorial slideJ
Dec 7, 2013 Shinya T-Y. Tokyo Tech 29
Standard On-chip Interconnect
CoRAM Abstraction
Accelerator logic Standard IP-core
Device-dependent Interfaces
CPU core
Automatically managed by EDK
Portable application design with CoRAM Cooperation with standard IP-cores
PyCoRAM and Pyverilog are ready for public!
n PyCoRAM (0.7.0-public) l https://github.com/shtaxxx/PyCoRAM
n Pyverilog (0.6.0-public) l https://github.com/shtaxxx/Pyverilog
Dec 7, 2013 Shinya T-Y. Tokyo Tech 30
Thanks!