dr. philip brisk department of computer science and engineering
DESCRIPTION
FPGA Applications IEEE Micro Special Issue on Reconfigurable Computing Vol. 34, No.1, Jan.-Feb. 2014. Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223. Guest Editors. Walid Najjar UCR. Paolo Ienne EPFL, Lausanne, Switzerland. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/1.jpg)
FPGA ApplicationsIEEE Micro Special Issue on Reconfigurable Computing
Vol. 34, No.1, Jan.-Feb. 2014
Dr. Philip BriskDepartment of Computer Science and Engineering
University of California, Riverside
CS 223
![Page 2: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/2.jpg)
2
Guest Editors
Walid NajjarUCR
Paolo IenneEPFL, Lausanne, Switzerland
![Page 3: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/3.jpg)
High-Speed Packet Processing Using Reconfigurable Computing
Gordon Brebner and Weirong Jiang
Xilinx, Inc.
![Page 4: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/4.jpg)
Contributions• PX: a domain-specific language for packet-processing
• PX-to-FPGA compiler
• Evaluation of PX-designed high-performance reconfigurable computing architectures
• Dynamic reprogramming of systems during live packet processing
• Demonstrated implementations running at 100 Gbps and higher rates
![Page 5: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/5.jpg)
PX Overview• Object-oriented semantics– Packet processing described as component objects– Communication between objects
• Engine– Core packet processing functions
• Parsing, editing, lookup, encryption, pattern matching, etc.
• System– Collection of communicating engines and/or sub
• Parsing, editing, lookup, encryption, pattern matching, etc.
![Page 6: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/6.jpg)
Interface Objects
• Packet– Communication of packets between components
• Tuple– Communication of non-packet data between
components
![Page 7: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/7.jpg)
OpenFlow Packet Classification in PX
Send packet to parser engine
![Page 8: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/8.jpg)
OpenFlow Packet Classification in PX
Parser engine extracts a tuple from the packet
Send the tuple to the lookup engine for classification
![Page 9: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/9.jpg)
OpenFlow Packet Classification in PX
Obtain the classification response from the lookup engine
Forward the response to the flowstreamoutput interface
![Page 10: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/10.jpg)
OpenFlow Packet Classification in PX
Forward the packet (without modification) to the outstream output interface
![Page 11: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/11.jpg)
PX Compilation Flow100 Gbps : 512-bit datapath10 Gbps : 64-bit datapath
Faster to reconfigure the generated architecture than the FPGA itself(not always applicable)
![Page 12: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/12.jpg)
OpenFlow Packet Parser (4 Stages)
Allowable Packet Structures:(Ethernet, VLAN, IP, TCP/UDP)(Ethernet, IP, TCP/UDP)
Stage 1: EthernetStage 2: VLAN or IPStage 3: IP or TCP/UDPStage 4: TCP/UDP or bypass
![Page 13: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/13.jpg)
OpenFlow Packet ParserMax. packet size Max. number of stacked sections
Structureof the tuple
I/O interface
Ethernet header
expected first
Determinethe type ofthe nextsection of the packet
Determine how far to goin the packetto reach thenext section
Set relevantmembers inthe tuple Beingpopulated
![Page 14: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/14.jpg)
OpenFlow Packet Parser
![Page 15: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/15.jpg)
Three-stage Parser Pipeline
• Internal units are customized based on PX requirements
• Units are firmware-controlled– Specific actions can be altered (within reason) without
reconfiguring the FPGA– e.g., add or remove section classes handled at that stage
![Page 16: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/16.jpg)
OpenFlow Packet Parser ResultsAdjust throughput for wasted bits at the end of packets
![Page 17: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/17.jpg)
Ternary Content Addressable Memory (TCAM)
X = Don’t Care
http://thenetworksherpa.com/wp-content/uploads/2012/07/TCAM_2.png
TCAM width and depth are configurable in PX
![Page 18: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/18.jpg)
TCAM Implementation in PXdepth key length result bitwidth
The parser (previous example) extracted the tuple
Set up TCAM access
Collect the result
TCAM architecture is generated automatically as described by one of the authors’ previous papers
![Page 19: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/19.jpg)
TCAM Architecture
![Page 20: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/20.jpg)
TCAM Parameterization• PX Description
– Depth (N)– Width– Result width
• Operational Properties– Number of rows (R)– Units per row (L)– Internal pipeline stages per unit (H)
• Performance– Each unit handles N/(LR) TCAM units– Lookup latency is LH + 2 clock cycles
• LH to process the row• 1 cycle for priority encoding• 1 cycle for registered output
![Page 21: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/21.jpg)
Results
![Page 22: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/22.jpg)
Database Analytics: A Reconfigurable Computing Approach
Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Bernard Brezzo, Sameh Asaad,
and Donna Eng. Dillenberger
IBM T.J. Watson Research Center
![Page 23: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/23.jpg)
Example: SQL Query
![Page 24: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/24.jpg)
Online Transaction Processing (OLTP)
• Rows are compressed for storage and I/O savings
• Rows are decompressed when issuing queries
• Data pages are cached in a dedicated memory space called the buffer pool
• I/O operations between buffer pool and disk are transparent
• Data in the buffer pool is always up-to-date
![Page 25: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/25.jpg)
Table Traversal
• Indexing– Efficient for locating a small number of records
• Scanning– Sift through the whole table– Used when a large number of records match the
search criteria
![Page 26: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/26.jpg)
FPGA-based Analytics Accelerator
![Page 27: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/27.jpg)
Workflow
• DBMS issues a command to the FPGA– Query specification and pointers to data
• FPGA– Pulls pages from main memory– Parses pages to extract rows– Queries the rows– Writes qualifying queries back to main memory in
database-formatted pages
![Page 28: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/28.jpg)
FPGA Query Processing
• Join and sort operations are not streaming– Data re-use is required– FPGA block RAM storage is limited– Perform predicate evaluation and projection
before join and sort• Eliminate disqualified rows• Eliminate unneeded columns
![Page 29: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/29.jpg)
Where is the Parallelism?
• Multiple tiles process DB pages in parallel– Concurrently evaluate multiple records from a page within a
tile• Concurrently evaluate multiple predicates against different
columns within a row
![Page 30: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/30.jpg)
Predicate EvaluationStored predicate values
Logical Operations(Depends on query)
#PEs and reduction network size are configurable at synthesis time
![Page 31: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/31.jpg)
Two-Phase Hash-Join• Stream the smaller join table
through the FPGA• Hash the join columns to populate a
bit vector• Store the full rows in off-chip DRAM• Join columns and row addresses are
stored in the address table (BRAM)• Rows that hash to the same position
are chained in the address table
• Stream the second table through the FPGA
• Hash rows to probe the bit vector (eliminate non-matching rows)
• Matches issue reads from off-chip DRAM
• Reduces off-chip accesses and stalls
![Page 32: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/32.jpg)
Database Sort
• Support long sort keys (tens of bytes)• Handle large payloads (rows)• Generate large sorted batches (millions of records)
• Coloring bins keys into sorted batches
https://en.wikipedia.org/wiki/Tournament_sort
![Page 33: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/33.jpg)
CPU Savings
Predicate Eval.Decompression+ Predicate Eval.
![Page 34: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/34.jpg)
Throughput and FPGA Speedup
![Page 35: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/35.jpg)
Scaling Reverse Time Migration Performance Through
Reconfigurable Dataflow Engines
Haohan Fu1, Lin Gan1, Robert G Clapp2, Huabin Ruan1, Oliver Pell3, Oskar Mencer3, Michael Flynn2,
Xiaomeng Huang1, and Guangwen Yang1
1Tsinghua University2Stanford University3Maxeler Technologies
![Page 36: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/36.jpg)
Migration (Geology)
https://upload.wikimedia.org/wikipedia/commons/3/38/GraphicalMigration.jpg
![Page 37: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/37.jpg)
Reverse Time Migration (RTM)
• Imaging algorithm• Used for oil and gas exploration• Computationally demanding
![Page 38: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/38.jpg)
RTM Pseudocode
Iterate over time-steps, and 3D grids
Iterations over shots (sources) are independent and easy to parallelize
Iterate over time-steps, and 3D grids
Propagate source wave fields from time 0 to nt - 1
Propagate receiver wave fields from time nt - 1 to 0
Cross-correlate the source and receiver wave field at the same time step to accumulate the result
Add the recorded source signal to the corresponding location
Add the recorded receiver signal to the corresponding location
Boundary conditions
Boundary conditions
![Page 39: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/39.jpg)
RTM Computational Challenges
• Cross-correlate source and receiver signals– Source/receiver wave signals are computed in different
directions in time– The size of a source wave field for one time-step can be 0.5 to
4 GB– Checkpointing: store source wave field and certain time steps
and recompute the remaining steps when needed
• Memory access pattern– Neighboring points may be distant in the memory space– High cache miss rate (when the domain is large)
![Page 40: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/40.jpg)
Hardware
![Page 41: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/41.jpg)
General Architecture
![Page 42: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/42.jpg)
Java-like HDL / MaxCompilerStencil Example
Automated construction of a window buffer that covers different points needed by the stencile
Data type: no reason that all floating-point data must be 32- or 64-bit IEEE compliant (float/double)
![Page 43: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/43.jpg)
Performance Tuning
• Optimization strategies– Algorithmic requirements– Hardware resource limits
• Balance resource utilization so that none becomes a bottleneck– LUTs– DSP Blocks– block RAMs– I/O bandwidth
![Page 44: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/44.jpg)
Algorithm Optimization
• Goal: – Avoid data transfer required to checkpoint source
wave fields
• Strategies:– Add randomness to the boundary region– Make computation of source wave fields
reversible
![Page 45: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/45.jpg)
Custom BRAM Buffers37 pt. Star Stencil on a MAX3 DFE• 24 concurrent pipelines
at 125 MHz• Concurrent access to 37
points per cycle• Internal memory
bandwidth of 426 Gbytes/sec
![Page 46: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/46.jpg)
More Parallelism• Process multiple points concurrently
– Demands more I/O
• Cascade multiple time steps in a deep pipeline– Demands more buffers
![Page 47: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/47.jpg)
Number Representation
• 32-bit floating-point was default• Convert many variables to 24-bit fixed-point– Smaller pipelines => MORE pipelines
Floating-point- 16,943 LUTs- 23,735 flip-flops- 24 DSP48Es
Fixed-point- 3,385 LUTs- 3,718 flip-flops- 12 DSP48Es
![Page 48: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/48.jpg)
Hardware Decompression• I/O is a bottleneck
– Compress data off-chip– Decompress on the fly– Higher I/O bandwidth
• Wave field data– Must be read and written many times– Lossy compression acceptable
• 16-bit storage of 32-bit data
• Velocity data and read-only Earth model parameters– Store values in a ROM– Create a table of indices into the ROM
• Decompression requires ~1300 LUTs and ~1200 flip-flops
![Page 49: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/49.jpg)
Results
![Page 50: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/50.jpg)
Performance Model• Memory bandwidth constraint
# points processed in parallel
# bytes per point
compression ratio
frequency memory bandwidth
• Resource constraint (details omitted)
![Page 51: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/51.jpg)
Performance Model
• Cost (cycles consumed on redundant streaming of overlapping halos)
• Model# points processed in parallel
# time steps cascadedin one pass
frequency
![Page 52: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/52.jpg)
Model Evaluation
![Page 53: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/53.jpg)
Fast, Flexible High-Level Synthesis from OpenCL Using
Reconfiguration Contexts
James Coole and Greg Stitt
University of Florida
![Page 54: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/54.jpg)
Compiler Flow
• Intermediate Fabric– Coarse-grained network of
arithmetic processing units synthesized on the FPGA
– 1000x faster place-and-route than an FPGA directly
– 72-cycle maximum reconfiguration time
![Page 55: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/55.jpg)
Intermediate Fabric
Multiple datapaths per kernel• Reconfigure the FPGA to
swap datapaths
The intermediate fabric can
implement each kernel by
reconfiguring the fabric routing
network
The core arithmetic operators are often
shared across kernels
![Page 56: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/56.jpg)
Reconfiguration Contexts
• One intermediate fabric may not be enough
• Generate an application-specific set of intermediate fabrics
• Reconfigure the FPGA to switch between intermediate fabrics
![Page 57: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/57.jpg)
Context Design Heuristic
• Maximize number of resources reused across kernels in a context
• Minimize area of individual contexts
• Use area savings to scale-up contexts to support kernels that were not known at context design-time
• Kernels grouped using K-means clustering
![Page 58: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/58.jpg)
Compiler Results
![Page 59: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/59.jpg)
Configuration Bitstream Sizes and Recompilation Times
![Page 60: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/60.jpg)
ReconOS: An Operating Systems Approach for
Reconfigurable ComputingAndreas Agne1, Markus Happe2, Ariane Keller2,
Enno Lübbers3, Bernhard Plattner2, Marco Platzner1, and Christian Plessl1
1University of Paderborn, Germany2ETH Zürich, Switzerland
3Intel Labs, Europe
![Page 61: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/61.jpg)
Does Reconfigurable Computing Need an Operating System?
• Application partitioning– Sequential: Software/CPU– Parallel/deeply pipelined: Hardware/FPGA
• Partitioning requires– Communication and synchronization
• The OS provides– Standardization and portability
• The alternative is– System- and application-specific services– Error-prone– Limited portability– Reduced designer productivity
![Page 62: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/62.jpg)
ReconOS Benefits
• Application development is structured and starts with software
• Hardware acceleration is achieved by design space exploration
• OS-defined synchronization and communication mechanisms provide portability
• Hardware and software threads are the same from the application development perspective
• Implicit support for partial dynamic reconfiguration of the FPGA
![Page 63: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/63.jpg)
Programming Model
• Application partitioned into threads (HW/SW)• Threads communicate and synchronize using
one of the programming model’s objects– Communication: Message queues, mailboxes, etc.– Synchronization: Mutexes
![Page 64: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/64.jpg)
Stream Processing Software
![Page 65: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/65.jpg)
Stream Processing Hardware
![Page 66: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/66.jpg)
OSFSM in VHDL
• VHDL library wraps all OS calls with VHDL procedures– Transitions are guarded by an OS-controlled signal
• done, Line 47– Blocking OS calls can pause execution of a HW thread
• e.g., mutex_lock(),
![Page 67: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/67.jpg)
ReconOS System ArchitectureDelegate Thread• Interface between HW
thread and the OS via OSIF• The OS is oblivious to HW
acceleration• Imposes non-negligible
overhead on OS calls
![Page 68: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/68.jpg)
OSFSM / OSIF / CPU Interface
• Handshaking provides synchronization• OS requests are a sequence of words
communicated via FIFOs
![Page 69: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/69.jpg)
HW Thread Interfaces
![Page 70: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/70.jpg)
ReconOS Toolflow
![Page 71: Dr. Philip Brisk Department of Computer Science and Engineering](https://reader036.vdocuments.net/reader036/viewer/2022062501/5681668d550346895dda5bc6/html5/thumbnails/71.jpg)
Example: Object Tracking