algorithm-architecture codesign for structured matrix ... · describe the structured matrices in...

Algorithm-Architecture Codesign for Structured

Matrix Operations on Reconfigurable Systems

J. Kim, P. Mangalagiri, M. Kandemir, V. Narayanan ∗

L. Deng. K. Sobti, C. Chakrabarti †

N. Pitsianis ‡and X. Sun §

August 14, 2006 (submission)

Abstract

This article introduces the new ideas and techniques of algorithm-architecturecodesign (AA codesign) for customized computation on reconfigurable systems.This approach, enabled by recent advances in reconfigurable hardware, notablythe field programmable gate arrays (FPGAs), integrates the design of algo-rithms and architectures for improving computation performance and resourceutilization. It utilizes software-hardware codesign techniques, but distinguishesitself in algorithm-architecture co-configuration, ending the isolation betweenalgorithm and architecture designs. We report our investigation and experi-ments for certain structured matrix operations. The new approach is showneffective, for instance, in power reduction without compromising computationperformance, or in accuracy improvement without using additional resources.This denotes a significant shift from the conventional tradeoff between effi-ciency and accuracy as well as the tradeoff between computation performanceand resource consumption.

.

∗Department of Computer Science and Engineering, Pennsylvania State University†Department of Electrical Engineering, Arizona State University‡Department of Electrical Engineering, Duke University§Department of Computer Science, Duke University

1

1 Introduction

Algorithmic and architectural techniques have been advancing with mutual adapta-tions, shaping and re-shaping the state of art in computation. They remain, however,segregated in the design and development stages. The segregation results in a tradeoffbetween efficiency and accuracy in computation performance or a tradeoff betweencomputation performance and power-area consumption. For instance, to reduce powerconsumption one may have to compromise efficiency or accuracy. We introduce in thispaper the techniques and tools we have developed for algorithm-architecture codesign(AA codesign), especially, for custom applications with reconfigurable hardware. Wedemonstrate that the proposed AA codesign approach makes a significant shift fromthe tradeoffs associated with the co-adaptation techniques in current systems.

The traditional software-hardware (SH) co-adaptation approaches fall into twocategories. One is architecture centric. Despite tremendous advances in hardwaretechnology, resource consumption in area, power and reliability requirements are stillcritical factors in exploration of hardware design options such as degree of parallelism,depth of pipelining and numerical representation. It is popularly known that one canimprove the utilization of a system architecture by adapting software to hardware.For example, if the architecture is cache memory constrained, then one can employalgorithm transformations such as loop tiling to enhance the cache locality and im-prove the timing performance. The AA codesign is task-centric, not architecturecentric. We consider, especially, the use of reconfigurable hardware to materialize thelong-time desire that an algorithm need not sacrifice its dynamic or adaptive natureto a statically configured architecture.

The other category of SH co-adaptation approaches is algorithm centric. Forexample, existing FPGA mapping tools, such as AccelChip [1], explore only hard-ware configurations for an optimal FPGA mapping of the same procedure. The AAcodesign techniques introduced here are radically different from these tools, and in abroader aspect, distinguished from the conventional SH codesign paradigm [5]. Givena computation procedure in terms of a set of related functions, a SH codesign tooldecides, automatically or semi-automatically, whether to implement each of the func-tions in hardware (using ASIC) or in software (executed in a CPU), in order to betterutilize area and power or optimize performance under certain constraints. However,neither the SH design nor the other FPGA mapping tools question the earlier al-gorithmic decisions that may confine the optimization within a narrow scope in thefirst place. The AA codesign, in contrast, is not algorithm centric. The software andhardware do not necessarily optimize for one particular algorithm in AA codesign.

By AA codesign, we increase the synergy between algorithms and architectures,

2

redefining the relationship between algorithm decisions and software-hardware config-urations. At the stage of system design, we explore both algorithmic and architecturalchoices, including their dynamic changes. We develop context-conscious techniquesto exploit task-specific knowledge for improving numerical accuracy, in addition tocaching efficiency, or reducing resource and power consumption without compromisingnumerical accuracy. In other words, algorithms and architectures co-script the compu-tation procedure and co-adapt the hardware configuration, toward high-performancecomputation and low-power consumption for a computation task.

To illustrate the AA codesign concepts and techniques, we consider specificallythe matrix-vector products within a class of large, dense but structured matrices,which dominate the time and affect the accuracy of many important computationor simulation problems arising frequently in engineering and scientific studies. Wedescribe the structured matrices in Section 2.

The conventional approach for a matrix-vector product partitions the interactionmatrix into sub-matrices (called tiles) for efficient data caching or utilizing parallelarchitecture. We refer to this approach as the plain tiling. We introduce in Section3 a different approach, called geometric tiling. The new approach is conscious oftask specifics and hardware fabric specifics, providing us with an ideal and intimateinterface between algorithmic configuration and architectural configuration.

We introduce in Section 4 our AA codesign system, an embodiment of our codesignmethodologies. We then present in Section 6 several design options in a study case.The comparison among these options provides convincing evidence of a paradigmshift by the AA codesign.

2 A class of structured matrices

We are interested in a rich and important class of structured matrices arising fre-quently in numerical solutions or simulations of engineering and scientific problems,such as in the study of molecular dynamics, galaxy dynamics or the propagation ofelectromagnetic waves in various environments, which include communication andcomputation environments. A matrix in the class is typically a linear or linearizedconvolutionary transform on two discrete, finitely bounded data sets. We describe thematrices with two concrete examples, omitting a rigorous mathematical specification.

In the first example, we compute the gravitational field potential at a target setT of particles in the three-dimensional space R3 due to the mass at a source set S of

3

particles as follows,

p(ti) = c ·∑

sj∈S

m(sj)

‖ti − sj‖, ti ∈ T. (1)

where the particles located at ti and sj in T and S, respectively, may not be equallyspaced in each and every dimension, ‖t− s‖ denotes the Euclidean distance betweent and s, m(sj) is the mass of the particle at location sj, and c is a constant. Theelectric field potential at cluster T due to the charge at cluster S follows the samekind of interaction relationship.

We may write Equation (1) in an aggregated form, i.e., the matrix-vector productform,

p(T) =∑

sj∈S

M(T,S)v(S),

where p(T) is the vector composed of p(ti) over T, v(S) is the vector composed ofm(sj) over S, and M(T,S) is the cluster-cluster interaction matrix composed of all theparticle-particle interactions c/‖ti − sj‖ over T and S. For convenient illustration,we assume the two datum sets T and S coincide. Then the matrix M(T,S) issymmetric. Figure 1 shows the images of two clusters in a datum set, the massdistribution over the particles, and the potential field due to the mutual interactionsamong the particles.

In the second example, we consider the following discrete transform,

g(ti) =∑

sj∈S

J0(‖ti − sj‖), ti ∈ T. (2)

where S and T are in the two-dimensional space R2, J0 is the zero-order Besselfunction of the first kind.

We illustrate with these two examples the general idea and techniques for exposingtask-specific information in our AA codesign. We provide experimental results withthe first case.

3 The co-configuration interface

We introduce in this section the basic ideas and scheme of geometric tiling and itsrole in the algorithm-architecture co-configuration interface.

4

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 1: Two particle clusters in a datum set (top left); images of the interactionmatrix associated with the two-clusters datum set in a plain tiling (bottom-left) andin a geometric tiling (bottom-right) in log scale; the numerical landscape of a largerinteraction matrix in geometric tiles (top-right)

3.1 Particle clustering

In geometric tiling we partition a matrix as follows. We partition and cluster theparticle first, based on a prescribed clustering rule. Imagine that all the particles, thesources and targets, are bounded in a box B. We may assume that one of the box’scorner is at the origin, see top-left of Figure 1. Specifically, one may divide the boxB into eight non-overlapping smaller boxes of equal size, B0, B1, B2, up to B7. Now,each and every particle falls into one of the smaller boxes. Denote by Ti and Si therespective clusters of the target particles and source particles in box Bi. This particleclustering scheme induces a virtual partition of the interaction matrix into 8× 8 tiles(sub-matrices). The (i, j) tile is M(Ti,Sj), the interaction matrix between clustersTi and Sj. This tile is empty if one of the clusters is empty. For the case illustratedin Figure 1, the 8 × 8 virtual partition renders a 2 × 2 partition of the interactionmatrix without empty tiles.

5

3.2 Geometric tiling vs. plain tiling

Figure 1 illustrates the images of the same interaction matrix associated with a two-clusters datum set, (top-left), in a plain tiling (bottom-left) and a geometric tiling(bottom-right), respectively. The numerical ranges are indicated by colors. A plaintiling makes simple set partition in matrix rows and columns (source particles andtarget particles), ignoring the geometric and numeric structures a geometric tilingexposes. Any tile may be further partitioned with either geometric tiling or plaintiling, depending on the codesign circumstance.

Geometric clustering incurs an overhead. Fortunately, this overhead is linearlyproportional to the total number of the particles, which is smaller than the totalnumber of the interaction entries by an order of magnitude. And this overhead issubsumed by many advantages of geometric tiling over plain tiling, as described inthe next section.

3.3 Geometric tiling: the scheme and rationale

Geometric tiling influences and integrates algorithm configuration and architecturalsynthesis in three ways mainly.

A. Reduction of dynamic range at the input of function evaluation. Computingpairwise difference (t−s) and distance ‖t−s‖ is common to the class of computationproblems we are interested in, as in the two concrete examples introduced earlier.Within each geometric tile, the difference t − s between any source-target pair (t, s)in the same sub-box is less than, or at most equal to, half of the maximum particle-to-particle difference in the big box, in each and every component in magnitude as wellas in the Euclidean length. By this fact we can represent the three components of t−s

and the distance ‖t−s‖ for each and every source-target pair in a geometric tile withone bit less without compromising the accuracy in their numerical representation.

At the hardware level, this reduces the bit-width requirements for registers forstoring inputs and the intermediate values. We can apply the geometric tiling recur-sively to any of the geometric tiles, and refine the tiling, depending on the particledistribution and the interaction relationship as well as desired architectural imple-mentation complexity.

B. Reduction of dynamic range at the output of function evaluation. The in-teraction functions considered here, such as those in the two examples, are smooth.Based on this property a sufficiently refined geometric tiling can make the numericalrange of the source-target interaction over a tile smaller than or at most equal to ahalf of the range over the whole matrix. For the gravitational interaction in (1) in

6

particular, a partition of the box B in the middle along each dimension is sufficientfor this reduction in numerical range of the interaction function over each and everygeometric tile.

The reduced numerical range helps to achieve a specified accuracy with smallernumber of lookup table entries for functions such as sine, cosine and Bessel functions.It also reduces the bit-width requirements for accumulation logic required for perform-ing matrix-vector operations. These reductions have a direct impact on the area andpower requirements of the design. Traditional algorithm design mostly overlookedthis architectural aspect and effect.

C. Increase of locality in data range. We exploit the range locality within eachtile in table lookups and in calculating submatrix-vector products. We also extendthis locality across tiles via geometric traversing, based on the translation-invariantproperty of the interaction functions. Specifically, we organize geometric tiles atdifferent elevation levels, according to their numerical ranges, see top-right of Figure1. We traverse the tiles and accumulate submatrix-vector products by the elevationlevels, in an ascending order, in order to reduce the variation oscillation in numericalrange across tiles.

In the hardware aspect, the increased locality in dynamic range enables us toconfigure specialized structures for the tiles in a designated range. For example,we can vary the number of lookup table entries required for evaluating the cosinefunction based on the expected dynamic range of the inputs without suffering fromhigh overhead in dynamic reconfiguration.

3.4 The co-configuration perspective and interface

It is clear from the description above that geometry tiling parameters form an inter-face between algorithm configuration and architecture configuration. Such algorithm-architecture interface is essential for AA codesign.

Geometric tiling may play two more interface roles. The regulated clustering tech-nique itself is familiar or natural to many computational scientists. The overhead maybe lower or eliminated when the data are already collected or provided in a geometri-cally partitioned fashion. Geometric tiling also exposes the potential for compressedrepresentation of certain tiles. As a matter of fact, the partition and clustering tech-nique has been used for years as a pre-processing step for compressed representationand fast evaluation of various interactions. In recent years many compression-basedfast algorithms have gained effective control on the loss in compression or mathe-matical approximation, see [2, 3] and references to the pioneer works. The sameclustering makes smooth the transition from direction evaluation algorithms to ap-

7

proximate and fast algorithms. Nonetheless, the geometric tiling introduced abovegives a new utilization of this kind of clustering. It also gives a new perspective forus to re-examine the algorithms that are fast in algorithmic complexity analysis butchallenging in implementation with architectural efficiency and accuracy.

4 AA Codesign Tool Support

Our tool flow depicted in Figure 2 permits the systematic exploration of both algo-rithmic and hardware level options. Both the hardware and algorithmic explorationsusing this tool start with a MATLAB-based description of the kernel functions. Thespecified kernels interact with the codelets to configure the algorithm and restructurethe data flow and data partitioning based on desired accuracy, power and perfor-mance metrics. The codelets provide a set of generic templates that can be appliedto generate various algorithmic configurations that differ based on data clusteringand indexing, tile traversal order, analytical and numerical compression within tilesand tile sizes. For example, a combination of a particular clustering and compressionscheme creates an algorithmic variant. It must be observed that the appropriate opti-mizations are influenced by the repertoire of chiplets available to the system. Chipletsare basic functional libraries that can directly be mapped on to the hardware. Theycorrespond to a specific implementation of a function, and in fact, a function can po-tentially have many different implementations (due to algorithmic and/or hardwarechoices), each differing from the rest in terms of latency, power, or area requirements.The user specifications indicate the allowable bounds on these metrics and the toolthen selects the most appropriate chiplet for each function that satisfies the specifiedconstraints.

The tool builds an intermediate representation called the Abstract Syntax Graph(ASG) from the kernel specification after being restructured by the codelets. Severalautomated optimizations are performed using the ASG, where each node representseither an input or a computation, and edges capture the computation/data flow. TheASG also reveals the opportunities for optimizations such as common sub-expressionelimination to generate the minimal hardware design for the target application. Inthe gravitational kernel case, for instance, there are seven kernels (each of which canbe represented with its own ASG) with a lot of data and computation reuse amongthem. Therefore, as a first step, the ASGs of the different kernels are combined intoa single ASG and redundant computations are eliminated. The resulting combinedASG is further refined for balancing parallelism and resource requirements, based onuser specifications.

8

MATLAB-Library:codelets

MATLAB

HDL Library:chiplets

HardwareConfiguration

Data

ArchitectureSynthesis

I/O In

terf

ace

Download

AlgorithmConfiguration

Data FlowRestructuring

& PartitionHost

Accelerator

Figure 2: AA codesign automation flow

After an ASG is optimized and restructured, an HDL code is generated, whichis subsequently synthesized using back-end synthesis tools (e.g., Xilinx ISE in ourimplementation). To generate the HDL code, the complex operations involved inthe kernel functions are transformed into a single operation format. In this format,any complex operation involving more than two operators is decomposed into a setof sub-operations, each having only one or two operands and an operator. Thisformat considerably simplifies the mapping process of an ASG into the HDL libraryof chiplets. Once the kernel functions are available in the single operation format,they are input to our back-end HDL generator. Figure 3(a) shows the steps ourapproach implements and Figure 3(b) illustrates the ASG for one of the kernels.

The HDL generator takes as input the ASG in the single operation format andgenerates a fully-pipelined HDL code. In doing so, it exploits the inherent paral-

9

CombiningASGs

BuildingIndividual ASGs

Optimization andRestructuring

HDL Code Generation

out1

e

/

+

1

r2 = rd12+rd33

r2e=r2+e

r2e_2 =1/r2e_1

out1 = rd1*r2e_2

Atomic Operation

(a) (b)

t3 s3

*

t1 s1 t2 s2

+

rd3 = t3-s3

rd33 = rd3*rd3

rd1 = t1-s1 rd2 = t2-s2

rd11 = rd1*rd1 rd22 = rd2*rd2

rd12 = rd11+rd22

–

sqrt

**

––

*

+

Figure 3: (a) Design Automation Flow. (b) An example ASG.

lelism in the design. As an example, by identifying the slack across the arrival timesof inputs at a node from two different paths, it can choose a low-power hardwareimplementation of a certain operation on the path with an early arrival time. Theback end is capable of generating both fixed and floating point HDL, depending onthe task configuration. When necessary, delay elements are inserted to match thespeeds of different pipelines. Moreover, in case of a fixed point HDL representation,the dynamic range for each input data is calculated at the algorithmic level. Thetool employs a version of interval arithmetic, as a part of the precision generation, todetermine the dynamic range of the vectored variables and outputs.

The bitstream from the HDL is downloaded onto the target FPGA hardware forconfiguring the hardware. The data and control flow sequencing to the FPGA toand from the host during the execution are controlled by policies determined by thecodelets.

10

5 The AA codesign scenarios

Our AA co-design system makes several co-adaptation decisions based on iterativedesign variation and tradeoff evaluation. To name a few, it selects number represen-tation formats, static or dynamic, to meet accuracy requirements, decides geometrictile sizes to relax on memory and pipeline constraints, and determines an appropriatedata width for pipeline units, exploiting the knowledge of dynamic data range. Toreduce implementation cost in lookup tables, it may also re-adjust the tile size andfunction expansion and approximation. We describe in the following a few typicalco-adaptation scenarios.

In the first case, the hardware data path supports only fixed point computation –a constraint that could have been imposed by area or power – then at the algorithmiclevel, the data access pattern is modified such that data with comparable dynamicrange are accessed successively and mapped to the same hardware. This is achievedby geometrically clustering the data as described in Section 3. This hardware-awarealgorithmic transform allows the scaling of values in a tile such that they can now beexpressed as a combination of (a) a scaling factor which differs from tile to tile and(b) data with smaller dynamic range which can then be processed by the same fixedpoint hardware. This is elaborated in Section 6.

Consider next the hardware cost or constraints of implementing a function usingLUT. It has a strong dependency with the tile structure. If the LUT is within verylimited area, the numerical range of the function input parameters in successive cyclesshould be small in order to achieve good numerical accuracy. This requires that thedata be organized at the algorithm level such that the numerical range is small overeach tile. Furthermore, if tiles that have the same dynamic range are accessed oneafter the other (rather than in row major or column major form), then the same LUTcan be used for computations of data in multiple tiles without having its contentsupdated frequently.

In a final scenario, we are concerned with architectural accuracy. In a LUT, thenumber of expansion terms for function evaluation, such as the truncated Taylorexpansion terms for the sine function, is determined by the requirement in approxi-mation accuracy. This approximation by analytical truncation, however, can be lostby architectural truncation due to an insufficient number of bits used by the hard-ware. Our framework takes account of both truncations and determines the best wayto apportion the errors across the algorithm and the hardware, to meet the otherresource constraints such power, area and delay.

In the next section, we will use the interactions between algorithmic data rangecontrol and data path number representation choices as an illustrative representative

11

of the power and numerical accuracy benefits of our co-design approach. The co-design approach can also tailored to provide better area or throughput benefits butis not elaborated in this work. For example, the LUT size reduction using algorithmmodifications reduces area requirements and improves throughput by reducing theaccess latencies of larger LUTS.

6 Case Study

We demonstrate the effectiveness of AA codesign for computing Equation (1) withexample designs mapped on to a Virtex2Pro-100 FPGA based target board using ourautomated tool flow. In these examples, we assume that the target set coincides withthe source set, T = S, and that the datum set is the union of two clusters C1 and C2

as shown in Figure 1.The choice of the most optimal algorithm-architecture framework is made by

the AA co-design methodology, based on the optimizations requested in the inputspecification. For instance, if the input specification assigns a high priority to area,latency and power optimizations with a controlled loss of accuracy, a fixed pointimplementation with geometric tiling is selected. Where as, if numerical accuracy isprioritized over area and power optimizations a floating point implementation withgeometric tiling is favored.

6.1 Co-configuration options and experiment results

We present four particular options, among others, for the algorithm-architecture co-configuration, we refer to these as PT-FL, PT-FX, GT-FL and GT-FX. Here, PTand GT stand for plain-tiling and geometric-tiling based algorithms, respectively, FLstands for hardware architectures supporting numeric operations in a single-precisionfloating point format, FX stands for fixed point implementations in which internalnodes use variable bit-width, as determined by their dynamic range.

We present in Table 1 (top), the comparison in accuracy between the first twooptions, PT-FL and GT-FL, in terms of the maximum absolute and relative errorsin the computed gravitation potential, in comparison to that computed in doubleprecision. These results demonstrate a consistent improvement in accuracy over awide range in the number of particles with geometric tiling, as compared to plaintiling on the same FL hardware. A key factor for the accuracy improvement is thealgorithmic awareness of the limitation in finite precision hardware. In contrast,hardware only approaches would have required more complex number representation

12

# particles 1000 3000 5000 10000

Abs. Err.PT 1.33e-004 1.26e-003 1.02e-003 2.76e-003GT 2.20e-005 9.78e-005 8.88e-005 1.52e-004

Rel. Err.PT 9.13e-007 3.71e-006 1.90e-006 2.78e-006GT 1.69e-007 3.12e-007 1.87e-007 1.60e-007

Format Area(no of Slices) Power(mW) Max Freq(MHz) Latency(Cycles)FX 36553 5861.88 125 119FL 44094 6062.87 125 144

Table 1: Comparison in accuracy between PT-FL and GT-FL options (top), and areaand power utilization between FX and FL hardware (bottom)

support from the hardware (such as double precision) to achieve higher accuracy.While not shown here, the accuracy with option GT-FX is also improved over

option PT-FX. This implies that it is possible to approach the numerical accuracyof plain tiling operated on more sophisticated hardware (e.g., FL vs FX or largerbitwidth for FX) with geometric tiling operating on simpler hardware. We can there-fore use algorithm-architecture codesign to provide more power/area-efficient solu-tions while achieving same numerical accuracies.

Table 1 (bottom) provides a comparison in area and power utilization betweenFX and FL hardware. Observe that a FX implementation achieves an overall 21%reduction in area. Consequently, we can fit more FX pipelines for evaluating thegravitational interaction in the same size FPGA as compared to FL pipelines andobtain better throughput, albeit at loss of accuracy. We can reduce the degree ofsuch loss in accuracy due to the use of FX, in place of FL, by using geometric tilingand other effective algorithmic transformations.

7 Conclusion and Future work

Currently, our reconfigurable system is implemented on a single FPGA. The XilinxVirtex2VP-100 FPGA is used for implementing a PCI Express core, PC interface anduser blocks. The bus interface to transmit/receive data between the host computerand the FPGA is provided by PCI Express 4 lane which has enough bandwidthto support the multiple FPGA version. This version targets the DN6000k10PCIe-4multiple FPGA board and is illustrated in Figure 4. One of the six FPGAs on boardis directly connected to the PCI express interface and used for interfacing with the

13

FANTOM 1 Chip

FANTOM Board

GIGABIT I/O

PCI Express 4x

64Mx16 SDRAM

FANTOM 1 Chip

FANTOM Board

GIGABIT I/O

PCI Express 4x

64Mx16 SDRAM

KernelFunc.

Data Path Contrl

Transfer Agent

MAC

Kernel Func.

MAC

Transfer Agent

PCI Express Core

Data Interface FIFO

Figure 4: Multiple FPGA Architecture

PCI FIFO block while the other five FPGAs are used for implementing the multipleuser blocks.

The theoretical peak performance of the floating point implementation for gravita-tional kernel by using the multiple FPGA configuration is 175Gflops. In comparison,the GRAPE-6 system which used ASICs to accelerate Astrophysical N-body simula-tions was able to attain a theoretical peak performance of 63.4 Tflops [4]. Even atthe chip level, GRAPE6 has a peak performance of 30.9Gflops per chip, whereas oursystem has a higher peak performance of 35Gflops per chip.

In addition to providing comparable peak performance as a contemporary special-purpose computer, our system has the flexibility of adaptation to new kernels andapplications. The GRAPE project is also moving towards retargeting their systems atreconfigurable platforms due to the large cost benefits of being able to reconfigure andsupport different classes of N-body problems efficiently. We anticipate reconfigurablesystems to become even more important in the future due to their intrinsic ability toco-adapt along with algorithmic changes.

Our future efforts include a fully automated search engine that can select ap-

14

propriate combinations of codelets and chiplets by exploring the search space andidentifying the Pareto points. There are at least two dimensions to be explored. Thefirst one is selection of the search algorithm(s) to employ. The second is the fastestimation of the values of the metrics of interest for evaluating a given point in thesearch space, without necessarily going all the way to the HDL code or hardwaremapping. Work is also underway in enhancing our set of metrics to include data re-silience, which is becoming increasingly important in many application domains thathave to process large datasets in a reliable fashion.

References

[1] AccelChip Inc. ”Automatic Conversion of Floating-Point to Fixed-Point MAT-LAB”, January 13 2004.

[2] J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithmfor Particle Simulations. SIAM J. Sci. Stat. Comput., 9(4):280–292, July 1988.

[3] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journalof Computational Physics, 73(2):325–348, December 1987.

[4] J. F. Makino, Toshiyuki, Koga Masaki, and Namura Ken. GRAPE-6: Massively-parallel special-purpose computer for astrophysical particle simulations. Publica-tions of the Astronomical Society of Japan, 55:1163–1187, 2003.

[5] Giovanni De Micheli, Rolf Ernst, and Wayne Wolf, editors. Readings in Hard-ware/Software Co-Design. Systems-on-Silicon Series. Morgan Kaufman, 2003.

15

algorithm-architecture codesign for structured matrix ... · describe the structured matrices in...

Documents