smash: structured matrix approximation by separation and ...yxi26/pdf/smash.pdf · the resulting...

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONSNumer. Linear Algebra Appl. 2010; 00:1–??Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/nla

SMASH: Structured Matrix Approximation by Separation andHierarchy

Difeng Cai1, Edmond Chow2, Lucas Erlandson3, Yousef Saad3 and Yuanzhe Xi4∗

1 Department of Mathematics, Purdue University, West Lafayette, USA2 School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA

3 Department of Computer Science & Engineering, University of Minnesota, Twin Cities, Minneapolis, MN 554554 Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322

SUMMARY

This paper presents an efficient method to perform Structured Matrix Approximation by Separation andHierarchy (SMASH), when the original dense matrix is associated with a kernel function. Given points ina domain, a tree structure is first constructed based on an adaptive partition of the computational domainto facilitate subsequent approximation procedures. In contrast to existing schemes based on either analyticor purely algebraic approximations, SMASH takes advantage of both approaches and greatly improves theefficiency. The algorithm follows a bottom-up traversal of the tree and is able to perform the operationsassociated with each node on the same level in parallel. A strong rank-revealing factorization is applied tothe initial analytic approximation in the separation regime so that a special structure is incorporated into thefinal nested bases. As a consequence, the storage is significantly reduced on one hand and a hierarchy ofthe original grid is constructed on the other hand. Due to this hierarchy, nested bases at upper levels can becomputed in a similar way as the leaf level operations but on coarser grids. The main advantages of SMASHinclude its simplicity of implementation, its flexibility to construct various hierarchical rank structures and alow storage cost. The efficiency and robustness of SMASH are demonstrated through various test problemsarising from integral equations, structured matrices, etc. Copyright c© 2010 John Wiley & Sons, Ltd.

Received . . .

KEY WORDS: hierarchical rank structure, nested basis, complexity analysis, integral equation, Cauchy-like matrix

1. INTRODUCTION

The invention of the Fast Multipole Method (FMM) [1, 2] opened a new chapter in scientificcomputing methodology by unraveling a set of effective techniques revolving around the powerfulprinciple of divide-and-conquer. When sets of points are far apart from each other, the physicalequations that couple them can be approximately expressed by means of a low rank matrix.Among the many variations to this elegant idea, a few schemes have been developed to gainfurther efficiency by building hierarchical bases in order to expand the various low-rank couplings.The resulting hierarchical rank structured matrices [3, 4, 5, 6], culminated in H2 matrices[7, 6],provide efficient solution techniques for structured linear systems (Toeplitz, Hankel, etc.)[8, 9, 10],

∗Correspondence to: Yuanzhe Xi, Department of Mathematics and Computer Science, Emory University, Atlanta, GA30322, U.S.A. E-mail: [email protected]

Contract/grant sponsor: Minnesota Supercomputing Institute

Contract/grant sponsor: NSF; contract/grant number: NSF/ACI–1306573,NSF/DMS–1521573

Copyright c© 2010 John Wiley & Sons, Ltd.Prepared using nlaauth.cls [Version: 2010/05/13 v2.00]

2 D. CAI, E. CHOW, L. ERLANDSON, Y. SAAD, Y. XI

integral equations [3, 4, 11, 12, 13, 14, 15, 16], partial differential equations [4, 17, 6, 18], matrixequations [4, 19, 20] and eigenvalue problems [21, 22]. Though these methods come under variousrepresentations, they all start with a block partition of the coefficient matrix and approximate certainblocks with low-rank matrices. The refinements of these techniques embodied in the H2 [4, 6, 23]and HSS [5, 24, 25] matrix representations take advantage of the relationships between different(numerically) low-rank blocks and use nested bases [6] to minimize computational costs and storagerequirements. What is often not emphasized in the literature, is that this additional efficiency in thesolution phase is achieved at a rather high cost in the construction phase.

Both HSS andH2 matrices employ just two key ingredients: low-rank approximations and nestedbases. The low-rank approximation, or compression, methods exploited in these techniques canbe classified into three categories. The first category involves methods that rely on algebraiccompression, such as the SVD and the rank–revealing QR (RRQR) [26] which are among themost common approaches. Utilizing these techniques to compress low-rank blocks [27, 23, 25]will result in nested bases that have orthogonal columns and an optimal rank. However, thesemethods will require the matrix entries to be explicitly available and usually lead to quadraticconstruction cost [23, 25]. Other compression techniques, such as adaptive cross approximation(ACA) [11, 28, 29], extract a low-rank factorization based only on part of the matrix entries andthis leads to a nearly linear construction cost. However, ACA may fail for general kernel functionsand complex geometries due to the heuristic nature of the method [30]. Other efficient low-rankapproximation techniques include but are not limited to Pseudo-skeleton approximations [31, 32],Mosaic-skeleton approximations [33], Cross approximations [34] and their latest development[35, 36]. To the best of our knowledge, no algebraic approach is able to achieve linear cost foran H2 or HSS construction with guaranteed accuracy. The methods in the second category rely oninformation on the kernel to perform the compression. They include methods based on interpolation[4, 37], Taylor expansion [7] or multipole expansion (as in FMM [2, 1]), etc. Although thesemethods lead to a linear construction cost, they usually yield nested bases whose rank is muchlarger than the optimal one [38]. Moreover, since bases are stored as dense matrices, these methodssuffer from high storage costs [3]. The methods in the third category either combine algebraiccompression with the analytic kernel information to take advantage of both, or use other techniqueslike equivalent densities or randomized methods to obtain a low-rank factorization. For example,hybrid cross approximation (HCA) [30] technique improves the robustness of ACA by applying itonly on a small matrix arising from the interpolation of the kernel function. The kernel independentfast multipole methods [39, 40] use equivalent densities to avoid explicit kernel expansions but itis only valid for certain kernels arising in potential theory. The randomized construction algorithms[41, 42, 43, 22] compute the hierarchical rank structured matrices by applying SVD/RRQR to theproduct of the original matrix and a random matrix and are effective when a fast matrix vectormultiplication routine is available.

1.1. Contributions

The aim of this paper is to introduce an efficient and unified framework to construct an n× n H2

or HSS matrix based on structured matrix approximation by separation and hierarchy (SMASH).In terms of the three categories discussed above, SMASH belongs to the third category in that itstarts with an initial analytic approximation in the Separation regime, then algebraic techniques areemployed to postprocess the approximation in order to build up a Hierarchy. The main features ofSMASH are as follows.

1. Fast and stable O(n) construction SMASH starts with an adaptive partition of thecomputational domain and then constructs a tree structure to facilitate subsequent operations asin [44, 3, 4, 45]. The construction process follows a bottom-up traversal of the tree and is able tocompute the bases associated with each node on the same level in parallel. In fact, the constructionprocedure is entirely local in the sense that the compression for a parent node only depends on theinformation passed from its children. By combining the analytic compression technique with strongRRQR [26], a special structure is incorporated into the final nested bases. In contrast to the methods

Copyright c© 2010 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. (2010)Prepared using nlaauth.cls DOI: 10.1002/nla

STRUCTURED MATRIX APPROXIMATION BY SEPARATION AND HIERARCHY 3

used in [42, 40], SMASH is able to set the approximation accuracy to any tolerance. In addition,the nested bases at each non-leaf level can be computed directly in a similar way as the leaf-leveloperations but on a coarser grid extracted from previous level of compression. Therefore, SMASH isalso advantageous relative to one based on the HCA method [46] since it does not need to constructan H matrix first and then use a recompression scheme to recover the nested bases. SMASH canbe easily adapted to construct either an n× n H2 or HSS matrix depending on the properties of theunderlying applications with O(n) complexity. The guaranteed accuracy/robustness of SMASH isjustified by various test examples with complicated geometries (Section 5).

2. Low storage requirement Construction algorithms that use analytic approximations usuallylead to high storage costs. SMASH alleviates this issue in several ways. First, instead of storingnested bases as dense matrices, only one vector and one smaller dense matrix need to be saved foreach basis. Second, each coupling matrix [3] is a submatrix of the original matrix in this scheme.Therefore, it suffices to store row and column indices associated with the submatrix instead of thewhole submatrix explicitly. Finally, the use of strong RRQR [26] can automatically reduce the rankof the nested bases if their columns obtained from the analytic approximation are not numericallylinearly independent.

3. Simplicity and flexibility for approximation of variable order Unlike analytic approaches(e.g., FMM) in which farfield approximations and transfer matrices are obtained differently andextra information is needed to compute transfer matrices (cf.[47]), SMASH only requires a farfieldapproximation, which can be readily obtained for almost all kernels, for example, via interpolation[37]. Moreover, the approximation rank in the compression on upper levels is independent of therank used in lower levels, which means that approximation rank can be chosen arbitrarily in thecompression at any level while still maintaining theH2 or HSS structure. This is due to the fact thatin each level of compression, SMASH produces transfer matrices directly, which is an advantage ofalgebraic approaches. For interpolation-based constructions, there are restrictions on the maximaldegree of basis polynomials in each level in order to maintain the H2 structure. [23].

1.2. Outline and Notation

The paper is organized as follows. In Section 2 we review low-rank approximations ([48, 3, 6])associated with some kernel functions. Section 3 introduces SMASH for the construction ofhierarchical rank structured matrices with nested bases. The complexity of SMASH is analyzedin Section 4. Numerical examples are provided in Section 5 and final concluding remarks are drawnin Section 6.

Throughout the paper, the following notation is used:

• A: a dense matrix associated with a kernel function κ;

• A: H2 or HSS approximation to A;

• i = 1 : n denotes an enumeration of index i from 1 to n;

• |·| denotes the cardinality of a finite set if the input is a set;

• ‖·‖, ‖·‖F denote the L2 norm, Frobenius norm, respectively, and ‖A‖max denotes theelementwise sup-norm of a matrix, i.e.,

‖A‖max := maxi,j|ai,j |, A = [ai,j ]i,j ;

• diag(. . . ) denotes a block diagonal matrix;

• Given a tree T , children(i) and lv(i) represent the children and level of node i, respectively,where root node is at level 1. The location of a node i at level l is denoted as li whenenumerated from left to right;



• Let X and Y be two nonempty finite sets of points and A be a matrix whose (i, j)th entry isdetermined by the ith point in X and jth point in Y . If i denotes the index set correspondingto a subset Xi of X , then A|i denotes the submatrix of A with rows determined by Xi.Furthermore, if index set j corresponds to a subset Yj of Y , then A|i×j denotes a submatrix ofA whose rows and columns are determined by Xi and Yj, respectively.

2. DEGENERATE AND LOW-RANK APPROXIMATIONS

Hierarchical rank structured matrices are often used to approximate matrices after a block partitionsuch that most blocks display certain (numerical) low-rank characteristics. For matrices derivedfrom kernel functions, a low-rank approximation can be determined when the kernel function can belocally approximated by degenerate functions [48]. In this section, we first review this property. Forpedagogical reasons, we focus on the kernel function 1/(x− y) but more general kernel functionscan be handled in a similar way as demonstrated in the numerical experiments section (Section 5).

2.1. Degenerate expansion

Consider the kernel function κ(x, y) on C×C defined by

κ(x, y) =

1

x−y , if x 6= y,

dx, if x = y,(1)

where the number dx ∈ C can be arbitrary. If x and y are far from each other (See Definition 2.1below), then κ(x, y) can be well approximated by a degenerate expansion

κ(x, y) ≈r−1∑k=0

k∑l=0

ck,lφk(x)ψl(y),

where φk and ψl are univariate functions. In fact, interpolation in the x variable yields the simplest,yet most general, way to obtain a degenerate approximation:

κ(x, y) ≈r∑

k=1

pk(x)κ(xk, y), (2)

where xk’s are the interpolation points and the pk’s are the associated Lagrange polynomials.Several ways to quantify the distance between two sets of points that are away from each other

have been defined [3, 6, 47]. One of these ([47]), given below, is often referred to. For a boundednonempty set S of C, let δ = minc∈C sups∈S |s− c|. Then we refer to the minimizer c∗ as the centerof S and to the corresponding minimum value δ as its radius.

Definition 2.1Let X and Y be two nonempty bounded sets in C. Let a ∈ C and δa > 0 be the center and radiusof X with |x− a| ≤ δa, ∀x ∈ X . Analogously, let b ∈ C and δb > 0 denote the center and radius ofY . Given a number τ ∈ (0, 1), we say that X and Y are well-separated with separation ratio τ ifthe following condition holds

δa + δb ≤ τ |a− b|. (3)

Fig. 1 illustrates two well-separated intervals (centered at a = 0.5, b = 2.5, respectively) withseparation ratio τ = 0.5. Given two sets X and Y , if (3) only holds for τ ≈ 1, then this implies thatX and Y are close to each other and we cannot regard X and Y as being well-separated. Hence weassume that τ is a given small or moderate constant (for example, τ ≤ 0.7) in the rest of this paper.

Consider the kernel function 1/(x− y) again. WhenX and Y are well-separated so that (3) holds,a degenerate expansion for the kernel function based on Taylor expansion takes the following form



0 1 2 3

X

a b

Y

Figure 1. Well-separated intervals X,Y (centered at a = 0.5, b = 2.5) with separation ratio τ = 0.5.

[49]:

κ(x, y) =

r−1∑k=0

k∑l=0

ck,lφa,l(x− a)φb,k−l(y − b) + εr, ∀x ∈ X, y ∈ Y, (4)

where

ck,l :=

−k!(b− a)−(k+1)η−1

a,l η−1b,k−l(−1)k−l if l ≤ k,

0 if l > k,

φv,l(x) := ηv,lxl

l!, ηv,l =

1, if l = 0,(le (2πr)

12r

1δv

)lif l = 1, . . . , r − 1,

(5)

and the approximation error εr satisfies

|εr| ≤(1 + τ)τ r

(1− τ)|κ(x, y)|, ∀x ∈ X, y ∈ Y. (6)

The above expansion will be used to illustrate the construction of hierarchical rank structuredmatrices and analyze the approximation error in the remaining sections. The scaling factor ηv,lis used to improve the numerical stability of the expansion (4). See [49] for more details.

2.2. Farfield and nearfield blocks

We now consider a dense matrix A defined by A := [κ(x, y)]x∈X,y∈Y . The degenerateapproximation (4) immediately indicates that certain blocks of A admit a low-rank approximation.In order to identify these low-rank blocks, it is necessary to exploit nearfield and farfield matrixblocks as they are defined in [3].

Definition 2.2Given two sets of points Xi and Yj, a submatrix A|i×j is called a farfield block if Xi and Yj arewell-separated in the sense of Definition 2.1; otherwise, A|i×j is called a nearfield block.

A major difference between farfield and nearfield blocks is that each farfield block can beapproximated by low-rank matrices, as a consequence of (4). The following theorem restates (4)in matrix form for the two-dimensional case.

Theorem 2.1If Xi and Yj are well-separated sets in C in the sense of (3) with centers ai and aj , radii δi and δj ,respectively, the farfield block A|i×j admits a low-rank approximation of the form

A|i×j = UiBi,j VTj + EF |i×j, (7)

where

Ui = [φai,l(x− ai)] x∈Xi,l=0:r−1

, Vj =[φaj ,l(y − aj)

]y∈Yj,l=0:r−1

, Bi,j = [ck,l]k,l=0:r−1 , (8)

with ck,l, φv,l(v = ai, aj) defined in (5), and

‖EF |i×j‖max ≤ εfar‖A|i×j‖max with εfar =(1 + τ)τ r

(1− τ). (9)

Let ni = |Xi| and nj = |Yj|. If the points x of Xi are listed as columns and the various functionsφai,l(x− ai) are listed row-wise with l = 0, · · · , r − 1 and similarly for y, Yj, and φaj ,l(y − aj) thenthe matrices Ui, Bij and Vj have dimensions ni × r, r × r, and nj × r, respectively. The theorem isillustrated in Fig.2.



X

Y

A

B

V^

^

^

U

i,j

i

jT

i x j

j

i

Figure 2. Illustration of Theorem 2.1.

2.3. Strong rank-revealing QR

Notice that in the approximation (7), Ui only depends on the points in Xi, Vj only depends thepoints in Yj and Bi,j depends on both the centers of Xi and Yj as well as their radii. This representsa standard expansion structure used in FMM [2, 1, 47]. As will be seen in the next section theconstruction of H2 and HSS matrices will be significantly simplified by further postprocessingUi and Vj with a strong rank-revealing QR (SRRQR) factorization [26]. The following theoremsummarizes Algorithm 4 in [26].

Theorem 2.2([26, Algorithm 4]) LetM be anm× nmatrix andM 6= 0. Given a real number s > 1 and a positiveinteger r (r ≤ rank(A)), the strong rank-revealing QR algorithm computes a factorization of M inthe form:

MP = Q

[R11 R12

R22

], (10)

where P is a permutation matrix, Q ∈ Rm×m is an orthogonal matrix, R11 is a r × r uppertriangular matrix and R12 is a r × (n− r) dense matrix that satisfies the condition:

‖R−111 R12‖max ≤ s.

The complexity is O(n3 logs n) if m ≈ n.

In all of our implementations, we set s = 2. SRRQR unravels a set of columns of A that nearlyspan the range of A – thus the term rank-revealing. Assume C is an n× r matrix with rank r.Applying SRRQR to CT produces the following factorization:

CTP = Q[R11 R12

],

where Q ∈ Rr×r is an orthogonal matrix. A modification of the above equation leads to

C = P

[I

(R−111 R12)T

]C,

where I is an identity matrix of order r and C = (QR11)T . Note that the above relation implies thatC ∈ Rr×r is a submatrix of C consisting of the first r rows of the row-permuted matrix PTC.From this perspective, the whole aim of the procedure is to extract a set of r rows from C that willnearly span its row space.

When Ui and Vj in (7) both have more rows than columns, applying SRRQR to UTi and then toV Tj yields:

Ui = Pi

[IGi

]Ui |i, Vj = Fj

[IHj

]Vj |j. (11)



Note that, as explained above for C, Ui |i denotes a matrix made up of selected rows of Ui.Substituting the above two equations into (7) leads to another form of the low-rank approximation

to A|i×j:

A|i×j ≈ Pi[IGi

]Ui |iBi,j(Vj |j)

T

(Fj

[IHj

])T(12)

= Pi

[IGi

](A|i×j − EF |i×j)

(Fj

[IHj

])T≈ Pi

[IGi

]A|i×j

(Fj

[IHj

])T, (13)

where i and j represent subsets of i and j, respectively, and (13) results from (7).A major advantage of this form of approximation over (7) is a reduction in storage. Now only

four index sets are needed to represent (Pi, Fj , i, j) and two smaller dense matrices (Gi, Hj) need tobe stored rather than three dense matrices (Ui, Bi,j , Vj). This form is very memory efficient sinceA|i×j can be quickly reconstructed on the fly based on the index set i, j. There are other advantagesthat will be discussed in the next section.

The operations represented by (11) will be used extensively in the construction of hierarchicalmatrices to be seen in the next section. These will be denoted as follows:

[Pi, Gi, i] = compr(Ui, i) and [Fj , Hj , j] = compr(Vj , j). (14)

Each of the above operations is also called a structure-preserving rank-revealing (SPRR)factorization [10] or an interpolative decomposition [50]. For recent developments on rank-revealingalgorithms, see [51]. Notice that the matrices Ui and Vj in (7) serve as the approximate columnand row bases for A|i×j, respectively. Taylor expansion (4) is used to illustrate the compressionof low-rank blocks due to its simplicity. More advanced compression schemes such as weightedtruncation techniques [52], the modified ACA method [29] and the fast algorithm combining Green’srepresentation formula with quadrature [53], can also be exploited to compute these bases.

3. CONSTRUCTION OF HIERARCHICAL RANK STRUCTURED MATRICES WITHNESTED BASES

This section presents SMASH, an algorithm to construct either an H2 or an HSS matrixapproximation to an n× n matrix A := [κ(x, y)]x∈X,y∈Y , where κ is a given kernel function and Xand Y are two finite sets of points. Although the discussion focuses on square matrices, SMASHcan be extended to rectangular ones [8] without any difficulty.

3.1. Adaptive partition

SMASH starts with a hierarchical partition of the computational domain Ω and then builds a treestructure T to facilitate subsequent operations. In order to deal with the case when X and Y arenon-uniformly distributed, an adaptive partition scheme is necessary.

Without loss of generality, assume both X and Y are contained in a unit box Ω = [0, 1]d in Rd(d = 1, 2, 3). The basic idea of this partition algorithm (similar to [44]) is to recursively subdividethe computational domain Ω into several subdomains until the number of points included in eachresulting subdomain is less than a prescribed constant ν0 (usually much smaller than the numberof points in the domain). Specifically, at level 1, Ω is not partitioned. Starting from level l (l ≥ 2),each box obtained at level l − 1 that contains more than ν0 points is bisected along each of the ddimensions.

For convenience we assume that the number of points fromX and Y in each partition is the same.If a box is empty, it is discarded. Let L be the maximum level where the recursion stops. Then theinformation about the partition can be represented by a tree T with L levels, where the root nodeis at level 1 and corresponds to the domain Ω and each nonroot node i corresponds to a partitionedsubdomain Ωi. See Fig. 3 for a 1D example. The adaptive partition guarantees that each subdomaincorresponding to a leaf node contains a small number of points less than the prescribed constant ν0.



Figure 3. Illustration of an adaptive partition for the caseX = Y = x1, x2, . . . , x8. Left: the computationaldomain Ω is recursively bisected until the number of points in each sub-interval Ωi centered at ai is less than4 (circled dots represent the points xi). Right: the corresponding postordered binary tree T with indices of

points stored at each node.

a1

a2 a3

a4 a5

a6

a7

level 4

level 3

level 2

level 1x1 x2 x3 x4x5 x6 x7 x8 7 [1:8]

6 [4:8]

5 [8]4[4:7]

3 [6:7]2[4:5]

1[1:3]

3.2. Review of H2 and HSS matrices

The low-rank property of a block A|i×j associated with a node pair (i, j) is related to the strong (orstandard) admissibility condition employed to define H and H2 matrices ([4],[38]):

for a fixed τ ∈ (0, 1), the node pair (i, j) in T is admissible if Xi and Yj are well-separated in the sense of Definition 2.1.

Hierarchical matrices are often defined in terms of the above condition, which, in essence, spellsout when a given block in the matrix can be compressed. A matrix A (associated with a tree T ) iscalled an H matrix ([37]) of rank r if there exists a positive integer r such that

rank(A|i×j) ≤ r, whenver (i, j) is admissible.

Furthermore, A is called a uniform H matrix ([37]) if there exist a column basis set Uii∈T and arow basis set Vii∈T associated with T , such that when (i, j) is admissible, A|i×j admits a low-rankfactorization:

A|i×j = UiBi,jVTj , for some matrix Bi,j .

This factorization is referred to as an AKB representation in [6], where Bi,j is called a coefficientmatrix. In [3], Bi,j is termed a coupling matrix and we will follow this terminology here.

The class of H2 matrices [37] is a subset of uniform H matrices with a more refined structure.That is, A is an H2 matrix if it is a uniform H matrix with nested bases in the sense that one canreadily express a basis at one level from that of its children (see (16)). What is exploited here is thatadmissible blocks are low-rank and in addition their factors (or generators) can be expressed fromlower levels.

c c c3c21 4

p A|c1×Q ≈ Uc1 Tc1A|c2×Q ≈ Uc2 Tc2A|c3×Q ≈ Uc3 Tc3A|c4×Q ≈ Uc4 Tc4

Figure 4. A parent node p with children c1, · · · , c4 and the corresponding partition of the matrix blockA|p×Q, where Q is the collection of indices associated with all nodes q such that (p, q) is admissible. In thecontext of HSS matrices there are at most 2 children since the trees are binary. ForH2 matrices the trees are

more general.

Assume we have a situation illustrated in Fig. 4 where the parent node p has children c1, . . . , c4.According to the interpolation in (2), the column basis Ui associated with the set i (for any nonroot



node i) can be chosen asUi =

[p

(i)k (x)

]x∈Xik=1:r

,

where p(i)k (k = 1, . . . , r) are Lagrange basis polynomials corresponding to interpolation points

xi1 , . . . , xir . If i is a child of node p, we can write (cf.[37])

p(p)k (x) =

∑l=1:r

p(p)k (xil)p

(i)l (x). (15)

The matrix version of (15) then leads to the so-called nested basis:

Up =

Uc1Rc1...Uc4Rc4

, with Ri =

p(p)1 (xi1) . . . p

(p)r (xi1)

......

...

p(p)1 (xir ) . . . p

(p)r (xir )

.The nested basis can also be obtained through algebraic compressions based on a bottom-up

procedure. Let A|p×Q denote the entire (numerically) low rank block row associated with nodep, i.e., Q is the union of all indices q such that (p, q) is admissible. As illustrated in Fig. 4,assuming that the column basis Uci has been obtained from a rank-r factorization of the submatrixA|ci×Q ≈ Uci Tci , we then derive

A|p×Q ≈

Uc1 Uc2Uc3

Uc4

Tc1Tc2Tc3Tc4

.Applying a rank-r factorization to the transpose of

[TTc1 TTc2 TTc3 TTc4

]leads to

Tc1Tc2Tc3Tc4

≈Rc1Rc2Rc3Rc4

Tp −→ A|p×Q ≈ UpTp with Up =

Uc1Rc1Uc2Rc2Uc3Rc3Uc4Rc4

.Thus, we can get the basis Up for the parent node from the children’s Uci’s and the matrices

Rci from both analytic and algebraic compression schemes. The Ris are called transfer matrices.Clearly, a similar process can be applied to obtain a row-basis Vp and so, more generally, we canwrite

Up =

Uc1Rc1...UckRck

, Vp =

Vc1Wc1...

VckWck

. (16)

Hence only the matrices Ui, Vi for all leaf nodes i need to be stored. Matrices Up, Vp for a non-leafnode p can be obtained via transfer matrices which require much less storage. This is at the originof the improvement from an O(n log n) cost for the early method in this class developed by Barnesand Hut [44] (H structure) into an O(n) cost method obtained later by the FMM [2] (H2 structure)for computing matrix-vector multiplications for some kernel matrices ([47]).

Note that as they are described in the literature H and H2 matrices are associated with moregeneral trees than those traditionally used for HSS matrices [54, 5] which are binary trees, accordingto the partition algorithm described in Section 3.1. In fact HSS matrices can be viewed as a specialclass ofH2 matrices in which the strong admissibility condition is replaced by the weak admissibilitycondition[38]:

the node pair (i, j) in T is admissible if i 6= j.



The above weak admissibility condition implies that, if A is an HSS matrix, and i, j are two childrenof the root node, then the matrix block A|i×j should admit a low-rank factorization.

In the context of integral equations, this requirement means that the HSS structure will facedifficulties in situations when couplings between nearfield blocks require a relatively high rankrepresentation. Approximation by HSS matrices will work well for integral equations defined on acurve where kernel functions are discretized. In other cases the numerical rank of A|i×j may notnecessarily be small even when a non-oscillatory kernel function is discretized on a surface or in avolume [3, 9].

The construction of H2 and HSS matrices involves computing the basis matrices U, V at the leaflevel, along with the transfer matrices R,W , and the coupling matrices B associated with a tree T .In particular, each leaf node i is assigned four matrices Ui, Vi, Ri,Wi and each nonleaf node i atlevel l ≥ 3 is assigned two matrices Ri,Wi.

There are two types of Bi,j matrices, those corresponding to the nearfield blocks at the leaflevel, and those corresponding to the coupling matrix where the product UiBi,jV Tj approximatesblock A|i×j for certain admissible (i, j). In general, the computation of Bi,j is more complicatedbecause one has to carefully specify the set of admissible node pairs (i, j) to be used for the efficientapproximation of A. If the distribution of points is uniform, the corresponding node pairs (i, j) arerelated to what is called interaction list in FMM [2, 47]. In more general settings where pointscan be arbitrarily distributed, they are called admissible leaves [3]. The set of admissible leavescorresponding to the minimal admissible partition [6] can be defined as follows:

L =(i, j) : i, j ∈ T are nodes at the same level such that (i, j) is admissiblebut (pi, pj) is not admissible , where pi, pj are parents of i, j, respectively∪ (i, j) : i ∈ T is a leaf node and j ∈ T with lv(j) > lv(i) such that(i, j) is admissible but (i, pj) is not admissible with pj the parent of j∪ (i, j) : j ∈ T is a leaf node and i ∈ T with lv(i) > lv(j) such that(i, j) is admissible but (pi, j) is not admissible with pi the parent of i.

(17)

The node pairs (i, j) corresponding to blocks Bi,j that can not be compressed or partitioned, can beidentified through inadmissible leaves as defined below (cf.[3]):

L− := (i, j) : i, j ∈ T are leaf nodes and (i, j) is not admissible . (18)

In particular, for HSS matrices, it can be seen that L and L− have the following simple expression:

L = (i, j) : i, j ∈ T and j = sibling of i, L− = (i, i) : i ∈ T is a leaf node. (19)

This special feature will be used later (Section 3.3.1) to simplify the notation associated with HSSmatrices. The U, V,R,W,B matrices are called H2, or HSS, generators in the remaining sections.

3.3. Levelwise construction

Although the HSS structure may appear to be simpler than theH2 structure, based on their algebraicdefinitions the HSS construction procedure is actually more complex. This is because HSS matricesrequire the compression of both nearfield and farfield blocks while H2 matrices only require thecompression on farfield blocks. For example, if two sets Xi and Yj are almost adjacent to each other(τ ≈ 1 in (3)), then the analytic approximation will not produce a low rank, i.e., to get an accurateapproximation, r has to be large in (6). In this case, the H2 matrix will form this block explicitly asa dense matrix while the HSS matrix still requires the block to be factorized. In what follows, wewill first discuss SMASH for the HSS construction in detail and then present the H2 constructionwith an emphasis on their differences.

3.3.1. HSS construction Due to the simple structure of L,L− in (19), the notation denoting thecoupling matrices and nearfield blocks can be simplified in the HSS representation. Specifically, for



Figure 5. Illustration of the sets Ni used in HSS constructions.

N

N

SepSep

i

i, k i, k

k

ik

i

p

k ii k kleaf

p

ppp

Case 2 Case 1 Case 3

k

p i

p i

(i, j) ∈ L, Bi,j can be represented as Bi with the second index j dropped because j = sibling of i isuniquely determined in a binary tree. An additional symbol Di is introduced to represent diagonalblocks Bi,i because (i, j) ∈ L− implies j = i.

The basic idea of SMASH for the HSS construction is to first apply a truncated SVD to obtaina basis for nearfield blocks, use interpolation or expansion to obtain a basis for farfield blocks andthen apply SRRQR to the combination of these two bases to obtain the U, V,R and W matrices.The D and B matrices are submatrices of the original kernel matrix and their indices are readilyavailable after the computation of U, V,R,W matrices. In order to distinguish between column androw indices associated with a node i, we use superscript row to mark its row indices and col to markits column indices. For example, irow and icol denote the indices of points from X and Y containedin Ωi, respectively.

Assuming the HSS tree T has L levels, the HSS construction algorithm traverses T through levell = L,L− 1, . . . , 2. Before the construction, two intermediate sets irow and icol are initialized asfollows for each node i:

irow =

irow if i is a leaf,∅ otherwise,

icol =

icol if i is a leaf,∅ otherwise,

(20)

where the index sets irow and icol have been saved for each leaf node after the partition of Ω.Let Ni be the set of blocks that are nearfield to node i. .We set Ni = ∅ when i = root. For the

other cases, Ni is defined below where pi denotes the parent of i:

Ni = k ∈ T such that : either k is a sibling of ior pk ∈ Npi and (i, k) /∈ Sepor k is a leaf such that k ∈ Npi and (i, k) /∈ Sep

, (21)

where Sep denotes the set of all well-separated pairs of subdomains corresponding to a givenpartition:

Sep := (i, j) : Ωi,Ωj are well-separated . (22)

See Figure 5 for a pictorial illustration.Note that when i is a child of root thenNpi is empty and so only the first case can take place (k is

a sibling of i). We also remark that the third case (k is a leaf such that k ∈ Np and (i, k) /∈ Sep) isrequired for non-uniform distributions and that it is empty if the distribution of the points is uniform.It is easy to see that if Ω = [0, 1] and X = Y is uniformly distributed, T is a perfect binary tree. Inaddition, if the separation ratio is set to τ = 0.5, then for any nonroot node i, Ni contains at mosttwo nodes.

For each node i, let i denote the index set of the points inX ∩ Ωi. Similarly, j represents the indexset of the points in Y ∩ Ωj . Namely, Xi = X ∩ Ωi and Yj = Y ∩ Ωj .

Remark 3.1Since the HSS structure [54, 5] is associated with a binary tree regardless of the dimension of theproblem (see Section 3.2), to construct HSS matrices, bisection is used throughout the adaptivepartitions. For example, given a domain or a curve enclosed in a square in R2, we use bisection



in the horizontal direction and the vertical direction alternatively in consecutive stages of theadaptive partitions, i.e., if horizontal bisection is used at partition level l, then vertical bisectionwill be employed at level l + 1. The numerical experiments in Section 5.3 and Section 5.4 provideillustrations. This partition strategy corresponds to the geometrically regular clustering (cf.[3]), andcan be generalized into the geometrically balanced clustering (cf.[3]).

For each node i at level l, the construction algorithm first applies a truncated SVD to compute anapproximate column basis for the nearfield block rows in terms of Xirow :

A−i :=[A|irow×jcol

]j∈Ni

= SiΣ−i

[Sj]j∈Ni

+[E−Σ |irow×jcol

]j∈Ni

, (23)

where the columns of Si and [Sj ]j∈Ni are the left/right singular vectors of A−i and Σ−i is a diagonalmatrix composed of corresponding singular values of A−i such that the following estimate holds

‖E−Σ |irow×jcol‖F ≤√|irow||jcol|εSVD‖A−i ‖2, ∀ j ∈ Ni. (24)

Here, εSVD is the relative tolerance used in the truncated SVD. The matrix Si is then taken as anapproximate column basis for the nearfield block rows A−i .

For farfield blocks with respect to Xirow , a column basis Ui can be easily obtained throughinterpolation (2) or Taylor expansion (8) that only rely on Xirow and Ωi. Next, we apply SRRQR tothe combined basis [Ui, Si] as shown below

[Pi, Gi, irow] = compr([Ui, Si], i

row). (25)

From these outputs, we set

Ui : = Pi

[IGi

]if i is a leaf node,[

Rc1Rc2

]: = Pi

[IGi

]if i is a parent with children c1, c2.

(26)

Similarly, in order to compute V,W generators, a truncated SVD is first applied to the nearfieldblock columns (transposed) in terms of Yicol :

A|i : =

[(Ajrow×icol

)T ]j∈Ni

= TiΣ|i

[Tj]j∈Ni

+[(E|Σ |jrow×icol)T

]j∈Ni

, (27)

where the truncation error satisfies

‖E|Σ |jrow×icol‖F ≤√|icol||jrow|εSVD‖A|i‖2, ∀ j ∈ Ni. (28)

In the next step we compute a row basis Vi for the farfield blocks with respect to Yicol based on (2)or (8) and apply SRRQR to [Vi, Ti]:

[Fi, Hi, icol] = compr([Vi, Ti], i

col).

Then we set

Vi : = Fi

[IHi

]if i is a leaf node,[

Wc1

Wc2

]: = Fi

[IHi

]if i is a parent with children c1, c2.

(29)

Once the compressions for children nodes (at level l) are complete, we update the intermediate indexset associated with the parent node (at level l − 1) as shown below :

prow = c1row ∪ c2

row, pcol = c1col ∪ c2

col, (30)

where c1, c2 are the children of p.After the bottom-up traversal of T and hence the computation of U, V,R,W matrices, the B and

D matrices can be extracted as follows:

Bi := A|irow×jcol , j = sibling of i, and Di := A|irow×icol , i = leaf node. (31)



3.3.2. H2 construction As mentioned at the beginning of Section 3.3, the H2 construction issimpler because nearfield blocks will not be factorized, and the only complication is that an H2

matrix may be associated with a more general tree structure where a parent can have more than twochildren.

SMASH for the H2 construction also follows a bottom-up levelwise traversal of T throughlevel l = L,L− 1, . . . , 2. For each node i at level l, a column/row basis Ui/Vi corresponding to afarfield block row/column with index irow /icol can be obtained via either interpolation (2) or Taylorexpansion (8). The bases Ui and Vi are then passed into SRRQR

[Pi, Gi, irow] = compr(Ui, i

row) and [Fi, Hi, icol] = compr(Vi, i

col). (32)

The H2 generators U,R, V,W are then set as

Ui : = Pi

[IGi

]if i is a leaf node,Rc1...

Rck

: = Pi

[IGi

]if i is a parent with children c1, . . . , ck,

Vi : = Fi

[IHi

]if i is a leaf node,Wc1

...Wck

: = Fi

[IHi

]if i is a parent with children c1, . . . , ck.

(33)

Again, once the compressions for children nodes (at level l) are complete, the intermediate indexset associated with the parent node (at level l − 1) can be updated as in (30). Namely,

prow = c1row ∪ · · · ∪ ck

row, pcol = c1col ∪ · · · ∪ ck

col, with children(p) = c1, . . . , ck.(34)

Finally, analogous to (31), the coupling matrices associated with admissible leaves are extractedbased on index sets irow and jcol as

Bi,j := A|irow×jcol , ∀(i, j) ∈ L, (35)

and the nearfield blocks associated with inadmissible leaves are formed by

Bi,j := A|irow×jcol , ∀(i, j) ∈ L−. (36)

Compared with standardH2 constructions based on either expansion or interpolation, SMASH ismore efficient and easier to implement. First, in order to complete theH2 construction procedure, itsuffices to provide only the column/row basis for each farfield block, which can be easily obtainedbased on interpolation (2), for example, and the coupling matrices Bi,j can be simply extractedfrom the original matrix without resorting to any other formulas. Second, no information is requiredabout the translation to compute transfer matrices because the computation of R/W is essentiallythe same as that of U/V at leaf level. For all the children i of a node p, Ri/Wi are calculated jointlybased on a subset of points located inside Ωp (i.e., Xprow/Ypcol). Therefore, SMASH essentiallybuilds a hierarchy of grids and computes the bases at each level of the tree by repeating the sameoperations (32) on each coarse grid. In addition, the use of SRRQR guarantees that each entry of theU, V,R,W matrices is bounded by a user-specified constant, which ensures the numerical stabilityof the construction procedure. Note that the special structures in the nested bases (33) result in notonly a reduced storage but also in faster matrix operations such as matrix-vector multiplications,linear system solutions, etc. Finally, since the computation of the basis matrices only relies on theinformation local to each node, as can be seen from (25) and (32), this construction algorithm isinherently suitable for a parallel computing environment.



3.4. Matrix-vector multiplication

Among various hierarchical rank structured matrix operations, the matrix-vector multiplication isthe most widely used, as indicated by the popularity of tree code [44] (for H matrices) and fastmultipole method [2, 47] (for H2 matrices).

The matrix-vector multiplication for an H2 matrix A follows first a bottom-up and then a top-down traversal of T [3, 6], which is a succinct algebraic generalization of the fast multipole method(cf.[47]). Suppose T has L levels, the node-wise version of this algorithm to evaluate z = Aq canbe summarized as follows:

1. from level l = L to level l = 2, for each node i at level l, compute qi := V Ti q|icol if l = L;otherwise, compute qi :=

∑c∈children(i)W

Tc qc;

2. for each nonroot node i ∈ T , compute zi =∑

j:(i,j)∈LBi,j qj ;

3. from level l = 2 to level l = L, for each node i at level l, if l < L, for each child c of i, computezc = zc +Rczi; otherwise, compute z|irow = Uizi +

∑j:(i,j)∈L− Bi,jq|jcol .

When X and Y are uniformly distributed in [0, 1]d, the resulting tree T is a perfect 2d-tree (eachparent has 2d children and all leaf nodes are at the same level). If, in addition we assume the orderingof points to be consistent with the postordering of tree T , i.e., for two siblings i, j ∈ T , if i < j thenthe index of any point in box i must be smaller than point in box j, then an H2 matrix A has atelescoping representation:

A =B(L)+

U (L)

(U (L−1)

(. . .

(U (2)B(1)

(V (2)

)T+B(2)

). . .

)(V (L−1)

)T+B(L−1)

)(V (L)

)T,

(37)

where U (l), V (l) are block diagonal matrices:

U (l) =

diag(Ui)lv(i)=l if l = L,

diag

Rc1...

Rck

lv(i)=l

if l < L,V (l) =

diag(Vi)lv(i)=l if l = L,

diag

Wc1

...

Wck

lv(i)=l

if l < L,(38)

and B(l) has a block structure. B(L) has #i ∈ T : lv(i) = L ×#i ∈ T : lv(i) = L blockswhere each nonzero block corresponds to a nearfield block, while for l < L, there are #i ∈ T :lv(i) = l + 1 ×#i ∈ T : lv(i) = l + 1 blocks in B(l) and each nonzero block corresponds to acoupling matrix. That is, in B(L), for lv(i) = lv(j) = L, block (li, lj) is equal to Bi,j if (i, j) ∈ L−;in B(l) with l < L, for lv(i) = lv(j) = l + 1, block (li, lj) is equal to Bi,j if (i, j) ∈ L. Here lidenotes the location of node i at level l enumerated from left to right. If A is an HSS matrixassociated with a perfect binary tree T , the structures of U (l) and V (l) are identical to those in(38) with k = 2 but B(l) has a much simpler block diagonal structure:

B(l) =

diag(Di)i is a leaf node if l = L,

diag

([0 Bc1Bc2 0

])lv(i)=l

if l < L,(39)

where c1 and c2 are the children of node i.Based on the explicit representation (37) of an H2 matrix associated with a perfect 2d-tree, we

can write down a levelwise version of the matrix-vector multiplication:



1. at level l(2 ≤ l ≤ L), compute

q(l) =(V (l)

)T. . .(V (L)

)Tq; (40)

2. at level l(2 ≤ l ≤ L), computez(l) = B(l−1)q(l); (41)

3.z = B(L)q + U (L)

(. . .(U (3)(U (2)z(2) + z(3)) + z(4)

)+ . . . z(L)

). (42)

Notice that when X,Y are not uniformly distributed, T is not necessarily a perfect tree. Underthis condition, the nodes i, j corresponding to a coupling matrix Bi,j may not be at the same levelof T and the telescoping expansion (37) does not exist.

As for linear system solutions, H2 and HSS matrices take completely different approaches todirectly solve the resulting system. Linear complexity H2 matrix solvers ([3, 6]) heavily dependon recursion to reduce the number of floating point operations while HSS matrices could benefitfrom highly parallelizable ULV-type algorithms (cf.[54], [5]) due to the special structure of HSS.However, as mentioned before, since the requirement of an HSS structure is too strong, theapplication of HSS matrices is limited as compared to H2 matrices.

4. COMPLEXITY ANALYSIS

This section studies the complexity of SMASH for an n× nmatrix. For simplicity, we only considerthe case whenX = Y and the points are uniformly distributed. Under this assumption, a perfect treeT will be used for both HSS and H2 structures.

4.1. Complexity for the HSS construction

We start with the HSS construction case. Since the HSS matrix structure is only efficient for onedimensional problems, we will focus on the one-dimensional problems in this section. Suppose Thas L levels such that n = O(r2L), where r is a positive integer such that the rank of HSS generatorsis bounded above by r and

|irow| ≤ 2r, |icol| ≤ 2r, ∀i 6= root. (43)

Notice that in the context of integral equations in potential theory, the assumption (43) in generalholds only for integral equations defined on a curve. Since the points are uniformly distributed, foreach nonroot node i, the number of nodes in Ni is very small, which we assume to be boundedabove by 3. Under these assumptions, we have the following complexity estimate.

Theorem 4.1Let T be a perfect binary tree with L levels and (43) hold. Then the complexity of SMASH for theHSS construction in Section 3.3.1 is O(n).

ProofBased on (43), it is easy to see that, for each nonroot node i, the compression cost for its nearfieldblocks in (23) is O(r3). This is because the size of the nearfield block row in (23) is no larger than2r-by-6r under the above assumption for Ni. Besides, the farfield basis matrix Ui has column sizeat most r, so the cost of an SRRQR procedure in (25) is O(r3 logs r). Therefore, the compressioncost associated with each nonroot node i is O(r3 logs r) and the complexity of the HSS constructionis O(2Lr3 logs r) = r2O(n) = O(n).



4.2. Complexity for the H2 construction

For the H2 construction case, we assume that when X ⊂ Rd, T is a perfect 2d-tree with L levelssuch that n = O(r2dL) and r is a positive integer such that the rank ofH2 generators is bounded byr and

|irow| ≤ r2d, |icol| ≤ r2d, ∀i 6= root. (44)The analysis here is simpler than that of the HSS construction in Section 4.1. Since each node i

only involves the compression of farfield basis Ui |irow ( as well as Vi |icol ), whose size is no largerthan r2d-by-r under the assumption (44), we deduce that the compression cost associated with eachnode is O(r3). As a result, the complexity of the H2 construction is O(2dLr3) = r2O(n). Thus weconclude:

Theorem 4.2Let T be a perfect 2d-tree with L levels and (44) hold. Then the complexity of SMASH for the H2

construction in Section 3.3.2 is O(n).

5. NUMERICAL EXAMPLES

In this section, we present numerical examples to illustrate the performance of SMASH. All of thenumerical results were performed in MATLAB R2014b on a macbook air with a 1.6 GHz CPU and8 GB of RAM. The following notation is used throughout the section:

• n: the size of A;

• tconstr: wall clock time for constructing A in seconds;

• tmatvec: wall clock time for multiplying A with a vector in seconds;

• tsol: wall clock time for solving Ax = b in seconds;

• εsvd: relative tolerance used in the truncated SVD for the nearfield compression;

• rand([0, 1]): a random number sampled from the uniform distribution in [0, 1].

5.1. Choice of parameters

Since the quality of a degenerate approximation depends on the underlying kernel function, thereis no rule of thumb in general on choosing the parameters to satisfy a prescribed tolerance. Forcompleteness, here we present a heuristic approach that we use in all numerical experiments on thechoice of parameters.

Given a matrix A and a tolerance ε, suppose one wants to construct a hierarchical matrix A(H, H2, or HSS) such that ‖A− A‖max ≈ ε. Then the following approach is adopted to determineparameters τ, r.

The choice of separation ratio τ ∈ (0, 1) only depends on the dimension of the problem, so it ischosen first. We choose τ such that τ ≤ 0.7 and, in general, a slightly larger τ is preferred for higherdimensional problems. For example, we choose τ = 0.6 for essentially one-dimensional problems,such as those in Section 5.3 and Section 5.4; we choose τ = 0.65 for problems in two or threedimensions in Section 5.2.

Having chosen a separation ratio τ , we use the following function to determine the farfieldapproximation rank r used in constructing Ui, Vi (before the SRRQR postprocessing):

r =

blog ε/ log τ − 20c, if ε < 10−8,

blog ε/ log τ − 15c, if 10−8 ≤ ε < 10−6,

maxblog ε/ log τ − 10c, 5 otherwise,

where bxc yields the largest integer less than or equal to x. For example, in Section 5.2.1, ε = 10−7,τ = 0.65, r = 22; in Section 5.3, ε = 10−8, τ = 0.6, r = 21; in Section 5.4, ε = 10−10, τ = 0.6,r = 25.



Table I. Numerical results for 2D test in Section 5.2.1

n = m2 ‖Au−Au‖/‖Au‖ tconstr tmatvec

1600 6.69× 10−13 0.52 0.026400 2.00× 10−12 1.97 0.0725600 3.65× 10−12 9.53 0.30

102400 4.87× 10−12 39.47 1.18

5.2. Construction and matrix-vector multiplication of H2 matrices

In this section, we perform numerical experiments to test the construction and matrix-vectormultiplication of anH2 approximation associated with kernels in both two and three dimensions. Acomplicated three-dimensional geometry (see Figure 7) is presented to illustrate the robustness ofthe algorithm.

5.2.1. Two dimensions We first consider the kernel in (1) with dx = 1. We chose X as a uniformm×m grid in [0, 1]2 and A = [κ(x, y)]x,y∈X . The computational domain [0, 1]2 was recursivelydivided into 4 subdomains until the number of the points inside each domain was less than or equalto 50. We embedded [0, 1]2 in the complex plane and used the truncated Taylor expansion (4) withr = 22 terms and the separation ratio τ = 0.65 to compress farfield blocks. The error is measuredby the relative error ‖Au−Au‖/‖Au‖, where u is a random vector of length n = m2 with entriesgenerated by rand([0, 1]). This is because one of the main applications of hierarchical matrices isused as an alternative of FMM to perform matrix-vector multiplications. The numerical results arereported in Table I.

As can be seen from Table I, SMASH for theH2 construction and the matrix-vector multiplicationdescribed in Section 3.4 scale linearly, which is consistent with the complexity analysis in Section4.

5.2.2. Three dimensions We consider the following point distributions in three dimensions:

• uniform distribution inside the unit cube;

• uniform distribution on the unit sphere;

• random distribution on a complicated triceratops geometry embedded in the cube [−100, 100]3

as shown in Figure 7 †,

where the kernel function for the first two cases is κ(x, y) = 1|x−y| (x 6= y) with κ(x, x) = 1 and

the kernel for the third case is taken as κ(x, y) = log |x−y||x−y| (x 6= y) with κ(x, x) = 1. Note that

the second kernel function has a stronger singularity than the first one near x = y. The H2

approximation A is constructed based on interpolation in three dimensions, where 5 Chebyshevpoints are used in each direction.

The numerical results are presented in Figure 6 and Table II. It is easily seen from Figure 6 that,for each case, the time per degree of freedom roughly remains constant as matrix size increases,which implies O(n) construction cost. For the highly non-uniform triceratops geometry, Table IIdemonstrates nearly linear cost in terms of the matrix size, since the corresponding tree T is nolonger perfect in this case.

†The datasets are from the point cloud tools http://www.geo.tuwien.ac.at/downloads/pg/pctools/pctools.html



n ×104

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

tim

e/n

×10-3

0

1

2

3

4

5

6

time per degree of freedom v.s. matrix size

n ×104

1 2 3 4 5 6 7 8

tim

e/n

×10-3

0

0.5

1

1.5

2

2.5

3

time per degree of freedom v.s. matrix size

Figure 6. Time per degree of freedom plot for 3D test: cube (left) and sphere (right)

Table II. Numerical results for 3D triceratops example in Section 5.2.2

n ‖Au−Au‖/‖Au‖ tconstr tmatvec

10000 1.98× 10−6 9.12 0.1420000 3.83× 10−6 23.20 0.3840000 5.83× 10−6 50.39 0.7480000 7.20× 10−6 115.03 1.57

Figure 7. The 3D triceratops geometry used for the numerical experiments in Table II.

5.3. Cauchy-like matrices

We consider in this section the numerical solution of Cauchy-like matrices. It is known thatCauchy-like matrices are related to other types of structured matrices including Toeplitz matrices,Vandermonde matrices, Hankel matrices and their variants [55, 56, 57, 58, 59, 60]. Consider thekernel κ(x, y) = 1/(x− y), x 6= y ∈ C. Let xi, yj(i, j = 1 : n) be 2n pairwise distinct points inC. The Cauchy matrix is then given by C = [κ(xi, yj)]i,j=1:n, which is known to be invertible[61]. Given two matrices w, v ∈ Cn×p, the (i, j)-entry of a Cauchy-like matrix A associated withgenerators w, v is defined by [62]

ai,j =1

xi − yj

p∑l=1

wi,lvj,l. (45)

For simplicity, we consider the case p = 2, i.e., w (as well as v) is composed of two column vectors.Denote by w1, w2, v1, v2 the column vectors in u, v, i.e., w = [w1, w2], v = [v1, v2]. It can be seenthat A can be written as

A = diag(w1)Cdiag(v1) + diag(w2)Cdiag(v2). (46)



Existing approaches for solving Cauchy or Cauchy-like linear systems associated with points inR mainly rely on some variants of Gaussian elimination with pivoting techniques. For example,fast O(n2) algorithms for solving Cauchy linear systems can be found in [63, 64, 65, 66], etc.;a superfast O(n log3 n) algorithm based on a sequential block Gaussian elimination process wasproposed in [67]. The performance of most existing methods depends on the the distribution ofpoint sets x, y. As pointed out in [64], if two sets of points x, y can not be separated, for example,when they are interlaced, existing algorithms (for example, BP-type algorithm of [63]) suffer frombackward stability issues. Moreover, due to the use of pivoting techniques, the accuracy of existingalgorithms heavily depend on the ordering of points [63, 64] and the analysis is limited to the casewhen the points are in R.

Therefore, in view of the issues mentioned above, we assume xi, yj are mixed together such thatin adaptive partition (see Section 3.1), each box contains the same number of points from xi and yj .We also consider that xi, yj are distributed on a curve in R2 as illustrated in Fig. 9 to demonstratethat the algorithm is independent of the ordering of points and is applicable for points in C.

We construct the HSS approximation A to A using SMASH discussed in Section 3.3.1 and thensolve the linear system associated with A using a fast ULV factorization solver [5]. Due to thechoice of stable expansion in (4), arbitrarily high approximation accuracy can be achieved withoutstability issues [49].

Note that the HSS approximation to C can be readily obtained as in Section 3. Consequently,the HSS approximations to diag(w1)Cdiag(v1) and diag(w2)Cdiag(v2) can be derived by modifyingU, V,D generators, respectively. The sum of these two HSS representations is also an HSS matrixwhose generators can be easily obtained using the technique presented in [68] by merging the twosets of HSS generators. Hence the HSS approximation to A is derived.

In the first experiment, the point sets xknk=1, yknk=1 are chosen as follows:

xk = k/(n+ 1), yk = xk + 10−7 ∗ rand([0, 1]), k = 1, . . . , n. (47)

In the second experiment, the point sets are distributed on the curve illustrated in Fig. 8 that isparametrized by

γ(t) = e−πi/6 ∗ [(0.5 + sin(4πt)) cos(2πt) + i(0.5 + sin(4πt)) sin(2πt)] , t ∈ [0, 1],

and xknk=1, yknk=1 are given by

xk = γ(k/(n+ 1)), yk = γ(xk + 10−7 ∗ rand([0, 1])), k = 1, . . . , n. (48)

In the third experiment, the point set xknk=1 is on the snail geometry in C as illustrated in Fig. 9and the point set yknk=1 is given by

yk = xk + 10−7 ∗ rand([0, 1]), k = 1, . . . , n. (49)

For generators of this Cauchy-like matrix, i.e., w = [w1, w2], v = [v1, v2], we chose wl, vl(l =1, 2) such that each entry in those vectors was given by rand([0, 1]). In order to solve the linearsystem Au = b, we constructed an HSS approximation A to A in (46) by SMASH. 1D boxes (i.e.,intervals) and 2D boxes (i.e., rectangles) are used in adaptive partition for point sets in (47) and(49), respectively. The right subfigures in Fig. 9 illustrate the adaptive partition using 2D boxesas described in Remark 3.1. We chose separation ratio as τ = 0.6 and adaptive partition stoppedwhen the number of points inside each box was less than or equal 50. The nearfield blocks werecompressed through SVD with truncation tolerance 10−9. The exact solution was set to be a columnvector u of length n with entries generated by rand([0, 1]), and the right-hand side b was formed byb = Au.

The numerical results for three Cauchy-like matrix problems are reported in Table III. FromTable III, we see that the computational time for both construction and the solution scale linearly,and SMASH in Section 3.3.1 is quite robust with respect to complex geometries. Moreover, it canbe seen that SMASH is independent of the ordering of points.



-1 -0.5 0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Figure 8. Honeybee geometry used for the numerical experiments in Table III. Left: Original curve; Right:Adaptive partition of the curve for the case when n = 12800 in Table III.

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

-1

-0.5

0

0.5

1

1.5

2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Figure 9. Snail geometry used for the numerical experiments in Table III. Left: Original curve; Right:Adaptive partition of the curve for the case when n = 12800 in Table III.

Table III. Numerical results for solving the Cauchy-like matrix when xknk=1 are distributed on threedifferent curves.

curve n ‖u− u‖/‖u‖ ‖Au−Au‖/‖Au‖ tconstr tsol

[0, 1]

1600 7.69× 10−12 5.56× 10−15 0.33 0.113200 1.01× 10−09 3.87× 10−14 0.63 0.196400 5.58× 10−11 5.58× 10−14 1.29 0.3512800 1.47× 10−08 5.87× 10−14 2.58 0.69

honeybee

1600 9.37× 10−11 8.09× 10−14 1.14 0.283200 9.78× 10−10 5.60× 10−13 2.22 0.516400 1.55× 10−09 9.16× 10−13 4.42 0.9612800 2.76× 10−09 1.54× 10−12 8.49 1.87

snail

1600 2.65× 10−11 1.51× 10−15 1.58 0.363200 7.48× 10−11 1.82× 10−15 3.28 0.736400 8.38× 10−11 3.49× 10−15 6.46 1.4212800 3.86× 10−10 2.59× 10−15 11.82 2.64

5.4. Integral equations

In this section, we solve Laplace boundary value problems via the integral equation method. AssumeΩ is a smooth simply-connected domain in R2 and let Γ = ∂Ω be the boundary of Ω of class C2.Consider the interior Dirichlet problem: find u ∈ C2(Ω) ∩ C(Ω) such that

∆u = 0 in Ω,

u = uD on Γ,(50)

where uD ∈ C(Γ) is given. With smooth boundary curves and Dirichlet data, the wellposedness ofthis problem is well studied in potential theory [69, 70].



The fundamental solution and its gradient (in terms of y) for the Laplace equation in R2 are givenby:

Φ(x,y) = − 1

2πlog |x− y|, and ∇yΦ(x,y) = − 1

2π

y − x

|x− y|2.

Let νy denote the unit outer normal at point y ∈ Γ. The double layer potential with continuousdensity σ is given by

Kσ(x) :=

∫Γ

∂Φ(x,y)

∂νyσ(y)dsy =

∫ 1

0

∂Φ(x, r(t))

∂νy|r′(t)|σ(r(t))dt, x ∈ Ω, (51)

where we assume Γ is parametrized by r(t) : [0, 1]→ R2.Given Dirichlet data uD ∈ C(Γ) in (50), we solve the following integral equation for σ ∈ C(Γ):

(K − 1

2I)σ = uD, on Γ. (52)

It is well-known ([70]) that the problem above for σ ∈ C(Γ) is well-posed, and the correspondingdouble layer potential u := Kσ solves the interior Dirichlet problem (50).

Denote the kernel in the second integral in (51) by

κ(s, t) :=∂Φ(r(s), r(t))

∂νy|r′(t)|. (53)

Several Laplace problems (50) with the same exact solution but different domains are consideredhere. The first domain Ω is a ram head whose boundary curve Γ is parametrized by r(t) =(r1(t), r2(t)) for t ∈ [0, 1]:

r1(t) = 2 cos(2πt),

r2(t) = 1 + sin(2πt)− 1.4 cos4(4πt).(54)

The second domain is a sunflower whose boundary curve Γ is parametrized by:

r1(t) = (1.3 + 1.25 cos(40πt)) cos(2πt),

r2(t) = (1.3 + 1.25 cos(40πt)) sin(2πt).(55)

We chose the Dirichlet data uD such that the exact solution of (50) is

u(x) = log |x− x0|,

where the source point x0 = (2, 1.5) is in the exterior of Ω. Illustrations for the curves parametrizedin (54) and (55) are shown in Fig. 10 and Fig. 11, respectively. We used Nystrom method withtrapezoidal rule to discretize (52). Since the curve Γ and the kernel are both smooth, Nystromdiscretization converges with a convergence rate proportional to that of the quadrature rule.

As in Section 5.3, we applied the HSS matrix techniques to approximate and solve the resultingmatrix from the Nystrom discretization of the integral equation in (52). The adaptive partition basedon bisection (cf. Remark 3.1) was applied to a box covering the domain Ω in R2 and each box inleaf level contained no more than 50 quadrature points on Γ. Empty boxes were discarded duringthe partition. Illustrations of adaptive partition are shown in the right subfigures of Fig. 10 and Fig.11, when 10240 quadrature points are in use. A binary tree T was then generated correspondingto each adaptive partition. In the construction of the HSS matrix, the bases for the farfield blockswere approximated by polynomial interpolation (with 25 interpolating points) with respective to theseparation ratio 0.6 and the bases for the nearfield blocks are computed by the truncated SVD withthe tolerance εsvd = 10−11. In order to test the convergence of the discretization, for curves in Fig.10, we compared the numerical solution u with the exact solution u by evaluating them at pointx∗ = (0.1, 0.1) inside Ω. For the curve in Fig. 11, the evaluation point is chosen as x∗ = (1.5, 0)



-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1.5

-1

-0.5

0

0.5

1

1.5

2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 10. Ram head domain for the Dirichlet problem (50) with source point and evaluation point marked asgreen ‘+’ and red ‘*’, respectively. Left: Original curve; Right: Adaptive partition of the ram head boundary

curve for the case when 10240 quadrature points are used in Table IV.

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Figure 11. Sunflower domain for the Dirichlet problem (50) with source point and evaluation point marked asgreen ‘+’ and red ‘*’, respectively. Left: Original curve; Right: Adaptive partition of the sunflower boundary

curve for the case when 10240 quadrature points are used in Table V.

Table IV. Numerical results for solving a 2D Laplace Dirichlet problem in a ram head domain as shown inFig. 10.

n |u(x∗)− u(x∗)| ‖A− A‖max cond(A) tconstr tsol

160 5.03× 10−08 1.06× 10−10 6.15× 1001 0.42 0.107320 9.54× 10−11 6.81× 10−10 6.92× 101 1.54 0.042640 1.91× 10−12 7.98× 10−10 6.00× 101 2.93 0.1271280 8.22× 10−13 2.26× 10−09 6.02× 101 5.41 0.1032560 7.78× 10−13 3.90× 10−09 6.02× 101 9.76 0.1775120 1.50× 10−13 9.63× 10−09 6.02× 101 18.74 0.28210240 1.96× 10−12 1.01× 10−08 6.02× 1001 34.78 0.591

Table V. Numerical results for solving a 2D Laplace Dirichlet problem in a sunflower domain as shown inFig. 11.

n |u(x∗)− u(x∗)| ‖A− A‖max cond(A) tconstr tsol

640 2.96× 10−02 1.20× 10−08 5.65× 103 5.42 0.1051280 1.25× 10−03 2.65× 10−08 2.23× 103 16.80 0.2562560 1.88× 10−06 3.56× 10−08 1.85× 103 41.76 0.6245120 1.02× 10−10 4.27× 10−07 1.72× 104 102.65 1.42910240 1.66× 10−11 3.80× 10−07 1.12× 104 192.78 2.16520480 8.03× 10−10 9.55× 10−07 6.86× 103 316.68 3.219



Table VI. Comparison of exact ε-rank and approximation rank ofM = Airow×jcol for the ram head geometryin Fig.10, the sunflower geometry in Fig.11 with ε = 10−3, 10−6, 10−10

geometry n size(M ) ε = 10−3 ε = 10−6 ε = 10−10

rε(M) size(Bi) rε(M) size(Bi) rε(M) size(Bi)

ramhead

1280 640 13 19 25 45 43 702560 1280 13 18 25 45 43 715120 2560 13 18 25 45 43 7210240 5120 13 19 25 45 43 72

sun-flower

1280 640 67 84 111 141 151 1852560 1280 83 117 159 187 213 2515120 2560 83 125 182 226 298 34710240 5120 83 123 187 237 328 38020480 10240 83 120 187 238 327 382

inside Ω. The numerical results for the ram head and sunflower problems are shown in Table IV andTable V, respectively.

From Table IV and Table V, it can be seen that the HSS matrix methods achieve linearcomplexity at both the construction and solution stages. In addition, it is worth noting that thenumerical solutions for these Laplace Dirichlet problems converge exponentially fast regardless ofthe complicated geometries and the solver is quite robust. For example, in view of Table V whichcorresponds to the seemingly complicated geometry in Fig. 11, 10 digits of accuracy can be achievedusing only 5120 quadrature points. The ε-rank in ram-head geometry is much smaller than the sun-flower geometry since the former geometry is essentially a 1D geometry while the later one is closerto a 2D geometry.

5.5. Nearly optimal compression

In this section, we compare the exact numerical rank of the largest off-diagonal block of A withthe approximation rank obtained from SMASH. The numerical results show that the approximationrank is nearly optimal in the sense that it differs from the exact numerical rank by a small constantthat is roughly independent of the kernel and the matrix size.

Definition 5.1 (ε-rank)Let σ1 ≥ σ2 ≥ · · · ≥ σr be singular values of a nonzero matrix A. Given a tolerance ε ∈ (0, 1), therelative ε-rank of a matrix A, denoted by rε(A), is the largest number i such that σi ≥ εσ1.

Consider the numerical examples in Section 5.4, where different curves give rise to differentkernels according to (53). Let i, j be two children of the root node. We focus on the (largest) off-diagonal block Airow×jcol .

We consider three tolerances: ε = 10−3, 10−6, 10−10. We list the size of Airow×jcol , the exactε-rank, and the approximation rank characterized by the size of Bi generator with size(Bi) :=maximum between row size and column size. The results are reported in Table VI .

Note that no a priori information is needed to determine the approximation rank as it is solelyderived from the prescribed tolerance and the construction algorithm. Thus the numerical resultsalso imply that the proposed method in Section 5.1 for choosing parameters is satisfactory.

5.6. Storage cost

In this section, we demonstrate the benefit of the special structure in the generators produced bySMASH. We store the HSS generators using the strategy mentioned at the end of Section 2.3. Forcomparison, we also compute the cost by storing the generators as dense matrices, denoted by HSS0,as well as the storage cost for the original dense matrix. The test matrix A of order 10240 is derivedfrom kernels in Section 5.4 and all entries are stored in double precision. The results are collected



Table VII. A comparison of storage costs (MB) of HSS generators the storage strategy in Section 2.3(SMASH) and the standard approach by storing dense generators (HSS0) for a square matrix A of order

10240.

εfar εSVD geometry storage(A) HSS0 SMASH10−4 10−5 ram head 800 5.9 4.410−4 10−5 sunflower 800 35.2 11.010−10 10−11 ram head 800 23.2 8.810−10 10−11 sunflower 800 145.7 31.3

Table VIII. Memory comparison between H2Lib and SMASH on the Newton potential kernel 1|x−y| , where

n is the number of 2D points used.

n 2.5× 105 5× 105 1× 106 2× 106 4× 106

H2Lib 919.0MB 1705.8MB 3688.0MB 6841.7MB 14832.4MBSMASH 802.6MB 1564.8MB 3259.7MB 6332.5MB 13139.2MB

Table IX. Memory comparison between H2Lib and SMASH on the exponential kernel e−|(x−y)|, where nis the number of the 2D points used.

n 2.5× 105 5× 105 1× 106 2× 106 4× 106

H2Lib 767.8MB 1321.6MB 3050.5MB 5305.8MB 12290.9MBSMASH 726.1MB 1332.4MB 2868.0MB 5116.0MB 11086.5MB

Table X. Memory comparison between H2Lib and SMASH on the logarithmic kernel −0.5 ∗ log(|x− y|2),where n is the number of 2D points used.

n 2.5× 105 5× 105 1× 106 2× 106 4× 106

H2Lib 806.7MB 1393.2MB 3149.1MB 5534.6MB 12634.6MBSMASH 766.3MB 1454.2MB 3095.1MB 5843.49MB 12400MB

in Table VII for different geometries and different approximation accuracy. The reduction in storagejustifies the use of strong rank-revealing QR algorithm in the construction and we see that SMASHis quite cheap even when a high approximation accuracy is in use.

5.7. Storage comparison with H2 recompression algorithm

In this section, we provide some memory comparison between the SMASH H2 algorithm andthe H2 recompression algorithm implemented in H2Lib [72]. In order to do a fair comparison,we tested both algorithms on the same machine, the same kernels (Newton, exponential andlogarithmic kernels) and the same points. Both algorithms used the same interpolation method(5 interpolation points per direction) to obtain the bases for far field blocks. The recompressiontolerance was set to be 10−4 in both algorithms, which resulted in O(10−6) relative error (measuredas ‖Au−Au‖/‖Au‖) for matrix-vector products, using double precision data types.

It can be seen from Tables VIII–X that the SMASH H2 algorithm results in less memory in mostexperiments except the exponential kernel test with 5× 105 points and the logarithmic kernel testswith 5× 105 and 2× 106 points. The other main difference between these two algorithms is thatthe SMASH algorithm in general has much smaller peak memory usage. For example, the SMASH



algorithm has the peak memory of 14.75GB for the logarithmic kernel with 4× 106 points whilethe H2Lib uses 47.92GB peak memory.

6. CONCLUSION

We presented a unified framework, called SMASH, to construct either an n× n HSS or H2 matrixwith an O(n) cost. One appealing feature of this scheme is its simple implementation which onlyrequires a routine to compress far field blocks. In addition, SMASH can greatly reduce the memorycost relative to existing analytic construction schemes. The numerical experiments illustrated theefficiency and robustness of SMASH through a few examples with various point distributions andkernel matrices.

We plan to extend this scheme to highly oscillatory kernels and to develop approximate inverse-type preconditioners for solving the resulting linear systems with H2 matrix representations.

ACKNOWLEDGMENTS

We would like to thank anonymous referees for their useful suggestions which led tosubstantial improvements of the original version of this paper. YX would like to thankProf. Ming Gu for fruitful discussions about the strong rank revealing QR algorithmand Prof. Steffen Borm for his kind explanation of the new developments of H2

matrices and providing us with a kernel matrix interface for his H2Lib package[72].REFERENCES

1. Rokhlin V. Rapid solution of integral equations of classical potential theory. Journal of ComputationalPhysics 1985; 60(2):187–207, doi:http://dx.doi.org/10.1016/0021-9991(85)90002-6. URL http://www.sciencedirect.com/science/article/pii/0021999185900026.

2. Greengard L, Rokhlin V. A fast algorithm for particle simulations. J. Comput. Phys. 1987; 73:325–348.3. Borm S. Efficient numerical methods for non-local operators : H2 -matrix compression, algorithms and analysis.

EMS Tracts in Mathematics, European Mathematical Society: Zurich, 2010. URL http://opac.inria.fr/record=b1133579.

4. Borm S, Grasedyck L, Hackbusch W. Introduction to hierarchical matrices with applications. ENG. ANAL. BOUND.ELEM. 2003; 27(5):405–422, doi:10.1016/S0955-7997(02)00152-2. URL http://www.mis.mpg.de/de/publications/preprints/2002/prepr2002-18.html.

5. Chandrasekaran S, Gu M, Pals T. A fast ULV decomposition solver for hierarchically semiseparablerepresentations. SIAM J. Matrix Anal. Appl. 2006; 28(3):603–622, doi:10.1137/S0895479803436652. URL http://dx.doi.org/10.1137/S0895479803436652.

6. Hackbusch W. Hierarchical matrices : algorithms and analysis. Springer, 2015.7. Hackbusch W, Khoromskij B, Sauter S. On H2-matrices. Lectures on applied mathematics, Bungartz HJ,

Hoppe RHW, Zenger C (eds.). Springer: Berlin, 2000; 9–29. URL http://www.mis.mpg.de/de/publications/preprints/1999/prepr1999-50.html.

8. Xi Y, Xia J, Cauley S, Balakrishnan V. Superfast and stable structured solvers for toeplitz least squares viarandomized sampling. SIAM J. Matrix Anal. Appl. 2014; 35(1):44–72, doi:10.1137/120895755. URL http://dx.doi.org/10.1137/120895755.

9. Xia J. Multi-layer hierarchical structures and factorizations. submitted to SIAM J. Matrix Anal. Appl. ; .10. Xia J, Xi Y, Gu M. A superfast structured solver for Toeplitz linear systems via randomized sampling. SIAM

J. Matrix Anal. Appl. 2012; 33(3):837–858, doi:10.1137/110831982. URL http://dx.doi.org/10.1137/110831982.

11. Bebendorf M. Approximation of boundary element matrices. Numer. Math. 2000; 86(4):565–589, doi:10.1007/PL00005410. URL http://dx.doi.org/10.1007/PL00005410.

12. Gillman A, Young PM, Martinsson PG. A direct solver with o(n) complexity for integral equations on one-dimensional domains. Frontiers of Mathematics in China 2012; 7(2):217–247, doi:10.1007/s11464-012-0188-3.URL http://dx.doi.org/10.1007/s11464-012-0188-3.

13. Ho KL, Greengard L. A fast direct solver for structured linear systems by recursive skeletonization. SIAM Journalon Scientific Computing 2012; 34(5):A2507–A2532, doi:10.1137/120866683. URL http://dx.doi.org/10.1137/120866683.

14. L Ho K, Ying L. Hierarchical interpolative factorization for elliptic operators: Integral equations. Communicationson Pure and Applied Mathematics 2016; 69(7):1314–1353, doi:10.1002/cpa.21577. URL http://dx.doi.org/10.1002/cpa.21577.


http://www.sciencedirect.com/science/article/pii/0021999185900026

http://www.sciencedirect.com/science/article/pii/0021999185900026

http://opac.inria.fr/record=b1133579

http://opac.inria.fr/record=b1133579

http://www.mis.mpg.de/de/publications/preprints/2002/prepr2002-18.html


http://dx.doi.org/10.1137/S0895479803436652

http://dx.doi.org/10.1137/S0895479803436652



http://dx.doi.org/10.1137/120895755

http://dx.doi.org/10.1137/120895755

http://dx.doi.org/10.1137/110831982

http://dx.doi.org/10.1137/110831982

http://dx.doi.org/10.1007/PL00005410

http://dx.doi.org/10.1007/s11464-012-0188-3

http://dx.doi.org/10.1137/120866683

http://dx.doi.org/10.1137/120866683

http://dx.doi.org/10.1002/cpa.21577

http://dx.doi.org/10.1002/cpa.21577


15. Liu X, Xia J, de Hoop MV. Parallel randomized and matrix-free direct solvers for large structured dense linearsystems. SIAM J. Sci. Comput. 2016; 38(5):S508–S538, doi:10.1137/15M1023774. URL http://dx.doi.org/10.1137/15M1023774.

16. Martinsson P, Rokhlin V. A fast direct solver for boundary integral equations in two dimensions. J. Comput. Phys.2005; 205(1):1–23, doi:http://dx.doi.org/10.1016/j.jcp.2004.10.033.

17. Borne SL, Grasedyck L. H-matrix preconditioners in convection-dominated problems. SIAM J. Matrix Anal. Appl.2006; 27(4):1172–1183, doi:10.1137/040615845. URL http://dx.doi.org/10.1137/040615845.

18. Xia J. Efficient structured multifrontal factorization for general large sparse matrices. SIAM J. Sci. Comput. 2013;35(2):A832–A860, doi:10.1137/120867032. URL http://dx.doi.org/10.1137/120867032.

19. Grasedyck L. Existence of a low rank or -matrix approximant to the solution of a sylvester equation. Numer. LinearAlgebra Appl. 2004; 11(4):371–389, doi:10.1002/nla.366. URL http://dx.doi.org/10.1002/nla.366.

20. Grasedyck L, Hackbusch W, Khoromskij NB. Solution of large scale algebraic matrix riccati equations by useof hierarchical matrices. Computing 2003; 70(2):121–165, doi:10.1007/s00607-002-1470-0. URL http://dx.doi.org/10.1007/s00607-002-1470-0.

21. Benner P, Mach T. Computing all or some eigenvalues of symmetric Hl-matrices. SIAM J. Sci. Comput. 2012;34(1):A485–A496, doi:10.1137/100815323. URL http://dx.doi.org/10.1137/100815323.

22. Xi Y, Xia J, Chan R. A fast randomized eigensolver with structured ldl factorization update. SIAM J. Matrix Anal.Appl. 2014; 35(3):974–996, doi:10.1137/130914966. URL http://dx.doi.org/10.1137/130914966.

23. Hackbusch W, Borm S. Data-sparse approximation by adaptiveH2-matrices. Computing 2002; 69(1):1–35, doi:10.1007/s00607-002-1450-4. URL http://www.mis.mpg.de/de/publications/preprints/2001/prepr2001-86.html.

24. Xi Y, Xia J. On the stability of some hierarchical rank structured matrix alogrithms. SIAM J. Matrix Anal. Appl.2016; 37(3):1279–1303.

25. Xia J, Chandrasekaran S, Gu M, Li XS. Fast algorithms for hierarchically semiseparable matrices. Numer. LinearAlgebra Appl. 2010; 17(6):953–976, doi:10.1002/nla.691. URL http://dx.doi.org/10.1002/nla.691.

26. Gu M, Eisenstat SC. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci.Comput. 1996; 17(4):848–869, doi:10.1137/0917055. URL http://dx.doi.org/10.1137/0917055.

27. Borm S. Data-sparse approximation of non-local operators by h2-matrices. Linear Algebra Appl. 2007; 422(2):380–403, doi:http://dx.doi.org/10.1016/j.laa.2006.10.021. URL http://www.sciencedirect.com/science/article/pii/S0024379506004745.

28. Bebendorf M, Rjasanow S. Adaptive low-rank approximation of collocation matrices. Computing 2003; 70(1):1–24, doi:10.1007/s00607-002-1469-6. URL http://dx.doi.org/10.1007/s00607-002-1469-6.

29. Bebendorf M, Venn R. Constructing nested bases approximations from the entries of non-local operators.Numer. Math. 2012; 121(4):609–635, doi:10.1007/s00211-012-0449-9. URL http://dx.doi.org/10.1007/s00211-012-0449-9.

30. Borm S, Grasedyck L. Hybrid cross approximation of integral operators. Numer. Math. 2005; 101(2):221–249,doi:10.1007/s00211-005-0618-1. URL http://dx.doi.org/10.1007/s00211-005-0618-1.

31. Tyrtyshnikov E. Mosaic-skeleton approximations. CALCOLO Jun 1996; 33(1):47–57.32. Goreinov SA, Tyrtyshnikov EE, Yeremin AY. Matrix-free iterative solution strategies for large dense linear

systems. Numer. Linear Algebra Appl. 1997; 4(4):273–294, doi:10.1002/(SICI)1099-1506(199707/08)4:4〈273::AID-NLA97〉3.0.CO;2-T. URL http://dx.doi.org/10.1002/(SICI)1099-1506(199707/08)4:4<273::AID-NLA97>3.0.CO;2-T.

33. Goreinov S, Tyrtyshnikov E, Zamarashkin N. A theory of pseudoskeleton approximations. LinearAlgebra Appl. 1997; 261(1):1–21, doi:https://doi.org/10.1016/S0024-3795(96)00301-1. URL http://www.sciencedirect.com/science/article/pii/S0024379596003011.

34. Tyrtyshnikov E. Incomplete cross approximation in the mosaic-skeleton method. Computing Jun 2000; 64(4):367–380.

35. Pan VY, Zhao L. Low-rank approximation of a matrix: Novel insights, new progress, and extensions. arXiv 2015;doi:http://arxiv.org/abs/1510.06142.

36. Pan VY, Luan Q, Svadlenka J, Zhao L. Primitive and cynical low rank approximation, preprocessing and extensions.arXiv 2016; doi:http://arxiv.org/abs/1611.01391.

37. Hackbusch W, Borm S.H2-matrix approximation of integral operators by interpolation. Appl. Numer. Math. 2002;43(1):129–143.

38. Hackbusch W, Khoromskij BN, Kriemann R. Hierarchical matrices based on a weak admissibility criterion.Computing 2004; 73(3):207–243, doi:10.1007/s00607-004-0080-4. URL http://dx.doi.org/10.1007/s00607-004-0080-4.

39. Anderson CR. An implementation of the fast multipole method without multipoles. SIAM J. Sci. Statist. Comput.1992; 13(4):923–947, doi:10.1137/0913055. URL http://dx.doi.org/10.1137/0913055.

40. Ying L, Biros G, Zorin D. A kernel-independent adaptive fast multipole algorithm in two and three dimensions.J. Comput. Phys. 2004; 196(2):591–626, doi:http://dx.doi.org/10.1016/j.jcp.2003.11.021. URL http://www.sciencedirect.com/science/article/pii/S0021999103006090.

41. Lin L, Lu J, Ying L. Fast construction of hierarchical matrix representation from matrixvector multiplication. J.Comput. Phys. 2011; 230(10):4071–4087, doi:http://dx.doi.org/10.1016/j.jcp.2011.02.033. URL http://www.sciencedirect.com/science/article/pii/S0021999111001227.

42. Martinsson PG. A fast randomized algorithm for computing a hierarchically semiseparable representation of amatrix. SIAM J. Matrix Anal. Appl. 2011; 32(4):1251–1274, doi:10.1137/100786617. URL http://dx.doi.org/10.1137/100786617.

43. Rouet FH, Li XS, Ghysels P, Napov A. A distributed-memory package for dense hierarchically semi-separablematrix computations using randomization. submitted to ACM Trans. Math. Softw. ; .

44. Barnes J, Hut P. A hierarchical O(N log N) force-calculation algorithm. Nature Dec 1986; 324:446–449, doi:10.1038/324446a0.


http://dx.doi.org/10.1137/15M1023774

http://dx.doi.org/10.1137/15M1023774

http://dx.doi.org/10.1137/040615845

http://dx.doi.org/10.1137/120867032

http://dx.doi.org/10.1002/nla.366

http://dx.doi.org/10.1007/s00607-002-1470-0

http://dx.doi.org/10.1007/s00607-002-1470-0

http://dx.doi.org/10.1137/100815323

http://dx.doi.org/10.1137/130914966



http://dx.doi.org/10.1002/nla.691

http://dx.doi.org/10.1137/0917055

http://www.sciencedirect.com/science/article/pii/S0024379506004745


http://dx.doi.org/10.1007/s00607-002-1469-6

http://dx.doi.org/10.1007/s00211-012-0449-9

http://dx.doi.org/10.1007/s00211-012-0449-9

http://dx.doi.org/10.1007/s00211-005-0618-1

http://dx.doi.org/10.1002/(SICI)1099-1506(199707/08)4:4<273::AID-NLA97>3.0.CO;2-T

http://dx.doi.org/10.1002/(SICI)1099-1506(199707/08)4:4<273::AID-NLA97>3.0.CO;2-T



http://dx.doi.org/10.1007/s00607-004-0080-4

http://dx.doi.org/10.1007/s00607-004-0080-4

http://dx.doi.org/10.1137/0913055





http://dx.doi.org/10.1137/100786617

http://dx.doi.org/10.1137/100786617


45. Carrier J, Greengard L, Rokhlin V. A fast adaptive multipole algorithm for particle simulations. SIAM Journal onScientific and Statistical Computing 1988; 9(4):669–686, doi:10.1137/0909044.

46. Borm S. Construction of data-sparse H2-matrices by hierarchical compression. SIAM J. Sci. Comput. 2009;31(3):1820–1839, doi:10.1137/080720693. URL http://dx.doi.org/10.1137/080720693.

47. Sun X, Pitsianis N. A matrix version of the fast multipole method. SIAM Rev. Feb 2001; 43(2):289–300, doi:10.1137/S0036144500370835. URL http://dx.doi.org/10.1137/S0036144500370835.

48. Bebendorf M. Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems. LectureNotes in Computational Science and Engineering, Springer Berlin Heidelberg, 2008. URL https://books.google.com/books?id=hnIJOwyz9Z4C.

49. Cai D, Xia J. A stable and efficient matrix version of the fast multipole method. to be submitted. ; .50. Halko N, Martinsson PG, Tropp J. Finding structure with randomness: Probabilistic algorithms for constructing

approximate matrix decompositions. SIAM Review 2011; 53.51. Gu M. Subspace iteration randomization and singular value problems. preprint ; URL https://arxiv.org/

pdf/1408.2208v1.pdf.52. Borm S. Approximation of integral operators by H2-matrices with adaptive bases. Computing

May 2005; 74(3):249–271, doi:10.1007/s00607-004-0106-y. URL https://doi.org/10.1007/s00607-004-0106-y.

53. Borm S, Christophersen S. Approximation of integral operators by green quadrature and nested crossapproximation. Numerische Mathematik Jul 2016; 133(3):409–442, doi:10.1007/s00211-015-0757-y. URLhttps://doi.org/10.1007/s00211-015-0757-y.

54. Chandrasekaran S, Gu M, Lyons W. A fast adaptive solver for hierarchically semiseparable representations.CALCOLO 2005; 42(3):171–185, doi:10.1007/s10092-005-0103-3. URL http://dx.doi.org/10.1007/s10092-005-0103-3.

55. Pan VY. Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer-Verlag New York, Inc.: NewYork, NY, USA, 2001.

56. Pan VY. Transformations of matrix structures work again. Linear Algebra and its Applications 2015; 465:107–138, doi:http://dx.doi.org/10.1016/j.laa.2014.09.004. URL http://www.sciencedirect.com/science/article/pii/S0024379514005898.

57. Pan VY. Fast Approximate Computations with Cauchy Matrices, Polynomials and Rational Functions. SpringerInternational Publishing: Cham, 2014; 287–299, doi:10.1007/978-3-319-06686-8 22. URL http://dx.doi.org/10.1007/978-3-319-06686-8_22.

58. Pan VY, Wang X. Inversion of displacement operators. SIAM Journal on Matrix Analysis and Appli-cations 2003; 24(3):660–677, doi:10.1137/S089547980238627X. URL http://dx.doi.org/10.1137/S089547980238627X.

59. Pan VY. How Bad Are Vandermonde Matrices? SIAM J. Matrix Anal. Appl. 2016; 37(2):676–694, doi:10.1137/15M1030170.

60. Pan VY. Fast approximate computations with cauchy matrices and polynomials. Math. Comput. 2017;86(308):2799–2826, doi:10.1090/mcom/3204. URL https://doi.org/10.1090/mcom/3204.

61. Calvetti D, Reichel L. Factorizations of Cauchy matrices. J. Comput. Appl. Math. 1997; 86(1):103–123, doi:http://dx.doi.org/10.1016/S0377-0427(97)00150-7. URL http://www.sciencedirect.com/science/article/pii/S0377042797001507.

62. Bini DA, Meini B, Poloni F. Fast solution of a certain riccati equation through cauchy-like matrices. ETNA.Electronic Transactions on Numerical Analysis [electronic only] 2008; 33:84–104. URL http://eudml.org/doc/223449.

63. Boros T, Kailath T, Olshevsky V. A fast parallel BjorckPereyra-type algorithm for solving Cauchy linear equations.Linear Algebra Appl. 1999; 302:265–293, doi:http://dx.doi.org/10.1016/S0024-3795(99)00115-9. URL http://www.sciencedirect.com/science/article/pii/S0024379599001159.

64. Boros T, Kailath T, Olshevsky V. Special issue on structured and infinite systems of linear equations pivotingand backward stability of fast algorithms for solving Cauchy linear equations. Linear Algebra Appl. 2002;343:63–99, doi:http://dx.doi.org/10.1016/S0024-3795(01)00519-5. URL http://www.sciencedirect.com/science/article/pii/S0024379501005195.

65. Gohbert I, Kailath T, Olshevsky V. Fast Gaussian elimination with partial pivoting for matrices with displacementstructure. Math. Comput. Oct 1995; 64(212):1557–1576, doi:10.2307/2153371. URL http://dx.doi.org/10.2307/2153371.

66. Gu M. Stable and efficient algorithms for structured systems of linear equations. SIAM J. Matrix Anal.Appl. 1998; 19(2):279–306, doi:10.1137/S0895479895291273. URL http://dx.doi.org/10.1137/S0895479895291273.

67. Pan VY, Zheng A. Superfast algorithms for Cauchy-like matrix computations and extensions. LinearAlgebra Appl. 2000; 310(1):83–108, doi:http://dx.doi.org/10.1016/S0024-3795(00)00041-0. URL http://www.sciencedirect.com/science/article/pii/S0024379500000410.

68. Xia J. On the complexity of some hierarchical structured matrix algorithms. SIAM Journal on Matrix Analysisand Applications 2012; 33(2):388–410, doi:10.1137/110827788. URL http://dx.doi.org/10.1137/110827788.

69. Hsiao GC, Wendland WL. Boundary integral equations. Applied Mathematical Sciences, Springer: Berlin,Heidelberg, 2008.

70. Kress R. Linear Integral Equations. Applied Mathematical Sciences, Springer New York, 2013. URL https://books.google.com/books?id=-L-8BAAAQBAJ.

71. Atkinson K. The Numerical Solution of Integral Equations of the Second Kind. Cambridge Monographs on Appliedand Computational Mathematics, Cambridge University Press, 1997.

72. Borm S. H2lib. http://www.h2lib.org/.


http://dx.doi.org/10.1137/080720693

http://dx.doi.org/10.1137/S0036144500370835

https://books.google.com/books?id=hnIJOwyz9Z4C

https://books.google.com/books?id=hnIJOwyz9Z4C

https://arxiv.org/pdf/1408.2208v1.pdf

https://arxiv.org/pdf/1408.2208v1.pdf

https://doi.org/10.1007/s00607-004-0106-y

https://doi.org/10.1007/s00607-004-0106-y

https://doi.org/10.1007/s00211-015-0757-y

http://dx.doi.org/10.1007/s10092-005-0103-3

http://dx.doi.org/10.1007/s10092-005-0103-3



http://dx.doi.org/10.1007/978-3-319-06686-8_22

http://dx.doi.org/10.1007/978-3-319-06686-8_22

http://dx.doi.org/10.1137/S089547980238627X

http://dx.doi.org/10.1137/S089547980238627X

https://doi.org/10.1090/mcom/3204



http://eudml.org/doc/223449

http://eudml.org/doc/223449





http://dx.doi.org/10.2307/2153371

http://dx.doi.org/10.2307/2153371

http://dx.doi.org/10.1137/S0895479895291273

http://dx.doi.org/10.1137/S0895479895291273



http://dx.doi.org/10.1137/110827788

http://dx.doi.org/10.1137/110827788

https://books.google.com/books?id=-L-8BAAAQBAJ

https://books.google.com/books?id=-L-8BAAAQBAJ

http://www.h2lib.org/

smash: structured matrix approximation by separation and ...yxi26/pdf/smash.pdf · the resulting...

Documents