a vector implementation of gaussian elimination over...

45
A Vector Implementation of Gaussian Elimination over GF(2): Exploring the Design-Space of Strassen’s Algorithm as a Case Study Enric Morancho Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain [email protected] 23 rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Turku, Finland, March 4 th - 6 th , 2015 Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45

Upload: nguyenquynh

Post on 26-Mar-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of

Strassen’s Algorithm as a Case Study

Enric Morancho

Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech

Barcelona, [email protected]

23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing

Turku, Finland, March 4th − 6th, 2015

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45

Page 2: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 2 / 45

Page 3: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 3 / 45

Page 4: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Introduction

Gaussian Elimination (GE) is one of the key algorithms in linearalgebraWe discuss a vector implementation of GE over GF(2)We apply this implementation to a case study:

Enumerating all matrix-multiply algorithms over GF(2) similar toStrassen’s algorithm

Strassen’s algorithm: first known sub-cubic matrix-multiply algorithm

The search engine relies on solving more than 1012 GaussianEliminations over GF(2) matrices

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 4 / 45

Page 5: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 5 / 45

Page 6: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 6 / 45

Page 7: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Gaussian Elimination (GE)

One of the key algorithms in linear algebraApplications: Solving LSE’s, inverting nonsingular matrices,...

GE transforms a matrix into a matrix in row (column) echelon formForward elimination

2 1 4 32 4 10 36 6 18 94 2 8 3

2 1 4 00 3 6 30 0 0 30 0 0 0

Gauss-Jordan Elimination (GJE) transforms a matrix into a matrixin reduced row (column) echelon form

Forward elimination + Back substitution2 1 4 32 4 10 36 6 18 94 2 8 3

2 1 4 00 3 6 30 0 0 30 0 0 0

1 0 1 00 1 2 00 0 0 10 0 0 0

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 7 / 45

Page 8: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Gaussian Elimination (GE)

GE iteratively applies three elementary transformations:Swapping rows (columns)Scaling rows (columns)Adding to a row (column) a scalar multiple of another row (column)

GE is defined over an algebraic fieldInfinite fields: Q, R, ...

Computer arithmetic over infinite fields introduces round-off errorsGE implementations use pivoting techniques to minimize them

Finite fields: the Galois Field of two elements (GF(2)), ...Computer arithmetic over finite fields is always exact

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 8 / 45

Page 9: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 9 / 45

Page 10: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Gaussian Elimination over GF(2)

GF(2) is the Galois field of two elements (aka F2, binary field)GF (2) = {0,1}addition ≡ bitwise XOR

subtraction and addition are the same operation (+1 = -1)

multiplication ≡ bitwise ANDImplementation remarks

Gaussian Elimination can be specialized for GF(2)The only element different from 0 is 1

Gauss-Jordan Elimination can be easily merged into GEPivoting is unnecessary

Computer arithmetic over GF(2) is always exact

ApplicationsFactoring large integer numbers, cryptography, pattern matching,...

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 10 / 45

Page 11: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 11 / 45

Page 12: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Vector extensions

Most current processors implement vector instructionsx86 family supports several vector-instruction sets and lengths

MMX: 64-bit vector registersSSE: 128-bit vector registersAVX2: 256-bit vector registers

Scalar processors may emulate vector instructionsSWAR: SIMD within a registerInterprets general-purpose registers as a bitvectorsBitwise operations (AND, shifts,...) can be seen as SIMD operations

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 12 / 45

Page 13: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 13 / 45

Page 14: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Preliminaries

We focus in row echelon formsVector instructions allow us to exploit the parallelism available inthe matrix transformationsWe represent GF(2) matrices as vector registers

Both row-major and column-major layoutsMain steps of GE

For each column:Finding the pivotRow swappingForward elimination and Back substitution

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 14 / 45

Page 15: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 15 / 45

Page 16: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Finding the pivot element

For each column, GE must locate a pivot (an element 6= 0)First: column must be extracted

Depends on matrix layoutSecond: pivot must be located

Naïve solution: iterating bit by bitOptimized solution: using bit-scanning machine instructions

tzcnt (AVX2)

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 16 / 45

Page 17: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 17 / 45

Page 18: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Row swapping

Depends on matrix layoutRow-major layout:

Vector instruction sets offer permute instructionsColumn-major layout:

Several vector instructions are required

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 18 / 45

Page 19: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Row swapping

Example of swapping rows 3 and 5 on a 6× 4 GF(2) matrix storedin column-major layout

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 19 / 45

Page 20: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 20 / 45

Page 21: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Forward elimination and Back substitution

Add pivot row to the rows with non-zero entries in the pivot columnWe perform all additions at the same timeExample on a 4× 6 GF(2) matrix stored in row-major layout

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 21 / 45

Page 22: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 22 / 45

Page 23: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 23 / 45

Page 24: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Block-recursive matrix-multiply algorithms

A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6

b) Strassen’s algorithm

n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7

c) Winograd’s variant

n 2.807

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 24 / 45

Page 25: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Block-recursive matrix-multiply algorithms

A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6

b) Strassen’s algorithm

n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7

c) Winograd’s variant

n 2.807

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 25 / 45

Page 26: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Block-recursive matrix-multiply algorithms

A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6

b) Strassen’s algorithm

n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7

c) Winograd’s variant

n 2.807

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 26 / 45

Page 27: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Block-recursive matrix-multiply algorithms

Several matrix-multiply algorithms with sub-cubic complexityn2.81 [Strassen, 1969]n2.79 [Pan, 1979]n2.78 [Bini, 1979]n2.55 [Schönhage, 1981]n2.373 [Coppersmith and Winograd, 1987]n2.37286 [LeGall, 2014]

But Strassen’s algorithm is optimal for 2× 2 matricesThere are other algorithms similar to Strassen’s?

Using a genetic algorithm, [Oh and Moon, 2010] searched forStrassen-like algorithms over R

Their partial search discovered 608 Strassen-like algorithms

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 27 / 45

Page 28: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Case Study

Enumerating all the Strassen-like algorithms for 2× 2 GF(2)matrices

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 28 / 45

Page 29: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 29 / 45

Page 30: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Formulation

Strassen-like algorithms make 7 recursive callsEach call computes a bilinear form

Pi =(αi1A11 + αi2A12 + αi3A21 + αi4A22)

·(βi1B11 + βi2B12 + βi3B21 + βi4B22) αij , βij ∈ GF (2)

There are (24 − 1)2 = 225 possible Pi ’s and(225

7

)≈ 5.27× 1012

candidate algorithms

An algorithm is correct (computes A · B) if there existsimultaneously next four linear combinations:

C11 :∑7

k=1 φk1Pk = A11 · B11 + A12 · B21

C12 :∑7

k=1 φk2Pk = A11 · B12 + A12 · B22

C21 :∑7

k=1 φk3Pk = A21 · B11 + A22 · B21

C22 :∑7

k=1 φk4Pk = A21 · B12 + A22 · B22

φij ∈ GF (2)

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 30 / 45

Page 31: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Representation

We represent each candidate algorithm as a 16× 11 GF(2) matrix

δijk = 1⇔ αij = βik = 1

δ1,1,1 ... δ7,1,1 1 0 0 0δ1,1,2 ... δ7,1,2 0 1 0 0δ1,1,3 ... δ7,1,3 0 0 0 0δ1,1,4 ... δ7,1,4 0 0 0 0δ1,2,1 ... δ7,2,1 0 0 0 0δ1,2,2 ... δ7,2,2 0 0 0 0δ1,2,3 ... δ7,2,3 1 0 0 0δ1,2,4 ... δ7,2,4 0 1 0 0δ1,3,1 ... δ7,3,1 0 0 1 0δ1,3,2 ... δ7,3,2 0 0 0 1δ1,3,3 ... δ7,3,3 0 0 0 0δ1,3,4 ... δ7,3,4 0 0 0 0δ1,4,1 ... δ7,4,1 0 0 0 0δ1,4,2 ... δ7,4,2 0 0 0 0δ1,4,3 ... δ7,4,3 0 0 1 0δ1,4,4 ... δ7,4,4 0 0 0 1

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 31 / 45

Page 32: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Resolution

We apply Gauss-Jordan elimination to the 16× 11 matrixIf the form of the resulting matrix is:

(I7 φ7×4Z9×7 Z9×4

) I7 7× 7 identity matrixZ9×7, Z9×4 9× 7, 9× 4 zero matricesφ7×4 7× 4 arbitrary matrix

then φ7×4 are the coefficients of the linear combinations of P1...P7that generate the Cij ’s of A · B

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 32 / 45

Page 33: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Search engines

Search engineIterates through

(2257

)≈ 5.2× 1012 candidate algorithms

Applies Gauss-Jordan elimination to the related 16× 11 GF(2)matricesChecks the form of the resulting matrix

SpecializationsIf a pivot is not found in anyone of the first seven columns, thealgorithm is discarded

Redundant Pi ’s

Gauss-Jordan is applied just on the first seven columnsAvoid explicitly swapping rowsFiltering-out algorithms that do not use all eight subproducts

A11 ·B11,A12 ·B21,A11 ·B12,A12 ·B22,A21 ·B11,A22 ·B21,A21 ·B12,A22 ·B22

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 33 / 45

Page 34: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 34 / 45

Page 35: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Experimental setup

Evaluated search engines6 scenarios:

AVX2, SSE, Scalar SWAR, Scalar no-SWAR, M4RI library (GE),M4RI library (GJE)

2 search engines for each scenarioGeneric: applies GJE/GE to all candidate algorithmsSpecialized: implement specializationsIn M4RI scenarios, specialization includes only the filteringmechanism

Search engines differ on the implementation of Gauss-Jordanelimination

Column-major layoutEvaluation platform

Intel Xeon CPU E3-1220 v3 (3.1 GHz, 4-core, Haswell)Implements AVX2 vector extension (256-bit)

Search space is dynamically distributed among the 4 cores

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 35 / 45

Page 36: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 36 / 45

Page 37: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Performance results

Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18

Relative execution time of the search enginesWith respect to specialized AVX2 engineThe lower, the better

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 37 / 45

Page 38: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Performance results: vector vs. scalar

Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18

Speedup of AVX2 impl. wrt. vector and pure scalar impl.AVX2 vs. SSE: ≈ 1.3XAVX2 vs. Scalar SWAR: ≈ 1.9XAVX2 vs. Scalar no-SWAR: ≈ 5.7X-6.8X

Scalar implementations are not competitiveThe larger the vector-register length, the better the performance

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 38 / 45

Page 39: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Performance results: vector vs. algebra library

Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18

In our case study, library implementations are not competitiveThey are optimized for larger matrices

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 39 / 45

Page 40: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Performance results: impact of code specializations

Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18

Impact of code specialization: almost 4XLarger than doubling or even quadruplicating vector lengthImpact break down:

Discarding algorithms if a pivot is not found: negligibleApplying elimination just on seven columns: 1.4X - 1.6XAvoiding row interchange: 1.2X - 1.4XFiltering-out algorithms without eight subproducts: 1.9X

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 40 / 45

Page 41: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Case-study results

We have found 20 Strassen-like algorithmsExcluding permutations and symmetric versionsDetailed in our paper

Results coherent with [Oh and Moon, 2010]They found additional algorithms where coefficient 0.5 is involved

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 41 / 45

Page 42: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 42 / 45

Page 43: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Conclusions

We have discussed a vector implementation of GaussianEliminationOur evaluation develops a case study

Requires solving more than 1012 GE’s over 16× 11 GF(2) matricesVector implementations clearly outperform scalar implementationWe point out the impact of code specialization for our case study

Speedup: almost 4XImpact larger than doubling or quadruplicating vector-register length

Performance of analyzed algebra libraries is not competitiveLibraries are optimized for larger matrices

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 43 / 45

Page 44: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

Future work

Developing efficient implementations of Gaussian Elimination forlarger matrices

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 44 / 45

Page 45: A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of

Strassen’s Algorithm as a Case Study

Enric Morancho

Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech

Barcelona, [email protected]

23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing

Turku, Finland, March 4th − 6th, 2015

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 45 / 45