a b c d read seq ref seq linear pe array 1 2 typical...

1
Reference Query (Sample) Platform Achitecture Device GCUPS Watts GCUPS/Watt Systolic FPGA / OpenCL Virtex-7 690T 77.0 28 2.8 SSW [2] / SSE2 Intel® Xeon E5-2697 12 core 19.7 130 0.15 SWAPHI-LS [3] / AVX-512 Intel® Xeon Phi 5110P 60 core 29.5 225 0.13 GSWABE [4] / CUDA nVidia® Tesla K40 2880 core 59.1 245 0.24 FPGA vs Xeon E5-2697 3.9x 0.22x 18.7x FPGA vs Xeon Phi 5110P 2.6x 0.12x 21.5x FPGA vs Tesla K40 1.3x 0.11x 11.6x The Smith-Waterman algorithm [1], which produces the optimal pairwise alignment between two sequences of proteins or nucleotides, is frequently used in genomics but computationally expensive. In this work an efficient and scalable implementation of the Smith-Waterman algorithm is demonstrated for nucleotides using OpenCL and implemented on a Xilinx Virtex- 7 FPGA PCIe accelerator card. A linear systolic array is implemented for the wavefront computation with local memory caching and multiple simultaneous queries. A performance of 77 GCUPS was achieved 30% faster than GPU implementations and 290% faster than a 12- core CPU and 60-core coprocessor. Architectural choices are discussed to achieve the results. OpenCL can model systolic arrays on FPGAs previously only accessible by writing RTL SIMD Striped Smith-Waterman on FPGA is inferior to systolic array implementation Memory Bandwidth internal and external to accelerator was measure and optimized Significant performance increase achieved compared to published work using a single device and order of magnitude better GCUPS/watt as a power efficiency metric System performance on the FPGA requires optimizing an area/throughput curve for choosing kernel sizes and balancing Future Work: Smith-Waterman is one of many algorithms used in platforms and tools such as GATK/Gamgee, BWA-MEM, Bowtie2. Similar optimization techniques will be required to accelerate the full genomics workflow. References: [1] Smith, T.F. and Waterman, M.S. 1981, Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197 [2] Zhao, M., et al., 2013, SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One, vol. 8, pp. e82138, 2013. [3] Liu, Y., et al., 2014, SWAPHI-LS: Smith-Waterman Algorithm on Xeon Phi Coprocessors for Long DNA Sequences, IEEE International Conf. on Cluster Computing 2014, 257-265 [4] Liu, Y., Schmidt, B., 2014, GSWABE: faster GPU-accelerated sequence alignment for optimal alignment retrieval for short DNA sequences, Concurrency and Computat.: Pract. Exper. 2015; 27:958-972 0 2 4 6 8 10 0 16 32 48 64 80 96 112 128 144 GCUPS #PE per kernel GCUPS - One Kernel - Systolic 0 20 40 60 80 100 120 140 160 180 0 40 80 120 160 8 16 32 64 128 #Instances MAX GCUPS #PE per Kernel GCUPS - Parallel Kernels - Systolic Best system acceleration: 32 PE/Kernel 16 SW Kernels achieving 77 GCUPS * Power Based on TDP Hardware: Xilinx Virtex-7 XC7VX690T Alpha Data ADM-PCIE-7V3 Software: OpenCL in Xilinx SDAccel 2015.1 for(i in rows) { __attribute__((xcl_pipeline_loop)) for(j in MAXPE) { executePE(i, j, &myPE[j], &myPE[j-1]); } } Future Optimizations: Adjusting score bitwidths Improving DDR3 memory interface Reducing kernel area Inter-Task vs Intra-Task Parallelism Type KBits Memory Update rate @200MHz in Tb/s Usage FF 538 105.25 Internal PE scores values LUTRAM 7078 21.60 Store Sample/Reference strings per PE RAMB 34380 13.12 Read/Write Between DDR3 and FPGA DDR3 8388608 0.08 Read/Write between host 0 20 40 60 80 100 Virtex-7 Xeon 12 Core Xeon Phi 60 Core Tesla K40 GCUPS 0 0.5 1 1.5 2 2.5 3 Virtex-7 Xeon 12 Core Xeon Phi 60 Core Tesla K40 GCUPS/Watt STEPS: For each cell i,j 1) Compare characters at Q(i),R(j) 2) Add match/mismatch score to Northwest 3) Subtract mismatch score from North, West 4) Store max value at H i,j Results of the Smith-Waterman calculation typically include max value of all H i,j scores along with the location (i,j). Some implementations will include backtracking which can produce an alignment showing where the two substrings have differences, i.e. ACT–AGCAC A–TCAGCAC Algorithm (FPGA) GCUPS Striped Smith-Waterman [2] 0.6 Linear Systolic (w/32 PEs) 3.2 Approaches to Smith-Waterman on SIMD processors include striped approaches, tiled approaches, intra and inter task parallelism. Algorithm selection in OpenCL let us try various approaches. The Striped Smith-Waterman [2] approach was prototyped in OpenCL on the FPGA and compared with the linear systolic array. The linear systolic array was approximately 5.3x faster. The architecture chosen for FPGA must be computational efficiency in that every clock cycle must maximize the number of cells being updated. Striped Smith-Waterman while performing well on SIMD processors compared to other algorithms has the drawback of separating horizontal and vertical dependencies. During the vertical dependency calculation, the vector units become idle losing efficiency and performance. , = max −1,−1 + , −1, ,−1 0 Systolic solutions to Smith-Waterman have various implementations from 2- D mesh to linear arrays. Linear arrays model the wavefront nature of a single query. The number of processing elements on the wavefront can vary from 1 up to the maximum diagonal. Performance on an FPGA is directly tied to area efficiency measured by chip area used per kernel. Single kernel performance, both queries per second and GCUPS, increases with number of PEs per kernel however total system performance shows a maximal GCUPS at 32 PEs per kernel. At 32 PEs per kernel, there was a balance between the idle time of larger number of PEs and the area overhead of kernel communication. Genome sequencing software typically starts with samples and reference files (1) and using various techniques can identify areas of the genomes where initial alignment may have been sub- optimal. These initial alignments may use algorithms such as the Burrows-Wheeler Transform. The software then produces the reference and sample data (2) for these suspect areas for full Smith-Waterman alignment. Our acceleration platform begins at this step (3) where the sample and reference data have been identified and are sent through the OpenCL framework (3). The sample and references of various alignments are combined to minimize memory transfer overhead to the global memory area on the FPGA accelerator card (4,5). Inside the OpenCL compute unit, 16 sample/reference pairs are run simultaneously and results sent back to the OpenCL host (6). Alignment locations can be used by the alignment software for further sequencing. * Goal: Find the optimal substring match using score values for matches and mismatches of nucleotides of proteins * Performance measure in GCUPS: Billion Cell Updates/Sec One of the key aspects of achieving performance was managing the various memory systems inside the FPGA. The OpenCL Platform Model separates memory into global, local and private but within private memory, Xilinx FPGAs have three different types. During step 4 from above, samples are transferred to the accelerator card DDR3. In step 5, the FPGA begins a burst DDR3 read and copies all samples/reference pairs into RAMB blocks. Within the FPGA, each of our 16 Smith-Waterman kernels then reads from RAMB into LUTRAMs. Finally to perform the PE computation, all intermediate values are stored in Flip-Flops. FASTA/BAM FORMAT DNA READS 1 NUCLEOTIDE ALIGNER i.e. GATK, BWA Read Seq Ref Seq 2 Typical sizes: Read 50-150 basepairs Ref 250-500 basepairs 3 OpenCL Host Data Packing, Compression FPGA 4 OpenCL Host Score Unpacking 16 Smith Waterman Kernels 5 6 7 Alpha Data ADM-PCIE-7V3 Xilinx Virtex-7 XC7VX690t Attribute to allow full anti-diagonal of PEs to run in parallel A C T A G C A C A T C A G C A C 0 1 2 3 4 1 2 2 3 3 3 4 4 4 A B C D Linear PE Array Sample Array (LUTRAM) – 2bits*ROW Temporary Cell Array for Reference length > #PE sizeof(short)*ROW bits

Upload: others

Post on 04-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A B C D Read Seq Ref Seq Linear PE Array 1 2 Typical sizessc15.supercomputing.org/.../tech_poster/poster_files/post269s2-file2… · Reference e) Platform Achitecture Device GCUPS

Reference

Qu

ery

(Sam

ple

)

Platform Achitecture Device GCUPS Watts GCUPS/Watt

Systolic FPGA / OpenCL Virtex-7 690T 77.0 28 2.8

SSW [2] / SSE2 Intel® Xeon E5-269712 core

19.7 130 0.15

SWAPHI-LS [3] / AVX-512 Intel® Xeon Phi 5110P60 core

29.5 225 0.13

GSWABE [4] / CUDA nVidia® Tesla K402880 core

59.1 245 0.24

FPGA vs Xeon E5-2697 3.9x 0.22x 18.7x

FPGA vs Xeon Phi 5110P 2.6x 0.12x 21.5x

FPGA vs Tesla K40 1.3x 0.11x 11.6x

The Smith-Waterman algorithm [1], which produces the optimal pairwise alignment betweentwo sequences of proteins or nucleotides, is frequently used in genomics but computationallyexpensive. In this work an efficient and scalable implementation of the Smith-Watermanalgorithm is demonstrated for nucleotides using OpenCL and implemented on a Xilinx Virtex-7 FPGA PCIe accelerator card. A linear systolic array is implemented for the wavefrontcomputation with local memory caching and multiple simultaneous queries. A performanceof 77 GCUPS was achieved 30% faster than GPU implementations and 290% faster than a 12-core CPU and 60-core coprocessor. Architectural choices are discussed to achieve the results.

• OpenCL can model systolic arrays on FPGAs previously only accessible by writing RTL• SIMD Striped Smith-Waterman on FPGA is inferior to systolic array implementation• Memory Bandwidth internal and external to accelerator was measure and optimized• Significant performance increase achieved compared to published work using a single device and order

of magnitude better GCUPS/watt as a power efficiency metric• System performance on the FPGA requires optimizing an area/throughput curve for choosing kernel

sizes and balancing

Future Work:Smith-Waterman is one of many algorithms used in platforms and tools such as GATK/Gamgee, BWA-MEM, Bowtie2. Similar optimization techniques will be required to accelerate the full genomics workflow.

References: [1] Smith, T.F. and Waterman, M.S. 1981, Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197 [2] Zhao, M., et al., 2013, SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One, vol. 8, pp. e82138, 2013. [3] Liu, Y., et al., 2014, SWAPHI-LS: Smith-Waterman Algorithm on Xeon Phi Coprocessors for Long DNA Sequences, IEEE International Conf. on Cluster Computing 2014, 257-265 [4] Liu, Y., Schmidt, B., 2014, GSWABE: faster GPU-accelerated sequence alignment for optimal alignment retrieval for short DNA sequences, Concurrency and Computat.: Pract. Exper. 2015; 27:958-972

0

2

4

6

8

10

0 16 32 48 64 80 96 112 128 144

GC

UP

S

#PE per kernel

GCUPS - One Kernel - Systolic

0

20

40

60

80

100

120

140

160

180

0

40

80

120

160

8 16 32 64 128

#In

stan

ces

MA

X G

CU

PS

#PE per Kernel

GCUPS - Parallel Kernels - Systolic

Best system acceleration: 32 PE/Kernel

16 SW Kernels achieving 77 GCUPS

* Power Based on TDP

Hardware:• Xilinx Virtex-7 XC7VX690T• Alpha Data ADM-PCIE-7V3Software:• OpenCL in Xilinx SDAccel 2015.1

for(i in rows) {

__attribute__((xcl_pipeline_loop))

for(j in MAXPE) {

executePE(i, j, &myPE[j], &myPE[j-1]);

}

}

Future Optimizations:• Adjusting score bitwidths• Improving DDR3 memory interface• Reducing kernel area• Inter-Task vs Intra-Task Parallelism

Type KBitsMemory Update rate@200MHz in Tb/s

Usage

FF 538 105.25 Internal PE scores values

LUTRAM 7078 21.60 Store Sample/Reference strings per PE

RAMB 34380 13.12 Read/Write Between DDR3 and FPGA

DDR3 8388608 0.08 Read/Write between host

0

20

40

60

80

100

Virtex-7 Xeon 12Core

Xeon Phi60 Core

TeslaK40

GCUPS

0

0.5

1

1.5

2

2.5

3

Virtex-7 Xeon 12Core

Xeon Phi60 Core

TeslaK40

GCUPS/Watt

STEPS:For each cell i,j1) Compare characters at Q(i),R(j)2) Add match/mismatch score to Northwest3) Subtract mismatch score from North, West4) Store max value at Hi,j

Results of the Smith-Waterman calculation typically include max value of all Hi,j scoresalong with the location (i,j). Some implementations will include backtracking which can produce an alignment showing where the two substrings have differences, i.e.

ACT–AGCACA–TCAGCAC

Algorithm (FPGA) GCUPS

Striped Smith-Waterman [2] 0.6

Linear Systolic (w/32 PEs) 3.2

Approaches to Smith-Waterman on SIMD processors include striped approaches,tiled approaches, intra and inter task parallelism. Algorithm selection in OpenCLlet us try various approaches. The Striped Smith-Waterman [2] approach wasprototyped in OpenCL on the FPGA and compared with the linear systolic array.The linear systolic array was approximately 5.3x faster. The architecture chosen for FPGA must be computational efficiency inthat every clock cycle must maximize the number of cells being updated. Striped Smith-Waterman while performing well onSIMD processors compared to other algorithms has the drawback of separating horizontal and vertical dependencies. Duringthe vertical dependency calculation, the vector units become idle losing efficiency and performance.

𝐻𝑖,𝑗 = max

𝐻𝑖−1,𝑗−1 + 𝑠 𝑎𝑖 , 𝑏𝑖𝐻𝑖−1,𝑗 −𝑊𝑘

𝐻𝑖,𝑗−1 − 𝑊𝑙

0

Systolic solutions to Smith-Waterman have various implementations from 2-D mesh to linear arrays. Linear arrays model the wavefront nature of asingle query. The number of processing elements on the wavefront canvary from 1 up to the maximum diagonal. Performance on an FPGA isdirectly tied to area efficiency measured by chip area used per kernel.Single kernel performance, both queries per second and GCUPS, increaseswith number of PEs per kernel however total system performance shows amaximal GCUPS at 32 PEs per kernel. At 32 PEs per kernel, there was abalance between the idle time of larger number of PEs and the areaoverhead of kernel communication.

Genome sequencing software typically starts with samples andreference files (1) and using various techniques can identify areasof the genomes where initial alignment may have been sub-optimal. These initial alignments may use algorithms such as theBurrows-Wheeler Transform. The software then produces thereference and sample data (2) for these suspect areas for fullSmith-Waterman alignment. Our acceleration platform begins atthis step (3) where the sample and reference data have beenidentified and are sent through the OpenCL framework (3). Thesample and references of various alignments are combined tominimize memory transfer overhead to the global memory areaon the FPGA accelerator card (4,5). Inside the OpenCL computeunit, 16 sample/reference pairs are run simultaneously andresults sent back to the OpenCL host (6). Alignment locations canbe used by the alignment software for further sequencing.

* Goal: Find the optimal substring match using score values for matches and mismatches of nucleotides of proteins* Performance measure in GCUPS: Billion Cell Updates/Sec

One of the key aspects of achieving performance was managing the variousmemory systems inside the FPGA. The OpenCL Platform Model separatesmemory into global, local and private but within private memory, Xilinx FPGAshave three different types. During step 4 from above, samples are transferredto the accelerator card DDR3. In step 5, the FPGA begins a burst DDR3 read andcopies all samples/reference pairs into RAMB blocks. Within the FPGA, each ofour 16 Smith-Waterman kernels then reads from RAMB into LUTRAMs. Finallyto perform the PE computation, all intermediate values are stored in Flip-Flops.

FASTA/BAM FORMAT

DNA READS

1

NUCLEOTIDE ALIGNERi.e. GATK, BWA

Read Seq Ref Seq2

Typical sizes:Read 50-150 basepairsRef 250-500 basepairs

3

OpenCL Host Data Packing,

Compression

FPGA

4OpenCL Host Score

Unpacking

16 Smith Waterman Kernels

5

6

7

Alpha Data ADM-PCIE-7V3Xilinx Virtex-7 XC7VX690t

Attribute to allow full anti-diagonal of PEs to run in parallel

A C T A G C A C

AT

CA

GC

AC

0

1

2

3

41

2

2

3

3

3

4

4

4

A B C D Linear PE ArraySa

mpl

e A

rray

(LU

TRA

M) –

2bi

ts*R

OW

Temporary Cell Arrayfor Reference length > #PE

sizeof(short)*ROW bits