scalable matrix multiplication for the 16 core epiphany co-processor

Scalable Matrix MultiplicationFor the 16 Core Epiphany Co-Processor

Louis LoizidesMay 2nd 2015

Parallella Board

16 core MIMD EpiphanyCo-Processor

Zync ARM processor / FPGA

Image from Adapteva

Epiphany Versions

32 GFLOPS16 core Epiphany on

Parallella

5 TFLOPS?4096 core Epiphany

Graphic from Adapteva

Compiling

*.c gcc Hostprog

HAL

Execution:

*.c e-gcc *.elf

ELDF

e-objcopy

Device prog

*.srec

Hardware definition file

Challenges• Hard to code. Need for very

manual memory allocation and management makes complex coding difficult.

• Hard to debug. Epiphany doesn’t share memory with Linux

• Temperature. After a week of frustration I realized I needed to put a fan over it.

• Documentation. SDK and examples are poor and frequently broken. Few beginner examples. Small community of users.

My “thermal management solution”

Process Synchronization• Each core runs a process, not a thread

– Every core can run a different process– “Workgroups” can be created in SDK

• Functions exist in OpenCL, COPRTHR and eSDK for synchronizing processes– Mutexes only provided between cores– SDK examples tend to use wait for single bits to change for

synchronization• MPI, OpenMP currently not supported for coprocessor

– Some “community” projects in works… not much of a community though

Memory Management• “Shared” DRAM

– Memory allocated specifically for Epiphany using e_alloc– 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978)

• SRAM in each core– Only 32kB available– 4 GB/s (1 GB/s in practice per DMA channel)– Use DMA channel functions to transfer memory between cores– Can’t use malloc!!! – must keep track manually– Have to know addresses on other cores you want to send data to– Must watch out for both code size and stack growth

32kB of memory

Prog. Stack(matrix buffers go here… essentially the heap)De

bugg

ing

Chip Architecture

• 32kb SRAM per core for program + stack• ~2 GB/s DMA transfers between cores• ~150 MB/s to transfer to/from shared DRAM

DMA engine frees up processor

Graphic from Adapteva

SUMMA/Blocking ImplementationBlock matrix

Execute SUMMA on sub-blocks

Each core copies it’s designated sub-block

Example code - copy sub-blocks from shared DRAM to Epiphany

Epip

hany

DRAM

Note: ~1000x1000 matrix size limitation due to

Parallella Linux shared memory size

150 MB/s2 GB/s

Results

0 200 400 600 800 10000

50

100

150

200

250

300

350

Matrix Multiplication Execution Times

Single Epiphany Core

2x2 Core Grid

3x3 Core Grid

4x4 Core Grid

ARM Naive

ARM Blocked

Matrix Side Size

Exec

ution

Tim

e (s

)

Epiphany Version Grid Side Size Epiphany

Time (s)Speedup vs. Single Core

1 317.2 1 2 80.9 3.92 3 35.43 8.95

E16G3 4 21.5 14.76E64G4 8 7.7 41.24

E256G4 16 1.98 160.02E1KG4 32 0.51 620.96E4KG4 64 0.13 2409.56

Speedup(vs. single core)

More cores -> Larger Blocks -> Exponentially Less Blocking

1 2 3 40

2

4

6

8

10

12

14

16

f(x) = 1.00827869658563 x^1.95623152724166R² = 0.999495920088555

Speedups vs. Grid Side Size

Grid Side Size

Spee

dup

(vs S

ingl

e Co

re)

Estimated

Conclusions

• Potentially powerful device, especially in embedded AI applications with large search spaces– Needs passive cooling

• 32kB SRAM is extremely limiting– Needs either L2 cache or just some kind of faster near-

chip shared memory– Really limitation of Parallella architecture, not Epiphany

• Incredibly difficult to code– SDK & Documentation needs improvement– Better debugging tools needed ASAP!

scalable matrix multiplication for the 16 core epiphany co-processor

Technology