an evaluation of an energy efficient many-core soc with ... · 4 outline •introduction •face...

63
Copyright 2013, Toshiba Corporation. An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

Copyright 2013, Toshiba Corporation.

An Evaluation of

an Energy Efficient Many-Core SoC

with Parallelized Face Detection

Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and

Takashi Miyamori

Toshiba Corporation, Kawasaki, Japan

Page 2: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

2

Executive Summary

• Future architecture will have many cores

• A key challenge : How to efficiently use them?

• We evaluated techniques to accelerate one

type of important application (face detection)

• Performance scales up to 64 cores

• Energy efficiency is 20x better than desktop

CPU

Page 3: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

3

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 4: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

4

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 5: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

5

Two Key Trends in Embedded Systems

• Trend 1 : New applications (e.g. image recognition)

need more computing power while keeping low power

• Trend 2 : New architecture can enable much more

parallelism than before

Page 6: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

6

Two Key Trends in Embedded Systems

• Trend 1 : New applications (e.g. image recognition)

need more computing power while keeping low power

• Trend 2 : New architecture can enable much more

parallelism than before Now : 500GOPS

Heterogeneous Multi-Core

Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

Page 7: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

7

Two Key Trends in Embedded Systems

• Trend 1 : New applications (e.g. image recognition)

need more computing power while keeping low power

• Trend 2 : New architecture can enable much more

parallelism than before Now : 500GOPS

Heterogeneous Multi-Core

Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Accelerator

Accelerator

Accelerator

Accelerator

Heterogeneous Many-Core Future : More than 1TOPS

Toshiba Many-Core

[VLSI Sympo. ’12]

Page 8: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

8

Two Key Trends in Embedded Systems

• Trend 1 : New applications (e.g. image recognition)

need more computing power while keeping low power

• Trend 2 : New architecture can enable much more

parallelism than before Now : 500GOPS

Heterogeneous Multi-Core

Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Accelerator

Accelerator

Accelerator

Accelerator

Heterogeneous Many-Core Future : More than 1TOPS

Toshiba Many-Core

[VLSI Sympo. ’12]

Result : A need for efficient and scalable application

performance on many-core

Page 9: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

9

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

Page 10: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

10

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

Page 11: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

11

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

Less than 3W is needed

for embedded applications

3

Page 12: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

12

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

Less than 3W is needed

for embedded applications

3

Energy Efficient Embedded Multi-Cores

Toshiba Multi-Core (ViscontiTM2)

ARM ® CortexTM-A5

Renesas RP-X

Page 13: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

13

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

Less than 3W is needed

for embedded applications

3

Energy Efficient Embedded Multi-Cores

Toshiba Multi-Core (ViscontiTM2)

ARM ® CortexTM-A5

Renesas RP-X

Our Target Area

Page 14: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

14

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

Less than 3W is needed

for embedded applications

3

Energy Efficient Embedded Multi-Cores

Toshiba Multi-Core (ViscontiTM2)

ARM ® CortexTM-A5

Renesas RP-X

Our Target Area

Toshiba Energy Efficient

Many-Core (64Core)

Page 15: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

15

Many-Core Scalability

The Number of Cores

Perf

orm

an

ce

Page 16: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

16

Many-Core Scalability

The Number of Cores

Perf

orm

an

ce

Ideal

Page 17: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

17

Many-Core Scalability

The Number of Cores

Perf

orm

an

ce

Ideal

Actual?

Page 18: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

18

Many-Core Scalability

The Number of Cores

Perf

orm

an

ce

Ideal

Actual?

Can we achieve good performance scaling-up

on face detection?

Page 19: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

19

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 20: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

20

Face Detection

Page 21: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

21

Face Detection

1st

ROI

Check if a face

exists or not

25 pixels

25 pixels

ROI : Region of Interest

Page 22: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

22

Face Detection

2nd

ROI

Check if a face

exists or not

25 pixels

25 pixels

2 pixels

ROI : Region of Interest

Page 23: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

23

Face Detection

3rd

ROI

Check if a face

exists or not

25 pixels

25 pixels

4 pixels

ROI : Region of Interest

Page 24: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

24

Face Detection

Nth

ROI

Check if a face

exists or not

ROI : Region of Interest

Page 25: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

25

Face Detection

Check if a face

exists or not

25 pixels

25 pixels

(N+1)th

ROI

2 pixels ROI : Region of Interest

Page 26: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

26

Face Detection

Target

ROI

ROI : Region of Interest

Page 27: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

27

ROI

(Region Of Interest)

Joint Haar-Like Features [ICCV ‘05]

Compared to each threshold

If greater than : 1, otherwise : 0

1 1 0 Joint Haar-Like

Feature

• Extension to widely-used Viola and John’s Method

[CVPR ‘01] (using Haar-like features)

Haar-like feature : Difference of image intensities between

blue and red rectangles.

Page 28: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

28

ROI

(Region Of Interest)

Joint Haar-Like Features [ICCV ‘05]

Compared to each threshold

If greater than : 1, otherwise : 0

1 1 0 Joint Haar-Like

Feature

• Extension to widely-used Viola and John’s Method

[CVPR ‘01] (using Haar-like features)

Haar-like feature : Difference of image intensities between

blue and red rectangles.

Eye is darker

than cheek

Page 29: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

29

Classifier using Joint Haar-Like Features

Table

Table

Face?

Face?

Accumulate

Face or Not Face

Positions of features and tables are learned in advance

and stored in the dictionary

Possibility of face or not face

Weight of the feature Joint Haar-Like Features

Page 30: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

30

Characteristics of Face Detection

• Face detection for each

ROI can be executed in

parallel

• There are a lot of ROIs

in an image

– 3M ROIs when image size

is 4000x3200

• A lot of coarse grain thread parallelism based

on ROIs

– Overhead of thread scheduling can be minimized

Many-core is good for face detection !

Page 31: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

31

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 32: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

32

Chip Micrograph and Features

Technology 40nm LP Process

Interconnect 8 metal (Cu)

Chip Size 15.0mm x 14.0mm

Cluster Size 7.4mm x 5.7mm

Transistors 87.5Million

Cluster

Frequency

333MHz, 1.1V

Package 1369-pin FCBGA

Cluster 0

DD

R3 I/F

D

DR

3 I/F

Cluster 1

15.0mm

14.0

mm

L2Cache Bank0

5.7

mm

7.4mm

L2Cache Bank1

L2 Cache Bank2

L2 Cache Bank3

Core

Reconfigurable

Engines

2MB L2 Cache

2MB L2 Cache

Page 33: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

33

Chip Micrograph and Features

Technology 40nm LP Process

Interconnect 8 metal (Cu)

Chip Size 15.0mm x 14.0mm

Cluster Size 7.4mm x 5.7mm

Transistors 87.5Million

Cluster

Frequency

333MHz, 1.1V

Package 1369-pin FCBGA

Cluster 0

DD

R3 I/F

D

DR

3 I/F

Cluster 1

15.0mm

14.0

mm

L2Cache Bank0

5.7

mm

7.4mm

L2Cache Bank1

L2 Cache Bank2

L2 Cache Bank3

Core

Reconfigurable

Engines

2MB L2 Cache

2MB L2 Cache

Page 34: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

34

Structure of Many-Core Cluster

• Tree-based NoC – Leaf nodes: Core

– Root nodes: L2 cache

banks

L2$(Root)

Core(Leaf)

Core(Leaf)

Router L2 Cache

512 KB

Bank0

L2 Cache

512 KB

Bank1

L2 Cache

512 KB

Bank2

L2 Cache

512 KB

Bank3

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Core Core

Core Core

1

Core Core

Core Core

1

2

2

- Interrupt Controller

- Hardware Semaphores

Cluster Control Module

3 3 3 3

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Four L2 Cache

Banks

Page 35: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

35

Many-Core SoC Architecture

Video

In x 4 Peripherals

L2 Cache

2MB

Core x 32

Many-Core Cluster 0

Video

Out

SRAM 512KB

Reconfigurable

Engine x 2

External

Interface PCIe

X 1

Image

Recognition &

Processing

Accelerators

SRAM 512KB x 4

ARM

Cortex-A9

X 2

PCIe

X 1

10.7GB/s

Reconfigurable

Engine x 2

ARM

Cortex-A9

X 2

DDR3

Controller

X 2

L2 Cache

2MB

Core x 32

Many-Core Cluster 1

Page 36: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

36

Core : Media Processing Block (MPB)

• 3-Way VLIW Processor

• L1 Instruction Cache: 32KB

• L1 Data Cache: 16KB

• 333 MHz

3-Way VLIW Processor (MPB)

Dec./RF

ALU

D$

Dec.

Cop. RF

ALU ALU

MUL

Acc. Acc.

I$

RF

32b RISC Core

2-way SIMD Coprocessor

RISC Core

SIMD Co-Processor

L1 Inst

Cache

32KB

L1 Data

Cache

16KB

Debug

Module

NoC I/F

Address Protect

Unit

Address Translate

Unit

Exploits multi-grain parallelism • Thread level by many cores

• Instruction level by VLIW architecture

• Data level by SIMD instructions

64b 64b

Page 37: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

37

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 38: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

38

Issues in implementing parallelized face detection

• High coarse-grain parallelism: Good for Parallelization

– There are enough ROIs to exploit by many cores

• Imbalanced workload: Bad for Processor Utilization

– The workload of an ROI where a face exists is higher than that of an

ROI without a face

Implementation of parallelized face-detection

– Minimize the number of threads in order to reduce synchronization cost

• Allocate one thread to one core

– Find a good thread partitioning with balancing workload of threads

– Reduce data bandwidth (L1$-L2$ and L2$-DDR3)

Page 39: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

39

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 40: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

40

Implementation on the Single Cluster

• We implemented the face detection with two

methods to allocate image to cores

– Allocating Cyclically

– Splitting Equally

Page 41: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

41

(1) Allocating Cyclically

ROI Core0

Core1

Core31

Core2

Image

This way allocates lines to each core cyclically

Effective in balancing workload

Page 42: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

42

(2) Splitting Equally

Core0

Core1

Core31

Core2

Image

Height

Height/32

This way divides the image evenly

Effective to reduce data size read by each core

Page 43: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

43

Images for Evaluation • High Resolution Images (5.76-12.7Mp) including many faces

No. Resolution Number of Faces

0 4000x1440 30

1 3000x4082 37

2 4083x3062 78

3 4094x3107 148

4 3568x2568 9

5 3568x2568 10

Face Detection Result of Image 4

Page 44: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

44

Evaluation Board

Many-Core SoC

(Fan-less Cooling)

I/O and switches for

evaluation

Page 45: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

45

0

5

10

15

20

25

30

35

1 2 4 8 16 32 1 2 4 8 16 32

Rela

tive

Pe

rfo

rma

nce

Number of Cores

img.0

img.1

img.2

img.3

img.4

img.5

ideal

Relative Performance on Single Cluster

15.5x

30x

11x

21x

Allocating

Cyclically

Splitting

Equally

Page 46: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

46

0

5

10

15

20

25

30

35

0 10 20 30 40

Ave

rag

e R

ela

tive

P

erf

orm

an

ce

The Number of Cores

AllocatingCyclically

SplittingEqually

ideal

Average Relative Performance on Single Cluster

11x

21x 15.5x

30x

With Allocating Cyclically,

performance scales up to 32 cores

Ideal

Page 47: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

47

0

5

10

15

20

25

30

35

40

45

0 1 2 3 4 5 0 1 2 3 4 5

Tim

e (

se

c)

Image Number

FastestCore

SlowestCore

Execution Time of the Fastest and Slowest Cores

11x

1.1x

Allocating

Cyclically

Splitting

Equally

Page 48: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

48

Processor Utilization

• Allocating Cyclically : 90 ~ 95%

• Splitting Equally : 55 ~ 75%

Low processor utilization deteriorates

the performance of Splitting Equally

0

20

40

60

80

100

0 1 2 3 4 5

Pro

ce

ss

or

Uti

liza

tio

n(%

)

Image Number

AllocatingCyclically

SplittingEqually

Page 49: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

49

Bandwidth of L2 Cache and DDR3

• L1-L2 bandwidth is nearly the same

– L1 cache is not enough to store ROI line

• About L2-DDR3, Allocating Cyclically is better

– All cores access the small area at the same

0

0.2

0.4

0.6

0.8

1

1.2

L1-L2 L2-DDR3

Rela

tive A

mo

un

t o

f Tra

ns

ferr

ed

Data

AllocatingCyclically

SplittingEqually

Page 50: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

50

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

Page 51: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

51

Implementation on Dual Cluster

• Each cluster has its own L2 cache and shares DDR3

– Because bandwidth is narrower than L1 and L2 cache, reducing

bandwidth between L2 cache and DDR3 is important

• We implemented the two ways

– Allocating Cyclically

– Bisection

MPB

L2 cache

L1Cache MPBx32 …

DDR3

MPB L1Cache

MPB

L2 cache

L1Cache MPBx32 …

MPB L1Cache

Page 52: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

52

(1) Allocating Cyclically

Image

Core0

Core1

Core31

Core2

Cluster0

Core0

Core1

Core31

Core2

Cluster1

ROI

This way is the same as that of a single cluster

Effective in balancing workload

Page 53: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

53

(2) Bisection

Image

Cluster0 Cluster1

Height/2

This way divides the image into two blocks

Each cluster processes each block

Page 54: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

54

(2) Bisection

Image

Core0

Core1

Core31

Core2

Cluster0

Core0

Core1

Core31

Core2

Cluster1

ROI

ROI

This way divides the image into two blocks

Effective to reduce data size read by each cluster

In each block, Allocating Cyclically is used

Page 55: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

55

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Rela

tive

Pe

rfo

rma

nce

Image Number

AllocatingCyclically

Bisection

Performance of Dual Cluster (64 Cores)

Ideal (64x)

61x

42x

Page 56: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

56

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Rela

tive

Pe

rfo

rma

nce

Image Number

AllocatingCyclically

Bisection

Performance of Dual Cluster (64 Cores)

Ideal (64x)

61x

42x

By Allocating Cyclically,

performance scales up to 64 cores

Page 57: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

57

0

2

4

6

8

10

12

14

16

18

0 1 2 3 4 5 0 1 2 3 4 5

Tim

e(s

ec)

Image Number

FastestCore

SlowestCore

Allocating

Cyclically Bisection

1.3x

2.8x

Execution Time of the Fastest and Slowest Cores

Page 58: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

58

Processor Utilization

• Allocating Cyclically : 87 ~ 95%

• Bisection : 66 ~ 91%

Low processor utilization deteriorates

the performance of Bisection

0

20

40

60

80

100

0 1 2 3 4 5

Pro

ce

ss

or

Uti

liza

tio

n

(%)

Image Number

AllocatingCyclically

Bisection

Page 59: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

59

DDR3 Bandwidth

Utilized bandwidth is 750MB/s ( only 7% of maximum (10.7GB/s ) )

Memory bandwidth is not bottleneck even when two clusters operate.

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6

DD

R3 B

an

dw

idth

(M

B/s

)

Image Number

Bandwidth in Allocating Cyclically

Page 60: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

60

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64

Po

we

r (m

W)

The Number of Cores

Clusters

Bus

DDR3

IO

Other

Power Consumption

Typical Process, Room Temperature, using Allocating Cyclically

2Clusers

1.18W

SoC : 2.21W

Our many-core SoC achieves less than 3W

Page 61: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

61

16.1

443.5

0

50

100

150

200

250

300

350

400

450

500

Many-core Core™-i7-3820

En

erg

y (

J)

Comparison with Desk-Top CPU

• Compared with Desk-Top CPU

(Core™-i7-3820: 3.6GHz, 4 Cores, 8 Threads)

TDP of Core™-i7-3820 (130W) is used for calculating energy

20x

better

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Many-core Core™-i7-3820

Rela

tive P

erf

orm

an

ce

Performance Energy

Page 62: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

62

Conclusion

• Future architecture will have many cores

– A key challenge : How to efficiently use them?

• We evaluated the many-core SoC with parallelized face

detection

– Many-core is suited for the face detection because it

exploits ROI based coarse-grained parallelism efficiently

• Scale up by 30x (32 cores) to 60x (64 cores)

• Balancing workload is important

• Power consumption is only 2.21W under actual

workload : enables fan-less cooling

– Our many-core SoC is remarkably energy efficient in

image recognition applications

• 20x better than the desk-top CPU

Page 63: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

63

Thank you!