hironori kasahara, · 2002 willis k. king 2003 stephen diamond 2004 carl k. chang 2005 gerald l....

58
High Performance Low Power OSCAR Multicore and Compiler Hironori Kasahara, Ph.D., IEEE Fellow IEEE Computer Society President 2018 Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan URL: http://www.kasahara.cs.waseda.ac.jp/ 2005 STARC Academia‐Industry Research Award 2008 LSI of the Year Second Prize 2008 Intel AsiaAcademic Forum Best Research Award 2010 IEEE CS Golden Core Member Award 2014 Minister of Edu., Sci. & Tech. Research Prize 2015 IPSJ Fellow 2017 IEEE Fellow, IEEE Eta Kappa Nu Reviewed Papers: 214, Invited Talks: 145, Published Unexamined Patent Application:59 (Japan, US, GB, China Granted Patents: 30), Articles in News Papers, Web News, Medias incl. TV etc.: 572 1980 BS, 82 MS, 85 Ph.D. , Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley 1986 Assistant Prof., 1988 Associate Prof., 1997, Waseda Univ ., Now Dept. of Computer Sci. & Eng. 1989‐90 Research Scholar: U. of Illinois, Urbana‐ Champaign, Center for Supercomputing R&D 2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan Committees in Societies and Government 245 IEEE Computer Society President 2018, BoG(2009‐ 14), Multicore STC Chair (2012‐), Japan Chair (2005‐ 07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC. METI/NEDOProject Leaders: Multicore for Consumer Electronics, Advanced Parallelizing Compiler , Chair: Computer Strategy Committee Cabinet OfficeCSTP Supercomputer Strategic ICT PT, Japan Prize Selection Committees, etc. MEXTInfo. Sci. & Tech. Committee, Supercomputers (Earth Simulator, HPCI Promo., Next Gen. Supercomputer K) Committees, etc.

Upload: others

Post on 13-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

High Performance Low Power OSCAR Multicore and CompilerHironori Kasahara, Ph.D., IEEE FellowIEEE Computer Society President 2018 Professor, Dept. of Computer Science & EngineeringDirector, Advanced Multicore Processor Research InstituteWaseda University, Tokyo, JapanURL: http://www.kasahara.cs.waseda.ac.jp/

1987 IFAC World Congress Young Author Prize1997 IPSJ Sakai Special Research Award2005 STARC Academia‐Industry Research Award2008 LSI of the Year Second  Prize2008 Intel AsiaAcademic Forum Best Research Award2010 IEEE CS Golden Core  Member Award2014 Minister of Edu., Sci. & Tech. Research  Prize2015 IPSJ Fellow2017 IEEE Fellow, IEEE Eta Kappa Nu

Reviewed  Papers: 214, Invited Talks: 145, Published Unexamined Patent Application:59 (Japan, US, GB, China Granted Patents: 30), Articles in News Papers, Web News, Medias  incl. TV etc.:  572

1980 BS, 82 MS, 85 Ph.D. , Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley1986 Assistant Prof., 1988 Associate Prof., 1997, Waseda Univ., Now Dept. of Computer Sci. & Eng. 1989‐90 Research Scholar: U. of Illinois, Urbana‐Champaign, Center for Supercomputing R&D2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan

Committees in Societies and Government 245IEEE Computer Society President 2018,  BoG(2009‐14), Multicore STC  Chair (2012‐), Japan Chair (2005‐07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC.【METI/NEDO】 Project Leaders: Multicore for Consumer Electronics, Advanced Parallelizing Compiler, Chair: Computer Strategy Committee 【Cabinet Office】 CSTP Supercomputer Strategic ICT PT, Japan Prize Selection Committees, etc.【MEXT】 Info. Sci. & Tech. Committee, Supercomputers (Earth Simulator, HPCI Promo., Next Gen. Supercomputer K) Committees, etc.

Page 2: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

2

Page 3: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

IEEE Computer Society BoG(Board of Governors) Feb.1, 2017

https://www.computer.org/web/cshistory/officers-2017

Page 4: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Past IEEE Computer Society PresidentsChairs of the IRE Professional Groupon Electronic Computers1951-53 Morton M. Astrahan1953-54 John H. Howard1954-55 Harry Larson1955-56 Jean H. Felker1956-57 Jerre D. Noe1957-58 Werner Buchholz1958-59 Willis H. Ware1959-60 Richard O. Endres1960-62 Arnold A. Cohen1962-64 Walter L. Anderson

Chairs of the AIEE Committeeon Large-Scale Computing Devices1946-49 Charles Concordia1949-51 John Grist Brainerd1951-53 Walter H. MacWilliams1953-55 Frank J. Maginniss1955-57 Edwin L. Harder1957-59 Morris Rubinoff1959-61 Ruben A. Imm1961-63 Claude A. Kagan1963-64 Gerhard L. Hollander

1996 Mario R. Barbacci1997 Barry W. Johnson1998 Doris L. Carver1999 Leonard L. Tripp2000 Guylaine M. Pollock2001 Benjamin W. Wah2002 Willis K. King2003 Stephen Diamond2004 Carl K. Chang2005 Gerald L. Engel2006 Deborah M. Cooper2007 Michael R. Williams2008 Rangachar Kasturi2009 Susan K. (Kathy) Land, 2010 James D. Isaak2011 Sorel Reisman2012 John W. Walz2013 David Alan Grier2014 Dejan S. Milojicic2015 Thomas M. Conte2016 Roger U. Fujii2017 Jean-Luc Gaudiot2018 Hironori Kasahara

Chairs & Presidents of the IEEE Computer Society1964-65 Keith Uncapher1965-66 Richard I. Tanaka1966-67 Samuel Levine1968-69 Charles L. Hobbs1970-71 Edward J. McCluskey1972-73 Albert S. Hoagland1974-75 Stephen S. Yau1976 Dick B. Simmons1977-78 Merlin G. Smith1979-80 Tse-Yun Feng1981 Richard E. Merwin1982-83 Oscar N. Garcia1984-85 Martha Sloan1986-87 Roy L. Russo1988 Edward A. Parrish1989 Kenneth A. Anderson1990 Helen M. Wood1991 Duncan H. Lawrie1992 Bruce D. Shriver1993 James H. Aylor1994 Laurel V. Kaleda1995 Ronald G. Hoelzeman

Page 5: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

5

IEEE Computer Society60,000+ members,   volunteer‐led organization,  200 technical conferences,   industry‐oriented "Rock Stars" , 17 scholarly journals and 13 magazines,   awards program,Digital Library with more than 550,000 articles and papers , 400 local and regional chapters,    40 technical committees,

Page 6: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Toward 20181. Refining content and services to further improve the satisfaction of CS

members; 2. Considering an incentive for volunteers to further accelerate CS activities

and promptly provide technical benefits for people around the globe; To express appreciation to volunteers: CS Point (Mileage) System: Annual & Life Time Honor, Premier Seating, Premier Registration, Distinguished Reviewer, etc

3. Offering more attractive services for practitioners in industry; 4. Providing the world’s best educational content and historical treasures for

future generations, which only the CS can create with our pioneering researchers (for example, the Multicore Compiler Video Series found at www.computer.org/web/education/multicore-video-series);

5. Thinking about sustainable membership fees while considering the diversity of economic situations within the 10 regions;

6. Cooperating with other IEEE societies and sister societies in a timely and efficient manner;

7. Intelligibly introducing the latest computer-related technologies to younger generations, including children, so that they can realize their technological dreams.

6

Page 7: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPGC

SM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

Multicores for Performance and Low Power

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara,

“An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic

Parallelizing Compiler”

Power ∝ Frequency * Voltage2

(Voltage ∝ Frequency)Power ∝ Frequency3

If Frequency is reduced to 1/4(Ex. 4GHz1GHz),

Power is reduced to 1/64 and Performance falls down to 1/4 .<Multicores>If 8cores are integrated on a chip,Power is still 1/8 and Performance becomes 2 times.

7

Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) .

Page 8: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED)

Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores Execution time with 128 cores was slower than 1 core (0.9 times speedup)

Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization

8

Parallel Soft is important for scalable performance of multicore (LCPC2015) Just more cores donʼt give us speedup Development cost and period of parallel software

are getting a bottleneck of development of embedded systems, eg. IoT, Automobile

Fjitsu M9000 SPARCMulticore Server

OSCAR Compiler gives us 211 times speedup with 128 cores

Commercial compiler gives us 0.9 times speedup with 128 cores (slow-downed against 1 core)

Page 9: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction9

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control(Voltage:1.4V)

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

Page 10: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

Page 11: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

<R & D Target>Hardware, Software, Application for Super Low-Power Manycore More than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, OSCAR Technology, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added ProductsAutomobiles, Medical, IoT, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

11

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

Page 12: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelization(LCPC1991,2001,04)

coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism

Data Localization Automatic data management for distributed shared memory, cache and local memory(Local Memory 1995, 2016 on RP2,Cache2001,03)Software Coherent Control (2017)

Data Transfer Overlapping(2016 partially)

Data transfer overlapping using DataTransfer Controllers (DMAs)

Power Reduction(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

Reduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Page 13: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Page 14: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Speedup ratio for H.264 and Optical Flow

on ARM Cortex-A9 Android 3 coresby OSCAR Automatic Parallelization

1.00 1.35

1.53

1.00

1.99

2.78

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1PE 2PE 3PE 1PE 2PE 3PEH.264 decoder OpticalFlow

Spee

dup

ratio

aga

inst

1PE

Page 15: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

1.07 0.79 

0.95 0.72 

1.69 

0.57 

1.50 

0.36 

2.45 

0.51 

2.23 

0.30 

0.00

0.50

1.00

1.50

2.00

2.50

3.00

without power control with power control without power control with power control

H.264 Optical flow

Average Po

wer Con

sumption[W]

1 core 2 cores 3 cores

1 1132 12 2 233 3

15

- 86.5%(1/7)

- 68.4%(1/3)

-79.2%(1/5)

-52.3%(1/2)

Automatic Power Reduction on ARM CortexA9 with Androidhttp://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

H.264 decoder & Optical Flow (on 3 cores)

ODROID X2Samsung Exynos4412 Prime, ARM Cortex‐A9 Quad core1.7GHz〜0.2GHz, used by Samsung's Galaxy S3

Power for 3cores was reduced to 1/5~1/7 against without software power control Power for 3cores was reduced to 1/2~1/3 against ordinary 1core execution

Page 16: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Automatic Power Reduction of OpenCV Face Detection on big.LITTLE ARM Processor

• ODROID-XU3• Samsung Exynos 5422 Processor

• 4x Cortex-A15 2.0GHz, 4x Cortex-A7 1.4GHz big.LITTLE Architecture

• 2GB LPDDR3 RAM Frequency can be changed by

0

1

2

3

4

5

6

3PE 3PEW/O Power Control W/ Power Control

Pow

er C

onsu

mpt

ion

[W]

Cortex-A7 Cortex-A15

4.9w

1.6w

-67% (1/3)

Page 17: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

H.264 decoder & Optical Flow (3cores)

17

29.67

17.37

29.29

24.17

37.11

16.15

36.59

12.21

41.81

12.50

41.58

9.60

0.005.00

10.0015.0020.0025.0030.0035.0040.0045.00

without power control with power control without power control with power control

H.264 Optical flow

Average Po

wer Con

sumption[W]

1 core 2 cores 3 cores

1 321 321 321 32

-70.1%(1/3)

-57.9%(2/5)

-76.9%(1/4)

-67.2%(1/3)

Automatic Power Reuction on Intel Haswell

Power for 3cores was reduced to 1/3~1/4 against without software power control Power for 3cores was reduced to 2/5~1/3 against ordinary 1core execution

H81M‐A, Intel Core i7 4770kQuad core, 3.5GHz〜0.8GHz

Page 18: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP) (LCPC2015)

18

Fortran:15 thousand lines

First touch for distributed shared memory and cache optimization over loops are important for scalable speedup

Page 19: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion)

327 times speedup on 144 cores

Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores(327.6 times speedup) Reduction of treatment cost and reservation waiting period is expected

19

Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip

1.00 5.00

109.20

196.50

327.60

0

50

100

150

200

250

300

350

1PE 32pe 64pe 144pe

327.6 times speed up with 144 cores

GCC

Page 20: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

Page 21: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Engine Control by multicore with Denso

21

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core  V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine control by multicore using local memories

Millions of lines C codes consisting conditional branches and basic blocks

Page 22: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Macro Task Fusion for Static Task Scheduling

22

MFG of sample program before maro task fusion

MFG of sample program after macro task fusion

MTG of sample program after macro task fusion

Fuse branches and succeeded tasks

Merged block: Data Dependency: Control Flow: Conditional Branch

Only data dependency

Page 23: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

3.1 Restructuring : Inline Expansion Inline expansion is effective

To increase coarse grain parallelism Expands functions having inner parallelism

23

MTG before inline expansion

MTG after inline expansion

Improves coarse grain parallelism

Page 24: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

MTG of Crankshaft Program Using Inline Expansion

24

Not enough coarse grain parallelism yet!

Critical Path(CP)Critical Path(CP)CP accounts for CP accounts for 

about 90% of whole execution time.

CP accounts for about 99% of whole execution time.

70%

MTG of crankshaft program before restructuring

Page 25: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Duplicating if-statements is effective To increase coarse grain parallelism

Duplicates fused tasks having inner parallelism

3.2 Restructuring: Duplicating If-statements

25

Improves coarse grain parallelism

if (condition) {func2();func3();func4();

}

func1();Func1 depends on func3 

No dependemce

Copying if-condition for each function

before duplicating if-statementsafter duplicatingif-statements

func1();

if (condition) {func2();

}if (condition) {func3();

} if (condition) {func4();

Page 26: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements

26

Successfully increased coarse grain parallelism

Critical Path(CP)

CP accounts for over CP accounts for over 99% of whole execution time.

MTG of crankshaft program before restructuring

Critical Path(CP)

CP accounts for about 60% of whole execution time.

MTG of crankshaft program after restructuring

Succeed to reduce CP 99% -> 60%

Page 27: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Evaluation of Crankshaft Program with Multi-core Processors

Attain 1.54 times speedup on RPX There are no loops, but only many conditional branches and small basic

blocks and difficult to parallelize this program This result shows possibility of multi-core processor for

engine control programs27

1.00 

1.54 0.57 

0.37 

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.000.200.400.600.801.001.201.401.601.80

1 core 2 core

executiontim

e(us)spee

dupratio

Page 28: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

OSCAR Compile Flow for Simulink Applications

28

Simulink model C code

Generate C codeusing Embedded Coder

OSCAR Compiler

(1) Generate MTG→ Parallelism 

(2) Generate gantt chart→ Scheduling in a multicore

(3) Generate parallelized C code    using the OSCAR API→ Multiplatform execution(Intel, ARM and SH etc)

Page 29: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

29

Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examplesBuoy Detection :  http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulinkColor Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐to‐grayscale‐/Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/

Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)

Page 30: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Parallel Processing of Face Detection on Manycore, Highend and PC Server

30

OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.

1.00 1.72 

3.01 

5.74 

9.30 

1.00 1.93 

3.57 

6.46 

11.55 

1.00 1.93 

3.67 

6.46 

10.92 

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

Automatic Parallelization of Face Detection

Page 31: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

OSCAR Heterogeneous Multicore

31

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

Page 32: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

32

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control

TIME

Page 33: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores (LCPC2009Homo, 2010 Hetero)

33

Page 34: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Spee

dups against a sing

le SH processor 

3.4[fps]

111[fps]

Page 35: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Page 36: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Low-Power Optimization with OSCAR API

36

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Page 37: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Automatic Local Memory ManagementData Localization: Loop Aligned Decomposition

• Decomposed loop into LRs and CARs– LR ( Localizable Region): Data can be passed through LDM– CAR (Commonly Accessed Region): Data transfers are

required among processors

37

Single dimension DecompositionMulti-dimension Decomposition

Page 38: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Adjustable Blocks

Handling a suitable block size for each application different from a fixed block size in cache each block can be divided into smaller blocks

with integer divisible size to handle small arrays and scalar variables

38

Page 39: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Multi-dimensional Template Arrays for Improving Readability

• a mapping technique for arrays with varying dimensions– each block on LDM corresponds to

multiple empty arrays with varying dimensions

– these arrays have an additional dimension to store the corresponding block number

• TA[Block#][] for single dimension• TA[Block#][][] for double dimension• TA[Block#][][][] for triple dimension• ...

• LDM are represented as a one dimensional array– without Template Arrays, multi-

dimensional arrays have complex index calculations

• A[i][j][k] -> TA[offset + i’ * L + j’ * M + k’]– Template Arrays provide readability

• A[i][j][k] -> TA[Block#][i’][j’][k’] 39

LDM

Page 40: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #1

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #0

I$16K

D$16K

CPU FPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlSRAM

controlDMA

control

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K

Local memoryI:8K, D:32K

CCNBAR

Barrier Sync. Lines

Page 41: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Speedups by the Proposed Local Memory Management Compared with Utilizing Shared Memory on Benchmarks Application using RP2

41

20.12 times speedup for 8cores execution using local memory against sequential execution using off‐chip shared memory of RP2 for the AACenc

Page 42: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Software Coherence Control Method on OSCAR Parallelizing Compiler

Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).

OSCAR compiler automatically controls coherence using following simple program restructuring methods: To cope with stale data problems:

Data synchronization by compilers To cope with false sharing problem:

Data AlignmentArray PaddingNon-cacheable Buffer

MTG generated by earliest executable condition analysis

Page 43: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2

1.00

1.38

2.52

1.00

1.67

2.65

1.00

1.76

2.90

1.00

1.79

2.99

1.00

1.84

3.34

1.00

1.32

2.36

1.00

1.87

2.86

1.00

1.79

2.86

1.00

1.55

2.19

1.00

1.70

3.17

1.071.45

2.63

4.37

1.10

1.76

2.95

3.65

1.06

1.90

3.28

4.76

1.01

1.81

3.19

4.63

1.07

2.01

3.71

5.66

1.031.32

2.36

3.67

1.05

1.95

2.87

3.49

1.05

1.77

2.70

3.32

1.071.40

1.892.19

1.02

1.67

3.02

4.92

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

equake art lbm hmmer cg mg bt lu sp MPEG2Encoder

SPEC2000 SPEC2006 NPB MediaBench

Spee

dup

Application/the number of processor core

SMP(Hardware Coherence)

NCC(Software Coherence)

Automatic Software Coherent Control for Manycores

Page 44: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

44

1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)

Co-design of Compiler and ArchitectureLooking at various applications, design a parallelizing compiler and design

a multiprocessor/multicore-processor to support compiler optimization

Page 45: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

45

(CP)

HOST COMPUTER

CONTROL & I/O PROCESSOR

(CP)

(CP) (CP) (CP).. ..

PE1 PE5 PE6 PE10 PE11 PE15PE8 PE9 PE16

Read & Write RequestsArbitrator

DistributedShared Memory (Dual Port)

RISC Processor I/O Processor

DataMemory

Prog.Memory

Bus Interface

Bank1

Addr.n

Bank2

Addr.n

Bank3

Addr.n

-5MFLOPS,32bit RISC Processor (64 Registers)-2 Banks of Program Memory-Data Memory-Stack Memory-DMA Controller

DistributedSharedMemory

(Simultaneous Readable)

..

5PE CLUSTER (SPC1) SPC2 SPC3 LPC2 8PE PROCESSOR CLUSTER (LPC1)

CSM2 CSM3

CENTRALIZED SHARED MEMORY1

OSCAR(Optimally Scheduled Advanced Multiprocessor)

Page 46: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

46

00000000

00100000

00200000

01000000

01100000

02000000

02100000

02200000

02F00000

03000000

03400000

FFFFFFFF

NOT USE

CSM1, 2, 3

PE 0

NOT USE

CP

NOT USE

UNDEFINED

:::

00000000

00000800

00010000

00030000

00040000

00080000

000F0000

00100000

FFFFFFFF

D S M

NOT USE

L P M(Bank0)

NOT USE

L D M

NOT USE

CONTROL

SYSTEM

ACCESSING

AREA

LOCAL MEMORY SPACE

PE15

PE 1

(Control Processor)

(Centralized Shared Memory)

(Local Program Memory)

(Local Data Memory)

(Distributed Shared Memory)

PE 0

PE 1

PE15

(Local Memory Area)

L P M(Bank1)

00020000

SYSTEM MEMORY SPACE

BROAD CAST

OSCAR Memory Space (Global Address Space)

Page 47: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Hierarchical Barrier Synchronization

• Specifying a hierarchical group barrier– #pragma oscar group_barrier (C) – !$oscar group_barrier (Fortran)

Kasahara-Kimura Lab., Waseda University 47

8cores

PG0(4cores) PG1(4cores)

PG1-0(2cores) PG1-1(2cores)

2nd layer →

PG0-0 PG0-1 PG0-2 PG0-3

1st layer →

3rd layer →

Page 48: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

48

NWT

Machine Cycle Time 9.5ns (105MHz)PE Performance 1.68GFlopsPE Memory Size 256MB/PECrossbar Bandwidth 4B/cycle x 2 (send/receive simultaneous)/PE

= 421MB/s x 2 /PENumber of PEs 140PEs + 2Control Proc.

NAL computer center, Chofu, Tokyo, Feb. 1, 1993

Page 49: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Crossbar NetworkVPP500: 256*256VPP5000: 1024*1024

Fujitsu Vector Parallel Supercomputer with Crossbar to a Chip

49

Page 50: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Copyright 2008 FUJITSU LIMITED

50

VPP500/NWT

CabinetCabinet (open)

PE Unit

Page 51: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

51

VPP500/NWT

PE Unit (separated)

PE LSI unit

Water Cooling system

Page 52: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

52

VPP500/NWT

GaAs BiCMOS

Page 53: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

4 core multicore RP1 (2007) , 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas

Page 54: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

64cores0.18[s]

1.00  1.96  3.95 7.86 

15.82 

30.79 

55.11 

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 16 32 64

Spee

d up

コア数

Speed‐ups on TILEPro64  Manycore

1core10.0[s]

Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule)

10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required.

Waseda U. & Olympus 54

• TILEPro64

Ds

t0X4)

rt 1

al)

nal)

X4)

55 times speedup with 64 cores

Page 55: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Target: Solar Powered

Compiler power reduction.Fully automatic parallelization and vectorization including local memory management and data transfer.

OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology

Centralized Shared Memory

Compiler Co-designed Interconnection Network

Compiler co-designed Connection Network

On-chip Shared Memory

Multicore Chip

VectorData

TransferUnit

CPU

Local MemoryDistributed Shared Memory

Power Control Unit

Core

×4chips

Page 56: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

OSCAR Technology Corp.

56

Started up on Feb. 28, 2013:Licensing the all patents and OSCAR compiler from Waseda Univ.Founder and CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, Director of National U., Invited Prof. of Waseda U.)Executives: Dr. M. Ohashi : COO (Ex- OO of Ono Sokki)Mr. A. Nodomi : CTO (Ex- Spansion)Mr. N. Ito (Ex- Visiting Prof. Tokyo Agricult. And Tech. U.)Dr. K. Shirai (Ex- President of Waseda U., Ex- Chairman of JapaneseOpen U.)Mr. K. Ashida (Ex- VP Sumitomo Trading, Adhida Consult. CEO)Mr. S. Tsuchida (Co-Chief Investment Officer of Innovation NetworkCorp. of Japan)Auditor: Mr. S. Honda (Ex- Senior VP and General Manager of MUFG)Dr. S. Matuda (Emeritus Prof. of Waseda U., Chairman of WERU INVESTMENT)Mr. Y. Hirowatari (President of AGS Consulting)Advisors: Prof. H. Kasahara (Waseda U.)Prof. K. Kimura (Waseda U.)

Copyright 2017

Page 57: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Future Multicore ProductsNext Generation Automobiles‐ Safer, more comfortable, energy efficient, environment friendly‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter  control

Solar powered with more than 100 times power efficient :  FLOPS/W• Regional Disaster Simulators

saving lives from tornadoes, localized heavy rain, fires with earth quakes

‐From everyday recharging to less than once a week‐ Solar powered operation  inemergency condition‐ Keep health 

Smart phones

57

Cancer treatment, Drinkable inner camera• Emergency solar powered• No cooling fun, No dust ,

clean usable inside OP room

Advanced medical systems Personal / Regional Supercomputers

Page 58: Hironori Kasahara, · 2002 Willis K. King 2003 Stephen Diamond 2004 Carl K. Chang 2005 Gerald L. Engel 2006 Deborah M. Cooper 2007 Michael R. Williams 2008 RangacharKasturi 2009 Susan

Summary Waseda University Green Computing Systems R&D Center supported by METI has 

been researching on low‐power high performance Green Multicore hardware, software and application with industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology.

OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including “Earthquake Wave Propagation”, medical applications including “Cancer Treatment Using Carbon Ion”, and “Drinkable Inner Camera”, industry application including “Automobile Engine Control”, “Smartphone”, and “Wireless communication Base Band Processing” on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu.

In automatic parallelization, 110 times speedup for “Earthquake Wave Propagation Simulation” on 128 cores of IBM Power 7 against 1 core, 55 times speedup for “Carbon Ion Radiotherapy Cancer Treatment” on 64cores IBM Power7, 1.95 times for “Automobile Engine Control” on Renesas 2 cores using SH4A or V850, 55 times for “JPEG‐XR Encoding for Capsule Inner Cameras” on Tilera 64 cores Tile64 manycore. The compiler will be available on market from OSCAR Technology.

In automatic power reduction, consumed powers for real-time multi-media applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.

Local memory management for automobiles and software coherent control have been patented and already realized by OSCAR compiler.

58