parallel gcc: status and expectations

41
Parallel GCC: Status and Expectations Giuliano Belinassi giulianob September 13, 2019 GNU Cauldron 2019 FLUSP Computer Science Department Institute of Mathematics and Statistics University of São Paulo (USP) 1/40

Upload: others

Post on 19-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel GCC: Status and Expectations

Parallel GCC: Status and Expectations

Giuliano Belinassi

giulianob

September 13, 2019

GNU Cauldron 2019FLUSP

Computer Science Department

Institute of Mathematics and Statistics

University of São Paulo (USP)

1/40

Page 2: Parallel GCC: Status and Expectations

Introduction

• FLUSP – Floss at USPI Student extension group

I Aims to Contribute to FLOSS

I Projects we currently contribute: Linux Kernel, GCC, Git, Debian

I Promote Contribution Events, such as the KernelDevDay

2/40

Page 3: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

3/40

Page 4: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

4/40

Page 5: Parallel GCC: Status and Expectations

Introduction

•How many of you have seen a compiler running in parallel?

5/40

Page 6: Parallel GCC: Status and Expectations

Introduction

• Parallelization of GCC InternalsI Exploring parallelism in a single file

•Main objectivesI Reduction of compilation time

I Highlight GCC global states

6/40

Page 7: Parallel GCC: Status and Expectations

Motivation

•Compilers are really complex programsI GCC dates from 1987

I Still sequential

•Automatic Generation of big filesI gimple-match.c: 100358 lines of C++ (GCC 10.0.0)

• Exponencial growth in number of cores of a CPU

7/40

Page 8: Parallel GCC: Status and Expectations

Motivation

Growth of Computational Power (Rupp, 2018)

100

101

102

103

104

105

106

107

1970 1980 1990 2000 2010 2020

Number ofLogical Cores

Frequency (MHz)

Single-ThreadPerformance(SpecINT x 10

3)

Transistors(thousands)

Typical Power(Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten

New plot and data collected for 2010-2017 by K. Rupp

Year

42 Years of Microprocessor Trend Data

8/40

Page 9: Parallel GCC: Status and Expectations

Motivation

•Where can we use these parallel processors in a compiler?

•How much the improvement is?

• Is there projects that can be benefited from this?

9/40

Page 10: Parallel GCC: Status and Expectations

Introduction

• Experiment 1I GCC compilation on a machine with 4× AMD Opteron 6376

» 4× 16 = 64 cores

I No bootstrap (--disable-bootstrap)

I $ make -j64

I Collected Compilation Time of each file

I Collected Consumed Energy of all CPUs

10/40

Page 11: Parallel GCC: Status and Expectations

0 25 50 75 100 125 150 175Time (s)

0

10

20

30

40

50

60

Mak

efile

Job

[A]

[B]

[C]

[D]

Flamegraph of GCC CompilationOther Files[A] insn-emit.o[B] insn-recog.o[C] gimple-match.o[D] generic-match.o

11/40

Page 12: Parallel GCC: Status and Expectations

0 25 50 75 100 125 150 175Time (s)

0

100

200

300

400

Powe

r Con

sum

ptio

n (W

)

Eletric Energy Consumed by All CPUs (amd fam15h_power)

Sequential. Consumed Power = 1.841616 ± 0.006624 Wh

12/40

Page 13: Parallel GCC: Status and Expectations

Introduction

•How about LTO?

• Experiment 2I GCC compilation on a machine with 4× AMD Opteron 6376

» 4× 16 = 64 cores

I No bootstrap (--disable-bootstrap)

I Enabled LTO (CFLAGS=-flto=64 CXXFLAGS=-flto=64)

I $ make -j64

I Collected Compilation Time of each file

I Collected Consumed Energy of all CPUs

13/40

Page 14: Parallel GCC: Status and Expectations

0 50 100 150 200 250 300Time (s)

0

10

20

30

40

50

60

Mak

efile

Job

[A] [B] [C]

Flamegraph of GCC CompilationOther Files[A] lto1[B] cc1[C] cc1plus

14/40

Page 15: Parallel GCC: Status and Expectations

0 50 100 150 200 250 300Time (s)

0

100

200

300

400

Powe

r Con

sum

ptio

n (W

)

Eletric Energy Consumed by All CPUs (amd fam15h_power)

Sequential. Consumed Power = 2.695716 ± 0.009216 Wh

15/40

Page 16: Parallel GCC: Status and Expectations

Introduction

• There is a parallelism bo�leneck in GCC projectI And the compilation uses more electrical power as a consequence

•Could we improve it by parallelizing GCC?

16/40

Page 17: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

17/40

Page 18: Parallel GCC: Status and Expectations

Where to Start?

GCC structure is divided in three parts.

• Front EndI Parsing

•Middle EndI Hardware-Independent Optimization

•Back EndI Hardware-Dependent Optimization

I Code Generation

18/40

Page 19: Parallel GCC: Status and Expectations

Where to Start?

• Selected Part: Middle EndI Applies Hardware-Independent Optimizations in the code

I Why? Because it looked easier to start rather than RTL

•Optimization can be broken into two disjoint setsI Intra Procedural

» Can be applied into a function without observing the interactions with other functions

I Inter Procedural

» Interaction with other functions must be considered» GCC calls this Inter Process Analysis (IPA)

19/40

Page 20: Parallel GCC: Status and Expectations

GCC Profile

•We used gimple-match.c to profile the compilerI Intel Core i5-8250U (4 cores)

I 5400RPM 1TB Hard Drive

Parser

5.8%

IPA

10.7%GIMPLE IPA

0.6%

GIMPLE 15.4%

RTL

48.1%

Others

19.4%

GCC ProfileParserIPAGIMPLE IPAGIMPLERTLOthers

20/40

Page 21: Parallel GCC: Status and Expectations

GCC Profile

•We used gimple-match.c to profile the compilerI 48s is spent compiling this file

I 63% is spent in Intra Procedural Optimization

I This can be parallelized

I Maximum speedup: 2.7× according to Amdahl’s Law

21/40

Page 22: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

22/40

Page 23: Parallel GCC: Status and Expectations

Parallel Architecture

• Parallelize Intra Procedural OptimizationsI By per-function

I Similarly to Wortman and Junkin (1992)

I With arbitrary number of threads

23/40

Page 24: Parallel GCC: Status and Expectations

Parallel Architecture

•Architecture of Parallel GCCI The Inter Process Analysis inserts functions into a Producer Consumer �eue

I Each thread is responsible to remove their work from the �eue

I When �eue is empty, threads are blocked until work is inserted into it

I The threads die when the EMPTY token is inserted

I Measured overhead: 0.1s for 2000 functions

RTL Passes

Barrier

Code Generation

IPA

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

Queue

f1 f2

f5f3 f4

f7f6

24/40

Page 25: Parallel GCC: Status and Expectations

Parallel Architecture

•AdvantagesI Best candidate to Linear speedup

I Worst case: A single big function

•DisadvantageI Map per-pass global states in the compiler

25/40

Page 26: Parallel GCC: Status and Expectations

Implementation

• Split cgraph_node::expand into three methods:1 expand_ipa_gimple

2 expand_gimple←− Parallelization e�ort in GSoC 2019

3 expand_rtl

• Serialized the Garbage Collector

• Serialized memory related structures

26/40

Page 27: Parallel GCC: Status and Expectations

Implementation

Memory Pools:

• Problems:I Linked List of several object instances

I Can be allocated/released upon request

I Most race conditions were there

• Solution:I Create an instance for every thread

I Merge the memory pools when all threads join

I Implementing merge feature: Just append the two Linked List

27/40

Page 28: Parallel GCC: Status and Expectations

Implementation

Garbage Collection:

• Problem:I GCC implements a garbage collector

I Variables can be marked to be watched by it

I Collection can also be done upon request

• Partial Solution:I Serialize the entire Garbage Collector

I Disable collection between passes

•Research is needed to:I Map the race conditions in the Garbage Collector

I Make it thread safe

28/40

Page 29: Parallel GCC: Status and Expectations

Implementation

rtl_data Structure:

• Problem:I This class represents the current function being compiled in RTL

I GCC only has a single instance of this class

I GIMPLE uses it to decide optimization related to instruction costs

• SolutionI Have one copy of this structure for each thread

I Make GIMPLE not rely on this?

29/40

Page 30: Parallel GCC: Status and Expectations

Implementation

tree-ssa.address.c: mem_addr_template_list

• Problem:I Race condition in this structure

• Partial Solution:I Serialize with a mutex

• SolutionI Replicate for each thread?

30/40

Page 31: Parallel GCC: Status and Expectations

Implementation

Integer to Tree Node hash

• Problem:I Race condition in this structure

• Partial Solution:I Serialize with a mutex

• SolutionI Replicate for each thread?

31/40

Page 32: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

32/40

Page 33: Parallel GCC: Status and Expectations

Results

•Results by parallelizing GIMPLE•Mean of 30 samples

1 2 4 8Number of Threads

0

1

2

3

4

5

6

7

Tim

e (s

) x1.72

x2.51 x2.52

Time in Parallel GIMPLE Intra Procedural Analysis

1 2 4 8Number of Threads

0

10

20

30

40

50

60

70

Tim

e (s

)

x1.06 x1.09 x1.09

GCC Time and SpeedupsParallel GIMPLERTLInter Process AnalysisParserOthers

33/40

Page 34: Parallel GCC: Status and Expectations

Results

• The same approach can be used to parallelize RTL•Using the GIMPLE results

1 2 4 8Number of Threads

0

10

20

30

40

50

60

70Ti

me

(s)

x1.35x1.61 x1.61

GCC Time and Speedups (Estimate)Parallel GIMPLEParallel RTL (estimate)Inter Process AnalysisParserOthers

34/40

Page 35: Parallel GCC: Status and Expectations

Motivation

•Where can we use these parallel processors in a compiler?I Intra Process Analysis is a easy answer

•How much the improvement is?I Without much optimization, 1.6× using 4 threads

I Be�er results will require more e�ort

• Is there projects that can be benefited from this?I GCC

I ...What else?

35/40

Page 36: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

36/40

Page 37: Parallel GCC: Status and Expectations

TODOs

•What is le� to do:I Fix all race conditions in GIMPLE

I Fix all C11 __thread occurrences

» Initialize Inter-Pass objects at ::execute time» Per-thread a�ributes initialized at spawn time

I Make this build pass the testsuite

I Parallelize RTL

I Parallelize IPA (useful for LTO?)

I Communicate with Make for automatic threading

37/40

Page 38: Parallel GCC: Status and Expectations

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

38/40

Page 39: Parallel GCC: Status and Expectations

References i

I Karl Rupp (2018). 42 Years of Microprocessor Trend Data. url:

h�ps://github.com/karlrupp/microprocessor-trend-data.

I D.B. Wortman and M.D. Junkin (Jan. 1992). “A Concurrent Compiler for Modula-2+”. In: 27,

pp. 68–81. doi: 10.1145/143103.120025.

39/40

Page 40: Parallel GCC: Status and Expectations

Repositories

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

h�ps://github.com/giulianobelinassi/gcc-timer-analysis

h�ps://gitlab.com/flusp/gcc/tree/giulianob_parallel

h�ps://gcc.gnu.org/wiki/ParallelGcc

40/40

Page 41: Parallel GCC: Status and Expectations

Thank you!

40/40