working and researching on open64

Institute of Computing Technology, Chinese Academy of Sciences

Working and Researching on Open64

Hongtao Yu Feng Li Wei HuoWei Mi Li Chen Chunhui Ma

Wenwen Xu Ruiqi Lian Xiaobing Feng

OutlineReform Open64 as an aggressive program

analysis tool– Source code analysis and error checking

Source-to-source transformation– WHIRL to C

Extending UPC for GPU clusterNew targeting

– Target to LOONGSON CPU

Part Ⅰ Aggressive program analysis

Whole Program analysis (WPA)Aim at Error checkingA frameworkPointer analysis

– The foundation of other program analysis – Flow - and context-sensitive

Program slicing– Interprocedural – Reduce program size for specific problems

Static slicer

Whole Program Analyzer

IPA_LINK

IPL summay phase

FSCS pointer analysis (LevPA)

Build Call Graph

Construct SSA Form for each procedure

WPA Framework

Static error checker

LevPA -- Level by Level pointer analysis

A Flow- and Context-sensitive pointer analysisFast analyzing millions of lines of codeThe work has been published as

Hongtao Yu, Jingling Xue, Wei Huo, Zhaoqing Zhang, Xiaobing Feng. Level by Level: Making Flow- and Context-Sensitive Pointer Analysis Scalable for Millions of Lines of Code. In the Proceedings of the 2010 International Symposium on Code Generation and Optimization. April 24-28, 2010, Toronto, Canada.

LevPALevel by Level analysis

– Analyze the pointers in decreasing order of their points-to levelsSuppose

int **q, *p, x;q has a level 2, p has a level 1 and x has a level 0. a variable can be referenced directly or indirectly through

dereferences of another pointer. – Fast flow-sensitive analysis on full sparse SSA– Fast and accurate context-sensitive analysis using a

full transfer function

7

Framework

Figure 1. Level-by-level pointer analysis (LevPA).

Evaluate transfer functions

Bottom-up Top-down

Propagate points-to set

Compute points-to

level

for points-to level from the highest to lowest

incremental build call graph

8

Example

int o, t;main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj;}

9

ptl(x, y, p, q) =2ptl(a, b, c, d, e) =1 ptl(t, o) = 0

analyze first { x, y, p, q } then { a, b, c, d, e} last { t, o }

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;


10

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }

main() { L1: int **x, **y;


11

• p1’s points-to depend on formal-in p• q1’s points-to depend on formal-in q

Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }

main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; L4: foo(x1, y1); L5: *b = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3); y4=ϕ (y2, y3) L8: c = &t; L9: foo( x4, y4); L10: *e = 10; }12

p1’s points-to depend on formal-in p q1’s points-to depend on formal-in q

• x1 → { a }• y1 → { b }• x2 → { c }• y2 → { e }• x3 → { d }• y3 → { d }• x4 → { c, d }• y4 → { e, d }

Full-sparse AnalysisAchieve flow-sensitivity flow-insensitively

– Regard each SSA name as a unique variable– Set constraint-based pointer analysis

Full sparse– Saving time– Saving space

13

Top-down analyze level 2

L4:foo.p → { a }foo.q → { b }

L9:foo.p → { c, d }foo.q → { d, e }

• foo.p → { a, c, d }• foo.q → { b, d, e }

main: Propagate to callsite

14

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;


Top-down analyze level 2

void foo( int **p, int **q) { μ(b, d, e) L11: *p1 = *q1; χ(a, c, d) L12: *q1 = &obj;

χ(b, d, e) }

foo: Expand pointer dereferences

15

Merging calling contexts here

void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }

main() { L1: int **x, **y;


Context Condition

To be context-sensitivePoints-to relation ci

– p ⟹ v (p→v ) , p must (may) point to v, p is a formal parameter.

Context Condition ℂ(c1,…,ck)– a Boolean function consists of higher-level points-to

relationsContext-sensitive μ and χ

– μ(vi, (cℂ 1,…,ck))– vi+1=χ(vi, M, (cℂ 1,…,ck))

M {may, must∈ }, indicates weak/strong update

16

Context-sensitive μ and χ

void foo( int **p, int **q) { μ(b, q⟹b)

μ(d, q→d) μ(e, q→e)

L11: *p1 = *q1; a=χ(a , must, p a)⟹ c=χ(c , may, p→c) d=χ(d , may, p→d)L12: *q1 = &obj; b=χ(b , must, q b)⟹ d=χ(d , may, q→d) e=χ(e , may, q→e)}

17

Bottom-up analyze level 1

void foo( int **p, int **q) { μ(b1, q⟹b) μ(d1, q→d) μ(e1, q→e)

L11: *p1 = *q1; a2=χ(a1 , must, p⟹a) c2=χ(c1 , may, p→c) d2=χ(d1 , may, p→d)L12: *q1 = &obj; b2=χ(b1 , must, q⟹b) d3=χ(d2 , may, q→d) e2=χ(e1 , may, q→e)}

Trans(foo, a) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p a⟹ , must >

18

• Trans(foo, c) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p→c, may >

• Trans(foo, b) = < {< obj, q⟹b> }, { } , q b⟹ , must >

• Trans(foo, e) = < {< obj, q→e> }, { } , q→e, may >

• Trans(foo, d) = < {< obj, q→d> }, { <b, p→d q∧ ⟹b> , < d, p→d>, < e, p→d

q∧ →e> } , p→d q∨ →d, may >

Bottom-up analyze level 1

int obj, t;main() { L1: int **x, **y;

L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; μ(b1, true) L4: foo(x1 , y1 ); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true)

L5: *b1 = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3) y4=ϕ (y2, y3) L8: c1 = &t; μ(d1, true) μ(e1, true) L9: foo(x4 , y4); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true) L10: *e1= 10; }

19

Full context-sensitive analysisCompute a complete transfer function for each

procedureThe transfer function maintains a low cost of

being represented and applied– Represent calling contexts by calling conditions

Merging similar calling contextsBetter than using calling strings in reducing costs

– Implement context conditions by using BDDs.compactly represent context conditions enable Boolean operations to be evaluated efficiently

20

Experiment

Analyzes million lines of code in minutesFaster than the state-of-the art FSCS pointer analysis

algorithms.

Table 2. Performance (secs).

21

Benchmark KLOCLevPA Bootstrapping(PLDI’08

)

64bit 32bit 32bit

Icecast-2.3.1 22 2.18 5.73 29

sendmail 115 72.63 143.68 939

httpd 128 16.32 35.42 161

445.gombk 197 21.37 40.78 /

wine-0.9.24 1905 502.29 891.16 /

wireshark-1.2.2 2383 366.63 845.23 /

Future workThe points-to result can be only used for error

checking nowWe are working for

– serving for optimization Let WPA framework generate codes (connect to CG)Let points-to set be accommodated for optimization

passesnew optimizations under the WPA framework

– serving for parallelizationprovide precise information to programmers for guiding

parallelization

22

An interprocedural slicerBased on PDG (Program dependence graph)Compressing PDG

Merging nodes that are aliased

Accommodate multiple pointer analysis Allow many problems to be solved on slice to

reduce the time and space costs

23

Application of sliceNow aiding program error checking

– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors

in file and memory operations is 11 and 2 times faster after slicing

24

1313. 701274. 39607. 71

434. 22

472. 00

384. 0827. 73

1636. 71 974. 42 3966. 83

0123456789

10111213141516171819202122

比值

FILEPOINTER

25



in file and memory operations is 11 and 2 times faster after slicing

26



in file and memory operations is 11.59 and 2.06 times faster after slicing

– improve the accuracy of error checking toolsUse Fastcheck as our error checkermore true errors are detected by Fastcheck

27

2. 88

0. 9

1

1. 1

1. 2

1. 3

1. 4

比值

原程序

FSM切片

28

Part Ⅱ Improvement on whirl2c

29

Improvement on whirl2cPrevious status

– Whirl2c is designed for compiler engineers of IPA and LNO to debug

– Berkeley UPC group and Houston Openuh group extend whirl2c somewhat, but it still cannot support big applications and various optimizations

Problem– Type Information incorrect because of

transformations

30

Improvement on whirl2cOur work

– Improve whirl2c to support recompilation of its output and execution

– Pass spec2000 C/C++ programs under O0/O2/O3+IPA based on pathscale-2.2

Motivation– Some customers require us not to touch their

platforms– Support the retargetability of some platform

independent optimizations– Support gdb of the whirl2c output

31

Improvement on whirl2cIncorrect information due to transformation

– Before structure folding

– After structure folding

Wrong output

whirl2c

frontend

32

Improvement on whirl2c Incorrect type information is mainly related to pointer/array/structure type

and their compositions.

We reinfer the type information correctly based on basic types– Basic type information is used to generate assembly code, so it is reliable– Array element size is also reliable– A series of rules to get the correct type information based on basic type infor,

array element size infor and operators.

Information useful for whirl2c but incorrect due to various optimizations is corrected just before whirl2c, which needs little change to existing IR

whirl2c

33

Part ⅢExtending UPC for GPU cluster

34

Extending UPC with Hierarchical Parallelism

UPC (Unified Parallel C), parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)– Suitable for distributed memory machines, shared memory systems

and hybrid memory systems– Good performance, portability and programmability

Important UPC features– SPMD parallelism– Shared data is partitioned to segments, each of which has affinity to

one UPC thread, and shared data is referenced through shared pointer– Global workload partitioning, upc_forall with affinity expression

ICT extends UPC with hierarchical parallelism– Extend data distribution for shared arrays– Hybrid SPMD with implicit thread hierarchy – Realize important optimizations targeting GPU cluster

35

Source-to-source Compiler, built upon Berkeley UPC(Open64)Frontend support Analysis and transformation on upc_forall loops

– shared memory management based on reuse analysis– Data regroup analysis for global memory coalescing

Structure splitting and array transpose– Instrumentation for memory consistency (collaborate with DSM

system)– Affinity-aware loop tiling

For multidimensional data blocking on shared arrays – Create data environments for kernel loop leveraging array

section analysis Copy in, copy out, private (allocation), formal arguments

– CUDA kernel code generation and runtime instrumentationkernel function and kernel invocation

Whirl2c translator, UPC=> C+UPCR+CUDA

36

Memory Optimizations for CUDA What data will be put into the shared memory?

– firstly pseudo tiling– Extend REGION with reuse degree and region volume

inter-thread and intra-threadaverage reuse degree for merged region

– 0-1 bin packing problem (SM capacity)Quantify the profit: reuse degree integrated with coalescing attribute prefer inter-thread reuse

What is the optimal data layout in global memory?– Coalescing attributes of array reference

only consider contiguous constraints– Legality analysis– Cost model and amortization analysis

Code transformations (in a runtime library)

37

Extend UPC’s Runtime SystemA DSM system on each UPC thread

– Demand-driven data transfer between GPU and CPU

– Manage all global variables– Grain size, upc tile for shared arrays and private

array as a wholeshuffle remote and local array region into one

contiguous physical block before transferringData transformation for memory coalescing

– implemented in the GPU side using CUDA kernel– Leverage shared memory

38

Applications Description Original language

Application field

Source

Nbody n-body simulation CUDA+MPI Scientific computing

CUDA campus programming contest 2009

LBM Lattice Boltzmann method in computational fluid dynamics

C Scientific computing

SPEC CPU 2006

CP Coulombic Potential CUDA Scientific computing

UIUC Parboil Benchmark

MRI-FHD Magnetic Resonance Imaging FHD

CUDA Medical image analysis


MRI-Q Magnetic Resonance Imaging Q

CUDA Medical image analysis


TPACF Two Point Angular Correlation Function

CUDA Scientific computing


39

Benchmarks

Overal l per f ormance on si ngl eCUDA node

00. 5

11. 5

2

nbody l bm

spee

dup

base dsmmemory coal esci ng sm reuseOPT CUDA

Overal l performance on GPU cl uster

02468

10

nbody mri -fhd

mri -q tpacf cp

spee

dup

base dsmmemory coal esci ng sm reuseOPT CUDA/MPI

CPUs on each node: 2 dual core AMD Opteron 880GPU: NVIDIA GeForce 9800 GX2Compilers: nvcc (2.2) –O3 ; GCC (3.4.6) –O3

Use 4-node cuda cluster; ethernet

40

UPC Performance on CUDA cluster

For more details, please contactLi Chen

[email protected]

41

Part Ⅳ Open Source Loongcc

42

Open Source LoongccTarget to LOONGSON CPUBase on Open64

– Main trunk -- r2716A MIPS-like processor

– Have new instructionsNew Features

– LOONGSON machine model– LOONGSON feature support

FE, LNO, WOPT, CG– Edge profiling

43

Thanks

44

working and researching on open64

Documents

program analysis flow

program analysis framework

program analysis wpaaim

consequential program

level pointer analysis

results of pointer analysis

scalar analysis of open64

program size