working and researching on open64
DESCRIPTION
Working and Researching on Open64. Institute of Computing Technology, Chinese Academy of Sciences. Outline. Reform Open64 as an aggressive program analysis tool Source code analysis and error checking Source-to-source transformation WHIRL to C Extending UPC for GPU cluster New targeting - PowerPoint PPT PresentationTRANSCRIPT
Institute of Computing Technology, Chinese Academy of Sciences
Working and Researching on Open64
Hongtao Yu Feng Li Wei HuoWei Mi Li Chen Chunhui Ma
Wenwen Xu Ruiqi Lian Xiaobing Feng
OutlineReform Open64 as an aggressive program
analysis tool– Source code analysis and error checking
Source-to-source transformation– WHIRL to C
Extending UPC for GPU clusterNew targeting
– Target to LOONGSON CPU
Part Ⅰ Aggressive program analysis
Whole Program analysis (WPA)Aim at Error checkingA frameworkPointer analysis
– The foundation of other program analysis – Flow - and context-sensitive
Program slicing– Interprocedural – Reduce program size for specific problems
Static slicer
Whole Program Analyzer
IPA_LINK
IPL summay phase
FSCS pointer analysis (LevPA)
Build Call Graph
Construct SSA Form for each procedure
WPA Framework
Static error checker
LevPA -- Level by Level pointer analysis
A Flow- and Context-sensitive pointer analysisFast analyzing millions of lines of codeThe work has been published as
Hongtao Yu, Jingling Xue, Wei Huo, Zhaoqing Zhang, Xiaobing Feng. Level by Level: Making Flow- and Context-Sensitive Pointer Analysis Scalable for Millions of Lines of Code. In the Proceedings of the 2010 International Symposium on Code Generation and Optimization. April 24-28, 2010, Toronto, Canada.
LevPALevel by Level analysis
– Analyze the pointers in decreasing order of their points-to levelsSuppose
int **q, *p, x;q has a level 2, p has a level 1 and x has a level 0. a variable can be referenced directly or indirectly through
dereferences of another pointer. – Fast flow-sensitive analysis on full sparse SSA– Fast and accurate context-sensitive analysis using a
full transfer function
7
Framework
Figure 1. Level-by-level pointer analysis (LevPA).
Evaluate transfer functions
Bottom-up Top-down
Propagate points-to set
Compute points-to
level
for points-to level from the highest to lowest
incremental build call graph
8
Example
int o, t;main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }
void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj;}
9
ptl(x, y, p, q) =2ptl(a, b, c, d, e) =1 ptl(t, o) = 0
analyze first { x, y, p, q } then { a, b, c, d, e} last { t, o }
Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }
main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }
10
Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }
main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }
11
• p1’s points-to depend on formal-in p• q1’s points-to depend on formal-in q
Bottom-up analyze level 2void foo( int **p, int **q) { L11: *p1 = *q1; L12: *q1 = &obj; }
main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; L4: foo(x1, y1); L5: *b = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3); y4=ϕ (y2, y3) L8: c = &t; L9: foo( x4, y4); L10: *e = 10; }12
p1’s points-to depend on formal-in p q1’s points-to depend on formal-in q
• x1 → { a }• y1 → { b }• x2 → { c }• y2 → { e }• x3 → { d }• y3 → { d }• x4 → { c, d }• y4 → { e, d }
Full-sparse AnalysisAchieve flow-sensitivity flow-insensitively
– Regard each SSA name as a unique variable– Set constraint-based pointer analysis
Full sparse– Saving time– Saving space
13
Top-down analyze level 2
L4:foo.p → { a }foo.q → { b }
L9:foo.p → { c, d }foo.q → { d, e }
• foo.p → { a, c, d }• foo.q → { b, d, e }
main: Propagate to callsite
14
void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }
main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }
Top-down analyze level 2
void foo( int **p, int **q) { μ(b, d, e) L11: *p1 = *q1; χ(a, c, d) L12: *q1 = &obj;
χ(b, d, e) }
foo: Expand pointer dereferences
15
Merging calling contexts here
void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; }
main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }
Context Condition
To be context-sensitivePoints-to relation ci
– p ⟹ v (p→v ) , p must (may) point to v, p is a formal parameter.
Context Condition ℂ(c1,…,ck)– a Boolean function consists of higher-level points-to
relationsContext-sensitive μ and χ
– μ(vi, (cℂ 1,…,ck))– vi+1=χ(vi, M, (cℂ 1,…,ck))
M {may, must∈ }, indicates weak/strong update
16
Context-sensitive μ and χ
void foo( int **p, int **q) { μ(b, q⟹b)
μ(d, q→d) μ(e, q→e)
L11: *p1 = *q1; a=χ(a , must, p a)⟹ c=χ(c , may, p→c) d=χ(d , may, p→d)L12: *q1 = &obj; b=χ(b , must, q b)⟹ d=χ(d , may, q→d) e=χ(e , may, q→e)}
17
Bottom-up analyze level 1
void foo( int **p, int **q) { μ(b1, q⟹b) μ(d1, q→d) μ(e1, q→e)
L11: *p1 = *q1; a2=χ(a1 , must, p⟹a) c2=χ(c1 , may, p→c) d2=χ(d1 , may, p→d)L12: *q1 = &obj; b2=χ(b1 , must, q⟹b) d3=χ(d2 , may, q→d) e2=χ(e1 , may, q→e)}
Trans(foo, a) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p a⟹ , must >
18
• Trans(foo, c) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p→c, may >
• Trans(foo, b) = < {< obj, q⟹b> }, { } , q b⟹ , must >
• Trans(foo, e) = < {< obj, q→e> }, { } , q→e, may >
• Trans(foo, d) = < {< obj, q→d> }, { <b, p→d q∧ ⟹b> , < d, p→d>, < e, p→d
q∧ →e> } , p→d q∨ →d, may >
Bottom-up analyze level 1
int obj, t;main() { L1: int **x, **y;
L2: int *a, *b, *c, *d, *e; L3: x1 = &a; y1 = &b; μ(b1, true) L4: foo(x1 , y1 ); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true)
L5: *b1 = 5; L6: if ( … ) { x2 = &c; y2 = &e; } L7: else { x3 = &d; y3 = &d; } x4=ϕ (x2, x3) y4=ϕ (y2, y3) L8: c1 = &t; μ(d1, true) μ(e1, true) L9: foo(x4 , y4); a2=χ(a1 , must, true) b2=χ(b1 , must, true) c2=χ(c1, may , true) d2=χ(d1, may , true) e2=χ(e1, may , true) L10: *e1= 10; }
19
Full context-sensitive analysisCompute a complete transfer function for each
procedureThe transfer function maintains a low cost of
being represented and applied– Represent calling contexts by calling conditions
Merging similar calling contextsBetter than using calling strings in reducing costs
– Implement context conditions by using BDDs.compactly represent context conditions enable Boolean operations to be evaluated efficiently
20
Experiment
Analyzes million lines of code in minutesFaster than the state-of-the art FSCS pointer analysis
algorithms.
Table 2. Performance (secs).
21
Benchmark KLOCLevPA Bootstrapping(PLDI’08
)
64bit 32bit 32bit
Icecast-2.3.1 22 2.18 5.73 29
sendmail 115 72.63 143.68 939
httpd 128 16.32 35.42 161
445.gombk 197 21.37 40.78 /
wine-0.9.24 1905 502.29 891.16 /
wireshark-1.2.2 2383 366.63 845.23 /
Future workThe points-to result can be only used for error
checking nowWe are working for
– serving for optimization Let WPA framework generate codes (connect to CG)Let points-to set be accommodated for optimization
passesnew optimizations under the WPA framework
– serving for parallelizationprovide precise information to programmers for guiding
parallelization
22
An interprocedural slicerBased on PDG (Program dependence graph)Compressing PDG
Merging nodes that are aliased
Accommodate multiple pointer analysis Allow many problems to be solved on slice to
reduce the time and space costs
23
Application of sliceNow aiding program error checking
– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors
in file and memory operations is 11 and 2 times faster after slicing
24
1313. 701274. 39607. 71
434. 22
472. 00
384. 0827. 73
1636. 71 974. 42 3966. 83
0123456789
10111213141516171819202122
比值
FILEPOINTER
25
Application of sliceNow aiding program error checking
– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors
in file and memory operations is 11 and 2 times faster after slicing
26
Application of sliceNow aiding program error checking
– reduce the number of states to be checkedUse Saturn as our error checkerInput slices to Saturn instead of the whole programThe time the error checker (Saturn) needs to detect errors
in file and memory operations is 11.59 and 2.06 times faster after slicing
– improve the accuracy of error checking toolsUse Fastcheck as our error checkermore true errors are detected by Fastcheck
27
2. 88
0. 9
1
1. 1
1. 2
1. 3
1. 4
比值
原程序
FSM切片
28
Part Ⅱ Improvement on whirl2c
29
Improvement on whirl2cPrevious status
– Whirl2c is designed for compiler engineers of IPA and LNO to debug
– Berkeley UPC group and Houston Openuh group extend whirl2c somewhat, but it still cannot support big applications and various optimizations
Problem– Type Information incorrect because of
transformations
30
Improvement on whirl2cOur work
– Improve whirl2c to support recompilation of its output and execution
– Pass spec2000 C/C++ programs under O0/O2/O3+IPA based on pathscale-2.2
Motivation– Some customers require us not to touch their
platforms– Support the retargetability of some platform
independent optimizations– Support gdb of the whirl2c output
31
Improvement on whirl2cIncorrect information due to transformation
– Before structure folding
– After structure folding
Wrong output
whirl2c
frontend
32
Improvement on whirl2c Incorrect type information is mainly related to pointer/array/structure type
and their compositions.
We reinfer the type information correctly based on basic types– Basic type information is used to generate assembly code, so it is reliable– Array element size is also reliable– A series of rules to get the correct type information based on basic type infor,
array element size infor and operators.
Information useful for whirl2c but incorrect due to various optimizations is corrected just before whirl2c, which needs little change to existing IR
whirl2c
33
Part ⅢExtending UPC for GPU cluster
34
Extending UPC with Hierarchical Parallelism
UPC (Unified Parallel C), parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)– Suitable for distributed memory machines, shared memory systems
and hybrid memory systems– Good performance, portability and programmability
Important UPC features– SPMD parallelism– Shared data is partitioned to segments, each of which has affinity to
one UPC thread, and shared data is referenced through shared pointer– Global workload partitioning, upc_forall with affinity expression
ICT extends UPC with hierarchical parallelism– Extend data distribution for shared arrays– Hybrid SPMD with implicit thread hierarchy – Realize important optimizations targeting GPU cluster
35
Source-to-source Compiler, built upon Berkeley UPC(Open64)Frontend support Analysis and transformation on upc_forall loops
– shared memory management based on reuse analysis– Data regroup analysis for global memory coalescing
Structure splitting and array transpose– Instrumentation for memory consistency (collaborate with DSM
system)– Affinity-aware loop tiling
For multidimensional data blocking on shared arrays – Create data environments for kernel loop leveraging array
section analysis Copy in, copy out, private (allocation), formal arguments
– CUDA kernel code generation and runtime instrumentationkernel function and kernel invocation
Whirl2c translator, UPC=> C+UPCR+CUDA
36
Memory Optimizations for CUDA What data will be put into the shared memory?
– firstly pseudo tiling– Extend REGION with reuse degree and region volume
inter-thread and intra-threadaverage reuse degree for merged region
– 0-1 bin packing problem (SM capacity)Quantify the profit: reuse degree integrated with coalescing attribute prefer inter-thread reuse
What is the optimal data layout in global memory?– Coalescing attributes of array reference
only consider contiguous constraints– Legality analysis– Cost model and amortization analysis
Code transformations (in a runtime library)
37
Extend UPC’s Runtime SystemA DSM system on each UPC thread
– Demand-driven data transfer between GPU and CPU
– Manage all global variables– Grain size, upc tile for shared arrays and private
array as a wholeshuffle remote and local array region into one
contiguous physical block before transferringData transformation for memory coalescing
– implemented in the GPU side using CUDA kernel– Leverage shared memory
38
Applications Description Original language
Application field
Source
Nbody n-body simulation CUDA+MPI Scientific computing
CUDA campus programming contest 2009
LBM Lattice Boltzmann method in computational fluid dynamics
C Scientific computing
SPEC CPU 2006
CP Coulombic Potential CUDA Scientific computing
UIUC Parboil Benchmark
MRI-FHD Magnetic Resonance Imaging FHD
CUDA Medical image analysis
UIUC Parboil Benchmark
MRI-Q Magnetic Resonance Imaging Q
CUDA Medical image analysis
UIUC Parboil Benchmark
TPACF Two Point Angular Correlation Function
CUDA Scientific computing
UIUC Parboil Benchmark
39
Benchmarks
Overal l per f ormance on si ngl eCUDA node
00. 5
11. 5
2
nbody l bm
spee
dup
base dsmmemory coal esci ng sm reuseOPT CUDA
Overal l performance on GPU cl uster
02468
10
nbody mri -fhd
mri -q tpacf cp
spee
dup
base dsmmemory coal esci ng sm reuseOPT CUDA/MPI
CPUs on each node: 2 dual core AMD Opteron 880GPU: NVIDIA GeForce 9800 GX2Compilers: nvcc (2.2) –O3 ; GCC (3.4.6) –O3
Use 4-node cuda cluster; ethernet
40
UPC Performance on CUDA cluster
Part Ⅳ Open Source Loongcc
42
Open Source LoongccTarget to LOONGSON CPUBase on Open64
– Main trunk -- r2716A MIPS-like processor
– Have new instructionsNew Features
– LOONGSON machine model– LOONGSON feature support
FE, LNO, WOPT, CG– Edge profiling
43
Thanks
44