performance and optimization
Post on 03-Jan-2016
38 Views
Preview:
DESCRIPTION
TRANSCRIPT
Performance and Optimization
Measuring Performance• Key measure of performance for a computing system is
speed– Response time or execution time or latency.
– Throughput.
• We have seen how to increase throughput, while slightly increasing execution time of each single instruction– Pipeline design.
• We now concentrate on measuring total execution time.
• Total execution time can mean:– Elapsed time -- includes all I/O, OS and time spent on other
jobs.
– CPU time -- time spent by processor on your job.
CPU Execution Time• We consider CPU execution time on an unloaded system
• Machine X is n times faster than machine Y if
where,– CPU Time = Execution Time
– Performance = 1 / CPU Time
• Basic measure of performance:
= Cycles count X Clock cycle time
= n orperformanceX
performanceY = n
CPU TimeY
CPU TimeX
Seconds
Clock cycleCPU Time = X
Clock cycles
program
CPU Execution Time
• Clock cycle time is measured in nanoseconds (10-9sec) or microseconds (10-6sec)
• Clock cycle rate = 1 / (Clock cycle time) is measured in• MegaHertz (MHz) - 106 cycles/sec
• GigaHertz (GHz) - 109 cycles/sec
CPI (Cycles per Instruction)
= IC X CPI
• CPI is one way to compare different implementations of the same Instruction Set Architecture (ISA), since instruction count (IC) for a given program will be the same in both cases.
Average Clock cycles
InstructionCycles Count = X
Instructions
program
CPI - Pipelined Implementation• In each cycle the execute stage would either process an instruction
or a bubble, injected due to one of three special cases.
• Total of Ci instructions and Cb bubbles, then the processor has required Ci+Cb clock cycles to execute Ci instructions.
• In pipelined implementation, CPI = (Ci+Cb)/Ci = 1 + Cb/Ci
• Cb/Ci indicates the average number of bubbles injected per instruction.
Thus, in our implementation CPI = 1.27
CauseFrequencyConditionBubblesProduct
Load/Use0.250.210.05
Misspredict0.200.420.16
Return0.02130.06
Total0.27
CPI Example• We have two machines with different implementations
of the same ISA (Instruction Set Architecture). Machine A has a clock cycle time of 10 ns and a CPI of 2.0 for program P; machine B has a clock cycle time of 20 ns and a CPI of 1.2 for the same program. Which machine is faster?
• Let IC be the number of instructions to be executed (same in both machines). Then,
Cycles countA = 2.0 IC
Cycles countB = 1.2 IC
Calculate CPU Time for each machine:
CPU TimeA = 2.0 IC x 10 ns = 20.0 IC ns
CPU TimeB = 1.2 IC x 20 ns = 24.0 IC ns
» Machine A is faster; in fact 24/20 = 20% faster.
Composite Performance Measure
or = Instruction Count X CPI X clock cycle time
or = Instruction Count X CPI
Clock rate
• These formulas show that performance is always a function of 3 distinct factors; 1 or 2 factors alone are not sufficient.
• IC (Instruction Count) was once the main factor advertised (VAX); today clock rate is in the headlines (3 GHz Pentiums)
• CPI is more difficult to advertise
• Changing one factor often affects others. For example,o Decreasing Instruction count means each instruction is doing
more; hence either CPI or cycle time or both, may increase.
• A smart compiler may decrease CPI by choosing the right kind and order of instructions, without a large increase in Instruction count.
Average Clock cycles
InstructionCPU Time = X X
Instructions
program
Seconds
Clock cycle
Amdahl's Law•Make the common case fast -- why?
•Denote part of system that was enhanced as the enhanced fraction or Fenhanced.
Speedup = = =
=
Fenhanced
speedupenhanced
(1 - Fenhanced ) +
1
CPU Timeold
CPU Timenew
CPU Timeold
CPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced)
Amdahl's Law (Example)Suppose we have a technique for improving the performance of FP operations by a factor of 10 : What fraction of the code must be floating point to achieve a 200% improvement in performance?
3 = ==>
Fenhanced = 20/27 = 74%
Even dramatic enhancements make a limited contribution unless they relate to a very common case.
Fenhanced
10 (1 - Fenhanced ) +
1
Amdahl's Law (example cont.) Let us assume a seq processor hardware:
Fetch – 11% (percentage of time spent) Decode – 18% Execute – 23% Memory – 40% WriteBack – 8%
And the speedups : 1x , 5x , 20x, 1.6x , 1x. 0.11/1 + 0.18/5 + 0.23/20 + 0.40/1.6 + 0.08/1 = 0.4875 Speedup = 1/0.4875 =~ 2 The 5x and 20x have little effect
Amdahl's Law (Parallel)
P is the portion of the code that can be made parallel N is the number of processors
With a very big number of processors the speedup isbound to 1/(1-p).
What does that mean about the efficiency of parallel computing ? On which kind of problems ?
Max speedup by N
Processors =
Important to Keep in Mind
90/10 rule. 90% of the time spent in 10% of the code.
Readability vs. Performances.Time vs. Memory.
Machine-Independent Optimizations– Optimizations you should do regardless of processor / compiler
• Code Motion– Reduce frequency with which computation performed• If it will always produce same result• Especially moving code out of loop
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[n*i + j] = b[j];
for (i = 0; i < n; i++){
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
Reduction in Strength– Replace costly operation with simpler one– Shift, add instead of multiply or divide
16*x --> x << 4• Utility machine dependent• Depends on cost of multiply or divide instruction• On Pentium II or III, integer multiply only requires 4
CPU cycles
– Recognize sequence of products
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[n*i + j] = b[j];
int ni = 0;
for (i = 0; i < n; i++){
for (j = 0; j < n; j++)
a[ni + j] = b[j];
ni += n;
}
Using more efficient instructions
Can
Cannot
Arrays and Loops Example Original code:
Assembly code:
Loop Optimization
Loop optimization is the process of the increasing execution speed and reducing the overheads associated of loops. Plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a program is spent on loops, so a lot of compiler optimization techniques have been developed to make them faster.
Time Scales• Absolute Time– Typically use nanoseconds
• 10–9 seconds
– Time scale of computer instructions
• Clock Cycles– Most computers controlled by high frequency clock
signal
– Typical Range• 100 MHz
– 108 cycles per second
– Clock period = 10ns
• 2 GHz – 2 X 109 cycles per second
– Clock period = 0.5ns
Cycles Per Element– Convenient way to express performance of
program that operates on vectors or lists– Length = n– T = CPE*n + Overhead
0
100
200
300
400
500
600
700
800
900
1000
0 50 100 150 200
Elements
Cycles
vsum1
Slope = 4.0
vsum2
Slope = 3.5
Optimization Example
• Procedure– Compute sum of all elements of integer vector
– Store result at destination location
– Vector data structure and operations defined via abstract data type
• Pentium II/III Performance: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)
void combine1(vec_ptr v, int *dest)
{
int i;
* dest = 0;
for (i = 0; i < vec_length(v); i++){
int val;
get_vec_element(v, i, &val);
* dest += val;
}
}
General GCC optimization commands
• Most optimizations are only enabled if -O is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified.
• With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
• With -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
Understanding Loop
• Inefficiency–Procedure vec_length called every iteration–Even though result always the same
void combine1-goto(vec_ptr v, int *dest)
{
int i = 0;
int val;
* dest = 0;
if (i >= vec_length(v))
goto done;
loop:
get_vec_element(v, i, &val);
* dest += val;
i;++
if (i < vec_length(v))
goto loop
done:
}
1 iteration
Move vec_length Call Out of Loop
• Optimization–Move call to vec_length out of inner loop• Value does not change from one iteration to next
• Code motion
– CPE: 20.66 (Compiled -O2)• vec_length requires only constant time, but significant overhead
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
* dest = 0;
for (i = 0; i < length; i++){
int val;
get_vec_element(v, i, &val);
* dest += val;
}
}
Arrays optimizations - Loops
Arrays and Loops optimizations:No need for a loop variable.Using pointer arithmetic I: Instead of increasing the
loop variable by one, it increases our pointer by the size of the data type.
Using pointer arithmetic II: it computes the address of the final array element, and uses a comparison to this address as the loop test (do-while loop).
Optimization techniques (C++ oriented)
For loops: Use ++i instead of i++
i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value. (Old compilers but still good practice because you never know what machine will run your code)
specially on non primitive types += is more efficient than x = x + a ( Probably doesn’t matter
to most compilers nowadays but once x = x + a used to evaluate x twice ).
Count down to 0 instead of up It’s usually faster to compare against 0
Reduction in Strength
• Optimization– Avoid procedure call to retrieve each vector element• Get pointer to start of array before loop
• Within loop just do pointer reference
• Not as clean in terms of data abstraction
– CPE: 6.00 (Compiled -O2)• Procedure calls are expensive!
• Bounds checking is expensive
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
* dest = 0;
for (i = 0; i < length; i++){
* dest += data[i];
}
Eliminate Unneeded Memory Refs
•Optimization–Don’t need to store in destination until end–Local variable sum held in register–Avoids 1 memory read, 1 memory write per cycle–CPE: 2.00 (Compiled -O2)•Memory references are expensive!
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
* dest = sum;
}
Detecting Unneeded Memory Refs.
• Performance–Combine3• 5 instructions in 6 clock cycles• addl must read and write memory
–Combine4• 4 instructions in 2 clock cycles
.L18:
movl (%ecx,%edx,4),%eax
addl %eax,(%edi)
incl %edx
cmpl %esi,%edx
jl .L18
Combine3.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
Combine4
Optimization Blocker: Memory Aliasing• Aliasing– Two different memory references specify single location
• Example– v: [3, 2, 17]
– combine3(v, get_vec_start(v)+2)-->?
– combine4(v, get_vec_start(v)+2)-->?
• Observations– Easy to have happen in C• Since allowed to do address arithmetic
• Direct access to storage structures
–Get in habit of introducing local variables• Accumulating within loops
• Your way of telling compiler not to check for aliasing
Loop Unrolling
•Optimization–Combine multiple
iterations into single loop body–Amortizes loop
overhead across multiple iterations–Finish extras at end–Measured CPE = 1.33
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0;
int i;
*/ Combine 3 elements at a time/*
for (i = 0; i < limit; i+=3){
sum += data[i] + data[i+2]
+ data[i+1];
}
*/ Finish any remaining elements/*
for (; i < length; i++){
sum += data[i];
}
* dest = sum;
}
Parallel Loop Unrolling•Code Version– Integer product
•Optimization–Accumulate in two
different products•Can be performed simultaneously
–Combine at end
• Performance–CPE = 2.0–2X performance
void combine6(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
*/ Combine 2 elements at a time/*
for (i = 0; i < limit; i+=2){
x0 *= data[i];
x1 *= data[i+1];
}
*/ Finish any remaining elements/*
for (; i < length; i++){
x0 *= data[i];
}
* dest = x0 * x1;
}
Code Optimizing Time
Using efficient algorithms.Constant calculations – outside of loops.Accessing memory as little as possible.Using more efficient instructions (shift vs. mult).Minimizing the call to inefficient functions.
MemoryUsing the smallest data types that fit.Using more efficient structures that allow for the same
functionality with less memory (see example later).Using as little variables as possible.
top related