algorithm

13

Click here to load reader

Upload: shreeyesh-menon

Post on 23-Dec-2015

9 views

Category:

Documents


5 download

DESCRIPTION

Basics of algorithm design

TRANSCRIPT

Page 1: Algorithm

Lecture 4

Session on Converting Algorithms written in pseudo code to Programs andExperimental Evaluation of a Program for Performance Evaluation

A generic approach that illustrates the process of moving from algorithms written in pseudo code to• Writing C++ programs from the pseudo code specification : involves making choices for

data structures; appropriate collection of functions and their inter-relationships; i/o handling;other features supported by C++.

• Generating input data sets : issues are i) how many different data sets, ii) what different sizesof input data, iii) scale factor for input data.

• Measuring run-time information from the OS.• Processing run-time data to generate average performance.

Case Study

Take pseudo code for Algorithm 1 for finding the largest contiguous positive sum in anarray.

int max ( int x, int y) { if ( x <= y ) return y; else return x; }

int max_pos_sum( int b[], int size) { int maxsofar = 0; for (int i = 0; i < size; i++) for ( int j = i; j < size; j++) { int sum = 0.0; for ( int k = i; k <=j; k++) sum = sum + b[k]; maxsofar = max(maxsofar, sum); }; return maxsofar; }

Design Issues

A. Convert to a C++ program :1. Decide on the size and type of the array; choices are 100000 and float2. Three functions : max(), max_pos_sum() and a main()3. main() uses file i/o for input and output; file names are hardcoded; reads inut from a

file named as “random” and writes its output to a file named as “output” in appendmode.

4. The input file is supposed to have an integer followed by list of real numbers asmany as the first data in the file.

5. The choices are not unique and a different set with suitable reasoning is quite fine.

// filename is “max-pos-sum-algo1.C” using namespace std; #include <iostream> #include <fstream>

1

Page 2: Algorithm

const int SIZE = 100000;

float max ( float x, float y) { if ( x <= y ) return y; else return x; }

float max_pos_sum( float b[], int size) { float maxsofar = 0.0; for (int i = 0; i < size; i++) for ( int j = i; j < size; j++) { float sum = 0.0; for ( int k = i; k <=j; k++) sum = sum + b[k]; maxsofar = max(maxsofar, sum); }; return maxsofar; } int main() { float a[SIZE]; int num; ifstream in1 ("random", ios::in); // input file ofstream out1 ("output", ios::app); // output file in1 >> num; for ( int i = 0; i < num; i++ ) in1 >> a[i]; out1 << " max +ve sum in array a[] = " << max_pos_sum(a, num) << endl; }

• Generate the executable

$ g++ max-pos-sum-algo1.C -o algo1

generates the executable and names it as “algo1”

• Create a file named “random” with the desired structure. For example a possible contentof the file is : 5 -20 10 5 -10 25

which indicates that there are 5 numbers in the file which are -20 10 5 -10 25

$ cat random

cat command displays the contents of its arguments on the screen

• Test the correctness of the executable with the generated input file. The first shellcommand runs the executable algo1; note that it does not need any arguments becausethe program uses file i/o inside the program for input / output. The second commandshows the content of the file “output” in which algo1 has written its output. Debug theprogram in case the output of the program is not correct.

$ ./algo1 $ cat output

The C++ program development task, say version 1, is complete at this point.

2

Page 3: Algorithm

What have we achieved so far ? : An executable program with the following behaviour

reads writes to random ----------> algo1 ------------> output (file) (file)

B. Generate Input Data

Issues are : i) decide on the size of input data ii) decide on how to generate the elements in a desired form

• Size of Input data : choice is to let the value be given by the user at run time. Forinstance an user may decide to have input of size 100 for one run while use 1000 foranother run and so on.

• Distribution of the values : For the problem under discussion we need a mix ofpositive and negative numbers, as many as decided by the user. Choices made are thefollowing : a) user will also supply the range, [low, high] in which values are to begenerated, b) each number, which must be in the specified range, is randomlygenerated, c) an output file named as “random” is to be generated with the size as itsfirst value followed by size number of randomly distributed +ve and -ve values.

• Write a C++ program, named as, “same_random_gen.C” that implements the tasksspecified above. Generate the executable and test its correctness.

// filename “same_random_gen.C”// The program generates the same sequence of numbers for every invocation ???using namespace std;#include <iostream>#include <fstream>#include <stdlib.h> // required for the definition of rand() library function // rand() generates a random number in [0, RAND_MAX]int main(){ ofstream out ("random", ios::out); // output file int num; // how many numbers cin >> num; // read fron std input int low, high, range; // range in which numbers are needed cin >> low >> high; cout << endl; // read from std input range = high - low + 1; out << num << endl; // write the size of generated data to file output if ( low < 0) low = -low; // use rand() function to generate in the desired range [low, high] for (int i = 1; i <= num; i++) out << (rand() % range - low) << " " ; out << endl; // end of line in file output return 0;}

$ g++ same_random_gen.C -o gen_inpEither test the executable same_inp by supplying inputs from command line $ ./gen_inp 10 -10 20Or create a file with name say,“params”, populate it with reqired data, such as 10 -10 20

3

Page 4: Algorithm

and use shell's i/p redirection operator <$ ./gen_inp < paramsIn either case the generated file is “random” whose contents can be seen using cat$ cat random

• By virtue of the design choices made so far, the two executables, “algo1”, and“gen_inp” are capable of communicating with each other using the file “random”which is written into by gen_inp and read by the algo1.

What have we achieved so far ? :

Two executable programs with the following behaviours :

reads writes to random --------------> algo1 ------------> output (file) (file) n+1 values 1 value

reads writes to standard i/p -------------> gen_inp ------------> random (file) (device file) 3 values n+1 values

where n is the input size (number of elements in the array).

C. Measure Run-time of an executable program

We choose the command time that is capable of monitoring an executable program while underexecution and report a few time related information about the monitored program.

A quick recap about the program time. This is used as follows :

$ time ls -R . >lsout 2> lserr

• The command being monitored for execution is the shell command “ls” which lists directorycontents.

• ls -R : the switch -R is the first argument of the “ls” command which denotes that thiscommand is to be called recursively for all subdirectories

• ls -R . : the symbol . is interpreted by the shell as the current directory and is the secondargument to the ls command. The command “ls -R . “ is information to the shell to displaythe contents of the current directory and all its subdirectories.Example : If the current directory on my system is“prof-biswas@prof-Biswas:~/Desktop/cs213m/lec4”and the following command is executed

$ ls -R . the following display appears on the screen :

The display is a faithful representation of the directory structure on my laptop at the time of writing this document.

“prof-biswas@prof-Biswas:~/Desktop/cs213m/lec4”

4

Page 5: Algorithm

You should be able to figure out the directory structure from the display

.: algo1 a.out lec4_outline.odt lserr lsout parse_functions.h perf_eval1 perf_eval_old progs same_inp stats test_sh

./progs: diff_random_gen.C max-pos-sum-algo1.C process_output.C same_random_gen.C test.C test_rand.C

• Now back to the time command shown earlier$ time ls -R . >lsout 2> lserr

The argument >lsout denotes that input is being redirected from file named “lsout” anderror is being redirected to file “lserr”. Since there are two executable commands “time” and“ls -R .” the question is whose i/o is being redirected by the above command.The result of executing the above command on my system gives the following display onscreen:

real 0m0.004s user 0m0.000s sys 0m0.000s

while the o/p of “ls -R .” has been placed in file “lsout” and the file “lserr” is empty, easilyverified by applying cat command to them.

The output of time command appears on the screen. It should be noted that time commanduses its output channel (>) to display the output produced by the executable being monitoredwhile it displays its own o/p through the error channel.

The following experiments can be used for validating this fact.

$ (time ls -R . )>out1 2>out2

The execution of the above command produces no display on screen; using cat on files“out1” and “out2” reveals the fact that output of “ls -R .” is in file “out1” while the o/p of“time” is present in “out2”.

The parentheses around “time ls -R . “ is required to redirect the o/p and error of the “time”

5

Page 6: Algorithm

command. In case the monitored command, “ls -R .” in this case, also needs to be redirected, do the needful within the parentheses, as shown below :

$ (time ls -R . 2>lserr )>out1 2>out2

• Understanding the output of time command is the last business to be studied under thissection. The figure given as real time is the time taken by the monitored program for itscomplete execution. A process (a program under execution) in the Linux OS runs undertwo distinct modes : user mode and system mode. In the user mode, the user writtencode is executed; while under system mode, the OS functions demanded by the userprocess are accounted for. For this course, we shall use the real time as our desired time measure. As aninformation item, you may note that contrary to intuition, in the time measures, real innot equal to user + sys; this is because the sys time accounts for all the OS kernel callsthat are serviced in the time duration of the user process and often some of these kernelfunctions have no relation with the executing user process.

real 0m0.004s user 0m0.000s sys 0m0.000s

What have we achieved so far ? :

Generated two executable programs, gen_ip and algo1and learned the use of time command to measure the run-time of algo1 with the interactions as depicted below. The processes are shown in rectangles, the files in circles and the 3 channels, inp, o/p and err are labeled on the arrows.

D. Extract the Relevant Run Time Data and Calculate the Mean

The tasks are explained with the help of the following fictitious data ina a file, say timeop,produced by monitoring some executable under time for 5 different runs. It is also assumed that aninput of size 100 was used for all the runs.

6

timealgo1gen_ip

random

stdin

output

time's o/p

o/pwrites o/p

writeswrites

o/p

err

inp

read

s

inp

Page 7: Algorithm

5

Run No 1 :real 0m0.442suser 0m0.440ssys 0m0.000s

Run No 2 :real 0m0.438suser 0m0.432ssys 0m0.004s

Run No 3 :real 0m0.444suser 0m0.440ssys 0m0.000s

Run No 4 :real 0m0.439suser 0m0.436ssys 0m0.000s Run No 5 :real 0m0.438suser 0m0.436ssys 0m0.000s

Required Tasks :

1. Read the input file line by line. Requires string processing features of C++. Read the firstnumber and save it as this gives the number of runs that follow.

2. Pull out all lines starting with “real” and save the time string. Ignore any line that does notbegin with “real”.

3. For a time string say, s, extract and separate its 3 fields (note that hours are not present in thedata shown but this is permitted).

For instance, given the string “0m0.442s”, create an object of the form :

4. After processing the next string “0m0.438s” of real time, we may extract the relevant fields

and add it to first run to get the cumulative sum of both the runs.

After extracting the data for the 5th run, the cumulative sum of real time for all the runs is

5. The average run time is calculated as cumulative_sum / no_of_runs, which for our example runtime data is

7

hr min sec

0 0 0.442

hr min sec

0 0 0.88

hr min sec

0 0 2.201

hr min sec

0 0 0.4402

Page 8: Algorithm

The programming problem of this section can now be stated. Given a file that stores the datagenerated by time command, a) to extract the real time data present in the line, b) separate thecomponents of real time in hours, minutes and seconds, c) ability to add two or more real time data,and finally d) compute the average of n data of type real time.

Implementation Decisions :

1. Real time data is treated as an object. class hrminsecs {

public : short hr; short min; float secs; hrminsecs() { hr=0; min=0; secs =0.0;}

}

2. Parse the input line by line as a string and examine the lines beginning with “real”.Read a line of input as a sequence of 2 strings, say str and time and check for str being “real”;sample code is given below.

string str, time;while ( cin >> str >> time){ if (str == “real” ) { // write code for parsing the string “time” and extract the 3 components } // else skip the lines for usr and sys times as we are not interested in them}

3. Parse a string of the form “xxhyymzz.uus”. The digits preceding 'h' specify the hours elapsed,the digits between 'h' and 'm' denote the elapsed minutes and the real number betweem 'm' and 's'denote the seconds elapsed.

• The choice for short data type for hr and min and float for secs should be obvious.class hrminsecs {

public : short hr; short min; float secs; hrminsecs() { hr=0; min=0; secs =0.0;}

}• The key processing to be done involves the following, i) find the position of occurence of 'h'

in the string time (if present) and assume it occurs at position k; divide the time into parts,time[0] to time[k-1] being the first part and the seond part being time[k+1] ..time[time.length()-1]. The first part, if non-empty, gives the hours in the form of a string.

• Process the second part along similar lines by locating the position of 'm' and 's' succesivelyto extract the minutes part and the seconds part. Both minutes and seconds would be in theform of a string.

• We require the hours and minutes to be in short and the seconds to be in float. We write twofunctions, s2i(string str) that given a string str returns the equivalent short integer andanother function, s2f(string str) that given a string str returns the equivalent float value.

• The function parse() given below uses the design ideas mentioned above to do the needful.

8

Page 9: Algorithm

The comments inserted inline explains the purpose of various code fragments.

hrminsecs* parse ( string s) { // string is of the form ddhddmdd.ddds ; locate the indices of h/m/s string strh="" , strm="", strs=""; short hr =0, min=0; float sec=0.0; // search for chars 'h' 'm' and 's' int indexh = -1, indexm = -1, indexs = -1; // assume that none are present in “s” indexh = s.find('h'); // use find() member function to locate 'h' if ( indexh > 0 && indexh < s.length()) { strh=s.substr(0, indexh); // extracts hours as a string s=s.substr(indexh+1, s.length() - indexh -1); // remaining string to be processed }; indexm=s.find('m'); // use find() member function to locate 'm' if ( indexm > 0 && indexm < s.length()) { strm=s.substr(0, indexm); // extracts minutes as a string s=s.substr(indexm+1, s.length() - indexm -1); // string to be processed }; indexs=s.find('s'); // use find() member function to locate 's' if (indexs == s.length()-1) strs=s.substr(0, s.length()-1); // extract seconds as string hr = s2i(strh); // convert hours to short min = s2i(strm); // convert minutes to short sec = s2f(strs); // conver seconds to float hrminsecs* p = new hrminsecs; // create a new hrminsecs object on the heap p->hr = hr; // assign the extracted time data p->min = min; // respective members of this object p->secs = sec; return p; // return the newly constructed object};

The supporting functions for converting from string to short / float are very basic algorithms and are given below without explanation.

float s2f(string s) { string left, right; short dot = s.find('.'); if ( dot < 0 && dot > s.length() ) { cerr << " dot not in string \n"; return -1.0;} left = s.substr(0, dot); right = s.substr(dot+1, s.length() - dot - 1); int whole, fract; whole = s2i(left); fract = s2i(right); float val = fract; for ( int i = 0; i < right.length(); i++) val = val * 0.1; val = val + whole; return val; }

9

Page 10: Algorithm

int s2i (string s) { int num = 0, digit = 0; int digit = 0; for ( int j = 0 ; j < s.size(); j++) { switch (s[j]) { case ('0') : digit = 0; break; case ('1') : digit = 1; break; case ('2') : digit = 2; break; case ('3') : digit = 3; break; case ('4') : digit = 4; break; case ('5') : digit = 5; break; case ('6') : digit = 6; break; case ('7') : digit = 7; break; case ('8') : digit = 8; break; case ('9') : digit = 9; break; }; // end of switch num = num*10 + digit; };// end of for return num; };

The program given in the file, process_output.C, is a program that reads a file which is the outputgenerated by multiple runs of the time command and produces the mean real time over a givennumber of runs.

$ g++ process_output.C -o stats

What have we achieved so far ? :

Generated three executable programs, gen_ip, algo1and stats and used the time command tomonitor the executable algo1. The processes interact with each other using files as depictedbelow. The processes are shown in rectangles, the files in circles and the 3 channels, inp, o/pand err are labeled on the arrows.

10

o/p

timealgo1gen_ip

random

stdin

output

time's o/p

o/pwrites o/p

writeswrites

o/p

err

inp

read

s

inp

stats

mean real time

inp

Page 11: Algorithm

E. Stitch the design solutions of Parts A to D above to create a single task that produces thedesired result.

A shell program is a collection of shell commands which are interpreted in order. Note that in theprogramming language of the shell : i) “echo” is like cout, ii) has assignment statement, iii) supportsseveral control constructs – if then else, for, while, case etc. Read the notes on shell and shellproramming posted on the course page.

For the programming problem under development, the main tasks are :

• Create a shell program say, perf_eval, which we wish to execute with 6 different argumentsas shown below, where

arg1 is the executable that is input generator programarg2 is the executable program that implements algorithm1

arg3 is a number that gives the number of runs for the same input for calculating mean arg4 is the input size to start the experimentation arg5 is a number that is the scaling actor for increasing input size arg6 is the maximum input size for measurement

$ ./perf_eval arg1 arg2 arg3 arg4 arg5 arg6

The program structure is quite flexible and automates a large part of the experimentation without any manual intervention.

• The file perf_eval is made executable by using the chmod command

$ chmod +x perf_eval

• Run the shell program with appropriate arguments

$ ./perf_eval ./inp_gen ./algo1 5 100 25 200

This will cause the shell program, perf_eval, to be executed as follows :

1. Run inp_gen to generate a list of 100 numbers (argument4) in the given range2. Run time command to monitor algo1 for the input of step 1 and collect the time data3. Repeat steps 1 and 2 for 5 times and collect the time output for each run, calculate the mean real time for 5 runs for the same input of size 100. This is to account for varying system load. Write the input size and the mean real time recorded in a separate file, statfile.4. Run inp_gen to generate a list of 125 numbers (arg4 + arg5) in the given range.5. Run time command to monitor algo1 for the input of step 4 and collect the time data.6. Repeat steps 4 and 5 for 5 times and collect the time output for each run, calculate the mean real time for 5 runs for the same input of size 125. Write the input size and the mean real time recorded in statfile in append mode.

Repeat steps 1,2 and 3 for input sizes of 150, 175 and 200. The file, statfile, contains the desired results of the experimentation.

11

Page 12: Algorithm

Sample Run of perf_eval :

When the shell program was executed with the following arguments

./perf_eval ./inp_gen ./algo1 10 100 200 1000

the contents of statfile is shown below (first column is input size):

Summary of Real Time over 10 executions 100 Mean Real Time 0h0m0.007s300 Mean Real Time 0h0m0.0278s500 Mean Real Time 0h0m0.0881s700 Mean Real Time 0h0m0.2143s900 Mean Real Time 0h0m0.4418s

• Last piece : How to write the shell program that behaves as intended ?

a) Initialization : Introduce a few shell variables to save the arguments supplied throughcommand line; shell uses special variables called $variables for this purpose.

ipgen=$1 # comment : $1 is argument 1, save the string in user variable ipgenexec=$2 # $1 is argument 2 saved in variable execruns=$3start=$4scale=$5last=$6 # $6 is the last argument, saved in variable lastlow=-32768 # variable for the min value in rangehigh=32767 # varable for the max value

Display shell variables on the screen for cross check : echo " Identical Input Generator Program: $ipgen "

echo " executable program : $exec" echo " number of runs to compute average time : $runs" echo " Input size to start : $start" echo " next input size is size + factor : give factor : $scale" echo " max input size : $last"

b) Basic processing step

Create a file param with 3 numbers : start size of input; range specified by [low, high]

$ echo "$start $low $high " > params # create a file with data that ipgen will reads

b) Run the input generator to create as many random numbers as specified in param file ./$ipgen < params # run ipgen with input redirected

• Monitor the executable program, specified by $exec, and the timing figures are redirected tothe file timeoutput in append mode

(time ./$exec ) 2>> timeoutput # run time command to monitor exec

12

Page 13: Algorithm

c) Repeat the basic processing of step (b) for the same input but for $runs number of times. Thisaccomplished by a while command of the shell and the test command. The test command returnsTRUE if the expression, $iter ≤ $runs, and FALSE otherwise. Two files, timeoutput and output arecreated to save the results of time and algo1 respectively. The command iter=`expr $iter + 1` isassigment whose rhs is the expr command. The backquotes surrounding expr $iter + 1 denote thatthe expr command is to be executed (which returns the value of its argument) and the returned valueis assigned to the lhs variable.

iter=1 # a shell variable to count iterations while test $iter -le $runs // while loop do echo "$start $low $high " > params # create a file with data that ipgen will read from ./$ipgen < params # run ipgen with input redirected from params file echo " Run No $iter :" >> output # append the run number in output file echo " Run No $iter :" >> timeoutput # append run number to timeoutput (time ./$exec ) 2>> timeoutput # run time command to monitor exec and append # error channel to timeoutput file iter=`expr $iter + 1` # iter++ in shell done

d) Step (c) completes the task for one input of a given size. To get the data for varying input sizes,we need to repeat step (c) for different sizes, starting with $start, scaling it by a factor $scale till wereach $last. An outer while-loop accomplishes this task

echo " Summary of Real Time over $runs executions " > statfilewhile test $start -le $last // outer while loopdo# an iteration of the outer loop for a given size of input echo "$runs" > timeoutput # initialize the first data in this file iter=1 while test $iter -le $runs // inner while loop do ................ done # run a program to find the mean time and save size and time cat timeoutput >> altime # a file to collect time output of all runs, for debugging cat output >> allout # a file to collect output of exec of all runs for debugging echo -n "$start" >> statfile # Append start input size to statfile ./stats timeoutput # run stats with timeoutput as argument start=`expr $start + $scale` # start+= scaledone

END OF DOCUMENT

Written by : Supratim BiswasJanuary 18, 2015

13