1 awk awk is a file-processing programming language. makes it easy to perform text manipulation...

44
1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in Generating reports Matching patterns Validating data Filtering data for transmission An awk program is a sequence of statements of the form Pattern {action} Scans the input lines, in order, one at a time. Searches for the pattern and if pattern is found, the corresponding action is performed. Each statement of awk program is executed for each line of input.

Upload: donna-fletcher

Post on 02-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

1

awk• awk is a file-processing programming language.

• Makes it easy to perform text manipulation tasks.

• Is used in

– Generating reports

– Matching patterns

– Validating data

– Filtering data for transmission

• An awk program is a sequence of statements of the form

– Pattern {action}

– Scans the input lines, in order, one at a time.

– Searches for the pattern and if pattern is found, the corresponding action is performed.

– Each statement of awk program is executed for each line of input.

Page 2: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

2

awk

Page 3: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

3

awk programming model

• awk program consists of a main input loop (you don’t write the loop but the main program works as one).

• The main routine reads one line of input from a file and makes it available for processing. The main loop executes as many times as there are lines in the input.

• Preprocessing before the main loop and post processing after the loop are done with BEGIN and END.

• The routine is applied to each input line, one line at a time.

Page 4: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

4

awk

• Two ways to present the program to awk.– Make the program the first

argument on the command line – if the program is short.

– awk ‘program ‘ [filename ....]– Examples:

%awk '/Smith/ {print}' people

%awk '/Smith/ {print}' -

– Put the program in a separate file and tell awk to use the program file on the input files.

– Examples:awk -f awkprog file1 file2

• Keywords and some important functions– BEGIN, END, FILENAME,

FS, NF, NR, OFS, ORS, OFMT, RS

– break, close, continue, exit, exp, for, getline, if, in, index, int, length

– log, next, number, print, printf, split, sprintf, sqrt, string, string, substr, while

• Operators– Assignment, compound

assignment, arithmetic, relational, logical and regular expression matching operators.

Page 5: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

5

Some Regular Expression Metacharacters

• \ - escapes any meta character that follows, including itself.

• ^ - anchors the following regular expression to the beginning of string.

• $ - anchors the following regular expression to the end of string.

• . (dot) Matches any character including newline

• […] – matches any one of the class characters enclosed between the brackets.

• [^] – A circumflex as first character inside [] reverses the match to all characters except those listed in the [].

• r1 | r2: between two regular expressions r1 and r2, it allows either of the regular expressions to be matched.

• r* - Matches any number (including zero) of the regular expression that precedes it.

• r+ - Matches one or more occurences of the regular expression that precedes it.

• r? - Matches 0 or 1 occurences of the regular expression that precedes it.

• () – groups regular expressions• \{n,m\} – Matches a range of

occurences of a single character that precedes it. Matches any number of occurences between n and m.

May not be available in very old versions.

Page 6: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

6

Writing Regular Expressions• Writing regular expressions involves three steps:

– Specification: Knowing what you want to match.

– Coding: Writing an expression to describe what you want to match

– Testing: Testing the pattern to see what it matches.

– Testing your regular expression may result in,

• Hits: Lines you wanted to match

• Misses: Lines you did not want to match

• Omissions:Lines you wanted to match but did not.

• False Alarms: The lines you matched but did not want to match.

– Eliminate false alarms by limiting the matches and capture the omissions by expanding the possible matches.

Page 7: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

7

Some Examples

What do they match?

• [a-zA-Z?+!] -

• [a-zA-Z][?+!] -

• [-+*/] -

• AB\{2,4\}C -

• UNIX|LINUX -

• Compan(y|ies) -

• [0-9][0-9]*\.\{2,\}[a-z][a-z]* -

Page 8: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

8

Multiline Records• FS – default value is a single space. FS can be set to a single character. When

more than one character is given it is interpreted as a regular expression.

• RS – default value is a newline. Default value can be changed.

• Example:BEGIN {RS = "" ; FS = "\n"} # Record separator is a blank

line

{ print "Name ", $1

print "Zip ", $NF

}Input file:

John Smith

235 Alameda

Santa Clara

CA

95053

Output:

Name John Smith

Zip 95053

Page 9: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

9

Examples

cat prog1.awk

# test for integer, string or a blank line.

/[0-9]+/ {print $0 ": An integer"}

/[A-Za-z]+/ { print $0 ": A String"}

/^$/ {print "A Blank line"}

# + metacharacter – one or more

cat testfile

1234

This is a test

789 Hello

%awk –f prog1.awk testfile

1234: An integer

This is a test: A String

789 Hello: An integer

789 Hello: A String

A Blank line

A Blank line

Page 10: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

10

Examples

%cat prog2.awk BEGIN {FS = ","} # Comma is the field separator

{ print $1 print $2 print $3}% cat prog3.awk BEGIN {FS = ","}/CA/ {print $1 "," $3} # will match any field with CA

$3 ~ /CA/ {print $1 "," $3} # field match

%cat testfile2John Smith, Santa Clara, CA

Mary Jones, Red Bank, NJ

Susan Wang, Denver, CO

% awk –f prog2.awk testfile2

What is the output?

• More than one character can be specified as a field separator, it will be interpreted as a regular expression.

• Examples:

FS = “\t+”

How many fields are in the following line?

IJK\t\tXYZ

FS= “[‘:,\t\]

Page 11: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

11

Examples

$cat prog4.awk

BEGIN {printf ("Scores\n "); }

{ print $0; total = total + $2}

#NR – number of input records that are read

END {print "Average score is ", total / NR }

$cat scores

Smith 80

Jones 97

Chan 95

King 78

$ awk -f prog4.awk scores

Scores

Smith 80

Jones 97

Chan 95

King 78

Average score is 87.5

Page 12: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

12

Passing Parameters into awk script• Parameters can be passed from the command line into an awk script. A

variable(s) is set from the command line and can be accessed from the awk script.

• Parameters that are passed in, are not available in BEGIn, they are available to the script only after the first line of input is read.

• Example – param.awkBEGIN {print "Passing Parameters"}{print "arg1 = ", arg1print "arg2 = ", arg2}From the command line, invokeawk –f param.awk arg1=100 arg2=200 datafileA shell script’s command line arguments can be passed in as

follows: Assume that the following line is in a shell script called awktest.sh

awk –f param.awk “arg1=$1 arg2=$2” datafile$1 and $2 are the positional parameters given as arguments on

command line when awktest.sh is invoked asawktest.sh 100 200

Page 13: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

13

Patterns Using Regular Expressions

# print lines ending with iaawk ‘ia$/ {print}’ countries -

#print countries ending with iaAwk ‘$1 ~ /ia$/ {print $1 }’

countries

#select lines where the third field #matches Asia or begins with North #or South

$3 ~ /Asia |^North | ^South/{print}

#Pattern Ranges/Russia/,/Brazil/ {print}#Replace USA by United States

/USA/ {$1 = "United States";print}

%cat countries

Australia 3000 Australia

USA 3615 North America

Argentina 1072 South America

India 1270 Asia

Russia 8650 Asia

China 3692 Asia

Brazil 3286 South America

Page 14: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

14

Associative Arrays• Arrays in awk are associative arrays where the index can be a number

or a string.

• The order in which the items are retrieved may be random.

%cat prog6.awk

{ x [$1] = $2 }

END {

for (item in x)

print item,x[item]

} %awk –f prog6.awk scores

Jones 89

Smith 65

Chen 100

King 120

Lowel 200

Page 15: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

15

Example: Computing GradesCat prog7.awk

BEGIN { OFS = "\t" }{# main loop applied to all input lines total = 0 for (I = 2; I <= NF; ++I) total += $I; average = total / (NF -1)

# store each student average stAvg[NR] = average avgByName[$1] = average

#determine the letter grade if (average >= 90) grade = "A" else if (average >= 80) grade = "B" else if (average >= 70) grade = "C" else grade = "F“

#store a count of the letter grades ++classGrade[grade]}

Page 16: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

16

#class statisticsEND{ #calculate class average for (x = 1; x <= NR; x++) classTotal += stAvg[x] classAve = classTotal / NR print "Class Average = " classAve

#determine how many above or below average #print number of students per letter grade print "Enter name " getline name < "-" print name ": " avgByName[name] for (letterGrade in classGrade) print letterGrade ":"

classGrade[letterGrade] | "sort"}

Page 17: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

17

%cat grades

Smith 90 80 50

Jones 20 0 70

Wang 67 90 80

Wolf 70 100 90

Pratt 90 88 92

%awk -f prog7.awk gradesSmith 73.3333 CJones 30 FWang 79 CWolf 86.6667 BPratt 90 AClass Average = 71.8Enter nameSmithSmith: 73.3333A:1B:1C:2F:1

Page 18: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

18

Multidimensional arrays#awk offers a syntax for subscripts that simulate

a reference to multidimensional arrays{ for (i = 1; i <= NF; ++i) table[NR,i] = $i}END{ for (k = 1; k <= NR ; ++k){ for (i = 1; i <= 4; ++i){ total += table[k,i] printf("%d ", table[k,i]) } printf("\n") }

{print "Total = " total}}

Page 19: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

19

next and getline

• Next causes the next input line to be read.

• Next statement passes control back to the top of the script.%cat prog9.awk

NF == 2 {next} # skips to the next record and starts the program from the

# beginning

/USA/ {$4 = "United States Of America"; print $0}

{print NR }%cat countries

Japan Asia

2: UK Europe

3: Brazil S.America

Egypt Africa

5: USA N.America

Canada N.America

% awk –f prog9.awk countries

2

3

5: USA N.America United States Of America

5

Page 20: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

20

Using getline#Using getline function to read the next line of input/^\/+/ { getline print $1 } #get input from command lineBEGIN{ printf "Enter your name: " getline name < "-" print name}/Smith/ { getline print $1}

Page 21: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

21

#Reading from a pipe using a getline

{while ("who" | getline)

terminal[$1] = $2

}

END{

for (item in terminal)

print item, terminal[item]

}

Page 22: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

22

Example - An word lookup

# reads a file with acronyms and their expansions,

#handles users queries

BEGIN { FS = “\t”; OFS = “\t”

printf (“Enter a word for lookup: “);

}

#Load the file named acronyms

FILENAME == “acronyms” {

wordList[$1] = $2

next

}

Page 23: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

23

Example - An word lookup (cont)#scan for command to exit program$0 ~ /^(quit|qQ|[Xx]|exit|)$/ { exit }#process any non-empty line$0 != “” {

if ( $0 in wordList) { print wordList[$0]}

else print $0 “ not found”}#Prompt user to enter another word{printf (“Enter another word or q|Q to quit”);

} acronyms -

Page 24: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

24

split ()

• Split () is a built-in function that can parse any string into elements of an array.

• Syntax:

• No Of elements = split (string,array,separator). If no separator is specified, FS is used as the field separator.

n = split($0,days)

{for (j = 1; j <= n; ++j)

print days[j]

}

Page 25: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

25

next• The next statement forces awk to immediately stop processing

the current record and go on to the next record. The rest of the current rule's action is not executed either.

• If you think of the main body in awk is a loop, the next statement is analogous to a continue statement: it skips to the end of the body of this implicit loop, and executes the increment (which reads another record).

• Note: getline function causes awk to read the next record immediately, but it does not alter the flow of control in any way. So the rest of the current action executes with a new input record.

• For example, if your awk program works only on records with four fields, and you don't want it to fail when given bad input, you might use this rule near the beginning of the program:

Page 26: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

26

Example:

FILENAME == "names.txt" {

count += 1;

next

}

{print $0 }

END{

print count

}

#Counts each line in the file, “names.txt”.

Page 27: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

27

%cat prog9.awkNF == 2 {next} # skips to the next record and starts the program from the

# beginning/USA/ {$4 = "United States Of America"; print $0}{print NR }%cat countriesJapan Asia2: UK Europe3: Brazil S.AmericaEgypt Africa5: USA N.AmericaCanada N.America % awk –f prog9.awk countries235: USA N.America United States Of America5

Page 28: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

28

getline

• getline is used to read the next line of input input from the current input file, from a specified file and a pipe.

• The getline command can be used without arguments to read input from the current input file.

• Reads the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to continue processing from the next record.

• Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost.

Page 29: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

29

Example:

/^[0-9]+/ {print "Line number ", NR, ":", "starts with a number" }

/^\/\*/ { getline }{print NR “:” $0 }Input:This is a cat1234 a catA test/* A comment line */990 is the scoreOutput:1:This is a catLine number 2 : starts with a number2:1234 a cat3:A test5:990 is the score

Page 30: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

30

getline• Using getline to read a line into a variable

• You can use `getline variable' to read the next record from awk's input into the variable variable. No other processing is done.

• For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it.

• The getline command used in this way sets only the variables NR and FNR.

• The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.

Page 31: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

31

• What is the output of the following program on input file given below:/^[A-Za-z]/ { getline tmp print tmp}{print $0 }

Inputfile:ABCD1234EFGH5678

Page 32: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

32

getline• Using getline to read the next record from the file file.

• Here file is a string-valued expression that specifies the file name. `< file' is called a redirection since it directs input to come from a different place.

• For example, the following program reads its input record from the file `input.dat when it encounters a first field with a value equal to 10 in the current input file.

• awk '{ if ($1 == 10) { getline < "input.dat" print } else print }' .

• Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into fields in the normal manner, so the values of $0 and other fields are changed. So is the value of NF.

Page 33: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

33

• Using getline to read the output of a command from a pipe:

• You can pipe the output of a command into getline, using `command | getline'. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record at a time from the pipe.

• For example, the following program copies its input to its output, except for lines that begin with `@execute', which are replaced by the output produced by running the rest of the line as a shell command:

awk ‘{ if ($1 == "@execute") { tmp = substr($0, 10)

while ((tmp | getline) > 0) print

close(tmp) }else print }' input

The close function is called to ensure that if two identical `@execute' lines appear in the input, the command is run for each one.

Page 34: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

34

Close()

• Close () allows you to close open files and pipes.

– There may be a limitation on the number of files and pipes that can be open at the same time.

– Closing a pipe allows you to run the same command twice.

– Example: Close (“who”)

Page 35: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

35

What is the output for the given input file

Jsmith

Mjones

@execute who

TWolf

Page 36: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

36

• Using getline to read the output of a command from pipe into a variable:

• When you use `command | getline var', the output of the command command is sent through a pipe to getline and into the variable var.

• Example:

• awk 'BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time }'

• In this version of getline, none of the built-in variables are changed, and the record is not split into fields.

Page 37: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

37

Using system()

• System() function executes a command supplied as an expression.

• The output generated from executing system() is not available within the program for processing.

• System() returns the exit status of the program that was executed.

Example:

#!/bin/awk -f

BEGIN{

status = system ("mkdir temp")

if (status != 0)

print "command failed"

}

Page 38: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

38

User-defined functions• A Function definition can be anywhere that a pattern-action rule

can be.• Input to the function are passed as a list of parameters.

Example:

# inserts a string, insertStr after position in aString

function insertString(aString, position, insertStr){

before = substr(aString, 1,position)

after = substr(aString,position +1)

return before insertStr after

}

{ print insertString($1,5,"BBBB") }#No spaces are allowed between the function name and the left parenthesis.

Page 39: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

39

• All the variables in the parameter list are considered local to the function.• All variables defined in the body of the function are treated as global

variables.• Therefore any temporary variables that are declared are put at the end of

the parameter list.• Example:function insertString(aString, position, insertStr,after){

before = substr(aString, 1,position) after = substr(aString,position +1) return before insertStr after}{ print insertString($1,5,"BBBB") }{ print aString }{ print "before: " before}{ print "after: "after }

Page 40: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

40

cat testFile

HelloWorld

This is a test

XYZ1234567890

awk –f fun2.awk testFile

HelloBBBBWorld

before: Hello

after:

ThisBBBB

before: This

after:

XYZ12BBBB34567890

before: XYZ12

after:

Page 41: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

41

Functions• Arrays are passed by reference#!/bin/awk -ffunction moveSmallest(LIST,SIZE, temp,small,smal small = LIST[1] for (i = 2; i <= SIZE; ++i){ if (LIST[i] < small){ small = LIST[i] smallIndex = i; } } LIST[smallIndex] = LIST[1] LIST[1] = small return}END{ array[1] = 12; array[2] = 0; array[3] = -1; array[4] = 100; moveSmallest(array,4) for(i = 1; i <= 4;++i){ print array[i] }}

Page 42: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

42

Some built-in Functions

• Arithmetic Functions• cos, exp,int,log,sin,sqrt,atan2,rand,srand• Some useful String Functions• index, length, split, sub,substr,tolower,loupper• gsub(regExp,replaceWithString,inString) – globally

substitutes replaceWithString for regExp in inString.• match (string, regExp) – returns the position of

where the regExp is found in string or 0 if no occurences are found.

Page 43: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

43

Passing parameters into a script• Input is passed into an awk script by setting variables on the command line.• Example:

– awk –f awkprog x=1 y=2 inputfile– The variables x and y can be accessed in the main loop (not in the BEGIN

section).– The system variables ARGC and ARGV can be used to access the command line

argumentsExample:BEGIN { print "BEGIN: " n }NR == 1 { print ARGC; print nfor (i = 0; i < ARGC; ++i){ print ARGV[i]}}% awk -f param.awk n=20 testfileBEGIN:320awkntestfile

Page 44: 1 awk awk is a file-processing programming language. Makes it easy to perform text manipulation tasks. Is used in –Generating reports –Matching patterns

44

An array of Environment variables

#!/bin/awk -f

BEGIN{

for (env in ENVIRON){

print env "=" ENVIRON[env]

}

print “Logname = “,ENVIRON[“LOGNAME”]

}