CSE2391/3391 / CMG / 2002 / 12 Unix Tools

1. awk Programming by Example

1.1. FIELDS

awk automatically splits each input line into fields, that is, strings of non-blank characters separated

by <blanks> or <tabs>.

The fields are assigned to built-in variables called $1, $2, ... $NF where NF is a variable whose value

is set to the number of fields in the current input record.

$NF is the last field in the line.

(a) Print the first third and last field of each record

awk ’{ print $1, $3, $NF }’ filename

Note: that the pattern is absent, hence the action will be applied to each record of the input file.

awk, by default, assumes that white space (any number of blanks or tabs) separates the fields. The

field separator can however be changed to any single character in the ASCII characters set. The

built-in variable FS stores the field separator.

(b)awk has a command line option which can be used to set the field separator:

$ awk -F: ’{ print $1 }’ /etc/passwd

can be used to print the first field of each record in the system password file. In this file strings are

separated by :(colon). This example may not work for you since, most modern systems, read

access to the password file is limited by a security strategy called Hidden Passwords.

1.2. The RECORD

The built-in variable, NR, stores the current record number (or line-number) and the current record is

assigned to the variable $0.

The print action can print fields separated by the output field separator, which is stored in the builtin

variable OFS.

$ awk ’{ printf "\t%s\n", $0 }’ filename

To produce a shell-script which could be called indent, based on awk, you could edit the

following text into a file called indent:

awk ’{ printf "\t%s\n", $0 }’ $*

Then an example of its usage would be:

$ indent filename

CSE2391/3391 Unix Tools, 2002 Page 1

CSE2391/3391 / CMG / 2002 / 12 Unix Tools

(d) Now to use patterns to indent only non-blank input lines the file indent2 should contain:

awk ’!($0 == "") { printf "\t%s\n", $0 }’ filename

awk ’length($0) != 0 { printf "\t%s\n", $0 }’ filename

(e) To print lines that are longer than 75 characters

$ awk ’length($0) > 75 { print "Line:", NR, "is too long" }’ filename

1.3. Built-in patterns BEGIN and END The patterns BEGIN and END specify actions which are

to occur before the first record is processed and after the last record has been processed, respectively.

This script (called try-awk) prints the first, third and last fields of each record.

BEGIN { print "Start of the File:", FILENAME }

{ print $1, $3, $NF }

END { print "The number of lines:", NR }

Now an example usage would be:

$ awk -f try-awk file1 file2 ...

One of the real strengths in awk lies in its ability to do calculations, as well as string operations.

(f) A common use of awk is to sum columns of numbers. So to sum the numbers in the first column of

a data file, requires the following code: assume that the awk-script file is called sum-col1:

{ s = s + $1 }

END { print "Sum of column 1 is:", s }

Now an example usage would be:

$ awk -f sum-col1 filename

NOTE: user variables are initialized to zero by default, so no initialisation step is required.

Variables in awk also store strings. Whether a variable is to be treated as a number or as a string

depends on the context. Roughly speaking, in an arithmetic expression like s+=$1, the numeric

value is used; while in a string context like x="abc", the string value is used; and in an

ambiguous case like, x>y, the string value is used unless the variables are clearly numeric.

1.4. CONTROL FLOW

The control flow statements used in awk are exactly the same as those used in the C Programming

Language.

We have thus:

• IF...THEN...ELSE

if (condition) code1 else code2

CSE2391/3391 Unix Tools, 2002 Page 2

CSE2391/3391 / CMG / 2002 / 12 Unix Tools

• FOR (){ ... }

for (initialisation; termination condition; iterator)

(g) Now for an example containing FOR and IF

The following awk-script folds lines which are longer than 80 characters; it is assumed that the file

does NOT contain <tabs>

BEGIN {

N = 80

for(i = 1; n <= N; i++)

blanks = blanks " "

}

{ if ((n = length($0)) <= N)

else {

for (i = 1; n > N; n -= N) {

printf "%s\0, substr($0,i,N);

i += N;

}

printf "%s%s0, substr(blanks,1,N-n), substr($0,i)

}

The algorithm is:

IF length of the line is <= 80 just print it

ELSE print a 79 character line and print the rest of the record on the

next line of output.

• WHILE(){ ... }

while (condition) { code }

(h) Now for an example containing WHILE. The following code shows a very naive example, and I

am sure that you get the idea!

{ field = 1

while( field <= NF )

{

printf("%s ", $field )

field += 2

}

print ""

}

An aside for all those of you who are veteran C programmers: exit causes an immediate transfer

to the END pattern.

1.5. ARRAYS

awk arrays, like other awk variables do not need to be initialised.

CSE2391/3391 Unix Tools, 2002 Page 3

CSE2391/3391 / CMG / 2002 / 12 Unix Tools

(i) The following example prints lines from the input file in reverse order to that in the original file:

{ line[NR] = $0 }

END { for (i = NR; i > 0; i--) print line[i] }

1.6. The split() built-in function:

Normally awk’s input processing splits each input line into fields. It is, however, possible to perform

the same splitting operation on any string with the built-in function split:

syntax: n = split(s, arr, sep)

where:

s - input string

arr - array elements1 - ncontain the fields in string s

sep - separator character.

(j) Now a simple example to show the splitting of the lines of /etc/passwd into fields :

You would use:

split($0,a,":")

a - is the array which will contain the fields.

Similarly in order to split a string /etc/passwd using / as a separator:

split("/etc/passwd",path_bits,"/")

path_bits - array which will contain the fields.

1.7. ASSOCIATIVE ARRAYS

Although one normally thinks of array subscripts as integers, in awk any value can be used as a

subscript. This provides a means of dealing with name-value pairs.

(k) Now for an example when there is a need to sum the values associated with particular names. The

input data are

Susie 400

John 100

Mary 200

Mary 300

John 100

Susie 100

Mary 100

Then the awk-script:

{ sum[$1] += $2 }

END { for (name in sum) print name, sum[name] }

CSE2391/3391 Unix Tools, 2002 Page 4

CSE2391/3391 / CMG / 2002 / 12 Unix Tools

produces the output:

John 200

Mary 600

Susie 500

NOTE that the:

for( var in array ) statement

is only superficially similar to the ’C’ for loop; in fact this for construct loops over subscripts in a

generally unpredictable order. A hashing scheme is used to store the elements of an associative

array, so that it takes roughly the same amount of time to access any element.

(l) Lastly let us look at an awk-script to determine word frequency:

{ for (i = 1; i <= NF; i++) num[$i]++ }

END { for (word in num) print word, num[word] }

Charles Greif, May 2002

CSE2391/3391 Unix Tools, 2002 Page 5