CSE2391/3391 / CMG / 2002 / 12 Unix Tools
1.
awk Programming
by Example
1.1.
FIELDS
awk automatically splits each input line into fields,
that is, strings of non-blank characters separated
by <blanks> or <tabs>.
The fields are assigned to built-in variables
called $1, $2, ... $NF where
NF is a variable whose value
is set to the number of fields in the current input
record.
$NF is the last field in the line.
(a) Print the first third and last field of each
record
awk
{ print $1, $3, $NF } filename
Note: that
the pattern is absent, hence the action will be applied to each record of the
input file.
awk, by default, assumes that white space (any
number of blanks or tabs) separates the fields. The
field separator can however be changed to any single
character in the ASCII characters set. The
built-in variable FS stores the field separator.
(b)awk has
a command line option which can be used to set the field separator:
$ awk
-F: { print $1 } /etc/passwd
can be used to print the first field of each record in
the system password file. In this file strings are
separated by
:(colon). This example may not work for you since, most modern systems,
read
access to the password file is limited by a security
strategy called Hidden Passwords.
1.2.
The RECORD
The built-in variable, NR, stores the current record number (or
line-number) and the current record is
assigned to the variable $0.
The print action
can print fields separated by the output field separator, which is
stored in the builtin
variable OFS.
(c) Indenting lines:
$ awk
{ printf "\t%s\n", $0 } filename
To produce a shell-script which could be called
indent, based on awk, you could edit the
following text into a file called indent:
awk
{ printf "\t%s\n", $0 } $*
Then an example of its usage would be:
$
indent filename
CSE2391/3391 Unix Tools, 2002 Page 1
CSE2391/3391 / CMG / 2002 / 12 Unix Tools
(d) Now to use patterns to indent only non-blank input
lines the file indent2 should
contain:
awk
!($0 == "") { printf "\t%s\n", $0 } filename
OR
awk
length($0) != 0 { printf "\t%s\n", $0 } filename
(e) To print lines that are longer than 75 characters
$ awk
length($0) > 75 { print "Line:", NR, "is too long" }
filename
1.3.
Built-in patterns BEGIN and END The patterns BEGIN
and END specify actions which are
to occur before the first record is processed
and after the last record has been processed, respectively.
This script (called try-awk) prints the first, third and last fields of each
record.
BEGIN
{ print "Start of the File:", FILENAME }
{
print $1, $3, $NF }
END {
print "The number of lines:", NR }
Now an example usage would be:
$ awk
-f try-awk file1 file2 ...
One of the real strengths in awk lies in its ability
to do calculations, as well as string operations.
(f) A common use of awk is to sum columns of numbers. So to sum the numbers in
the first column of
a data file, requires the following code: assume that
the awk-script file is called sum-col1:
{ s =
s + $1 }
END {
print "Sum of column 1 is:", s }
Now an example usage would be:
$ awk
-f sum-col1 filename
NOTE: user
variables are initialized to zero by default, so no initialisation
step is required.
Variables in awk also store strings. Whether a variable is to be
treated as a number or as a string
depends on the context. Roughly speaking, in an
arithmetic expression like s+=$1,
the numeric
value is used; while in a string context like x="abc", the string value is used; and in an
ambiguous case like,
x>y, the string value is used
unless the variables are clearly numeric.
1.4.
CONTROL FLOW
The control flow statements used in awk are exactly the same as those used in the C Programming
Language.
We have thus:
IF...THEN...ELSE
if (condition)
code1 else code2
CSE2391/3391 Unix Tools, 2002 Page 2
CSE2391/3391 / CMG / 2002 / 12 Unix Tools
FOR (){ ... }
for (initialisation;
termination condition; iterator)
(g) Now for an example containing FOR and IF
The following awk-script folds lines which are
longer than 80 characters; it is assumed that the file
does NOT contain <tabs>
BEGIN
{
N =
80
for(i
= 1; n <= N; i++)
blanks
= blanks " "
}
{ if
((n = length($0)) <= N)
print
else
{
for
(i = 1; n > N; n -= N) {
printf
"%s\0, substr($0,i,N);
i +=
N;
}
printf
"%s%s0, substr(blanks,1,N-n), substr($0,i)
}
}
The algorithm is:
IF
length of the line is <= 80 just print it
ELSE
print a 79 character line and print the rest of the record on the
next
line of output.
WHILE(){ ... }
while (condition)
{ code }
(h) Now for an example containing WHILE. The following code shows a very naive example, and I
am sure that you get the idea!
{
field = 1
while(
field <= NF )
{
printf("%s
", $field )
field
+= 2
}
print
""
}
An aside for all those of you who are veteran C programmers: exit causes an immediate transfer
to the END pattern.
1.5.
ARRAYS
awk arrays, like
other awk variables do not need to
be initialised.
CSE2391/3391 Unix Tools, 2002 Page 3
CSE2391/3391 / CMG / 2002 / 12 Unix Tools
(i) The following example prints lines from the input
file in reverse order to that in the original file:
{
line[NR] = $0 }
END {
for (i = NR; i > 0; i--) print line[i] }
1.6.
The split() built-in
function:
Normally awks input processing splits each input line into
fields. It is, however, possible to perform
the same splitting operation on any string with the
built-in function split:
syntax: n = split(s, arr, sep)
where:
s - input string
arr - array elements1 - ncontain the fields in string s
sep - separator character.
(j) Now a simple example to show the splitting of the
lines of /etc/passwd into fields :
You would use:
split($0,a,":")
a - is the array which will contain the fields.
Similarly in order to split a string /etc/passwd
using / as a separator:
split("/etc/passwd",path_bits,"/")
path_bits
- array which will contain the fields.
1.7.
ASSOCIATIVE ARRAYS
Although one normally thinks of array subscripts as
integers, in awk any
value can be used as a
subscript. This provides a means of dealing with name-value
pairs.
(k) Now for an example when there is a need to sum the
values associated with particular names. The
input data are
Susie
400
John
100
Mary
200
Mary
300
John
100
Susie
100
Mary
100
Then the awk-script:
{
sum[$1] += $2 }
END {
for (name in sum) print name, sum[name] }
CSE2391/3391 Unix Tools, 2002 Page 4
CSE2391/3391 / CMG / 2002 / 12 Unix Tools
produces the output:
John
200
Mary
600
Susie
500
NOTE that the:
for(
var in array ) statement
is only superficially similar to the C for loop;
in fact this for construct loops over subscripts in a
generally unpredictable order. A hashing scheme is
used to store the elements of an associative
array, so
that it takes roughly the same amount of time to access any element.
(l) Lastly let us look at an awk-script
to determine word frequency:
{ for
(i = 1; i <= NF; i++) num[$i]++ }
END {
for (word in num) print word, num[word] }
Charles Greif, May 2002
CSE2391/3391 Unix Tools, 2002 Page 5