Awk
Unix
has many text manipulation utilities. The most flexible is awk.
Its conciseness is paid for by
- speed of execution.
- potentially hieroglyphic expressions.
but
if you need to manipulate text files which have a fairly fixed line
format, awk is
ideal. It operates on the fields of a line (the default field
separator, FS, being
<space>
).
When awk reads
in a line, the first field can be referred to as `$1
',
the second `$2
'
etc. The whole line is `$0
'.
A short awk program
can be written on the command line. eg cat file | awk '{print NF,$0}'
which
prepends each line with the Number of Fields (ie, words) on the line.
The quotes are necessary because otherwise the shell would interpret
special characters like `
$
'
before awk had
a chance to read them. Longer programs are best put into files.
Two
examples in
/export/Examples/Korn_shell
(wordcount
and awker
)
should give CUED users a start (the awk manual
page has more examples). Once you have copied
over wordcount and text,
do wordcount text
you
will get a list of words in text and
their frequency. Here is wordcount
awk ' {for (i = 1; i<=NF ; i++) num[$i]++ } END {for (word in num) print word, num[word] } ' $*
The syntax is similar to that
of C. awk lines
take form
<pattern> { <action> }
Each
input line is matched against each awk line
in turn. If, as here in wordcount,
there is no target pattern on the awk line
then all input lines will be matched. If there is a match but no
action, the default action is to print the whole line.
Thus,
the for loop
is done for every line in the input. Each word in the line (NF is
the number of words on the line) is used as an index into an array
whose element is incremented with each instance of that word. The
ability to have strings as array `subscripts' is very useful.
END
is
a special pattern, matched by the end of the input file. Here its
matching action is to run a different sort of for loop
that prints out the words and their frequencies. The
variable word takes
successively the value of the string `subscripts' of the array num.
Example
2 introduces some more concepts. Copy
/export/Examples/Korn_shell/data
(shown
below)NAME AMOUNT STATUS Tom 1.35 paid Dick 3.87 Unpaid Harry 56.00 Unpaid Tom 36.03 unpaid Harry 22.60 unpaid Tom 8.15 paid Tom 11.44 unpaid
and
/export/Examples/Korn_shell/awker
if
you haven't done so already. Here is the text of awkerawk ' $3 ~ /^[uU]npaid$/ {total[$1] += $2; owing=1} END { if (owing) for (i in total) print i, "owes", total[i] > "invoice" else print "No one owes anything" > "invoice" } ' $*
Typing
awker data
will
add up how much each person still owes and put the answer in a file
called invoice.
In awker the
3rd field is matched against a regular expression (to find out more
about these, type man
5 regexp ).
Note that both 'Unpaid'
and 'unpaid'
will match, but nothing else. If there is a match then the action is
performed. Note that awk copes
intelligently with strings that represent numbers; explicit
conversion is rarely necessary. The `total'
array has indices which are the people's names. If anyone owes, then
a variable `owing'
is set to 1. At the end of the input, the amount each person owes is
printed out.
Other awk facilities
are:-
- fields can be split:-
n = split(field,new_array,separator)
- there are some string manipulation routines, e.g.:-
substr(string,first_pos,max_chars), index(string,substring)
- awk has built-in math functions ( exp, log, sqrt, int) and relational operators (
==
,!=
,>
,>=
,<
,<=
,~
(meaning ``contains''),!~
).
As
you see, awk is
almost a language in itself, and people used to C syntax can soon
create useful scripts with it.
To remove duplicate lines without sorting.
root@SHANKAR:~# cat /tmp/aaa
112121212
dsad
dsad
112121212
dasda
112121212
112121212
root@SHANKAR:~# awk '!x[$0]++' /tmp/aaa
112121212
dsad
dasda
To remove duplicate lines without sorting.
root@SHANKAR:~# cat /tmp/aaa
112121212
dsad
dsad
112121212
dasda
112121212
112121212
root@SHANKAR:~# awk '!x[$0]++' /tmp/aaa
112121212
dsad
dasda
2 comments:
Can I have your contact no please?
sankar.h.patel@gmail.com
Post a Comment