awk

Awk is a filter command (added in the late 1970s to Unix).  It is named after its three co-inventors: Aho, Weinberger, and Kernighan.  Awk proved so popular that it was extended with many useful feature five years later.  Today Gnu provides gawk, and other dialects such as nawk (new awk) exists and are POSIX compliant.  Gnu awk (gawk) is the most popular version and includes many very useful features missing from POSIX awk, including regex backreferences, TCP/IP and PostgreSQL DB commands, and many minor improvements.

The oldest version of awk dates from the 1970s.  Awk version 2 was invented in the mid-1980s.  POSIX awk is based on awk v2.  However many systems provide multiple versions (for backward compatibility) and the version called “awk” on your system may be the original version (sometimes called oawk), version2/POSIX (sometimes called nawk), or the Gnu version (often also called gawk).  There are other versions too.

Until Perl awk was the most powerful filter available.  (And since Perl isn’t part of POSIX you may not find it on all systems; awk is always available.)  Unlike most *nix filters that do a single task, awk (like sed) is a multi-tasker.  You can use awk instead of a pipeline of many simpler utilities.  Awk is well-suited for many tasks such as generating reports, validating data, managing small text databases (e.g., address book, calendar, rolodex), document preparation (e.g., producing an index), extract data, sort data, etc.

Among awk’s more interesting features is its ability to automatically break the line into fields, to do math, and to perform a full set of programming language tasks: if, loops, variables, functions, etc.  These features make awk very useful for both one-liner scripts and for more complete programs.  Awk is normally used to process data files and to produce reports from them or to re-arrange the files.

Awk is used this way:  awk options 'script' [argument ...].  The script consists of one or more awk statements. Typically the script is enclosed in single quotes (and may be several lines).  For longer scripts you can also specify a file containing the script with the -f file option (which can be repeated; all the scripts are concatenated).  The arguments are filenames just like for other filter commands.

Actually the arguments may either be filenames or variable assignments.  (So avoid using filenames with an equal-sign in them!)  Such assignments are processed just before reading the following file(s).  (See also the ‑v assignment option, which processes the assignment before anything else.)

Awk has a cycle like that of sed:  A line (record) is read, then broken into words (fields).  Next each awk command is applied, in order, to the line, if it matches.  Then the cycle repeats.

Each awk statement has two parts:

          pattern        { action }   semicolon_or_newline

A missing pattern means to apply the action to all lines.  A pattern with no action means to print any matching lines (like grep).

A statement can appear on a single line and no space is needed between the pattern and action (or between statements).  For readability usually you put one statement per line, with spaces or tabs between the pattern and action.  The open curly brace of an action must be on the same line as the statement’s pattern.

Statements must be separated by a newline or by a semicolon.  (Gnu awk doesn’t require a semicolon or newline between statements; e.g.: awk '{print}{print}' instead of '{print};{print}').

An awk statement may also be a function definition. You can use those in actions.  (It doesn’t matter where in the script you define these.)  Function definitions look like this:

function name(parameter_list) { statements }

Discuss style: one liners, multiple short statements, or long actions.  Bad style:

awk '{print}function foo(){print "foo"}$0=="abc"{foo();next}' FILE

Awk variables don’t have to be declared and can hold either numbers or strings.  Many variables are pre-defined, e.g.,  RS is the record separator (normally a newline).  Awk also provides one-dimensional arrays but the index is a string.  (See below.)

The value of a variable will be converted as needed to/from string, numeric, and Boolean values.  To force a variable to be treated as a number, add 0 to it; to force it to be treated as a string, concatenate it with the null (empty) string.  When converting a string to a number only a leading number is used.  A string without a leading number converts to zero.  true/false values:  An expression that evaluates to the number zero or null string is considered false.  Anything else means true.  (Somewhat surprising is that "0" converts to true, not false!)

String literals are enclosed in double-quotes and can use common (C) backslash escapes: \n, \t, \\, etc.

Awk reads input up to the RS value (record separator).  This is normally a newline but can be any single character.  If RS is set to null (“RS=”) then a blank line is the record separator.  Also a newline becomes an additional field separator.

Next awk splits the line into fields using the value of FS as the field separator.  FS may be a single char, a single space (then any run of white-space is a field separator, with leading and trailing white space ignored), or some ERE.  E.g. “FS="[ ]"”:

$ echo ' 1  2  3 ' | awk '
BEGIN {FS=" " }
{ printf NF ": " $1
  for (i=2; i<=NF;++i)printf "|" $i
  print ""
}'

3: 1|2|3

(With FS set as “FS="[ ]"” instead, the output is “7: |1||2||3|”.)

Once the line (record) is parsed, awk sets the variables $1, $2, ... to each field’s value.  $0 is the whole line.  NF is the number of fields.  You can use $expression to refer to any field.  Using $NF and $(NF-1) are common.

Changing NF will add/remove fields from the line.  Assigning to a non-existing field adds fields to the line (and resets NF; skipped fields contain null strings).

Awk sets NR and FNR to the line (record) number; FNR starts over for each file.  You can set OFS on ORS (output field/record separator) characters too.

After parsing the input, awk checks (in order) each statement in the script, to see if the pattern matches.  The pattern is usually an ERE but can be any Boolean expression.  If true then the action is executed.  When the end of the script is reached, awk starts a fresh cycle.

AWK patterns may be one of the following:

BEGIN    (All statements with this pattern are run before reading any data)

END      (All such stmnts run in order after all data is read.)

Expressions                   (generally used to test field values):

/ERE/    (matches against whole record/line)

text ~ /ERE/
text !~ /ERE/
lhs == rhs
    (or !=, <, <=, >, >=, in, etc.)

pattern && pattern

pattern || pattern

pattern ? pattern : pattern

 (pattern)       for grouping

! pattern

pattern1, pattern2        an inclusive range of lines

The EREs are similar to those in egrep (except meta-chars don’t need a backslash).

Summary of pre-defined variables:

·        RS, FS       Record separator, field separator

·        OFS, ORS     print x,y is the same as print x OFS y OFS ORS

·        NR, FNR      Current record number; FNR restarts at 1 for each file

·        FILENAME     Current file being processed; stdin shows as “

·        ARGC, ARGV, ENVIRON   Used to access parameters and the environment;  cmd line args (minus the script itself) are in ARGV[1]..ARGV[ARGC-1]

·        RLENGTH, RSTART  Set by match function; see below

·        SUBSEP       ary[x,y] is the same asary[x SUBSEP y]; see arrays below

The most useful actions include print and printf:

print [comma separated list of expressions] - adds ORS at end, OFS between expressions

printf format, list of expressions that match place-holders in format:

    {printf " %03d: %s\n", total}

Any print or printf can be followed with a redirection:

          print stuff | "some shell command"
and   
print stuff > "some file"      (>>works too)

(Don’t forget the quotes!)  Each print/printf maintains a separate stream, so you can redirect some output to one place, and other output to a different place.

Operators are similar to those in C (and Java, JavaScript, Perl, ...), with these additions:  A space between two strings means to concatenate those strings.  A “^” means exponentiation.  “$numeric_exp” means a references to a field.  A “~” (“!~”) is used to (not) match, e.g.,  var [!]~ /RE/.  The expression “val in array” is true if array[val] has been defined.

Awk supports one-dimensional arrays that use strings for the subscripts.  You also use in with a special form of the for loop to iterate over each index in the array:  for (var in array) statement.  To simulate a 2-d array you used to have to use something like “array[i "," j]”.  Modern/POSIX awk allows us to use “array[i, j]”, which is the same as “array[i SUBSEP j]”.  (Show two-d-arrays.awk.)    When using the in operator with these pseudo-2-d arrays, use parenthesis around the subscripts, e.g., “(x,y) in ary”.

To remove an element use delete array[index] (delete array erases the whole array only in gawk).

You can also use if, while, do statement while, for(init;test;incr)for (var in array), break, continue, exit, block-statements with { and }, next, nextfile, and getline [var].

Using shell variables in awk:  You can use the ENVIRON array to access awk’s environment.  You can access command line arguments with the ARGV array (indexed from 1 to ARGC-1).  A more common solution is to create a shell (not awk) script that runs  awk '... '"$1"' ...' to access shell’s $1.  Complex quoting can result!  (“awk '$1 == "'"$1"'" {...}'”).

In addition there are a slew of standard functions for math (int, exp, log, sqrt, rand, etc.) and for string manipulation (see the man page for details):

split(string, array [,ERE])       Splits the string into fields, storing each in the array.  If you don’t specify an ERE to use, the current value of FS is used instead to split the fields.  Note the array indexes start with 1, not 0.

sub(ERE, repl [,string]) Searches through string for the first occurrence of the ERE and replacing it with the text repl.  If the string is omitted $0 is used.  This works like the sed command s/ERE/repl/, except no back-references.  (Can use an “&” in repl to mean the text that matched.)

gsub(ERE, repl [,string])         The same as sub, but every occurrence is replaced.

length([string])     The length in characters (not bytes!) of the string, or $0 if string is omitted.

index(string, substring) Returns the position in string of substring, or zero if it doesn’t occur.

match(string, ERE)       Returns the position of ERE in string, or zero if it doesn’t occur.  This function also sets the variables RSTART and RLENGTH to the starting position in string that matched the ERE, and the length in characters of the matched text.

substr(string, start [, length])  Returns the substring of string starting at position start, of length characters (if there are that many).  If length is omitted, the rest of the string is returned.  Note the first characters has position 1, and not 0.  (It is common to use RSTART and RLENGTH with substr, right after using match().)

getline    This function has several forms.  It reads the next record of data, or reads from a specified file, or from the output of a pipeline, into either $0 (resetting NF etc.) or into some specified variable.

tolower(string), toupper(string), sprintf(format, argsThese have the obvious meanings.

close(stream)        Close is used when printing to a file or pipe, or using getline from a file or pipe.

Gnu awk also has the useful asort (sort an array by values), asorti (sort an array’s indexes), and gensub (like gsub with back-references; note the double backslash):

$ echo abc |awk '
{ s = gensub(/.*(b).*/, "x\\1y", "g"); print s }'

xby

Or to swap “<a,b>” to “<b,a>”:

$ echo '<a,b>' |awk '
 {s=gensub(/<(.*),(.*)>/,"<\\2,\\1>", 1); $0=s};1'

<b,a>

The match function also supports capture groups in Gnu awk.

Note that unlike sub and gsub, gensub doesn’t modify the string (optional 4th argument, $0 by default) but returns the result.  Also note the need to double up the backslashes in the replacement string (do that with sub and gsub too).

Examples are the best way to learn.  Awk one-liners are very handy to more easily solve tasks that otherwise require long pipelines of sed, cut, sort, etc.  Try to convert your other filter pipelines to awk.  Examples:

Print and sort the login names of all users( from /etc/passwd):

BEGIN { FS = ":" }; { print $1 | "sort" }

Count lines in a file:     END { print NR }

Common AWK idiom: Work like sed (change lines that match, and output all):
  
awk ' /.../ { ... };1'
(The “1” is a pattern that matches anything, and the default action of print is done.  A non-zero integer expression for the pattern means “true”.)

Precede each line by its number in the file:  { print FNR, $0 }

Print the first and fifth fields from /etc/passwd in the opposite order:

BEGIN { FS = ":" }; { print $5, $1 }

Summarize data.  (Show interrupt_cnt.awk.)

Process INI files with awk: validate, display sorted list of name=values. (Show cfg.awk.)

Address book lookup script, with fields defined for first, last, phone, etc.

Users logged in (but not) from HCC.  (Process who output).

Reformat the output of some command.  (Modify df output to skip %used column, and any non-disk lines.  Show df.awk and df2.awk.)

Print the IP address for a given host.  Parse the output of nslookup:

  nslookup foo.com |awk '$4~/^foo.com/{print $6}' RS=''

(Parsing the output of “host -t a wpollock.com” might be easier.)

Process Apache error log to find the top 10 “File does not exist” files and the web page referrer (with the bad link):  (Convert this sed script to awk)

$ sed -e \
'/File does not exist/s/^.* exist: \(.*\)$/\1/' \
-e '/, referer/s/^\(.*\), referer.*$/\1/' error_log \
|sort |uniq -c |sort -nr |head

 

There are some incompatibilities between Gnu AWK and POSIX AWK.  If necessary you can make “awk” an alias for “gawk --posix”, and use gawk when you want to use the Gnu extensions.