John Lekberg


AWK

Using the AWK command line tool

I have some thoughts about the tool AWK. I wrote a bit of an essay a while back, so I'll put that here.


Using AWK in 2019

(This essay is a work in progress.)

AWK ("Aho Weinberger Kernighan") is a scripting language that was built at Bell Labs in 1977 by Al Aho, Peter Weinberger, and Brian Kernighan, who built AWK to address the problems that they had with the tools of the time: grep, sed, shell, and C. Over 40 years have passed since AWK was created, and many of the problems that Aho, Weinberger, and Kernighan faced are better solved by modern tools like Python. How should AWK be used in 2019?

First, we need to understand "what are appropriate uses of AWK in 2019?". We need to understand the sorts of situations AWK could be used in, and then compare it to alternative tools (like grep and sed). Once we understand when we should be using AWK, then we understand "how do I effectively use AWK in 2019?". We need to understand what documentation for AWK is out there, what the common pitfalls are, and what implementations of AWK are out there.

AWK has an implicit behavior that it follows:

for each file
  for each input line
    for each pattern
      if pattern matches input line
        do the action

Now, this behavior can be changed slightly (using the RS variable). For example, AWK can match patterns by chunks of text. The general case is plain text records delimited by a pattern of text that can be matched by a regular expression. This implicit behavior removes some boilerplate code. For example, here is an AWK script:

$3 > 5

and a corresponding Python script:

import sys

for line in sys.stdin:
    fields = line.split()
    if float(fields[2]) > 5:
        print line

However, the removed boilerplate is a constant factor. For scripts that are dozens or hundreds of lines long, the benefit of removing this boilerplate is negligable, and the downsides of AWK (lack of classes, unable to make multiple passes over date, diffificulty keeping track of lots of state) become more problematic. This same removal of boilerplate is what makes AWK extremely useful for one liners.

Aho, Weinberger, and Kernighan intended AWK to be used for simple data processing and analysis of plain text data. Tasks like:

They also included features to make one-liners possible, because Kernighan wanted to write the scripts "where you know in your heart it's one line long". Input is read automatically across multiple files, and lines are split into fields. Variables can contain string or numeric values, so text and numbers can be mixed freely. Variables don't have to be declared, their type is determined by context and use. Variables are initialized to 0 and the empty string. There are built in variables for frequently used values, and operators work on strings or numbers.

Perl 5 includes all of this behavior actually (it was built to supersede AWK). When perl is available on the system and you and your team are comfortable using it, it is a superior choice for a few reasons: it provides better regular expressions (PCRE vs ERE); some people report that it is faster; it can also provide sed-like operations. However, AWK is a POSIX standard tool, so if you are building shell scripts that need to be portable to POSIX systems, AWK is a better choice. Additionally, because the AWK language is so small and simple compared to Perl, it can be easier to convince team members to allow its use into the project, because introducing it into the ecosystem doesn't carry as much baggage as a full scale programming language like Perl. Perl 5 can be used for one liners as well.

So, if you've got a shell script that you are writing, and you want to include a one-liner in a pipeline, how do you compare AWK to grep and sed? The approach we will take is to briefly explain the tools, understand where they overlap and where they don't. And when they don't overlap.

AWK is a tool for matching patterns of text in records. Grep is a tool for searching patterns of text and extracting matches. Grep has default behavior that happens every time a pattern is matched, which is to print the matching line, which is also AWK's default action for a pattern match. However, grep's default actions and patterns can be configured: the -i flag allows case insensitive matching on all patterns. This could be simulated in AWK by wrapping every text pattern in tolower($0) ~ /.../. Grep's -F flag does fixed string matching (meaning there are no special meta characters. This could be done in AWK by manually escaping the regex metacharacters in each string. For the output, grep can print out the number of the matching line using the -n flag. In AWK this would be done by having the action for each text pattern be { print NR, $0 }. Grep can also print out the names of files containing any matches, which would require a lot of extra code in POSIX AWK. Grep can also force the pattern to match a whole line of input using the -x flag. In AWK, this would mean wrapping a regular expression in /^...$/. So I guess that points from this are:

Sed is a tool for performing a few types of editing commands on streaming data. sed uses basic regular expressions, while AWK uses extended regular expressions. Sed does have a y command which behaves like the tr utility. AWK has no equivalent to sed's y command; it can be emulated but I would discourage that. Sed does have shorter editing commands. AWK can deal with more arbitrary patterns compared to grep or sed (like mathematical patterns, patterns based on previous state "if i've seen this before"). Sed can substitute for the nth occurence of a pattern, which is something that AWK can not do (to my knowledge). Sed has no variables and if any state needs to be tracked, then AWK should be used. If the whole file is to be passed through and have some text substitutions done on it, sed's s command is probably better than AWK.

Compared to the shell. AWK can handle math much better.

Section Under Construction (below).

One key to using AWK effectively is knowing what resources are out there. Physical books. There is the "Sed and AWK" book by O'reilly (as well as a pocket reference). There is "The AWK Programming Language" by Aho, Weinberger, and Kernighan. There is the book "Effective AWK Programming" which covers AWK and GNU AWK, and is available as a web page and as a physical book. Web Resources. The man page for AWK is pretty good. There is a POSIX Standard Online, which explains the behavior.

GNU AWK includes a debugger, which is modeled after GDB. GNU AWK also includes a profiler. GNU AWK includes a linter to check for features not available in the original AWK or nonportable features. Even when I just use POSIX AWK, I will use this tooling available in GNU AWK. Even though I recommend AWK only for one liners, if you're maintaining a legacy script, these tools can be useful.

Some versions of AWK out there are awk, nawk, gawk, mawk, and awka. I should figure out what are used in what situations.