Awk
Awk is a programming language designed for text processing and data extraction. It was created in the 1970s and remains widely used today for tasks such as filtering and transforming text data, generating reports, and performing basic calculations. Awk is known for its simplicity and versatility, making it a popular tool for Unix system administrators and data analysts.
Invocation
We can use awk
directly in stdin
or we can reference .awk
files for more elaborate scripts
# CLI
awk [program] file1, file2, file3
# Script file
awk -f [ref_to_script_file] file1, file2, file3
We can also pipe to it. This piped command receives output from the echo
command and prints the value in the last field for each record:
echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
Syntactic structure
awk
is a line-oriented language.
An awk
program consists in a sequence of pattern: action statements and optional functional definitions.
For most of the examples we will use this list as the input:
cloud
existence
ministerial
falcon
town
sky
top
bookworm
bookcase
war
Peter 89
Lucia 95
Thomas 76
Marta 67
Joe 92
Alex 78
Sophia 90
Alfred 65
Kate 46
awk
particularly lends itself to inputs that are structured by whitespace or in columns, like what you get from commands likels
andgrep
Patterns and actions
The basic structure of an awk
script is as follows:
pattern {action}
A pattern is what you want to match against. It can be a literal string or a regex. The action is what process you want to execute against the lines in the input that match the pattern.
The following script prints the line that matches Joe
:
awk '/Joe/ {print}' list.txt
/Joe/
is the patttern and {print}
is the action.
Lines, records, fields
When awk
receives a file it divides the lines into records.
Each line awk
receives is broken up into a sequence of fields.
The fields are accessed by special variables:
$1
reads the first field,$2
reads the second field and so on.The variable
$0
refers to the whole record
So, in the picture cloud existence ministerial
corresponse to $1
$2
$3
Basic examples
Match a pattern
awk '/book/ { print }' list.txt
# bookworm
# bookcase
Print all words that are longer that five characters
awk 'length($1) > 5 { print $0 }' list.txt
For the first field of every line (we only have one field per line), if it is greater than 5 characters print it. The “every line” part is provided for via the all fields variable - $0
.
We actually don’t need to include the { print $0 }
action, as this is the default behaviour. We could have just put length($1) > 5 list.txt
Print all words that do not have three characters
awk '!(length($1) == 3)' list.txt
Here we negate by prepending the pattern with !
and wrapping it in parentheses.
Return words that are either three characters or four characters in length
awk '(length($1) == 3) || (length($1) == 4)' list.txt
Here we use the logical OR to match against more than one pattern. Notice that whenever we use a Boolean operator such as NOT or OR, we wrap our pattern in parentheses.
Match and string-interpolate the output
awk 'length($1) > 0 {print $1, "has", length($1), "chars"}' list.txt
# storeroom has 9 chars
# tree has 4 chars
# cup has 3 chars
Match against a numerical property
awk '$2 >= 90 { print $0 }' scores.txt
# Lucia 95
# Joe 92
# Sophia 90
This returns the records where there is a secondary numerical field that is greater than 90.
Match a field against a regular expression
awk '$1 ~ /^[b,c]/ {print $1}' words.txt
This matches all the fields in the $1
place that begin with ‘b’ or ‘c’.
The tilde is the regex match operator. You must be passing a regex to use it, otherwise use ==
.
Syntactic shorthands
- For a statement like
awk 'length($1) > 5 { print $0 }' list.txt
. We actually don’t need to include the{ print $0 }
action, as this is the default behaviour and it is implied. We could have just putlength($1) > 5 list.txt
.
Built-in variables
NF
The value of NF
is the number of fields in the current record. Awk
automatically updates the value of NF
every time it reads a record.
No matter how many fields there are, the last value in a record can always be represented by $NF
.
NR
NR
represents the number of records. It is set at the point at which the file is read.
FS
FS
represents the field separator. The default field separator is a space. We can specify a different separator with the -F
flag. E.g to separate by comma:
awk -F, '{print $1 }' list.txt