DataScience Workbook / 02. Introduction to the Command Line / 3. Useful Text Manipulation Programs / 3.3 AWK – advanced text processing


Awk can be a powerful general-purpose scripting language to perform advanced text processing. It is a command-line tool and is particularly useful for working with structured data, such as tables or files with columns of data separated by delimiters. To understand awk more, let us create a file marksheet.txt.

In the command below, we are creating a new file named marksheet.txt and entering some data into it using the cat command.

The command cat > marksheet.txt below is used to create a file named marksheet.txt. When this command is executed, the shell waits for input from the user. Here, we can then enter the data line by line, pressing Enter/return key after each line. In this example, we are entering four lines of data along with one empty line. The fourth line is empty.

$ cat > marksheet.txt
Allan Lamb 78
Graeme Hick 56
David Gower 83

Graham Gooch 43

After entering Graham Gooch 43, press the Enter/return key and then enter Ctrl-C to return to the $ terminal prompt. We can now see what the marksheet.txt file holds:

$ cat marksheet.txt
Allan Lamb 78
Graeme Hick 56
David Gower 83

Graham Gooch 43

cat-like (Default) behavior of Awk:

$ awk '{print}' marksheet.txt
Allan Lamb 78
Graeme Hick 56
David Gower 83

Graham Gooch 43

Understanding Records and Fields in Awk

In the context of Awk, it’s vital to understand two core concepts: records and fields. These two concepts form the basis for how Awk processes and manipulates data.

Records in Awk

In Awk, a record is typically a line of input. The default record separator in Awk is the newline character, meaning that by default, Awk considers each line it reads from a file or from its standard input as a separate record.

Think of a record like a row in a spreadsheet or a database. Each record can contain multiple pieces of data.

Why is this concept significant? Awk is often used to process structured data, such as tables or comma-separated value (CSV) files. Each line in such a file typically represents a separate entity (like a user, a transaction, a product, etc.) Thus, each line is a record that Awk will process individually.

Fields in Awk

Now, let us talk about fields. Each record can be further split into smaller units called fields. A field is like a column in a spreadsheet or database.

By default, Awk treats any whitespace (space or tab) as a field separator. That means if you have a line (or record) like Allan Lamb 78, Awk will split it into three fields:

  • Field 1: Allan
  • Field 2: Lamb
  • Field 3: 78

You can access these fields within your Awk script using the dollar sign $ followed by the field number. For example, $1 would refer to Allan in the example above.

Understanding fields is crucial because it allows you to manipulate and process specific portions of your data. For example, you could write an Awk script to process only the third field of each record, ignoring the others. Or you could rearrange the fields in your output, or combine them in new ways.

In a nutshell, the concepts of records and fields underpin the power and flexibility of Awk as a tool for text processing. Once you understand these concepts, you can write more complex and useful Awk scripts.

The file marksheet.txt has five lines including an empty line. In the example below, each record has three words separated by a single space character. We can access the first and third fields of the file by,

$ awk '{print $1,$3}' marksheet.txt
Allan 78
Graeme 56
David 83

Graham 43

The first field (First name) and the third field (Marks scored) are returned. The empty line is also returned as it is also a record.

grep-like Awk

$ awk '/G/ {print}' marksheet.txt
Graeme Hick 56
David Gower 83
Graham Gooch 43

There are three records with the letter G. Now if we needed all Gs in the beginning of a record or line, then we might do:

awk '/^G/ {print}' marksheet.txt
Graeme Hick 56
Graham Gooch 43

There are just two records in this case.

Doing Arithmetic

Let us get the average marks scored by the candidates. For that we do:

$ awk '{sum+=$3} END {print sum}' marksheet.txt

We can get the average scores by:

$ awk '{sum+=$3} END {print sum/NR}' marksheet.txt

There are a couple of things to notice. One, the introduction of a new variable NR; and two, the output is incorrect. The average should be 65 and not 52. The NR command stores the number of current records, which in our example is five, including the empty line and thus explains the error in the output. To remove the empty line:

$ awk NF marksheet.txt > marksheet2.txt

The NF variable contains the number of fields in a line and is positive when the line is non-empty and prints it. Awk does nothing when the line is empty.

$ cat marksheet2.txt
Allan Lamb 78
Graeme Hick 56
David Gower 83
Graham Gooch 43


$ awk '{sum+=$3} END {print sum/NR}' marksheet2.txt

Sometimes the field separator could be a tab character. So, when we do

$ awk 'OFS="," {print $1, $2, $3}' marksheet2.txt > marksheet3.csv

we use OFS to set the output field separator to be a tab character. Looking at marksheet3.tsv, where tsv is tab separated value,

$ cat marksheet3.csv

So, doing

$ awk '{print $1}' marksheet3.csv

returns no output. That is because the default separator is a blank space. To set the field separator on the command line, we do

$ awk -F ',' '{print $1}' marksheet3.csv

Built-In Variables In Awk

FS: FS command contains the field separator character which is used to divide fields on the input line. The default is “white space”, meaning space and tab characters. FS can be reassigned to another character (typically in BEGIN) to change the field separator.

RS: RS command stores the current record separator character. Since, by default, an input line is the input record, the default record separator character is a newline.

OFS: OFS command stores the output field separator, which separates the fields when Awk prints them. The default is a blank space. Whenever print has several parameters separated with commas, it will print the value of OFS in between each parameter.

ORS: ORS command stores the output record separator, which separates the output lines when Awk prints them. The default is a newline character. print automatically outputs the contents of ORS at the end of whatever it is given to print. ___

Further Reading

Homepage Section Index Previous Next top of page