AWK Scripting Language

Praveen Alex Mathew
4 min readJul 4, 2020

--

Awk is an interpreted scripting language used for text manipulation. It is available by default in most linux and unix distributions.

awk command line

Awk divides the input file into records and each record is divide into fields.

The awk command is as follows

> awk '<the script to run in quotes>' input_file

The quotes around the script enables the shell to treat the entire script as a single argument for the awk command. Awk runs the script on each record in the input_file unless specified.

Download the Sample Input File

Sample Dataset used in this blog src:”https://gist.github.com/PraveenMathew92/fccec9b1f16fe4a776f4148e7bf82b03"

The input file of the name weather-data.csv used in this blog is available on github. Do download it to follow along. It wont take much time.

> cd /tmp
> curl -O https://gist.githubusercontent.com/PraveenMathew92/fccec9b1f16fe4a776f4148e7bf82b03/raw/059c95dbee2366ff995ddd5a6b63c6d6c6cb037f/weather-data.csv

This should create a new file weather-data.csv in your machine along with some contents

Run awk Scripts on the Input File

For the csv file weather-data.csv, the contents can be displayed with the cat command.

> cat weather-data.csvtimestamp,Temperature,Precipitation
20200622T00000,23.907972,0.5
20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0
20200622T0300,23.917973,0.0

Lets try to do the same with awk

Print the files (similar to cat) using awk

> awk {print} weather-data.csvtimestamp,Temperature,Precipitation
20200622T0000,23.907972,0.5
20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0
20200622T0300,23.917973,0.0

Now, lets try something more.

Removing headers in a csv

The headers may be removed to obtain only the contents of the file.

> awk 'FNR>1 {print}' weather-data.csv20200622T0000,23.907972,0.5
20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0
20200622T0300,23.917973,0.0

FNR refers to the Record Number in the current file. Another variable NR refers to Record Number of the total records being processed.

Passing a value to print function

The default value that is passed to the print function is $0. $0 refers to the entire record. The following command gives the same output as the one above

> awk 'FNR>1 {print $0}' weather-data.csv20200622T0000,23.907972,0.5
20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0
20200622T0300,23.917973,0.0

To refer to a field in the record individually, use dollar sign followed by the position of the field in the record. For instance, the first field in the record can be referenced with ‘$1’. More on this later.

Setting the Field Separators and the Record Separators

In awk FS and RS variables store the Field Separator and the Record Separators respectively. The default values of RS is the newline character while that of FS is ‘whitespace’. However, for csv files we need field separators to be ”,”. The value of FS can be set with the -F option in awk.

> awk -F, 'FNR>1 {print $1}' weather-data.csv20200622T0000
20200622T0100
20200622T0200
20200622T0300

Now the fields are split at ”,” instead of whitespace. The below command also gives the same result:

> awk 'FNR>1 {print $1}' FS="," weather-data.csv

Example: show only the Temperature for a each timestamp

> awk -F, 'FNR>1 {print $1,$2}' weather-data.csv20200622T0000 23.907972
20200622T0100 24.277971
20200622T0200 24.517971
20200622T0300 23.917973

Regex in Awk

The command to filter the records in the input_file that matche the regex:

> awk '/<regex enclosed in forward slashes>/' input_file

Further comparisons on the regular expressions can be done with ”~” and ”!~”

For instance to get all the temperatures that are in the 23° range.

> awk -F, '$2 ~ /23/' weather-data.csv20200622T0000,23.907972,0.5
20200622T0300,23.917973,0.0

On the other hand to get the temperatures not in the 23° range.

> awk -F, 'FNR> 1 && $2 !~ /23.*/' weather-data.csv20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0

Begin and End Block

Unlike the rest of the script which gets executed for each record in the file, the BEGIN and END Blocks will be executed only once. These are the start-up and the clean-up blocks.

> awk 'BEGIN {print "Begin Block"} END {print "End Block"} {print}' weather-data.csvBegin Block
timestamp,Temperature,Precipitation
20200622T0000,23.907972,0.5
20200622T0100,24.277971,1.2
20200622T0200,24.517971,0.0
20200622T0300,23.917973,0.0
End Block

The BEGIN and the END blocks may also contain configurations for the awk script.

References:

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Praveen Alex Mathew
Praveen Alex Mathew

Written by Praveen Alex Mathew

Software Developer. Masters in Computer Science @Arizona State University. https://praveenmathew92.github.io/

No responses yet

Write a response