The awk command

29
The awk command

description

The awk command. Introduction. Awk is a programming language used for manipulating data and generating reports. The data may come from standard input, one or more files, or as output from a process. - PowerPoint PPT Presentation

Transcript of The awk command

Page 1: The awk command

The awk command

Page 2: The awk command

Introduction

• Awk is a programming language used for manipulating data and generating reports. The data may come from standard input, one or more files, or as output from a process.

• Awk can be used at the command line for simple operations, or it can be written into programs for larger applications.

• Awk scans a file ( or input) line by line, from the first to the last line, searching for lines that match a specified pattern and performing selected actions ( enclosed in curly braces ) on those lines.

Page 3: The awk command

• Awk stands for the first initials in the last names of each of the authors of the language, Alfred Aho, Brian Kernighan, and peter Weinberger.

• There are a number of versions of awk : old awk, new awk, gnu awk, POSIX awk, and so on.

• Awk combines features of several filters, but it has two unique features.

• 1. it can identify and manipulate individual fields in a line.• 2. awk is the only UNIX filter that can perform

computation.• Further, awk also accepts extended regular expressions

(EREs) for pattern matching, has C-type programming constructs and several built-in variables and functions.

Page 4: The awk command

awk Preliminaries

• The awk command follows the general syntax:• Awk <options> ‘selection_criteria { action }’ <file(s) >

• Note the use of single quotes and curly braces.• The selection_criteria ( a form of addressing) filters input and

selects lines for the action component to act on. This component is enclosed within curly braces.

• The selection_criteria and action constitute an awk program that is surrounded by a set of single quotes.

• These programs are often one-liners though they can span several lines as well.

• Ex: to select the directors from the file, the awk command is:• $ awk '/dir./ {print}' emp.lst• 7898 | akash |dir. |mark. | 11/06/70 |9000

Page 5: The awk command

• Unlike other filters, awk uses a contiguous sequence of spaces and tabs as the default delimiter. This default has been changed in the example by “|” using the –F option. A ,(comma) has been used to delimit the field specification.

• $ awk -F"|" '/dir./ {print $2,$3,$4,$6}' emp.lst• akash dir. mark. 9000

• Fields in awk are numbered $1,$2,etc.• Awk also addresses the entire line as $0.

• Ex: to display the number of records in the file e.lst:

• $ awk '{print $0}' e.lst |wc -l• 6

Page 6: The awk command

• The action section is represented by the statement { print }, which has the effect of printing all the selected lines.

• If the selection_criteria is missing, then the action will apply to all lines of the file.

• If the action is missing , then the entire line will be printed.• Either the address or the action is optional, but both must be

enclosed within a pair of single quotes.• All context patterns have to be enclosed within a pair of /’s.• The print statement if used without any field specifiers prints the

entire line , though you can also use the variable $0 to indicate that explicitly.

• Since print is the default action of awk, there is no need to specify it if you want to print the entire line. All the three forms are equivalent:

• $ awk ‘/dir/ ’ emp.lst• $ awk ‘/dir/ {print} ‘ emp.lst• $ awk ‘/dir/ {print $0} ‘ emp.lst

Page 7: The awk command

• For pattern matching, awk uses regular expressions of the egrep variety, with the same requirement that all these expressions be bounded on either side by a /. This lets you locate both ‘sharma’ and ‘sarma’ :

• $ awk -F"|" '/[Ss]h*arma/ ' e.lst• 9876 | sharma | mgr |product| 12/03/60 |15000• 8888 | Sarma | dir.| sales | 05/09/60 |25000

• Awk also accepts a line address (single or double) to select lines.• Ex: to select lines 3 to 6 from a file, use the built-in variable NR to

specify line numbers :

• $ awk -F"|" 'NR==3,NR==6 {print NR, $2, $3,$6}' e.lst• 3 akash dir. 9000• 4 tiwary g.m 23000• 5 kumar mgr 1500• 6 Sarma dir. 25000

Page 8: The awk command

Formatting output with printf

• Awk uses the print and printf statements to write to standard output. Print produces unformatted output.

• Ex: to print all fields except the 4th , we can assign the one we don’t want to an empty string :

• $ awk -F"|" '{ $4=""; print}' e.lst |head -2• 2233 shukla g.m 12/12/52 20000• 9876 sharma mgr 12/03/60 15000• When placing multiple statements in a single line, use the ; as their

delimiter. Print here is the same as print $0.• With the C-like printf statement, you can use awk as a stream

formatter. Printf uses a quoted format specificier and a field list.• %s – String• %d – Integer• %f – Floating point number

Page 9: The awk command

• To produce formatted o/p from unformatted i/p, using a regular expression,

• $ awk -F"|" '/[sS]h*arma/{• printf("%-20s %-12s %6d\n",$2,$3,$6) }' e.lst• sharma mgr 15000• Sarma dir. 25000

Page 10: The awk command

The Logical And Relational Operators

• To print the 3 fields for the directors and the manager, you can write each awk program in a separate line:

• $ awk –F”|” ‘/director/ { printf “%-20s %-12s %d\n”, $2,$3,$6}

>/manager/ {printf “%-20s %-12s %d\n”, $2,$3,$6}’ emp.lst

• But this method of repeating the print action on each line can be tedious. Awk also uses the || and && logical operators.

• $ awk -F"|" '$3==" mgr " || $3=="dir. "{• printf("%-20s %-12s %6d\n",$2,$3,$6) }' e1.lst• akash dir. 9000• kumar mgr 15000

Page 11: The awk command

• If you want to print only those lines for persons who are neither director nor manager, you should use the != and && operators:

• $ awk -F"|" '$3!=" dir." && $3!=" mgr" {• printf "%-20s %-12s %d\n", $2,$3,$6}' e1.lst

• While using the operators == and != for string matching, you must remember that they can handle only fixed strings, and not regular expressions.

• How to match regular expressions:• Awk offers the ~ and !~ operators to match and negate a match,

respectively.• $ awk -F"|" '$3 ~/g.m/ {print}' e1.lst• 2233 | shukla | g.m | sales | 12/12/52 | 20000• 9876 | sharma |d.g.m|product| 12/03 60 | 15000• 3456 | tiwary |g.m |product| 05/02/89 |23000

Page 12: The awk command

• The previous example prints the d.g.m’s as well as the g.m’s, since the pattern g.m. is embedded in the larger string .

• Therefore use the characters ^ and $ used by the regular expressions, which indicate the beginning and the end of a field, respectively.

• $ awk -F"|" '$3 ~/^g.m/ {print}' e1.lst• 3456 | tiwary |g.m |product| 05/02/89 |23000

Page 13: The awk command

The relational and regular expression matching operators used by awk

• Operator Significance• < Less than• <= less than or equal to• == equal to• != not equal to• >= greater than or equal to• > greater than• ~ match a regular expression• !~ doesn’t match a regular expression

Page 14: The awk command

Number Processing

• Awk uses the arithmetic operators +,-,*,/, and %(modulus).• It also overcomes the most major limitations of the shell ; the

inability to handle decimal numbers.

• You can use awk to print a pay-slip for the directors:• $ awk -F"|" '$3~/^dir./ {>printf "%-20s %-12s,%d %d %d\n", $2,$3,$6,$6*0.4,$6*0.15}' e1.lst• akash dir. ,9000 3600 1350

• While awk has certain built-in variables, like NR and $0, it also permits the user to use variables of his choice. A user-defined variable used by awk has a special feature ; no type declaration is needed, and it is initialized to zero or a null string, by default, depending on its type. Awk has a mechanism of identifying the type of variable used from its context.

Page 15: The awk command

• $ awk -F"|" '$6>=15000 {• > cnt = cnt+1• > print cnt,$2,$3,$6}' e1.lst• 1 shukla g.m 20000• 2 sharma d.g.m 15000• 3 tiwary g.m 23000• 4 kumar mgr 15000

Page 16: The awk command

THE –f OPTION

• Awk offers the –f option to take the program from the file that follows this option.

• $ cat q1.awk

$6>=15000 {

print ++count,$2,$3,$6}

• $ awk -F"|" -f q1.awk e1.lst• 1 shukla g.m 20000• 2 sharma d.g.m 15000• 3 tiwary g.m 23000• 4 kumar mgr 15000

Page 17: The awk command

THE BEGIN AND END SECTIONS

• If you are to print something before processing the first line, for example, a heading, then the BEGIN section can be used quite gainfully. Similarly, if you want to print some totals after the processing is over, then you should do it in the END section.

• The BEGIN and END are optional, and take the form:BEGIN {action}

END {action}

These two sections, when present, are delimited by the body of the awk program. They also use a pair of curly braces to enclose the program. You can use these two sections to print a suitable heading at the beginning, and the average salary at the end.

Page 18: The awk command

• $ cat q2.awk• BEGIN {• printf "\n\t\t EMPLOYEE ABSTRACT \n\n"• }

• $6>15000 {• # used for comments

• count++;• tot+=$6• printf "%3d%-20s%-12s%d\n", count,$2,$3,$6• }

• END{• printf "\n\t The average basic pay is %6d\n", tot/count• }

Page 19: The awk command

• $ awk -F"|" -f q2.awk e1.lst

EMPLOYEE ABSTRACT

1 shukla g.m 20000

2 tiwary g.m 23000

The average basic pay is 21500

Page 20: The awk command

Positional Parameters

• The program q1.awk could take a more generalized form if the number 15000 is replaced with a variable.

• To do that, the entire awk command (not just the program) should be stored in a shell script, and the parameter supplied as an argument to the script. This parameter is then compared with the variable. These variables are known as positional parameters, and identified by the shell as $1,$2,$3, etc. in the order they are presented in the command line.

• The positional parameters used by awk should be enclosed within single quotes, so as to distinguish between a positional parameter and a field identifier.

Page 21: The awk command

• Cat q1.awk• awk -F"|" '$6>='$1' { print $2,$3,$6}' e1.lst

• $ q1.awk 15000

Page 22: The awk command

BUILT–IN VARIABLES

• VARIABLE FUNCTION• NR Cumulative number of records read• FS The input field separator• OFS The output field separator• NF Number of fields in current record• FILENAME The current input file• ARGC Number of arguments in the command line• ARGV The list of arguments

Page 23: The awk command

• NR stores the record number of the current line.• FS defines the input field separator. This is an alternative to the –F

option of the command. When used at all it must occur in the BEGIN section so that the body of the program knows its value before it starts processing :

• The default output field separator, can be reassigned using the variable OFS in the BEGIN section

• Ex:• $ awk 'BEGIN {FS="|";OFS="~"}• $6>15000 {print $1,$2,$3,$6}' e1.lst

2233 ~ shukla ~ g.m ~ 20000

3456 ~ tiwary ~g.m ~23000

Page 24: The awk command

• NF is used in cleaning up a database from records which don’t contain the right number of fields.

• Ex: to locate those records not having 6 fields, and which have crept in due to faulty data entry:

• $ awk 'BEGIN {FS="|"}• > NF!=6• > print "record no ",NR," has ",NF, " fields"}' emp.lst

• FILENAME stores the name of the current file being processed. By default, awk doesn’t print the filename, but you can instruct it to do so:

• $ awk -F "|" '$6<15000 {print FILENAME,$0}' e1.lst• e1.lst 7898 | akash |dir. |mark. | 11/06/70 |9000

Page 25: The awk command

• While using awk program within shell scripts, you can arrange to pass parameters to the script. ARGV[ ] , stores the entire list of arguments in the array.

• And the number of such arguments is stores in the variable ARGC

• $ emp.awk 3500 7000 director• Then ARGC takes the value 4, while the array ARGV[ ] is filled up

with the words in the command line:• ARGV[0] = empfind.awk • ARGV[1] = 3500• ARGV[2] = 7000• ARGV[3] = director

Page 26: The awk command

FUNCTIONS

• Awk has several built-in functions, performing both arithmetic and string operations.

• The parameters are passed to a function in C-style, delimited by commas, and enclosed by a matched pair of parentheses.

Page 27: The awk command

Built – in functions in awk

• Function Description

• int(x) Returns the integer value of x• sqrt(x) Returns the square root of x• index(s1,s2) Returns the position of the string s2 in the

string s1• length( ) Returns the length of the argument (the

complete record in case of none)• substr(s1,s2,s3) Returns portion of the string of length s3,

starting from the position s2 in the strting s1• split(s,a) Split string s into the array a; optionally

returns number of fields

Page 28: The awk command

Control flow – THE if statement

• the control command itself must be enclosed in parentheses.

• $ awk -F"|” '{ if ($6 >15000) print($2,$6)}' e1.lst• shukla 20000• tiwary 23000

• $ awk -F"|" '{ if ($6 >15000) commission = 0.15*$6• else commission = 0.10 *$6 } {print ($2,$6,commission)}'

e1.lst• shukla 20000 3000• sharma 15000 1500• akash 9000 900• tiwary 23000 3450• kumar 15000 1500

Page 29: The awk command