Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk...

34
awk earch for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk program [file-list] Summary The awk utility is a pattern-scanning and processing language. It sear ches one or more files to see if they contain lines that match specifi ed patterns and then performs actions, such as writing the line to the standard output or incrementing a counter, each time it finds a match. You can use awk to generate reports or filter text. It works equal ly well with numbers and text; when you mix the two, awk will almost a lways come up with the right answer. The authors of awk (Alfred V. Aho, Peter J. Weinberger, and Brian W.Kernighan) designed it to be easy to use and, to this end, they sacr ificed execution speed.

Transcript of Awk search for and process a pattern in a file. Format awk [-Fc] –f program-file [file-list] awk...

  • awksearch for and process a pattern in a file.Formatawk [-Fc] f program-file [file-list]awk program [file-list]SummaryThe awk utility is a pattern-scanning and processing language. It searches one or more files to see if they contain lines that match specified patterns and then performs actions, such as writing the line to the standard output or incrementing a counter, each time it finds a match.You can use awk to generate reports or filter text. It works equally well with numbers and text; when you mix the two, awk will almost always come up with the right answer.The authors of awk (Alfred V. Aho, Peter J. Weinberger, and Brian W.Kernighan) designed it to be easy to use and, to this end, they sacrificed execution speed.

  • The awk utility takes its input from files you specify on the command line or fron1 its standard input.

    flexible formatconditional executionlooping statementsnumeric variablesstring variablesregular expressionsCs printfThe awk utility takes many of its constructs from the C programming language. It includes the following features:The first format uses a program-file, which is the pathname of a fie containing an awk program. See Description, on the next page.The second format uses a program, which is an awk program included on the command line. This format allows you to write simple, short awk programs without having to create a separate program-file. To prevent the shell from interpreting the awk commands as shell commands, it is a good idea to enclose the program in single quotation marks.The file-list contains pathnames of the ordinary files that awk processes. These are the input files.Arguments

  • OptionsIf you do not use the -f option, awk uses the first command line argument as its program.

    -fprogram-filefile This option causes awk to read its program from the program file given as the first command line argument.-Fcfield This option specifies an input field separator c, to be used in place of the default separators ([space] and [TAB]). The field separator can be any singlecharacter.Description

    An awk program consists of one or more program lines containing a pattern and/or action in the hllowing format:

    panern { action }

    The pattern selects lines from the input file. The awk utility performs the action on all lines that the pattern selects. You must enclose the action within braces so that awk can differentiate it from the pattern . If a program line does not contain a pattern, awk selects all lines in the input file. If a program line does not contain an action, awk copies the selected lines to its standard output.

  • To start, awk compares the first line in the input file (from the file--list) with each pattern in the program-file or program. If a pattern selects the line (if there is a match), awk takes the action associated with the pattern. If the line is not selected, awk takes no action. When awk has completed its comparisons for the first line of the input file, it repeats the process for the next line of input. It continues this process, comparing subsequent lines in the input file, until it has read the entire file-list.If several patterns select the same line, awk takes the actions associated with each of the patterns in the order in which they appear. It is therefore possible for awk to send a single line from the input file to its standard output more than once.

  • Patterns

    You can use a regular expression (refer to Appendix A), enclosed within slashes, as a pattern. The ~ operator tests to see if a field or variable matches a regular expression-The !~operator tests for no match.You can process arithmetic and character relational expressions with the following relational operators.You can combine any of the patterns described above using the Boolean operators | | (OR) or && (AND).

    Operator Meaningless thanless than or equal toequal tonot equal togreater than or equal togreater than

  • The comma is the range operator. If you separate two patterns with a comma on a single awk program line, awk selects a range of lines beginning with the first line that contains the first pattern. The last line awk selects is the next subsequent line that contains the second pattern. After awk finds the second pattern, it Starts the process over by looking for the first pattern again.Two unique patterns, BEGIN and END, allow you to execute commands before awk starts its processing and after it finishes. The awk utility executes the actions associated with the BEGIN pattern before, and with the END pattern after, it processes all the files in the file-list.ActionsThe action portion of an awk command causes awk to take action when it matches a pattern. If you do not specify an action, awk performs the default action, which is the Print command (explicitly represented as {print}). This action copies the record (normally a line-see Variables on the next page) from the input file to awks standard output.You can follow a Print command with arguments, causing awk to print just the arguments you specify. The arguments can be variables or string constants. Using awk, you can send the output from a Print command to a file(>), append it to a file (>>), or pipe it to the input of another program( | ).Unless you separate items in a Print command with commas, awk catenates them. Commas cause awk to separate the items with the output field separator (normally a [space]-see Variables on the next page).You can include several actions on one line within a set of braces by separating them with semicolons.

  • CommentsThe awk utility disregards anything on a program line following a pound sign (#). You can document an awk program by preceding comments with this symbol.

    VariablesYou declare and initialize user variables when you use them (that is, you do not have to declare them before you use them). In addition, awk maintains program variables for your use. You can use both user and program variables in the pattern and in the action portion of an awk program. Following is a list of program variables.

    VariableRepresentsNR$0NF$1-$NFSOFSRSORSFILENAMErecord number of current recordthe current record(as a single variable) number of fields in the current record fields in the current record input field separator (default:[SPACE]or[TAB])output field separator (default:[SPACE])input record separator (default:[NEWLINE])output record separator (default:[NEWLINE])name of the current input file

  • The input and output record separators are, by default, [NEWLINE] characters. Thus, awk takes each line in the input file to be a separate record and appends a [NEWLINE] to the end of each record that it sends to its standard output. The input field separators are, by default, [SPACE] and [TAB]s. The output field separator is a [SPACE]. You can change the value of any of the separators at any time by assigning a new value to its associated variable. Also, the input held separator can be set on the command line using the -F option.FunctionsThe functions that awk provides for manipulating numbers and strings follow.

    NameFunctionlength(str)returns the number of characters in str; if you do not supply an argument, it returns the number of characters in th current input recordint(num)returns the integer portion of numindex(str1, str2)returns the index of str2 in str1 or 0 if str2 is not presentsplit(str, arr, del)places elements of str, delimited by del, in the array arr[1]arr[n]; returns the number of elements in the arraysprintf(fmt, args)formats args according to fmt and returns the formatted string; mimics the C programming language function of the same namesubstr(str,pos,len)returns a substring of str that begins at pos and is len characters long

  • OperatorsThe following awk arithmetic operators are from the C programming language.

    OperatorFunction*multiplies the expression preceding the operator by the expression following it./divides the expression preceding the operator by the expression following it.%takes the remainder after dividing the expression preceding the operator by the expression following it+adds the expression preceding the operator and the expression following it.-subtracts the expression following the operator from the expression preceding it=assigns the value of the expression following the operator to the variable preceding it.++increments the variable preceding the operator--decrements the variable preceding the operator+=adds the expression following the operator to the variable preceding it and assigns the result to the variable preceding the operator-=subtracts the expression following the operator from the variable preceding it and assigns the result to the variable preceding the operator

  • Associative ArraysAn associative array is one of awks most powerful features. An associative array uses strings as its indexes. Using an associative array, you can mimic a traditional array by using numeric. strings as indexes.You assign a value to an element of an associative array just as you would assign a value to any other awk variable. The format is shown below.

    array[string] = value

    The array is the name of the array, string is the index of the element of the array you are assigning a value to, and value is the value you are assigning to the element of the array

    OperatorFunction*=multiplies the variable preceding the operator by the expression following it and assigns the result to the variable preceding the operator/=divides the variable preceding the operator by the expression following it and assigns the result to the variable preceding the operator%=takes the remainder, after dividing the variable preceding the operator by the expression following it, and assigns the result to the variable preceding the operator

  • There is a special For structure you can use with an awk array. The formatat is:

    for (elem in array) action

    The elem is a variable that takes on the values of each of the elements in the array as the For structure loops through them, array is the name of the array, and action is the action that awk takes for each element in the array. You can use the elem variable in this action.The Examples section contains programs that use associative arrays.PrintfYou can use the Printf command in place of Print to control the format of the output that awk generates. The awk version of Printf is similar to that of the C language. A Printf command takes the following format:printf control-string arg1, arg2, ..., argn

    The control-string determines how Printf will format arg1-n. The arg1-n can be variables or other expressions. Within the control-string, you can use \n to indicate a [NEWLINE] and \t to indicate a [TAB].The control-string contains conversion specifications, one for each argument (arg1-n). A conversion specification has the following format:

  • %[-][x[.y]]convThe - causes Printf to Left justify the argument. The x is the minimum field width, and the .y is the number of places to the right of a decimal point in a number. The conv is a letter from the following list. Refer to the following Examples section for examples of how to use printf.

    conv Cenversionddecimaleexponential notationffloating-point numberguse f or e, whichever is shorterounsigned octalsstring of charactersxunsigned hexadecimal

  • ExamplesA simple awk program is shown on the following page.{ print }

    This program consists of one program line that is an action. It uses no pattern. Because the pattern is missing, awk selects all lines in the input file. Without any arguments, the Print command prints each selected line in its entirety. This program copies the input file to its standard output.The following program has a pattern pan without an explicit action.

    /jenny/

    In this case, awk selects all lines from the input file that contain the string jenny. When you do not specify an action, awk assumes the action to be Print. This program Copies all the lines in the input file that contain jenny to its standard output.The following examples work with the car data file. From left to right, the columns in the file contain each cark make, model, year of manufacture, mileage, and price. All white space in this file is composed of single [TAB]s (there are no [SPACE]s in the file).

  • $cat carsThe first example below selects all lines that contain the string chevy. The slashes indicate that chevy is a regular expression. This example has no action part.Although neither awk nor shell syntax requires single quotation marks on the command line, it is a good idea to use then1, because they prevent many problems. If the awk program you create on the command line includes [SPACE]s or any special characters that the shell will interpret, you must quote them. Always enclosing the program in single quotation marks is the easiest way of making sure you have quoted any characters that need to be quoted.

    $ awk /chevy/ carschevy nova 79603000chevy nova 80503500chevy impa1a 65851550

    The next example selects all lines from the file (it has no pattern part). The braces enclose the action part-you must always use braces to delimit the action part, so that awk can distinguish the pattern part from the action part. This example prints the third field ($3), a [SPACE] (indicated by the comma), and the first field ($1) of each selected line.

  • $ awk {print $3, $1} cars77 p1ym79 chevy65 ford78 vo1vo83 ford 88 chevy65 fiat8l honda84 ford82 toyota65 chevy83 ford

    The next example includes both a pattern and an action part. It selects all lines that contain the string chevy and prints the third and first fields from the lines it selects.

    $ awk /chevy/ {print $3, $l} cars79 chevy88 chevy65 chevy

  • The next example selects lines that contain a match for the regular expression h. Because there is no explicit action, it prints all the lines it selects.$ awk /h/ carschevy nova 79 68 3000chevy nova 8050 3500honda accord 8l 30 6000ford thundbd 84 l0 17000chevy impa1a 65 85 l550

    The next pattern uses the matches operator (~) to select all lines that contain the letter h in the first field.$ awk $1 ~ /h/ carschevy nova 79 60 3000chevy nova 80 50 3500honda accord 8l 30 6000chevy impa1a 65 85 l550

    The caret (^) in a regular expression forces a match at the beginning of the line or, in this case, the beginning of the first field.

    $ awk $l ~ /^h/ carshonda accord 81 30 6000

    A pair of brackets SUI-rounds a character class definition (refer to Appendix A, Regular Expressions). Below, awk selects all lines that have a second field that begins with t or m. Then it prints the third and second fields, a dollar sign, and the fifth field.

  • $ awk $2 ~ /^[tm]/ {print $3, $2, $, $5} cars65 mustang $l000084 thundbd $1700082 tercel $750

    The next example shows three roles that a dollar sign can play in an awk program. A dollarsign followed by a number forms the name of a field. Within a regular expression, a dollar sign forces a match at the end of a line or held (5$). Within a string, you can use a dollar sign as itself.

    $ awk $3 ~ /5$/ {print $3, $l, $ $5} cars65 ford $l000065 fiat $45065 chevy $l550

    Below, the equals relational operator (==) causes awk to perform a numeric comparison between the third field in each line and the number 65. The awk commands takes the default action, Print, on each line that matches.

    $ awk $3 == 65 carsford mustang 654510000fiat 600 65115450chevy impa1a 65851550

  • The next example finds all cars priced at or under $3000.

    $ awk $5 = 300 carsplym fury 77732500chevy nova 79603000fiat 60065115450toyota terce1 82180750chevy impa1a 65851550When you use double quotation marks, awk performs textual comparisons, using the ASCII collating sequence as the basis of the comparison. Below, awk shows that the strings 450 and 750 fall in the range that lies between the strings 2000 and 9000.

    $ awk $5 >= 2000 && $5 < 9000 carsp1ym fury 77 73 2500chevy nova 79 60 3000chevy nova 80 50 3500fiat 600 65 ll5 450honda accord 8l 30 6000toyota terce1 82 l80750When you need a numeric comparison, do not use quotation marks.The next example gives the correct results. It is the same as the previous ex. ample but omits the double quotation marks .

  • $ awk $5 >= 2000 && $5 < 9000 carsplymfury77732500chevynova79603000chevynova80503500Hondaaccord81306000Next, the range operator (,) selects a group of lines. The first line it selects is the one specified by the pattern before the comma. The last line is the one selected by the pattern after the comma. If there is not line that matches the pattern after the comma, awk selects every line up to the end of the file. The example selects all lines starting with the line that contains Volvo and concluding with the line that contains fiat.

    $ awk /volvo/ , /fiat/ carsvolvogl781029850fordltd831510500chevynova80503500fiat60065115450

    After the range operator finds its first group of lines, it starts the process over, looking for a line that matches the pattern before the comma. In the following example, awk finds three groups of lines that fall between chevy and ford. Although the fifth line in the file contains ford, awk does not select it because, at the time it is processing the fifth line, it is searching for chevy.

  • $ awk /chevy/ , /ford/ carschevynova79603000fordmustang654510000chevynova80503500fiat60065115450hondaaccord81306000fordthundbd841017000chevyimpala65851550fordbronco83259500When you are writing a longer awk program, it is convenient to put the program in a file and reference the file on the command line. Use the f option, followed by the name of the file containing the awk program.Following is an awk program that has two actions and uses the BEGIN pattern. The awk utility performs the action associated with BEGIN before it processes any of the lines of the data file. The pr_header awk program uses BEGIN to print a header.The second action, {print}, has no pattern part and prints all the lines in the file.$ cat pr_headerBEGIN{printMake ModelYearMilesPrice}{print}

  • $ awk f pr_header carsMakeModelYearMilesPricePlymfury77732500Chevynova79603000Fordmustang654510000Volvogl781029850Fordltd831510500Chevynova80503500Fiat60065115450Hondaaccord81306000Fordthundbd841017000Toyotatercel82180750Chevyimpala65851550Fordbronco83259500In the previous and following examples, the white space in the headers is composed of single [TAB]s, so that the titles line up with the columns of data.

    $ cat pr_header2BEGIN {printMakeModelYearMilesPriceprint-----------------------------}{print}

  • $ awk f pr_header2 carsMakeModelYearMilesPrice------------------------------------------------------------Plymfury77732500Chevynova79603000Fordmustang654510000Volvogl781029850Fordltd831510500Chevynova80503500Fiat60065115450Hondaaccord81306000Fordthundbd841017000Toyotatercel82180750Chevyimpala65851550Fordbronco83259500

    When you call the length function without an argument, it returns the number of characters in the current line, including field separators. The $0 variable always contains the value of the current line. In the next example, awk prepends the length to each line, and then a pipe sends the output from awk to sort, so that the lines of the cars file appear in order of length. Because the formatting of the report depends on [TAB]s, including three extra characters at the beginning of each line throws off the format of the last line. A remedy for this situation will be covered shortly.

  • $ awk {print length, $0} cars | sort19fiat6006511545020fordltd83151050020plymfury7773250020volvogl78102985021chevynova7960300021chevynova8050350022fordbronco8325950023chevyimpala6585155023hondaaccord8130600024fordmustang65451000024fordthundbd84101700024toyotatercel82180750The NR variable contains the record (line) number of the current line. The following pattern selects all lines that contain more than 23 characters. The action prints the line number of all the selected lines.

    $ awk length > 23 {print NR} cars3910

  • You can combine the range operator (,) and the NR variable to display a group of lines of a file based on their line numbers. The next example displays lines 2 through 4.

    $ awk NR == 2 , NR == 4 carschevynova79603000fordmustang654510000volvogl781029850

    The END pattern works in a manner similar to the BEGIN pattern, except awk takes the actions associated with it after it has processed the last of its input lines. The following report displays information only after it has processed the entire data file. The NR variable retains its value after awk has finished processing the data file, so that an action associated with an END pattern can use it.

    $ awk END {print NR, cars for sale. } cars12 cars for sale.

    The next example uses If commands to change the values of some of the first fields. As long as awk does not make any changes to a record, it leaves the entire record, including separators, intact. Once it makes a change to a record, it changes all separators in that record to the default. The default output field separator is a [SPACE].

  • $ cat separ_demo{ if ($1 ~ /ply/)$1 = plymouth if ($1 ~ /chev/) $1 = chevroletprint}

    $ awk f separ_demo carsplymouth fury 77 73 2500chevrolet nova 79 60 3000fordmustang654510000volvogl781029850ford1td831510500chevrolet nova 80 50 3500fiat60065115450hondaaccord81306000fordthundba841017000Toyotatercel82180750Chevroletimpala65851550Fordbronco83259500

  • You can change the default value of the output field separator by assigning a value to the OFS variable. There is one [TAB] character between the quotation marks in the following example.This fix improves the appearance of the report but does not properly line up the columns.

    $ cat ofs_demoBEGIN{OFS = [TAB]}{if ($1 ~ /ply/) $1 = plymouthif ($1 ~ /chev/)$1 = chevroletprint}

    $ awk -f ofs_demo carsplymouthfury77732500chevroletnova79603000ford mustang654510000volvo gl781029850ford 1td831510500chevroletnova80503500fiat 60065115450honda accord81306000ford thundba841017000Toyota tercel82180750Chevroletimpala65851550Ford bronco83259500

  • You can use Printf to refine the output format (refer to page 535). The following example uses a backslash at the end of a program line to mask the following [NEWLINE] from awk. You can use this technique to continue a long line over one or more lines without affecting the outcome of the program.

    $ cat printf_demoBEGIN {print Milesprint MakeMode1Year(000)Priceprint \-----------------------------------------------------------------------}}if ($l ~ /p1y/ $l = p1ymouthif ($l ~ /chev/) $l = chevro1et printf %-l0s %-8s l9%2d %5d$ %8.2f\n,\$1, $2, $3, $4, $5 }

  • $ awk -f printf_demo cars MilesMakeModelYear(0000)Price------------------------------------------------------------------------------------------plymouthfury1977 73$ 2500.00chevroletnova1079 60$ 3000.00fordmustang1965 45$ 10000.00volvogl1978 102$ 9850.00ford1td1983 15$ 10500.00chevroletnova1980 50$ 3500.00fiat6001965115$ 450.00hondaaccord1981 30$ 6000.00fordthundba1984 10$ 17000.00Toyotatercel1982180$ 750.00Chevroletimpala1965 85$ 1550.00Fordbronco1983 25$ 9500.00

  • The next example creates two new files, one with all the lines that contain chevy and the other with lines containing ford.

    $ cat redi rect-out/chevy/ {print chevfi1e}/ford/ {print fordfi1e}END {print done.}

    $ awk -f red1rect-out carsdone .

    $ cat chevfi1echevy nova79603000chevy nova80503500chevy nova65851550

    The summary program produces a summary report on all cars and newer cars. The first two lines of declarations are not required; awk automatically declares and initializes variables as you use them. After awk reads all the input data, it computes and displays averages.

  • $ cat summaryBEGIN{yearsum = 0 ; costsum = 0newcostsum = 0 ; newcount = 0}{yearsum += $3costsum += $5}$3 80 {newcostsum += $5 ; newcount ++}END {Printf Average age of cars is %3.lf yearsn , \90 - (yearsum/NR)printf Average cost of cars is $%7.2fn ,costum/NRprintf Average cost of newer cars is %$7.2fn,\ newcostsum/newcount}

    $ awk -f summary carsAve rage age of cars is l3.2 yearsAverage cost of cars is $62l6.67Average cost of newer cars is $8750.00

    Following, grep shows the format of a line from the passwd file that the next example uses.

  • $ grep mark /etc/passwdmark:4zvDGYGEbYHJg:107:ext 112:/home/mark:/bin/csh

    The next example demonstrates a technique for finding the largest number in a field. Because it works with the passwd file, which delimits fnelds with colons (:), it changes the input filed separator (FS) before reading any data. (Alternatively, the -F option could be used on the command line to change the input held separator.) This example reads the passwd file and determines the next available user ID number (field 3). The numbers do not have to be in order in the passwd file for this program to work..The pattern causes awk to select records that contain a user ID number greater than any previous user ID number that it has processed. Each time it selects a record, it assigns the value of the new user ID number to the saveit variable. Then awk uses the new value of saveit to test the user ID of all subsequent records.Finally awk adds 1 to the value of saveit and displays the result.

    $ cat find-uid. BEGIN {F5 = : saveit = 0}$3 Saveit {saveit = $3}END {print Next avai1able UID i s saveit + 1}

    $awk f find_uid /etc/passwdNext available UID is 192

  • The next example shows another report based on the cars file. This report uses nested If Else statements to substitute values based on the contents of the price field. The program has no pattern part--it processes every record.

    $ cat price_range{if ($5 5000 && $5 1000) $5 = please aske1se if ($5 >= l0000) $5 = expensiveprintf %-10s %-8s19%2d%5d%-12s\n,\$l, $2, $3, $4, $5}

    $ awk -f price -range carsp1ym fury1977 73inexpensivechevy nova1979 60inexpensiveford mustang1965 45expensivevolvo g11978102please askford 1td1983 15expensivechevy nova1980 50inexpensivefiat 6001965115inexpensivehonda accord1981 30please askford thundbd1984 10expensivetoyota tercel1982180inexpensivechevy impa1a1965 85inexpensiveford bronco1983 25please ask

  • Problem 1) Find the number of annotated gene in each strand of ecoli genome sequences.

    Problem 2)Find the number of putatively identified, hypothetical, unknown genes from ecoli genome seqeunces.