Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177...

56
Tools for Text Unix Pipe Fitting for Data Analysis David A. Smith 1

Transcript of Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177...

Page 1: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for TextUnix Pipe Fitting for Data Analysis

David A. Smith

1

Page 2: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools

2

Page 3: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

3

Page 4: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

4

Page 5: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Data

5

Page 6: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

6

Page 7: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

6

Page 8: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

6

Page 9: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Data Structures

nobody:*:-2:nogroup:*:-1:wheel:*:0:rootdaemon:*:1:rootkmem:*:2:rootsys:*:3:roottty:*:4:rootoperator:*:5:rootmail:*:6:_teamsserver

Sequence of lines

Delimiters separate fields

7

Page 10: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Data Structures

1330 was1177 had1159 her 949 be 926 not 855 it 853 that 819 she 787 as

Sequence of lines

Whitespace separates fields

8

Page 11: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Data Structures

Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,for his own amusement, never took up any book but the Baronetage;there he found occupation for an idle hour, and consolation in adistressed one; there his faculties were roused into admiration andrespect, by contemplating the limited remnant of the earliest patents;there any unwelcome sensations, arising from domestic affairs

Tokens separated by more complex

orthographic conventions

9

Page 12: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Data Structures

• “Plain” text

• “Flat file” databases

• Linear not random access

• Scales to arbitrary sizes

10

Page 13: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

11

Page 14: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

11

Page 15: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

11

Page 16: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

• ... we could build special-purpose programs and data structures.

11

Page 17: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

• ... we could build special-purpose programs and data structures.

• But we don’t know what we want!

11

Page 18: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

• ... we could build special-purpose programs and data structures.

• But we don’t know what we want!

✤ (It’s research)

11

Page 19: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

• ... we could build special-purpose programs and data structures.

• But we don’t know what we want!

✤ (It’s research)

• Some tools already exist; some don’t.

11

Page 20: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Tools for Research

• If we knew what we wanted to do...

• ... and what our data should look like, ...

• ... we could build special-purpose programs and data structures.

• But we don’t know what we want!

✤ (It’s research)

• Some tools already exist; some don’t.

• Tools communicate with simple text data.

11

Page 21: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Composability

Dis

cove

rabi

lity

12

Page 22: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Composability

Dis

cove

rabi

lity

Big Tools

12

Page 23: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Composability

Dis

cove

rabi

lity

Big Tools

CLI

12

Page 24: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Composability

Dis

cove

rabi

lity

Big Tools

CLI

CLI+Google

12

Page 25: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Today’s Tasks

• Sorting and counting words

• Regular expressions

✤ Matching and substitution

• Scripting and automation

13

Page 26: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Getting Some Data

• http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/gutenberg.zip

• Unzip (e.g., unzip -a gutenberg.zip)

• Open the Termimal app, or similar

• Go to the new directory

✤ cd ~/Downloads/gutenberg

•ls

14

Page 27: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Looking at Files

•head

✤ -100 : first 100 lines

•tail

✤ -20 : last 20 lines

✤ -n +20 : tail starting at 20th line

•less

• Your favorite text editor

15

Page 28: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Counting Words

•wc

✤ Discoverability: man wc

• But I really meant, counting different words

✤ Really simple tokenization: tr

✤ Sorting: sort

✤ Aggregation (counting): uniq

16

Page 29: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Directing Data

• < Input from a file

• > Output to a file

• | Pipe

17

Page 30: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Counting Words

• tr -sc 'A-Za-z' '\n'

✤ Turns one set of characters into another

✤ What about: tr 'A-Z' 'a-z'

• sort

✤ sort -n numerical order

✤ sort -r reverse order

✤ sort -k 2 sort by 2nd (whitespace) field

• uniq -c: count contiguous repeated lines

18

Page 31: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Counting Words

$ tr -sc 'A-Za-z' '\n' < austen-emma.txt | sort | uniq -c 1 125 A 31 Abbey 1 Abbots 1 Abdy...

Try piping to less, head, or out to a file.19

Page 32: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

More Counting

• Sort by word frequency

• Merge counts for upper and lower case

• Sort by word endings (rev)

20

Page 33: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Sorting by Frequency

$ tr -sc 'A-Za-z' '\n' < austen-emma.txt | sort | uniq -c | sort -rn5186 to4846 the4673 and4281 of3192 I...

21

Page 34: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Merging Case$ tr -sc 'A-Za-z' '\n' < austen-emma.txt | tr 'A-Z' 'a-z' | sort | uniq -c 1 1595 a 1 abbreviation 1 abdication 1 abide 3 abilities 30 able...

22

Page 35: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Parallel Streams

•paste file1 file2

✤ paste <(pipe1) <(pipe2)

•cut -f 1

• Counting bigrams

• join -1 2 -2 2 : join on second field

✤ Inputs must be sorted on the joined field!

23

Page 36: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Parallel Streams$ tr -sc 'A-Za-z' '\n' < austen-emma.txt | tr 'A-Z' 'a-z' > austen-emma.words$ paste austen-emma.words <(tail -n +2 austen-emma.words) emmaemma byby janejane austenausten volumevolume ii chapterchapter ii emmaemma woodhousewoodhouse handsomehandsome cleverclever andand richrich with...

24

Page 37: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Counting Bigrams$ paste austen-emma.words <(tail -n +2 austen-emma.words) | sort | uniq -c | sort 608 to be 566 of the 449 it was 446 in the 395 i am 334 she had 331 she was 308 had been 301 it is 299 mr knightley 283 i have 278 could not 265 of her 256 mrs weston...

25

Page 38: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Comparing Counts

$ tr -sc 'A-Za-z' '\n' < austen-sense.txt | grep -v '^$' | tr 'A-Z' 'a-z' | sort | uniq -c > austen-sense.counts$ tr -sc 'A-Za-z' '\n' < austen-persuasion.txt | grep -v '^$' | tr 'A-Z' 'a-z' | sort | uniq -c > austen-persuasion.counts$ join -1 2 -2 2 austen-sense.counts austen-persuasion.countsa 2092 1595abilities 9 3able 46 30abode 5 1about 144 97above 20 6abroad 9 5absence 11 9...

26

Page 39: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Comparing Counts$ tr -sc 'A-Za-z' '\n' < austen-sense.txt | grep -v '^$' | wc -l 120734$ tr -sc 'A-Za-z' '\n' < austen-persuasion.txt | grep -v '^$' | wc -l 84123$ join -1 2 -2 2 austen-sense.counts austen-persuasion.counts | awk '{ print $1, $2 * 10000 / 120734, $3 * 10000 / 84123 }' > sense-persuasion.counts$ less sense-persuasion.countsa 173.273 189.603abilities 0.74544 0.356621able 3.81003 3.56621abode 0.414134 0.118874about 11.927 11.5307above 1.65653 0.713241abroad 0.74544 0.594368absence 0.911094 1.06986absent 0.24848 0.475494...

27

Page 40: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Comparing Counts

0.1 0.5 1.0 5.0 10.0 50.0

0.1

0.5

1.0

5.0

50.0

500.0

pcounts$sense

pcounts$persuasion

affection

also

anne

assured

bad

bath

bed

beginning

behaviour

body

colonelcompletely

concern

cousindecided

declaredeasilyedward

enteredequally

everybody

everything

expectation

fancy

fanny

father

fear

hall high

john

kind

likewise

looking

mary

morrow

mother

parkpounds

pride

resolved

rooms

smith

sort

spentt

thing

thousand

townvalue

whatever

Filtered out words w/similar

frequencies

28

Page 41: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Digression: R graphs

> counts <- read.table("sense-persuation.counts")> names(counts) <- alist(word,sense,persuasion)> pcount <- subset(counts, (sense > 2 | persuasion > 2) & abs(log(sense/persuasion)) > 1.1)> plot(pcounts$sense, pcounts$persuasion, type="n", log="xy")> text(pcount$sense, pcount$persuasion, label=pcount$word)

29

Page 42: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Filters

•grep '^[A-Z]'

✤ “generalized regular expression parser” ???!

✤ Matches words beginning with a capital

• Count results?

• What are regular expressions?

30

Page 43: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Regular Expressions

31

Page 44: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Regular Expressions

• Two types of characters

✤ Literal

• Every “normal” alphanumeric character is an RE and matches itself

✤ Meta-characters

• Special characters that allow you to combine REs in various ways

• Example

✤ a matches a

✤ a* matches ε or a or aa or aaa or ...

32

Page 45: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Regex Basics

Andrew McCallum, UMass Amherst,

including material from Chris Manning and Jason Eisner

Basic Regular Expressions

Character Concat went went

Alternatives (go|went) go went

[aeiou] a o u

disjunc. negation [^aeiou] b c d f g

wildcard char . a z &

Loops & skips a* ε a aa aaa ...

one or more a+ a aa aaa

zero or one colou?r color colour

Pattern Matches

33

Page 46: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

More Regexes

Andrew McCallum, UMass Amherst,

including material from Chris Manning and Jason Eisner

More Fancy Regular Expressions• Special characters

– \t tab \v vertical tab

– \n newline \r carriage return

• Aliases (shorthand)– \d digits [0-9]

– \D non-digits [^0-9]

– \w alphabetic [a-zA-Z]

– \W non-alphabetic [^a-zA-Z]

– \s whitespace [\t\n\r\f\v]

– \w alphabetic [a-zA-Z]

• Examples– \d+ dollars 3 dollars, 50 dollars, 982 dollars

– \w*oo\w* food, boo, oodles

• Escape character– \ is the general escape character; e.g. \. is not a wildcard,

but matches a period .

– if you want to use \ in a string it has to be escaped \\

34

Page 47: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

More Regexes

Andrew McCallum, UMass Amherst,

including material from Chris Manning and Jason Eisner

Yet More Fancy Regular Expressions

• Anchors. AKA, “zero width characters”.

• They match positions in the text.– ^ beginning of line

– $ end of line

– \b word boundary, i.e. location with \w on oneside but not on the other.

– \B negated word boundary, i.e. any locationthat would not match \b

• Examples:– \bthe\b the together

• Counters {1}, {1,2}, {3,}

35

Page 48: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

More Regexes

Andrew McCallum, UMass Amherst,

including material from Chris Manning and Jason Eisner

Even More Fancy Regular Expressions

• Grouping– a (good|bad) movie

– He said it (again and )*again.

• Parens also indicate Registers (saved contents)– b(\w+)h\1

matches boohoo and baha, but not boohaaThe digit after the \ indicates which of multiple parengroups, as ordered by when then were opened.

• Grouping without the cost of register saving– He went (?:this|that) way.

36

Page 49: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Exercising grep

• How many upper case words?

✤ How many distinct upper case words?

• Are there any words with no vowels?

• Any one-syllable (i.e. one vowel) words?

✤ Omit words with silent e

• Gotchas: grep doesn’t support all modern “perl-style” regex extensions, try egrep

37

Page 50: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Regex Substitutions

• Lots of languages support them: perl, python, ruby, Java (and the JVM), ...

• perl -pe 's/regex/subst/g'

✤ easiest on the command line

• perl -000 -pe 's/regex/sub/g'

✤ Match paragraphs, not lines

• perl -ne 'print $1 while /foo (bar)/g'

✤ Print only the “captured” part

38

Page 51: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Munging with Regexes

• Remove common morphological suffixes from word list and get counts

• Patterns that indicate proper names

• Extract direct speech from novels

• Put each paragraph on a single line

✤ For, e.g., Mallet input

39

Page 52: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Replicable Research

40

Page 53: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Shell Scripts• If you’ve figured out a good pipeline, write it down!

✤ Better yet, make it executable.

✤ Make a file histogram containing

✤ Then run chmod +x histogram

#!/bin/shtr -sc 'A-Za-z' '\n' | \ sort | uniq -c | sort -rn

41

Page 54: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Makefiles

• Output from one process is input for another

✤ Capture dependencies in a Makefile

✤ Run make target to generate output

comparison:! emma.txt sense.txtjoin -1 2 -2 2 $^ | \hist-stats > $@

%.hist:!%.txt histogram! ./histogram < $< > $@

42

Page 55: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Where Next?

• Transforming data for more specialized analysis tools

✤ E.g., Gephi, R, Voyant

• Programming to supply missing tools

✤ E.g., R, python (“Programming Historian”)

• Tools for more structured (XML) text

✤ Though regexes can work pretty well!

43

Page 56: Tools for Text - Khoury College of Computer Sciences2013/02/06  · Data Structures 1330 was 1177 had 1159 her 949 be 926 not 855 it 853 that 819 she 787 as Sequence of lines Whitespace

Thanks

• Inspired by Ken Church’s “Unix for Poets”

• My colleagues at the NULab for Texts, Maps & Networks

44