Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

33
Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things

Transcript of Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Page 1: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Methods in Computational Linguistics II

Queens College

Lecture 3: Counting More Things

Page 2: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

2

Overview

• Basics of Probability. – Example Bayes Rule Question

• Implementing – FreqDist– ConditionalFreqDist

• Using the Command Line

Page 3: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

3

Definitions

• Joint Probability• Marginal Probability• Conditional Probability

Page 4: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Bayes’ Rule

Bayes’ Rule relates conditional probability distributions:

P(h | e) = P(e | h) * P(h) P(e)

or with additional conditioning information:

P(h | e k) = P(e | h k) * P(h | k) P(e | k)

Page 5: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Bayes Rule Problem

• The probability I think that my cup of coffee tastes good is 0.80.

P(G) = .80• I add Equal to my coffee 60% of the time.

P(E) = .60• I think when coffee has Equal in it, it tastes good

50% of the time.P(G|E) = .50

• If I sip my coffee, and it tastes good, what are the odds that it has Equal in it?

P(E|G) = P(G|E) * P(E) / P(G)

Page 6: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Bayes’ Rule

• P(disease | symptom) =

P(symptom | disease) P(disease) P(symptom)

• Assess diagnostic probability from causal probability:– P(Cause|Effect) = P(Effect|Cause) P(Cause)

P(Effect)

• Prior, Likelihood, Posterior

Page 7: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Bayes Example

• Imagine – disease = BirdFlu, symptom = coughing– P(disease | symptom) is different in

BirdFlu-indicated country vs. USA– P(symptom | disease) should be the same

• It is more useful to learn P(symptom | disease)

– What about the denominator: P(symptom)? How do we determine this? Use conditioning (next slide).Skip this detail, Spring 2007

Page 8: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Conditioning

• Idea: Use conditional probabilities instead of joint probabilities

• P(A) = P(A B) + P(A B) = P(A | B) P(B) + P(A | B) P( B) Example:

P(symptom) = P( symptom | disease ) P(disease) + P( symptom | disease ) P( disease)

• More generally: P(Y) = åz P(Y|z) P(z)

• Marginalization and conditioning are useful rules for derivations involving probability expressions.

Page 9: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Independence

• A and B are independent iff– P(A B) = P(A) P(B)– P(A | B) = P(A)– P(B | A) = P(B)

• Independence is essential for efficient probabilistic reasoning• 32 entries reduced to 12; for n independent biased coins,

O(2n) →O(n)• Absolute independence powerful but rare• Dentistry is a large field with hundreds of variables, none of

which are independent. What to do?

CavityToothache Xray

Weatherdecomposes into

CavityToothache Xray

Weather

P(T, X, C, W) = P(T, X, C) P(W)

Page 10: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

Conditional Independence

• A and B are conditionally independent given C iff– P(A | B, C) = P(A | C)– P(B | A, C) = P(B | C)– P(A B | C) = P(A | C) P(B | C)

• Toothache (T), Spot in Xray (X), Cavity (C)– None of these propositions are independent of

one other– But:

T and X are conditionally independent given C

Page 11: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

11

Boxes and Balls

• 2 Boxes, one red and one blue.• Each contain colored balls.

Page 12: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

12

Frequency Distribution

• Count up the number of occurrences of each member of a set of items.

• This counting can be used to calculate the probability of seeing any word.

Page 13: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

13

nltk.FreqDist

• Let’s look at some code.

• Feel free to code along.

Page 14: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

14

How would you implement a Frequency Distribution

• First conceptually.– What needs to happen.

• Implementation– Dictionary Objects. – Let’s see some examples.

Page 15: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

15

Conditional Frequency Distribution

• Construct Frequency Distributions based on “conditions”.

Page 16: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

16

Using the command line

• You can write and run code in IDLE or the interpreter.

• This is called ‘interactive’ mode.• However, a more useful way to build tools

is to write a file that– contains your code– can be run from the command line

Page 17: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

17

Why do we do this?

Page 18: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

18

Command Line Arguments

python my_code.py

python my_code.py readthisfile.txt writethisfile.txt

python my_code.py readthisfile.txt 4 writethisfile.txt

Page 19: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

19

How do we get at this information?

sys.argv

We’ll code some examples.

Page 20: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

20

Argparse

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’)

parser.add_argument(‘outfile’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 21: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

21

Help Information

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, help=‘file to read’)

parser.add_argument(‘outfile’, help=‘file to write’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 22: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

22

Optional Arguments

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, help=‘file to read’)

parser.add_argument(‘outfile’, help=‘file to write’)

parser.add_argument(‘--num_lines’, help=‘number of lines to read’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 23: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

23

Presence Arguments

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, help=‘file to read’)

parser.add_argument(‘outfile’, help=‘file to write’)

parser.add_argument(‘--verbose’, help=‘how wordy’,

action=‘store_true’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 24: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

24

Multiple descriptors

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, help=‘file to read’)

parser.add_argument(‘outfile’, help=‘file to write’)

parser.add_argument(‘-v’, ‘--verbose’, help=‘how wordy’,

action=‘store_true’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 25: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

25

Typed Arguments

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, type=str, help=‘file to read’)

parser.add_argument(‘outfile’, type=str, help=‘file to write’)

parser.add_argument(‘-n’, ‘--num_lines’, type=int,

help=‘number of lines to read’)

args = parser.parse_args()

print args.infile

print args.outfile

Page 26: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

26

Default values

import argparse

parser = argparse.ArgumentParser(description=‘fun with files’)

# Positional Arguments

parser.add_argument(‘infile’, type=str, help=‘file to read’)

parser.add_argument(‘outfile’, type=str, help=‘file to write’)

parser.add_argument(‘-n’, ‘--num_lines’, type=int,

help=‘number of lines to read’,

default=10)

args = parser.parse_args()

print args.infile

print args.outfile

Page 27: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

27

Documentation and Comments

Why bother?

Page 28: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

28

Good Documentation

• Every file gets a header describing what it does.

• Every function includes a string with 3 quotes describing what it does.– This allows help() to work

Page 29: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

29

Documentation vs. Comments

• There are differing philosophies here.

• Documentation is for ‘what is done’

• Comments are for ‘how it’s done’

Page 30: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

30

Effective Variable & Function Names

x = a + b

x = x / m

y = x * x

Page 31: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

31

Effective Variable & Function Names

x = a + b

x = x / m

y = x * x

num_things = thing_count + thang_count

avg_things = num_things / document_count

sq_avg_things = avg_thing * avg_things

Page 32: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

32

A simple, but good piece of code

• Let’s write code that– reads a file– counts the number of words– writes a file containing the frequency of the N

most-frequent words (or all of the words if N isn’t specified.)

Page 33: Methods in Computational Linguistics II Queens College Lecture 3: Counting More Things.

33

Next Time

• Matching Things– Regular Expressions and Finite State Machines