Vocabulary Size of Moby Dick - Brown...

40
Vocabulary Size of Moby Dick March 6, 2014 CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 1

Transcript of Vocabulary Size of Moby Dick - Brown...

Vocabulary Size of Moby Dick

March 6, 2014

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 1

The Big Picture

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 2

Overall Goal Build a Concordance of a text

• Locations of words • Frequency of words

Today: Summary Statistics • Get the vocabulary size of Moby Dick (Attempt 1)

• Write test cases to make sure our program works • Think of a faster way to compute the vocabulary size

Project 1

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 3

• Grades back

• Check the rubric that was shared with you for your proposal

HW 2-2

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 4

• What problems did you encounter?

– Error messages?

– Understanding how ‘for loops’ work

– Using a variable to accumulate some value (e.g., a count, a sum, a list) over the course of running the loop

From Last Class

• Finding average word length, longest word

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 5

Compute the Average Word Length of Moby Dick

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 6

def avgWordLengthInMobyDick():

'''Gets the average word length in MobyDick.txt'''

myList = readMobyDick()

s = 0 # tally of lengths of all words encountered so far

for word in myList:

s = s + len(word)

avg = s/float(len(myList))

return avg

Is our Program Correct?

>>> MDList = readMobyDick()

>>> MDList[0:99]

['CHAPTER', '1', 'Loomings', 'Call', 'me', 'Ishmael.',

'Some', 'years', 'ago--never', 'mind', 'how', 'long',

'precisely--', 'having', 'little', 'or', 'no', 'money', 'in',

'my', 'purse,', 'and', 'nothing', 'particular', 'to',

'interest', 'me', 'on', 'shore,', 'I', 'thought', 'I',

'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the',

'watery', 'part', 'of', 'the', 'world.', 'It', 'is', 'a',

'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen',

'and', 'regulating', 'the', 'circulation.', 'Whenever', 'I',

'find', 'myself', 'growing', 'grim', 'about', 'the',

'mouth;', 'whenever', 'it', 'is', 'a', 'damp,', 'drizzly',

'November', 'in', 'my', 'soul;', 'whenever', 'I', 'find',

'myself', 'involuntarily', 'pausing', 'before', 'coffin',

'warehouses,', 'and', 'bringing', 'up', 'the', 'rear', 'of',

'every', 'funeral', 'I', 'meet;', 'and']

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 7

Get the Longest Word in Moby Dick

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 8

def getLongestWordInMobyDick():

'''Returns the longest word in MobyDick.txt'''

return longestword

Get the Longest Word in Moby Dick

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 9

def getLongestWordInMobyDick():

'''Returns the longest word in MobyDick.txt'''

myList = readMobyDick()

longestWord = ""

for word in myList :

if len(word) > len(longestWord) :

longestWord = word

return longestWord

Get the Longest Word in Moby Dick

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 10

def getLongestWordInMobyDick():

'''Returns the longest word in MobyDick.txt'''

myList = readMobyDick()

longestword = ""

for word in myList:

if len(word) > len(longestword):

longestword = word

return longestword

Is our program correct?

Why use functions?

• Break up tasks into smaller tasks

– Test smaller tasks; then assemble!

• Functions allow generalization!

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 11

Compute the Average Word Length of a word list

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 12

def avgWordLength (wordList):

''‘Average word length in a nonempty list of words'''

s = 0 # tally of lengths of all words encountered so far

for word in wordList:

s = s + len(word)

avg = s/float(len(wordList)) #assumes wordList nonempty!

return avg

Designing functions

• What constitutes a “smaller task”?

• Bad choice: “find average word length in first third of Moby Dick”

• Good choice: “read in list of all words of Moby Dick”; “compute average word length in list”

• For now...we’ll guide you on this.

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 13

ACT2-4

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 14

• Do Task 1 – Practice spotting errors in functions

Debugging Programs

Function Example

addTwo(x,y) addTwo(2,0)

subtractTwo(x,y) subtractTwo(2,0)

multiplyTwo(x,y) z = multiply(2,0)

divideTwo(x,y) divideTwo(2,0)

addList(myList) myList([2,0])

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 15

Debugging Programs

Function Example

addTwo(x,y) addTwo(2,0) – output is wrong

subtractTwo(x,y) subtractTwo(2,0) – any input causes an error

multiplyTwo(x,y) z = multiply(2,0) – What is z after this assignment?

divideTwo(x,y) divideTwo(2,0) – What happens when I run this? Use an “if” to catch the error and print a message to the screen.

addList(myList) myList([2,0]) – any input causes an error

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 16

Boolean Expressions on Numbers

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 17

Expression Evaluation Final

(100 > 101) and (-1 != -1)

(1 <= 2) or ((1 == 1) and (1 != 2)

x = 1

not(x > 2) or (x <= x+1)

y = 100

not(not(y == 100))

Boolean Expressions on Numbers

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 18

Expression Evaluation Final

(100 > 101) and (-1 != -1) False and False False

(1 <= 2) or ((1 == 1) and (1 != 2)

x = 1

not(x > 2) or (x <= x+1)

y = 100

not(not(y == 100))

Boolean Expressions on Numbers

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 19

Expression Evaluation Final

(100 > 101) and (-1 != -1) False and False False

(1 <= 2) or ((1 == 1) and (1 != 2)

True or

(True and True)

True

x = 1

not(x > 2) or (x <= x+1)

y = 100

not(not(y == 100))

Boolean Expressions on Numbers

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 20

Expression Evaluation Final

(100 > 101) and (-1 != -1) False and False False

(1 <= 2) or ((1 == 1) and (1 != 2)

True or

(True and True)

True

x = 1

not(x > 2) or (x <= x+1)

True or True True

y = 100

not(not(y == 100))

Boolean Expressions on Numbers

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 21

Expression Evaluation Final

(100 > 101) and (-1 != -1) False and False False

(1 <= 2) or ((1 == 1) and (1 != 2)

True or

(True and True)

True

x = 1

not(x > 2) or (x <= x+1)

True or True True

y = 100

not(not(y == 100))

not(False) True

Boolean Expressions on Strings

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 22

Boolean Operators on Strings

Operator Example Result

Equality 'a' == 'b' False

Inequality 'a' != 'b' True

Boolean Expressions on Strings

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 23

>>> 'apple' == 'apple'

True

>>> 'apple' == 'Apple'

False

>>> 'apple' == 'apple!'

False

Boolean Operators on Strings

Operator Example Result

Equality 'a' == 'b' False

Inequality 'a' != 'b' True

The Big Picture

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 24

Overall Goal Build a Concordance of a text

• Locations of words • Frequency of words

Today: Summary Statistics • Get the vocabulary size of Moby Dick (Attempt 1)

• Write test cases to make sure our program works • Think of a faster way to compute the vocabulary size

Save ACT2-4.py and MobyDick.txt to the same directory

Writing a vocabSize Function

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 25

def vocabSize():

myList = readMobyDickShort()

uniqueList = noReplicates(myList)

return len(uniquelist)

Writing a vocabSize Function

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 26

def noReplicates(wordList):

'''takes a list as

argument, returns a list free

of replicate items. slow

implementation.'''

def isElementOf(myElement,myList):

'''takes a string and a list

and returns True if the string is

in the list and False

otherwise.'''

def vocabSize():

myList = readMobyDickShort()

uniqueList = noReplicates(myList)

return len(uniquelist)

Writing a vocabSize Function

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 27

def noReplicates(wordList):

'''takes a list as

argument, returns a list free

of replicate items. slow

implementation.'''

def isElementOf(myElement,myList):

'''takes a string and a list

and returns True if the string is

in the list and False

otherwise.'''

def vocabSize():

myList = readMobyDickShort()

uniqueList = noReplicates(myList)

return len(uniquelist)

def testNoReplicates(): def testIsElementOf():

Writing test cases are important to make sure your program works!

The Big Picture

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 28

Overall Goal Build a Concordance of a text

• Locations of words • Frequency of words

Today: Summary Statistics • Get the vocabulary size of Moby Dick (Attempt 1)

• Write test cases to make sure our program works • Think of a faster way to compute the vocabulary size

What does slow implementation mean?

• Replace readMobyDickShort() with readMobyDickAll()

• Now, run vocabSize()

– Hint: Ctrl-C (or Command-C) will abort the call.

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 29

What does slow implementation mean?

• Replace readMobyDickShort() with readMobyDickAll()

• Now, run vocabSize()

– Hint: Ctrl-C (or Command-C) will abort the call.

• Faster way to write noReplicates()

–What if we can sort the list? [‘a’,’a’,’a’,’at’,’and’,’and’,…,’zebra’]

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 30

Sorting Lists

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 31

Preloaded Functions

Name Inputs Outputs CHANGES

sort List Original List!

Sorting Lists

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 32

Preloaded Functions

Name Inputs Outputs CHANGES

sort List Original List!

>>> myList = [0,4,1,5,-1,6]

>>> myList.sort()

>>> myList

[-1, 0, 1, 4, 5, 6]

Sorting Lists

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 33

Preloaded Functions

Name Inputs Outputs CHANGES

sort List Original List!

>>> myList = [0,4,1,5,-1,6]

>>> myList.sort()

>>> myList

[-1, 0, 1, 4, 5, 6]

>>> myList = ['b','d','c','a','z','i']

>>> myList.sort()

>>> myList

['a', 'b', 'c', 'd', 'i', 'z']

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 34

>>> myList = [1,2,3]

>>> for i in range(0,3):

print myList[i]

1

2

3

>>>

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 35

>>> myList = [1,2,3]

>>> for i in range(0,3):

print myList[i]

1

2

3

>>>

List [0,1,2]

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 36

>>> myList = [1,2,3]

>>> for i in range(0,3):

print myList[i]

1

2

3

>>>

List [0,1,2]

Variable

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 37

>>> myList = [1,2,3]

>>> for i in range(0,3):

print myList[i]

1

2

3

>>>

List [0,1,2]

Variable

Q: What if we don’t know the length of the list?

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 38

>>> myList = [1,2,3]

>>> for i in range(0,len(myList)):

print myList[i]

1

2

3

>>>

List [0,1,2]

Variable

Q: What if we don’t know the length of the list?

Python For Statements (For Loops)

“For each element in list myList, do something”

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 39

>>> myListA = [1,2,3]

>>> myListB = [4,5,6]

>>> for i in range(0,len(myListA)):

print myListA[i], myListB[i]

1 4

2 5

3 6

>>>

The Big Picture

CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 40

Overall Goal Build a Concordance of a text

• Locations of words • Frequency of words

Today: Summary Statistics • Get the vocabulary size of Moby Dick (Attempt 1)

• Write test cases to make sure our program works • Think of a faster way to compute the vocabulary size