Reading Files - University of Michigan · Reading Files Chapter 7 ......
Transcript of Reading Files - University of Michigan · Reading Files Chapter 7 ......
![Page 1: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/1.jpg)
Reading FilesChapter 7
Python for Informatics: Exploring Informationwww.py4inf.com
![Page 2: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/2.jpg)
Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License.http://creativecommons.org/licenses/by/3.0/.
Copyright 2010, Charles Severance
![Page 3: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/3.jpg)
Software
Inputand Output
Devices
CentralProcessing
Unit
MainMemory
SecondaryMemory
It is time to go find some Data to
mess with!
WhatNext?
if x< 3: print
From [email protected] Sat Jan 5 09:14:16 2008Return-Path: <[email protected]>Date: Sat, 5 Jan 2008 09:12:18 -0500To: [email protected]: [email protected]: [sakai] svn commit: r39772 - content/branches/Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772...
![Page 4: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/4.jpg)
File Processing
• A text file can be thought of as a sequence of lines
From [email protected] Sat Jan 5 09:14:16 2008Return-Path: <[email protected]>Date: Sat, 5 Jan 2008 09:12:18 -0500To: [email protected]: [email protected]: [sakai] svn commit: r39772 - content/branches/Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
![Page 5: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/5.jpg)
Opening a File
• Before we can read the contents of the file we must tell Python which file we are going to work with and what we will be doing with the file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform operations on the file
• Kind of like “File -> Open” in a Word Processor
![Page 6: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/6.jpg)
Using open()
• handle = open(filename, mode)
• returns a handle use to manipulate the file
• filename is a string
• mode is optional and should be “r” if we are planning reading the file and “w” if we are going to write to the file.
http://docs.python.org/lib/built-in-funcs.html
fhand = open("mbox.txt", "r")
![Page 7: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/7.jpg)
What is a Handle?
>>> fhand = open('mbox.txt')>>> print fhand<open file 'mbox.txt', mode 'r' at 0x1005088b0>
>>> fhand = open("stuff.txt")Traceback (most recent call last): File "<stdin>", line 1, in <module>IOError: [Errno 2] No such file or directory: 'stuff.txt'
![Page 8: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/8.jpg)
When Files are Missing
>>> fhand = open("stuff.txt")Traceback (most recent call last): File "<stdin>", line 1, in <module>IOError: [Errno 2] No such file or directory: 'stuff.txt'
![Page 9: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/9.jpg)
The newlineCharacter
• We use a special character to indicate when a line ends called the "newline"
• We represent it as \n in strings
• Newline is still one character - not two
>>> stuff = 'Hello\nWorld!'>>> stuff'Hello\nWorld!'>>> print stuffHelloWorld!>>> stuff = 'X\nY'>>> print stuffXY>>> len(stuff)3
![Page 10: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/10.jpg)
File Processing
• A text file can be thought of as a sequence of lines
From [email protected] Sat Jan 5 09:14:16 2008Return-Path: <[email protected]>Date: Sat, 5 Jan 2008 09:12:18 -0500To: [email protected]: [email protected]: [sakai] svn commit: r39772 - content/branches/Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
![Page 11: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/11.jpg)
File Processing
• A text file has newlines at the end of each line
From [email protected] Sat Jan 5 09:14:16 2008\nReturn-Path: <[email protected]>\nDate: Sat, 5 Jan 2008 09:12:18 -0500\nTo: [email protected]\nFrom: [email protected]\nSubject: [sakai] svn commit: r39772 - content/branches/\nDetails: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
![Page 12: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/12.jpg)
File Handle as a Sequence
• A file handle open for read can be treated as a sequence of strings where each line in the file is a string in the sequence
• We can use the for statement to iterate through a sequence
• Remember - a sequence is an ordered set
xfile = open("mbox.txt", "r")
for cheese in xfile: print cheese
![Page 13: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/13.jpg)
Counting Lines in a File
• Open a file read-only
• Use a for loop to read each line
• Count the lines and print out the number of lines
fhand = open("mbox.txt")count = 0for line in fhand: count = count + 1
print "Line Count:", count
python open.py Line Count: 132045
![Page 14: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/14.jpg)
Reading the *Whole* File
• We can read the whole file (newlines and all) into a single string.
>>> fhand = open("mbox-short.txt")>>> inp = fhand.read()>>> print len(inp)94626>>> print inp[:20]From stephen.marquar
![Page 15: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/15.jpg)
Searching Through a File
• We can put an if statement in our for loop to only print lines that meet some criteria
fhand = open("mbox-short.txt")for line in fhand: if line.startswith('From:') : print line
![Page 16: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/16.jpg)
OOPS!
From: [email protected]
From: [email protected]
From: [email protected]
From: [email protected]...
What are all these blank lines doing here?
![Page 17: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/17.jpg)
OOPS!
From: [email protected]\n\nFrom: [email protected]\n\nFrom: [email protected]\n\nFrom: [email protected]\n...
What are all these blank lines doing here?
The print statement adds a newline to each line.
Each line from the file also has a newline at the end.
![Page 18: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/18.jpg)
Searching Through a File (fixed)
• We can strip the whitespace from the right hand side of the string using rstrip() from the string library
• The newline is considered "white space" and is stripped
fhand = open("mbox-short.txt")for line in fhand: line = line.rstrip() if line.startswith('From:') : print line
From: [email protected]: [email protected]: [email protected]: [email protected]....
![Page 19: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/19.jpg)
Skipping with continue
• We can convienently skip a line by using the continue statement
fhand = open("mbox-short.txt")for line in fhand: line = line.rstrip() # Skip 'uninteresting lines' if not line.startswith('From:') : continue # Process our 'interesting' line print line
![Page 20: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/20.jpg)
Using in to select lines
• We can look for a string anywhere in a line as our selection criteria
fhand = open("mbox-short.txt")for line in fhand: line = line.rstrip() if (not '@uct.ac.za' in line) : continue print line
From [email protected] Sat Jan 5 09:14:16 2008X-Authentication-Warning: set sender to [email protected] using -fFrom: [email protected]: [email protected] [email protected] Fri Jan 4 07:02:32 2008X-Authentication-Warning: set sender to [email protected] using -f...
![Page 21: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/21.jpg)
Review: Splitting Lines
• The split() function breaks lines based on whitespace
>>> line = 'Have a nice day'>>> words = line.split()>>> print words['Have', 'a', 'nice', 'day']>>> print len(words)4>>> print words[2]nice
![Page 22: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/22.jpg)
fhand = open("mbox-short.txt")for line in fhand: line = line.rstrip() if not line.startswith('From ') : continue words = line.split() print words[2]
SatFriFriFri ...
From [email protected] Sat Jan 5 09:14:16 2008
>>> line = "From [email protected] Sat Jan 5 09:14:16 2008">>> words = line.split()>>> print words['From', '[email protected]', 'Sat', 'Jan', '5', '09:14:16', '2008']>>>
![Page 23: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/23.jpg)
Prompt for File Name
fname = raw_input("Enter the file name: ")fhand = open(fname)count = 0for line in fhand: if line.startswith('Subject:') : count = count + 1print 'There were', count, 'subject lines in', fname
python search6.py Enter the file name: mbox.txtThere were 1797 subject lines in mbox.txt
python search6.py Enter the file name: mbox-short.txtThere were 27 subject lines in mbox-short.txt
![Page 24: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/24.jpg)
Bad File Names
fname = raw_input("Enter the file name: ")try: fhand = open(fname)except: print 'File cannot be opened:', fname exit()count = 0for line in fhand: if line.startswith('Subject:') : count = count + 1print 'There were', count, 'subject lines in', fnamepython search7.py
Enter the file name: mbox.txtThere were 1797 subject lines in mbox.txt
python search7.pyEnter the file name: na na boo booFile cannot be opened: na na boo boo
![Page 25: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/25.jpg)
Mystery Problem...
fhand = open("mbox-short.txt")for line in fhand: words = line.split() if words[0] != 'From' : continue print words[2]
python search8.py SatTraceback (most recent call last): File "search8.py", line 5, in <module> if words[0] != 'From' : continueIndexError: list index out of range
![Page 26: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/26.jpg)
Summary• Secondary storage
• Opening a file - file handle
• File structure - newline character
• Reading a file line-by-line with a for loop
• Reading the whole file as a string
• Searching for lines
• Stripping white space
• Using continue
• Using in as an operator
• Reading a file and splitting lines
• Reading file names
• Dealing with bad files
![Page 27: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/27.jpg)
Exercise 7.3
Write a program to read through a file and print the contentsof the file (line by line) all in upper case. Executing the programwill look as follows:
python shout.pyEnter a file name: mbox-short.txtFROM [email protected] SAT JAN 5 09:14:16 2008RETURN-PATH: <[email protected]>RECEIVED: FROM MURDER (MAIL.UMICH.EDU [141.211.14.90]) BY FRANKENSTEIN.MAIL.UMICH.EDU (CYRUS V2.3.8) WITH LMTPA; SAT, 05 JAN 2008 09:14:16 -0500
![Page 28: Reading Files - University of Michigan · Reading Files Chapter 7 ... source@collab.sakaiproject.org From: stephen.marquard@uct.ac.za Subject: [sakai] ... (CYRUS V2.3.8) WITH LMTPA;](https://reader031.fdocuments.us/reader031/viewer/2022022015/5b5a47487f8b9ac7498baa81/html5/thumbnails/28.jpg)
Exercise 7.4
Write a program to loop through a mailbox-format file and look for lines of the form:
X-DSPAM-Confidence: 0.8475
Use find and string slicing to extract the portion of the string after the colon character and then use the float function to convert the extracted string into a floating point number. Count these lines and the compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence.
Enter the file name: mbox.txtAverage spam confidence: 0.894128046745
Enter the file name: mbox-short.txtAverage spam confidence: 0.750718518519