Python in Action (Part 2)

173
Copyright (C) 2007, http://www.dabeaz.com 2- Python in Action 1 Presented at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming)

description

Official tutorial slides from USENIX LISA, Nov. 16, 2007.

Transcript of Python in Action (Part 2)

Page 1: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Python in Action

1

Presented at USENIX LISA ConferenceNovember 16, 2007

David M. Beazleyhttp://www.dabeaz.com

(Part II - Systems Programming)

Page 2: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Section Overview• In this section, we're going to get dirty

• Systems Programming

• Files, I/O, file-system

• Text parsing, data decoding

• Processes and IPC

• Networking

• Threads and concurrency

2

Page 3: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• I personally think Python is a fantastic tool for systems programming.

• Modules provide access to most of the major system libraries I used to access via C

• No enforcement of "morality"

• Decent performance

• It just "works" and it's fun

3

Page 4: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Approach

• I've thought long and hard about how I would present this part of the class.

• A reference manual approach would probably be long and very boring.

• So instead, we're going to focus on building something more in tune with the times

4

Page 5: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

"To Catch a Slacker"• Write a collection of Python programs that can

quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports.

• Oh yeah, and be a real sneaky bugger about it.

5

Page 6: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Why this Problem?

• Involves a real-world system and data

• Firefox already installed on your machine (?)

• Cross platform (Linux, Mac, Windows)

• Example of tool building

• Related to a variety of practical problems

• A good tour of "Python in Action"

6

Page 7: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Disclaimers• I am not involved in browser forensics (or

spyware for that matter).

• I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code

• I have never worked with the cache data prior to preparing this tutorial

• I have never used any third-party tools for looking at this data.

7

Page 8: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

More Disclaimers• All of the code in this tutorial works with a

standard Python installation

• No third party modules.

• All code is cross-platform

• Code samples are available online at

8

http://www.dabeaz.com/action/

• Please look at that code and follow along

Page 9: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Assumptions

• This is not a tutorial on systems concepts

• You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.)

• Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications.

9

Page 10: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The Big Picture

• We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches.

• For example, the cache directories on all machines on the LAN of a quasi-evil corporation.

10

Page 11: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The Firefox Cache• The Firefox browser keeps a disk cache of

recently visited sites

11

% ls Cache/-rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01-rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01-rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01-rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_-rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_-rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_-rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_

• A bunch of cryptically named files.

Page 12: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : Finding Files• Find the Firefox cache

Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories.

12

• Example:% python findcache.py /Users/beazley/Users/beazley/Library/.../qs1ab616.default/Cache/Users/beazley/Library/.../wxuoyiuf.slt/Cache%

• Use case: Searching for things on the filesystem.

Page 13: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

findcache.py# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

13

Page 14: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The sys module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

14

The sys module has basic information related to the execution environment.

sys.argv

A list of the command line options

sys.stdinsys.stdoutsys.stderr

Standard I/O files

sys.argv = ['findcache.py', '/Users/beazley']

Page 15: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Program Termination# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

15

SystemExit exception

Forces Python to exit.Value is return code.

Page 16: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

os Module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

16

os module

Contains useful OS related functions (files, processes, etc.)

Page 17: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

os.walk()# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

17

os.walk(topdir)

Recursively walks a directory tree and generates a sequence of tuples (path,dirs,files)

path = The current directory namedirs = List of all subdirectory names in pathfiles = List of all regular files (data) in path

Page 18: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

A Sequence of Caches# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

18

This statement generates a sequence of directory names where '_CACHE_MAP_' is contained in the filelist.

The directory name that is generated as a

result

File name check

Page 19: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Printing the Result# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

19

This prints the sequence of cache directories that are generated by the previous statement.

Page 20: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• Our solution is strongly based on a "declarative" programming style (again)

• We simply write out a sequence of operations that produce what we want

• Not focused on the underlying mechanics of how to traverse all of the directories.

20

Page 21: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Big Idea : Iteration• Python allows iteration to be captured as a

kind of object.

21

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

• This de-couples iteration from the code that uses the iterationfor name in caches: print name

• Another usage example:for name in caches: print len(os.listdir(name)), name

Page 22: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Big Idea : Iteration• Compare to this:

22

for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path

• This code is simple, but the loop and the code that executes in the loop body are coupled together

• Not as flexible, but this is somewhat subtle to wrap your brain around at first.

Page 23: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : sys, os• sys module

23

sys.argv # List of command line optionssys.stdin # Standard inputsys.stdout # Standard outputsys.stderr # Standard errorsys.executable # Full path of Python executablesys.exc_info() # Information on current exception

• os moduleos.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir

• SystemExit exceptionraise SystemExit(n) # Exit with integer code n

Page 24: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem: Searching for Text• Extract all URL requests from the cache

Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache.

24

• Example:% python requests.py /Users/.../qs1ab616.default/Cachehttp://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.jshttp://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gifhttp://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png...%

• Use case: Searching the contents of files for text patterns.

Page 25: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The Firefox Cache• The cache directory holds two types of data

• Metadata (URLs, headers, etc.).

• Raw data (HTML, JPEG, PNG, etc.)

• This data is stored in two places

• Cryptic files in the Cache directory

• Blocks inside the _CACHE_00n_ files

• Metadata almost always in _CACHE_00n_

25

Page 26: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Possible Solution : Regex• The _CACHE_00n_ files are encoded in a

binary format, but URLs are embedded inside as null-terminated text:

26

\x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f\xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a\x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00request-Accept-Encoding\x00gzip,deflate\x00response-head\x00HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\nServer: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control:

• Maybe the requests could just be ripped using a regular expression.

Page 27: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

A Regex Solution# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

27

Page 28: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

The re module

28

re module

Contains all functionality related to regular expression pattern matching, searching, replacing, etc.

Features are strongly influenced by Perl, but regexs are not directly integrated into the Python language.

Page 29: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Using re

29

Patterns are first specified as strings and compiled into a regex object.

pat = re.compile(pattern [,flags])

The pattern syntax is "standard"pat*pat+pat?(pat).

pat1|pat2[chars][^chars]pat{n}pat{n,m}

Page 30: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Using re

30

All subsequent operations are methods of the compiled regex patternm = pat.match(data [,start]) # Check for matchm = pat.search(data [,start]) # Search for matchnewdata = pat.sub(data, repl) # Pattern replace

Page 31: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Searching for Matches

31

pat.search(text [,start])

Searches the string text for the first occurrence of the regex pattern starting at position start. Returns a "MatchObject" if a match is found.

In the code below, we're finding matches one at a time.

Page 32: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Match Objects

32

Regex matches are represented by a MatchObject

m.group([n]) # Text matched by group nm.start([n]) # Starting index of group nm.end([n]) # End index of group n

The matching text for just the URL.

The end of the match

Page 33: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Groups

33

In patterns, parentheses () define groups which are numbered left to right.group 0 # The entire patterngroup 1 # Text in first ()group 2 # Text in next ()...

Page 34: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : re• re pattern compilation

34

pat = re.compile(r'patternstring')

• Pattern syntaxliteral # Match literal textpat* # Match 0 or more repetitions of patpat+ # Match 1 or more repetitions of patpat? # Match 0 or 1 repetitions of patpat1|pat2 # Patch pat1 or pat2(pat) # Patch pat (group)[chars] # Match characters in chars[^chars] # Match characters not in chars. # Match any character except \n\d # Match any digit\w # Match alphanumeric character\s # Match whitespace

Page 35: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : re• Common pattern operations

35

pat.search(text) # Search text for a matchpat.match(text) # Search start of text for matchpat.sub(repl,text) # Replace pattern with repl

• Match objectsm.group([n]) # Text matched by group nm.start([n]) # Starting position of group nm.end([n]) # Ending position of group n

• How to loop over all matches of a patternfor m in pat.finditer(text): # m is a MatchObject that you process ...

Page 36: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : re• An example of pattern replacement

36

# This replaces American dates of the form 'mm/dd/yyyy'# with European dates of the form 'dd/mm/yyyy'.

# This function takes a MatchObject as input and returns# replacement text as output.

def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year)

# Date re pattern and replacement operationdatepat = re.compile(r'(\d+)/(\d+)/(\d+)')newdata = datepat.sub(euro_date,text)

Page 37: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : re

37

• There are many more features of the re module

• Strongly influenced by Perl (feature set)

• Regexs are a library in Python, not integrated into the language.

• A book on regular expressions may be essential for advanced functions.

Page 38: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

File Handling

38

What is going on in this statement?

Page 39: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

os.path module

39

os.path has portable file related functionsos.path.join(name1,name2,...) # Join path namesos.path.getsize(filename) # Get the file sizeos.path.getmtime(filename) # Get modification date

There are many more functions, but this is the preferred module for basic filename handling

Page 40: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

os.path.join()

40

Creates a fully-expanded pathnamedirname = '/foo/bar' filename = 'name'

'/foo/bar/name'

os.path.join(dirname,filename)

Aware of platform differences ('/' vs. '\')

Page 41: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-Reference : os.path

41

os.path.join(s1,s2,...) # Join pathname parts togetheros.path.getsize(path) # Get file size of pathos.path.getmtime(path) # Get modify time of pathos.path.getatime(path) # Get access time of pathos.path.getctime(path) # Get creation time of pathos.path.exists(path) # Check if path existsos.path.isfile(path) # Check if regular fileos.path.isdir(path) # Check if directoryos.path.islink(path) # Check if symbolic linkos.path.basename(path) # Return file part of pathos.path.dirname(path) # Return dir part ofos.path.abspath(path) # Get absolute path

Page 42: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Binary I/O

42

For all binary files, use modes "rb","wb", etc.

Disables new-line translation (critical on Windows)

Page 43: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

Common I/O Shortcuts

43

# Read an entire file into a stringdata = open(filename).read()

# Write a string out to a fileopen(filename,"w").write(text)

# Loop over all lines in a filefor line in open(filename): ...

Page 44: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary on Solution

• This regex approach is mostly a hack for this particular application.

• Reads entire cache files into memory as strings (may be quite large)

• Only finds URLs, no other metadata

• Some risk of false positives since URLs could also be embedded in data.

44

Page 45: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• We have started to build a collection of very simple command line tools

• Very much in the "Unix tradition."

• Python makes it easy to create such tools

• More complex applications could be assembled by simply gluing scripts together

45

Page 46: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Working with Processes

• It is common to write programs that run other programs, collect their output, etc.

• Pipes

• Interprocess Communication

• Python has a variety of modules for supporting this.

46

Page 47: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

subprocess Module

• A module for creating and interacting with subprocesses

• Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module

• Cross platform (Unix/Windows)

47

Page 48: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Example : Slackers

• Find slacker cache entries.

Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL.

48

Page 49: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

slackers.py# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

49

Page 50: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Launching a subprocess# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

50

This is launching a python script as a subprocess, connecting its stdout stream to a pipe.

Collection of output with newline stripping.

Page 51: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Python Executable# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

51

Full pathname of python interpreter

Page 52: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Subprocess Arguments# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

52

List of arguments to subprocess. Corresponds to what would appear on a shell command line.

Page 53: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

slackers.py# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

53

More of the same idea. For each directory we found in the last step, we run requests.py to produce requests.

Page 54: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

54

• subprocess is a large module with many options.

• However, it takes care of a lot of annoying platform-specific details for you.

• Currently the "recommended" way of dealing with subprocesses.

Page 55: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Low Level Subprocesses

55

• Running a simple system commandos.system("shell command")

• Connecting to a subprocess with pipespout, pin = popen2.popen2("shell command")

• Exec/spawnos.execv(),os.execl(),os.execle(),...os.spawnv(),os.spawnvl(), os.spawnle(),...

• Unix fork()os.fork(), os.wait(), os.waitpid(), os._exit(), ...

Page 56: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Interactive Processes

• Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect")

• Must install third party modules for this

• Example: pexpect

• http://pexpect.sourceforge.net

56

Page 57: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary• Writing small Unix-like utilities is fairly

straightforward in Python

• Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.)

• However, our solution is also kind of clunky

• Only returns some information

• Not particularly memory efficient (reads large files into memory)

57

Page 58: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Interlude

• Python is well-suited to building libraries and frameworks.

• In the next part, we're going to take a totally different approach than simply writing simple utilities.

• Will build libraries for manipulating cache data and use those libraries to build tools.

58

Page 59: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : Parsing Data• Extract the cache data (for real)

Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs.

Capture all available information including URLs, timestamps, sizes, locations, content types, etc.

59

• Use case: Blood and guts

Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse.

Page 60: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The Firefox Cache• There are four critical files

60

_CACHE_MAP_ # Cache index _CACHE_001_ # Cache data_CACHE_002_ # Cache data_CACHE_003_ # Cache data

• All files are binary-encoded

• _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits.

• We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions.

Page 61: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Firefox _CACHE_ Files• _CACHE_00n_ file organization

61

Free/used block bitmap

Blocks

4096 bytes

Up to 32768 blocks

• The block size varies according to the file:_CACHE_001_ 256 byte blocks_CACHE_002_ 1024 byte blocks_CACHE_003_ 4096 byte blocks

Page 62: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Cache Entries• Each cache entry:

• A maximum of 4 cache blocks

• Can either be data or metadata

• If >16K, written to a file instead

62

• Notice how all the "cryptic" files are >16K-rw------- beazley 111169 Sep 25 17:15 01CC0844d01-rw------- beazley 104991 Sep 25 17:15 01CC3844d01-rw------- beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- beazley 58172 Sep 25 18:16 FFE628C6d01

Page 63: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Cache Metadata• Metadata is encoded as a binary structure

63

Header

Request String

Request Info

36 bytes

Variable length (in header)

Variable length (in header)

• Header encoding (binary, big-endian)magic (???)locationfetchcountfetchtimemodifytimeexpiretimedatasizerequestsizeinfosize

unsigned int (0x00010008)unsigned intunsigned intunsigned int (system time)unsigned int (system time)unsigned int (system time)unsigned int (byte count)unsigned int (byte count)unsigned int (byte count)

0-34-78-1112-1516-1920-2324-2728-3132-35

Page 64: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Solution Outline

• Part 1: Parsing Metadata Headers

• Part 2: Getting request information (URL)

• Part 3: Extracting additional content info

• Part 4: Scanning of individual cache files

• Part 5: Scanning an entire directory

• Part 6: Scanning a list of directories

64

Page 65: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part I - Reading Headers

• Write a function that can parse the binary metadata header and return the data in a useful format

65

Page 66: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Reading Headersimport struct

# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

66

Page 67: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Reading Headers

>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> meta{'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}>>>

67

• How this is supposed to work:

• Basically, we're parsing the header into a useful Python data structure (a dictionary)

Page 68: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import struct

# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

struct module

68

Parses binary encoded data into Python objects.

You would use this module to pack/unpack raw binary data from Python strings.

Unpacks 9 unsigned 32-bit big-endian integers

Page 69: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import struct

# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

struct module

69

Result is always a tuple of converted values.head = (65544, 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218)

Page 70: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import struct

# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

Dictionary Creation

70

zip(s1,s2) makes a list of tupleszip(_headernames,head) [('magic',head[0]),

('location',head[1]), ('fetchcount',head[2])...]

Make a dictionary

Page 71: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary• Dictionaries as data structures

71

meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 }

• Useful if data has many partsdata = f.read(meta[8]) # Huh?!?

vs.

data = f.read(meta['infosize']) # Better

Page 72: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Mini-reference : struct• struct module

72

items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn)

• Sample Format codes'c' char (1 byte string)'b' signed char (8-bit integer)'B' unsigned char (8-bit integer)'h' signed short (16-bit integer)'H' unsigned short (16-bit integer)'i' int (32-bit integer)'I' unsigned int (32-bit integer)'f' 32-bit single precision float'd' 64-bit double precision float's' char s[] (String)'>' Big endian modifier'<' Little endian modifier'!' Network order modifier'n' Repetition count modifier

Page 73: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part 2 : Parsing Requests

• Write a function that will read the URL request string and request information

• Request String : A Null-terminated string

73

• Request Info : A sequence of Null-terminated key-value pairs (like a dictionary)

Page 74: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Parsing Requestsimport repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

74

Page 75: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Usage : Requests

>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> requestdata = f.read(meta['requestsize']+meta['infosize'])>>> parse_request_data(meta,requestdata)True>>> meta['request']'http://www.yahoo.com/'>>> meta['info']{'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response-head': 'HTTP/1.1 200 OK\r\nDate: Wed, 26 Sep 2007 18:03:17 ...' }>>>

75

• Usage of the function:

Page 76: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

String Stripping

76

The request data is a sequence of null-terminated strings. This splits the data up into parts.

requestdata = 'part\x00part\x00part\x00part\x00...'

.split('\x00')

parts = ['part','part','part','part',...]

Page 77: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

String Validation

77

Individual parts are printable characters except for newline characters ('\n\r').

We use the re module to match each string. This would help catch cases where we might be reading bad data (false headers, raw data, etc.).

Page 78: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

URL Request String

78

The request string is the first part. The check that follows makes sure it's the right size (a further sanity check on the data integrity).

Page 79: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

Request Info

79

Each request has a set of associated data represented as key/value pairs.

parts = ['request','key','val','key','val','key','val']

parts[1::2] ['key','key','key']parts[2::2] ['val','val','val']

zip(parts[1::2],parts[2::2]) [('key','val'), ('key','val') ('key','val')]

Makes a dictionary from (key,val) tuples

Page 80: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given a dictionary of header information and a file,# this function extracts the request data from a cache# metadata entry and saves it in the dictionary. Returns# True or False depending on success.

def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00')

# Validate request and infodata here (nothing now)

# Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2]))

meta['request'] = request.split(':',1)[1] meta['info'] = info return True

Fixing the Request

80

Cleaning up the request stringrequest = "HTTP:http://www.google.com"

.split(':',1)

['HTTP','http://www.google.com']

[1]

'http://www.google.com'

Page 81: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary• Emphasize that Python has very powerful

list manipulation primitives

• Indexing

• Slicing

• List comprehensions

• Etc.

• Knowing how to use these leads to rapid development and compact code

81

Page 82: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part 3: Content Info

• All documents on the internet have optional content-type, encoding, and character set information.

• Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.)

82

Page 83: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

HTTP Responses• The cache metadata includes an HTTP

response header

83

>>> print meta['info']['response-head']HTTP/1.1 200 OKDate: Sat, 29 Sep 2007 20:51:37 GMTCache-Control: privateVary: User-AgentContent-Type: text/html; charset=utf-8Content-Encoding: gzip

>>>

Content type, character set,and encoding.

Page 84: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Solution# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding

import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset

84

Page 85: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding

import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset

Internet Data Handling

85

Python has a vast assortment of internet data handling modules.

email. Parsing of email messages, MIME headers, etc.

Page 86: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding

import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset

Internet Data Handling

86

In this code, we parse the HTTP reponse headers using the email module and extract content-type, encoding, and charset information.

Page 87: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• Python is heavily used in Internet applications

• There are modules for parsing common types of data (email, HTML, XML, etc.)

• There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.)

87

Page 88: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part 4: File Scanning

• Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata.

• This is just one more of our building blocks

• The goal is to hide some of the nasty bits

88

Page 89: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

File Scanning# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

89

Page 90: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Usage : File Scanning

>>> f = open("Cache/_CACHE_001_","rb")>>> for meta in scan_cache_file(f,256)... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...

90

• Usage of the scan function

• We can just open up a cache file and write a for-loop to iterate over all of the entries.

Page 91: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

Python File I/O

91

File Objects

Modeled after ANSI C.Files are just bytes.File pointer keeps track.

f.read() # Read bytesf.tell() # Current fpf.seek(n,off) # Move fp

Page 92: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

Using Earlier Code

92

Here we are using our header parsing functions written in previous parts.

Note: We are progressively adding more data to a dictionary.

Page 93: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

Data Validation

93

This is a sanity check to make sure the header data looks like a valid header.

Page 94: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

Generating Results

94

We are using yield to produce data for a single cache entry. If someone uses a for-loop, they will get all of the entries.

Note: This allows us to process the cache without reading all of the data into memory.

Page 95: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata.

• It's still somewhat low-level

• Just need to package it a little better

95

Page 96: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part 5 : Scan a Directory

• Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records.

• Make it real easy to extract data

96

Page 97: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Solution : Directory Scan# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

97

Page 98: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

Solution : Directory Scan

98

General idea:

We loop over the three _CACHE_00n_ files and produce a sequence of the cache records

Page 99: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

Solution : Directory Scan

99

We use the low-level file scanning function here to generate a sequence of records.

Page 100: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

More Generation

100

By using yield here, we are chaining together the results obtained from all three cache files into one big long sequence of results.

The underlying mechanics and implementation details are hidden (user doesn't care)

Page 101: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

Additional Data

101

Adding path and file information to the data (May be useful later)

Page 102: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Usage : Cache Scan

>>> for meta in scan_cache("Cache/"):... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...

102

• Usage of the scan function

• Given the name of a cache directory, we can just loop over all of the metadata. Trivial!

• With work, could perform various kinds of queries and processing of the data

Page 103: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Another Example

>>> for meta in scan_cache("Cache/"):... if 'slashdot' in meta['request']:... print meta['request']...http://www.slashdot.org/http://images.slashdot.org/topics/topiccommunications.gifhttp://images.slashdot.org/topics/topicstorage.gifhttp://images.slashdot.org/comments.css?T_2_5_0_176...

103

• Find all requests related to Slashdot

• Well, that was pretty easy.

Page 104: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Another Example

>>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...>>>

104

• Find all large JPEG images in the cache

• That was also pretty easy

Page 105: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Part 6 : Scan Everything

• Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them.

• A single utility function that let's us query everything.

105

Page 106: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Scanning Everything

# scan an entire list of cache directories producing# a sequence of records

def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta

106

Page 107: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Type Checking

# scan an entire list of cache directories producing# a sequence of records

def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta

107

This bit of code is an example of type checking.

If the argument is a string, we convert it to a list with one item. This allows the following usage:scan("CacheDir")scan(["CacheDir1","CacheDir2",...])

Page 108: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Putting it all together# slack.py# Find all of those slackers who should be workingimport sys, os, ffcache

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print

108

Page 109: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Intermission

• Have written a simple library ffcache.py

• Library takes a moderate complex data processing problem and breaks it up into pieces.

• About 100 lines of code.

• Now, let's build an application...

109

Page 110: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : CacheSpy• Big Brother (make an evil sound here)

110

Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata.

• Big Picture

We're going to write a daemon that will find and quietly report on browser cache contents.

Page 111: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

cachespy.pyimport sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

111

Page 112: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

SocketServer Module

112

SocketServer

A module for easily creating low-level internet applications using sockets.

Page 113: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

SocketServer Handlers

113

You define a simple class that implements handle().

This implements the server logic.

Page 114: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

SocketServer Servers

114

Next, you just create a Server object, hook the handler up to it, and run the server.

Page 115: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

Data Serialization

115

Here, we are turning a socket into a file and dumping cache data on it.

socket corresponding to client that connected.

Page 116: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

pickle Module

116

The pickle module takes any Python object and serializes it into a byte string.

There are really only two ops:

pickle.dump(obj,f) # Dump objectobj = pickle.load(f) # Load object

Page 117: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Running our Server

% python cachespy.py /UsersCacheSpy running on port 31337

117

• Example:

• Server is just sitting there waiting

• You can try connecting with telnet % telnet localhost 31337Trying 127.0.0.1...Connected to localhost.Escape character is '^]'.(dp0S'info'p1... bunch of cryptic data ...

Page 118: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : CacheMon

• The Evil Overlord (make a more evil sound)

118

Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine.

• Big Picture

Writing network clients. Programs that make outgoing connections to internet services.

Page 119: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.pyimport pickle, socket

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

cachemon.py

119

Page 120: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.pyimport pickle, socket

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

Solution : Socket Module

120

socket module provides direct access to low-level socket API.s = socket(addr,type)

s.connect(host)s.bind(addr)s.listen(n)s.accept()s.recv(n)s.send(data)...

Page 121: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.pyimport pickle, socket

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

Unpickling a Sequence

121

Here we use pickle to repeatedly load objects off of the socket. We use yield to generate a sequence of received objects.

Page 122: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Example Usage

>>> rcache = scan_remote_cache(("localhost",31337))>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...

122

• Example: Find all JPEG images > 100K on a remote machine

• This looks almost identical to old code

Page 123: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Code Similarity

rcache = scan_remote_cache(("localhost",31337))jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']

123

• A Remote Scan

• A Local Scancache = ffcache.scan(cachedirs)jpegs = (meta for meta in cache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']

Page 124: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Big Picture

124

for meta in ffcache.scan(dirs): pickle.dump(meta,f)

while True: meta = pickle.load(f) yield meta

cachespy.py

for meta in remote_scan(host): # ...

cachemon.pysocket

Page 125: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : Clusters

• Scan a whole cluster of machines

125

Write a function that can easily scan the caches of an entire collection of remote hosts.

• Big Picture

Collecting data from a group of machines on the network.

Page 126: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...def scan_cluster(hostlist): for host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass

cachemon.py

126

A bit of exception handling to deal with dead machines, and other problems (would probably need to be expanded)

Page 127: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Example Usage

>>> hosts = [('host1',31337),('host2',31337),...]>>> rcaches = scan_cluster(hosts)>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']......

127

• Example: Find all JPEG images > 100K on a set of remote machines

• Think about the abstraction of "iteration" here. Query code is exactly the same.

Page 128: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Problem : Concurrency

• Collect data from a large set of machines

128

In the last section, the scan_cluster() function retrieves data from one machine at a time. However, a world-wide quasi-evil organization is likely to have at least several dozen machines.

• Your task

Modify the scanner so that it can manage concurrent client connections, reading data from multiple sources at once.

Page 129: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Concurrency

129

• Python provides full support for threads

• They are real threads (pthreads, system threads, etc.)

• However, a lock within the Python interpreter (Global Interpreter Lock), prevents concurrency across more than one CPU.

Page 130: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Programming with Threads

130

• threading module provides a Thread object.

• A variety of synchronization primitives are provided (Locks, Semaphores, Condition Variations, Events, etc.)

• Can program very traditional kinds of threaded programs (multiple threads, lots of locking, race conditions, horrible debugging, etc.).

Page 131: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Threads with Queues

131

• One technique for thread programming is to have independent threads that share data via thread-safe message queues.

• Variations of "producer-consumer" problems.

• Will use this in our solution. Keep in mind, it's not the only way to program threads.

Page 132: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

A Cache Scanning Thread

132

Page 133: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

threading Module

133

threading module.

Contains most functionality related to threads.

Page 134: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Thread Base Class

134

Threads are defined by inheriting from the Thread base class.

Page 135: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Thread Initialization

135

initialization and setup

Page 136: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Thread Execution

136

run() method

Contains code that executes in the thread.

The thread performs a scan of a single host.

Page 137: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Launching a Thread

137

• You create a thread object and start itt1 = ScanThread(("host1",31337),msg_q)t1.start()

t2 = ScanThread(("host2",31337),msg_q)t2.start()

• .start() starts the thread and calls .run()

Page 138: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Thread Safe Queues

138

• Queue module. Provides a thread-safe queue.import Queuemsg_q = Queue.Queue()

• Queue insertionmsg_q.put(obj)

• Queue removalobj = msg_q.get()

• Queue can be shared by as many threads as you want without worrying about locking.

Page 139: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Use of a Queue Object

139

A Queue object. Where incoming objects are placed.

Get data from the remote machine and put into the Queue

Page 140: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Primitive Use of a Queue

140

• You first create a queue, then launch the threads to insert data into it.

msg_q = Queue.Queue()t1 = ScanThread(("host1",31337),msg_q)t1.start()

t2 = ScanThread(("host2",31337),msg_q)t2.start()

while True: meta = msg_q.get() # Get metadata

Page 141: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Monitor Architecture

141

Host Host Host

MonitorThread Thread Thread

msg_q

socket socket socket

Consumer

.put()

.get()

????

Page 142: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Concurrent Monitorimport threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break

142

Page 143: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break

Launching Threads

143

The above function is a thread that launches ScanThreads. It then waits for the threads to terminate by joining with them. After all threads have terminated, a sentinel is dropped in the Queue.

Page 144: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break

Collecting Results

144

The function below creates a Queue and launches a thread to launch all of the scanning threads.

It then produces a sequence of cache data until the sentinel (None) is pulled off of the queue.

Page 145: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

More on Threads

145

• There are many more issues to thread programming that we could discuss.

• All issues concerning locking, synchronization, event handling, and race conditions apply to Python.

• Because of global interpreter lock, threads are not a way to achieve higher performance (generally).

Page 146: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Thread Synchronization

146

• threading module has various primitivesLock() # Mutex LockRLock() # Reentrant Mutex LockSemaphore(n) # Semaphore

• Example use:x = value # Some kind of shared objectx_lock = Lock() # A lock associated with x...

x_lock.acquire()# Modify or do something with x (critical section)...x_lock.release()

Page 147: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Story so Far

147

• Wrote a module ffcache.py that parsed contents of caches (~100 lines)

• Wrote cachespy.py that allows cache data to be retrieved by a remote client (~25 lines)

• Wrote a concurrent monitor for getting that data (~50 lines)

Page 148: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

A subtle observation

148

• In none of our programs have we read the entire contents of any Firefox cache into memory.

• In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory).

• In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results).

Page 149: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Another Observation

149

• For every connection, cachespy sends the entire contents of the Firefox cache metadata back to the monitor.

• Given that caches are ~50 MB by default, this could result in large network traffic.

• Question: Given that we're normally performing queries on the data, could we do any of this work on the remote machines?

Page 150: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Remote Filtering

• Distribute the work

150

Modify the cachespy program so that some of the query work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program.

• Big Picture

Distributed computation. Massive security nightmare.

Page 151: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The idea• Modify scan_cluster() and all related

functions to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning results.

151

filter = """if meta['content-type'] == 'image/jpeg'and meta['datasize'] > 100000"""

rcaches = scan_cluster(hostlist,filter)

Page 152: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Changes to the Monitor# cachemon.pydef scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() try: while True: meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass

152

Send the filter to the remote host right after connecting.

Add a filter parameter

Page 153: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Changes to the Monitor# cachemon.py...class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q self.filter = filter def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass

153

filter added to thread data

Page 154: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

def concurrent_scan(hostlist, msg_q,filter): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

Changes to the Monitor

154

filter passed to thread creation

Page 155: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Changes to the Monitor

# cachemon.py...def scan_cluster(hostlist,filter=""): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q,filter)).start() while True: meta = msg_q.get() if not meta: break yield meta

155

filter added

Page 156: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Commentary

• Have modified the cache monitor program to accept a filter string and to pass that string to remote clients upon connecting.

• How to use the filter in the spy server.

156

Page 157: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Changes to CacheSpy

# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

157

Page 158: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

Changes to CacheSpy

158

Filter added and used to create an expression string.

filter = "if meta['datasize'] > 100000"

values = """(meta for meta in ffcache.scan(caches) if meta['datasize'] > 100000)"""

Page 159: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

Eval()

159

eval(s). Evaluates s as a Python expression.

A bit error of handling. traceback module creates stack traces for exceptions.

Page 160: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Changes to the Server

# cachespy.py...class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) f.close()

160

Get filter from the monitor

Page 161: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Putting it all Together

• A remote query to find slackers

161

# Find all of those slashdot slackersimport cachemon

hosts = [('host1',31337),('host2',31337), ('host3',31337),...]

filter = "if 'slashdot' in meta['request']"

rcaches = cachemon.scan_cluster(hosts,filter)for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print

Page 162: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Putting it all Together

• Queries run remotely on all the hosts

• Only data of interest is sent back

• No temporary lists or large data structures

• Concurrent execution on monitor

• Concurrency is hidden from user

162

Page 163: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

The Power of Iteration• Loop over all entries in a cache file:

163

for meta in scan_cache_file(f,256): ...

• Loop over all entries in a cache directoryfor meta in scan_cache(dirname): ...

• Loop over all cache entries on remote hostfor meta in scan_remote_cache(host): ...

• Loop over all cache entries on many hostsfor meta in scan_cluster(hostlist): ...

Page 164: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Wrapping Up

• A lot of material has been presented

• Again, the goal was to do something interesting with Python, not to be just a reference manual.

• This is only a small taste of what's possible

• And it's only a small taste of why people like programming in Python

164

Page 165: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Other Python Examples

• Python makes many annoying tasks relatively easy.

• Will end by showing very simple examples of other modules.

165

Page 166: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Fetching a Web Page• urllib and urllib2 modules

166

import urllibw = urllib.urlopen("http://www.foo.com")for line in w: # ...

page = urllib.urlopen("http://www.foo.com").read()

• Additional options support uploading of form values, cookies, passwords, proxies, etc.

Page 167: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

A Web Server with CGI

167

• Serve files and allow CGI scriptsfrom BaseHTTPServer import HTTPServerfrom CGIHTTPServer import CGIHTTPRequestHandlerimport osos.chdir("/home/docs/html")serv = HTTPServer(("",8080),CGIHTTPRequestHandler)serv.serve_forever()

• Can easily throw up a server with just a few lines of Python code.

Page 168: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

A Custom HTTP Server• BaseHTTPServer module

168

from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer

class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ...

serv = HTTPServer(("",8080),MyHandler)serv.serve_forever()

• Could use to put a web server in an application

Page 169: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

XML-RPC Server/Client

169

• How to create a stand-alone serverfrom SimpleXMLRPCServer import SimpleXMLRPCServer

def add(x,y): return x+y

s = SimpleXMLRPCServer(("",8080))s.register_function(add)s.serve_forever()

• How to test it (xmlrpclib)>>> import xmlrpclib>>> s = xmlrpclib.ServerProxy("http://localhost:8080")>>> s.add(3,5)8>>> s.add("Hello","World")"HelloWorld">>>

Page 170: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Where to go from here?

• Network/Internet programming. Python has a large user base developing network applications, web frameworks, and internet data handling tools.

• C/C++ extension building. Python is easily extended with C/C++ code. Can use Python as a high-level control application for existing systems software.

170

Page 171: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Where to go from here?

• GUI programming. There are several major GUI packages for Python (Tkinter, wxPython, PyQT, etc.).

• Jython and IronPython. Implementations of the Python interpreter for Java and .NET.

171

Page 172: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Where to go from here?

• Everything Pythonic:

172

http://www.python.org

• Get involved. PyCon'2008 (Chicago)

• Have an on-site course (shameless plug)

http://www.dabeaz.com/python.html

Page 173: Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Thanks for Listening!

• Hope you got something out of the class

173

• Please give me feedback!