Python in Action (Part 2)

Copyright (C) 2007, http://www.dabeaz.com 2-

Python in Action

1

Presented at USENIX LISA ConferenceNovember 16, 2007

David M. Beazleyhttp://www.dabeaz.com

(Part II - Systems Programming)


Section Overview• In this section, we're going to get dirty

• Systems Programming

• Files, I/O, file-system

• Text parsing, data decoding

• Processes and IPC

• Networking

• Threads and concurrency

2


Commentary

• I personally think Python is a fantastic tool for systems programming.

• Modules provide access to most of the major system libraries I used to access via C

• No enforcement of "morality"

• Decent performance

• It just "works" and it's fun

3


Approach

• I've thought long and hard about how I would present this part of the class.

• A reference manual approach would probably be long and very boring.

• So instead, we're going to focus on building something more in tune with the times

4


"To Catch a Slacker"• Write a collection of Python programs that can

quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports.

• Oh yeah, and be a real sneaky bugger about it.

5


Why this Problem?

• Involves a real-world system and data

• Firefox already installed on your machine (?)

• Cross platform (Linux, Mac, Windows)

• Example of tool building

• Related to a variety of practical problems

• A good tour of "Python in Action"

6


Disclaimers• I am not involved in browser forensics (or

spyware for that matter).

• I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code

• I have never worked with the cache data prior to preparing this tutorial

• I have never used any third-party tools for looking at this data.

7


More Disclaimers• All of the code in this tutorial works with a

standard Python installation

• No third party modules.

• All code is cross-platform

• Code samples are available online at

8

http://www.dabeaz.com/action/

• Please look at that code and follow along


Assumptions

• This is not a tutorial on systems concepts

• You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.)

• Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications.

9


The Big Picture

• We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches.

• For example, the cache directories on all machines on the LAN of a quasi-evil corporation.

10


The Firefox Cache• The Firefox browser keeps a disk cache of

recently visited sites

11

% ls Cache/-rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01-rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01-rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01-rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_-rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_-rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_-rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_

• A bunch of cryptically named files.


Problem : Finding Files• Find the Firefox cache

Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories.

12

• Example:% python findcache.py /Users/beazley/Users/beazley/Library/.../qs1ab616.default/Cache/Users/beazley/Library/.../wxuoyiuf.slt/Cache%

• Use case: Searching for things on the filesystem.


findcache.py# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

for name in caches: print name

13


The sys module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




14

The sys module has basic information related to the execution environment.

sys.argv

A list of the command line options

sys.stdinsys.stdoutsys.stderr

Standard I/O files

sys.argv = ['findcache.py', '/Users/beazley']


Program Termination# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




15

SystemExit exception

Forces Python to exit.Value is return code.


os Module# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




16

os module

Contains useful OS related functions (files, processes, etc.)


os.walk()# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




17

os.walk(topdir)

Recursively walks a directory tree and generates a sequence of tuples (path,dirs,files)

path = The current directory namedirs = List of all subdirectory names in pathfiles = List of all regular files (data) in path


A Sequence of Caches# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




18

This statement generates a sequence of directory names where '_CACHE_MAP_' is contained in the filelist.

The directory name that is generated as a

result

File name check


Printing the Result# findcache.py# Recursively scan a directory looking for # Firefox/Mozilla cache directories

import sysimport os




19

This prints the sequence of cache directories that are generated by the previous statement.


Commentary

• Our solution is strongly based on a "declarative" programming style (again)

• We simply write out a sequence of operations that produce what we want

• Not focused on the underlying mechanics of how to traverse all of the directories.

20


Big Idea : Iteration• Python allows iteration to be captured as a

kind of object.

21


• This de-couples iteration from the code that uses the iterationfor name in caches: print name

• Another usage example:for name in caches: print len(os.listdir(name)), name


Big Idea : Iteration• Compare to this:

22

for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path

• This code is simple, but the loop and the code that executes in the loop body are coupled together

• Not as flexible, but this is somewhat subtle to wrap your brain around at first.


Mini-Reference : sys, os• sys module

23

sys.argv # List of command line optionssys.stdin # Standard inputsys.stdout # Standard outputsys.stderr # Standard errorsys.executable # Full path of Python executablesys.exc_info() # Information on current exception

• os moduleos.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir

• SystemExit exceptionraise SystemExit(n) # Exit with integer code n


Problem: Searching for Text• Extract all URL requests from the cache

Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache.

24

• Example:% python requests.py /Users/.../qs1ab616.default/Cachehttp://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.jshttp://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gifhttp://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png...%

• Use case: Searching the contents of files for text patterns.


The Firefox Cache• The cache directory holds two types of data

• Metadata (URLs, headers, etc.).

• Raw data (HTML, JPEG, PNG, etc.)

• This data is stored in two places

• Cryptic files in the Cache directory

• Blocks inside the _CACHE_00n_ files

• Metadata almost always in _CACHE_00n_

25


Possible Solution : Regex• The _CACHE_00n_ files are encoded in a

binary format, but URLs are embedded inside as null-terminated text:

26

\x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f\xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a\x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00request-Accept-Encoding\x00gzip,deflate\x00response-head\x00HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\nServer: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control:

• Maybe the requests could just be ripped using a regular expression.


A Regex Solution# requests.pyimport reimport osimport sys

cachedir = sys.argv[1]cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL stringsrequest_pat = re.compile('([a-z]+://.*?)\x00')

# Loop over all files and search for URLsfor name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

27


# requests.pyimport reimport osimport sys


# A regex for embedded URL stringsrequest_pat = re.compile(r'([a-z]+://.*?)\x00')


The re module

28

re module

Contains all functionality related to regular expression pattern matching, searching, replacing, etc.

Features are strongly influenced by Perl, but regexs are not directly integrated into the Python language.






Using re

29

Patterns are first specified as strings and compiled into a regex object.

pat = re.compile(pattern [,flags])

The pattern syntax is "standard"pat*pat+pat?(pat).

pat1|pat2[chars][^chars]pat{n}pat{n,m}






Using re

30

All subsequent operations are methods of the compiled regex patternm = pat.match(data [,start]) # Check for matchm = pat.search(data [,start]) # Search for matchnewdata = pat.sub(data, repl) # Pattern replace






Searching for Matches

31

pat.search(text [,start])

Searches the string text for the first occurrence of the regex pattern starting at position start. Returns a "MatchObject" if a match is found.

In the code below, we're finding matches one at a time.






Match Objects

32

Regex matches are represented by a MatchObject

m.group([n]) # Text matched by group nm.start([n]) # Starting index of group nm.end([n]) # End index of group n

The matching text for just the URL.

The end of the match






Groups

33

In patterns, parentheses () define groups which are numbered left to right.group 0 # The entire patterngroup 1 # Text in first ()group 2 # Text in next ()...


Mini-Reference : re• re pattern compilation

34

pat = re.compile(r'patternstring')

• Pattern syntaxliteral # Match literal textpat* # Match 0 or more repetitions of patpat+ # Match 1 or more repetitions of patpat? # Match 0 or 1 repetitions of patpat1|pat2 # Patch pat1 or pat2(pat) # Patch pat (group)[chars] # Match characters in chars[^chars] # Match characters not in chars. # Match any character except \n\d # Match any digit\w # Match alphanumeric character\s # Match whitespace


Mini-Reference : re• Common pattern operations

35

pat.search(text) # Search text for a matchpat.match(text) # Search start of text for matchpat.sub(repl,text) # Replace pattern with repl

• Match objectsm.group([n]) # Text matched by group nm.start([n]) # Starting position of group nm.end([n]) # Ending position of group n

• How to loop over all matches of a patternfor m in pat.finditer(text): # m is a MatchObject that you process ...


Mini-Reference : re• An example of pattern replacement

36

# This replaces American dates of the form 'mm/dd/yyyy'# with European dates of the form 'dd/mm/yyyy'.

# This function takes a MatchObject as input and returns# replacement text as output.

def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year)

# Date re pattern and replacement operationdatepat = re.compile(r'(\d+)/(\d+)/(\d+)')newdata = datepat.sub(euro_date,text)


Mini-Reference : re

37

• There are many more features of the re module

• Strongly influenced by Perl (feature set)

• Regexs are a library in Python, not integrated into the language.

• A book on regular expressions may be essential for advanced functions.






File Handling

38

What is going on in this statement?






os.path module

39

os.path has portable file related functionsos.path.join(name1,name2,...) # Join path namesos.path.getsize(filename) # Get the file sizeos.path.getmtime(filename) # Get modification date

There are many more functions, but this is the preferred module for basic filename handling






os.path.join()

40

Creates a fully-expanded pathnamedirname = '/foo/bar' filename = 'name'

'/foo/bar/name'

os.path.join(dirname,filename)

Aware of platform differences ('/' vs. '\')


Mini-Reference : os.path

41

os.path.join(s1,s2,...) # Join pathname parts togetheros.path.getsize(path) # Get file size of pathos.path.getmtime(path) # Get modify time of pathos.path.getatime(path) # Get access time of pathos.path.getctime(path) # Get creation time of pathos.path.exists(path) # Check if path existsos.path.isfile(path) # Check if regular fileos.path.isdir(path) # Check if directoryos.path.islink(path) # Check if symbolic linkos.path.basename(path) # Return file part of pathos.path.dirname(path) # Return dir part ofos.path.abspath(path) # Get absolute path






Binary I/O

42

For all binary files, use modes "rb","wb", etc.

Disables new-line translation (critical on Windows)






Common I/O Shortcuts

43

# Read an entire file into a stringdata = open(filename).read()

# Write a string out to a fileopen(filename,"w").write(text)

# Loop over all lines in a filefor line in open(filename): ...


Commentary on Solution

• This regex approach is mostly a hack for this particular application.

• Reads entire cache files into memory as strings (may be quite large)

• Only finds URLs, no other metadata

• Some risk of false positives since URLs could also be embedded in data.

44


Commentary

• We have started to build a collection of very simple command line tools

• Very much in the "Unix tradition."

• Python makes it easy to create such tools

• More complex applications could be assembled by simply gluing scripts together

45


Working with Processes

• It is common to write programs that run other programs, collect their output, etc.

• Pipes

• Interprocess Communication

• Python has a variety of modules for supporting this.

46


subprocess Module

• A module for creating and interacting with subprocesses

• Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module

• Cross platform (Unix/Windows)

47


Example : Slackers

• Find slacker cache entries.

Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL.

48


slackers.py# slackers.pyimport sysimport subprocess

# Run findcache.py as a subprocessfinder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocessfor cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

49


Launching a subprocess# slackers.pyimport sysimport subprocess




50

This is launching a python script as a subprocess, connecting its stdout stream to a pipe.

Collection of output with newline stripping.


Python Executable# slackers.pyimport sysimport subprocess




51

Full pathname of python interpreter


Subprocess Arguments# slackers.pyimport sysimport subprocess




52

List of arguments to subprocess. Corresponds to what would appear on a shell command line.


slackers.py# slackers.pyimport sysimport subprocess




53

More of the same idea. For each directory we found in the last step, we run requests.py to produce requests.


Commentary

54

• subprocess is a large module with many options.

• However, it takes care of a lot of annoying platform-specific details for you.

• Currently the "recommended" way of dealing with subprocesses.


Low Level Subprocesses

55

• Running a simple system commandos.system("shell command")

• Connecting to a subprocess with pipespout, pin = popen2.popen2("shell command")

• Exec/spawnos.execv(),os.execl(),os.execle(),...os.spawnv(),os.spawnvl(), os.spawnle(),...

• Unix fork()os.fork(), os.wait(), os.waitpid(), os._exit(), ...


Interactive Processes

• Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect")

• Must install third party modules for this

• Example: pexpect

• http://pexpect.sourceforge.net

56


Commentary• Writing small Unix-like utilities is fairly

straightforward in Python

• Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.)

• However, our solution is also kind of clunky

• Only returns some information

• Not particularly memory efficient (reads large files into memory)

57


Interlude

• Python is well-suited to building libraries and frameworks.

• In the next part, we're going to take a totally different approach than simply writing simple utilities.

• Will build libraries for manipulating cache data and use those libraries to build tools.

58


Problem : Parsing Data• Extract the cache data (for real)

Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs.

Capture all available information including URLs, timestamps, sizes, locations, content types, etc.

59

• Use case: Blood and guts

Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse.


The Firefox Cache• There are four critical files

60

_CACHE_MAP_ # Cache index _CACHE_001_ # Cache data_CACHE_002_ # Cache data_CACHE_003_ # Cache data

• All files are binary-encoded

• _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits.

• We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions.


Firefox _CACHE_ Files• _CACHE_00n_ file organization

61

Free/used block bitmap

Blocks

4096 bytes

Up to 32768 blocks

• The block size varies according to the file:_CACHE_001_ 256 byte blocks_CACHE_002_ 1024 byte blocks_CACHE_003_ 4096 byte blocks


Cache Entries• Each cache entry:

• A maximum of 4 cache blocks

• Can either be data or metadata

• If >16K, written to a file instead

62

• Notice how all the "cryptic" files are >16K-rw------- beazley 111169 Sep 25 17:15 01CC0844d01-rw------- beazley 104991 Sep 25 17:15 01CC3844d01-rw------- beazley 47233 Sep 24 16:41 021F221Ad01...-rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01-rw------- beazley 58172 Sep 25 18:16 FFE628C6d01


Cache Metadata• Metadata is encoded as a binary structure

63

Header

Request String

Request Info

36 bytes

Variable length (in header)

Variable length (in header)

• Header encoding (binary, big-endian)magic (???)locationfetchcountfetchtimemodifytimeexpiretimedatasizerequestsizeinfosize

unsigned int (0x00010008)unsigned intunsigned intunsigned int (system time)unsigned int (system time)unsigned int (system time)unsigned int (byte count)unsigned int (byte count)unsigned int (byte count)

0-34-78-1112-1516-1920-2324-2728-3132-35


Solution Outline

• Part 1: Parsing Metadata Headers

• Part 2: Getting request information (URL)

• Part 3: Extracting additional content info

• Part 4: Scanning of individual cache files

• Part 5: Scanning an entire directory

• Part 6: Scanning a list of directories

64


Part I - Reading Headers

• Write a function that can parse the binary metadata header and return the data in a useful format

65


Reading Headersimport struct

# This function parses a cache metadata header into a dict# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize']

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

66


Reading Headers

>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> meta{'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}>>>

67

• How this is supposed to work:

• Basically, we're parsing the header into a useful Python data structure (a dictionary)


import struct




struct module

68

Parses binary encoded data into Python objects.

You would use this module to pack/unpack raw binary data from Python strings.

Unpacks 9 unsigned 32-bit big-endian integers


import struct




struct module

69

Result is always a tuple of converted values.head = (65544, 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218)


import struct




Dictionary Creation

70

zip(s1,s2) makes a list of tupleszip(_headernames,head) [('magic',head[0]),

('location',head[1]), ('fetchcount',head[2])...]

Make a dictionary


Commentary• Dictionaries as data structures

71

meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 }

• Useful if data has many partsdata = f.read(meta[8]) # Huh?!?

vs.

data = f.read(meta['infosize']) # Better


Mini-reference : struct• struct module

72

items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn)

• Sample Format codes'c' char (1 byte string)'b' signed char (8-bit integer)'B' unsigned char (8-bit integer)'h' signed short (16-bit integer)'H' unsigned short (16-bit integer)'i' int (32-bit integer)'I' unsigned int (32-bit integer)'f' 32-bit single precision float'd' 64-bit double precision float's' char s[] (String)'>' Big endian modifier'<' Little endian modifier'!' Network order modifier'n' Repetition count modifier


Part 2 : Parsing Requests

• Write a function that will read the URL request string and request information

• Request String : A Null-terminated string

73

• Request Info : A sequence of Null-terminated key-value pairs (like a dictionary)


Parsing Requestsimport repart_pat = re.compile(r'[\n\r -~]*$')

def parse_request_data(meta,requestdata): parts = requestdata.split('\x00') for part in parts: if not part_pat.match(part): return False

request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True

74


Usage : Requests

>>> f = open("Cache/_CACHE_001_","rb")>>> f.seek(4096) # Skip the bit map>>> headerdata = f.read(36) # Read 36 byte header>>> meta = parse_meta_header(headerdata)>>> requestdata = f.read(meta['requestsize']+meta['infosize'])>>> parse_request_data(meta,requestdata)True>>> meta['request']'http://www.yahoo.com/'>>> meta['info']{'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response-head': 'HTTP/1.1 200 OK\r\nDate: Wed, 26 Sep 2007 18:03:17 ...' }>>>

75

• Usage of the function:


import repart_pat = re.compile(r'[\n\r -~]*$')



String Stripping

76

The request data is a sequence of null-terminated strings. This splits the data up into parts.

requestdata = 'part\x00part\x00part\x00part\x00...'

.split('\x00')

parts = ['part','part','part','part',...]





String Validation

77

Individual parts are printable characters except for newline characters ('\n\r').

We use the re module to match each string. This would help catch cases where we might be reading bad data (false headers, raw data, etc.).





URL Request String

78

The request string is the first part. The check that follows makes sure it's the right size (a further sanity check on the data integrity).





Request Info

79

Each request has a set of associated data represented as key/value pairs.

parts = ['request','key','val','key','val','key','val']

parts[1::2] ['key','key','key']parts[2::2] ['val','val','val']

zip(parts[1::2],parts[2::2]) [('key','val'), ('key','val') ('key','val')]

Makes a dictionary from (key,val) tuples


# Given a dictionary of header information and a file,# this function extracts the request data from a cache# metadata entry and saves it in the dictionary. Returns# True or False depending on success.

def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00')

# Validate request and infodata here (nothing now)

# Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2]))

meta['request'] = request.split(':',1)[1] meta['info'] = info return True

Fixing the Request

80

Cleaning up the request stringrequest = "HTTP:http://www.google.com"

.split(':',1)

['HTTP','http://www.google.com']

[1]

'http://www.google.com'


Commentary• Emphasize that Python has very powerful

list manipulation primitives

• Indexing

• Slicing

• List comprehensions

• Etc.

• Knowing how to use these leads to rapid development and compact code

81


Part 3: Content Info

• All documents on the internet have optional content-type, encoding, and character set information.

• Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.)

82


HTTP Responses• The cache metadata includes an HTTP

response header

83

>>> print meta['info']['response-head']HTTP/1.1 200 OKDate: Sat, 29 Sep 2007 20:51:37 GMTCache-Control: privateVary: User-AgentContent-Type: text/html; charset=utf-8Content-Encoding: gzip

>>>

Content type, character set,and encoding.


Solution# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding

import emaildef add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset

84


# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding


Internet Data Handling

85

Python has a vast assortment of internet data handling modules.

email. Parsing of email messages, MIME headers, etc.


# Given a metadata dictionary, this function adds additional# fields related to the content type, charset, and encoding


Internet Data Handling

86

In this code, we parse the HTTP reponse headers using the email module and extract content-type, encoding, and charset information.


Commentary

• Python is heavily used in Internet applications

• There are modules for parsing common types of data (email, HTML, XML, etc.)

• There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.)

87


Part 4: File Scanning

• Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata.

• This is just one more of our building blocks

• The goal is to hide some of the nasty bits

88


File Scanning# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta

# Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1)

89


Usage : File Scanning

>>> f = open("Cache/_CACHE_001_","rb")>>> for meta in scan_cache_file(f,256)... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...

90

• Usage of the scan function

• We can just open up a cache file and write a for-loop to iterate over all of the entries.


# Scan a single file in the firefox cachedef scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta


Python File I/O

91

File Objects

Modeled after ANSI C.Files are just bytes.File pointer keeps track.

f.read() # Read bytesf.tell() # Current fpf.seek(n,off) # Move fp




Using Earlier Code

92

Here we are using our header parsing functions written in previous parts.

Note: We are progressively adding more data to a dictionary.




Data Validation

93

This is a sanity check to make sure the header data looks like a valid header.




Generating Results

94

We are using yield to produce data for a single cache entry. If someone uses a for-loop, they will get all of the entries.

Note: This allows us to process the cache without reading all of the data into memory.


Commentary

• Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata.

• It's still somewhat low-level

• Just need to package it a little better

95


Part 5 : Scan a Directory

• Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records.

• Make it real easy to extract data

96


Solution : Directory Scan# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.

import osdef scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)]

for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close()

97


# Given the name of a Firefox cache directory, the function# scans all of the _CACHE_00n_ files for metadata. A sequence# of dictionaries containing metadata is returned.



Solution : Directory Scan

98

General idea:

We loop over the three _CACHE_00n_ files and produce a sequence of the cache records





Solution : Directory Scan

99

We use the low-level file scanning function here to generate a sequence of records.





More Generation

100

By using yield here, we are chaining together the results obtained from all three cache files into one big long sequence of results.

The underlying mechanics and implementation details are hidden (user doesn't care)





Additional Data

101

Adding path and file information to the data (May be useful later)


Usage : Cache Scan

>>> for meta in scan_cache("Cache/"):... print meta['request']...http://www.yahoo.com/http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif...

102

• Usage of the scan function

• Given the name of a cache directory, we can just loop over all of the metadata. Trivial!

• With work, could perform various kinds of queries and processing of the data


Another Example

>>> for meta in scan_cache("Cache/"):... if 'slashdot' in meta['request']:... print meta['request']...http://www.slashdot.org/http://images.slashdot.org/topics/topiccommunications.gifhttp://images.slashdot.org/topics/topicstorage.gifhttp://images.slashdot.org/comments.css?T_2_5_0_176...

103

• Find all requests related to Slashdot

• Well, that was pretty easy.


Another Example

>>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...>>>

104

• Find all large JPEG images in the cache

• That was also pretty easy


Part 6 : Scan Everything

• Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them.

• A single utility function that let's us query everything.

105


Scanning Everything

# scan an entire list of cache directories producing# a sequence of records

def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta

106


Type Checking

# scan an entire list of cache directories producing# a sequence of records

def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta

107

This bit of code is an example of type checking.

If the argument is a string, we convert it to a list with one item. This allows the following usage:scan("CacheDir")scan(["CacheDir1","CacheDir2",...])


Putting it all together# slack.py# Find all of those slackers who should be workingimport sys, os, ffcache

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1)


for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print

108


Intermission

• Have written a simple library ffcache.py

• Library takes a moderate complex data processing problem and breaks it up into pieces.

• About 100 lines of code.

• Now, let's build an application...

109


Problem : CacheSpy• Big Brother (make an evil sound here)

110

Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata.

• Big Picture

We're going to write a daemon that will find and quietly report on browser cache contents.


cachespy.pyimport sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close()

SocketServer.TCPServer.allow_reuse_address = Trueserv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)print "CacheSpy running on port %d" % SPY_PORTserv.serve_forever()

111


import sys, os, pickle, SocketServer, ffcache





SocketServer Module

112

SocketServer

A module for easily creating low-level internet applications using sockets.







SocketServer Handlers

113

You define a simple class that implements handle().

This implements the server logic.







SocketServer Servers

114

Next, you just create a Server object, hook the handler up to it, and run the server.







Data Serialization

115

Here, we are turning a socket into a file and dumping cache data on it.

socket corresponding to client that connected.







pickle Module

116

The pickle module takes any Python object and serializes it into a byte string.

There are really only two ops:

pickle.dump(obj,f) # Dump objectobj = pickle.load(f) # Load object


Running our Server

% python cachespy.py /UsersCacheSpy running on port 31337

117

• Example:

• Server is just sitting there waiting

• You can try connecting with telnet % telnet localhost 31337Trying 127.0.0.1...Connected to localhost.Escape character is '^]'.(dp0S'info'p1... bunch of cryptic data ...


Problem : CacheMon

• The Evil Overlord (make a more evil sound)

118

Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine.

• Big Picture

Writing network clients. Programs that make outgoing connections to internet services.


# cachemon.pyimport pickle, socket

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

cachemon.py

119




Solution : Socket Module

120

socket module provides direct access to low-level socket API.s = socket(addr,type)

s.connect(host)s.bind(addr)s.listen(n)s.accept()s.recv(n)s.send(data)...




Unpickling a Sequence

121

Here we use pickle to repeatedly load objects off of the socket. We use yield to generate a sequence of received objects.


Example Usage

>>> rcache = scan_remote_cache(("localhost",31337))>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']...http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/story.jpghttp://images.salon.com/ent/video_dog/ifc/2007/09/28/apocalypse/story.jpghttp://www.lakesideinns.com/images/fallroadphoto2006.jpg...

122

• Example: Find all JPEG images > 100K on a remote machine

• This looks almost identical to old code


Code Similarity

rcache = scan_remote_cache(("localhost",31337))jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']

123

• A Remote Scan

• A Local Scancache = ffcache.scan(cachedirs)jpegs = (meta for meta in cache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000)for j in jpegs: print j['request']


Big Picture

124

for meta in ffcache.scan(dirs): pickle.dump(meta,f)

while True: meta = pickle.load(f) yield meta

cachespy.py

for meta in remote_scan(host): # ...

cachemon.pysocket


Problem : Clusters

• Scan a whole cluster of machines

125

Write a function that can easily scan the caches of an entire collection of remote hosts.

• Big Picture

Collecting data from a group of machines on the network.


# cachemon.py...def scan_cluster(hostlist): for host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass

cachemon.py

126

A bit of exception handling to deal with dead machines, and other problems (would probably need to be expanded)


Example Usage

>>> hosts = [('host1',31337),('host2',31337),...]>>> rcaches = scan_cluster(hosts)>>> jpegs = (meta for meta in rcache... if meta['content-type'] == 'image/jpeg'... and meta['datasize'] > 100000)>>> for j in jpegs:... print j['request']......

127

• Example: Find all JPEG images > 100K on a set of remote machines

• Think about the abstraction of "iteration" here. Query code is exactly the same.


Problem : Concurrency

• Collect data from a large set of machines

128

In the last section, the scan_cluster() function retrieves data from one machine at a time. However, a world-wide quasi-evil organization is likely to have at least several dozen machines.

• Your task

Modify the scanner so that it can manage concurrent client connections, reading data from multiple sources at once.


Concurrency

129

• Python provides full support for threads

• They are real threads (pthreads, system threads, etc.)

• However, a lock within the Python interpreter (Global Interpreter Lock), prevents concurrency across more than one CPU.


Programming with Threads

130

• threading module provides a Thread object.

• A variety of synchronization primitives are provided (Locks, Semaphores, Condition Variations, Events, etc.)

• Can program very traditional kinds of threaded programs (multiple threads, lots of locking, race conditions, horrible debugging, etc.).


Threads with Queues

131

• One technique for thread programming is to have independent threads that share data via thread-safe message queues.

• Variations of "producer-consumer" problems.

• Will use this in our solution. Keep in mind, it's not the only way to program threads.


# cachemon.py...import threading

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

A Cache Scanning Thread

132




threading Module

133

threading module.

Contains most functionality related to threads.




Thread Base Class

134

Threads are defined by inheriting from the Thread base class.




Thread Initialization

135

initialization and setup




Thread Execution

136

run() method

Contains code that executes in the thread.

The thread performs a scan of a single host.


Launching a Thread

137

• You create a thread object and start itt1 = ScanThread(("host1",31337),msg_q)t1.start()

t2 = ScanThread(("host2",31337),msg_q)t2.start()

• .start() starts the thread and calls .run()


Thread Safe Queues

138

• Queue module. Provides a thread-safe queue.import Queuemsg_q = Queue.Queue()

• Queue insertionmsg_q.put(obj)

• Queue removalobj = msg_q.get()

• Queue can be shared by as many threads as you want without worrying about locking.




Use of a Queue Object

139

A Queue object. Where incoming objects are placed.

Get data from the remote machine and put into the Queue


Primitive Use of a Queue

140

• You first create a queue, then launch the threads to insert data into it.

msg_q = Queue.Queue()t1 = ScanThread(("host1",31337),msg_q)t1.start()

t2 = ScanThread(("host2",31337),msg_q)t2.start()

while True: meta = msg_q.get() # Get metadata


Monitor Architecture

141

Host Host Host

MonitorThread Thread Thread

msg_q

socket socket socket

Consumer

.put()

.get()

????


Concurrent Monitorimport threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break

142


import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel


Launching Threads

143

The above function is a thread that launches ScanThreads. It then waits for the threads to terminate by joining with them. After all threads have terminated, a sentinel is dropped in the Queue.


import threading, Queuedef concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel


Collecting Results

144

The function below creates a Queue and launches a thread to launch all of the scanning threads.

It then produces a sequence of cache data until the sentinel (None) is pulled off of the queue.


More on Threads

145

• There are many more issues to thread programming that we could discuss.

• All issues concerning locking, synchronization, event handling, and race conditions apply to Python.

• Because of global interpreter lock, threads are not a way to achieve higher performance (generally).


Thread Synchronization

146

• threading module has various primitivesLock() # Mutex LockRLock() # Reentrant Mutex LockSemaphore(n) # Semaphore

• Example use:x = value # Some kind of shared objectx_lock = Lock() # A lock associated with x...

x_lock.acquire()# Modify or do something with x (critical section)...x_lock.release()


Story so Far

147

• Wrote a module ffcache.py that parsed contents of caches (~100 lines)

• Wrote cachespy.py that allows cache data to be retrieved by a remote client (~25 lines)

• Wrote a concurrent monitor for getting that data (~50 lines)


A subtle observation

148

• In none of our programs have we read the entire contents of any Firefox cache into memory.

• In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory).

• In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results).


Another Observation

149

• For every connection, cachespy sends the entire contents of the Firefox cache metadata back to the monitor.

• Given that caches are ~50 MB by default, this could result in large network traffic.

• Question: Given that we're normally performing queries on the data, could we do any of this work on the remote machines?


Remote Filtering

• Distribute the work

150

Modify the cachespy program so that some of the query work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program.

• Big Picture

Distributed computation. Massive security nightmare.


The idea• Modify scan_cluster() and all related

functions to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning results.

151

filter = """if meta['content-type'] == 'image/jpeg'and meta['datasize'] > 100000"""

rcaches = scan_cluster(hostlist,filter)


Changes to the Monitor# cachemon.pydef scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() try: while True: meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass

152

Send the filter to the remote host right after connecting.

Add a filter parameter


Changes to the Monitor# cachemon.py...class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q self.filter = filter def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass

153

filter added to thread data


def concurrent_scan(hostlist, msg_q,filter): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel

Changes to the Monitor

154

filter passed to thread creation


Changes to the Monitor

# cachemon.py...def scan_cluster(hostlist,filter=""): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q,filter)).start() while True: meta = msg_q.get() if not meta: break yield meta

155

filter added


Commentary

• Have modified the cache monitor program to accept a filter string and to pass that string to remote clients upon connecting.

• How to use the filter in the spy server.

156


Changes to CacheSpy

# cachespy.py...def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

157



Changes to CacheSpy

158

Filter added and used to create an expression string.

filter = "if meta['datasize'] > 100000"

values = """(meta for meta in ffcache.scan(caches) if meta['datasize'] > 100000)"""



Eval()

159

eval(s). Evaluates s as a Python expression.

A bit error of handling. traceback module creates stack traces for exceptions.


Changes to the Server

# cachespy.py...class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) f.close()

160

Get filter from the monitor


Putting it all Together

• A remote query to find slackers

161

# Find all of those slashdot slackersimport cachemon

hosts = [('host1',31337),('host2',31337), ('host3',31337),...]

filter = "if 'slashdot' in meta['request']"

rcaches = cachemon.scan_cluster(hosts,filter)for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print


Putting it all Together

• Queries run remotely on all the hosts

• Only data of interest is sent back

• No temporary lists or large data structures

• Concurrent execution on monitor

• Concurrency is hidden from user

162


The Power of Iteration• Loop over all entries in a cache file:

163

for meta in scan_cache_file(f,256): ...

• Loop over all entries in a cache directoryfor meta in scan_cache(dirname): ...

• Loop over all cache entries on remote hostfor meta in scan_remote_cache(host): ...

• Loop over all cache entries on many hostsfor meta in scan_cluster(hostlist): ...


Wrapping Up

• A lot of material has been presented

• Again, the goal was to do something interesting with Python, not to be just a reference manual.

• This is only a small taste of what's possible

• And it's only a small taste of why people like programming in Python

164


Other Python Examples

• Python makes many annoying tasks relatively easy.

• Will end by showing very simple examples of other modules.

165


Fetching a Web Page• urllib and urllib2 modules

166

import urllibw = urllib.urlopen("http://www.foo.com")for line in w: # ...

page = urllib.urlopen("http://www.foo.com").read()

• Additional options support uploading of form values, cookies, passwords, proxies, etc.


A Web Server with CGI

167

• Serve files and allow CGI scriptsfrom BaseHTTPServer import HTTPServerfrom CGIHTTPServer import CGIHTTPRequestHandlerimport osos.chdir("/home/docs/html")serv = HTTPServer(("",8080),CGIHTTPRequestHandler)serv.serve_forever()

• Can easily throw up a server with just a few lines of Python code.


A Custom HTTP Server• BaseHTTPServer module

168

from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer

class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ...

serv = HTTPServer(("",8080),MyHandler)serv.serve_forever()

• Could use to put a web server in an application


XML-RPC Server/Client

169

• How to create a stand-alone serverfrom SimpleXMLRPCServer import SimpleXMLRPCServer

def add(x,y): return x+y

s = SimpleXMLRPCServer(("",8080))s.register_function(add)s.serve_forever()

• How to test it (xmlrpclib)>>> import xmlrpclib>>> s = xmlrpclib.ServerProxy("http://localhost:8080")>>> s.add(3,5)8>>> s.add("Hello","World")"HelloWorld">>>


Where to go from here?

• Network/Internet programming. Python has a large user base developing network applications, web frameworks, and internet data handling tools.

• C/C++ extension building. Python is easily extended with C/C++ code. Can use Python as a high-level control application for existing systems software.

170



• GUI programming. There are several major GUI packages for Python (Tkinter, wxPython, PyQT, etc.).

• Jython and IronPython. Implementations of the Python interpreter for Java and .NET.

171



• Everything Pythonic:

172

http://www.python.org

• Get involved. PyCon'2008 (Chicago)

• Have an on-site course (shameless plug)

http://www.dabeaz.com/python.html


Thanks for Listening!

• Hope you got something out of the class

173

• Please give me feedback!

Python in Action (Part 2)

Technology

Transcript of Python in Action (Part 2)