Accessing File-Specific Attributes on Steroids - EuroPython 2008

Post on 23-Jan-2015

508 views 1 download

description

A presentation about a tool named "fileinfo" given at EuroPython 2008.

Transcript of Accessing File-Specific Attributes on Steroids - EuroPython 2008

Accessing File-Specific Attributes on Steroids

Dinu C. Ghermangherman@python.net

EuroPython Conference 2008-07-07, Vilnius

Motivation

• Get quick overview of file attributes for multiple files

• Compare attribute values between files

• Identify groups of files

• Reuse overview results

• Avoid “opening” files with applications

Background

wc$ cd mercurial/hgweb$ wc -lwc *.py

16 66 502 __init__.py 118 438 3988 common.py 993 2876 36064 hgweb_mod.py 305 910 12420 hgwebdir_mod.py 228 683 7258 protocol.py 101 320 3577 request.py 298 863 10698 server.py 127 414 3907 webcommands.py 65 190 2090 wsgicgi.py2251 6760 80504 total

~1971(?)

pycount$ cd mercurial/hgweb$ pycount2.py *.py

lines code doc comment blank file 16 5 0 7 4 __init__.py 118 77 22 8 11 common.py 993 809 3 31 150 hgweb_mod.py 305 249 0 17 39 hgwebdir_mod.py 228 174 0 19 35 protocol.py 101 76 2 7 16 request.py 298 244 5 10 39 server.py 127 93 0 10 24 webcommands.py 65 43 0 11 11 wsgicgi.py 2251 1770 32 120 329 total

~2002

ttfinfo$ cd fonts/truetype/$ ttfinfo.py -a maxp.numGlyphs -a kern.nPairs -a head.unitsPerEm A*.ttf

249 0 1000 AmericanTypewriter.ttf1320 3072 2048 Arial.ttf 245 1536 2048 ArialBlack.ttf1320 3072 2048 ArialBold.ttf 956 3072 2048 ArialBoldItalic.ttf 956 3072 2048 ArialItalic.ttf 244 384 2048 ArialNarrow.ttf 245 384 2048 ArialNarrowBold.ttf 244 384 2048 ArialNarrowBoldItalic.ttf 244 384 2048 ArialNarrowItalic.ttf 243 1536 2048 ArialRoundedMTBold.ttf

~2005

pyinfo$ cd mercurial/hgweb$ pyinfo.py -a nclass:ndef:ncalls:ndiffkw *.py

nclass ndef ncalls ndiffkw file 0 2 2 3 __init__.py 1 9 31 18 common.py 1 60 492 24 hgweb_mod.py 1 15 133 23 hgwebdir_mod.py 0 11 121 21 protocol.py 1 12 30 16 request.py 6 24 104 18 server.py 0 14 50 15 webcommands.py 0 3 15 13 wsgicgi.py 10 150 978 total

2007…

pdfinfo$ cd brandeins/200805_bildung$ pdfinfo.py -a npages:nimgs:author *.pdf

npages nimgs author file 1 1 Kathrin 802053_008b10508m.pdf 1 0 Kathrin 802055_010b10508w.pdf 2 0 Kathrin 802056_012b10508m.pdf 2 1 Kathrin 802057_018b10508d.pdf 1 1 Kathrin 802060_020b10508m.pdf 9 8 n/a 802064_022b10508d.pdf 8 8 Kathrin 802067_036b10508w.pdf 2 0 Kathrin 803048_136b10508w.pdf 26 19 total

2007…

Fileinfo

Big Picture

• Describe input files & attributes

• Locate input files

• Investigate file attributes

• Process file attributes

• Present tabular output

Input Files Examples

• fileinfo [opts] /mypath/*.pdf

• fileinfo [opts] $(find /mypath -name "*.py")

• fileinfo [opts] $(mdfind -onlyin /mypath -name "*.py")

Attributes Examples

• --attrs nclasses:ndefs

• --sort size:ndefs

• --filter "rec.ndefs > 1000"

Output Formats

• Text, HTML, CSV, ReST (simple)

• Cocoa, WxPython

• Django

Selected Plug-insGeneral XML PDF Python Quicktimecounter nattrs title ndefs durationwc ndattrs author nclasses boxlc ntags producer ncalls datasizemd5 ndtags creation\ nstrs ntracks

depht date ndocstrsOS npages nkws OS X bundlesuid TTF nimgs ndkws bundlenameusername kern.nPairs nimpstmts bundleversionmtime maxp.numGlyphs MP3 nopssize maxp.version album mlw Spotlightlevel head.unitsPerEm artist mil, … kMDItem*

Examples

$ cd /Data/brandeins/200712_design$ fileinfo --format rest-simple -a npages:nimgs \ -f "rec.nimgs > 2" *.pdf====== ===== =====================npages nimgs path====== ===== ===================== 11 3 540237_058b11207s.pdf 8 18 540238_070b11207r.pdf 7 11 540240_082b11207a.pdf 9 9 540242_096b11207f.pdf 3 5 540243_106b11207r.pdf 11 15 540244_110b11207s.pdf 7 8 540245_122b11207s.pdf 2 3 540246_136b11207s.pdf 2 3 540248_148b11207d.pdf 6 6 540252_138b11207b.pdf 8 6 540260_026b11207h.pdf 6 5 540261_038b11207o.pdf 8 10 540262_048b11207m.pdf 6 6 540263_156b11207d.pdf 7 6 540265_170b11207h.pdf 101 114 total====== ===== =====================

Implementation

PDF-Plugin (1)class PDFInvestigator(BaseInvestigator): "A class for determining attributes of PDF files."

attrMap = { "title": "getTitle", "author": "getAuthor", "producer": "getProducer", "creationdate": "getCreationDate", "npages": "getNumPdfPages", "nimgs": "getNumImages", }

totals = ("npages", "nimgs")

def activate(self): "Try activating self, setting 'active' variable."

# calculate self.active... return self.active

PDF-Plugin (2)

def getNumPdfPages(self): "Return the number of pages in a PDF document."

try: # uses PyPdf res = self.input.getNumPages() except: res = "n/a"

return res

PDF-Plugin (3)

def getNumImages(self): "Return the number of images in a PDF document." expr = r"\d+ +\d+ +obj.*?endobj\s+(?:%.*?[\r\n])?" objPat = re.compile(expr, re.M | re.S) items = re.findall(objPat, self.content) for p in [ re.compile("/%s\s*/%s" % (k, v), re.M | re.S) for (k, v) in [("Type", "XObject"), ("Subtype", "Image")]]: items = [i for i in items if re.search(p, i) != None]

return len(items)

An Aside: Spotlight

Spotlight

• Desktop file search

• Mac OS X 10.4 and 10.5

• Deeply integrated in Mac OS X

• Index-based, with attributes

• Results based on relevance and recency

• Plug-ins/API for custom file formats

• GUI & command-line

Spotlight Menu

Spotlight Window

Spotlight$ mdfind europython | egrep "\.pdf$"

/Users/dinu/Desktop/EuroPython2008Timetable.pdf/Users/dinu/Developer/Python/fileinfo/presentation/fileinfo-slides.pdf/Data/Perso/CV/cv-dg.pdf/Users/dinu/Library/Mail Downloads/cv-dg.pdf/Users/dinu/Developer/Python/epc2008/badge_data.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue4.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume3Issue1.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue3.pdf/Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 2.pdf/Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 1.pdf/Users/dinu/Developer/Python/epc2008/badge_data-hpda.pdf/Users/dinu/Developer/Python/epc2008/badge_data-hpda-sliced.pdf/Data/Perso/Travel/Vilnius2008/EuroPython 2008 Invoice.pdf/Users/dinu/Developer/Python/hipsterpda/output/badges.pdf...

Spotlight – Pro

• Great index/search technology

• Very fast, useful and easy to use

• ~125 search attributes in Mac OS X 10.5 (e.g. Aperture, Composer, …)

• Extensible (Python plug-in available)

Spotlight – Con

• Result on command-line not in table form

• Result in GUI is always a list of file names + the attributes, that the Finder (!) knows

• Weak on providing overview

• Mac OS X only

Future

Issues

• Testing, debugging & refactoring, …

• Better folder handling (e.g. OS X bundles)

• Attribute namespaces (pdf.npages)?

• Attribute parameters (nattr#h2)?

• Attribute Null values (”n/a“)?

• Better dependancies handling

More Features?

• Output format plug-ins?

• Pylint plug-in for fileinfo?

• Fileinfo Python plug-in ⇒ pyinfo.py?

• Plug-ins for functions like total()

• Access intra-file dataset attributes?

• Multi-line attribute values?

• ”Abreviations“ for attribute lists?

• Derived attributes (ncomments/loc)?

Summary

• Useful as general purpose attribute ”browser“

• Access to Spotlight meta-data (Mac OS X)

• Easy to write plug-ins

• Fileinfo not like Spotlight (no index/search)

• More like iTunes (on the command-line ;-)

Questions?