Accessing File-Specific Attributes on Steroids - EuroPython 2008

35
Accessing File-Specific Attributes on Steroids Dinu C. Gherman [email protected] EuroPython Conference 2008-07-07, Vilnius

description

A presentation about a tool named "fileinfo" given at EuroPython 2008.

Transcript of Accessing File-Specific Attributes on Steroids - EuroPython 2008

Page 1: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Accessing File-Specific Attributes on Steroids

Dinu C. [email protected]

EuroPython Conference 2008-07-07, Vilnius

Page 2: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Motivation

• Get quick overview of file attributes for multiple files

• Compare attribute values between files

• Identify groups of files

• Reuse overview results

• Avoid “opening” files with applications

Page 3: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Background

Page 4: Accessing File-Specific Attributes on Steroids - EuroPython 2008

wc$ cd mercurial/hgweb$ wc -lwc *.py

16 66 502 __init__.py 118 438 3988 common.py 993 2876 36064 hgweb_mod.py 305 910 12420 hgwebdir_mod.py 228 683 7258 protocol.py 101 320 3577 request.py 298 863 10698 server.py 127 414 3907 webcommands.py 65 190 2090 wsgicgi.py2251 6760 80504 total

~1971(?)

Page 5: Accessing File-Specific Attributes on Steroids - EuroPython 2008

pycount$ cd mercurial/hgweb$ pycount2.py *.py

lines code doc comment blank file 16 5 0 7 4 __init__.py 118 77 22 8 11 common.py 993 809 3 31 150 hgweb_mod.py 305 249 0 17 39 hgwebdir_mod.py 228 174 0 19 35 protocol.py 101 76 2 7 16 request.py 298 244 5 10 39 server.py 127 93 0 10 24 webcommands.py 65 43 0 11 11 wsgicgi.py 2251 1770 32 120 329 total

~2002

Page 6: Accessing File-Specific Attributes on Steroids - EuroPython 2008

ttfinfo$ cd fonts/truetype/$ ttfinfo.py -a maxp.numGlyphs -a kern.nPairs -a head.unitsPerEm A*.ttf

249 0 1000 AmericanTypewriter.ttf1320 3072 2048 Arial.ttf 245 1536 2048 ArialBlack.ttf1320 3072 2048 ArialBold.ttf 956 3072 2048 ArialBoldItalic.ttf 956 3072 2048 ArialItalic.ttf 244 384 2048 ArialNarrow.ttf 245 384 2048 ArialNarrowBold.ttf 244 384 2048 ArialNarrowBoldItalic.ttf 244 384 2048 ArialNarrowItalic.ttf 243 1536 2048 ArialRoundedMTBold.ttf

~2005

Page 7: Accessing File-Specific Attributes on Steroids - EuroPython 2008

pyinfo$ cd mercurial/hgweb$ pyinfo.py -a nclass:ndef:ncalls:ndiffkw *.py

nclass ndef ncalls ndiffkw file 0 2 2 3 __init__.py 1 9 31 18 common.py 1 60 492 24 hgweb_mod.py 1 15 133 23 hgwebdir_mod.py 0 11 121 21 protocol.py 1 12 30 16 request.py 6 24 104 18 server.py 0 14 50 15 webcommands.py 0 3 15 13 wsgicgi.py 10 150 978 total

2007…

Page 8: Accessing File-Specific Attributes on Steroids - EuroPython 2008

pdfinfo$ cd brandeins/200805_bildung$ pdfinfo.py -a npages:nimgs:author *.pdf

npages nimgs author file 1 1 Kathrin 802053_008b10508m.pdf 1 0 Kathrin 802055_010b10508w.pdf 2 0 Kathrin 802056_012b10508m.pdf 2 1 Kathrin 802057_018b10508d.pdf 1 1 Kathrin 802060_020b10508m.pdf 9 8 n/a 802064_022b10508d.pdf 8 8 Kathrin 802067_036b10508w.pdf 2 0 Kathrin 803048_136b10508w.pdf 26 19 total

2007…

Page 9: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Fileinfo

Page 10: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Big Picture

• Describe input files & attributes

• Locate input files

• Investigate file attributes

• Process file attributes

• Present tabular output

Page 11: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Input Files Examples

• fileinfo [opts] /mypath/*.pdf

• fileinfo [opts] $(find /mypath -name "*.py")

• fileinfo [opts] $(mdfind -onlyin /mypath -name "*.py")

Page 12: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Attributes Examples

• --attrs nclasses:ndefs

• --sort size:ndefs

• --filter "rec.ndefs > 1000"

Page 13: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Output Formats

• Text, HTML, CSV, ReST (simple)

• Cocoa, WxPython

• Django

Page 14: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Selected Plug-insGeneral XML PDF Python Quicktimecounter nattrs title ndefs durationwc ndattrs author nclasses boxlc ntags producer ncalls datasizemd5 ndtags creation\ nstrs ntracks

depht date ndocstrsOS npages nkws OS X bundlesuid TTF nimgs ndkws bundlenameusername kern.nPairs nimpstmts bundleversionmtime maxp.numGlyphs MP3 nopssize maxp.version album mlw Spotlightlevel head.unitsPerEm artist mil, … kMDItem*

Page 15: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Examples

Page 16: Accessing File-Specific Attributes on Steroids - EuroPython 2008

$ cd /Data/brandeins/200712_design$ fileinfo --format rest-simple -a npages:nimgs \ -f "rec.nimgs > 2" *.pdf====== ===== =====================npages nimgs path====== ===== ===================== 11 3 540237_058b11207s.pdf 8 18 540238_070b11207r.pdf 7 11 540240_082b11207a.pdf 9 9 540242_096b11207f.pdf 3 5 540243_106b11207r.pdf 11 15 540244_110b11207s.pdf 7 8 540245_122b11207s.pdf 2 3 540246_136b11207s.pdf 2 3 540248_148b11207d.pdf 6 6 540252_138b11207b.pdf 8 6 540260_026b11207h.pdf 6 5 540261_038b11207o.pdf 8 10 540262_048b11207m.pdf 6 6 540263_156b11207d.pdf 7 6 540265_170b11207h.pdf 101 114 total====== ===== =====================

Page 17: Accessing File-Specific Attributes on Steroids - EuroPython 2008
Page 18: Accessing File-Specific Attributes on Steroids - EuroPython 2008
Page 19: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Implementation

Page 20: Accessing File-Specific Attributes on Steroids - EuroPython 2008

PDF-Plugin (1)class PDFInvestigator(BaseInvestigator): "A class for determining attributes of PDF files."

attrMap = { "title": "getTitle", "author": "getAuthor", "producer": "getProducer", "creationdate": "getCreationDate", "npages": "getNumPdfPages", "nimgs": "getNumImages", }

totals = ("npages", "nimgs")

def activate(self): "Try activating self, setting 'active' variable."

# calculate self.active... return self.active

Page 21: Accessing File-Specific Attributes on Steroids - EuroPython 2008

PDF-Plugin (2)

def getNumPdfPages(self): "Return the number of pages in a PDF document."

try: # uses PyPdf res = self.input.getNumPages() except: res = "n/a"

return res

Page 22: Accessing File-Specific Attributes on Steroids - EuroPython 2008

PDF-Plugin (3)

def getNumImages(self): "Return the number of images in a PDF document." expr = r"\d+ +\d+ +obj.*?endobj\s+(?:%.*?[\r\n])?" objPat = re.compile(expr, re.M | re.S) items = re.findall(objPat, self.content) for p in [ re.compile("/%s\s*/%s" % (k, v), re.M | re.S) for (k, v) in [("Type", "XObject"), ("Subtype", "Image")]]: items = [i for i in items if re.search(p, i) != None]

return len(items)

Page 23: Accessing File-Specific Attributes on Steroids - EuroPython 2008

An Aside: Spotlight

Page 24: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight

• Desktop file search

• Mac OS X 10.4 and 10.5

• Deeply integrated in Mac OS X

• Index-based, with attributes

• Results based on relevance and recency

• Plug-ins/API for custom file formats

• GUI & command-line

Page 25: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight Menu

Page 26: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight Window

Page 27: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight$ mdfind europython | egrep "\.pdf$"

/Users/dinu/Desktop/EuroPython2008Timetable.pdf/Users/dinu/Developer/Python/fileinfo/presentation/fileinfo-slides.pdf/Data/Perso/CV/cv-dg.pdf/Users/dinu/Library/Mail Downloads/cv-dg.pdf/Users/dinu/Developer/Python/epc2008/badge_data.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue4.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume3Issue1.pdf/Data/Docs/dev/The Python Papers/ThePythonPapersVolume2Issue3.pdf/Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 2.pdf/Data/Docs/dev/The Python Papers/The Python Papers Volume 2, Issue 1.pdf/Users/dinu/Developer/Python/epc2008/badge_data-hpda.pdf/Users/dinu/Developer/Python/epc2008/badge_data-hpda-sliced.pdf/Data/Perso/Travel/Vilnius2008/EuroPython 2008 Invoice.pdf/Users/dinu/Developer/Python/hipsterpda/output/badges.pdf...

Page 28: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight – Pro

• Great index/search technology

• Very fast, useful and easy to use

• ~125 search attributes in Mac OS X 10.5 (e.g. Aperture, Composer, …)

• Extensible (Python plug-in available)

Page 29: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Spotlight – Con

• Result on command-line not in table form

• Result in GUI is always a list of file names + the attributes, that the Finder (!) knows

• Weak on providing overview

• Mac OS X only

Page 30: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Future

Page 31: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Issues

• Testing, debugging & refactoring, …

• Better folder handling (e.g. OS X bundles)

• Attribute namespaces (pdf.npages)?

• Attribute parameters (nattr#h2)?

• Attribute Null values (”n/a“)?

• Better dependancies handling

Page 32: Accessing File-Specific Attributes on Steroids - EuroPython 2008

More Features?

• Output format plug-ins?

• Pylint plug-in for fileinfo?

• Fileinfo Python plug-in ⇒ pyinfo.py?

• Plug-ins for functions like total()

• Access intra-file dataset attributes?

• Multi-line attribute values?

• ”Abreviations“ for attribute lists?

• Derived attributes (ncomments/loc)?

Page 33: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Summary

• Useful as general purpose attribute ”browser“

• Access to Spotlight meta-data (Mac OS X)

• Easy to write plug-ins

• Fileinfo not like Spotlight (no index/search)

• More like iTunes (on the command-line ;-)

Page 35: Accessing File-Specific Attributes on Steroids - EuroPython 2008

Questions?