Pads/ML: A Functional Data Description Language

54
Pads/ML: Pads/ML: A Functional Data Description Language A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez (AT&T)

description

Pads/ML: A Functional Data Description Language. David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez (AT&T). Data, data everywhere!. Incredible amounts of data stored in well-behaved formats:. Databases:. Tools Schema Browsers - PowerPoint PPT Presentation

Transcript of Pads/ML: A Functional Data Description Language

Page 1: Pads/ML: A Functional Data Description Language

Pads/ML:Pads/ML:A Functional Data Description LanguageA Functional Data Description Language

David Walker

Princeton University

with: Yitzhak Mandelbaum (Princeton),

Kathleen Fisher and Mary Fernandez (AT&T)

Page 2: Pads/ML: A Functional Data Description Language

2

Data, data everywhere!Data, data everywhere!

Incredible amounts of data stored in well-behaved formats:

Tools• Schema• Browsers• Query languages• Standards• Libraries• Books, documentation• Conversion tools• Vendor support• Consultants…

Databases:

XML:

Page 3: Pads/ML: A Functional Data Description Language

3

Ad hoc DataAd hoc Data

• Vast amounts of data in ad hoc formats.• Ad hoc data is semi-structured:

– Not free text.– Not as rigid as data in relational databases.

• Examples from many different areas:– Physics– Computer system maintenance and administration– Biology– Finance– Government– Healthcare– More!

Page 4: Pads/ML: A Functional Data Description Language

4

Ad Hoc Data in BiologyAd Hoc Data in Biology

format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824,PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

www.geneontology.orgwww.geneontology.org

Page 5: Pads/ML: A Functional Data Description Language

5

Ad Hoc Data in Chemistry Ad Hoc Data in Chemistry

O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H](OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)C5=CC=CC=C5)=O)C1

O O

O

OH

AcO

H

O

O

O

HO

NH

O

O

OHO

Page 6: Pads/ML: A Functional Data Description Language

6

Ad Hoc Data in FinanceAd Hoc Data in Finance

HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HL00000002START OF OPEN INTERESTd 00000003FZYX G1AB0000030000300000HM00000004END OF OPEN INTERESTHE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARY

www.opradata.comwww.opradata.com

Page 7: Pads/ML: A Functional Data Description Language

7

Ad Hoc Data from Web Server Logs Ad Hoc Data from Web Server Logs (CLF)(CLF)

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -

Page 8: Pads/ML: A Functional Data Description Language

8

Ad Hoc Data: DNS packetsAd Hoc Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

Page 9: Pads/ML: A Functional Data Description Language

9

Properties of Ad hoc DataProperties of Ad hoc Data

• Data arrives “as is” -- you don’t choose the format

• Documentation is often out-of-date or nonexistent.– Hijacked fields.

– Undocumented “missing value” representations.

• Data is buggy.– Missing data, “extra” data, …

– Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), …

– Errors are sometimes the most interesting portion of the data.

• Data sources often have high volume.– Data might not fit into main memory.

• Data can be created by malicious sources attempting to exploit software vulnerabilities

– c.f. Ethereal network monitoring system

Page 10: Pads/ML: A Functional Data Description Language

10

The Goal(s)The Goal(s)• What can we do about ad hoc data?

– how do we read it into programs?– how do we detect errors?– how do we correct errors?– how do we query it?– how do we discover its structure and properties? – how do we view it?– how do we transform it into standard formats like CSV, XML?– how do we merge multiple data sources?

• In short: how do we do all the things we take for granted when dealing with standard formats in a fault-tolerant and efficient, yet nearly effortless way?

Page 11: Pads/ML: A Functional Data Description Language

11

Enter PadsEnter Pads

• Pads: a system for Processing Ad hoc Data Sources

• Three main components:– a data description language

• for concise and precise specifications of ad hoc data formats and semantic properties

– a compiler that automatically generates a suite of programming libraries & end-to-end applications

– a visual interface to support both novice and expert users

Page 12: Pads/ML: A Functional Data Description Language

12

One Description, Many ToolsOne Description, Many Tools

Data Description(Type T)

compiler

queryengine

parser printervisual data

browserxml

translator...

programming library

complete application

Page 13: Pads/ML: A Functional Data Description Language

13

Some Advantages Over Some Advantages Over Ad Hoc MethodsAd Hoc Methods

• Big bang for buck: 1 description, many tools

• Descriptions document data sources

– the documentation IS the tool generator so documentation is automatically

kept up-to-date with implementation

• Descriptions are easy to write, easy to understand.

– descriptions are high-level & declarative

– description syntax exploits programmer intuition concerning types

• Tools are robust

– Error handling code generated automatically; doesn’t clutter documentation.

• Descriptions & generated tools can be analyzed and reasoned about

– eg: data size, tool termination & safety properties, coherence of generated

parsers & printers

Page 14: Pads/ML: A Functional Data Description Language

14

The PADS ProjectThe PADS Project

• PADS/C [PLDI 05; POPL 06]– Based on C type structure.– Generates C libraries.

• too bad C doesn’t actually support libraries ....

– LaunchPADS visual interface [Daly et al., SIGMOD 06]

• PADS/ML (Mandelbaum’s thesis)– Based on the ML type structure.

• polymorphic, dependent datatypes– Generates ML modules.

• better reuse & library structure • functional data processing = far

greater programmer productivity– New framework for tool development.

• Format-independent algorithms architected using functors vs macros

– Implementation status.• Version 1.0 up and running• Many more exciting things to do

• Describe real formats:– Newick tree-structured data– Reglens galaxy catalogues– Palm PDA databases– AT&T call-detail data– AT&T billing data– Web server logs– Gene ontologies– DNS packets– OPRA data– More …

Page 15: Pads/ML: A Functional Data Description Language

15

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 16: Pads/ML: A Functional Data Description Language

16

Base Types and RecordsBase Types and Records

• Base types: C (e).– Describe atomic portions of data.

– Parameterized by host-language expression.

– Examples:

• Pint, Pchar, Pstring_FW(n), Pstring(c).

• Tuples and Records: t * t’ and {x:t; y:t’}.– Record fields are dependent: field names can be referenced

by types of later fields.

– Example to follow.

Page 17: Pads/ML: A Functional Data Description Language

17

122Joe|Wright|45|95|79n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85 Burton|30|82|71126George|Lucas|32|62|40

Base Types and RecordsBase Types and Records

Tim

Pint * Pstring(‘|’) * Pchar

125 |

Movie-director Bowling Score (MBS) Format

Page 18: Pads/ML: A Functional Data Description Language

18

13C Programming31Types and Programming Languages20Twenty Years of PLDI36Modern Compiler Implementation in ML 27Elements o f ML Programming

Base Types and RecordsBase Types and Records

13C Programming

{ width: ; title: Pstring_FW(width) }Pint

Bookshelf Listing (BL) Format

Page 19: Pads/ML: A Functional Data Description Language

19

ConstraintsConstraints

• Constrained types: [x:t | e] .– Enforce the constraint e on the underlying type t.

BurtonTim125 | 30|

[c:Pchar | c = ‘|’]

ptype Scores = { min:Pint; ‘|’; max: [m:Pint | min ≤ m]; ‘|’; avg: [a:Pint | min ≤ a & a ≤ max] }

82 71| |

Pchar ‘|’

Page 20: Pads/ML: A Functional Data Description Language

20

122Joe|Wright|45|95|79n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85125Tim|Burton|30|82|71126George|Lucas|32|62|40

DatatypesDatatypes

• Describe alternatives in data source with datatypes.– Parser tries each alternative in order.

pdatatype Id = None of “n/a” | Some of Pint

n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85

Page 21: Pads/ML: A Functional Data Description Language

21

Recursive Datatypes Recursive Datatypes

• Describe inductively-defined formats.

pdatatype IntList = Cons of Pint * ‘|’

| Last of Pint

* IntList

79| 4031|71|

Page 22: Pads/ML: A Functional Data Description Language

22

Polymorphic Types Polymorphic Types

• Parameterize types by other types.

pdatatype (Elt) List = Cons of Elt * ‘|’

| Last of Elt

* (Elt) List

pdatatype IntList = Cons of Pint * ‘|’

| Last of Pint

* IntList

ptype IntList = Pint List

ptype CharList = Pchar List

Page 23: Pads/ML: A Functional Data Description Language

23

Dependent Types Dependent Types

• Parameterize types by values.

pdatatype IntList = Cons of Pint * ‘|’ * IntList

| Nil of Pint

ptype IntListBar = Pint List(‘|’)

ptype CharListComma = Pchar List (‘,’)

pdatatype (Elt) List (x:char) = Cons of Elt * x * (Elt) List(x)

| Nil of Elt

Page 24: Pads/ML: A Functional Data Description Language

24

More Dependent Types More Dependent Types

pdatatype GuidedOption (tag: int) =

pmatch tag of

0 => Zero of Pstring

| 1 => One of Pint

| 2 => Two of Pint * Pint

| _ => None

ptype source = {tag: Pint; payload: GuidedOption (tag)}

• “Switched” datatypes:

Page 25: Pads/ML: A Functional Data Description Language

ptype Timestamp = Ptimestamp_explicit_FW(8, "%H:%M:%S", gmt)

ptype Pip = Puint8 * ’.’ * Puint8 * ’.’ * Puint8 * ’.’ * Puint8

ptype (Alpha) Pnvp(p : string -> bool) = { name : [name : Pstring(’=’) | p name]; ’=’; value : Alpha }

ptype (Alpha) Nvp(name:string) = Alpha Pnvp(fun s -> s = name)

ptype SVString = Pstring_SE("/;|\\|/")

ptype Nvp_a = SVString Pnvp(fun _ -> true)

ptype Details = { source : Pip Nvp("src_addr");’;’; dest : Pip Nvp("dest_addr");’;’; start_time : Timestamp Nvp("start_time");’;’; end_time : Timestamp Nvp("end_time");’;’; cycle_time : Puint32 Nvp("cycle_time")}

ptype Semicolon = Pcharlit(’;’)ptype Vbar = Pcharlit(’|’)

pdatatype Info(alarm_code : int) = Pmatch alarm_code with 5074 -> Details of Details | _ -> Generic of (Nvp_a,Semicolon,Vbar) Plist

pdatatype Service = Dom of "DOMESTIC" | Int of "INTERNATIONAL" | Spec of "SPECIAL"

ptype Raw_alarm = { alarm : [ i : Puint32 | i = 2 or i = 3];’:’; start : Timestamp Popt;’|’; clear : Timestamp Popt;’|’; code : Puint32;’|’; src_dns : SVString Nvp("dns1");’;’; dest_dns : SVString Nvp("dns2");’|’; info : Info(code);’|’; service : Service}

let checkCorr ra = ...

ptype Alarm = [x:Raw_alarm | checkCorr x]ptype Source = (Alarm,Peor,Peof) Plist

2:3004092508||5001|dns1=abc.com;dns2=xyz.com|c=slow link;w=lost packets|INTERNATIONAL3:|3004097201|5074|dns1=bob.com;dns2=alice.com|src_addr=192.168.0.10;dst_addr=192.168.23.10;start_time=1234567890;end_time=1234568000;cycle_time=17412|SPECIAL

Sample Regulus Data:

PADS/ML Regulus Format:

Page 26: Pads/ML: A Functional Data Description Language

26

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 27: Pads/ML: A Functional Data Description Language

27

Parsing With PADSParsing With PADS

data description(type T)

0100100100111 user code

data rep(type ~ T)

parse descriptor(type ~ T)

compiler

parser

Page 28: Pads/ML: A Functional Data Description Language

28

Example: MBS Example: MBS Representation Representation

n/aEd|Wood|10|47|31

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

pdatatype Id = None of “n/a” | Some of Pint

type MBS-Entry = { id: Id; first: string; last: string; scores: Scores }

datatype Id = None | Some of int

Page 29: Pads/ML: A Functional Data Description Language

29

Tool Generation With PADS/MLTool Generation With PADS/ML

data description(type T)

0100100100111

format-specific traversalfunctor

data rep(type ~ T)

parse descriptor(type ~ T)

compiler

parser

format-independent

toolmodule

tools in this pattern:accumulator, debugger, histograms, clusters, format converters

Page 30: Pads/ML: A Functional Data Description Language

30

Types as ModulesTypes as Modules

• PADS/ML generates a module for each type/description

• Parameterized types ==> Functors• Recursive types ==> Recursive modules

– sigh: combination of recursive modules & functors not supported in O’Caml, so we’re reduced to a bit of a hack for recursion

sig

type rep

type pd

fun parser : Pads.handle -> rep * pd

module Traverse (tool : TOOL) : sig ... end

end

Page 31: Pads/ML: A Functional Data Description Language

31

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 32: Pads/ML: A Functional Data Description Language

32

MotivationMotivation

• To crystallize design principles.

– Example: error counting methodology in PADS/C.

• To ensure system correctness.

– Example: parsers return data of expected type.

• As basis for evolution and experimentation.

– Critical to design of PADS/ML.

• To communicate core ideas.

– Designing the next 700 data description languages.

Page 33: Pads/ML: A Functional Data Description Language

33

PADS and DDCPADS and DDC

• Developed semantic framework based on Data

Description Calculus (DDC).

• Explains PADS/ML and other languages with DDC.

• Give denotational semantics to DDC.

PADS/CPADS/CPADS/MLPADS/ML

The Next 700 The Next 700

DDC

Page 34: Pads/ML: A Functional Data Description Language

34

Data Description CalculusData Description Calculus

• DDC: calculus of dependent types for describing data.

• Expressions e with type drawn from F-omega• A kinding judgment specifies well-formed descriptions.

t ::= unit | bottom | C(e)

| x:t.t | t + t | t & t | {x:t | e}

| t seq(t,e,t) | x.e | t e

| .t | t t | | .t

| compute (e:) | absorb(t) |

scan(t)

Page 35: Pads/ML: A Functional Data Description Language

35

Choosing a SemanticsChoosing a Semantics

• Semantics of REs, CFGs given as sets of strings but

fails to account for:

– Relationship between internal and external data.

– Error handling.

– Types of representation and parse descriptor.

• DDC

– Denotational semantics of types as parsers in F-omega

Page 36: Pads/ML: A Functional Data Description Language

36

A 3-Fold SemanticsA 3-Fold Semantics

t

0100100100...

Parser

Representation

Parse Descriptor

Description

[ t ][ t ]rep

[ t ]pd

Interpretations of t[ {x:t | e} ]rep = [ t ]rep + [ t ]rep

[ x:t.t’ ]rep = [ t ]rep * [ t’ ]rep

[ {x:t | e} ]pd = hdr * [ t ]pd

[ x:t.t’ ]pd = hdr * [ t ]pd * [ t’ ]pd

Page 37: Pads/ML: A Functional Data Description Language

37

Type CorrectnessType Correctness

t

0100100100...

Parser

Representation

Parse Descriptor

DescriptionTheorem [ t ] : bits [ t ]rep * [ t ]pd [ t ]

[ t ]rep

[ t ]pd

Interpretations of t

Page 38: Pads/ML: A Functional Data Description Language

38

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 39: Pads/ML: A Functional Data Description Language

39

Related WorkRelated Work

• parser generator technology:– Lex & Yacc

• no dependency

• semantic actions entwined with data description

• no higher-level tools

– Parser combinators

• semantic actions entwined with data description

• no higher-level tools

Page 40: Pads/ML: A Functional Data Description Language

40

Reminder:Reminder:One Description, Many ToolsOne Description, Many Tools

Data Description(Type T)

compiler

queryengine

parser printervisual data

browserxml

translator...

programming library

complete application

Page 41: Pads/ML: A Functional Data Description Language

41

Parser combinators:Parser combinators:One algorithm, One ToolOne algorithm, One Tool

parser

Page 42: Pads/ML: A Functional Data Description Language

42

Related WorkRelated Work

• Other “data description” languages– Data Format Description Language (DFDL)

– Binary Format Description Language (BFD)

– PacketTypes [SIGCOMM ’00]

– DataScript [GPCE ’02]

• None have a well-defined semantics or Pads tool

support

Page 43: Pads/ML: A Functional Data Description Language

43

Current & Future WorkCurrent & Future Work

• Tools and Applications– Description inference.

– Support for specific domains (microbiology)

• Language Design– Transformation language for ad hoc data.

– Description language for distributed

• Describe locations, versions, timing, relationships, etc.

• Theory– Analyze data descriptions for interesting properties, e.g.

equivalence, data size, termination, emptiness (always fails).

– Coherence of parsing & printing

Page 44: Pads/ML: A Functional Data Description Language

44

SummarySummary

• The PADS vision: reliable, efficient and effortless ad

hoc data processing

• PADS/ML:– Data description based on polymorphic, dependent

datatypes

– “Types as modules” implementation

– Solid theoretical basis.

• Visit www.padsproj.org

Page 45: Pads/ML: A Functional Data Description Language

45

The EndThe End

Questions?

Page 46: Pads/ML: A Functional Data Description Language

49

Existing ApproachesExisting Approaches

• C, Perl, or shell scripts: most popular.– Time consuming & error prone to hand code parsers.

– Difficult to maintain (worse than the ad hoc data itself in some cases!).

– Often incomplete, particularly with respect to errors.

• Error code, if written, swamps main-line computation.

• If not written, errors can corrupt “good” data.

• Lex & Yacc

– Good match for programming languages.

– Bad match for ad hoc data.

• Compiler converts descriptions into robust, format-specific

tools.

Page 47: Pads/ML: A Functional Data Description Language

50

Parsing With PADSParsing With PADS

• Robust parser at the core of generated tools.

Page 48: Pads/ML: A Functional Data Description Language

51

Using Ad hoc DataUsing Ad hoc Data

• Parsing only brings you part way.– Queries must be written in ML.– A lot of work.

• What about a declarative query?

122Joe|Wright|45|95|79

124Chris|Nolan|80|93|85125Tim|Burton|30|82|71126George|Lucas|32|62|40

• Can Ed Wood bowl?

n/aEd|Wood|10|47|31

Page 49: Pads/ML: A Functional Data Description Language

52

From Ad hoc Data To XMLFrom Ad hoc Data To XML

• XML– Encoding for semi-structured data.

– Good match!

• XQuery– Declarative XML query language for semi-structured

sources.

– Standardized by W3C, many implementations.

Page 50: Pads/ML: A Functional Data Description Language

53

PADX = PADS + XQueryPADX = PADS + XQuery

• Galax [Fernandez, et al.]

– Complete, open-source XQuery implementation.

• PADX

– Integrates PADS and Galax.

– Supports declarative queries over ad hoc data sources.

Page 51: Pads/ML: A Functional Data Description Language

54

Using PADXUsing PADX

• User describes format in PADS.

• PADX provides

– XML “view” of data in XML Schema.

– Customized XQuery engine.

• Query PADS-specific and other XML sources.

• User provides

– Ad hoc data

– Queries expressed in XQuery.

Page 52: Pads/ML: A Functional Data Description Language

55

Describing MBS Format Describing MBS Format

• Example Movie-director Bowling Score data

• PADS/ML Description

n/aEd|Wood|10|47|31

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

Page 53: Pads/ML: A Functional Data Description Language

56

Viewing and Querying MBSViewing and Querying MBS• Virtual XML view

• Query: What is Ed Wood’s maximum score?

$pads/Psource/MBS-Entry[first = “Ed”][last = “Wood”]/scores/max

• Query: Which directors have scored less than 50?

$pads/Psource/MBS-Entry[scores/min < 50]

<MBS-Entry> <id><None>n/a</None></id> <first>Ed</first> <last>Wood</last> <scores> <min>10</min> <max>47</max> <avg>31</avg> <scores></MBS-Entry>

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

n/aEd|Wood|10|47|31

Page 54: Pads/ML: A Functional Data Description Language

57

Challenges & SolutionsChallenges & Solutions

• Semantics

– Map PADS language to XML Schema.

• Re-engineer Galax Data Model

– Create abstract data model.

– Generate description-specific concrete data models.

• Efficiently query large-scale data sources.

– Provide lazy access to data.

– Implement custom memory-management.