The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science...
-
Upload
myra-garrett -
Category
Documents
-
view
220 -
download
0
description
Transcript of The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science...
The Next 700 Data The Next 700 Data Description LanguagesDescription Languages
Yitzhak MandelbaumPrinceton UniversityComputer Science
Collaborators: Kathleen Fisher and David Walker
““The Next 700 …”The Next 700 …”
Program(s) Data Format(s)
Programming Language(s)
PL Semantics
Data Description Language(s)
DDL Semantics
What Data Needs Describing?What Data Needs Describing?• There's much data in databases and common
formats like XML; there’s much data that’s ad hoc.
• Ad hoc data lacks readily available parsing, querying, analysis or transformation tools
• It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.
Ad Hoc Data in BiologyAd Hoc Data in Biology!autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;
...
from www.geneontology.orgfrom www.geneontology.org
Ad Hoc Data in ChemistryAd Hoc Data in ChemistryO=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H](OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)C5=CC=CC=C5)=O)C1
O O
O
OH
AcO
H
O
O
O
HO
NH
O
O
OHO
Ad Hoc Data from Web Server Logs Ad Hoc Data from Web Server Logs (CLF)(CLF)
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941
Ad Hoc Data: DNS packetsAd Hoc Data: DNS packets
00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co
DData ata Description LanguagesDescription Languages• Data description languages describe many ad hoc
formats and provide the following features:– Descriptions serves as documentation, including semantic of
data– Compiler generates tools from description: parser, printer,
query engine, converter to XML, statistical profiler, etc.– Parser includes robust error detection and recovery.– Parsers can handle high data volume.
• > 1GB/second Netflow traffic from Cisco routers.
Many Data Description Many Data Description LanguagesLanguages
• Logical Descriptions– ASN.1– ASDL
• Physical Descriptions– PacketTypes (SIGCOMM ‘00)
– DataScript (GPCE ‘02)
– PADS (PLDI ‘05)
• Basis for current work
001010101101001001111001010001010100
Logical
Physical
ContributionsContributions• A core data description calculus (DDC)
– Based on dependent type theory– Simple, orthogonal, composable types– Types are transducers from external data source to internal
data representation.
• Encodings of high-level DDLs in low-level DDC– Explain semantics of PADS language in particular.
PacketTypesPADS
Datascript
DDC
Base Types and SequencesBase Types and Sequences• C(e): base type can be parameterized by
expression e.• x:T.T’: dependent product describes sequence of
values.– Variable x gives name to first value in sequence.
• Examples:
“123hello|” int * string(‘|’) * char (123, “hello”, ‘|’)
“3513” width:int_fw(1). int_fw(width) (3,513)
“:hello:” term:char.string(term) * char (‘:’,“hello”,‘:’)
ConstraintsConstraints• {x:T | e}: set types allow you to constrain the type T
and express relationships between elements of the data.
• Examples:
‘a’ {c:char | c = ‘a’} (abbrev: Sc(‘a’)) inl ‘a’
“101”,“82”
{x:int | x > 100}inl 101,inr error(82)
“43|105|67”min:int.Sc(‘|’) *max:{m:int | min ≤ m}.Sc(‘|’) * {avg:int | min ≤ avg & avg ≤ max}
(43, inl ‘|’, inl 105, inl ‘|’, inl 67)
Unions and the Empty StringUnions and the Empty String• true: matches the empty string.• T + T’ : deterministic, exclusive or: try T; on failure, try
T’.• Examples:
“54”, “n/a” int + Ss(“n/a”) inl 54, inr “n/a”
“2341”, “” int + true inl 2341, inr ()
Array FeaturesArray Features• What features do we need to handle data
sequences?– Elements– Separator between elements– Termination condition (“are we done yet?”)– Terminator after sequence
• Examples:“192.168.1.1”“Bill|Cathy|Jane|Bob;”
False and ArraysFalse and Arrays• T seq(Ts; e, Tt) specifies:
– Element type T– Separator types Ts.– Termination condition e.– Terminator type Tt.
• false: reads nothing, flagging an error.• Example: IP address.
“192.168.1.1” int seq(Sc(‘.’); len 4, false) [192,168,1,1]
Abstraction and ApplicationAbstraction and Application• Can parameterize types over values: x.T• Correspondingly, can apply types to values: T e• Example: IP address with terminator
none term.int seq(Sc(‘.’); len 4, Sc(term)) none
“1.2.3.4|” IP_addr ‘|’ * Sc(‘|’) ([1,2,3,4],inl ‘|’)
Absorb, Compute and ScanAbsorb, Compute and Scan• Absorb, Compute and Scan are active types.
– absorb(T) : consume data from source; produce nothing.– compute(e:) : consume nothing; output result of
computation e.– scan(T) : scan data source for type T.
• Examples:
“|” absorb(Sc(‘|’)) ()
“10|12”width:int.Sc(‘|’) *length:int. area:compute(width length:int)
(10,12,120)
“^%$!&_|” scan(Sc(‘|’)) (6,inl ‘|’)
Type KindingType Kinding• Kinding ensures types are well formed.
|- T : s k |- e : s |- T e: k
|- T : type |- T’ : type |- T + T’: type
|- T : type ,x:s |- e : bool (s = …) |- {x:T | e}: type
Parsing Semantics of TypesParsing Semantics of Types• Semantics expressed as parsing functions
written in the polymorphic -calculus.– Sem(T) : DDC Type Function– Input data and offset, output new offset, value and
parse descriptor.– For specifics, see upcoming technical report.
Types of Parser Output Types of Parser Output • Parsers produce values with following type in the host language:
DDC Host Language
[C(e)]rep I(C) + noval
[true]rep unit
[x:T.T’]rep [T]rep * [T’]rep
[x.T]rep, [T e]rep [T]rep
[T + T’]rep [T]rep + [T’]rep + noval
[{x:T | e}]rep [T]rep + ([T]rep error)
unrecoverable error
semantic error
dependency erased
Base Types
Products
Union
Abs. and App.
Set types
Properties of the CalculusProperties of the Calculus
• Theorem: If |- T : k then – [T] = F well formed types yield parsers |- F : bits * offset offset * [T]rep * [T]pd
a T-Parser returns values with types that correspond to T.
• Theorem: Parsers report errors accurately.– Errors in parse descriptor correspond to actual errors in data.– Parsers check all semantic constraints.– More …
Making Use of the CalculusMaking Use of the Calculus
IPADS
DDC
|- t T
IPADSt ::= C(e) | Pfun(x:s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts tdef;} | Popt t | t Pwhere x.e | Palt{fields} | t [t; e,t] | Pcompute e | Plit cfields ::= | fields x : t;alts ::= | alts e => t;
Example: Example: PoptPopt and and PlitPlit
|- Popt t T + true
|- t T
|- Plit c scan(absorb({x:char | x = c}))
|- c : char
trueT1 + T2
C(e){x:T | e}absorb(T)scan(T)
Example: Example: PswitchPswitch
|- Pswitch e of {e1 => t1; e2 => t2; … tdef}
(c.{x:T1 | c = e1} + {x:T2 | c = e2} + …+ Tdef) e
|- ti Ti (i = 1…n) T + T’x.T{x:T|e}
|- tdef Tdef
Future workFuture work• What are the set of languages recognized by the
DDC?• How does the expressive power of the DDC relate to
CFGs and regular expressions?• Implement recursive types in PADS system based on
the recursive types of the DDC.• Add polymorphism to DDC and PADS.
SummarySummary• Data description languages are well-suited to
describing ad hoc data.• No one DDL will ever be right - different domains and
applications will demand different languages with differing levels of expressiveness and abstraction.
• Our work defines the first semantics for data description languages.
• For more information, visit www.padsproj.org.