Projecting XML Documents
description
Transcript of Projecting XML Documents
9/11/2003 1
Projecting XML Documents
Amélie Marian, Columbia UniversityJérôme Simeon, Bell Laboratories
9/11/2003 2
Motivation
XQuery used in very different environments: XQuery implementations on XML stored in
databases (with indexes). Main-memory XQuery implementations on XML
in files, sent as streams, computed on the fly…
Example Applications: Web Services (e.g., ActiveXML). Telecommunication apps (XML messages). XML documents. Information Integration.
9/11/2003 3
Memory Limitations
Main-memory XQuery implementations cannot handle large documents.Complex XQuery expressions require materialization (DOM).DOM is the bottleneck.
XQuery Processors
Maximum Document
Size
QuipKweeltGalax
Xalan (XSLT)
7Mb17Mb33Mb75Mb
XMark Query 1 on an IBM laptop T23 (256Mb RAM)
9/11/2003 4
Projection: Example<site>
<regions>...</regions> <people>
... <person id="person120"> <name>Wagar Bougaut</name> <emailaddress>mailto:[email protected]</emailaddress> </person> <person id="person121">
<name>Waheed Rando</name><emailaddress>mailto:[email protected]</emailaddress><address> <street>32 Mallela St</street>
<city>Tucson</city> <country>United States</country> <zipcode>37</zipcode></address>
<creditcard>7486 5185 1962 7735</creditcard><profile income="59224.09">...
<site> <regions>...</regions>
<people>... <person id="person120"> <name>Wagar Bougaut</name> <emailaddress>mailto:[email protected]</emailaddress> </person> <person id="person121">
<name>Waheed Rando</name><emailaddress>mailto:[email protected]</emailaddress><address> <street>32 Mallela St</street>
<city>Tucson</city> <country>United States</country> <zipcode>37</zipcode></address>
<creditcard>7486 5185 1962 7735</creditcard><profile income="59224.09">...
XMark Query 1for $b in /site/people/person[@id=“person0”]return $b/name
Less than 2% of original document !
9/11/2003 5
Projection: IntuitionGiven a query:For $b in /site/people/person[@id=“person0”]Return $b/name Most nodes in the input document(s) are not required. Projection operation removes unnecessary nodes. Evaluation of the query on projected document yields
the same results as on the original document.How it works:
Projection defined by set of paths. Static analysis infers sets of paths used within a query.
/site/people/person/site/people/person/@id/site/people/person/name
9/11/2003 6
Projection: Challenges
For an XQuery expression, compute all paths that allow to reach nodes required to evaluate the expression.XQuery is complex: Variables Composition Syntactic Sugar Complex expressions
Have to be able to analyze all of XQuery.
9/11/2003 7
Contributions
Definition of a notion of projection for XML documents, based on path expressions.Static Analysis algorithm for arbitrary XQuery expressions used to infer projection paths.Loading algorithm to build projected XML document.Integration of XML Projection in XQuery processor (Galax).Experimental evaluation of projection on XML query processing.
9/11/2003 8
XML Projection
Similar to relational projection: One key operation. Prunes unnecessary part of the data. Essential for memory management.
Specific problems related to XML: Projection must operate on trees. Requires analysis of the query. Need to address XQuery complexity.
9/11/2003 9
Notation
Projection Paths: Path expressions are noted using XPath
semantics (/site/people/person/@id) “#” notation used when subtree should
be kept (/site/people/person/name#)
Static Analysis: inference rule notationExpr => Paths
9/11/2003 10
Static Analysis: Variables
Variables can be bound to nodes coming form different paths.for $b in /site/people/(teacher | student)return $b/name
Analysis must remember paths to which variable was bound/site/people/teacher
/site/people/student Environment is maintained during path analysis:
Env |- Expr => Paths
9/11/2003 11
Static Analysis: Example
Literals do not require any paths:
Paths are propagated in a sequence:
Static analysis algorithm is correct (see Technical Report)
Env |- Literal => {}
Env |- Expr1 => Paths1Env |- Expr2 => Paths2
Env |- Expr1,Expr2 => Paths1 U Paths2
32 => {}
/a/b => {/a/b}
/a/d => {/a/d}
/a/b,/a/d => {/a/b,/a/d}
9/11/2003 12
Static Analysis: Composition
(if (count (/site/regions/*) = 3)then /site/people/personelse /site/open_auctions/open_auction)/@id
/@id does not apply to /site/regions/*Final set of paths should be/site/regions/*/site/people/person/@id/site/open_auctions/open_auction/@id
Need to differentiate two sets of paths during analysis:
Returned Paths: returned by the expression, further path steps are applied on them.
Used Paths: used to compute the expression.
Env |- Expr => Paths using UPaths
9/11/2003 13
XQuery CoreSubset of XQuery:
Reduced grammar. All operations are made explicit. Same expressive power as XQuery. Removes syntactic sugar. Simplifies complex expressions
Analysis only needs to be applied to a small set of expressions
for $b in /site/people/personreturn $b/name
for . in /return for . in child::site return for . in child::people return for . in child::person return child::name
XQuery
XQuery Core
9/11/2003 14
Optimized Projection
XQuery Core decomposition may lead to redundant paths being kept:/site/site/people/site/people/person/site/people/person/name#
Optimization on inference rules can avoid redundancyDetails in paper, full optimized analysis in technical report
9/11/2003 15
XQuery Processing Architecture
XQueryParser
QueryEvaluation
SAX ParserXML Data
Model Loader
XQueryExpression
Input XMLDocument
XQuery AbstractSyntax tree
XML QueryResult
SAXEvents
PathAnalysis
Projection Paths
Projected DataModel
Document Data Model
9/11/2003 16
Loading Algorithm: Description
Input: Set of projection paths. Document SAX events.
Decide on action to apply on document nodes: Skip: ignore node and its subtree. KeepSubtree: keep node and its subtree. Keep: keep node without its subtree. Move: keep processing SAX events. Current
node is only kept if some of its children are kept.
Keep a set of current paths.
9/11/2003 17
Loading Algorithm: Example
Projection Paths:/a/b/c#/a/d
Document Stream
<a> <g> </b><b> </g> </f><f> </c><c><b> </b><d> </d><e></e> </a>
Current Paths:
Loaded Nodes:
/b/c#/d
Action: MoveSkip
/c#
Keep Subtree
c
f
Keep
b d
/a/b/c#/a/d
a
Similar to XML filtering algorithms
Limitations: - Backward Axis!- Number of current paths can be huge (descendant axis)
9/11/2003 18
Experiments: Settings
XML Projection Evaluation: Effectiveness: projection impact on
different queries. Maximum document size: largest
document that can be processed. Processing time: effect on processing
time.
Experimental Setup: Default XMark document size: 50Mb.
Configuration CPU Cache Size RAM
A 1GHz 256Kb 256Mb
B 550MHz 512Kb 768Mb
C (default) 1.4GHz 256Kb 2Gb
9/11/2003 19
Experiments: Effectiveness
0
1
2
3
4
5
6
7
8
9
10
Siz
e a
s p
erc
en
tag
e o
f th
e s
ize
of
the
o
rig
ina
l d
oc
um
en
t
Projection
OptimizedProjection
100% 100% 33%100% 60%
All queries but one require less than 5% of the document.
9/11/2003 20
Experiments: Maximal Document Size
Configuration A B CXMark Query 3
(simple selection with
predicate)
No Projection 33Mb 220Mb 520MbOptimized Projection
1Gb 1.5Gb 1.5Gb
XMark Query 14 (Non-
selective path query with predicates)
No Projection 20Mb 20Mb 20MbOptimized Projection
100Mb 100Mb 100Mb
XMark Query 15
(Long, very selective path
query)
No Projection 33Mb 220Mb 520MbOptimized Projection
1Gb 2Gb 2Gb
9/11/2003 21
Experiments: Query Execution Time
0
50
100
150
200
250
To
tal Q
uer
y E
xecu
tio
n T
ime
(in
sec
on
ds)
Projection significantly reduces query processing timeNext Bottleneck: Joins!
0
1000
2000
3000
4000
5000
6000
7000
8000
Query8
Query9
Query10
Query11
Query12
Query14
No Projection
Projection
Optimized Projection
9/11/2003 22
Improvements
Complete XQuery implementation with projection available in Galax Demo at VLDB 2003.
Galax uses a more recent pure streaming algorithm for applying projection to a document. Better performance. Can be used as a stand-alone
operation, without loading.
9/11/2003 23
Conclusion and Future Work
Main contributions: Definition of a notion of projection for XML. Static analysis to infer projection paths from any
XQuery expression. Full implementation in Galax.
Experimental results: Dramatic increase in the size main-memory XQuery
processor can handle. Projection helps reducing query processing time.
Future work: Define loading algorithm for backward axis. Combine projection with other optimizations. Pushing down query operations during projection
(e.g., predicate evaluation)
9/11/2003 24
Advertisment“XQuery from the Experts” Just released by Addison Wesley
Ask Jerome for 20% discount flyers!