Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for...

30
Web Data Management Indexes

Transcript of Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for...

Page 1: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Web Data Management

Indexes

Page 2: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

In this lecture• Indexes

– XSet

– Region algebras

– Indexes for Arbitrary Semistructured Data

– Dataguides

– T-indexes

– Index Fabric

Resources• Index Structures for Path Expressions by Milo and Suciu, in ICDT'99• XSet description: http://www.openhealth.org/XSet/

• Data on the Web Abiteboul, Buneman, Suciu : section 8.2

Page 3: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

The problem

• Input: large, irregular data graph

• Output: index structure for evaluating regular path expressions

Page 4: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

The Data

Semistructured data instance = a large graph

Page 5: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

The queriesRegular expressions (using Lorel-like syntax)

SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul X

Select xfrom part._*.supplier.name x

Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.

Select XFrom part._*.supplier: {name: X, address: “Philadelphia”}

Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Page 6: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Analyzing the problem

• what kind of data– tree data (XML): easier to index – graph data: used in more complex applications

• what kind of queries– restricted regular expressions (e.g. XPath): may

be more efficient– arbitrary regular expressions: rarely

encountered in practice

Page 7: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

XSet: a simple index for XML

• Part of the Ninja project at Berkeley• Example XML data:

Page 8: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

XSet: a simple index for XML

Each node = a hashtable

Each entry = list of pointers to data nodes (not shown)

Page 9: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

XSet: Efficient query evaluation

• To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.

• R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.

• Thus, explore the entire subtree dominated by h2.• Will be efficient if index is small and fits in memory• R3 – leading wild card forces to consider all nodes in the index tree,

resulting in less efficient computation than for R4.• Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal

search from there.

(R1) SELECT X FROM part.name X -yes

(R2) SELECT X FROM part.supplier.name X -yes

(R3) SELECT X FROM *.supplier.name X -maybe

(R4) SELECT X FROM part.*.subpart.name X -maybe

Page 10: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Region Algebras• structured text = text with tags (like XML)

• powerful indexing techniques[Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.]

• New Oxford English Dictionary

• critical limitation:ordered data only (like text)

• Assume: data given as an XML text file, and implicit ordering in the file.

• less critical limitation: restricted regular expressions

Page 11: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Region Algebras: Definitions• data = sequence of characters [c1c2c3 …]

• region = segment of the text in a file– representation (x,y) = [cx,cx+1, … cy], x – start position, y –

end position of the region– example: <section> … </section>

• region set = a set of regions s.t. any two regions are either disjoint or one included in the other– example all <section> regions (may be nested)– Tree data – each node defines a region and each set of nodes

define a region set.– example: region p2 consisting of text under p2, set {p2,s2,s1}

is a region set with three regions

Page 12: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Representation of a region set

• Example: the <subpart> region set:

• region algebra = operators on region set, ss11 op s op s22 defines a new region set

Page 13: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Region algebra: some operators

• s1 intersect s2 = {r | r s1, r s2}

• s1 included s2 = {r | rs1, r´ s2, r r´}

• s1 including s2 = {r | r s1, r´ s2, r r´}

• s1 parent s2 = {r | r s1, r´ s2, r is a parent of r´}

• s1 child s2 = {r | r s1, r´ s2, r is child of r´}

Examples:

<subpart> included <part> = { s1, s2, s3, s5}

<part> including <subpart> = {p2, p3}

<name> child <part> = {n1, n3, n12}

Page 14: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Efficient computation of Region Algebra Operators

Example: s1 included s2

s1 = {(x1,x1'), (x2,x2'), …}

s2 = {(y1,y1'), (y2,y2'), …}

(i.e. assume each consists of disjoint regions)

Algorithm:if xi < yj then i := i + 1

if xi' > yj' then j := j + 1

otherwise: print (xi,xi'), do i := i + 1

Can do in sub-linear time when one region is very small

Page 15: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

From path expressions to region expressions

• Use region algebra operators to answer regular path expressions:

• Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *.

Region expressions correspond to simple XPath expressions

part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))

Page 16: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

From path expressions to region expressions

• Answering more complex queries:

• Translates into the following region algebra expression:

• “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text.

• Such a region can be computed dynamically using a full text index.

• Region expressions correspond to simple XPath expressions

Select XFrom *.subpart: {name: X, *.supplier.address: “Philadelphia”}

Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

Page 17: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

• A semistructured data instance that is a DAG

Page 18: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

• The data represents employees and projects in a company.• Two kinds of employees – programmers and statisticians• Three kinds of links to projects – leads, workson, consultants• Index graph – reduced graph that summarizes all paths from root in the data

graph• Example: node p1 – paths from root to p1 labeled with the following five

sequences:

ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson

• Node p2 – paths from root to p2 labeled by same five sequences• p1 and p2 are language-equivalent

Page 19: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

• For each node x in the data graph,

Lx = {w| a path from the root to x labeled w}

x,y x y Lx = Ly

[x] = {y | x y }

Nodes(I) = {[x] | x nodes(G)

I =

Edges(I) = {[x] [y] | x [x], y [y], x y } a a

Page 20: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

• We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7

Page 21: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

• Computing path expression queries– Compute query on I and obtain set of index nodes– Compute union of all extents

• Returns nodes h8, h9.• Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8]• Always: size(I) size(G)• Efficient when I can be stored in main memory• Checking x y is expensive.

Select XFrom statistician.employee.(leads|consults): X

Page 22: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Indexes for Arbitrary Semistructured Data

Use bisimulation instead of Fact: x, y x b y x y

Use the same construction, but [u] now refers to b instead of .

Bisimulation: Let DB be a data graph. A relation is a bisimulation on the reversed graph (i.e. all edges have their direction reversed) if the following conditions hold:

1. If x y and x is a root, then so is y.

2. Conversely, if x y and y is a root, then so is x.

3. If x y, then for any edge x x there exists an edge y y, s.t. x y.4. Conversely, if x y, then for any edge y y, then there exists an edge

x x s.t. x y.

a a

a

a

Page 23: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

• Goldman & Widom [VLDB 97]– graph data– arbitrary regular expressions

Page 24: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

Definition

given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G

- every path in G occurs in DB

- every path in G is unique

Page 25: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Dataguides

Example:

Page 26: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

• Multiple DataGuides for the same data:

Page 27: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

Definition

Let w, w’ be two words (i.e. word queries) and G a graph

w G w’ if w(G) = w’(G)

Definition

G is a strong dataguide for a database DB if G is the same as DB

Page 28: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

Example:

• G1 is a strong dataguide

• G2 is not strong

person.project !DB dept.project

person.project !G2 dept.project

Page 29: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides

• Constructing the strong DataGuide G:Nodes(G)={{root}}Edges(G)=while changes do

choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)

• Use hash table for Nodes(G)• This is precisely the powerset automaton

construction.

Page 30: Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

DataGuides• How large are the dataguides ?

– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of G• here: dataguide = XSet

– How many nodes does the strong dataguide have for this DB ?

20 nodes (least common multiple of 4 and 5)

Dataguides usually fail on data with cyclic schemas, like: