Storage and Retrieval of E-Commerce Data R. Agrawal, A. Somani, Y. Xu: VLDB-2001.
-
Upload
emerald-adams -
Category
Documents
-
view
216 -
download
2
Transcript of Storage and Retrieval of E-Commerce Data R. Agrawal, A. Somani, Y. Xu: VLDB-2001.
Storage and Retrieval ofStorage and Retrieval ofE-Commerce DataE-Commerce Data
R. Agrawal, A. Somani, Y. Xu: VLDB-2001
OverviewOverview
E-Commerce Data Characteristics Alternative Physical Representations Querying Mapping Layer Performance evaluation of various approaches Conclusions
Typical E-Commerce Data Typical E-Commerce Data CharacteristicsCharacteristics
Nearly 2 Million components More than 2000 leaf-level
categories Large number of Attributes (5000)
An Experimental E-marketplace for An Experimental E-marketplace for Computer componentsComputer components
Constantly evolving schema Sparsely populated data (about 50-100 attributes/component)
OutlineOutline E-Commerce Data Characteristics
Alternative Physical RepresentationsHorizontal - One n-ary relationBinary - N 2-ary relationsVertical - One 3-ary relation
Query Mapping Layer Performance evaluation Conclusion
Conventional horizontal representation Conventional horizontal representation (n-ary relation)(n-ary relation)
Name Monitor Height Recharge Output playback Smooth scan Progressive Scan
PAN DVD-L75 7 inch - Built-in Digital - - -
KLH DVD221 - 3.75 - S-Video - - No
SONY S-7000 - - - - - - -
SONY S-560D - - - - Cinema Sound Yes -
… … … … … … … …
DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns)
Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index Hard to load/update Schema evolution is expensive
Querying is straightforward
Binary RepresentationBinary Representation(N 2-ary relations)(N 2-ary relations)
Dense representation Manageability is hard
because of large number of tables
Schema evolution expensive
Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99]
Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99]
Val
7 inch
Name
PAN DVD-L75
Monitor
ValName
KLH DVD221
Height
3.75
ValName
PAN DVD-L75
Output
Digital
S-VideoKLH DVD221
Vertical representationVertical representation(One 3-ary relation)(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Objects can have large number of attributes
Handles sparseness well Schema evolution is easy
Oid Key Val
0 ‘Name’ ‘PAN DVD-L75’
0 ‘Monitor’ ‘7 inch’
0 ‘Recharge’ ‘Built-in’
0 ‘Output’ ‘Digital’
1 ‘Name’ ‘KLH DVD221’
1 ‘Height’ ‘3.75’
1 ‘Output’ ‘S-Video’
1 ‘Progressive Scan’
‘No’
2 ‘Name’ ‘SONY S-7000’
… … …
Implementation of SchemaSQL [LSS 99] Edge Approach for storing XML Data [FK
99]
Querying over Vertical Querying over Vertical RepresentationRepresentation
Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2 WHERE v1.Key = ‘Monitor’ AND v2.Key = ‘Output’ AND v2.Val = ‘Digital’ AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
Solution : Query Mapping on Solution : Query Mapping on Vertical RepresentationVertical Representation
Simplify querying of the data– Provide a horizontal view of the vertical table– Automatically transform queries on the
horizontal view to the vertical table
Solution (Continued)Solution (Continued)Translation layer maps relational algebraic
operations on H to operations on V
…Attrk…Attr2Attr1
Query Mapping Layer
ValKeyOid
Horizontal view (H)
Vertical table (V)
Can we define a translation algebra ?Standard database view mechanism does not
work attribute values become column names (need higher
order views)
Can the vertical representation support fast querying ?
What are the new requirements from the database engines?
KeyKey IssuesIssues
OutlineOutline
E-Commerce data characteristics Alternative Approaches
Query Mapping LayerTransformation AlgebraImplementation Strategies
Performance evaluation Conclusion
Transformation AlgebraTransformation Algebra
Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation. Two key operators:– v2h (Vertical to Horizontal)– h2v (Horizontal to Vertical)
NotationNotation
Select Project () Join () Left Outerjoin ( Right Outerjoin () Union () Intersection () Difference (-) Aggregation () Null Value ()
Sample Algebraic TransformsSample Algebraic Transforms v2h ( Operation – Convert from vertical to horizontal
k(V) = [Oid(V)] [i=1,k Oid,Val(Key=‘Ai’(V))]
h2V (Operation – Convert from horizontal to vertical
k(H) = i=1,k Oid,’Ai’Ai(Ai ‘’(V))] i=1,k Oid,’Ai’Ai(i=1,kAi=‘’(V))
Similar operations such as Unfold/Fold and Gather/Scatter exist in SchemaSQL [LSS 99] and [STA 98] respectively
Complete transforms in VLDB-2001 Paper
From the Algebra to SQLFrom the Algebra to SQL
Equivalent SQL transforms for algebraic transforms– Select, Project– Joins (self, two verticals, a horizontal and a vertical)– Cartesian Product– Union, Intersection, Set difference– Aggregation
Extend DDL to provide the Horizontal View
Four implementation strategies to provide Horizontal View
CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable
USING COLUMNS (Attr1, Attr2, … Attrk, …)
Implementation Strategies on Implementation Strategies on Vertical RepresentationVertical Representation
VerticalSQL – Uses only SQL-92 level capabilities– Used XQGM code to represented parsed SQL
queries VerticalUDF
– Exploits User Defined Functions and Table Functions to provide a direct implementation
Alternative Implementation Alternative Implementation StrategiesStrategies
Binary (hand-coded queries)– 2-ary representation with one relation per
attribute (using only SQL-92 transforms)
SchemaSQL– Addresses a more general problem– Performed 2-3X worse than Vertical
representation because of temporary tables
Transformation strategy: VerticalSQLTransformation strategy: VerticalSQL
Attr1 Attr2 … … …Horizontal view
SELECT Attr1, Attr2 FROM hview
SELECT t7.Attr1, t7.Attr2 FROM ( SELECT t4.Oid, t4.Attr1, t6.Attr2 FROM (SELECT t0.Oid, t3.Attr1 FROM (SELECT DISTINCT t0.Oid FROM vtable AS t0 ) AS t1(Oid) LEFT OUTER JOIN (SELECT t2.Oid, t2.Val FROM vtable AS t2 WHERE t2.Key = ‘Attr1’ ) AS t3(Oid, Attr1) ON t1.Oid = t3.Oid ) AS t4(Oid, Attr1) LEFT OUTER JOIN (SELECT t5.Oid, t5.Val FROM vtable AS t5 WHERE t5.Key = ‘Attr2’ ) AS t6(Oid, Attr2) ON t4.Oid = t6.Oid ) AS t7(Oid, Attr1, Attr2)
Oid Key ValVertical table
Query transformation
Transformation strategy: VerticalUDFTransformation strategy: VerticalUDF
Attr1 Attr2 … … …Horizontal view
SELECT Attr1, Attr2 FROM hview
SELECT t1.Attr1, t1.Attr2 FROM vtable AS t0, TABLE(v2h(t0.Oid, t0.Key, t0.Val)) AS t1(Oid, Attr1, Attr2)WHERE t0.Key = ‘Attr1’ OR t0.Key = ‘Attr2’
Oid Key Val
0 ‘Attr1’ …Vertical table
Query transformation
The v2h table function reads tuples of vertical table sorted on Oid and outputs a horizontal tuple for each oid Severely penalized because of lack of engine support
Transformation strategy: BinaryTransformation strategy: Binary
Attr1 Attr2 … … …Horizontal view
SELECT Attr1, Attr2 FROM hview
ValOidBinary Relations
Query transformation
ValOid
ATTR1 ATTR2
Oid
ALLPROD
SELECT t2.Attr1, ATTR2.val FROM ( SELECT t1.Oid, ATTR1.Val FROM ( SELECT Oid FROM ALLPROD ) AS t1(Oid) LEFT OUTER JOIN ATTR1 ON t1.Oid = Attr1.Oid ) AS t2(Oid, Attr1) LEFT OUTER JOIN ATTR2 ON t2.Oid = ATTR2.Oid
Story So Far …Story So Far … E-Commerce Data
– High-Arity, Sparse and Constantly Evolving Three Approaches
But what about Performance ?
+ +Querying
+-Flexibility
++Manageability
Vertical (w/ Mapping)Horizontal
-
-
Binary (w/ Mapping)
+
Performance: Performance: Experimental setupExperimental setup
600 MHz dual-processor Intel Pentium machine 512 MB RAM Windows NT 4.0 Database IBM DB2 UDB 7.1 Two 30GB IDE Drives
Buffer Pool Size – 50 MB
All numbers reported are cold numbers
Experimental Setup (Continued)Experimental Setup (Continued)
Parameters used for Synthetic Data – Number of columns– Number of rows– Non-null density– Selectivity of a predicate for a column
For example, 200X100K & =10% implies 200 columns, 100K rows and non-null density = 10%
Complete results in technical report
Clustering by Clustering by KeyKey outperforms outperforms clustering by clustering by OidOid
density = 10%, 1000 cols x 20K rows
0
5
10
15
20
25
0.1% 1% 5%
Join selectivity
Ex
ec
uti
on
tim
e (
se
co
nd
s)
VerticalSQL_oid
VerticalSQL_key
Join
Projection of 10 columns
VerticalSQL comparable to Binary VerticalSQL comparable to Binary and outperforms Horizontaland outperforms Horizontal
0
10
20
30
40
50
60
200x100K 400x50K 800x25K 1000x20K
Table (#cols x #rows)
Ex
ec
uti
on
tim
e (
se
co
nd
s)
density = 10%
HorizontalSQL
VerticalSQL
Binary
VerticalUDF is the best approach VerticalUDF is the best approach
0
10
20
30
200x100K 400x50K 800x25K 1000x20K
Table (#cols x #rows)
Ex
ec
uti
on
tim
e (
se
co
nd
s)
density = 10%
VerticalUDF
VerticalSQL
Binary
Projection of 10 columns
Wish List from Database EnginesWish List from Database Engines
– VerticalUDF approach showed need for
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
SummarySummary
+-
+-Flexibility
++Manageability
Vertical (w/ Mapping)Horizontal
-
-
Binary (w/ Mapping)
+Performance
Querying + + +
ConclusionsConclusions Vertical Representation attractive for dealing with E-
Commerce Data High-arity, sparse and constantly evolving schema
Get the best of both horizontal and vertical representations Translation layer => Querying as easy as horizontal
representation Great flexibility and manageability Performance
– VerticalUDF best approach – VerticalSQL comparable to Binary and outperforms
Horizontal
Full Report @ http://www.almaden.ibm.com/cs/quest/