A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data...
-
Upload
meredith-casey -
Category
Documents
-
view
215 -
download
1
Transcript of A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data...
A Paradigm Shift in Database A Paradigm Shift in Database Optimization:Optimization:
From Indices to AggregatesFrom Indices to Aggregates
Presented to: Presented to:
The Data Warehousing & Data Mining mini-track – AMCIS 2002 asThe Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-ProgressResearch-in-Progress
Ryan LaBrie • Robert St. Louis • Lin YeRyan LaBrie • Robert St. Louis • Lin Ye
Arizona State UniversityArizona State University
[email protected] • [email protected] • [email protected]@asu.edu • [email protected] • [email protected]
AgendaAgenda
A need for a shift in optimization A need for a shift in optimization strategystrategy
What our research is focusing onWhat our research is focusing on How we performed this researchHow we performed this research Update on our resultsUpdate on our results Next stepsNext steps
Why a Shift, Why Now?Why a Shift, Why Now?
HISTORICALLYHISTORICALLY Relational database technology is really Relational database technology is really
good at what it does…good at what it does… Transaction-oriented, operational systemsTransaction-oriented, operational systems Optimized for data INPUTOptimized for data INPUT
FOCUS: Storage of DATAFOCUS: Storage of DATA TODAY’S ENVIRONMENTTODAY’S ENVIRONMENT
Large Data WarehousesLarge Data Warehouses Used for decision supportUsed for decision support Need to be optimized for information OUTPUTNeed to be optimized for information OUTPUT
FOCUS: Retrieval of INFORMATIONFOCUS: Retrieval of INFORMATION
The Decision Support ProblemThe Decision Support Problem
Relational DBMS limitationsRelational DBMS limitations Too much dataToo much data
Tera- and petabytes, quickly approaching exabytesTera- and petabytes, quickly approaching exabytes
Too complex queries Too complex queries Structured Query LanguageStructured Query Language
Too long for resultsToo long for results Indexing limitationsIndexing limitations
Usage of (i.e. Table Scans)Usage of (i.e. Table Scans) B+ TreesB+ Trees
A Possible Decision Support SolutionA Possible Decision Support Solution
Multidimensional Databases Multidimensional Databases New effective storage techniquesNew effective storage techniques Simpler modeling techniquesSimpler modeling techniques Potential for easier query interfacesPotential for easier query interfaces
andand Intelligent AggregationIntelligent Aggregation
Appropriate use of redundancyAppropriate use of redundancy More effective indexing algorithmsMore effective indexing algorithms
Bitmapped indicesBitmapped indices
The Focus of Our ResearchThe Focus of Our Research CURRENT RESEARCHCURRENT RESEARCH
1.1. Cost comparisons of Relational vs. Cost comparisons of Relational vs. Multidimensional Decision Support SystemsMultidimensional Decision Support Systems
2.2. Working towards a multidimensional Working towards a multidimensional benchmarking systembenchmarking system
TPC-H is positioned as a Decision Support benchmark, TPC-H is positioned as a Decision Support benchmark, however it is based on relational technologieshowever it is based on relational technologies
GOAL: Vendor neutral benchmark for comparing GOAL: Vendor neutral benchmark for comparing multidimensional database productsmultidimensional database products
FUTURE RESEARCHFUTURE RESEARCH In the long term, show that decisions can be In the long term, show that decisions can be
made more easily with multidimensional made more easily with multidimensional technologytechnology
Simpler design, simple interfaces, faster responsesSimpler design, simple interfaces, faster responses
Why Develop a Multidimensional Why Develop a Multidimensional Benchmark?Benchmark?
Benchmarking is an established method for creating Benchmarking is an established method for creating vendor neutral testsvendor neutral tests Transaction Processing Performance Council (TPC)Transaction Processing Performance Council (TPC)
Benchmarking has been examine in other IS fields Benchmarking has been examine in other IS fields includingincluding Server Platforms: Johnson & Gray, 1993Server Platforms: Johnson & Gray, 1993 eCommerce: Menasce, 2002eCommerce: Menasce, 2002
It has been called for specifically in the data It has been called for specifically in the data warehousing academic communitywarehousing academic community Nemati et al., 2000Nemati et al., 2000
andand Has yet to be doneHas yet to be done
How Are We Building Our BenchmarkHow Are We Building Our Benchmark
Based on the TPC-H relational decision Based on the TPC-H relational decision support benchmarksupport benchmark
Create a relational dimensional model Create a relational dimensional model that forms the basis for the data martthat forms the basis for the data mart
Build a multidimensional cube off the Build a multidimensional cube off the dimensional modeldimensional model
Convert the SQL statement to the Convert the SQL statement to the equivalent MDXequivalent MDX
Run both the SQL query and the MDX Run both the SQL query and the MDX query, report resultsquery, report results
What We Have Done To DateWhat We Have Done To Date
Initially have mapped all 22 TPC-H Initially have mapped all 22 TPC-H relational queries to potential data martsrelational queries to potential data marts 3-4 data marts necessary3-4 data marts necessary
Built 2 TPC-H data sets (1GB and 10GB)Built 2 TPC-H data sets (1GB and 10GB) Converted TPC-H Query #4 to MDXConverted TPC-H Query #4 to MDX Ran comparisons on both data setsRan comparisons on both data sets In the process of converting a second In the process of converting a second
query (TPC-H Query #7) for additional query (TPC-H Query #7) for additional analysis/confirmation of gainsanalysis/confirmation of gains
TPC-H: Query #4 – Relational SQLTPC-H: Query #4 – Relational SQL
SELECT o_orderpriority, SELECT o_orderpriority,
COUNT(*) AS order_countCOUNT(*) AS order_count
FROM ordersFROM orders
WHERE o_orderdate >= '1993-07-01' WHERE o_orderdate >= '1993-07-01'
AND o_orderdate < '1993-10-01' AND o_orderdate < '1993-10-01'
AND EXISTS AND EXISTS
(SELECT * (SELECT *
FROM lineitemFROM lineitem
WHERE l_orderkey = o_orderkey WHERE l_orderkey = o_orderkey
AND l_commitdate < l_receiptdate)AND l_commitdate < l_receiptdate)
GROUP BY o_orderpriorityGROUP BY o_orderpriority
ORDER BY o_orderpriority ORDER BY o_orderpriority
REGIONR_REGIONKEYR_NAMER_COMMENT
NATIONN_NATIONKEYN_NAMEN_REGIONKEYN_COMMENT
partP_PARTKEYP_NAMEP_MFGRP_BRANDP_TYPEP_SIZEP_CONTAINERP_RETAILPRICEP_COMMENT
customerC_CUSTKEYC_NAMEC_ADDRESSC_NATIONKEYC_PHONEC_ACCTBALC_MKTSEGMENTC_COMMENT
ordersO_ORDERKEYO_CUSTKEYO_ORDERSTATUSO_TOTALPRICEO_ORDERDATEO_ORDERPRIORITYO_CLERKO_SHIPPRIORITYO_COMMENT
supplierS_SUPPKEYS_NAMES_ADDRESSS_NATIONKEYS_PHONES_ACCTBALS_COMMENT
partsuppPS_PARTKEYPS_SUPPKEYPS_AVAILQTYPS_SUPPLYCOSTPS_COMMENT
lineitemL_ORDERKEYL_PARTKEYL_SUPPKEYL_LINENUMBERL_QUANTITYL_EXTENDEDPRICEL_DISCOUNTL_TAXL_RETURNFLAGL_LINESTATUSL_SHIPDATEL_COMMITDATEL_RECEIPTDATEL_SHIPINSTRUCTL_SHIPMODEL_COMMENT
Typical Decision Support Request: Answers the questions, “How many orders were delivered late in Quarter 3 of 1993, sorted by priority?”
TPC-H: Query #4 – Multidimensional TPC-H: Query #4 – Multidimensional Expression (MDX) EquivalentExpression (MDX) Equivalent
SELECTSELECT
{[Measures].[O Latecount]} ON COLUMNS,{[Measures].[O Latecount]} ON COLUMNS,
{[PriorityDim].children} ON ROWS{[PriorityDim].children} ON ROWS
FROM Q4CubeFROM Q4Cube
WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3])WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3])
The Database Costs DilemmaThe Database Costs Dilemma
DiskDiskSpace?Space?
QueryQuerySpeed?Speed?
BuildBuildTime?Time?
Results To Date (Query Speed)Results To Date (Query Speed)
TPC-HTPC-H
Query 4Query 4
1 GB Dataset1 GB Dataset 10 GB Dataset10 GB Dataset
MultidimensionalMultidimensional 0.33 seconds0.33 seconds 0.33 seconds0.33 seconds
RelationalRelational 46.6 seconds46.6 seconds
(140x slower)(140x slower)
925 seconds 925 seconds (~15.5 min)(~15.5 min)
(~2800x slower)(~2800x slower)
Relational (optimized Relational (optimized w/Indices)w/Indices)
38 seconds38 seconds
(114x slower)(114x slower) Test not runTest not run
Relational (optimized Relational (optimized w/Indices & Striping)w/Indices & Striping)
26 seconds26 seconds
(78x slower)(78x slower)
247 seconds 247 seconds (~4.0 min)(~4.0 min)
(~750x slower)(~750x slower)
Results To Date (Other Measures)Results To Date (Other Measures)
TPC-HTPC-H
Query 4Query 4
1 GB Dataset1 GB Dataset 10 GB Dataset10 GB Dataset
Relational DBRelational DB 1.2 GB 1.2 GB 12.5 GB12.5 GB
Relational DB Relational DB (w/Indices)(w/Indices)
1.8 GB1.8 GB 28.9 GB28.9 GB
Multidimensional Multidimensional Cube SizeCube Size
.16 MB.16 MB .16 MB.16 MB
Multidimensional Multidimensional Cube Build TimeCube Build Time
46 seconds46 seconds 356 seconds356 seconds(~6 minutes)(~6 minutes)
Preliminary ConclusionsPreliminary Conclusions
For a very modest investment For a very modest investment organizations will be able to process organizations will be able to process very large data warehousesvery large data warehouses
The multidimensional data mart is the The multidimensional data mart is the only practical (speed, processing only practical (speed, processing time) way to support the end-user time) way to support the end-user decision maker.decision maker.
Aggregation truly is a substitute for Aggregation truly is a substitute for expensive hardwareexpensive hardware
Next StepsNext Steps
Acquire a larger serverAcquire a larger server Build 100GB and 300GB TPC-H data setsBuild 100GB and 300GB TPC-H data sets Benchmark both relational and dimensional Benchmark both relational and dimensional
queriesqueries Publish resultsPublish results Consider ROLAP, HOLAP, MOLAP issuesConsider ROLAP, HOLAP, MOLAP issues Possible extensions to some data mining Possible extensions to some data mining
researchresearch Possible extensions to decision making Possible extensions to decision making
through technology researchthrough technology research
Thank You for Your TimeThank You for Your Time
Questions?
[email protected]/~rlabrie (for this presentation and paper)
Appendix A: Current SystemAppendix A: Current System
SOFTWARESOFTWARE Microsoft Windows Microsoft Windows
2000 Advanced 2000 Advanced ServerServer
Microsoft SQL Microsoft SQL Server 2000 Server 2000 Enterprise EditionEnterprise Edition
Microsoft SQL Microsoft SQL Server 2000 Analysis Server 2000 Analysis Services Enterprise Services Enterprise EditionEdition
HARDWAREHARDWARE (1) 1.8GHz Intel (1) 1.8GHz Intel
Pentium 4 processorPentium 4 processor 768MB RAM768MB RAM 240GB HD space (3 240GB HD space (3
IDE 80GB 7200RPM IDE 80GB 7200RPM Drives)Drives)
Total cost: $1100 Total cost: $1100 (Hardware only)(Hardware only)