A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data...

A Paradigm Shift in Database A Paradigm Shift in Database Optimization:Optimization:

From Indices to AggregatesFrom Indices to Aggregates

Presented to: Presented to:

The Data Warehousing & Data Mining mini-track – AMCIS 2002 asThe Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-ProgressResearch-in-Progress

Ryan LaBrie • Robert St. Louis • Lin YeRyan LaBrie • Robert St. Louis • Lin Ye

Arizona State UniversityArizona State University

[email protected] • [email protected] • [email protected]@asu.edu • [email protected] • [email protected]

AgendaAgenda

A need for a shift in optimization A need for a shift in optimization strategystrategy

What our research is focusing onWhat our research is focusing on How we performed this researchHow we performed this research Update on our resultsUpdate on our results Next stepsNext steps

Why a Shift, Why Now?Why a Shift, Why Now?

HISTORICALLYHISTORICALLY Relational database technology is really Relational database technology is really

good at what it does…good at what it does… Transaction-oriented, operational systemsTransaction-oriented, operational systems Optimized for data INPUTOptimized for data INPUT

FOCUS: Storage of DATAFOCUS: Storage of DATA TODAY’S ENVIRONMENTTODAY’S ENVIRONMENT

Large Data WarehousesLarge Data Warehouses Used for decision supportUsed for decision support Need to be optimized for information OUTPUTNeed to be optimized for information OUTPUT

FOCUS: Retrieval of INFORMATIONFOCUS: Retrieval of INFORMATION

The Decision Support ProblemThe Decision Support Problem

Relational DBMS limitationsRelational DBMS limitations Too much dataToo much data

Tera- and petabytes, quickly approaching exabytesTera- and petabytes, quickly approaching exabytes

Too complex queries Too complex queries Structured Query LanguageStructured Query Language

Too long for resultsToo long for results Indexing limitationsIndexing limitations

Usage of (i.e. Table Scans)Usage of (i.e. Table Scans) B+ TreesB+ Trees

A Possible Decision Support SolutionA Possible Decision Support Solution

Multidimensional Databases Multidimensional Databases New effective storage techniquesNew effective storage techniques Simpler modeling techniquesSimpler modeling techniques Potential for easier query interfacesPotential for easier query interfaces

andand Intelligent AggregationIntelligent Aggregation

Appropriate use of redundancyAppropriate use of redundancy More effective indexing algorithmsMore effective indexing algorithms

Bitmapped indicesBitmapped indices

The Focus of Our ResearchThe Focus of Our Research CURRENT RESEARCHCURRENT RESEARCH

1.1. Cost comparisons of Relational vs. Cost comparisons of Relational vs. Multidimensional Decision Support SystemsMultidimensional Decision Support Systems

2.2. Working towards a multidimensional Working towards a multidimensional benchmarking systembenchmarking system

TPC-H is positioned as a Decision Support benchmark, TPC-H is positioned as a Decision Support benchmark, however it is based on relational technologieshowever it is based on relational technologies

GOAL: Vendor neutral benchmark for comparing GOAL: Vendor neutral benchmark for comparing multidimensional database productsmultidimensional database products

FUTURE RESEARCHFUTURE RESEARCH In the long term, show that decisions can be In the long term, show that decisions can be

made more easily with multidimensional made more easily with multidimensional technologytechnology

Simpler design, simple interfaces, faster responsesSimpler design, simple interfaces, faster responses

Why Develop a Multidimensional Why Develop a Multidimensional Benchmark?Benchmark?

Benchmarking is an established method for creating Benchmarking is an established method for creating vendor neutral testsvendor neutral tests Transaction Processing Performance Council (TPC)Transaction Processing Performance Council (TPC)

Benchmarking has been examine in other IS fields Benchmarking has been examine in other IS fields includingincluding Server Platforms: Johnson & Gray, 1993Server Platforms: Johnson & Gray, 1993 eCommerce: Menasce, 2002eCommerce: Menasce, 2002

It has been called for specifically in the data It has been called for specifically in the data warehousing academic communitywarehousing academic community Nemati et al., 2000Nemati et al., 2000

andand Has yet to be doneHas yet to be done

How Are We Building Our BenchmarkHow Are We Building Our Benchmark

Based on the TPC-H relational decision Based on the TPC-H relational decision support benchmarksupport benchmark

Create a relational dimensional model Create a relational dimensional model that forms the basis for the data martthat forms the basis for the data mart

Build a multidimensional cube off the Build a multidimensional cube off the dimensional modeldimensional model

Convert the SQL statement to the Convert the SQL statement to the equivalent MDXequivalent MDX

Run both the SQL query and the MDX Run both the SQL query and the MDX query, report resultsquery, report results

What We Have Done To DateWhat We Have Done To Date

Initially have mapped all 22 TPC-H Initially have mapped all 22 TPC-H relational queries to potential data martsrelational queries to potential data marts 3-4 data marts necessary3-4 data marts necessary

Built 2 TPC-H data sets (1GB and 10GB)Built 2 TPC-H data sets (1GB and 10GB) Converted TPC-H Query #4 to MDXConverted TPC-H Query #4 to MDX Ran comparisons on both data setsRan comparisons on both data sets In the process of converting a second In the process of converting a second

query (TPC-H Query #7) for additional query (TPC-H Query #7) for additional analysis/confirmation of gainsanalysis/confirmation of gains

TPC-H: Query #4 – Relational SQLTPC-H: Query #4 – Relational SQL

SELECT o_orderpriority, SELECT o_orderpriority,

COUNT(*) AS order_countCOUNT(*) AS order_count

FROM ordersFROM orders

WHERE o_orderdate >= '1993-07-01' WHERE o_orderdate >= '1993-07-01'

AND o_orderdate < '1993-10-01' AND o_orderdate < '1993-10-01'

AND EXISTS AND EXISTS

(SELECT * (SELECT *

FROM lineitemFROM lineitem

WHERE l_orderkey = o_orderkey WHERE l_orderkey = o_orderkey

AND l_commitdate < l_receiptdate)AND l_commitdate < l_receiptdate)

GROUP BY o_orderpriorityGROUP BY o_orderpriority

ORDER BY o_orderpriority ORDER BY o_orderpriority

REGIONR_REGIONKEYR_NAMER_COMMENT

NATIONN_NATIONKEYN_NAMEN_REGIONKEYN_COMMENT

partP_PARTKEYP_NAMEP_MFGRP_BRANDP_TYPEP_SIZEP_CONTAINERP_RETAILPRICEP_COMMENT

customerC_CUSTKEYC_NAMEC_ADDRESSC_NATIONKEYC_PHONEC_ACCTBALC_MKTSEGMENTC_COMMENT

ordersO_ORDERKEYO_CUSTKEYO_ORDERSTATUSO_TOTALPRICEO_ORDERDATEO_ORDERPRIORITYO_CLERKO_SHIPPRIORITYO_COMMENT

supplierS_SUPPKEYS_NAMES_ADDRESSS_NATIONKEYS_PHONES_ACCTBALS_COMMENT

partsuppPS_PARTKEYPS_SUPPKEYPS_AVAILQTYPS_SUPPLYCOSTPS_COMMENT

lineitemL_ORDERKEYL_PARTKEYL_SUPPKEYL_LINENUMBERL_QUANTITYL_EXTENDEDPRICEL_DISCOUNTL_TAXL_RETURNFLAGL_LINESTATUSL_SHIPDATEL_COMMITDATEL_RECEIPTDATEL_SHIPINSTRUCTL_SHIPMODEL_COMMENT

Typical Decision Support Request: Answers the questions, “How many orders were delivered late in Quarter 3 of 1993, sorted by priority?”

TPC-H: Query #4 – Multidimensional TPC-H: Query #4 – Multidimensional Expression (MDX) EquivalentExpression (MDX) Equivalent

SELECTSELECT

{[Measures].[O Latecount]} ON COLUMNS,{[Measures].[O Latecount]} ON COLUMNS,

{[PriorityDim].children} ON ROWS{[PriorityDim].children} ON ROWS

FROM Q4CubeFROM Q4Cube

WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3])WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3])

The Database Costs DilemmaThe Database Costs Dilemma

DiskDiskSpace?Space?

QueryQuerySpeed?Speed?

BuildBuildTime?Time?

Results To Date (Query Speed)Results To Date (Query Speed)

TPC-HTPC-H

Query 4Query 4

1 GB Dataset1 GB Dataset 10 GB Dataset10 GB Dataset

MultidimensionalMultidimensional 0.33 seconds0.33 seconds 0.33 seconds0.33 seconds

RelationalRelational 46.6 seconds46.6 seconds

(140x slower)(140x slower)

925 seconds 925 seconds (~15.5 min)(~15.5 min)

(~2800x slower)(~2800x slower)

Relational (optimized Relational (optimized w/Indices)w/Indices)

38 seconds38 seconds

(114x slower)(114x slower) Test not runTest not run

Relational (optimized Relational (optimized w/Indices & Striping)w/Indices & Striping)

26 seconds26 seconds

(78x slower)(78x slower)

247 seconds 247 seconds (~4.0 min)(~4.0 min)

(~750x slower)(~750x slower)

Results To Date (Other Measures)Results To Date (Other Measures)

TPC-HTPC-H

Query 4Query 4

1 GB Dataset1 GB Dataset 10 GB Dataset10 GB Dataset

Relational DBRelational DB 1.2 GB 1.2 GB 12.5 GB12.5 GB

Relational DB Relational DB (w/Indices)(w/Indices)

1.8 GB1.8 GB 28.9 GB28.9 GB

Multidimensional Multidimensional Cube SizeCube Size

.16 MB.16 MB .16 MB.16 MB

Multidimensional Multidimensional Cube Build TimeCube Build Time

46 seconds46 seconds 356 seconds356 seconds(~6 minutes)(~6 minutes)

Preliminary ConclusionsPreliminary Conclusions

For a very modest investment For a very modest investment organizations will be able to process organizations will be able to process very large data warehousesvery large data warehouses

The multidimensional data mart is the The multidimensional data mart is the only practical (speed, processing only practical (speed, processing time) way to support the end-user time) way to support the end-user decision maker.decision maker.

Aggregation truly is a substitute for Aggregation truly is a substitute for expensive hardwareexpensive hardware

Next StepsNext Steps

Acquire a larger serverAcquire a larger server Build 100GB and 300GB TPC-H data setsBuild 100GB and 300GB TPC-H data sets Benchmark both relational and dimensional Benchmark both relational and dimensional

queriesqueries Publish resultsPublish results Consider ROLAP, HOLAP, MOLAP issuesConsider ROLAP, HOLAP, MOLAP issues Possible extensions to some data mining Possible extensions to some data mining

researchresearch Possible extensions to decision making Possible extensions to decision making

through technology researchthrough technology research

Thank You for Your TimeThank You for Your Time

Questions?

[email protected]/~rlabrie (for this presentation and paper)

Appendix A: Current SystemAppendix A: Current System

SOFTWARESOFTWARE Microsoft Windows Microsoft Windows

2000 Advanced 2000 Advanced ServerServer

Microsoft SQL Microsoft SQL Server 2000 Server 2000 Enterprise EditionEnterprise Edition

Microsoft SQL Microsoft SQL Server 2000 Analysis Server 2000 Analysis Services Enterprise Services Enterprise EditionEdition

HARDWAREHARDWARE (1) 1.8GHz Intel (1) 1.8GHz Intel

Pentium 4 processorPentium 4 processor 768MB RAM768MB RAM 240GB HD space (3 240GB HD space (3

IDE 80GB 7200RPM IDE 80GB 7200RPM Drives)Drives)

Total cost: $1100 Total cost: $1100 (Hardware only)(Hardware only)

A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data...

Documents

Transcript of A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data...