Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227...

26
Index 34 subsystems of ETL, 430–434 64 bit architectures for data warehouse, 554, 558, 582 2NF (second normal form), 177–178 3NF (third normal form), 133 business rules, 145–147 Chris Date criticisms, 147 complex schemas, 146 versus dimensional modeling, 140–141 incompleteness, 146 primary criticism, 137–139 query complexity, 138 performance for BI queries, 138 real data, 146 redundancy, 138 uniqueness, 146 usability, 138 CIF use of 3NF, 173–178 A abstract dimensions, reasons to avoid, 311 abused users, 119 accumulating snapshot fact table, 194. See grain (fact tables) combining with periodic and transaction grains, 249 comparison to other grains, 244–246 date dimension roles, 300 fact table loader, ETL subsystem #13, 432 nulls to be expected, 276 pipelines and short processes, 246 procurement example, 241–242 real time partition, 509 university and admissions example, 247–248 accurate counting, combining CASE and SUM, 314–315 actions, tracking, 590, 593 activity based costing difficult environments for, 265 modeling income statements, 401 ad hoc attack, 63 Adaptive Software Development, 109 additive facts, 12, 139–142, 182 examples, 31, 213, 264 declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383, 475 administration criteria for dimensional DWs, 228–229 administrative costs, 59 admissions, university, accumulating snapshot example, 247 affinity grouping data mining, 617 market basket example, 421 aggregate builder, ETL subsystem #19, 432 aggregate data quality measures, 470 aggregate fact table definition, 188 aggregate navigation criterion for dimensional DW, 228 dimensional modeling advantages, 142 of dissimilar fact table grains, 545 example, 32 main architecture articles, 536–546 main algorithm, 542 563106bindex02.indd 693 12/23/09 10:53:32 PM

Transcript of Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227...

Page 1: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index

34 subsystems of ETL, 430–43464 bit architectures for data warehouse, 554,

558, 5822NF (second normal form), 177–1783NF (third normal form), 133

business rules, 145–147Chris Date criticisms, 147complex schemas, 146versus dimensional modeling, 140–141incompleteness, 146primary criticism, 137–139query complexity, 138

performance for BI queries, 138real data, 146redundancy, 138uniqueness, 146usability, 138

CIF use of 3NF, 173–178

Aabstract dimensions, reasons to avoid, 311abused users, 119accumulating snapshot fact table, 194. See grain

(fact tables)combining with periodic and transaction

grains, 249comparison to other grains, 244–246date dimension roles, 300fact table loader, ETL subsystem #13, 432nulls to be expected, 276pipelines and short processes, 246procurement example, 241–242real time partition, 509university and admissions example, 247–248

accurate counting, combining CASE and SUM, 314–315

actions, tracking, 590, 593activity based costing

difficult environments for, 265modeling income statements, 401

ad hoc attack, 63Adaptive Software Development, 109additive facts, 12, 139–142, 182

examples, 31, 213, 264declaring in metadata, 227non-additive example, 281semi-additive example, 182, 227

address cleaning and standardizing, 374–388, 439

international addresses, 274, 378–383, 475administration criteria for dimensional DWs,

228–229administrative costs, 59admissions, university, accumulating snapshot

example, 247affinity grouping

data mining, 617market basket example, 421

aggregate builder, ETL subsystem #19, 432aggregate data quality measures, 470aggregate fact table definition, 188aggregate navigation

criterion for dimensional DW, 228dimensional modeling advantages, 142of dissimilar fact table grains, 545example, 32main architecture articles, 536–546main algorithm, 542

563106bindex02.indd 693 12/23/09 10:53:32 PM

Page 2: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index694

metadata and, 539minimum metadata requirement, 542OLAP considerations, 548, 554query tool discipline, 542query tools, 639recommended data warehouse architecture,

537aggregate navigator, 185, 188aggregate processing during ETL, 440, 486aggregated data

anticipates the business question, 236characteristics, 239data mining and, 44data quality reporting, 470drilling down from, 53prematurely, 92, 143

aggregated dimensional models, 239aggregates

administration with Type 1 SCD, 25design requirements, 540fact provider responsibilities, 164goals for data warehouse, 539metadata requirements, 536, 569objection removers, 78–79positive and negative impacts, 536removing from real time partition, 495, 508server configurations, 583shrunken dimension tables, 540

aggregation. See aggregated datawhen premature defeats drill down, 143

agile development approach, 107–111Agile Manifesto, 109AI (artificial intelligence), 616airline customer satisfaction dimension, 371airline flight segment database design, 393–395airline yield KPI use case, 22–24airport role playing dimensions, 396Alda, Alan, interviewing skills, 113allocating costs

conflicting requirements, 263–265danger of implementing, 4–5, 63, 71–72

allocation, environments, 265allocation rules for calculating profit, 72

compliance requirements, 426allocations

computing on the fly, 523

implementing in OLAP, 549income statement fact tables, 401–403profitability fact tables, 402substituting rules of thumb, 44, 402version number in audit dimension, 466, 469

alternate reality, type 3 SCD, 27An Introduction to Database Systems (Chris Date),

36, 137analytic application lifecycle, five stages, 22,

590–596analytics matrix tracking, 158

analytic application reports, 602build versus buy, 603

analytic requirements, identifying, 126analytic tools, 62, 63analytics matrix, 158Analytics Workshop, 127AND queries, 349–350architecture

address matching and standardizing, 385–386aggregate navigation articles, 536–546archiving, long term preservation, 579–582BI architecture articles, 560–565, 607–610BI comparison queries, 631–634BI portal, dashboards, 610–612BI upgrading unsuccessful, 674–676bus architecture, 38–45, 51–52, 150–151catastrophe protection, 576–578change data capture, 452–453criteria, dimensional DWs, 226–228data architecture chapter, 133–178data mining articles, 615–629data quality, 460–467distributed EDW, 56drilling across, 189–191, 629–631drilling down, 22–24, 186–189EDW diagram, 51ETL, 105

34 subsystems of, 430–434FTP-based integration, 450integrated EDW, 13–21late arriving data handling, 491–495Lifecycle place for, 97master data management (MDM), 516–520metadata 567–571

563106bindex02.indd 694 12/23/09 10:53:32 PM

Page 3: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 695

Microsoft SQL Server 2005 data architecture, 554–559

real time, 503–510ROLAP versus OLAP, 549–553SCDs and time variance of dimensions, 24–27security, 83, 575separating IT systems, 50–51service oriented architecture (SOA), 513–515storage area network (SAN), 585–587surrogate key processing pipeline, 481–485time handling, 192–194

architecture phase, data marts, 39–40archiving, 2, 8

encapsulating and emulating strategy, 581examples of very long term requirements, 579historical letters case study, 347–351limitations of media, formats, software,

hardware, 580metadata examples, 568–569migrate and refresh strategy, 581requirements affecting ETL design, 428very long term digital preservation, 579–582

Atkinson, Toby, multinational name and address resource, 381

atomic data, 61advantages, 43aggregations, 235–236as basis of dimensional models, 196drilling down, 47, 188normalized form, 200storage architectures, CIF versus Kimball, 174

atomic fact tables, 43as core foundation, 239

atomic grain, dimensionality, 235atomic-level behavior data, 55audit columns for change data capture, 452–453audit dimension, 465–467

assembler, ETL subsystem #6, 431in data mining, 619data quality measures, 468detailed design, 469–471environmental descriptors, 468fact tables, 187

automobile collisions, factless fact table, 257automobile policy coverages, insurance case

study, 278, 392

availability of data warehouse, 48, 53, 558minimizing offline time, 502taking aggregates offline, 440

averaging over time, 182awkward formats, 92

BB-tree indexes, 37, 269, 508back pointers to operational systems,

487–488back room, 653. See ETL.backup and recovery use cases, 79backup system, ETL subsystem #23, 433, 578backups

data staging, 8objection removers, 79

balance transactions, 279BEEP, 237begin- and end-effective time stamps. See time

stampsbehavior analysis, 598–600, 621–625

from clickstream, 410–413, 415–417market basket analysis, 420–424purchase behavior security risks, 573

behavior dimension, 231, 324, 643behavior tags, 368–371

recency, frequency, intensity, 337–338, 368behavioral queries, 640–644

non-behavior, 26–262Berry, Michael, 600, 625best practices

building DW/BI systems, 103establishing operating procedures, 655

BI (business intelligence)applications, chapter 13, 589–650architecture

unsuccessful, 675upgrading, 674–676

compliance, 596–597CRM, 599–600custom tools, 520–522dimension browsing, 28drilling across, accreting measures, 29drilling down, 28ease of use, 29

563106bindex02.indd 695 12/23/09 10:53:32 PM

Page 4: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index696

environmentlaunching, 652monitoring operations, 653–654

pervasive, 532–533portal, 610–612queries, 28

improving performance, 78reports, 28. See reportingsequential behavior analysis, 597–598tools, 20–21

licenses, 682sequential computation difficulties, 635

user interface, 28value with, 589–600

BI tool interfaces affecting ETL design, 429bitmap indexes, 81, 269, 325, 559, 562blended development approach, top-down and

bottom up, 103book references

building interpersonal skills, 94building public speaking skills, 95building written communication skills, 95understanding the business world, 94

bottlenecksauthentication and access, 578memory, 582scalability, 507

bottom-up approach, Kimball Lifecycle, 100, 128bottom-up market basket algorithm, 423boundaries with finance, IT, legal, and end users,

4–6bridge tables

account to customer in banking, 343, 344begin- and end-effective time stamps, 345correctly weighted report, 342definition, 335diagnosis tracking in health care, 342ETL subsystem #15, bridge table builder, 432for multiple alternate hierarchies, 366. See

hierarchiesfor variable depth hierarchies, 336, 357–359.

See hierarchiesfor satisfaction tracking, 373impact report, 342keyword tracking, 348Microsoft Analysis Services alternative, 554

natural keys, 362–363need for surrogate keys, 344reports-to dimension, separate, 361SIC codes, 343surrogate keys, 344, 360–361updating, 346weighting factor, 342, 345

Brin, David (The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom?), 574

browse a dimension, BI tool user interface design, 28, 135, 638

budgeting case study, 403–407budgeting data aligned with planning data, 545bug tracking system, 601bus architecture, 46, 51, 150–151, 172. See

architecture.distributed systems, 151independent from centralization, 151

bus matrixanalytics, 158consolidated processes in, 240detailed implementation matrix, 159–160drill down into, 159–161executive communication, 15, 154extensions, 158feasibility grid, benefit versus feasibility, 131grain, altering, 159for integrated EDW, 15, 129for manufacturing, 16mishaps, 157opportunity, 158processes versus departments, 130preliminary bus matrix and bubble chart, 218primary introductions, 151–159strategic initiatives versus business processes,

127, 158business acceptance, 88, 99, 113, 664–670Business Dimensional Lifecycle (Kimball

Lifecycle), 96–99business intelligence. See BI (business

intelligence)business needs affecting ETL design, 47, 53, 204,

426business phase of data mining, 626business processes

563106bindex02.indd 696 12/23/09 10:53:32 PM

Page 5: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 697

as basis of dimensional models, 197versus departments, 123fact table grain, 126identifying, 124, 125–127

consequences of incorrect, 126subject areas, 61tying to strategic initiatives with matrix, 127

business realignment, 667business reengineering, 2, 461

driven from poor data quality, 459organizational steps, 459

business requirements gathering, 2, 3, 5, 83, 113conversationality, 114–115curiosity, 114data audits, 116difficult users, 119listening skills, 116preparing beforehand, 115wants/needs determination, 118–119

business rule screens, 122, 463business rules, 145

screens, in ETL architecture, 463supported by data models, 145

business sponsor, 86–89, 149–150, 655, 662–670business user’s responsibilities, 216

Ccalendar date dimension design, 291calendar dimension, 293–294. See date

dimension; time dimensiondesign, 435multi-enterprise, 339primary key, date format, 288

calendarsinternational dates, 476multinational designs, 376

case studiesbudgeting, 403–407clickstream, 409–413, 413–417growth scenario, 658–661human resources, 396–400insurance, 389–393profitability, 400–403text document searching, 417–420travel, 393–396

catastrophic failures, 576catastrophic SCD type 1 invalidation using

OLAP, 552causal dimensions

describing promotions or behavior, 235, 308–311, 674

design recommendations, 310sourcing the data, 309

causal factorsanalytic application, 22determining, 590, 592

CDI (customer data integration), 105, 155central data warehouse team, 80–83centralization, 168

decentralized reality, 47, 52inappropriate, 60logical design and integration, 169objection remover false promise, 76, 78, 79risks, 103–104risks of physical but not logical, 169steps to migrate from disparate data, 170

centralized architecture comparison to planned economy, 178

centralized customer management system, 78centralized DW/BI systems, risks, 59change

anticipating, 47, 53continuous, 60source data changes, 7, 195–196

change data capture, 6–7, 452–453with CRC (cyclic redundancy checksum),

486–487with diff compare, 453ETL subsystem #2, 431

change impact on dimensional models, 9. See graceful extensibility

checkups, 661–667choice presentation in web-oriented data

warehouse, 563CIF (Corporate Information Factory), 99

compared to Kimball bus architecture, 171hybrid with Kimball approach, 175and Kimball approaches, fundamental

differences, 174claims periodic snapshot fact table, insurance

case study, 391

563106bindex02.indd 697 12/23/09 10:53:32 PM

Page 6: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index698

claims transactions fact table, insurance case study, 390

classification queries, 642classifying, data mining activity, 616–617cleaning data, 431, 439, 443, 454–459

as prelude to data mining, 618–619using regular expressions, 477–481

clickstream case study, 409–413, 413–417clickstream data source challenges, 411clickstream dimension, 410–412

events, 413session type, 415–417visitor dimension, 414–415web page object, 415web sessions, 413

clickstream facts, 412clueless users, 121clustering, data mining activity, 600, 616codes

expand as verbose text in dimensions, 199interpreting into text, 618

cognitive models, 91column screens in data quality architecture, 122,

463comatose users, 120combinatorial explosion in market basket

analysis, 422comment fields, removing from fact tables, 275commitments fact table as part of budgeting

value chain, 404common labels in conformed dimensions, 149communication, bus matrix as a vehicle for, 154comparisons

difficulty using SQL, 631–634implementing with drill across, 23

compliance, 2, 3affecting ETL design, 426international data, 476as part of data steward role, 166requirements for data warehouse, 407–409,

596–597compliance-enabled fact and dimension tables,

408compliance reporting, ETL subsystem #33, 434compression of fact tables with careful design,

274

dictionary compression in SQL Server 2008, 557

improving query performance with, 556–558conceptual models as part of user interface

design, 91configuration choices for data warehouse servers,

583conformed dimension assembler, ETL subsystem

#8, 431conformed dimensions, 15, 51–54, 141, 244, 516

affecting ETL design, 428anonymous data warehouse key, 41bus matrix, 154commitment to use, 42cost if not conformed, 91criterion for dimensional DW, 227data warehouses and, 41definition, 40, 149, 190designing, 41establishing, 40executive support required, 149fixing stovepipe data marts, 202grain, 41importance of, 40–41integrated EDW, 198integration and, 105, 108no need for, 45replication, 485–486unconformed and, 91variations, 42

conformed facts, 15, 42cost if not conformed, 91cross-process calculations, 201definition, 150unconformed, 91using analytically, 162

conforming data, ETL subsystem #8, 431conforming dimensions at query time, unrealistic

goal, 523conforming nonconformed dimensions, 676–677consolidated fact tables, 240

combining processes, 240example, 240–241

constraint targets, always in dimension tables, 198constraints on data warehouse design, 58–60cookies in clickstream data, 411

563106bindex02.indd 698 12/23/09 10:53:33 PM

Page 7: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 699

corporate data model, 93Corporate Information Factory. See CIFcorrectly weighted report using bridge table, 342,

345–346. See impact reportcorrelated dimensions, splitting or combining,

307correlated subqueries, 641cost allocations. See allocationscosts, 90

administrative, 59hardware, 59implementation costs, 59software, 59sources, 90surprises, 59

coverage dimension in insurance examples, 243–246, 390–393

coverage fact table, promotion tracking example, 257–258

coverage tables, finding what didn’t happen, 260–262

CRC (cyclic redundancy checksum), 8in change data capture, 486–487

criteria for dimensional DWs, 226critical thinking, 58critique data warehouse, 673–674CRM (customer relationship management),

599–600cross-browsing, 638–639cultural correctness, multinational name and

addresses, 379currencies

conversion version in audit dimension, 466, 469

design, 435international data, 476in multinational designs, 377

custody of data, compliance responsibility, 407

custom tool development for ETL and BI, 520customer dimension extensions, 367customer modeling issues, 366–374customer profiling, factless fact table, 259Customers.Com (Seybold), 525cyber warfare, 576–577cyclic redundancy checksum. See CRC

DDangermond, Jack on GIS systems, 384dashboards, 612–613data

aggregated prematurely, 92awkward formats, 92as both fact and dimension, 277delivery, slow, 92integration, external, 449–450integration manager, ETL subsystem #21, 432locked, 92profiling, 2, 3, 121

affecting ETL design, 427business rule screens, 122column screens, 122role in organization, 462structure screens, 122

quality1996 perspective, 454afffecting ETL design, 427aggregate data quality reporting, 470–471comprehensive architecture, 460critically dependent applications, 454culture steps, 461error event handler, ETL subsystem #5, 431error responses, 465estimating from historical data, 471international design issues, 474measures in audit dimension, 468no history, 473–474predictable changes, 474six sigma, 467standard deviation, 472X-11 ARIMA, 474

staging, 8, 50. See ETLarea, 437

stewardship, 156, 165communications, 167goal of program, 165master data management, 517need for, 165qualifications necessary, 167responsibilities, 166–167

transformations, 618–620tool-dependent, 620–621

wrangling, 7

563106bindex02.indd 699 12/23/09 10:53:33 PM

Page 8: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index700

data audits during requirements gathering, 116. See data profiling

data cleaning, 439applications, 458ETL subsystem #4, 431regular expressions, 477–481steps, 456

data conformer, ETL subsystem #8, 431data flow, ETL system, 446–447data governance as foundation for MDM, 520.

See governancedata marts, 39

architecture phase, 39–40avoid departmental definition, 123bus architecture, 46business process subject areas, 61data warehouse bus architecture, 51–54dimensional modeling and, 50–51higher level, 44presentation area, 51quick and dirty data warehouse myth, 202stovepipe data marts, 39

data mining, 44, 63, 615–617affinity grouping, 617aggregated data and, 44business phase, 626categories, 616–617classifying, 616clustering, 616data mining phase, 626–628data transformations, 618–620

tool-dependent, 620–621data warehouse responsibilities, 623database architecture and, 624estimating, 617explain variance of KPI, 24metadata, 629observations, 622–624operations phase, 628–629origins, 615–616predicting, 617process flow chart, 625references, 24

data profiling, 2–3, 121–123, 462data quality driver, 427ETL subsystem #1, 430

data quality architecture articles, 460–481Data Quality: The Accuracy Dimension

(Jack Olson), 427data warehouse bus architecture. See bus

architecturedata warehouse manager responsibilities, 70–73data warehouse not needed, objection remover,

77data warehouses

building in 15 minutes, 80bus architecture, data marts, 51–54central data warehouse team, 80–83costs, sources, 90mission, 73–74planning, 38publishing results, 48–49, 54securing results, 49, 54as Web-enabled system, 55

data webhouses, 55–56, 410data wrangling , 6–8. See change data capturedatabase market split, 35Date, Chris

An Introduction to Database Systems, 36, 137criticisms of E/R models and business rules,

147on dimensional models, 137, 147

date dimensionactivity date versus booking date, 492advantages, 225attached to every dimensional model, 41, 197,

225conformed EDW, 82, 154design, 291hierarchies in, 352incompatible rollups, 292keys, 198, 288, 297latest thinking, 295multiple dates in accumulating snapshot, 246as outrigger dimension, 292as outriggers, 299recommended design, 250, 294role playing, 298used in SCD2 processing, 26, 193

DBAs (database administrators), 7decentralized development, 47, 52, 60. See

distributed architecture

563106bindex02.indd 700 12/23/09 10:53:33 PM

Page 9: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 701

decodes needed in dimension tables, 198, 224deduplicating source data in ETL system, 18,

439, 487deduplication system, ETL subsystem #7, 431degenerate dimensions, 182, 271

airline flight segment example, 394bill of lading example, 396grouping fact table rows, 271health care billing example, 235invoice header example, 264, 267market basket analysis, 271multiple keys in reference dimension, 272order line item example, 300parent-child fact tables, 264shipment invoice example, 465–466storing control numbers, 225tie back to operational system, 271web page event example, 414

demographic tracking, factless fact table, 259demographics mini-dimension table, 320–322,

367ETL processing steps, 321permissible snowflake table, 337–338

denormalized dimension tables, 136, 181, 197, 334, 352

denormalized models, 9departmental data marts to be avoided, 10, 47,

86–88, 123, 128dependency analysis during ETL, 442. See

impact analysisdeployment of data warehouse, 97, 651–661

back room, 653BI applications, 609dimensional relational (ROLAP) versus OLAP,

549–552front room, 651–653monitoring operations, 653–654rapid, 47, 53, 60

descriptive attributes, verbose, 199descriptive model, 60–61. See normative modeldesign drivers, 74

design review, 221, 223–231design steps, 210, 405

1 Choose the process, 2112 Choose the grain, 211, 2233 Choose the dimensions, 211

3b Confirm the dimensions, 2124 Choose the facts, 2135 Store precalculations, 2136 Round out dimensions, 2147 Choose database duration, 2158 Specify SCDs, 2159 Decide physical design, 215review and validation, 221–223

design team roles, 3, 40–41, 80, 93, 216destruction of facility, 576diagnosis bridge table dimension in health care,

341diff compare, 453. See change data capturedigital preservation, 579–582dimension design response for new attributes,

195dimension independence, 180dimension keys, durable, natural, surrogate, 18dimension limitations in OLAP, 325dimension manager, 17–19

joint responsibilities with fact provider, 21, 514LDAP, 21MDM resource, 514responsibilities, 17, 163–164

dimension manager system, ETL subsystem #17, 432

dimension notification criterion for dimensional DW, 229

dimension processing in ETL, 438. See SCDsdimension size limitations, 325dimension tables

accurate counting, 314–315conformed, 141decodes, 224design process, 214–215many-to-one relationships, 197primary keys, surrogate keys, 198replicating to fact providers, 18row labels, 198shrunken dimensions, 19snowflaked, 181, 333source of constraints and row headers, 139text facts, 199version numbers, 19, 20

dimension update strategies, 325

563106bindex02.indd 701 12/23/09 10:53:33 PM

Page 10: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index702

dimensional attributes, overwrites, 317. See SCDs

dimensional criterion for dimensional DW, 228dimensional designs, graceful modifications,

194–195dimensional DWs

administration criteria, 228–229architecture criteria, 227–228criteria, 226expression criteria, 229–231rating scheme, 226

dimensional models. See DMaggregated, 239atomic data, 196based on reports, 200business processes, 197date dimensions, 197departmental data marts, 144extensible designs, 203graceful extensibility, 9, 43, 142, 194–195, 228motivation and advantages, 139normalized model comparison, 134–137versus normalized models, information

content, 140null usage, 276–277. See nullspopulating, 238query evaluation strategy, 135relational models and, 9, 181source data changes, 195–196summarized information, 238symmetrical approach, 141themes, 278

dimensional queries, processing, 136dimensional relational versus OLAP, 550–551dimensional replication criterion for dimensional

DW, 228dimensional scalability criterion for dimensional

DW, 228dimensional star schema, fact tables, 181. See

star join, star schemadimensional symmetry criterion for dimensional

DW, 228dimensions, 10, 87, 180

abstract, 311–312behavior, 324causal, 308–311

conformed, 15, 244correlated, splitting/combining, 307degenerate, market basket analysis, 271degenerate dimensions. See degenerate

dimensionsdesign process, 211–212facts, data as both, 277–278generic, 311–312hierarchies, multiple, 184hot-swappable, 312–314independence, 192joins, avoiding, 307junk dimensions. See junk dimensionskeyword dimension, 347–351mini, 326–327, 367missing, 181reference dimensions, 272–273retaining headers as, 266smart keys, 199user interface, 11verbose description attributes, 199

directory server, 427dirty data, cleaning up hierarchies, 354disorders of data warehouse. See DW/BI

checkupsdistraction avoidance in web-oriented data

warehouse, 564distributed architecture, 56, 60, 103, 151. See

integrationcatastrophic failure and, 577

distributed systems, 151DM (dimensional modeling), 133–134. See

dimensional models3NF comparison, 140–141data marts and, 50–51defending, 144Microsoft Analysis Services, 553–554myths. See mythsoverview, 139–140retail databases, 143rules, 196snowflaking, 143–144. See snowflaked

dimension tablesstovepiping, 143strengths, 141–142symmetrical approach, 141–142

563106bindex02.indd 702 12/23/09 10:53:33 PM

Page 11: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 703

top-down design, 135understanding of, 143

drill-across reports, 14, 20, 42. See integrationgrouping columns, 185

drilling across, 14, 150, 162, 240. See integrationarticle, 189–191conformed dimensions, 82danger using two fact tables in same query, 189definition, 185detailed implementation, 190different grains, 33fact tables, 185fact tables of dissimilar grains, 191implementing, 190–191outer join, 190queries, 190sort-merge, 190SQL, 629–631

drilling down, 186–189atomic data, 47, 53, 188BI tool user interface design, 28computation, 187data quality attributes, 187definition, 183grouping columns, 184not in a hierarchy, 184, 187predetermined hierarchy, 23precise technical comments, 187row headers, 186user interface, 188

drilling up, 184durable key, 18–19, 27, 328–331, 337. See

natural keyduration of data storage, 215DW/BI (data warehouse/business intelligence), 1

business acceptance, 667–670business realignment, 667business representatives, 668–669centralized, risks, 59checkups

business acceptance disorder, 664–665business sponsor disorder, 662–663cultural/political disorder, 666data disorder, 663–664infrastructure disorder, 665–666

custom tools, 520–522

failure points, 93feedback, 669–670interview team, 668isolationist approach, 84management education, 670–673marketing system, 656–658performance, 682projects, listing, 106system operations planning, 654

EE/R modeling, Chris Date on, 147EAI (enterprise application integration), 523early arriving facts in real time applications,

494–495ease of use, 49, 54

BI tool acceptibility test, 30EDM (enterprise data model), 9–10education of senior staff, 672EDW (enterprise data warehouse), 106. See CIF

architectural requirements, 47architectures, normalized versus dimensional,

176–178bus matrix, 15–16conformed dimensions and facts, 17integrated, 13–21, 108, 161–164MDM and, 15practical approach, 530reports, 14–15

employee dimensionbest practice design, 359–365bridge table using natural keys, 362dimension outrigger example, 335fixed depth hierarchy compromise, 363human resources example, 396–400insurance matrix example, 160normalized design example, 495–497pathstring attribute for ragged hierarchy,

364reports-to bridge table, 360SCD processing example, 25–27separate reports-to dimension, 361telecomm matrix example, 152time stamps, 399–400

enterprise application integration (EAI), 523

563106bindex02.indd 703 12/23/09 10:53:33 PM

Page 12: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index704

ER (entity-relationship) models, 50, 133, 147normalized, 57, 62

ERP (enterprise resource planning)data warehouse limitations, 526role of, 526–528systems

as primary data warehouses, 56relationship to data warehouse, 525

vendors, 56error event handler, ETL subsystem #5, 431error event schema, recording data quality

events, 463–464error responses, 465ESRI, GIS vendor evaluation, 384–386

address standardizing, 385extending SQL for geographic queries, 386

estimating (data mining), 617ETL (extract, transform, and load) staging

systems, 62aggregate processing, 440architecture, 105audit dimension, 465–466bottlenecks, 683business rule screens, 463cleaning and conforming, 98column screens, 463custom tools, 520–522data quality error event handler, 431. See data

qualitydata staging area, 437deduplicating source data, 487delivering, 98dependency analysis, 442–443design

archiving requirements, 428BI tool interfaces, 429business needs, 426compliance, 426conformed dimensions, 428data profiling, 427data quality, 427. See data qualityexample, 445foundations, 443integration requirements, 428latency, 428licenses, 429–430

lineage requirements, 428planning inputs, 446–447security, 427–428skills of staff, 429staging, 428tradeoffs, 434

designer’s responsibilities, 216documentation, 445extract processing, 441extracting, 98hierarchy validation, 354, 440householding, 439–440impact analysis, 442–443junk dimensions and, 307lineage analysis, 442–443managing, 98operational resilience, 442planning steps, 445quality screens, 463referential integrity, 438requirements, 425self-documentation, 442snowflake design, 336structure screens, 463subsystems

accumulating snapshot grain fact table loader, 432

aggregate builder, 432audit dimension assembler, 431backup system, 433change data capture system, 431compliance reporting, 434conformed dimension assembler, 431data cleaning system, 431data conformer, 431data integration manager, 432data profiling system, 430deduplication system, 431dimension manager system, 432error event handler, 431extract system, 431fact table loader, 432fact table provider system, 432hierarchy dimension builder, 432impact analyzer, 433job scheduler, 433

563106bindex02.indd 704 12/23/09 10:53:33 PM

Page 13: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 705

junk dimension builder, 432late-arriving data handler, 432lineage and dependency analyzer, 433metadata repository manager, 434multi-dimensional cube builder, 432multi-valued dimension bridge table loader,

432OLAP cube builder, 432parallelizing system, 433periodic snapshot grain fact table loader, 432pipelining system, 433problem escalation system, 433quality screen handler, 431recovery and restart system, 433SCD processor, 432security system, 433sort system, 433special dimension builder, 432surrogate key creation system, 432surrogate key pipeline, 432transaction grain fact table loader, 432version control system, 433version migration system, 433workflow monitor, 433

time zones and, 434–435tool pros and cons, 442visual flow, 442

euro, special business rules, 377event fact tables, 182exception reports, analytic applications, 22exceptions

analytic application, 22handling, 451–452identification, 590, 592

explicit declaration criteria for dimensional DWs, 227

expression criteria for dimensional DWs, 229–231extensibility

of dimensional models, 142, 203, 206graceful, 9new data source, 205

extensible markup language. See XMLexternal data integration, 449–450extract, transform, and load. See ETLextract processing during ETL, 441extract system, ETL subsystem #3, 431

Ffact dimension, modeling sparse facts, 281fact provider, 17, 19–20, 164

joint responsibilities with dimension manager, 21

LDAP, 21fact table grains. See grain (fact tables)

transaction, periodic snapshot, accumulating snapshot, 193, 243–244

fact table loader, ETL subsystem #13, 432fact table provider system, ETL subsystem #18,

432fact tables, 37

atomic, as core foundation, 239audit dimensions, 187combining types in periodic and accumulating

snapshots, 249consolidated

combining processes, 240example, 240–241

cost cutting and, 682–683departmental views, 9design patterns, 273–283design process, 213–214dimensional star schema, 181drilling across, 185, 240. See drilling across;

integrationfactless, 255–258flexible width, 281grain. See grain (fact tables)

declaring, 30declaring before dimensions added, 199uniformity, 197

granularity, 43–44instantaneous transactions, 278many-to-many relationships, 197parent/child, 262–268partitioning with smart date keys, 296–297pivoting, 282–283populating, 104primary keys, 181purpose of, 30response to measurement events, 9, 11rows, grouping with degenerate dimensions,

271

563106bindex02.indd 705 12/23/09 10:53:33 PM

Page 14: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index706

scalable width, 281second-level, 240size reduction with careful design, 274sparse facts, 280–282surrogate keys, 33

reader suggestions, 269where to use, 268

time stamps, 192types. See grain (fact tables)used as dimensions, 278

factless fact tables, 182, 255attendance tracking, 256automobile collisions, 257customer profiling, 259demographic tracking, 259SCDs, 258

facts, 11, 134additive facts, 12, 182, 227conformed, 15, 42–43design process, 213dimensions, data as both, 277measurement events, 11non-additive, 31, 227, 281nulls as, 277numeric measurements, 179semi-additive, 182, 227, 509, 548, 554, 639unconformed, 91user interface, 11

feedback from end users, 669–670finance, boundaries with, 5financial product dimensions, 82, 313, 338–339,

436financial services date dimension roles, 298first-level data marts, 44. See second-level;

integrationfirst-level subject area, 153. See second-level;

integrationfixed-depth hierarchy, 363–364fixed-width databases, 338FK (foreign key), 31flat file, 62, 437, 440–441, 624flexible width fact tables, 281foreign keys, nulls as, 276–277foreign keys in dimensional schemas, 31,

180–181Friedmann, Thomas (The World is Flat), 474

front room, 32–33, 50–51, 651–653architecture, 560–565BI applications, 589–649metadata, 569–570

FTP based integration, 450–452fundamental fact table grains, 243–246. See grain

(fact tables)

Ggeneral ledger account dimension in budgeting

value chain, 406general ledger (GL), 4–5

fact table example, 336tying to operational results, 5

generic dimensions, avoiding, 311geocoder for ESRI GIS parsing of addresses, 386geographic information system (GIS), link to

data warehouse, 383address standardizing, 385evaluation of ESRI GIS vendor 384–386extending SQL for geographic queries, 386

geography dimension in a conformed EDW, 82GIS. See geographic information systemGL. See general ledgergovernance, 108

driving MDM initiatives, 520driving SOA initiatives, 513–514

graceful extensibility, 9, 43, 53, 59graceful modification criterion for dimensional

DW, 228graceful modifications to dimensional designs,

140, 142, 194–195, 309grain (fact tables), 30

accumulating snapshot grain, 32, 194capture lowest possible, 10clickstream data, 412conformed dimensions, 41declaration, 30, 182

before design begins, 200, 233precedes key definition, 235

definition, 11foundation of design, 223

as definition of business event, 237design process, 211, 223drilling across, 33

563106bindex02.indd 706 12/23/09 10:53:33 PM

Page 15: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 707

fundamental grains comparison, 243–246mismatches, mixed grain in fact table, 200mixed grain problems, 223–224periodic snapshot grain, 32, 194transaction grain, 32, 193uniform throughout each fact table, 197

grouping columns, 183drill-across reports, 185drilling down, 184drilling up, 184row headers, 186in a SELECT list, 186

growth management, 658–661

Hhardware cost, 59. See costsheader/line item designs, 262–268heterogeneous product design, 82, 313,

338–339, 436hierarchies, 351

alternate, 365–366design for maintainability, 351dimensions, multiple, 184dirty sources, 354–355drilling down, 23, 28, 187fixed depth, 197, 363–364mistakes in design, 199, 224, 334multiple in a dimension, 229, 352ragged, 229–230, 355–356

pathstring attribute, 364–365shared ownership, 358

referential integrity, 352single dimension, 224splitting into multiple dimensions, 199validation

during ETL, 440in ETL system, 354

hierarchy bridge tabledesign, 357manufacturing parts explosion, 358–359shared ownership, 358time varying, 358

hierarchy dimension builder, ETL subsystem #11, 432

hierarchy management with custom tool, 521

historical dimension rows, 488–490historical letter data warehouse, 347historically accurate attributes, lack of, 531history

preservation, data warehouse requirements, 191, 579–582

seamlessness, 61Holtzman, David, 115hot partition in real time systems. See real-time

partitionhot response cache in real time systems, 504hot-swappable dimensions, 312–314

criterion for dimensional DW, 231multi-client security, 313

householding during ETL, 379, 439, 455, 458hub and spoke architecture, 171. See CIFhuman resources case study, 396–400. See

employee dimensionhybrid approach, combining CIF and Kimball, 175hybrid SCDs (slowly changing dimensions)

combination type 1, 2, 3, 326type 1, 2, 3, 326–328type 1 + 2 tracking with natural keys in fact

table, 328type 1 fact and type 2 mini-dimension, 327type 6 combination of all three types, 327

Iimpact analysis during ETL, 442impact analyzer, ETL subsystem #29, 433impact report using bridge table, 343, 346. See

correctly weighted reportimplementation cost, 59income statement fact table

allocations, 402design, 401

incompatible data, dealing with, 21, 26, 43, 45, 83, 91, 111, 142, 149, 184, 191, 207, 282, 292, 313, 339, 373, 391, 514, 523

incompatible technologies, 14, 48, 53, 56, 60, 75, 78, 290, 380, 455

increased granularity of dimension, design response, 196

indexes for DW/BI databasesB-tree indexes, 37, 269, 508

563106bindex02.indd 707 12/23/09 10:53:33 PM

Page 16: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index708

bitmap indexes, 81, 269, 325, 559, 562substring index, pattern index for high speed

searching, 350Inf*Act, Nielson syndicated reporting, 8inheritance, line items inheriting dimensionality,

267insurance coverage limits example, 12insurance data warehouse examples, 129–130,

243–245, 257, 320–322, 389–393integration. See dimension manager; fact

provider, drill-acrossconformed dimensions and, 108, 198definition, 161–162drill-across as litmus test, 14EDW, 13, 38, 161–164MDM, 13measures, 162–163normalization and, 207–208

integration requirements affecting ETL design, 428international. See multinationalinternet. See webinterviews, 5

gathering requirements, tactic and objective, 210

interviewing techniques, 113–121, 668–670investment banking

custom hot-swappable dimensions, 312–313junk dimensions, 303

IT (information technology)boundaries, 6functions, centralizing, 79licenses, 2partnership with, 91review, 222

Jjob scheduler, ETL subsystem #22, 433joins between dimensions, avoiding, 307junk dimension builder, ETL subsystem #12, 432junk dimensions

advantages, 306combining or separating, 305creating, using, maintaining, 497–499decided granularity, 305ETL choices for creating, 307

investment banking example, 303when to use, 275

KKey, Alan (father of personal computer), 560key performance indicator. See KPIkeys in dimensional schemas, 180keyword dimension, 347–351Kimball Approach, 99

bus architecture. See bus architectureCIF, fundamental differences, 174enterprise versus departmental, 128hybrid with CIF, 175measurement processes versus departmental

reports, 204myths, 206

Kimball Lifecycle, 96–99agile approach and, 110bottom up approach, 100business intelligence track, 99business requirements, 98data track, 98deployment, maintenance, and growth, 99diagram, 97Metaphor Computer Systems, 96–97program/project planning and management, 98technology track, 98

kitchen metaphor for DW/BI system, 65–68know-it-all users, 120–121KPI (key performance indicator), 2–5

airline example, 22compliance impact, 597conformed facts, 428

Llabels, integrating, 162. See integrationlanguages. See multinationallate-arriving

data, 19data handler, ETL subsystem #16, 432dimension records processing steps, 492fact records, processing steps, 491

latency affecting ETL design, 428latent semantic analysis for unstructured text

search, 418

563106bindex02.indd 708 12/23/09 10:53:33 PM

Page 17: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 709

launching BI environment, 652LDAP (lightweight directory access protocol)

managing role enabled security, 578server, dimension manager and fact provider

responsibilities, 21lean times DW fitness program, 680–684legacy data formats, resolving inconsistent, 618legal department, boundaries, 6licenses

cost cutting and, 682for software and systems, affecting ETL design,

429lightweight methodologies, 109. See agileline items, inheriting dimensionality, 267lineage, 468

analysis during ETL, 442and dependency analyzer, ETL subsystem #29,

433requirements affecting ETL design, 428

Linoff, Gordon, 600, 625locked data, 92log scraping for change data capture, 453LSA (latent semantic analysis), 418

Mmanagement education, 670–673many-to-many bridge tables, 335many-to-many dimension relationships, resolve

in fact table, 197many-to-many relationships, prevalence of, 146many-to-one relationships

outriggers, 334resolve in dimension table, 197snowflaking and, 334

MapObjects Visual Basic tool for GIS, 384market basket analysis, 420–424

degenerate dimensions, 271proposed dimensional design, 420

marketing of the DW/BI system, 656–658marquee applications, 598master data management. See MDMMastering Data Mining, The Art and Science of

Customer Relationship Management (Berry and Linoff), 600, 625

matrix. See bus matrix

MDM (master data management), 13, 353business value, 515centralized enterprise source, 519–520deployment steps, 520dimension manager role, 514EDW and, 15importance of data governance, 520integration hub, 517–518need for, 516and SOA with agile development, 111solving data disparity, 516source system disparities, 515supported from data warehouse, 516–517three approaches, 516

MDX (multidimensional expressions), 648measured facts, new, design response, 195measurements, fact tables, 11, 104, 179

reports and, 204snapshots. See grain (fact tables)

measures, integration. See conformed factsmedia, formats in data warehouse 55

archiving, preservation and, 580medical information privacy, 573MERGE command (SQL) for SCD processing,

499merge-sort, drilling across, 190meta meta data data (data about metadata), 566metadata, 613–614

complete list for data warehouse, 567data mining, 629data warehouse scope, 566management tasks, 567management tools, 570–572repository manager, ETL subsystem #34, 434strategy recommendations, 571

Metaphor Computer Systems, 96–97Microsoft Analysis Services 2005, 553–555Microsoft SQL Server 2005, data warehouse

architecture guidelines, 554Microsoft SQL Server 2008

database compression, 556–558new features, 556star schema optimization, 559table partitioning, 558

migrating from disparate data to centralized, 170mini-dimension tables, 367

563106bindex02.indd 709 12/23/09 10:53:33 PM

Page 18: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index710

customer attributes, 367demographics example, 320linking to primary dimension through fact

table, 322monster dimensions, 320overwrite and, 326–327

mistakes in building a DW/BI system, 100mixed grain problems, double counting, 223,

227model alternatives, analytic applications step,

590, 592–593monolithic approach, 38monster dimensions, rapidly changing, 320multi-dimensional cube builder, ETL subsystem

#20, 432multi-pass SQL, 150. See drill-acrossmulti-valued dimension

health care diagnoses, 341multi-valued dimension bridge table loader, ETL

subsystem #15, 432multinational data, 374–388

addresses, 475calendars, 376–377, 476character sets, 475compliance, 476consistency criterion for dimensional DW, 229cultures, 475currencies, 377–378, 476customer information in real time applications,

379data warehouse design considerations, 387dimension translation, 387euro, 377–378geographies, 475languages, 475names, 475names and addresses, 378–383

Atkinson, Toby, 381cultural correctness, 379

numbers, 476postal address formats, 380quality architecture, 477quality issues, 474reporting issues, 566salutations, 475time zones, 476, 477

multiple dimension hierarchies criterion for dimensional DW, 229

multiple dimension roles criterion for dimensional DW, 230

multiple hierarchies in a dimension, 184, 351multiple valued dimensions criterion for

dimensional DW, 230myths, 8–10, 38, 143–144, 201–203

atomic data should be normalized, 239dimensional models pre-suppose the business

question, 238facts and fables, 204–208

NN-tiling, 636name and address processing, 439

cultural correctness, 379international design issues, 378–383

naming conventions, 220–221, 450natural keys, 18. See durable keys

bridge table, 362–363in fact table for type 2 and type 2 tracking,

328–331problems with, 287in surrogate key pipeline, 482surrogate keys, 285

navigation, aggregate navigation, 32–33network database design, 393Nielsen syndicated reporting. See Inf*Actnon-additive numeric fact, 281

bad design example, 213computing from additive facts, 264handling in BI tool, 635–639handling in OLAP, 548summarizing across time, 293

non-behavior, explicit records for, 260non-existence of events, techniques for querying,

259nonconformed dimensions, conforming,

676–677nonexistent users, 121normalization, integration and, 207–208normalized data models, 9, 12, 133, 137. See ER

BI queries, 144complexity and BI, 146

563106bindex02.indd 710 12/23/09 10:53:33 PM

Page 19: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 711

compared to dimensional model, 134creating dimensional views, 77uniqueness or completeness, 57, 146

normalized data warehouse. See CIFnormalized data warehouse, lack of procedure

for slowly changing dimensions, 177normalized EDW not for business intelligence,

176normalized hierarchy disadvantages, 224normative model, 60–61. See descriptive modelNOT EXISTS

missing attributes, 262what didn’t happen, 261

nullsas dimension attributes, 277as fact table foreign keys, 276–277as facts, 277

numbers, international data, 476

Oobjection removers, 76, 77

aggregates, 78applications integrators, 78backups, 79centralized customer management system, 78centralizing IT functions, 79larger problem and, 77recognizing, 77security, 79solutions for, 77

ODS, operational data store hot cache, 504offline delays during ETL processing, 502–503OLAP (online analytical processing), 17, 46

advantages versus dimensional relational, 551analytic syntax, 551catastrophic invalidation with SCD Type 1, 552cube builder, ETL subsystem #20, 432data cube, 63desktop versus server, 547dimension limitations, 325versus dimensional relational advantages, 550versus dimensional relational disadvantages,

551dimensions comparison with ROLAP

dimensions, 547

disadvantages versus dimensional relational, 552

implementing aggregations via strong hierarchies, 548

major advantages, 548–549as major data warehouse component, 546versus ROLAP, final deployment choice,

549–553SCDs contrasted with ROLAP SCDs, 548security scenarios, 551sensitivity to type 1 SCD, 25similarity to star schemas, 63SQL-99 extensions, 645–649time constraints contrasted with ROLAP, 548

Olson, Jack (Data Quality: The Accuracy Dimension), 427

OLTP (online transaction processing), 36data warehouse systems, 37models, 137

on-the-fly behavior dimensions criterion for dimensional DW, 231

on-the-fly fact range dimensions criterion for dimensional DW, 231

online analytical processing. See OLAPonline transaction processing. See OLTPoperating procedures, 655–656operational systems back pointers, 487–488operations phase of data mining, 628–629operators, RegExp, 479opportunity matrix, 158

processes versus departments, 130OR queries, 349–350outrigger dimension, 135–136, 334–335

cautions, 224date dimension as, 292, 299time dimension as, 292variation of snowflaking, 224–225, 336–339

overbooked users, 120overwriting, type 1 SCD, 25–26, 317overzealous users, 120

Ppackaged applications

avoiding stovepipes, 522–523data warehouses and, 522–524, 529

563106bindex02.indd 711 12/23/09 10:53:33 PM

Page 20: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index712

page events in clickstream dimensional design, 412–417

parallel communication paths, catastrophic failure and, 577

parallelizing system, ETL subsystem #31, 433paralysis of project, 84–85parent-child fact tables, 262–268

degenerate dimensions, 264design alternatives, 263

partitioningfact tables with smart date keys, 296–297real time design, 507–510surrogate keys and, 297table partitioning, 558tricks to minimize offline time, 502type 2 SCD, 316

partnership between IT and business, 91parts adding up to whole, 48, 53. See distributed

architecturepathstring attribute for ragged hierarchy, 364pattern index for high speed searching, 350payments fact table as part of budgeting value

chain, 404performance guidelines of web-oriented data

warehouse, 562periodic snapshot grain. See grain (fact tables)periodic snapshot grain fact table loader, ETL

subsystem #13, 432periodic snapshot grain real time partition, 509personal data

ownership, 574uses and abuses, 573

personnel, staffing team 70, 217–218pipeline processes, accumulating snapshots, 246

See grain (fact tables)pipelining system, ETL subsystem #31, 433pivoting fact table with fact dimension, 282–283P&L (profit and loss) fact table, 401–402, 436playbooks for all operations, 656populating dimensional models, 238predicting (data mining), 617presentation area, 51, 62–63, 67preservation. See digital preservationprimary keys in dimensional models, 181prioritization grid, benefit versus feasibility, 131

privacyconcerns from RFID tags, 534–535data warehouse architecture and, 575information transfer and, 476tradeoffs in data warehouses, 572

private attributes in conformed dimensions, 16, 19problem escalation system, ETL subsystem #30,

433problem resolution in web-oriented data

warehouse, 565process-centric rows in bus matrix, 156–157process steps, data warehouse design, 210process streamlining in web-oriented data

warehouse, 564processes versus departments, 123procurement pipeline, accumulating snapshot

example, 241–242product dimension, conformed in an EDW, 82production keys, problems with, 287production (source) transaction processing

systems, 62profitability case study, 400–403profitability fact tables, allocations, 402, 436progressive subsetting queries, 642promotion dimension

design example, 308design recommendations, 310

promotion profitability, 311promotion tracking, factless fact table, 257provenance, lineage 468pruning algorithm in market basket analysis, 423publishing metaphor for data warehouse

manager, 58, 70, 73publishing reports, 590–591purchase behavior privacy, 573

Qquality culture, 461–462. See data quality

architecture articlesquality screen handler, ETL subsystem #4, 431quality screens in ETL architecture, 463queries, BI

AND, 349–350behavioral, 642browse queries, 135

563106bindex02.indd 712 12/23/09 10:53:33 PM

Page 21: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 713

decomposition, 639. See drill-across reports; drilling across

drill-across operations, 190. See drill-across reports; drilling across

features for query tools needed, 638–649hot-swappable dimensions, 313OR, 349–350performance

cost when too slow, 92priorities for improving, 201

SQL, categories, 641–642query time dimension conforming, goals, 523

Rragged dimension hierarchies criterion for

dimensional DW, 229ragged hierarchies. See hierarchies

bridge table solution, 355–358pathstring attribute solution, 364–365recursive pointer problems, 357

rapid deployment, 47, 53rating scheme for dimensional DWs, 226real-time architectures, 503–509

customer information in multinational applications, 379

late arriving dimensions, 494real-time partitions, 507

real-time partition design, 507accumulating snapshot grain, 509–510periodic snapshot grain, 509transaction grain, 508

real-time triage, judging user requirements, 510–511

realignment, business, 667reason code, SCD2, 330–332. See SCDsreassuring users, in web-oriented data

warehouse, 565recency, frequency, intensity. See RFIrecovery and restart system, ETL subsystem #24,

433recursive pointer

problems, modeling ragged hierarchies, 357replaced by hierarchy bridge table, 357

redundancy, data3NF and, 138reducing, 679–680

reference dimensions, 272–273referential integrity

in dimensional schemas, 181, 228enforcing during ETL, 438handling nulls, 276, 295in hierarchies, 352

regular expressions (RegExp)for data cleaning, 477–481operators, 479uses, 480–481

relational databases, business rules, Chris Date, 137, 147

relational modelsdimensional models and, 9, 181EDM and, 10

relational online analytical processing. See ROLAP replicating conformed dimensions, 18, 485–486reporting

accuracy testing, 608analytic application, 22custom tool, 521dashboard development, 612–613deployment, 609development, 607documentation, 604–605EDW, 14–15maintenance, 609management, 609measurements and, 204navigation framework, 606performance testing, 608portal development, 610–612presentation area, 51, 62–63, 67publishing, 590, 591replication, 606report creation, 602–608reporting portal, 652specifications, 604–605standard, 602system design, 603–606target report list, 603template, 604user review, 606users’ involvement, 610

response time to data warehouse queries, 49, 54, 56. See queries, BI

563106bindex02.indd 713 12/23/09 10:53:33 PM

Page 22: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index714

responsibilities of DW/BI teamdata warehouse manager, 70team members, 217

results, preventing irrelevant, 59return on investment. See ROIreview and validate design, 221RFI (recency, frequency, intensity)

behavior tags, 337, 368–369definitions, 369

RFID (radio frequency identification) tagsapplication examples, 533impacting personal privacy, 534sequential behavior analysis, 534smart dust, 535tracked in data warehouse, 533

ROI (return on investment), data warehouse, 93ROLAP (relational online analytical processing),

48ROLAP versus OLAP, final deployment choice,

549–553role playing dimensions, 10, 300, 312

telecomm example, 301transportation example, 301in voyage and network designs, 395

rolling date reporting, 252rolling operational results, tying to GL, 5row change reason code, ETL. See SCDs, type 2row headers. See grouping columnsrow labels in dimension tables, 198rules for dimensional modeling, 196

Ssabotage, 576SANs (storage area networks)

as counter to security catastrophes, 578data warehouse and, 585typical configuration, 586

Sarbanes-Oxley Act, 596satisfaction metrics

chaotic lists, 373design alternatives, 371simultaneous dimension and fact, 372standard fixed list, 371

scalable width fact tables, 281scaling out, scaling up a data warehouse, 584

SCD processor, ETL subsystem #9, 432SCDs (slowly changing dimensions), 315–332

comprehensive overview, 24criterion for dimensional DW, 230delaying dealing with, 199dimension manager responsibilities, 18factless fact tables, 258handling, 193hybrid combinations

type 1 +2 tracking with natural keys in fact table, 328–329

type 1 fact and type 2 mini-dimension, 327type 6 combination of all three types,

327–328MERGE command (SQL), 499–501place in dimensional modeling, 322processing in ROLAP, 438

with OLAP, 548rapidly changing, mini-dimensions, 323slowly changing entities, normalized time

variance tracking, 495–497strategies for, 225–226too fast, 324type 1 (overwrite), 25–26type 2 (new dimension record), 26–27

begin- and end-effective time stamp, 193, 323change description, 193most recent flag, 193reason codes, 330–332

type 3 (new field), 27scorecards and dashboards, 612–613screens, data quality. See data quality

architecture articlesSCRUM, 109SDE (spatial database engine)

ESRI GIS semantics extender for SQL, 386searches

pattern index, 350–351substrings, 350

seasonal fluctuations, removing when testing data quality, 474

second-level subject area, 44, 155fact tables, 240profitability design, 400risks, profitability and satisfaction, 102

second normal form, dimension tables, 181

563106bindex02.indd 714 12/23/09 10:53:33 PM

Page 23: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 715

security, 2, 3architecture, 83catastrophes

categories, 576–577techniques for countering, 577–578

EDWs, 83ETL design, 427–428ETL subsystem #32, 433management with custom tool, 521objection removers, 79scenarios with OLAP, 551technique for multiple clients, hot-swappable

dimensions, 313self-documenting code, 601semi-additive numeric facts, 182, 293

BI application handling techniques, 639declaring in metadata, 227OLAP handling advantages, 548, 554real time partition handling, 509

sequential behavior analysis using RFID tags, 24, 597–598

sequential computations in BI tool, 635–638server configuration choices for data warehouse,

583service accounts versus personal DBA accounts,

655service oriented architecture. See SOAsession type dimension in clickstream

dimensional design, 415Seybold, Patricia (Customers.Com), 525shadow functions, office anthropology, 115shapefiles, GIS data object for boundaries and

areas, 385shared ownership, hierarchy bridge table, 358shrunken dimension tables in aggregate

architecture, 540–542similarity metrics for unstructured text,

417–420six sigma data quality, 467skills, for DW/BI team, 93SLA (service level agreement), 655slowly changing dimensions. See SCDssmart dust. See RFID tagssmart keys

date keys for partitioning fact tables, 296–297dimensions, not for fact table joins, 200

disadvantages, 286problems in data warehouse, 288

snapshots, periodic, accumulating. See grain (fact tables)

snowflaked dimension tables, 135, 181classic design, 336complex calendar dimension, 339context-dependent, 338definition, 333financial product dimension, 338impact on usability, 104large custom dimension, 337

snowflakingas alternative to dimensional model, 143disk space and, 224as DM alternative, 143–144outriggers, 224

SOA (service oriented architecture)agile development, 111data warehouse and, 513–515services defined for dimension manager, 514

software development manager, lessons learned, 601

sort-merge, drilling across, 190sort system, ETL subsystem #28, 433sparse facts

fact dimension, 281wide fact tables, 280–282

sparsity tolerance criterion for dimensional DW, 228

spatial database engine. See SDEspecial dimension builder, ETL subsystem #12, 432sponsor from business, 86–89SQL-92, flexibility of, 645SQL-99, OLAP extensions, 645–649SQL (Structured Query Language)

CASE expression, 633comparisons, 631drill across, 629–631as interim language, 631MERGE for SCD processing, 499multi-pass SQL, 150queries, categories, 641–642

staffing dimensional modeling team, 70, 216–217, 429

skills development, 93

563106bindex02.indd 715 12/23/09 10:53:33 PM

Page 24: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index716

staging area, 62, 66. See archivingaffecting ETL design, 428

standard deviation used for data quality estimating, 472

standard reports, 602. See reportingstar join model, relationship to dimensional

model, 139star schema optimization in Microsoft SQL

Server 2008, 559star schemas

fact tables, 181OLAP data cubes and, 63–64

Star Workstation, Xerox, 57statistical analysis as part of data mining, 616steering committees, 672–673stovepipes, 38–39

avoiding, 522–523converting to architected dimensional data

marts, 45strategic business initiatives, 127

matrix, 158–159street segment data, TIGER Census Department,

386structure screens, 122

in ETL data quality architecture, 463sub-types and super-types. See heterogeneous

product designsubject area groups in conformed dimension

design, 154subject areas

first level, 153second-level, 155

substring searching in keyword list, 350subtransactions describing behavior, 368sunsetting older environments, 681super-types and sub-types. See heterogeneous

product designsurrogate key administration criterion for

dimensional DW, 229surrogate key creation system, ETL subsystem

#10, 432surrogate key pipeline, 20, 26, 481–485

ETL subsystem #14, 432inserting surrogate keys, 482

surrogate keys, 109, 285–289advantages, 225, 285–286

bridge tables, 344–345, 360–361creating, 677–678dimension manager responsibilities, 18dimension table primary keys, 198example used incorrectly, 289fact tables, 33required by type 2 SCD, 26fact tables

reader suggestions, 269where to use, 268

natural keys, 285partitioning and, 297uncertainty, 287

surveillance privacy, 573

Ttable partitioning. See partitioningtape recorders during requirements gathering,

115TCO (total cost of ownership) of data

warehouse, 89telecomm bus matrix, 152telecomm dimensional roles example, 301telephone system comparison, 60text document searching, 417–420text field problems in fact table, 224text in fact tables, removal techniques, 275text facts

recency, frequency, intensity behavior tags, 369recommended design, 370

The Data Warehouse Lifecycle Toolkit (Kimball, et al), 97

The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom? (Brin), 574

The World is Flat (Friedman), 474third normal form, fact tables, 181. See 3NFTIGER census department data, USA street

segments, 386time constraints, ultra precise, 251time dimension, 192

bad design, 293incompatible rollups, 292keys, 293as outrigger dimension, 292

563106bindex02.indd 716 12/23/09 10:53:34 PM

Page 25: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index 717

recommended design, 294role playing, 298

time spans created by transactions, 250time stamps

begin- and end-effective, 251, 289type 2 SCD, 323

bridge tables, 345–346employee dimension table, 399fact tables, 192to nearest second, 490time zones and, 375

time variance in dimensions. See slowly changing dimensions

time zone discovery (www.timezoneconverter.com), 477

time zonesETL system tradeoff, 434international data, 476, 477synchronizing, 374–376

top-down design, dimensional modeling, 135. See bottom-up approach

total cost of ownership. See TCOtraining

data subsets used in data mining, 620–629DW/BI business users, 101, 652

transaction grain fact table, 32, 193, 243–244. See grain (fact tables)

transaction grain fact table loader, ETL subsystem #13, 432

transaction grain real time partition. See real-time partition design

transaction processing models, 137Transaction Processing Performance Council, 37transaction workloads in data warehouse, 532translations in multinational data warehouse,

387, 477transportation database design, 301, 393travel case study, 393–396trust building in web-oriented data warehouse,

61type 1, type 2, type 3, type 6 SCDs. See SCDs

Uuncertainty, encoding with surrogate keys, 287unconformed dimensions and facts, 91

UNICODE character set, 475multinational information, 380

units of measure, conflicts, 435university admissions, accumulating snapshot

example, 247unstructured text applications, 420

LSA (latent semantic analysis), 418similarity metrics, 417–418

unstructured text fact table, 417user-focused cognitive and conceptual models,

91user interface, 57

advances driven by the Web, 561design, 56

BI tools, 28, 57, 91dimensions, 11drilling down, 188facts, 11guidelines for web-oriented data warehouse,

562poorly performing, 92urgency, 561WYSIWYG (what you See is what you get), 560

user typesabused, 119boundaries, 5clueless, 121comatose, 120control, 107know-it-all, 120–121nonexistant, 121overbooked, 120overzealous, 120

Vversion control, 71, 215, 450version control system, ETL subsystem #25, 433version management

audit dimension, 466–470fact and dimension tables, 19–21, 25, 163–164,

313, 344, 408, 450version migration system, ETL subsystem #26,

433voyage database design, 393

563106bindex02.indd 717 12/23/09 10:53:34 PM

Page 26: Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227 address cleaning and standardizing, 374–388, 439 international addresses, 274, 378–383,

Index718

Wwaterfall development approach compared to

agile approach, 107waterfall development risks, 102web-oriented data warehouse, 48–51, 55

choice presentation, 563–564dimensional design, 410distracted avoidance, 564page object dimension, 415performance guidelines, 562–563problem resolution, 565–566process streamlining, 564–565reassuring users, 565session modeling, protocol analysis 413, 416user interface guidelines, 562–566visitor dimension, 414web page characteristics, 409–413

weighting factor in bridge tables, 342what didn’t happen, techniques for finding, 259what if analysis, analytic applications, 22workflow monitor, ETL subsystem #27, 433worksheets during design phase, 219WYSIWYG (what you See is what you get) user

interfaces, 560

XX-11 ARIMA statistic for data quality testing, 474Xerox PARC, birthplace of personal computer, 560XML (extensible markup language), 8

data warehouse integration, 523XP (Extreme Programming), 109

563106bindex02.indd 718 12/23/09 10:53:34 PM