Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227...
Transcript of Index [] · declaring in metadata, 227 non-additive example, 281 semi-additive example, 182, 227...
Index
34 subsystems of ETL, 430–43464 bit architectures for data warehouse, 554,
558, 5822NF (second normal form), 177–1783NF (third normal form), 133
business rules, 145–147Chris Date criticisms, 147complex schemas, 146versus dimensional modeling, 140–141incompleteness, 146primary criticism, 137–139query complexity, 138
performance for BI queries, 138real data, 146redundancy, 138uniqueness, 146usability, 138
CIF use of 3NF, 173–178
Aabstract dimensions, reasons to avoid, 311abused users, 119accumulating snapshot fact table, 194. See grain
(fact tables)combining with periodic and transaction
grains, 249comparison to other grains, 244–246date dimension roles, 300fact table loader, ETL subsystem #13, 432nulls to be expected, 276pipelines and short processes, 246procurement example, 241–242real time partition, 509university and admissions example, 247–248
accurate counting, combining CASE and SUM, 314–315
actions, tracking, 590, 593activity based costing
difficult environments for, 265modeling income statements, 401
ad hoc attack, 63Adaptive Software Development, 109additive facts, 12, 139–142, 182
examples, 31, 213, 264declaring in metadata, 227non-additive example, 281semi-additive example, 182, 227
address cleaning and standardizing, 374–388, 439
international addresses, 274, 378–383, 475administration criteria for dimensional DWs,
228–229administrative costs, 59admissions, university, accumulating snapshot
example, 247affinity grouping
data mining, 617market basket example, 421
aggregate builder, ETL subsystem #19, 432aggregate data quality measures, 470aggregate fact table definition, 188aggregate navigation
criterion for dimensional DW, 228dimensional modeling advantages, 142of dissimilar fact table grains, 545example, 32main architecture articles, 536–546main algorithm, 542
563106bindex02.indd 693 12/23/09 10:53:32 PM
Index694
metadata and, 539minimum metadata requirement, 542OLAP considerations, 548, 554query tool discipline, 542query tools, 639recommended data warehouse architecture,
537aggregate navigator, 185, 188aggregate processing during ETL, 440, 486aggregated data
anticipates the business question, 236characteristics, 239data mining and, 44data quality reporting, 470drilling down from, 53prematurely, 92, 143
aggregated dimensional models, 239aggregates
administration with Type 1 SCD, 25design requirements, 540fact provider responsibilities, 164goals for data warehouse, 539metadata requirements, 536, 569objection removers, 78–79positive and negative impacts, 536removing from real time partition, 495, 508server configurations, 583shrunken dimension tables, 540
aggregation. See aggregated datawhen premature defeats drill down, 143
agile development approach, 107–111Agile Manifesto, 109AI (artificial intelligence), 616airline customer satisfaction dimension, 371airline flight segment database design, 393–395airline yield KPI use case, 22–24airport role playing dimensions, 396Alda, Alan, interviewing skills, 113allocating costs
conflicting requirements, 263–265danger of implementing, 4–5, 63, 71–72
allocation, environments, 265allocation rules for calculating profit, 72
compliance requirements, 426allocations
computing on the fly, 523
implementing in OLAP, 549income statement fact tables, 401–403profitability fact tables, 402substituting rules of thumb, 44, 402version number in audit dimension, 466, 469
alternate reality, type 3 SCD, 27An Introduction to Database Systems (Chris Date),
36, 137analytic application lifecycle, five stages, 22,
590–596analytics matrix tracking, 158
analytic application reports, 602build versus buy, 603
analytic requirements, identifying, 126analytic tools, 62, 63analytics matrix, 158Analytics Workshop, 127AND queries, 349–350architecture
address matching and standardizing, 385–386aggregate navigation articles, 536–546archiving, long term preservation, 579–582BI architecture articles, 560–565, 607–610BI comparison queries, 631–634BI portal, dashboards, 610–612BI upgrading unsuccessful, 674–676bus architecture, 38–45, 51–52, 150–151catastrophe protection, 576–578change data capture, 452–453criteria, dimensional DWs, 226–228data architecture chapter, 133–178data mining articles, 615–629data quality, 460–467distributed EDW, 56drilling across, 189–191, 629–631drilling down, 22–24, 186–189EDW diagram, 51ETL, 105
34 subsystems of, 430–434FTP-based integration, 450integrated EDW, 13–21late arriving data handling, 491–495Lifecycle place for, 97master data management (MDM), 516–520metadata 567–571
563106bindex02.indd 694 12/23/09 10:53:32 PM
Index 695
Microsoft SQL Server 2005 data architecture, 554–559
real time, 503–510ROLAP versus OLAP, 549–553SCDs and time variance of dimensions, 24–27security, 83, 575separating IT systems, 50–51service oriented architecture (SOA), 513–515storage area network (SAN), 585–587surrogate key processing pipeline, 481–485time handling, 192–194
architecture phase, data marts, 39–40archiving, 2, 8
encapsulating and emulating strategy, 581examples of very long term requirements, 579historical letters case study, 347–351limitations of media, formats, software,
hardware, 580metadata examples, 568–569migrate and refresh strategy, 581requirements affecting ETL design, 428very long term digital preservation, 579–582
Atkinson, Toby, multinational name and address resource, 381
atomic data, 61advantages, 43aggregations, 235–236as basis of dimensional models, 196drilling down, 47, 188normalized form, 200storage architectures, CIF versus Kimball, 174
atomic fact tables, 43as core foundation, 239
atomic grain, dimensionality, 235atomic-level behavior data, 55audit columns for change data capture, 452–453audit dimension, 465–467
assembler, ETL subsystem #6, 431in data mining, 619data quality measures, 468detailed design, 469–471environmental descriptors, 468fact tables, 187
automobile collisions, factless fact table, 257automobile policy coverages, insurance case
study, 278, 392
availability of data warehouse, 48, 53, 558minimizing offline time, 502taking aggregates offline, 440
averaging over time, 182awkward formats, 92
BB-tree indexes, 37, 269, 508back pointers to operational systems,
487–488back room, 653. See ETL.backup and recovery use cases, 79backup system, ETL subsystem #23, 433, 578backups
data staging, 8objection removers, 79
balance transactions, 279BEEP, 237begin- and end-effective time stamps. See time
stampsbehavior analysis, 598–600, 621–625
from clickstream, 410–413, 415–417market basket analysis, 420–424purchase behavior security risks, 573
behavior dimension, 231, 324, 643behavior tags, 368–371
recency, frequency, intensity, 337–338, 368behavioral queries, 640–644
non-behavior, 26–262Berry, Michael, 600, 625best practices
building DW/BI systems, 103establishing operating procedures, 655
BI (business intelligence)applications, chapter 13, 589–650architecture
unsuccessful, 675upgrading, 674–676
compliance, 596–597CRM, 599–600custom tools, 520–522dimension browsing, 28drilling across, accreting measures, 29drilling down, 28ease of use, 29
563106bindex02.indd 695 12/23/09 10:53:32 PM
Index696
environmentlaunching, 652monitoring operations, 653–654
pervasive, 532–533portal, 610–612queries, 28
improving performance, 78reports, 28. See reportingsequential behavior analysis, 597–598tools, 20–21
licenses, 682sequential computation difficulties, 635
user interface, 28value with, 589–600
BI tool interfaces affecting ETL design, 429bitmap indexes, 81, 269, 325, 559, 562blended development approach, top-down and
bottom up, 103book references
building interpersonal skills, 94building public speaking skills, 95building written communication skills, 95understanding the business world, 94
bottlenecksauthentication and access, 578memory, 582scalability, 507
bottom-up approach, Kimball Lifecycle, 100, 128bottom-up market basket algorithm, 423boundaries with finance, IT, legal, and end users,
4–6bridge tables
account to customer in banking, 343, 344begin- and end-effective time stamps, 345correctly weighted report, 342definition, 335diagnosis tracking in health care, 342ETL subsystem #15, bridge table builder, 432for multiple alternate hierarchies, 366. See
hierarchiesfor variable depth hierarchies, 336, 357–359.
See hierarchiesfor satisfaction tracking, 373impact report, 342keyword tracking, 348Microsoft Analysis Services alternative, 554
natural keys, 362–363need for surrogate keys, 344reports-to dimension, separate, 361SIC codes, 343surrogate keys, 344, 360–361updating, 346weighting factor, 342, 345
Brin, David (The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom?), 574
browse a dimension, BI tool user interface design, 28, 135, 638
budgeting case study, 403–407budgeting data aligned with planning data, 545bug tracking system, 601bus architecture, 46, 51, 150–151, 172. See
architecture.distributed systems, 151independent from centralization, 151
bus matrixanalytics, 158consolidated processes in, 240detailed implementation matrix, 159–160drill down into, 159–161executive communication, 15, 154extensions, 158feasibility grid, benefit versus feasibility, 131grain, altering, 159for integrated EDW, 15, 129for manufacturing, 16mishaps, 157opportunity, 158processes versus departments, 130preliminary bus matrix and bubble chart, 218primary introductions, 151–159strategic initiatives versus business processes,
127, 158business acceptance, 88, 99, 113, 664–670Business Dimensional Lifecycle (Kimball
Lifecycle), 96–99business intelligence. See BI (business
intelligence)business needs affecting ETL design, 47, 53, 204,
426business phase of data mining, 626business processes
563106bindex02.indd 696 12/23/09 10:53:32 PM
Index 697
as basis of dimensional models, 197versus departments, 123fact table grain, 126identifying, 124, 125–127
consequences of incorrect, 126subject areas, 61tying to strategic initiatives with matrix, 127
business realignment, 667business reengineering, 2, 461
driven from poor data quality, 459organizational steps, 459
business requirements gathering, 2, 3, 5, 83, 113conversationality, 114–115curiosity, 114data audits, 116difficult users, 119listening skills, 116preparing beforehand, 115wants/needs determination, 118–119
business rule screens, 122, 463business rules, 145
screens, in ETL architecture, 463supported by data models, 145
business sponsor, 86–89, 149–150, 655, 662–670business user’s responsibilities, 216
Ccalendar date dimension design, 291calendar dimension, 293–294. See date
dimension; time dimensiondesign, 435multi-enterprise, 339primary key, date format, 288
calendarsinternational dates, 476multinational designs, 376
case studiesbudgeting, 403–407clickstream, 409–413, 413–417growth scenario, 658–661human resources, 396–400insurance, 389–393profitability, 400–403text document searching, 417–420travel, 393–396
catastrophic failures, 576catastrophic SCD type 1 invalidation using
OLAP, 552causal dimensions
describing promotions or behavior, 235, 308–311, 674
design recommendations, 310sourcing the data, 309
causal factorsanalytic application, 22determining, 590, 592
CDI (customer data integration), 105, 155central data warehouse team, 80–83centralization, 168
decentralized reality, 47, 52inappropriate, 60logical design and integration, 169objection remover false promise, 76, 78, 79risks, 103–104risks of physical but not logical, 169steps to migrate from disparate data, 170
centralized architecture comparison to planned economy, 178
centralized customer management system, 78centralized DW/BI systems, risks, 59change
anticipating, 47, 53continuous, 60source data changes, 7, 195–196
change data capture, 6–7, 452–453with CRC (cyclic redundancy checksum),
486–487with diff compare, 453ETL subsystem #2, 431
change impact on dimensional models, 9. See graceful extensibility
checkups, 661–667choice presentation in web-oriented data
warehouse, 563CIF (Corporate Information Factory), 99
compared to Kimball bus architecture, 171hybrid with Kimball approach, 175and Kimball approaches, fundamental
differences, 174claims periodic snapshot fact table, insurance
case study, 391
563106bindex02.indd 697 12/23/09 10:53:32 PM
Index698
claims transactions fact table, insurance case study, 390
classification queries, 642classifying, data mining activity, 616–617cleaning data, 431, 439, 443, 454–459
as prelude to data mining, 618–619using regular expressions, 477–481
clickstream case study, 409–413, 413–417clickstream data source challenges, 411clickstream dimension, 410–412
events, 413session type, 415–417visitor dimension, 414–415web page object, 415web sessions, 413
clickstream facts, 412clueless users, 121clustering, data mining activity, 600, 616codes
expand as verbose text in dimensions, 199interpreting into text, 618
cognitive models, 91column screens in data quality architecture, 122,
463comatose users, 120combinatorial explosion in market basket
analysis, 422comment fields, removing from fact tables, 275commitments fact table as part of budgeting
value chain, 404common labels in conformed dimensions, 149communication, bus matrix as a vehicle for, 154comparisons
difficulty using SQL, 631–634implementing with drill across, 23
compliance, 2, 3affecting ETL design, 426international data, 476as part of data steward role, 166requirements for data warehouse, 407–409,
596–597compliance-enabled fact and dimension tables,
408compliance reporting, ETL subsystem #33, 434compression of fact tables with careful design,
274
dictionary compression in SQL Server 2008, 557
improving query performance with, 556–558conceptual models as part of user interface
design, 91configuration choices for data warehouse servers,
583conformed dimension assembler, ETL subsystem
#8, 431conformed dimensions, 15, 51–54, 141, 244, 516
affecting ETL design, 428anonymous data warehouse key, 41bus matrix, 154commitment to use, 42cost if not conformed, 91criterion for dimensional DW, 227data warehouses and, 41definition, 40, 149, 190designing, 41establishing, 40executive support required, 149fixing stovepipe data marts, 202grain, 41importance of, 40–41integrated EDW, 198integration and, 105, 108no need for, 45replication, 485–486unconformed and, 91variations, 42
conformed facts, 15, 42cost if not conformed, 91cross-process calculations, 201definition, 150unconformed, 91using analytically, 162
conforming data, ETL subsystem #8, 431conforming dimensions at query time, unrealistic
goal, 523conforming nonconformed dimensions, 676–677consolidated fact tables, 240
combining processes, 240example, 240–241
constraint targets, always in dimension tables, 198constraints on data warehouse design, 58–60cookies in clickstream data, 411
563106bindex02.indd 698 12/23/09 10:53:33 PM
Index 699
corporate data model, 93Corporate Information Factory. See CIFcorrectly weighted report using bridge table, 342,
345–346. See impact reportcorrelated dimensions, splitting or combining,
307correlated subqueries, 641cost allocations. See allocationscosts, 90
administrative, 59hardware, 59implementation costs, 59software, 59sources, 90surprises, 59
coverage dimension in insurance examples, 243–246, 390–393
coverage fact table, promotion tracking example, 257–258
coverage tables, finding what didn’t happen, 260–262
CRC (cyclic redundancy checksum), 8in change data capture, 486–487
criteria for dimensional DWs, 226critical thinking, 58critique data warehouse, 673–674CRM (customer relationship management),
599–600cross-browsing, 638–639cultural correctness, multinational name and
addresses, 379currencies
conversion version in audit dimension, 466, 469
design, 435international data, 476in multinational designs, 377
custody of data, compliance responsibility, 407
custom tool development for ETL and BI, 520customer dimension extensions, 367customer modeling issues, 366–374customer profiling, factless fact table, 259Customers.Com (Seybold), 525cyber warfare, 576–577cyclic redundancy checksum. See CRC
DDangermond, Jack on GIS systems, 384dashboards, 612–613data
aggregated prematurely, 92awkward formats, 92as both fact and dimension, 277delivery, slow, 92integration, external, 449–450integration manager, ETL subsystem #21, 432locked, 92profiling, 2, 3, 121
affecting ETL design, 427business rule screens, 122column screens, 122role in organization, 462structure screens, 122
quality1996 perspective, 454afffecting ETL design, 427aggregate data quality reporting, 470–471comprehensive architecture, 460critically dependent applications, 454culture steps, 461error event handler, ETL subsystem #5, 431error responses, 465estimating from historical data, 471international design issues, 474measures in audit dimension, 468no history, 473–474predictable changes, 474six sigma, 467standard deviation, 472X-11 ARIMA, 474
staging, 8, 50. See ETLarea, 437
stewardship, 156, 165communications, 167goal of program, 165master data management, 517need for, 165qualifications necessary, 167responsibilities, 166–167
transformations, 618–620tool-dependent, 620–621
wrangling, 7
563106bindex02.indd 699 12/23/09 10:53:33 PM
Index700
data audits during requirements gathering, 116. See data profiling
data cleaning, 439applications, 458ETL subsystem #4, 431regular expressions, 477–481steps, 456
data conformer, ETL subsystem #8, 431data flow, ETL system, 446–447data governance as foundation for MDM, 520.
See governancedata marts, 39
architecture phase, 39–40avoid departmental definition, 123bus architecture, 46business process subject areas, 61data warehouse bus architecture, 51–54dimensional modeling and, 50–51higher level, 44presentation area, 51quick and dirty data warehouse myth, 202stovepipe data marts, 39
data mining, 44, 63, 615–617affinity grouping, 617aggregated data and, 44business phase, 626categories, 616–617classifying, 616clustering, 616data mining phase, 626–628data transformations, 618–620
tool-dependent, 620–621data warehouse responsibilities, 623database architecture and, 624estimating, 617explain variance of KPI, 24metadata, 629observations, 622–624operations phase, 628–629origins, 615–616predicting, 617process flow chart, 625references, 24
data profiling, 2–3, 121–123, 462data quality driver, 427ETL subsystem #1, 430
data quality architecture articles, 460–481Data Quality: The Accuracy Dimension
(Jack Olson), 427data warehouse bus architecture. See bus
architecturedata warehouse manager responsibilities, 70–73data warehouse not needed, objection remover,
77data warehouses
building in 15 minutes, 80bus architecture, data marts, 51–54central data warehouse team, 80–83costs, sources, 90mission, 73–74planning, 38publishing results, 48–49, 54securing results, 49, 54as Web-enabled system, 55
data webhouses, 55–56, 410data wrangling , 6–8. See change data capturedatabase market split, 35Date, Chris
An Introduction to Database Systems, 36, 137criticisms of E/R models and business rules,
147on dimensional models, 137, 147
date dimensionactivity date versus booking date, 492advantages, 225attached to every dimensional model, 41, 197,
225conformed EDW, 82, 154design, 291hierarchies in, 352incompatible rollups, 292keys, 198, 288, 297latest thinking, 295multiple dates in accumulating snapshot, 246as outrigger dimension, 292as outriggers, 299recommended design, 250, 294role playing, 298used in SCD2 processing, 26, 193
DBAs (database administrators), 7decentralized development, 47, 52, 60. See
distributed architecture
563106bindex02.indd 700 12/23/09 10:53:33 PM
Index 701
decodes needed in dimension tables, 198, 224deduplicating source data in ETL system, 18,
439, 487deduplication system, ETL subsystem #7, 431degenerate dimensions, 182, 271
airline flight segment example, 394bill of lading example, 396grouping fact table rows, 271health care billing example, 235invoice header example, 264, 267market basket analysis, 271multiple keys in reference dimension, 272order line item example, 300parent-child fact tables, 264shipment invoice example, 465–466storing control numbers, 225tie back to operational system, 271web page event example, 414
demographic tracking, factless fact table, 259demographics mini-dimension table, 320–322,
367ETL processing steps, 321permissible snowflake table, 337–338
denormalized dimension tables, 136, 181, 197, 334, 352
denormalized models, 9departmental data marts to be avoided, 10, 47,
86–88, 123, 128dependency analysis during ETL, 442. See
impact analysisdeployment of data warehouse, 97, 651–661
back room, 653BI applications, 609dimensional relational (ROLAP) versus OLAP,
549–552front room, 651–653monitoring operations, 653–654rapid, 47, 53, 60
descriptive attributes, verbose, 199descriptive model, 60–61. See normative modeldesign drivers, 74
design review, 221, 223–231design steps, 210, 405
1 Choose the process, 2112 Choose the grain, 211, 2233 Choose the dimensions, 211
3b Confirm the dimensions, 2124 Choose the facts, 2135 Store precalculations, 2136 Round out dimensions, 2147 Choose database duration, 2158 Specify SCDs, 2159 Decide physical design, 215review and validation, 221–223
design team roles, 3, 40–41, 80, 93, 216destruction of facility, 576diagnosis bridge table dimension in health care,
341diff compare, 453. See change data capturedigital preservation, 579–582dimension design response for new attributes,
195dimension independence, 180dimension keys, durable, natural, surrogate, 18dimension limitations in OLAP, 325dimension manager, 17–19
joint responsibilities with fact provider, 21, 514LDAP, 21MDM resource, 514responsibilities, 17, 163–164
dimension manager system, ETL subsystem #17, 432
dimension notification criterion for dimensional DW, 229
dimension processing in ETL, 438. See SCDsdimension size limitations, 325dimension tables
accurate counting, 314–315conformed, 141decodes, 224design process, 214–215many-to-one relationships, 197primary keys, surrogate keys, 198replicating to fact providers, 18row labels, 198shrunken dimensions, 19snowflaked, 181, 333source of constraints and row headers, 139text facts, 199version numbers, 19, 20
dimension update strategies, 325
563106bindex02.indd 701 12/23/09 10:53:33 PM
Index702
dimensional attributes, overwrites, 317. See SCDs
dimensional criterion for dimensional DW, 228dimensional designs, graceful modifications,
194–195dimensional DWs
administration criteria, 228–229architecture criteria, 227–228criteria, 226expression criteria, 229–231rating scheme, 226
dimensional models. See DMaggregated, 239atomic data, 196based on reports, 200business processes, 197date dimensions, 197departmental data marts, 144extensible designs, 203graceful extensibility, 9, 43, 142, 194–195, 228motivation and advantages, 139normalized model comparison, 134–137versus normalized models, information
content, 140null usage, 276–277. See nullspopulating, 238query evaluation strategy, 135relational models and, 9, 181source data changes, 195–196summarized information, 238symmetrical approach, 141themes, 278
dimensional queries, processing, 136dimensional relational versus OLAP, 550–551dimensional replication criterion for dimensional
DW, 228dimensional scalability criterion for dimensional
DW, 228dimensional star schema, fact tables, 181. See
star join, star schemadimensional symmetry criterion for dimensional
DW, 228dimensions, 10, 87, 180
abstract, 311–312behavior, 324causal, 308–311
conformed, 15, 244correlated, splitting/combining, 307degenerate, market basket analysis, 271degenerate dimensions. See degenerate
dimensionsdesign process, 211–212facts, data as both, 277–278generic, 311–312hierarchies, multiple, 184hot-swappable, 312–314independence, 192joins, avoiding, 307junk dimensions. See junk dimensionskeyword dimension, 347–351mini, 326–327, 367missing, 181reference dimensions, 272–273retaining headers as, 266smart keys, 199user interface, 11verbose description attributes, 199
directory server, 427dirty data, cleaning up hierarchies, 354disorders of data warehouse. See DW/BI
checkupsdistraction avoidance in web-oriented data
warehouse, 564distributed architecture, 56, 60, 103, 151. See
integrationcatastrophic failure and, 577
distributed systems, 151DM (dimensional modeling), 133–134. See
dimensional models3NF comparison, 140–141data marts and, 50–51defending, 144Microsoft Analysis Services, 553–554myths. See mythsoverview, 139–140retail databases, 143rules, 196snowflaking, 143–144. See snowflaked
dimension tablesstovepiping, 143strengths, 141–142symmetrical approach, 141–142
563106bindex02.indd 702 12/23/09 10:53:33 PM
Index 703
top-down design, 135understanding of, 143
drill-across reports, 14, 20, 42. See integrationgrouping columns, 185
drilling across, 14, 150, 162, 240. See integrationarticle, 189–191conformed dimensions, 82danger using two fact tables in same query, 189definition, 185detailed implementation, 190different grains, 33fact tables, 185fact tables of dissimilar grains, 191implementing, 190–191outer join, 190queries, 190sort-merge, 190SQL, 629–631
drilling down, 186–189atomic data, 47, 53, 188BI tool user interface design, 28computation, 187data quality attributes, 187definition, 183grouping columns, 184not in a hierarchy, 184, 187predetermined hierarchy, 23precise technical comments, 187row headers, 186user interface, 188
drilling up, 184durable key, 18–19, 27, 328–331, 337. See
natural keyduration of data storage, 215DW/BI (data warehouse/business intelligence), 1
business acceptance, 667–670business realignment, 667business representatives, 668–669centralized, risks, 59checkups
business acceptance disorder, 664–665business sponsor disorder, 662–663cultural/political disorder, 666data disorder, 663–664infrastructure disorder, 665–666
custom tools, 520–522
failure points, 93feedback, 669–670interview team, 668isolationist approach, 84management education, 670–673marketing system, 656–658performance, 682projects, listing, 106system operations planning, 654
EE/R modeling, Chris Date on, 147EAI (enterprise application integration), 523early arriving facts in real time applications,
494–495ease of use, 49, 54
BI tool acceptibility test, 30EDM (enterprise data model), 9–10education of senior staff, 672EDW (enterprise data warehouse), 106. See CIF
architectural requirements, 47architectures, normalized versus dimensional,
176–178bus matrix, 15–16conformed dimensions and facts, 17integrated, 13–21, 108, 161–164MDM and, 15practical approach, 530reports, 14–15
employee dimensionbest practice design, 359–365bridge table using natural keys, 362dimension outrigger example, 335fixed depth hierarchy compromise, 363human resources example, 396–400insurance matrix example, 160normalized design example, 495–497pathstring attribute for ragged hierarchy,
364reports-to bridge table, 360SCD processing example, 25–27separate reports-to dimension, 361telecomm matrix example, 152time stamps, 399–400
enterprise application integration (EAI), 523
563106bindex02.indd 703 12/23/09 10:53:33 PM
Index704
ER (entity-relationship) models, 50, 133, 147normalized, 57, 62
ERP (enterprise resource planning)data warehouse limitations, 526role of, 526–528systems
as primary data warehouses, 56relationship to data warehouse, 525
vendors, 56error event handler, ETL subsystem #5, 431error event schema, recording data quality
events, 463–464error responses, 465ESRI, GIS vendor evaluation, 384–386
address standardizing, 385extending SQL for geographic queries, 386
estimating (data mining), 617ETL (extract, transform, and load) staging
systems, 62aggregate processing, 440architecture, 105audit dimension, 465–466bottlenecks, 683business rule screens, 463cleaning and conforming, 98column screens, 463custom tools, 520–522data quality error event handler, 431. See data
qualitydata staging area, 437deduplicating source data, 487delivering, 98dependency analysis, 442–443design
archiving requirements, 428BI tool interfaces, 429business needs, 426compliance, 426conformed dimensions, 428data profiling, 427data quality, 427. See data qualityexample, 445foundations, 443integration requirements, 428latency, 428licenses, 429–430
lineage requirements, 428planning inputs, 446–447security, 427–428skills of staff, 429staging, 428tradeoffs, 434
designer’s responsibilities, 216documentation, 445extract processing, 441extracting, 98hierarchy validation, 354, 440householding, 439–440impact analysis, 442–443junk dimensions and, 307lineage analysis, 442–443managing, 98operational resilience, 442planning steps, 445quality screens, 463referential integrity, 438requirements, 425self-documentation, 442snowflake design, 336structure screens, 463subsystems
accumulating snapshot grain fact table loader, 432
aggregate builder, 432audit dimension assembler, 431backup system, 433change data capture system, 431compliance reporting, 434conformed dimension assembler, 431data cleaning system, 431data conformer, 431data integration manager, 432data profiling system, 430deduplication system, 431dimension manager system, 432error event handler, 431extract system, 431fact table loader, 432fact table provider system, 432hierarchy dimension builder, 432impact analyzer, 433job scheduler, 433
563106bindex02.indd 704 12/23/09 10:53:33 PM
Index 705
junk dimension builder, 432late-arriving data handler, 432lineage and dependency analyzer, 433metadata repository manager, 434multi-dimensional cube builder, 432multi-valued dimension bridge table loader,
432OLAP cube builder, 432parallelizing system, 433periodic snapshot grain fact table loader, 432pipelining system, 433problem escalation system, 433quality screen handler, 431recovery and restart system, 433SCD processor, 432security system, 433sort system, 433special dimension builder, 432surrogate key creation system, 432surrogate key pipeline, 432transaction grain fact table loader, 432version control system, 433version migration system, 433workflow monitor, 433
time zones and, 434–435tool pros and cons, 442visual flow, 442
euro, special business rules, 377event fact tables, 182exception reports, analytic applications, 22exceptions
analytic application, 22handling, 451–452identification, 590, 592
explicit declaration criteria for dimensional DWs, 227
expression criteria for dimensional DWs, 229–231extensibility
of dimensional models, 142, 203, 206graceful, 9new data source, 205
extensible markup language. See XMLexternal data integration, 449–450extract, transform, and load. See ETLextract processing during ETL, 441extract system, ETL subsystem #3, 431
Ffact dimension, modeling sparse facts, 281fact provider, 17, 19–20, 164
joint responsibilities with dimension manager, 21
LDAP, 21fact table grains. See grain (fact tables)
transaction, periodic snapshot, accumulating snapshot, 193, 243–244
fact table loader, ETL subsystem #13, 432fact table provider system, ETL subsystem #18,
432fact tables, 37
atomic, as core foundation, 239audit dimensions, 187combining types in periodic and accumulating
snapshots, 249consolidated
combining processes, 240example, 240–241
cost cutting and, 682–683departmental views, 9design patterns, 273–283design process, 213–214dimensional star schema, 181drilling across, 185, 240. See drilling across;
integrationfactless, 255–258flexible width, 281grain. See grain (fact tables)
declaring, 30declaring before dimensions added, 199uniformity, 197
granularity, 43–44instantaneous transactions, 278many-to-many relationships, 197parent/child, 262–268partitioning with smart date keys, 296–297pivoting, 282–283populating, 104primary keys, 181purpose of, 30response to measurement events, 9, 11rows, grouping with degenerate dimensions,
271
563106bindex02.indd 705 12/23/09 10:53:33 PM
Index706
scalable width, 281second-level, 240size reduction with careful design, 274sparse facts, 280–282surrogate keys, 33
reader suggestions, 269where to use, 268
time stamps, 192types. See grain (fact tables)used as dimensions, 278
factless fact tables, 182, 255attendance tracking, 256automobile collisions, 257customer profiling, 259demographic tracking, 259SCDs, 258
facts, 11, 134additive facts, 12, 182, 227conformed, 15, 42–43design process, 213dimensions, data as both, 277measurement events, 11non-additive, 31, 227, 281nulls as, 277numeric measurements, 179semi-additive, 182, 227, 509, 548, 554, 639unconformed, 91user interface, 11
feedback from end users, 669–670finance, boundaries with, 5financial product dimensions, 82, 313, 338–339,
436financial services date dimension roles, 298first-level data marts, 44. See second-level;
integrationfirst-level subject area, 153. See second-level;
integrationfixed-depth hierarchy, 363–364fixed-width databases, 338FK (foreign key), 31flat file, 62, 437, 440–441, 624flexible width fact tables, 281foreign keys, nulls as, 276–277foreign keys in dimensional schemas, 31,
180–181Friedmann, Thomas (The World is Flat), 474
front room, 32–33, 50–51, 651–653architecture, 560–565BI applications, 589–649metadata, 569–570
FTP based integration, 450–452fundamental fact table grains, 243–246. See grain
(fact tables)
Ggeneral ledger account dimension in budgeting
value chain, 406general ledger (GL), 4–5
fact table example, 336tying to operational results, 5
generic dimensions, avoiding, 311geocoder for ESRI GIS parsing of addresses, 386geographic information system (GIS), link to
data warehouse, 383address standardizing, 385evaluation of ESRI GIS vendor 384–386extending SQL for geographic queries, 386
geography dimension in a conformed EDW, 82GIS. See geographic information systemGL. See general ledgergovernance, 108
driving MDM initiatives, 520driving SOA initiatives, 513–514
graceful extensibility, 9, 43, 53, 59graceful modification criterion for dimensional
DW, 228graceful modifications to dimensional designs,
140, 142, 194–195, 309grain (fact tables), 30
accumulating snapshot grain, 32, 194capture lowest possible, 10clickstream data, 412conformed dimensions, 41declaration, 30, 182
before design begins, 200, 233precedes key definition, 235
definition, 11foundation of design, 223
as definition of business event, 237design process, 211, 223drilling across, 33
563106bindex02.indd 706 12/23/09 10:53:33 PM
Index 707
fundamental grains comparison, 243–246mismatches, mixed grain in fact table, 200mixed grain problems, 223–224periodic snapshot grain, 32, 194transaction grain, 32, 193uniform throughout each fact table, 197
grouping columns, 183drill-across reports, 185drilling down, 184drilling up, 184row headers, 186in a SELECT list, 186
growth management, 658–661
Hhardware cost, 59. See costsheader/line item designs, 262–268heterogeneous product design, 82, 313,
338–339, 436hierarchies, 351
alternate, 365–366design for maintainability, 351dimensions, multiple, 184dirty sources, 354–355drilling down, 23, 28, 187fixed depth, 197, 363–364mistakes in design, 199, 224, 334multiple in a dimension, 229, 352ragged, 229–230, 355–356
pathstring attribute, 364–365shared ownership, 358
referential integrity, 352single dimension, 224splitting into multiple dimensions, 199validation
during ETL, 440in ETL system, 354
hierarchy bridge tabledesign, 357manufacturing parts explosion, 358–359shared ownership, 358time varying, 358
hierarchy dimension builder, ETL subsystem #11, 432
hierarchy management with custom tool, 521
historical dimension rows, 488–490historical letter data warehouse, 347historically accurate attributes, lack of, 531history
preservation, data warehouse requirements, 191, 579–582
seamlessness, 61Holtzman, David, 115hot partition in real time systems. See real-time
partitionhot response cache in real time systems, 504hot-swappable dimensions, 312–314
criterion for dimensional DW, 231multi-client security, 313
householding during ETL, 379, 439, 455, 458hub and spoke architecture, 171. See CIFhuman resources case study, 396–400. See
employee dimensionhybrid approach, combining CIF and Kimball, 175hybrid SCDs (slowly changing dimensions)
combination type 1, 2, 3, 326type 1, 2, 3, 326–328type 1 + 2 tracking with natural keys in fact
table, 328type 1 fact and type 2 mini-dimension, 327type 6 combination of all three types, 327
Iimpact analysis during ETL, 442impact analyzer, ETL subsystem #29, 433impact report using bridge table, 343, 346. See
correctly weighted reportimplementation cost, 59income statement fact table
allocations, 402design, 401
incompatible data, dealing with, 21, 26, 43, 45, 83, 91, 111, 142, 149, 184, 191, 207, 282, 292, 313, 339, 373, 391, 514, 523
incompatible technologies, 14, 48, 53, 56, 60, 75, 78, 290, 380, 455
increased granularity of dimension, design response, 196
indexes for DW/BI databasesB-tree indexes, 37, 269, 508
563106bindex02.indd 707 12/23/09 10:53:33 PM
Index708
bitmap indexes, 81, 269, 325, 559, 562substring index, pattern index for high speed
searching, 350Inf*Act, Nielson syndicated reporting, 8inheritance, line items inheriting dimensionality,
267insurance coverage limits example, 12insurance data warehouse examples, 129–130,
243–245, 257, 320–322, 389–393integration. See dimension manager; fact
provider, drill-acrossconformed dimensions and, 108, 198definition, 161–162drill-across as litmus test, 14EDW, 13, 38, 161–164MDM, 13measures, 162–163normalization and, 207–208
integration requirements affecting ETL design, 428international. See multinationalinternet. See webinterviews, 5
gathering requirements, tactic and objective, 210
interviewing techniques, 113–121, 668–670investment banking
custom hot-swappable dimensions, 312–313junk dimensions, 303
IT (information technology)boundaries, 6functions, centralizing, 79licenses, 2partnership with, 91review, 222
Jjob scheduler, ETL subsystem #22, 433joins between dimensions, avoiding, 307junk dimension builder, ETL subsystem #12, 432junk dimensions
advantages, 306combining or separating, 305creating, using, maintaining, 497–499decided granularity, 305ETL choices for creating, 307
investment banking example, 303when to use, 275
KKey, Alan (father of personal computer), 560key performance indicator. See KPIkeys in dimensional schemas, 180keyword dimension, 347–351Kimball Approach, 99
bus architecture. See bus architectureCIF, fundamental differences, 174enterprise versus departmental, 128hybrid with CIF, 175measurement processes versus departmental
reports, 204myths, 206
Kimball Lifecycle, 96–99agile approach and, 110bottom up approach, 100business intelligence track, 99business requirements, 98data track, 98deployment, maintenance, and growth, 99diagram, 97Metaphor Computer Systems, 96–97program/project planning and management, 98technology track, 98
kitchen metaphor for DW/BI system, 65–68know-it-all users, 120–121KPI (key performance indicator), 2–5
airline example, 22compliance impact, 597conformed facts, 428
Llabels, integrating, 162. See integrationlanguages. See multinationallate-arriving
data, 19data handler, ETL subsystem #16, 432dimension records processing steps, 492fact records, processing steps, 491
latency affecting ETL design, 428latent semantic analysis for unstructured text
search, 418
563106bindex02.indd 708 12/23/09 10:53:33 PM
Index 709
launching BI environment, 652LDAP (lightweight directory access protocol)
managing role enabled security, 578server, dimension manager and fact provider
responsibilities, 21lean times DW fitness program, 680–684legacy data formats, resolving inconsistent, 618legal department, boundaries, 6licenses
cost cutting and, 682for software and systems, affecting ETL design,
429lightweight methodologies, 109. See agileline items, inheriting dimensionality, 267lineage, 468
analysis during ETL, 442and dependency analyzer, ETL subsystem #29,
433requirements affecting ETL design, 428
Linoff, Gordon, 600, 625locked data, 92log scraping for change data capture, 453LSA (latent semantic analysis), 418
Mmanagement education, 670–673many-to-many bridge tables, 335many-to-many dimension relationships, resolve
in fact table, 197many-to-many relationships, prevalence of, 146many-to-one relationships
outriggers, 334resolve in dimension table, 197snowflaking and, 334
MapObjects Visual Basic tool for GIS, 384market basket analysis, 420–424
degenerate dimensions, 271proposed dimensional design, 420
marketing of the DW/BI system, 656–658marquee applications, 598master data management. See MDMMastering Data Mining, The Art and Science of
Customer Relationship Management (Berry and Linoff), 600, 625
matrix. See bus matrix
MDM (master data management), 13, 353business value, 515centralized enterprise source, 519–520deployment steps, 520dimension manager role, 514EDW and, 15importance of data governance, 520integration hub, 517–518need for, 516and SOA with agile development, 111solving data disparity, 516source system disparities, 515supported from data warehouse, 516–517three approaches, 516
MDX (multidimensional expressions), 648measured facts, new, design response, 195measurements, fact tables, 11, 104, 179
reports and, 204snapshots. See grain (fact tables)
measures, integration. See conformed factsmedia, formats in data warehouse 55
archiving, preservation and, 580medical information privacy, 573MERGE command (SQL) for SCD processing,
499merge-sort, drilling across, 190meta meta data data (data about metadata), 566metadata, 613–614
complete list for data warehouse, 567data mining, 629data warehouse scope, 566management tasks, 567management tools, 570–572repository manager, ETL subsystem #34, 434strategy recommendations, 571
Metaphor Computer Systems, 96–97Microsoft Analysis Services 2005, 553–555Microsoft SQL Server 2005, data warehouse
architecture guidelines, 554Microsoft SQL Server 2008
database compression, 556–558new features, 556star schema optimization, 559table partitioning, 558
migrating from disparate data to centralized, 170mini-dimension tables, 367
563106bindex02.indd 709 12/23/09 10:53:33 PM
Index710
customer attributes, 367demographics example, 320linking to primary dimension through fact
table, 322monster dimensions, 320overwrite and, 326–327
mistakes in building a DW/BI system, 100mixed grain problems, double counting, 223,
227model alternatives, analytic applications step,
590, 592–593monolithic approach, 38monster dimensions, rapidly changing, 320multi-dimensional cube builder, ETL subsystem
#20, 432multi-pass SQL, 150. See drill-acrossmulti-valued dimension
health care diagnoses, 341multi-valued dimension bridge table loader, ETL
subsystem #15, 432multinational data, 374–388
addresses, 475calendars, 376–377, 476character sets, 475compliance, 476consistency criterion for dimensional DW, 229cultures, 475currencies, 377–378, 476customer information in real time applications,
379data warehouse design considerations, 387dimension translation, 387euro, 377–378geographies, 475languages, 475names, 475names and addresses, 378–383
Atkinson, Toby, 381cultural correctness, 379
numbers, 476postal address formats, 380quality architecture, 477quality issues, 474reporting issues, 566salutations, 475time zones, 476, 477
multiple dimension hierarchies criterion for dimensional DW, 229
multiple dimension roles criterion for dimensional DW, 230
multiple hierarchies in a dimension, 184, 351multiple valued dimensions criterion for
dimensional DW, 230myths, 8–10, 38, 143–144, 201–203
atomic data should be normalized, 239dimensional models pre-suppose the business
question, 238facts and fables, 204–208
NN-tiling, 636name and address processing, 439
cultural correctness, 379international design issues, 378–383
naming conventions, 220–221, 450natural keys, 18. See durable keys
bridge table, 362–363in fact table for type 2 and type 2 tracking,
328–331problems with, 287in surrogate key pipeline, 482surrogate keys, 285
navigation, aggregate navigation, 32–33network database design, 393Nielsen syndicated reporting. See Inf*Actnon-additive numeric fact, 281
bad design example, 213computing from additive facts, 264handling in BI tool, 635–639handling in OLAP, 548summarizing across time, 293
non-behavior, explicit records for, 260non-existence of events, techniques for querying,
259nonconformed dimensions, conforming,
676–677nonexistent users, 121normalization, integration and, 207–208normalized data models, 9, 12, 133, 137. See ER
BI queries, 144complexity and BI, 146
563106bindex02.indd 710 12/23/09 10:53:33 PM
Index 711
compared to dimensional model, 134creating dimensional views, 77uniqueness or completeness, 57, 146
normalized data warehouse. See CIFnormalized data warehouse, lack of procedure
for slowly changing dimensions, 177normalized EDW not for business intelligence,
176normalized hierarchy disadvantages, 224normative model, 60–61. See descriptive modelNOT EXISTS
missing attributes, 262what didn’t happen, 261
nullsas dimension attributes, 277as fact table foreign keys, 276–277as facts, 277
numbers, international data, 476
Oobjection removers, 76, 77
aggregates, 78applications integrators, 78backups, 79centralized customer management system, 78centralizing IT functions, 79larger problem and, 77recognizing, 77security, 79solutions for, 77
ODS, operational data store hot cache, 504offline delays during ETL processing, 502–503OLAP (online analytical processing), 17, 46
advantages versus dimensional relational, 551analytic syntax, 551catastrophic invalidation with SCD Type 1, 552cube builder, ETL subsystem #20, 432data cube, 63desktop versus server, 547dimension limitations, 325versus dimensional relational advantages, 550versus dimensional relational disadvantages,
551dimensions comparison with ROLAP
dimensions, 547
disadvantages versus dimensional relational, 552
implementing aggregations via strong hierarchies, 548
major advantages, 548–549as major data warehouse component, 546versus ROLAP, final deployment choice,
549–553SCDs contrasted with ROLAP SCDs, 548security scenarios, 551sensitivity to type 1 SCD, 25similarity to star schemas, 63SQL-99 extensions, 645–649time constraints contrasted with ROLAP, 548
Olson, Jack (Data Quality: The Accuracy Dimension), 427
OLTP (online transaction processing), 36data warehouse systems, 37models, 137
on-the-fly behavior dimensions criterion for dimensional DW, 231
on-the-fly fact range dimensions criterion for dimensional DW, 231
online analytical processing. See OLAPonline transaction processing. See OLTPoperating procedures, 655–656operational systems back pointers, 487–488operations phase of data mining, 628–629operators, RegExp, 479opportunity matrix, 158
processes versus departments, 130OR queries, 349–350outrigger dimension, 135–136, 334–335
cautions, 224date dimension as, 292, 299time dimension as, 292variation of snowflaking, 224–225, 336–339
overbooked users, 120overwriting, type 1 SCD, 25–26, 317overzealous users, 120
Ppackaged applications
avoiding stovepipes, 522–523data warehouses and, 522–524, 529
563106bindex02.indd 711 12/23/09 10:53:33 PM
Index712
page events in clickstream dimensional design, 412–417
parallel communication paths, catastrophic failure and, 577
parallelizing system, ETL subsystem #31, 433paralysis of project, 84–85parent-child fact tables, 262–268
degenerate dimensions, 264design alternatives, 263
partitioningfact tables with smart date keys, 296–297real time design, 507–510surrogate keys and, 297table partitioning, 558tricks to minimize offline time, 502type 2 SCD, 316
partnership between IT and business, 91parts adding up to whole, 48, 53. See distributed
architecturepathstring attribute for ragged hierarchy, 364pattern index for high speed searching, 350payments fact table as part of budgeting value
chain, 404performance guidelines of web-oriented data
warehouse, 562periodic snapshot grain. See grain (fact tables)periodic snapshot grain fact table loader, ETL
subsystem #13, 432periodic snapshot grain real time partition, 509personal data
ownership, 574uses and abuses, 573
personnel, staffing team 70, 217–218pipeline processes, accumulating snapshots, 246
See grain (fact tables)pipelining system, ETL subsystem #31, 433pivoting fact table with fact dimension, 282–283P&L (profit and loss) fact table, 401–402, 436playbooks for all operations, 656populating dimensional models, 238predicting (data mining), 617presentation area, 51, 62–63, 67preservation. See digital preservationprimary keys in dimensional models, 181prioritization grid, benefit versus feasibility, 131
privacyconcerns from RFID tags, 534–535data warehouse architecture and, 575information transfer and, 476tradeoffs in data warehouses, 572
private attributes in conformed dimensions, 16, 19problem escalation system, ETL subsystem #30,
433problem resolution in web-oriented data
warehouse, 565process-centric rows in bus matrix, 156–157process steps, data warehouse design, 210process streamlining in web-oriented data
warehouse, 564processes versus departments, 123procurement pipeline, accumulating snapshot
example, 241–242product dimension, conformed in an EDW, 82production keys, problems with, 287production (source) transaction processing
systems, 62profitability case study, 400–403profitability fact tables, allocations, 402, 436progressive subsetting queries, 642promotion dimension
design example, 308design recommendations, 310
promotion profitability, 311promotion tracking, factless fact table, 257provenance, lineage 468pruning algorithm in market basket analysis, 423publishing metaphor for data warehouse
manager, 58, 70, 73publishing reports, 590–591purchase behavior privacy, 573
Qquality culture, 461–462. See data quality
architecture articlesquality screen handler, ETL subsystem #4, 431quality screens in ETL architecture, 463queries, BI
AND, 349–350behavioral, 642browse queries, 135
563106bindex02.indd 712 12/23/09 10:53:33 PM
Index 713
decomposition, 639. See drill-across reports; drilling across
drill-across operations, 190. See drill-across reports; drilling across
features for query tools needed, 638–649hot-swappable dimensions, 313OR, 349–350performance
cost when too slow, 92priorities for improving, 201
SQL, categories, 641–642query time dimension conforming, goals, 523
Rragged dimension hierarchies criterion for
dimensional DW, 229ragged hierarchies. See hierarchies
bridge table solution, 355–358pathstring attribute solution, 364–365recursive pointer problems, 357
rapid deployment, 47, 53rating scheme for dimensional DWs, 226real-time architectures, 503–509
customer information in multinational applications, 379
late arriving dimensions, 494real-time partitions, 507
real-time partition design, 507accumulating snapshot grain, 509–510periodic snapshot grain, 509transaction grain, 508
real-time triage, judging user requirements, 510–511
realignment, business, 667reason code, SCD2, 330–332. See SCDsreassuring users, in web-oriented data
warehouse, 565recency, frequency, intensity. See RFIrecovery and restart system, ETL subsystem #24,
433recursive pointer
problems, modeling ragged hierarchies, 357replaced by hierarchy bridge table, 357
redundancy, data3NF and, 138reducing, 679–680
reference dimensions, 272–273referential integrity
in dimensional schemas, 181, 228enforcing during ETL, 438handling nulls, 276, 295in hierarchies, 352
regular expressions (RegExp)for data cleaning, 477–481operators, 479uses, 480–481
relational databases, business rules, Chris Date, 137, 147
relational modelsdimensional models and, 9, 181EDM and, 10
relational online analytical processing. See ROLAP replicating conformed dimensions, 18, 485–486reporting
accuracy testing, 608analytic application, 22custom tool, 521dashboard development, 612–613deployment, 609development, 607documentation, 604–605EDW, 14–15maintenance, 609management, 609measurements and, 204navigation framework, 606performance testing, 608portal development, 610–612presentation area, 51, 62–63, 67publishing, 590, 591replication, 606report creation, 602–608reporting portal, 652specifications, 604–605standard, 602system design, 603–606target report list, 603template, 604user review, 606users’ involvement, 610
response time to data warehouse queries, 49, 54, 56. See queries, BI
563106bindex02.indd 713 12/23/09 10:53:33 PM
Index714
responsibilities of DW/BI teamdata warehouse manager, 70team members, 217
results, preventing irrelevant, 59return on investment. See ROIreview and validate design, 221RFI (recency, frequency, intensity)
behavior tags, 337, 368–369definitions, 369
RFID (radio frequency identification) tagsapplication examples, 533impacting personal privacy, 534sequential behavior analysis, 534smart dust, 535tracked in data warehouse, 533
ROI (return on investment), data warehouse, 93ROLAP (relational online analytical processing),
48ROLAP versus OLAP, final deployment choice,
549–553role playing dimensions, 10, 300, 312
telecomm example, 301transportation example, 301in voyage and network designs, 395
rolling date reporting, 252rolling operational results, tying to GL, 5row change reason code, ETL. See SCDs, type 2row headers. See grouping columnsrow labels in dimension tables, 198rules for dimensional modeling, 196
Ssabotage, 576SANs (storage area networks)
as counter to security catastrophes, 578data warehouse and, 585typical configuration, 586
Sarbanes-Oxley Act, 596satisfaction metrics
chaotic lists, 373design alternatives, 371simultaneous dimension and fact, 372standard fixed list, 371
scalable width fact tables, 281scaling out, scaling up a data warehouse, 584
SCD processor, ETL subsystem #9, 432SCDs (slowly changing dimensions), 315–332
comprehensive overview, 24criterion for dimensional DW, 230delaying dealing with, 199dimension manager responsibilities, 18factless fact tables, 258handling, 193hybrid combinations
type 1 +2 tracking with natural keys in fact table, 328–329
type 1 fact and type 2 mini-dimension, 327type 6 combination of all three types,
327–328MERGE command (SQL), 499–501place in dimensional modeling, 322processing in ROLAP, 438
with OLAP, 548rapidly changing, mini-dimensions, 323slowly changing entities, normalized time
variance tracking, 495–497strategies for, 225–226too fast, 324type 1 (overwrite), 25–26type 2 (new dimension record), 26–27
begin- and end-effective time stamp, 193, 323change description, 193most recent flag, 193reason codes, 330–332
type 3 (new field), 27scorecards and dashboards, 612–613screens, data quality. See data quality
architecture articlesSCRUM, 109SDE (spatial database engine)
ESRI GIS semantics extender for SQL, 386searches
pattern index, 350–351substrings, 350
seasonal fluctuations, removing when testing data quality, 474
second-level subject area, 44, 155fact tables, 240profitability design, 400risks, profitability and satisfaction, 102
second normal form, dimension tables, 181
563106bindex02.indd 714 12/23/09 10:53:33 PM
Index 715
security, 2, 3architecture, 83catastrophes
categories, 576–577techniques for countering, 577–578
EDWs, 83ETL design, 427–428ETL subsystem #32, 433management with custom tool, 521objection removers, 79scenarios with OLAP, 551technique for multiple clients, hot-swappable
dimensions, 313self-documenting code, 601semi-additive numeric facts, 182, 293
BI application handling techniques, 639declaring in metadata, 227OLAP handling advantages, 548, 554real time partition handling, 509
sequential behavior analysis using RFID tags, 24, 597–598
sequential computations in BI tool, 635–638server configuration choices for data warehouse,
583service accounts versus personal DBA accounts,
655service oriented architecture. See SOAsession type dimension in clickstream
dimensional design, 415Seybold, Patricia (Customers.Com), 525shadow functions, office anthropology, 115shapefiles, GIS data object for boundaries and
areas, 385shared ownership, hierarchy bridge table, 358shrunken dimension tables in aggregate
architecture, 540–542similarity metrics for unstructured text,
417–420six sigma data quality, 467skills, for DW/BI team, 93SLA (service level agreement), 655slowly changing dimensions. See SCDssmart dust. See RFID tagssmart keys
date keys for partitioning fact tables, 296–297dimensions, not for fact table joins, 200
disadvantages, 286problems in data warehouse, 288
snapshots, periodic, accumulating. See grain (fact tables)
snowflaked dimension tables, 135, 181classic design, 336complex calendar dimension, 339context-dependent, 338definition, 333financial product dimension, 338impact on usability, 104large custom dimension, 337
snowflakingas alternative to dimensional model, 143disk space and, 224as DM alternative, 143–144outriggers, 224
SOA (service oriented architecture)agile development, 111data warehouse and, 513–515services defined for dimension manager, 514
software development manager, lessons learned, 601
sort-merge, drilling across, 190sort system, ETL subsystem #28, 433sparse facts
fact dimension, 281wide fact tables, 280–282
sparsity tolerance criterion for dimensional DW, 228
spatial database engine. See SDEspecial dimension builder, ETL subsystem #12, 432sponsor from business, 86–89SQL-92, flexibility of, 645SQL-99, OLAP extensions, 645–649SQL (Structured Query Language)
CASE expression, 633comparisons, 631drill across, 629–631as interim language, 631MERGE for SCD processing, 499multi-pass SQL, 150queries, categories, 641–642
staffing dimensional modeling team, 70, 216–217, 429
skills development, 93
563106bindex02.indd 715 12/23/09 10:53:33 PM
Index716
staging area, 62, 66. See archivingaffecting ETL design, 428
standard deviation used for data quality estimating, 472
standard reports, 602. See reportingstar join model, relationship to dimensional
model, 139star schema optimization in Microsoft SQL
Server 2008, 559star schemas
fact tables, 181OLAP data cubes and, 63–64
Star Workstation, Xerox, 57statistical analysis as part of data mining, 616steering committees, 672–673stovepipes, 38–39
avoiding, 522–523converting to architected dimensional data
marts, 45strategic business initiatives, 127
matrix, 158–159street segment data, TIGER Census Department,
386structure screens, 122
in ETL data quality architecture, 463sub-types and super-types. See heterogeneous
product designsubject area groups in conformed dimension
design, 154subject areas
first level, 153second-level, 155
substring searching in keyword list, 350subtransactions describing behavior, 368sunsetting older environments, 681super-types and sub-types. See heterogeneous
product designsurrogate key administration criterion for
dimensional DW, 229surrogate key creation system, ETL subsystem
#10, 432surrogate key pipeline, 20, 26, 481–485
ETL subsystem #14, 432inserting surrogate keys, 482
surrogate keys, 109, 285–289advantages, 225, 285–286
bridge tables, 344–345, 360–361creating, 677–678dimension manager responsibilities, 18dimension table primary keys, 198example used incorrectly, 289fact tables, 33required by type 2 SCD, 26fact tables
reader suggestions, 269where to use, 268
natural keys, 285partitioning and, 297uncertainty, 287
surveillance privacy, 573
Ttable partitioning. See partitioningtape recorders during requirements gathering,
115TCO (total cost of ownership) of data
warehouse, 89telecomm bus matrix, 152telecomm dimensional roles example, 301telephone system comparison, 60text document searching, 417–420text field problems in fact table, 224text in fact tables, removal techniques, 275text facts
recency, frequency, intensity behavior tags, 369recommended design, 370
The Data Warehouse Lifecycle Toolkit (Kimball, et al), 97
The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom? (Brin), 574
The World is Flat (Friedman), 474third normal form, fact tables, 181. See 3NFTIGER census department data, USA street
segments, 386time constraints, ultra precise, 251time dimension, 192
bad design, 293incompatible rollups, 292keys, 293as outrigger dimension, 292
563106bindex02.indd 716 12/23/09 10:53:34 PM
Index 717
recommended design, 294role playing, 298
time spans created by transactions, 250time stamps
begin- and end-effective, 251, 289type 2 SCD, 323
bridge tables, 345–346employee dimension table, 399fact tables, 192to nearest second, 490time zones and, 375
time variance in dimensions. See slowly changing dimensions
time zone discovery (www.timezoneconverter.com), 477
time zonesETL system tradeoff, 434international data, 476, 477synchronizing, 374–376
top-down design, dimensional modeling, 135. See bottom-up approach
total cost of ownership. See TCOtraining
data subsets used in data mining, 620–629DW/BI business users, 101, 652
transaction grain fact table, 32, 193, 243–244. See grain (fact tables)
transaction grain fact table loader, ETL subsystem #13, 432
transaction grain real time partition. See real-time partition design
transaction processing models, 137Transaction Processing Performance Council, 37transaction workloads in data warehouse, 532translations in multinational data warehouse,
387, 477transportation database design, 301, 393travel case study, 393–396trust building in web-oriented data warehouse,
61type 1, type 2, type 3, type 6 SCDs. See SCDs
Uuncertainty, encoding with surrogate keys, 287unconformed dimensions and facts, 91
UNICODE character set, 475multinational information, 380
units of measure, conflicts, 435university admissions, accumulating snapshot
example, 247unstructured text applications, 420
LSA (latent semantic analysis), 418similarity metrics, 417–418
unstructured text fact table, 417user-focused cognitive and conceptual models,
91user interface, 57
advances driven by the Web, 561design, 56
BI tools, 28, 57, 91dimensions, 11drilling down, 188facts, 11guidelines for web-oriented data warehouse,
562poorly performing, 92urgency, 561WYSIWYG (what you See is what you get), 560
user typesabused, 119boundaries, 5clueless, 121comatose, 120control, 107know-it-all, 120–121nonexistant, 121overbooked, 120overzealous, 120
Vversion control, 71, 215, 450version control system, ETL subsystem #25, 433version management
audit dimension, 466–470fact and dimension tables, 19–21, 25, 163–164,
313, 344, 408, 450version migration system, ETL subsystem #26,
433voyage database design, 393
563106bindex02.indd 717 12/23/09 10:53:34 PM
Index718
Wwaterfall development approach compared to
agile approach, 107waterfall development risks, 102web-oriented data warehouse, 48–51, 55
choice presentation, 563–564dimensional design, 410distracted avoidance, 564page object dimension, 415performance guidelines, 562–563problem resolution, 565–566process streamlining, 564–565reassuring users, 565session modeling, protocol analysis 413, 416user interface guidelines, 562–566visitor dimension, 414web page characteristics, 409–413
weighting factor in bridge tables, 342what didn’t happen, techniques for finding, 259what if analysis, analytic applications, 22workflow monitor, ETL subsystem #27, 433worksheets during design phase, 219WYSIWYG (what you See is what you get) user
interfaces, 560
XX-11 ARIMA statistic for data quality testing, 474Xerox PARC, birthplace of personal computer, 560XML (extensible markup language), 8
data warehouse integration, 523XP (Extreme Programming), 109
563106bindex02.indd 718 12/23/09 10:53:34 PM