Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start...
-
Upload
mitchell-cummings -
Category
Documents
-
view
218 -
download
4
Transcript of Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start...
Multiple Files Problem Added complexity of multiple
source files Start simple
MultipleSource files
Extracted dataLogic to detectCorrect source
Missing Values ProblemSolution Ignore Wait Mark rows Extract when time-stamped
AIf NULL then Field=‘A’
Duplicate Value Problem
Solution SQL self-join techniques RDMBS constrains utilities
SELECT…FROM table_a, table_bWHERE table_a.key(+)=table_b.keyUNIONSELECT…FROM table_a, table_bWHERE table_a.key=table_b.key(+)ACME IncACME Inc
ACME Inc
ACME Inc
ACME Inc
Element Meaning Problem
Avoid misinterpretation Complex solution Document meaning in
metadata
Customer’sname
All customerdetails
All detailsExcept name
Referential Integrity ProblemSolution SQL anti-join Server constraints Dedicated tools
Department10203040
Emp Name Department1099 Smith 101269 Jones 201270 Doe 506787 Harris 60
Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same
number Many names on one line One name on two lines
Name and Address Problem Single-field format
Multiple-field format
Mr.J.Smith, 100 Main St., Bigtown, County Luth, 23565
NameStreetTownCountyCode
Mr.J.Smith100 Main St.BigtownCounty Luth23565
Clean and Organize1. Create atomic values.2. Standardize formats.3. Verify data accuracy.4. Match with other records.5. Identify private and commercial
addresses and inhabitants.6. Document in metadata.Requires sophisticated tools and techniques
Merging Data Operational transactions do not usually
map one-to-one with warehouse data Data for the warehouse is merged to
provide information for analysis
Sale 1/2/98 12:00:01 Ham Pizza $10.00
Sale 1/2/98 12:00:02 Cheese Pizza $15.00
Sale 1/2/98 12:00:02 Anchovy Pizza $12.00
Return 1/2/98 12:00:03 Ham Pizza -$12.00
Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Pizza sales/return by day, hour, seconds
Merging DataSale 1/2/98 12:00:01 Ham Pizza $10.00
Sale 1/2/98 12:00:02 Cheese Pizza $15.00
Sale 1/2/98 12:00:02 Anchovy Pizza $12.00
Return 1/2/98 12:00:03 Ham Pizza -$12.00
Sale 1/2/98 12:00:01 Ham Pizza $10.00
Sale 1/2/98 12:00:02 Cheese Pizza $10.00
Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Adding a Date Stamp Enables time analysis Label loaded data with a date
stamp Add time to fact and dimension
data
Adding a Date Stamp
Sales Fact TableItem_idStore_id
Time_keySales_dollarsSales_units
Store TableStore_id
District_idTime_key
Item_TableItem_idDept_id
Time_key
Time TableWeek_idPeriod_idYear_id
Time_key
Product TableProduct_idTime_key
Product_desc
Adding a Date Stamp Fact table - Add triggers - Recode applications - Compare tables Dimension table Time representation - Point in time - Time span
Adding Keys to Data#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00
#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00
#4 Sale 1/2/98 12:00:03 Ham Pizza -$12.00
#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $10.00
#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Data values or artificial keys
Summarizing Data During extraction on staging area After loading onto the warehouse
server
Operational databases
Staging area
Warehouse database
Maintaining Transformation Metadata
Contains transformation rules, algorithms, and routines
Sources Stages Rules Publish Extract Transform Load Query
Transformation Timing and Location Transformation is performed: - Before load - In parallel May be initiated at different points
Unlikely Probable Possible
Choosing a Transformation Point
* Workload * Network bandwidth* Environment * Parallel execution* CPU use * Load window time * Disk space * User information needs
Monitoring and Tracking
Transformations should: Be self-documenting Provides summary statistics Handle process exceptions
Designing Transformation Processes Analysis: - Sources and target mappings, business rules - Key users, metadata, grain Design options: PL/SQL, replication, custom,
third-party tools Design issues: - Performance - Size of the staging area - Exception handling, integrity maintenance
Data Management, Quality, and Auditing Tools Data management: - Innovative Systems - Postalsoft - Vality Technology Data quality and auditing: - Innovative Systems - Vality Technology