Getting the data into the warehouse: extract, transform, load
-
Upload
cade-foley -
Category
Documents
-
view
27 -
download
1
description
Transcript of Getting the data into the warehouse: extract, transform, load
![Page 1: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/1.jpg)
Getting the data into the warehouse: extract, transform, loadMIS2502
Data Analytics
![Page 2: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/2.jpg)
Getting the information into the data mart
Now let’s address this part…
![Page 3: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/3.jpg)
Extract, Transform, Load (ETL)
• The process of copying data from the transactional database to the analytical database
• Going from relational to dimensional
• Basically, it’s a matter of identifying where the data should come from to fill the data mart
![Page 4: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/4.jpg)
ETL Defined
![Page 5: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/5.jpg)
The Actual Process
Transactional Database 2
Transactional Database 1
Data Mart
Query
Query
Data conversion
Data conversion
Query
Query
Extract Transform Load
![Page 6: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/6.jpg)
Main ETL Issues: Conversion Stage
![Page 7: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/7.jpg)
Data Consistency: The Problem with Legacy Systems
• An IT infrastructure evolves over time• Systems are created and acquired by different people using different specifications
This can happen through:•Changes in management•Mergers & Acquisitions•Externally mandated standards•General poor planning
![Page 8: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/8.jpg)
This leads to many issues
• Redundant data across the organization• Customer record maintained by accounts
receivable and marketing
• The same data element stored in different formats• Social Security number (123-45-6789 versus
123456789)
• Different naming conventions• “Doritos” versus “Frito-Lay’s Doritos” verus
“Regular Doritos”
• Different unique identifiers used• Account_Number versus Customer_ID
What are the problems with each of these
?
![Page 9: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/9.jpg)
What’s the big deal?
• This is a fundamental problem for creating data cubes
• We often need to combine information from several transactional databases
• How do we know if we’re talking about the same customer or product?
![Page 10: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/10.jpg)
Now think about this scenarioHotel Reservation Database Café Database
CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode
OrderOrder_numberCustomer_numberHotel_idFood_item_idOrder_dateOrder_timeTable_number
Food itemOrder numberFood_item_idOrder_dateOrder_time
HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode
HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode
CountriesCountry_codeCountry_currencyCountry_name
Hotel roomsRoom_numberHotel_idRoom_typeRoom_floor
Room typesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN
Room BookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count
Guest BookingsBooking_idGuest_number
GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email
Hotel Amenities LookupCharacteristic_idCharacteristic_description
Hotel AmenitiesCharacteristic_idHotel_id
![Page 11: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/11.jpg)
Solution: “Single view” of data
• The entire organization understands a unit of data in the same way
• It’s both a business goal and a technology goal
and really more this…
..than this
![Page 12: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/12.jpg)
Closer look at the Guest/CustomerGuests
Guest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email
CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode
vs.vs.
![Page 13: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/13.jpg)
Organizational issues
• Why might there be resistance to data standardization?
• Is it an option to just “fix” the transactional databases?
• If two data elements conflict, who’s standard “wins?”
![Page 14: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/14.jpg)
Data Quality
• The degree to which the data reflects the actual environment
![Page 15: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/15.jpg)
Finding the right data
Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf
![Page 16: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/16.jpg)
Ensuring accuracy
Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf
![Page 17: Getting the data into the warehouse: extract, transform, load](https://reader035.fdocuments.us/reader035/viewer/2022062321/568133d2550346895d9aca9f/html5/thumbnails/17.jpg)
Reliability of the collection process
Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf