MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia...
-
Upload
suzanna-rich -
Category
Documents
-
view
221 -
download
0
Transcript of MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia...
MySQL Data Warehousing Survival Guide
Marius Moscovici ([email protected])Steffan Mejia ([email protected])
Topics
• The size of the beast • Evolution of a Warehouse
• Lessons Learned
• Survival Tips
• Q&A
Size of the beast
• 43 Serverso 36 activeo 7 standby spares
• 16 TB of data in MySQL • 12 TB archived (pre S3 staging)• 4 TB archived (S3) • 3.5B rows in main warehouse • Largest table ~ 500M rows (MySQL)
Warehouse Evolution - First came slaving
Problems:
• Reporting slaves easily fall behind
• Reporting limited to one-pass
SQL
Warehouse Evolution - Then came temp tables
Problems: • Easy to lock replication with
temp table creation • Slaving becomes fragile
Warehouse Evolution - A Warehouse is Born
Problems: • Warehouse workload limited
by what can be performed by a single server
Warehouse Evolution - Workload DistributedProblems: • No Real-Time
Application integration support
Warehouse Evolution - Integrate Real Time Data
Lessons Learned - Warehouse Design
Workload exceeds available memory
Lessons Learned - Warehouse Design
• Keep joins < available memory
• Heavily Denormalize data for effective reporting
• Minimize joins between large tables
• Aggressively archive historical data
Lessons Learned - Data Movement
• Mysqldump is your friend
• Sequence parent/child data loads based on ETL assumptionso Orders without order lineso Order lines without orders
• Data Movement Use Caseso Fullo Incrementalo Upsert (Insert on duplicate key update)
Full Table Loads
• Good for small tables
• Works for tables with no primary key
• Data is fully replaced on each load
Incremental Loads
• Table contains new rows but no updates
• Good for insert-only tables
• High-water mark level included in Mysqldump where clause
Upsert Loads
• Table contains new and updated rows
• Table must have primary key
• Can be used to update only subset of columns
Lessons Learned - ETL Design
• Avoid large joins like the plague
• Break out ETL jobs into bite-size-bites
• Ensure target data integrity on ETL failure
• Use memory staging tables to boost performance
ETL Design - Sample Problem
Build a daily summary of customer event log activity
ETL Design - Sample Solution
ETL Pseudo code - Step 1
1) Create staging table & Find High Water Mark:
SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date FROM user_event_log_summary;
set max_heap_table_size = <big enough number to hold several days data>
CREATE TEMPORARY TABLE user_event_log_summary_staging (.....)ENGINE = MEMORY;
CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);
ETL Pseudo code - Step 2
2) Summarize events:
INSERT INTO user_event_log_summary_staging (calendar_date, user_id, event_type, event_count)
SELECT DATE(event_time), user_id, event_type, COUNT(*)FROM event_logWHERE event_time > CONCAT(@last_loaded_date, '23:59:59')GROUP BY 1,2,3;
ETL Pseudo code - Step 3
3) Set denormalized user columns:
UPDATE user_event_log_summary_staging log_summary, userSET log_summary.type = user.type, log_summary.status = user.statusWHERE user.user_id = log_summary.user_id;
ETL Pseudo code - Step 4
3) Insert into Target Table:
INSERT INTO user_event_log_summary(...)SELECT ...FROM user_event_log_summary_staging;
Functional Partitioning
• Benefits depend on
o Partition Execution Times
o Data Move Times
o Dependencies between functional partitions
Functional Partitioning
Job Management
• Run everything single-threaded on a server
• Handle dependencies between jobs across servers
• Smart re-start key to survival
• Implemented 3-level hierarchy of processingo Process (collection of build steps and data moves)o Build Steps (ETL 'units of work')o Data Moves
DW Replication
• Similar to other MySQL environmentso Commodity hardware o Master-slave pairs for all databases
• Mixed environments can be difficulto Use rsync to create slaveso But not with ssh (on private network)
• Monitoring
o Reporting queries need to be monitored Beware of blocking queries Only run reporting queries on slave (temp table issues)
o Nagioso Gangliao Custom scripts
Infrastructure Planning
• Replication latencyo Warehouse slave unable to keep up o Disk utilization > 95%o Required frequent re-sync
• Options evaluated
o Higher speed conventional diskso RAM increaseo Solid-state-disks
Optimization
• Check / reset HW RAID settings• Use general query log to track ETL / Queries• Application timing
o isolate poor-performing parts of the build• Optimize data storage - automatic roll-off of older data
Infrastructure Changes
• Increased memory 32GB -> 64GB• New servers have 96GB RAM
• SSD Solutiono 12 & 16 disk configurationso RAID6 vs. RAID10o 2.0T or 1.6TB formatted capacityo SATA2 HW BBU RAID6o ~ 8 TB data on SSD
Results
• Sometimes it pays to throw hardware at a problem o 15-hour warehouse builds on old systemo 6 hours on optimized systemo No application changes
Finally...Archive
Two-tiered solution• Move data into archive tables in separate DB• Use select to dump data - efficient and fast• Archive server handles migration
o Dump datao GPGo Push to S3
Survival Tips
• Efforts to scale are non-linearo As you scale, it becomes increasingly difficult to manageo Be prepared to supplement your warehouse strategy
Dedicated appliance Distributed processing (Hadoop, etc)
• You can gain a great deal of headroom by optimizing I/Oo Optimize current disk I/O path o Examine SSD / Flash solutionso Be pragmatic about table designs
• It's important to stay ahead of the performance curveo Be proactive - monitor growth, scale early
• Monitor everything, including your userso Bad queries can bring replication down