MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia...

MySQL Data Warehousing Survival Guide

Marius Moscovici ([email protected])Steffan Mejia ([email protected])

Topics

• The size of the beast • Evolution of a Warehouse

• Lessons Learned

• Survival Tips

• Q&A

Size of the beast

• 43 Serverso 36 activeo 7 standby spares

• 16 TB of data in MySQL • 12 TB archived (pre S3 staging)• 4 TB archived (S3) • 3.5B rows in main warehouse • Largest table ~ 500M rows (MySQL)

Warehouse Evolution - First came slaving

Problems:

• Reporting slaves easily fall behind

• Reporting limited to one-pass

SQL

Warehouse Evolution - Then came temp tables

Problems: • Easy to lock replication with

temp table creation • Slaving becomes fragile

Warehouse Evolution - A Warehouse is Born

Problems: • Warehouse workload limited

by what can be performed by a single server

Warehouse Evolution - Workload DistributedProblems: • No Real-Time

Application integration support

Warehouse Evolution - Integrate Real Time Data

Lessons Learned - Warehouse Design

Workload exceeds available memory

Lessons Learned - Warehouse Design

• Keep joins < available memory

• Heavily Denormalize data for effective reporting

• Minimize joins between large tables

• Aggressively archive historical data

Lessons Learned - Data Movement

• Mysqldump is your friend

• Sequence parent/child data loads based on ETL assumptionso Orders without order lineso Order lines without orders

• Data Movement Use Caseso Fullo Incrementalo Upsert (Insert on duplicate key update)

Full Table Loads

• Good for small tables

• Works for tables with no primary key

• Data is fully replaced on each load

Incremental Loads

• Table contains new rows but no updates

• Good for insert-only tables

• High-water mark level included in Mysqldump where clause

Upsert Loads

• Table contains new and updated rows

• Table must have primary key

• Can be used to update only subset of columns

Lessons Learned - ETL Design

• Avoid large joins like the plague

• Break out ETL jobs into bite-size-bites

• Ensure target data integrity on ETL failure

• Use memory staging tables to boost performance

ETL Design - Sample Problem

Build a daily summary of customer event log activity

ETL Design - Sample Solution

ETL Pseudo code - Step 1

1) Create staging table & Find High Water Mark:

SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date FROM user_event_log_summary;

set max_heap_table_size = <big enough number to hold several days data>

CREATE TEMPORARY TABLE user_event_log_summary_staging (.....)ENGINE = MEMORY;

CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);


2) Summarize events:

INSERT INTO user_event_log_summary_staging (calendar_date, user_id, event_type, event_count)

SELECT DATE(event_time), user_id, event_type, COUNT(*)FROM event_logWHERE event_time > CONCAT(@last_loaded_date, '23:59:59')GROUP BY 1,2,3;


3) Set denormalized user columns:

UPDATE user_event_log_summary_staging log_summary, userSET log_summary.type = user.type, log_summary.status = user.statusWHERE user.user_id = log_summary.user_id;


3) Insert into Target Table:

INSERT INTO user_event_log_summary(...)SELECT ...FROM user_event_log_summary_staging;

Functional Partitioning

• Benefits depend on

o Partition Execution Times

o Data Move Times

o Dependencies between functional partitions

Functional Partitioning

Job Management

• Run everything single-threaded on a server

• Handle dependencies between jobs across servers

• Smart re-start key to survival

• Implemented 3-level hierarchy of processingo Process (collection of build steps and data moves)o Build Steps (ETL 'units of work')o Data Moves

DW Replication

• Similar to other MySQL environmentso Commodity hardware o Master-slave pairs for all databases

• Mixed environments can be difficulto Use rsync to create slaveso But not with ssh (on private network)

• Monitoring

o Reporting queries need to be monitored Beware of blocking queries Only run reporting queries on slave (temp table issues)

o Nagioso Gangliao Custom scripts

Infrastructure Planning

• Replication latencyo Warehouse slave unable to keep up o Disk utilization > 95%o Required frequent re-sync

• Options evaluated

o Higher speed conventional diskso RAM increaseo Solid-state-disks

Optimization

• Check / reset HW RAID settings• Use general query log to track ETL / Queries• Application timing

o isolate poor-performing parts of the build• Optimize data storage - automatic roll-off of older data

Infrastructure Changes

• Increased memory 32GB -> 64GB• New servers have 96GB RAM

• SSD Solutiono 12 & 16 disk configurationso RAID6 vs. RAID10o 2.0T or 1.6TB formatted capacityo SATA2 HW BBU RAID6o ~ 8 TB data on SSD

Results

• Sometimes it pays to throw hardware at a problem o 15-hour warehouse builds on old systemo 6 hours on optimized systemo No application changes

Finally...Archive

Two-tiered solution• Move data into archive tables in separate DB• Use select to dump data - efficient and fast• Archive server handles migration

o Dump datao GPGo Push to S3

Survival Tips

• Efforts to scale are non-linearo As you scale, it becomes increasingly difficult to manageo Be prepared to supplement your warehouse strategy

Dedicated appliance Distributed processing (Hadoop, etc)

• You can gain a great deal of headroom by optimizing I/Oo Optimize current disk I/O path o Examine SSD / Flash solutionso Be pragmatic about table designs

• It's important to stay ahead of the performance curveo Be proactive - monitor growth, scale early

• Monitor everything, including your userso Bad queries can bring replication down

MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia...

Documents

Transcript of MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia...