Post on 02-Dec-2014
description
© 2010 Quest Software, Inc. ALL RIGHTS RESERVED
Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP
Guy HarrisonDirector, R&D Melbournewww.guyharrison.net Guy.harrison@quest.com@guyharrison
2
Introductions
3
4
Agenda
• RDBMS-Hadoop interoperability scenarios• Interoperability options• Cloudera SQOOP• Extending SQOOP • Quest OraOop extension for Cloudera SQOOP• Performance comparisons • Lessons learned and best practices
Scenario #1: Reference data in RDBMS
RDBMS
Customers
WebLogs
Products
HDFS
Scenario #2: Hadoop for off-line analytics
RDBMS
Customers
Products
HDFS
Sales History
Scenario #3: Hadoop for RDBMS archive
RDBMS
HDFS
Sales 2008
Sales 2009
Sales 2010
Sales 2008
Scenario #4: MapReduce results to RDBMS
RDBMS
WebLogs
HDFS
WebLog
Summary
9
Options for RDBMS inter-op• DBInputFormat:
– Allows database records to be used as mapper inputs.– BUT:
• Not inherently scalable or efficient• For repeated analysis, better to stage in Hadoop• Tedious coding of DBWritable classes for each table
• SQOOP– Open source utility provided by Cloudera– Configurable command line interface to copy RDBMS->HDFS– Support for Hive, Hbase– Generates java classes for future M-R tasks– Extensible to provide optimized adaptors for specific targets– Bi-Directional
10
SQOOP Details • SQOOP import
– Divide table into ranges using primary key max/min– Create mappers for each range – Mappers write to multiple HDFS nodes– Creates text or sequence files – Generates Java class for resulting HDFS file– Generates Hive definition and auto-loads into HIVE
• SQOOP export– Read files in HDFS directory via MapReduce– Bulk parallel insert into database table
11
SQOOP details• SQOOP features:
– Compatible with almost any JDBC enabled database– Auto load into HIVE – Hbase support – Special handling for database LOBs– Job management – Cluster configuration (jar file distribution)– WHERE clause support– Open source, and included in Cloudera distributions
• SQOOP fast paths & plug ins– Invoke mysqldump, mysqlimport for MySQL jobs – Similar fast paths for PostgreSQL– Extensibility architecture for 3rd parties (like Quest )
• Teradata, Netezza, etc.
12
Working with Oracle • SQOOP approach is generic and applicable to all RDBMS• However for Oracle, sub-optimal in some respects:
– Oracle may parallelize and serialize individual mappers – Oracle optimizer may decline to use index range scans– Oracle physical storage often deliberately not in primary key order
(reverse key indexes, hash partitioning, etc)– Primary keys often not be evenly distributed– Index range scans use single block random reads
• vs. faster multi-block table scans– Index range scans load into Oracle buffer cache
• Pollutes cache increasing IO for other users• Limited help to SQOOP since rows are only read once
• Luckily, SQOOP extensibility allows us to add optimizations for specific targets
Oracle – parallelism
Oracle
SALES
table
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
ScanSort
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Oracle PQ
Slave
Aggregate
Oracle
Master
(QC)
Client
(JDBC)
SELECT cust_id, SUM (amount_sold)
FROM sh.sales
GROUP BY cust_id
ORDER BY 2 DESC
Oracle
SALES
table
HDFS
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Oracle – parallelism
Oracle
SALES
table
HDFS
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Oracle – parallelism
Buffer cache
Oracle table
Index range scans
Index block Index block
Index range scan
Hadoop Mapper
Oracle
Session
ID > 0 and ID < MAX/2 Hadoop Mapper
Oracle
Session
ID > MAX/2
Index block Index block
Index range scan
Index block Index block
Oracle
SALES
table
HDFS
Ideal architecture
Hadoop MapperOracle
Session
Hadoop MapperOracle
Session
Hadoop MapperOracle
Session
Hadoop MapperOracle
Session
18
Quest/Cloudera OraOop for SQOOP• Design goals
– Partition data based on physical storage– By-pass Oracle buffering– By-pass Oracle parallelism– Do not require or use indexes– Never read the same data block more than once– Support esoteric datatypes (eventually) – Support RAC clusters
• Availability:– Freely available from www.quest.com/ora-oop– Packaged with Cloudera Enterprise – Commercial support from Quest/Cloudera within Enterprise
distribution
19
0 5 10 15 20 25 30 350
1000
2000
3000
4000
5000
6000
7000
50M row, 50GB Oracle table to 16-node Hadoop cluster
SQOOP
SQOOP with Ora-Oop
Number of Hadoop mappers
Ela
pse
d t
ime
(s)
OraOop Throughput
20
16 mappers, 50M rows, 50 GB clustered data
Elasped time
CPU Time
Network round trips
IO requests
IO time
0 20 40 60 80 100
80.84
89.72
98.95
99.08
98.71
16 mappers, 50M rows, 50 GB clustered data
Pct reduction
Oracle overhead
21
Extending SQOOP• SQOOP lets you concentrate on the RDBMS logic, not the
Hadoop plumbing:– Extend ManagerFactory (what to handle)– Extend ConnManager (DB connection and metadata)– For imports:
• Extend DataDrivenDBInputFormat (gets the data)– Data allocation (getSplits())– Split serialization (“io.serializations” property)– Data access logic (createDBRecordReader(),
getSelectQuery())» Implement progress (nextKeyValue(), getProgress())
– Similar procedure for extending exports
22
SQOOP/OraOop best practices
• Use sequence files for LOBs OR– Set inline-lob-limit
• Directly control datanodes for widest destination bandwidth– Can’t rely on mapred.max.maps.per.node
• Set number of mappers realistically • Disable speculative execution (our default)
– Leads to duplicate DB reads • Set Oracle row fetch size extra high
– Keeps the mappers streaming to HDFS
23
Conclusion
• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption
• SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop
• SQOOP extensions can provide optimizations for specific targets
• Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value
• Try out OraOop for SQOOP!
© 2010 Quest Software, Inc. ALL RIGHTS RESERVED
너를 감사하십시요 Thank You Danke Schön
Gracias 有難う御座いました Merci
बहवः� धन्यवःदाः� Obrigado 谢谢