Hadoop and rdbms with sqoop

Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP

Guy HarrisonDirector, R&D Melbournewww.guyharrison.net Guy.harrison@quest.com@guyharrison

Introductions

Agenda

• RDBMS-Hadoop interoperability scenarios• Interoperability options• Cloudera SQOOP• Extending SQOOP • Quest OraOop extension for Cloudera SQOOP• Performance comparisons • Lessons learned and best practices

Scenario #1: Reference data in RDBMS

Customers

WebLogs

Products

Scenario #2: Hadoop for off-line analytics

Customers

Products

Sales History

Scenario #3: Hadoop for RDBMS archive

Sales 2008

Sales 2009

Sales 2010

Sales 2008

Scenario #4: MapReduce results to RDBMS

WebLogs

WebLog

Summary

Options for RDBMS inter-op• DBInputFormat:

– Allows database records to be used as mapper inputs.– BUT:

• Not inherently scalable or efficient• For repeated analysis, better to stage in Hadoop• Tedious coding of DBWritable classes for each table

• SQOOP– Open source utility provided by Cloudera– Configurable command line interface to copy RDBMS->HDFS– Support for Hive, Hbase– Generates java classes for future M-R tasks– Extensible to provide optimized adaptors for specific targets– Bi-Directional

SQOOP Details • SQOOP import

– Divide table into ranges using primary key max/min– Create mappers for each range – Mappers write to multiple HDFS nodes– Creates text or sequence files – Generates Java class for resulting HDFS file– Generates Hive definition and auto-loads into HIVE

• SQOOP export– Read files in HDFS directory via MapReduce– Bulk parallel insert into database table

SQOOP details• SQOOP features:

– Compatible with almost any JDBC enabled database– Auto load into HIVE – Hbase support – Special handling for database LOBs– Job management – Cluster configuration (jar file distribution)– WHERE clause support– Open source, and included in Cloudera distributions

• SQOOP fast paths & plug ins– Invoke mysqldump, mysqlimport for MySQL jobs – Similar fast paths for PostgreSQL– Extensibility architecture for 3rd parties (like Quest )

• Teradata, Netezza, etc.

Working with Oracle • SQOOP approach is generic and applicable to all RDBMS• However for Oracle, sub-optimal in some respects:

– Oracle may parallelize and serialize individual mappers – Oracle optimizer may decline to use index range scans– Oracle physical storage often deliberately not in primary key order

(reverse key indexes, hash partitioning, etc)– Primary keys often not be evenly distributed– Index range scans use single block random reads

• vs. faster multi-block table scans– Index range scans load into Oracle buffer cache

• Pollutes cache increasing IO for other users• Limited help to SQOOP since rows are only read once

• Luckily, SQOOP extensibility allows us to add optimizations for specific targets

Oracle – parallelism

Oracle

Oracle PQ

ScanSort

Oracle PQ

Aggregate

Oracle

Master

Client

(JDBC)

SELECT cust_id, SUM (amount_sold)

FROM sh.sales

GROUP BY cust_id

ORDER BY 2 DESC

Oracle

Hadoop Mapper

Oracle

Hadoop Mapper

Buffer cache

Oracle table

Index range scans

Index block Index block

Index range scan

Hadoop Mapper

Oracle

Session

ID > 0 and ID < MAX/2 Hadoop Mapper

Oracle

Session

ID > MAX/2

Index range scan

Oracle

Ideal architecture

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Quest/Cloudera OraOop for SQOOP• Design goals

– Partition data based on physical storage– By-pass Oracle buffering– By-pass Oracle parallelism– Do not require or use indexes– Never read the same data block more than once– Support esoteric datatypes (eventually) – Support RAC clusters

• Availability:– Freely available from www.quest.com/ora-oop– Packaged with Cloudera Enterprise – Commercial support from Quest/Cloudera within Enterprise

distribution

0 5 10 15 20 25 30 350

50M row, 50GB Oracle table to 16-node Hadoop cluster

SQOOP with Ora-Oop

Number of Hadoop mappers

OraOop Throughput

16 mappers, 50M rows, 50 GB clustered data

Elasped time

CPU Time

Network round trips

IO requests

IO time

0 20 40 60 80 100

16 mappers, 50M rows, 50 GB clustered data

Pct reduction

Oracle overhead

Extending SQOOP• SQOOP lets you concentrate on the RDBMS logic, not the

Hadoop plumbing:– Extend ManagerFactory (what to handle)– Extend ConnManager (DB connection and metadata)– For imports:

• Extend DataDrivenDBInputFormat (gets the data)– Data allocation (getSplits())– Split serialization (“io.serializations” property)– Data access logic (createDBRecordReader(),

getSelectQuery())» Implement progress (nextKeyValue(), getProgress())

– Similar procedure for extending exports

SQOOP/OraOop best practices

• Use sequence files for LOBs OR– Set inline-lob-limit

• Directly control datanodes for widest destination bandwidth– Can’t rely on mapred.max.maps.per.node

• Set number of mappers realistically • Disable speculative execution (our default)

– Leads to duplicate DB reads • Set Oracle row fetch size extra high

– Keeps the mappers streaming to HDFS

Conclusion

• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption

• SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop

• SQOOP extensions can provide optimizations for specific targets

• Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value

• Try out OraOop for SQOOP!

너를 감사하십시요 Thank You Danke Schön

Gracias 有難う御座いました Merci

बहवः� धन्यवःदाः� Obrigado 谢谢

Hadoop and rdbms with sqoop

Technology

Transcript of Hadoop and rdbms with sqoop

About the Tutorialarchive.keyllo.com/L-编程/Sqoop-sqoop_tutorial.pdf · 2018-10-09 · About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational

Hadoop Administrator … · Hadoop Administration: 1. Types Of Data and Tools used 2. Characteristics Of Big Data 3. Hadoop And Traditional Rdbms 4. Hadoop Core Services and Daemons

Data Migration from RDBMS to Hadoop - CORE

Real*World*Big*DataArchitecture*@* Splunk, Hadoop,*RDBMS* · RDBMS" Oracle,"MySQL,"IBM DB2,Teradata" Hadoop SemiStructured MapReduce" Schema"at"Read" HDFS"Storage" Distributed"File"

A New Generation of Data Transfer Tools for Hadoop: Sqoop 2

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

SQOOP - RDBMS to Hadoop

Hadoop and No SQL - TechTargetmedia.techtarget.com/rms/pdf/Tackling big data Hadoop and No SQL… · Hadoop and No SQL Slide 17 2011 Jul RDBMS and Hadoop RDBMS MapReduce Data size

Migrating structured data between Hadoop and RDBMS

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

Big Data Workflows with Oozie and Sqoop - GitHub · Big Data Workflows with Oozie and Sqoop. What is Oozie? •A workflow engine for actions in a Hadoop cluster ... •Support parallel

Using*RDBMS,*NoSQLor* Hadoop?

From oracle to hadoop with Sqoop and other tools

Big Data: SQL query federation for Hadoop and RDBMS data

Co existence or Competitions? RDBMS and Hadoop

SAS on Hadoop - Analytics, Business Intelligence and … ON HADOOP HADOOP, BIG DATA & ANALYTICS ... data mining, predictive ... data appliances or an RDBMS BI and Analytics

Teradata Connector for Hadoop User Guide · Web viewFor example, it can be integrated with Sqoop so Sqoop users can use the Sqoop command line interface to move data between Teradata

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Hadoop and Hive as Scalable Alternatives to RDBMS: A Case ...

Supercharge Sqoop with magical JDBC drivers | Strata+Hadoop World 2015

RealWorldBigDataArchitecture@* Splunk, Hadoop,RDBMS · RDBMS" Oracle,"MySQL,"IBM DB2,Teradata" Hadoop SemiStructured MapReduce" Schema"at"Read" HDFS"Storage" Distributed"File"

UsingRDBMS,NoSQLor* Hadoop?