Advanced Sqoop

Sqoop – Advanced Options

2015

Contents

1 What is Sqoop ?

2 Import and Export data using Sqoop

3 Import and Export command in Sqoop

4 Saved Jobs in Sqoop

5 Option File

6 Important Sqoop Options

What is Sqoop?

Sqoop is a tool designed for efficiently transferring bulk data between Hadoop andstructured data stores such as relational databases.

Import and Export using Sqoop

The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase.The export command in Sqoop transfers the data from HDFS/Hive/HBase back toRDBMS.

Import command in Sqoop

The command to import data into Hive :

The command to import data into HDFS :

The command to import data in HBase :

sqoop import --connect <connect-string>/dbname --username uname -P

--table table_name --hive-import -m 1

sqoop import --connect <connect-string>/dbname --username uname --P

--table table_name -m 1

sqoop import --connect <connect-string>/dbname --username root -P

--table table_name --hbase-table table_name

--column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1

Export command in Sqoop

The command to export data from RDBMS to Hive :

The command to export data from RDBMS to HDFS :

Limitations of Import and Export command:- Import and Export commands are convenient to use when one wants to transfer data from RDBMS to

HDFS/Hive/HBase and vice-a-versa for a limited number of times.

So what if there is a requirement of executing the import and export commands several times a day ?

In such situations Saved Sqoop Job can save your time.

sqoop export --connect <connect-string>/db_name --table table_name -m 1

--export-dir <path_to_export_dir>

sqoop export --connect <connect-string>/db_name --table table_name -m 1

--export-dir <path_to_export_dir>

Saved Jobs in Sqoop

The Saved Sqoop Job remembers the parameters used by a job so they can be re-executed by invoking the job several times.

Following command creates saved jobs:

The command above just creates a job with the job name you specify.It means that the job you created is now available in your saved jobs list which can beexecuted later.

Following command executes a saved job :

sqoop job --create job_name --import --connect <connect-string>/dbname \ --table table_name

sqoop job --exec job_name --username uname –P

Sample Saved Job

sqoop job --create JOB1 -- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks-username XXX-password XXX--table transactionhistory--target-dir /user/cloudera/datasets/trans -m 1 --columns "TransactionID,ProductId,TransactionDate" --check-column TransactionDate--incremental lastmodified--last-value "2004-09-01 00:00:00";

Important Options in Saved Jobs in Sqoop

Sqoop option Usage--connect Connection string for the source database--table Source table name--columns Columns to be extracted--username User name for accessing source table--password Password for accessing source table

--check-columnSpecifies the column to be examined when determining which rows to import.

--incremental Specifies how Sqoop determines which rows are new.

--last-value

Specifies the maximum value of the check column from the previous import. For the first execution of the job, “last-value” is treated as the upper bound and data is extracted from first record till the upper bound.

--target-dir Target HDFS directory--m Number of mapper tasks

--compressSpecifies that compression has to be applied while loading data into target.

--fields-terminated-by Fields separator in output directory

Sqoop Metastore

• A Sqoop metastore keeps track of all jobs.

• By default, the metastore is contained in your home directory under .sqoop and is only used for your own jobs. If you want to share jobs, you would need to install a JDBC-compliant database and use the --meta-connect argument to specify its location when issuing job commands.

• Important Sqoop commands:

• $ sqoop job –list – Lists all jobs available in metastore• sqoop job --exec JOB1 – Executes JOB1• sqoop job --show JOB1 – Displays metadata of JOB1

Option File

Certain arguments in import, export commands and saved jobs are to be written everytime you execute them.

What would be an alternative to this repetitive work ?For instance following arguments are used repetitively in import and exportcommands as well as saved jobs :

• So these arguments can be saved in a single text file say option.txt.• While executing the command just include this file for the argument --options-file.• Following command shows the use of –options-file argument:

import

-connect

jdbc:mysql//localhost

-username

-P

Option.txt

sqoop --options-file <path_to_option_file>/db_name --table table_name

Option File

1. Each argument in the option file should be on a new line.

2. -connect in option file cannot be written as --connect.

3. Same is the case for other arguments too.

4. Option file is generally used when large number of Sqoop jobs use a common setof parameters such as:1. Source RDBMS ID, Password2. Source database URL3. Field Separator4. Compression type

Sqoop Design Guidelines for Performance

1. Sqoop imports data in parallel from database sources. You can specify the numberof map tasks (parallel processes) to use to perform the import by using the -m argument. Some databases may see improved performance by increasing thisvalue to 8 or 16. Do not increase the degree of parallelism greater than thatavailable within your MapReduce cluster;

2. By default, the import process will use JDBC. Some databases can perform importsin a more high-performance fashion by using database-specific data movementtools. For example, MySQL provides the mysqldump tool which can export datafrom MySQL to other systems very quickly. By supplying the --direct argument,you are specifying that Sqoop should attempt the direct import channel.

Thank You

Advanced Sqoop

Technology

Transcript of Advanced Sqoop