Apache Sqoop: Unlocking Hadoop for Your Relational Database

Post on 21-Nov-2014

969 views 3 download

description

Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.

Transcript of Apache Sqoop: Unlocking Hadoop for Your Relational Database

           Unlocking  Hadoop  for  Your  Rela4onal  DB  

         Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014      

Who  Am  I?  

•  Started  3  yr  ago  as  1st  Cloudera  Support  Eng  •  Now  manages  Cloudera’s  2  largest  customers  

•  Sqoop  CommiJer,  PMC  Member  •  Co-­‐Author  of  the  Apache  Sqoop  Cookbook  

What  is  Sqoop?  

•  Apache  Top-­‐Level  Project  •  SQl  to  hadOOP  •  Tool  to  transfer  data  from  rela4onal  databases  

•  Teradata,  MySQL,  PostgreSQL,  Oracle,  Netezza  

•  To/From  Hadoop  ecosystem  •  HDFS  (text,  sequence  file),  Hive,  HBase,  Avro  

3

Why  Sqoop?  

•  Efficient/Controlled  resource  u4liza4on  •  Concurrent  connec4ons,  Time  of  opera4on  

•  Datatype  mapping  and  conversion  •  Automa4c,  and  User  override  

• Metadata  propaga4on  •  Sqoop  Record  •  Hive  Metastore  •  Avro  

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Sqoop  1  Architecture  

7

Sqoop  1  Command  Line  

sqoop TOOL PROPS ARG [-- EXTRA] •  TOOL:  import,  export  •  PROPS

•  Hadoop  (java)  proper4es  •  -Dwhatever.whenever=yes

•  ARG •  Generic  SQOOP  arguments  •  --table, --connect,  ...  

•  EXTRA •  connector  specific  •  --schema (PostgreSQL  and  Microsoa  SQL  Server)  

Sqoop  1  Example  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities

sqoop export \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities \ --export-dir /temp/cities

Sqoop  1  Challenges  

•  Cryp4c,  contextual  command  line  arguments  •  Security  concerns  •  Type  mapping  is  not  clearly  defined  •  Client  needs  access  to  Hadoop  binaries/configura4on  and  database  

•  JDBC  model  is  enforced  

10

Troubleshoo4ng  Sqoop  1  

•  Versions:  Sqoop,  Hadoop,  OS,  JDBC  •  Console  log  aaer  running  with  the  --verbose flag  

•  Capture  the  en4re  output  via  sqoop import … &> sqoop.log •  En4re  Sqoop  command  including  the  op4ons-­‐file  if  applicable  •  Expected  output  and  actual  output  •  Table  defini4on  •  Small  input  data  set  that  triggers  the  problem  

•  Especially  with  export,  malformed  data  is  oaen  the  culprit  •  Hadoop  task  logs  

•  Oaen  the  task  logs  contain  further  informa4on  describing  the  problem  •  Permissions  on  input  files  

Troubleshoo4ng  Sqoop  1  

Imported  table  has  more  rows  than  source  table?  •  Data  contains  char  used  as  Hive’s  delimiters  

•  Clean  up  data  •  --hive-drop-import-delims

•  Removes  \n, \t, and \01 char

•  --hive-delims-replacement “SPECIAL” •  Replaces  \n, \t, and \01  char  with  string  SPECIAL

•  Not  restricted  to  Hive  -­‐  any  import  job  using  text  files  •  Ensure  output  files  have  one  line  per  imported  row  

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Protec4ng  Your  Password  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ -P

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --password-file my-sqoop-password

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  Works  on  CLI  Not  in  Oozie  

Character parameter '|' has multiple characters; only the first will be used.

Got error creating database manager: java.io.IOException:

No manager for connect string: "jdbc:teradata...”

Sqoop  Works  on  CLI  Not  in  Oozie  

sqoop import --password "spEci@l\$" \ –connect 'jdbc:x:/yyy;db=sqoop’

•  Remove  all  escaping  that  you’ve  added  for  the  shell  •  Use  <arg>  vs  <command>  tags  as  content  is  considered  to  be  one  parameter  

•  Put  all  -­‐D  parameters  into  configura4on  sec4on  •  Install  driver  into  workflow’s  lib/  directory  or  shared  ac4on  library  /user/oozie/share/lib/sqoop/  

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Choosing  Proper  Connector  

•  JDBC  driver  is  dependency  for  all  three  connectors  

•  Sqoop  automa4cally  chooses  most  op4mal  connector  (OraOoop,  built-­‐in,    

       Generic  JDBC  Connector)  •  Or  explicitly  chose:    --connection-manager com.quest.oraoop.OraOopConnManager

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Overriding  Type  Mapping  

-­‐-­‐map-­‐column-­‐java  parameter  •  comma  separated  list  of  key-­‐value  pairs  

•  key  =  exact  column  name  •  value  =  target  Java  type    

sqoop import \

--map-column-java \

c1=Float,c2=String,c3=String ...

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Sqoop  2  Architecture  

25

Sqoop  2  Design  Goals  

•  Security  and  Separa4on  of  Concerns  •  Role  based  access  and  use  

•  Ease  of  extension  •  No  low-­‐level  Hadoop  knowledge  needed    •  No  func4onal  overlap  between  Connectors  

•  Ease  of  Use  •  Uniform  func4onality  •  Domain  specific  interac4ons  

Sqoop  2  UI  in  Hue  

•  Troubleshoo4ng  •  sqoop.log  file  is  located  in  @LOGDIR@  and  the  rest  should  be  in  server/logs/*  

•  Look  for  catalina.out,  catalina.log,  localhost-­‐*.log  

28

29

30

31

32

33

34

35

36

37

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Resources  

39

Sqoop 2 http://archive-primary.cloudera.com/cdh5/cdh/5/sqoop2/

Sqoop 1