Apache Sqoop: Unlocking Hadoop for Your Relational Database

39
Unlocking Hadoop for Your Rela4onal DB Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014

description

Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.

Transcript of Apache Sqoop: Unlocking Hadoop for Your Relational Database

Page 1: Apache Sqoop: Unlocking Hadoop for Your Relational Database

           Unlocking  Hadoop  for  Your  Rela4onal  DB  

         Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014      

Page 2: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Who  Am  I?  

•  Started  3  yr  ago  as  1st  Cloudera  Support  Eng  •  Now  manages  Cloudera’s  2  largest  customers  

•  Sqoop  CommiJer,  PMC  Member  •  Co-­‐Author  of  the  Apache  Sqoop  Cookbook  

Page 3: Apache Sqoop: Unlocking Hadoop for Your Relational Database

What  is  Sqoop?  

•  Apache  Top-­‐Level  Project  •  SQl  to  hadOOP  •  Tool  to  transfer  data  from  rela4onal  databases  

•  Teradata,  MySQL,  PostgreSQL,  Oracle,  Netezza  

•  To/From  Hadoop  ecosystem  •  HDFS  (text,  sequence  file),  Hive,  HBase,  Avro  

3

Page 4: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Why  Sqoop?  

•  Efficient/Controlled  resource  u4liza4on  •  Concurrent  connec4ons,  Time  of  opera4on  

•  Datatype  mapping  and  conversion  •  Automa4c,  and  User  override  

• Metadata  propaga4on  •  Sqoop  Record  •  Hive  Metastore  •  Avro  

Page 5: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 6: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 7: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Architecture  

7

Page 8: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Command  Line  

sqoop TOOL PROPS ARG [-- EXTRA] •  TOOL:  import,  export  •  PROPS

•  Hadoop  (java)  proper4es  •  -Dwhatever.whenever=yes

•  ARG •  Generic  SQOOP  arguments  •  --table, --connect,  ...  

•  EXTRA •  connector  specific  •  --schema (PostgreSQL  and  Microsoa  SQL  Server)  

Page 9: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Example  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities

sqoop export \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities \ --export-dir /temp/cities

Page 10: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Challenges  

•  Cryp4c,  contextual  command  line  arguments  •  Security  concerns  •  Type  mapping  is  not  clearly  defined  •  Client  needs  access  to  Hadoop  binaries/configura4on  and  database  

•  JDBC  model  is  enforced  

10

Page 11: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Troubleshoo4ng  Sqoop  1  

•  Versions:  Sqoop,  Hadoop,  OS,  JDBC  •  Console  log  aaer  running  with  the  --verbose flag  

•  Capture  the  en4re  output  via  sqoop import … &> sqoop.log •  En4re  Sqoop  command  including  the  op4ons-­‐file  if  applicable  •  Expected  output  and  actual  output  •  Table  defini4on  •  Small  input  data  set  that  triggers  the  problem  

•  Especially  with  export,  malformed  data  is  oaen  the  culprit  •  Hadoop  task  logs  

•  Oaen  the  task  logs  contain  further  informa4on  describing  the  problem  •  Permissions  on  input  files  

Page 12: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Troubleshoo4ng  Sqoop  1  

Imported  table  has  more  rows  than  source  table?  •  Data  contains  char  used  as  Hive’s  delimiters  

•  Clean  up  data  •  --hive-drop-import-delims

•  Removes  \n, \t, and \01 char

•  --hive-delims-replacement “SPECIAL” •  Replaces  \n, \t, and \01  char  with  string  SPECIAL

•  Not  restricted  to  Hive  -­‐  any  import  job  using  text  files  •  Ensure  output  files  have  one  line  per  imported  row  

Page 13: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 14: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 15: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 16: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Protec4ng  Your  Password  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ -P

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --password-file my-sqoop-password

Page 17: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 18: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  Works  on  CLI  Not  in  Oozie  

Character parameter '|' has multiple characters; only the first will be used.

Got error creating database manager: java.io.IOException:

No manager for connect string: "jdbc:teradata...”

Page 19: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  Works  on  CLI  Not  in  Oozie  

sqoop import --password "spEci@l\$" \ –connect 'jdbc:x:/yyy;db=sqoop’

•  Remove  all  escaping  that  you’ve  added  for  the  shell  •  Use  <arg>  vs  <command>  tags  as  content  is  considered  to  be  one  parameter  

•  Put  all  -­‐D  parameters  into  configura4on  sec4on  •  Install  driver  into  workflow’s  lib/  directory  or  shared  ac4on  library  /user/oozie/share/lib/sqoop/  

Page 20: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 21: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Choosing  Proper  Connector  

•  JDBC  driver  is  dependency  for  all  three  connectors  

•  Sqoop  automa4cally  chooses  most  op4mal  connector  (OraOoop,  built-­‐in,    

       Generic  JDBC  Connector)  •  Or  explicitly  chose:    --connection-manager com.quest.oraoop.OraOopConnManager

Page 22: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 23: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Overriding  Type  Mapping  

-­‐-­‐map-­‐column-­‐java  parameter  •  comma  separated  list  of  key-­‐value  pairs  

•  key  =  exact  column  name  •  value  =  target  Java  type    

sqoop import \

--map-column-java \

c1=Float,c2=String,c3=String ...

Page 24: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 25: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  Architecture  

25

Page 26: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  Design  Goals  

•  Security  and  Separa4on  of  Concerns  •  Role  based  access  and  use  

•  Ease  of  extension  •  No  low-­‐level  Hadoop  knowledge  needed    •  No  func4onal  overlap  between  Connectors  

•  Ease  of  Use  •  Uniform  func4onality  •  Domain  specific  interac4ons  

Page 27: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  UI  in  Hue  

•  Troubleshoo4ng  •  sqoop.log  file  is  located  in  @LOGDIR@  and  the  rest  should  be  in  server/logs/*  

•  Look  for  catalina.out,  catalina.log,  localhost-­‐*.log  

Page 28: Apache Sqoop: Unlocking Hadoop for Your Relational Database

28

Page 29: Apache Sqoop: Unlocking Hadoop for Your Relational Database

29

Page 30: Apache Sqoop: Unlocking Hadoop for Your Relational Database

30

Page 31: Apache Sqoop: Unlocking Hadoop for Your Relational Database

31

Page 32: Apache Sqoop: Unlocking Hadoop for Your Relational Database

32

Page 33: Apache Sqoop: Unlocking Hadoop for Your Relational Database

33

Page 34: Apache Sqoop: Unlocking Hadoop for Your Relational Database

34

Page 35: Apache Sqoop: Unlocking Hadoop for Your Relational Database

35

Page 36: Apache Sqoop: Unlocking Hadoop for Your Relational Database

36

Page 37: Apache Sqoop: Unlocking Hadoop for Your Relational Database

37

Page 38: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 39: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Resources  

39

Sqoop 2 http://archive-primary.cloudera.com/cdh5/cdh/5/sqoop2/

Sqoop 1