Download - Apache Sqoop: Unlocking Hadoop for Your Relational Database

Transcript
Page 1: Apache Sqoop: Unlocking Hadoop for Your Relational Database

           Unlocking  Hadoop  for  Your  Rela4onal  DB  

         Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014      

Page 2: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Who  Am  I?  

•  Started  3  yr  ago  as  1st  Cloudera  Support  Eng  •  Now  manages  Cloudera’s  2  largest  customers  

•  Sqoop  CommiJer,  PMC  Member  •  Co-­‐Author  of  the  Apache  Sqoop  Cookbook  

Page 3: Apache Sqoop: Unlocking Hadoop for Your Relational Database

What  is  Sqoop?  

•  Apache  Top-­‐Level  Project  •  SQl  to  hadOOP  •  Tool  to  transfer  data  from  rela4onal  databases  

•  Teradata,  MySQL,  PostgreSQL,  Oracle,  Netezza  

•  To/From  Hadoop  ecosystem  •  HDFS  (text,  sequence  file),  Hive,  HBase,  Avro  

3

Page 4: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Why  Sqoop?  

•  Efficient/Controlled  resource  u4liza4on  •  Concurrent  connec4ons,  Time  of  opera4on  

•  Datatype  mapping  and  conversion  •  Automa4c,  and  User  override  

• Metadata  propaga4on  •  Sqoop  Record  •  Hive  Metastore  •  Avro  

Page 5: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 6: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 7: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Architecture  

7

Page 8: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Command  Line  

sqoop TOOL PROPS ARG [-- EXTRA] •  TOOL:  import,  export  •  PROPS

•  Hadoop  (java)  proper4es  •  -Dwhatever.whenever=yes

•  ARG •  Generic  SQOOP  arguments  •  --table, --connect,  ...  

•  EXTRA •  connector  specific  •  --schema (PostgreSQL  and  Microsoa  SQL  Server)  

Page 9: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Example  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities

sqoop export \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities \ --export-dir /temp/cities

Page 10: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  1  Challenges  

•  Cryp4c,  contextual  command  line  arguments  •  Security  concerns  •  Type  mapping  is  not  clearly  defined  •  Client  needs  access  to  Hadoop  binaries/configura4on  and  database  

•  JDBC  model  is  enforced  

10

Page 11: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Troubleshoo4ng  Sqoop  1  

•  Versions:  Sqoop,  Hadoop,  OS,  JDBC  •  Console  log  aaer  running  with  the  --verbose flag  

•  Capture  the  en4re  output  via  sqoop import … &> sqoop.log •  En4re  Sqoop  command  including  the  op4ons-­‐file  if  applicable  •  Expected  output  and  actual  output  •  Table  defini4on  •  Small  input  data  set  that  triggers  the  problem  

•  Especially  with  export,  malformed  data  is  oaen  the  culprit  •  Hadoop  task  logs  

•  Oaen  the  task  logs  contain  further  informa4on  describing  the  problem  •  Permissions  on  input  files  

Page 12: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Troubleshoo4ng  Sqoop  1  

Imported  table  has  more  rows  than  source  table?  •  Data  contains  char  used  as  Hive’s  delimiters  

•  Clean  up  data  •  --hive-drop-import-delims

•  Removes  \n, \t, and \01 char

•  --hive-delims-replacement “SPECIAL” •  Replaces  \n, \t, and \01  char  with  string  SPECIAL

•  Not  restricted  to  Hive  -­‐  any  import  job  using  text  files  •  Ensure  output  files  have  one  line  per  imported  row  

Page 13: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 14: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 15: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 16: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Protec4ng  Your  Password  

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ -P

sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --password-file my-sqoop-password

Page 17: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 18: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  Works  on  CLI  Not  in  Oozie  

Character parameter '|' has multiple characters; only the first will be used.

Got error creating database manager: java.io.IOException:

No manager for connect string: "jdbc:teradata...”

Page 19: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  Works  on  CLI  Not  in  Oozie  

sqoop import --password "spEci@l\$" \ –connect 'jdbc:x:/yyy;db=sqoop’

•  Remove  all  escaping  that  you’ve  added  for  the  shell  •  Use  <arg>  vs  <command>  tags  as  content  is  considered  to  be  one  parameter  

•  Put  all  -­‐D  parameters  into  configura4on  sec4on  •  Install  driver  into  workflow’s  lib/  directory  or  shared  ac4on  library  /user/oozie/share/lib/sqoop/  

Page 20: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 21: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Choosing  Proper  Connector  

•  JDBC  driver  is  dependency  for  all  three  connectors  

•  Sqoop  automa4cally  chooses  most  op4mal  connector  (OraOoop,  built-­‐in,    

       Generic  JDBC  Connector)  •  Or  explicitly  chose:    --connection-manager com.quest.oraoop.OraOopConnManager

Page 22: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Page 23: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Overriding  Type  Mapping  

-­‐-­‐map-­‐column-­‐java  parameter  •  comma  separated  list  of  key-­‐value  pairs  

•  key  =  exact  column  name  •  value  =  target  Java  type    

sqoop import \

--map-column-java \

c1=Float,c2=String,c3=String ...

Page 24: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 25: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  Architecture  

25

Page 26: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  Design  Goals  

•  Security  and  Separa4on  of  Concerns  •  Role  based  access  and  use  

•  Ease  of  extension  •  No  low-­‐level  Hadoop  knowledge  needed    •  No  func4onal  overlap  between  Connectors  

•  Ease  of  Use  •  Uniform  func4onality  •  Domain  specific  interac4ons  

Page 27: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Sqoop  2  UI  in  Hue  

•  Troubleshoo4ng  •  sqoop.log  file  is  located  in  @LOGDIR@  and  the  rest  should  be  in  server/logs/*  

•  Look  for  catalina.out,  catalina.log,  localhost-­‐*.log  

Page 28: Apache Sqoop: Unlocking Hadoop for Your Relational Database

28

Page 29: Apache Sqoop: Unlocking Hadoop for Your Relational Database

29

Page 30: Apache Sqoop: Unlocking Hadoop for Your Relational Database

30

Page 31: Apache Sqoop: Unlocking Hadoop for Your Relational Database

31

Page 32: Apache Sqoop: Unlocking Hadoop for Your Relational Database

32

Page 33: Apache Sqoop: Unlocking Hadoop for Your Relational Database

33

Page 34: Apache Sqoop: Unlocking Hadoop for Your Relational Database

34

Page 35: Apache Sqoop: Unlocking Hadoop for Your Relational Database

35

Page 36: Apache Sqoop: Unlocking Hadoop for Your Relational Database

36

Page 37: Apache Sqoop: Unlocking Hadoop for Your Relational Database

37

Page 38: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Agenda  

Sqoop  1  •  Sqoop  1  Architecture  •  Sqoop  1  Command  Line  •  Sqoop  1  Examples  •  Sqoop  1  Challenges  •  Troubleshoo4ng  Sqoop  1  •  Common  Sqoop  1  Issues  

•  Protec4ng  Your  Password  •  Sqoop  Works  on  CLI  Not  in  Oozie  •  Choosing  Proper  Connector  •  Overriding  Type  Mapping  

Sqoop  2  •  Sqoop  2  Architecture  •  Sqoop  2  Design  Goals  •  Sqoop  2  UI  in  Hue  Resources  

Page 39: Apache Sqoop: Unlocking Hadoop for Your Relational Database

Resources  

39

Sqoop 2 http://archive-primary.cloudera.com/cdh5/cdh/5/sqoop2/

Sqoop 1