Hadoop Summit EU 2014

44
DATA INTEGRATION AND SQL APPLICATION MIGRATION WITH CASCADING LINGUAL Chris K Wensel | Hadoop Summit EU 2014

Transcript of Hadoop Summit EU 2014

Page 1: Hadoop Summit EU   2014

DATA INTEGRATION AND SQL APPLICATION MIGRATION WITH CASCADING LINGUAL

Chris K Wensel | Hadoop Summit EU 2014

Page 2: Hadoop Summit EU   2014

• Not a “data scientist”

• No idea what “big data” means

• Used MR in anger once, and did it wrong

• Author of Cascading

• Co-Author of Lingual (w/ Julian Hyde)

CHRIS K WENSEL

2

Page 3: Hadoop Summit EU   2014

3

Why is Hadoop & “big data” a thing?

Page 4: Hadoop Summit EU   2014

More is better

HADOOP & BIG DATA

4

Page 5: Hadoop Summit EU   2014

More Data More Machines More Algorithms

More Tools

HADOOP & BIG DATA

5

Page 6: Hadoop Summit EU   2014

Worse is better

HADOOP & BIG DATA

6

Page 7: Hadoop Summit EU   2014

Less red tape More degrees of freedom

No upfront design

HADOOP & BIG DATA

7

Page 8: Hadoop Summit EU   2014

8

Why Cascading?

Page 9: Hadoop Summit EU   2014

Makes hard things possible.

CASCADING

9

Page 10: Hadoop Summit EU   2014

While helping to retain Conceptual Integrity.

CASCADING

10

Page 11: Hadoop Summit EU   2014

"the speed of innovation is proportional to the arrival rate of

answers to questions"

HADOOP & BIG DATA

11

Page 12: Hadoop Summit EU   2014

True when you are questioning Data, Algorithms, and

Architecture

CASCADING

12

Page 13: Hadoop Summit EU   2014

• Java API (alternative to Hadoop MapReduce)

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

13

Process Planner

Processing API Integration APIScheduler API

Scheduler

Compute

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Page 14: Hadoop Summit EU   2014

ECOSYSTEM

14

Lingual Pattern

Cascading

Hadoop MR

Scalding Cascalog

Hadoop Tez Whatever

Page 15: Hadoop Summit EU   2014

• Started in 2007

• 2.0 released June 2012

• 2.5 stable out now

• 3.0 wip now available

• Tez support coming soon

• Apache Licensed Open-Source

• Supports all Hadoop 1 & 2 distros

CASCADING

15

Page 16: Hadoop Summit EU   2014

ANSI SQL

on Cascading

on Whatever

LINGUAL

16

Page 17: Hadoop Summit EU   2014

How’s this different than all the other “SQL for Hadoop” projects?

LINGUAL

17

Page 18: Hadoop Summit EU   2014

Not intended as an ad-hoc query interface.

[Lingual is only as fast as Hadoop]

WHY LINGUAL?

18

Page 19: Hadoop Summit EU   2014

Is intended to be as standards compliant as

possible.

WHY LINGUAL?

19

Page 20: Hadoop Summit EU   2014

Migrate workloads from expensive systems to less expensive Hadoop

WHY LINGUAL?

20

Page 21: Hadoop Summit EU   2014

Liberate the data trapped on Hadoop w/o involving an Engineer

WHY LINGUAL?

21

Page 22: Hadoop Summit EU   2014

• ANSI Compatible SQL

• JDBC Driver

• Cascading Java API

• SQL Command Shell

• Catalog Manager Tool

• Data Provider API

LINGUAL

22

Query Planner

JDBC API Lingual APIProvider API

Cascading

Compute

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

Page 23: Hadoop Summit EU   2014

• SQL-92

• Character, Numeric, and Temporal types

• IN and CASE

• FROM sub-queries

• CAST and CONVERT

• CURRENT_*

ANSI SQL

23

http://docs.cascading.org/lingual/1.1/#sql-support

Page 24: Hadoop Summit EU   2014

24

query:      {              select      |      query  UNION  [  ALL  ]  query      |      query  EXCEPT  query      |      query  INTERSECT  query      }      [  ORDER  BY  orderItem  [,  orderItem  ]*  ]      [  LIMIT  {  count  |  ALL  }  ]      [  OFFSET  start  {  ROW  |  ROWS  }  ]      [  FETCH  {  FIRST  |  NEXT  }  [  count  ]  {  ROW  |  ROWS  }  ]  !orderItem:      expression  [  ASC  |  DESC  ]  [  NULLS  FIRST  |  NULLS  LAST  ]  !select:      SELECT  [  ALL  |  DISTINCT  ]              {  *  |  projectItem  [,  projectItem  ]*  }      FROM  tableExpression      [  WHERE  booleanExpression  ]      [  GROUP  BY  {  ()  |  expression  [,  expression]*  }  ]      [  HAVING  booleanExpression  ]      [  WINDOW  windowName  AS  windowSpec  [,  windowName  AS  windowSpec  ]*  ]  !projectItem:              expression  [  [  AS  ]  columnAlias  ]      |      tableAlias  .  *  

tableExpression:              tableReference  [,  tableReference  ]*      |      tableExpression  [  NATURAL  ]  [  LEFT  |  RIGHT  |  FULL  ]                    JOIN  tableExpression  [  joinCondition  ]  !joinCondition:              ON  booleanExpression      |      USING  (  column  [,  column  ]*  )  !tableReference:      tablePrimary  [  [  AS  ]  alias  [  (  columnAlias  [,  columnAlias  ]*  )  ]  ]  !tablePrimary:              [  TABLE  ]  [  [  catalogName  .  ]  schemaName  .  ]  tableName      |      (  query  )      |      VALUES  expression  [,  expression  ]*      |      (  TABLE  expression  )  !windowRef:              windowName      |      windowSpec  !windowSpec:      [  windowName  ]      (              [  ORDER  BY  orderItem  [,  orderItem  ]*  ]              [  PARTITION  BY  expression  [,  expression  ]*  ]              {                      RANGE  numericOrInterval  {  PRECEDING  |  FOLLOWING  }              |                      ROWS  numeric  {  PRECEDING  |  FOLLOWING  }              }      )

Lingual 1.1 -> Optiq 0.4.12.3https://github.com/julianhyde/optiq/blob/master/REFERENCE.md

Page 25: Hadoop Summit EU   2014

Lingual provides two interfaces.

APIS

25

Page 26: Hadoop Summit EU   2014

Allows SQL and non-SQL Flows to work together as a single application via

conceptually similar interfaces

CASCADING API

26

Page 27: Hadoop Summit EU   2014

27

Cascading API !

FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );! !SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! !flowDef.addAssemblyPlanner( sqlPlanner );!!Flow  flow  =  new  HadoopFlowConnector().connect(  flowDef  );  !

flow.complete();

Page 28: Hadoop Summit EU   2014

So Systems and People can talk directly to Hadoop visible data

JDBC API

28

Page 29: Hadoop Summit EU   2014

29

JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();! ! ResultSet resultSet = statement.executeQuery(! "select *\n"! + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n"! + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n"! + "on e.\"EMPID\" = s.\"CUST_ID\"" );! ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }

Page 30: Hadoop Summit EU   2014

JDBC

30

Server / Desktop

JDBCFlowAssembly

ClusterJobJobSQL

select * from employees ...

SQLselect * from employees ...

SQLselect * from employees ...

lingual-hadoop-1.1.0-jdbc.jar

meta-data catalog

Page 31: Hadoop Summit EU   2014

DEFAULT SHELL

31

Page 32: Hadoop Summit EU   2014

select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no;

SUB-QUERY

32

Page 33: Hadoop Summit EU   2014

ACCESS HADOOP FROM R

33

# load the JDBC package!library(RJDBC)! !# set up the driver!drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")! !# set up a database connection to a local repository!connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")! !# query the repository: in this case the MySQL sample database (CSV files)!df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!head(df)! !# use R functions to summarize and visualize part of the data!df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!summary(df$hire_age)!!library(ggplot2)!m <- ggplot(df, aes(x=hire_age))!m <- m + ggtitle("Age at hire, people named Gina")!m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

Page 34: Hadoop Summit EU   2014

RESULTS

34

> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92

Page 35: Hadoop Summit EU   2014

INTEGRATION

35

But I use a custom data format!

Page 36: Hadoop Summit EU   2014

• Any Cascading Tap and/or Scheme can be used from JDBC

• Use a “fat jar” on local disk or from a Maven repo

‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0

• The Jar is dynamically loaded into cluster, on the fly

DATA PROVIDER API

36

Page 37: Hadoop Summit EU   2014

DATA PROVIDER

37

JDBC

Maven Repo

Assembly Flow

ClusterJobJob

lingual-hadoop-1.1.0-jdbc.jar

cascading-jdbc-oracle-provider.jar

your-avro-provider.jar

Page 38: Hadoop Summit EU   2014

AMAZON EMR & REDSHIFT

38

Amazon Elastic MapReduceJob Job Job Job

SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...

Amazon S3

Amazon RedShift

file1 file2

results

http://docs.cascading.org/tutorials/lingual-redshift/

Page 39: Hadoop Summit EU   2014

All Cascading applications can be visualized and monitored …

MANAGED

39

Page 40: Hadoop Summit EU   2014

• Understand how your application maps onto your cluster

• Identify bottlenecks (data, code, or the system)

• Jump to the line of code implicated on a failure

• Plugin available via Maven repo

• Beta UI hosted online

DRIVEN

40

http://cascading.io/driven/

Page 41: Hadoop Summit EU   2014

MANAGED WITH DRIVEN

41

Page 42: Hadoop Summit EU   2014

42

Page 43: Hadoop Summit EU   2014

A BOOK!

43

Enterprise Data Workflows with Cascading

O’Reilly, 2013 amazon.com/dp/1449358721

Page 44: Hadoop Summit EU   2014

[email protected] !

!

@cwensel

DONE

44