HLoader – Automated Incremental Hadoop Data Loader Service and Framework

34

Transcript of HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Page 1: HLoader – Automated Incremental Hadoop Data Loader Service and Framework
Page 2: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

HLoaderData Ingestionfrom Oracle Databasesto Hadoop ClustersAutomaticallyOn-Demand

8/13/2015 HLoader – A. Bose, D. Stein 2

HL

Page 3: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Problem

– Control and monitor data transferusing Sqoop, a CLI tool for bulk data transfer

– Two in onetwo distinct Summer Student task proposals for basically the same job

8/13/2015 HLoader – A. Bose, D. Stein 3

Page 4: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Problem

– Frequent requestsdifferent users with different but similar use casesATLAS Job Monitoring, CMS Job Monitoring, CMS data popularity, ACCLOG

– Manually executed jobthat can be partially automated

8/13/2015 HLoader – A. Bose, D. Stein 4

Page 5: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Requirements– Run jobs…

… incrementally

… communicate withthe end user

– Handle failuresretry, notify, prevent

– Be secure, stay safeauthorize, authenticate the users without exchanging passwords

– Use what’s providedRun on the CERN-provided infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 5

Page 6: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution overview

8/13/2015 HLoader – A. Bose, D. Stein 6

1. Provided infrastructureOracle Databases and Hadoop Clusters

2. Transfer Datathe user wants to transfer data, so they create a new job: what, when, where to transfer

3. Execute the transfer on behalf of the userschedule and execute the job at the requested time (also inform the user of the status)

4. Update if neededif the user requested incremental updates, schedule it after the given interval

Page 7: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

Page 8: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

1

Page 9: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

2

Page 10: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

3

Page 11: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

44

Page 12: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

1

2 3

44

Page 13: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

Page 14: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

1

Page 15: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

2

Page 16: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

3

Page 17: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

4

Page 18: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

5

Page 19: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

1

2

3

4

5

Page 20: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

Page 21: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

1

Page 22: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

2

Page 23: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

3

Page 24: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

4

Page 25: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

1

2 3

4

Page 26: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Solution meta DB

8/13/2015 HLoader – A. Bose, D. Stein 10

HL_SERVERS

HL_CLUSTERS

HL_JOBS

HL_TRANSFERS

HL_LOGS

server_idPK

server_address

server_name

cluster_idPK

cluster_address

cluster_name

job_idPK

source_server_idFK

source_schema_name

source_object_name

destination_cluster_idFK

destination_path

owner_username

sqoop_nmap

sqoop_splitting_column

sqoop_incremental_method

sqoop_direct

start_time

interval

job_last_update

transfer_idPK

scheduler_transfer_id

job_idFK log_idPK

transfer_idFK

log_source

transfer_status

transfer_start

transfer_last_update

last_modified_value

log_path

log_content

Page 27: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Solution restrictions

8/13/2015 HLoader – A. Bose, D. Stein 11

– Only allow tables and views to be importedthe DB is responsible for evaluating and checking the queries

– Selected (preconfigured) source databasesgradual introduction for new users

– Preset destination folder structurewith restricted access rights, avoiding collision, unauthorized access

– Basic Sqoop command logic (for now)eg., with primary key, only one PK attribute

Page 28: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution current state

8/13/2015 HLoader – A. Bose, D. Stein 12

1. Client Katein progress, meanwhile the REST interface can be used

2. REST API Danielalmost ready, missing the new job processing interface

3. Agent Scheduling Anibasically ready, can schedule jobs and update itself after job description modifications

4. Agent Runners Danielworking for initial imports, soon to be able to execute incremental updatespartially working SSH and REST monitorig

Page 29: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Solution current state

8/13/2015 HLoader – A. Bose, D. Stein 12

Page 30: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Solution future work

8/13/2015 HLoader – A. Bose, D. Stein 13

– Support more database connectors SQLA/NoSQL

– Support alternative runners like Oozie

– Prepare for Sqoop 2

– Integrate with Hive

– Resolve restrictions

– Release on GitHub with an Open Source license

Page 31: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Summary– Easily expandable framework and service

for transferring data from Oracle to Hadoop

– Designed with automation in mindminimal administrator intervention needed

– Service built for easy usageeasy to use for the routine jobs

8/13/2015 HLoader – A. Bose, D. Stein 14

Page 32: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Workflow tools– GitLab

– JIRA

– Slack

– Jenkins CI

8/13/2015 HLoader – A. Bose, D. Stein 32

Page 33: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Contributors– Anirudha Bose– Dániel Stein

– Antonio Romero Marin– Domenico Giordano– Kacper Surdy– Katarzyna Maria Dziedziniewicz-Wójcik– Manuel Martín Márquez– Zbigniew Baranowski

8/13/2015 HLoader – A. Bose, D. Stein 15

Page 34: HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

HLoader

8/13/2015 HLoader – A. Bose, D. Stein 16

HL