HLoader – Automated Incremental Hadoop Data Loader Service and Framework

Post on 15-Apr-2017

223 views 3 download

Transcript of HLoader – Automated Incremental Hadoop Data Loader Service and Framework

HLoaderData Ingestionfrom Oracle Databasesto Hadoop ClustersAutomaticallyOn-Demand

8/13/2015 HLoader – A. Bose, D. Stein 2

HL

Problem

– Control and monitor data transferusing Sqoop, a CLI tool for bulk data transfer

– Two in onetwo distinct Summer Student task proposals for basically the same job

8/13/2015 HLoader – A. Bose, D. Stein 3

Problem

– Frequent requestsdifferent users with different but similar use casesATLAS Job Monitoring, CMS Job Monitoring, CMS data popularity, ACCLOG

– Manually executed jobthat can be partially automated

8/13/2015 HLoader – A. Bose, D. Stein 4

Requirements– Run jobs…

… incrementally

… communicate withthe end user

– Handle failuresretry, notify, prevent

– Be secure, stay safeauthorize, authenticate the users without exchanging passwords

– Use what’s providedRun on the CERN-provided infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 5

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution overview

8/13/2015 HLoader – A. Bose, D. Stein 6

1. Provided infrastructureOracle Databases and Hadoop Clusters

2. Transfer Datathe user wants to transfer data, so they create a new job: what, when, where to transfer

3. Execute the transfer on behalf of the userschedule and execute the job at the requested time (also inform the user of the status)

4. Update if neededif the user requested incremental updates, schedule it after the given interval

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

1

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

2

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

3

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

44

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution security

8/13/2015 HLoader – A. Bose, D. Stein 7

1. CERN SSO authenticationno password exchange

2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used

3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password

4. Secure password inputother users can not see the password as plaintext anywhere

1

2 3

44

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

1

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

2

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

3

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

4

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

5

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution modularity

8/13/2015 HLoader – A. Bose, D. Stein 8

1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated

2. Interchangeable schedulerbased on the servers and the needed schedule complexity

3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used

4. Client communicating using REST API

5. Changeable Sqoop JDBC drivernormal or fast connectors if possible

1

2

3

4

5

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

1

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

2

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

3

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

4

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution infrastructure

8/13/2015 HLoader – A. Bose, D. Stein 9

1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector

2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask

3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)

4. Client hosted with the REST APIfor easy usage and update, could be separate

1

2 3

4

Solution meta DB

8/13/2015 HLoader – A. Bose, D. Stein 10

HL_SERVERS

HL_CLUSTERS

HL_JOBS

HL_TRANSFERS

HL_LOGS

server_idPK

server_address

server_name

cluster_idPK

cluster_address

cluster_name

job_idPK

source_server_idFK

source_schema_name

source_object_name

destination_cluster_idFK

destination_path

owner_username

sqoop_nmap

sqoop_splitting_column

sqoop_incremental_method

sqoop_direct

start_time

interval

job_last_update

transfer_idPK

scheduler_transfer_id

job_idFK log_idPK

transfer_idFK

log_source

transfer_status

transfer_start

transfer_last_update

last_modified_value

log_path

log_content

Solution restrictions

8/13/2015 HLoader – A. Bose, D. Stein 11

– Only allow tables and views to be importedthe DB is responsible for evaluating and checking the queries

– Selected (preconfigured) source databasesgradual introduction for new users

– Preset destination folder structurewith restricted access rights, avoiding collision, unauthorized access

– Basic Sqoop command logic (for now)eg., with primary key, only one PK attribute

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

Solution current state

8/13/2015 HLoader – A. Bose, D. Stein 12

1. Client Katein progress, meanwhile the REST interface can be used

2. REST API Danielalmost ready, missing the new job processing interface

3. Agent Scheduling Anibasically ready, can schedule jobs and update itself after job description modifications

4. Agent Runners Danielworking for initial imports, soon to be able to execute incremental updatespartially working SSH and REST monitorig

Solution current state

8/13/2015 HLoader – A. Bose, D. Stein 12

Solution future work

8/13/2015 HLoader – A. Bose, D. Stein 13

– Support more database connectors SQLA/NoSQL

– Support alternative runners like Oozie

– Prepare for Sqoop 2

– Integrate with Hive

– Resolve restrictions

– Release on GitHub with an Open Source license

Summary– Easily expandable framework and service

for transferring data from Oracle to Hadoop

– Designed with automation in mindminimal administrator intervention needed

– Service built for easy usageeasy to use for the routine jobs

8/13/2015 HLoader – A. Bose, D. Stein 14

Workflow tools– GitLab

– JIRA

– Slack

– Jenkins CI

8/13/2015 HLoader – A. Bose, D. Stein 32

Contributors– Anirudha Bose– Dániel Stein

– Antonio Romero Marin– Domenico Giordano– Kacper Surdy– Katarzyna Maria Dziedziniewicz-Wójcik– Manuel Martín Márquez– Zbigniew Baranowski

8/13/2015 HLoader – A. Bose, D. Stein 15

Client

Meta DB

REST API Agent

Oracle Databases

FIM

Hadoop Clusters

HLoader

8/13/2015 HLoader – A. Bose, D. Stein 16

HL