Post on 20-Feb-2020
Module: Data Ingestion On Sqoop
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 2 / 48
● At the end of this lesson, students shall be able to:– Understand what is Sqoop and what is its uses and strength– Understand how Sqoop ingest data into HDFS– Understand Sqoop ‘direct’ mode functionality– Understand how to implement full and incremental RDBMS ingestion using Sqoop– Able to use the Sqoop CLI to ingest data from RDBMS
Objectives
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 3 / 48
Introduction To Sqoop
● Distributed data ingestion tool for extracting large RDBMS tables
● Distribute ingestion through assigning each mapper different sections/partitions of the source data
● High performance connectors to source– Sqoop provides highly optimized data
extraction strategy for different RDBMS● Oracle● MySQL● PostgreSQL
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 4 / 48
SqooP Data Ingestion Architecture
Sqoop CLI
Sqoop Mapper
Sqoop Mapper
Sqoop Mapper
Sqoop AppMaster
Data Block
Data Block
Data Block
File
File
File
Triggers
Manages
OptimizedConnectors
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 5 / 48
Sqoop COmmand
● Getting help– sqoop help
● Ingesting from MySQL into Hive with ORC storage
– sqoop import \--connect jdbc:mysql://server/dbname \--table tablename \--hcatalog-database destdbname \--hcatalog-table desttablename \--create-hcatalog-table \--hcatalog-storage-stanza "stored as orc”
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 6 / 48
Sqoop ‘Direct’ Mode
● Default sqoop command uses standard JDBC connection to ingest data.● This can be pretty slow for large data sources due to data source may take time to unpack and
prepare data for transmission through JDBC.● Sqoop provides ‘direct’ mode which will attempt to ingest data through database-specific, more
optimized ingestion strategy.● To use direct mode, add --direct option to your sqoop command● Note: Direct mode may require specific requirements for different databases before it can be
used. – Refer to: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_compatibility_notes for more
information
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 7 / 48
Implementing Incremental Ingestion With Sqoop
● Incremental ingestion of transactional / log-like data can be done through incremental-append strategy, and requires an incremental identifier such as
– incremental running number OR– entry creation timestamp
● Incremental ingestion of operational tables with updates can be done through incremental-merge strategy, and requires following two (2) fields as dependency
– A unique identifier column– A modification timestamp
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 8 / 48
Full Ingestion
Sqoop CLISqoop
Mappers Source Table
Return all data
Writes Data
Workflow
Trigger SqoopFull Ingestion
Hive Table
Start Mappers
Query all data
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 9 / 48
Incremental-Append
Sqoop CLISqoop
MappersSource Table
Return filtered data
Writes Data
Workflow
Trigger Sqoop(select * where incremental_field > last_incremental_value)
StagingHive Table
Start Mappers
Query filtered data
Hive Table
Append
Trigger Hive Append
Trigger Hive Drop
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 10 / 48
Incremental-Merge
Sqoop CLISqoop
MappersSource Table
Return filtered data
Writes Data
Workflow
Trigger Sqoop(select * where last_modified > last_ingest_date)
StagingHive Table
Start Mappers
Query filtered data
Hive Table
Merge
Trigger Hive Merge(take latest value based on unique ID)
Trigger Hive Drop
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 11 / 48
Incremental Merge SQL
CREATE VIEW RECONCILE_VIEW AS
SELECT t2.* FROM
(SELECT *,ROW_NUMBER() OVER (PARTITION BY UNIQUE_ID_COLUMN ORDER BY
LAST_MODIFIED_COLUMN DESC) hive_rn
FROM
(SELECT * FROM HIVE_TABLE
WHERE LAST_MODIFIED_COLUMN <= ${LAST_MODIFIED_TIMESTAMP}
OR LAST_MODIFIED_COLUMN IS NULL
UNION ALL
SELECT * FROM STAGING_TABLE
WHERE LAST_MODIFIED_COLUMN > ${LAST_MODIFIED_TIMESTAMP}) t1) t2
WHERE t2.hive_rn=1;
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 12 / 48
Better ways to Incremental-Merge?
● With Hive ACID enabled, there’s possibility to merge through ACID MERGE command
MERGE INTO HIVE_TABLE AS T USING STAGING_TABLE AS SON T.UNIQUE_ID_COLUMN = S.UNIQUE_ID_COLUMN WHEN MATCHED THEN UPDATE SET T.VAL_COLUMN = S.VAL_COLUMN, T.VAL2_COLUMN = S.VAL2_COLUMNWHEN MATCHED AND S.DELETE_COLUMN IS NOT NULL THEN DELETEWHEN NOT MATCHED THEN INSERT VALUES (S.UNIQUE_ID_COLUMN, S.VAL1_COLUMN, S.VAL2_COLUMN, S.LASTMODIFIED_COLUMN, S.DELETE_TIMESTAMP_COLUMN);
● With Hive ACID enabled, a Spark / Python program can also be written to upsert new values from staging table through looping through all the data.
● Alternatively, Change Data Capture / data replication solutions such as SymmetricDS can be made to work with Hive ACID tables
LAB: INGESTING DATA WITH SQOOP
Module:Data Ingestion On NiFi
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 15 / 48
Objectives
● At the end of this lesson, students shall be able to:– Understand the key components and concepts in a NiFi Flow– Understand NiFi Expression Language– Able to use NiFi to build a data ingestion flow
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 16 / 48
Introduction To NiFi
● Centralized, web-based, data flow management tool for moving data from various sources, to various destinations
● Over 200+ processors for:– Extracting data– Filtering data– Transforming data formats– Loading (saving) data
● Highly configurable– Loss tolerant vs guaranteed delivery– Low latency vs high throughput– Dynamic prioritization– Flow can be modified at runtime– Back pressure
● Data Provenance– Track dataflow from beginning to end
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 17 / 48
NiFi Use Case
● What Apache NiFi is good at:– Reliable and secure transfer of data between systems– Delivery of data from sources to analytic platforms– Enrichment and preparation of data:– Conversion between formats– Extraction/Parsing– Routing decisions
● What Apache NiFi shouldn’t be used for:– Distributed Computation– Complex Event processing– Joins, rolling windows, aggregates operations
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 18 / 48
Key Concept: FlowFile
● FlowFile is basically the data itself.● Consist of 2 components:
– Header attributes– Content body
● Attributes stores metadata of the received file
● Content body stores the actual data itself.
Header Attributes
Content
Data
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 19 / 48
Processors
● The actual component that does the work
● Generates FlowFiles or receives FlowFiles and act on it.
● Can be parallelized and load balanced across nodes
● Right click on processor and select Configure to configure the processor
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 20 / 48
Processor Configuration : Settings
● Name– Human readable name for the processor
● ID– UUID of the processor object. Can be used
in NiFi REST API
● Automatically Terminate Relationships– Check to terminate the output relationships,
ie: you are not going to configure an output connection for the relationship.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 21 / 48
Processor Configuration: Scheduling
● Scheduling Strategy– Timer driven – periodic– Cron driven – cron-like
scheduling
● Concurrent Tasks– Number of parallel threads which
this task will be running as.
● Run Schedule– Scheduling for tasks
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 22 / 48
Processor Configuration: Properties
● Processor-specific configuration
● Refer to processor documentation to know what each properties are for, and how to use them
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 23 / 48
Connections
● Represents a data flow queue from one processor to another
● Right click and clicking Configure will load up the connection settings page.
● Right click and clicking List Queue will load up the queue’s FlowFile listing page.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 24 / 48
Connection Settings
● Name– Human readable name of the
connection
● FlowFile expiration– How long a FlowFile shall be queued
in the queue before it is deleted
● Back Pressure Object Threshold– Max number of FlowFiles will be
queued in this queue
● Back Pressure Data Size Threshold– Max size of FlowFiles that can be
queued in this queue
● Selected Prioritizers– Queue priority algorithm
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 25 / 48
Connection Item List
● This view list down all FlowFiles queued in the connection
● Clicking the information icon on the leftmost column will load up an information page for the specific FlowFile
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 26 / 48
FlowFile Details
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 27 / 48
FlowFile Content
● Clicking "View” button from FlowFile details view will load up a page that shows the content of the FlowFile
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 28 / 48
Process Group
● Processors and connections can be grouped together into a single unit called Process Group
● Process Groups may have variables for configuring Processors inside it through NiFi Expression Language
● Input / Output ports may be created inside Process Groups to allow connections from outside the Process Group to flow into it.
● Remote Process Group is a connection towards a separate NiFi Cluster.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 29 / 48
Funnel
● Output of multiple connections can be merged into a single flow using a Funnel
● Funnel can also be used to temporarily stage data while downstream processors are still being developed
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 30 / 48
Expression Language
● Certain NiFi Processor properties fields can be configured using the NiFi Expression Language (EL).
● In simple words, EL is a simple text templating language for setting the properties field content with programmatically acquired values from FlowFile attributes, or from Process Group variables.
● Example:– "${now():format("yyyy/MM/dd”)” ← this will return current date in format 2018/09/01– "${filename:substring(0,1)}” ← this will return the first character of the content of filename attribute /
variable
● https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
LAB: INGESTING DATA With NiFi
Module:Data Transformation With Spark
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 33 / 48
Objectives
● At the end of this lesson, students shall be able to:– Understand the difference between Hive and Spark– Know the components of Apache Spark– Understand the architecture of job submission for Spark programs on HDP– Understand what is an RDD and how to manipulate RDD using Spark data manipulation /
transformation functions– Understand what is DataFrame and how to manipulate DataFrame using DataFrame API– Understand how to register DataFrame as temporary in-memory table for querying using Spark SQL– Understand how to register custom transformation functions (UDF) and use in Spark SQL– Understand how to save Spark temporary table as Hive table
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 34 / 48
APACHE SPARK
● Apache Spark is a general purpose in-memory data analytics tool that can run on Hadoop YARN.● Spark consist of 5 key components
– Spark Core● Core of Spark. Base RDD operations.
– Spark SQL● Data Frames and HiveQL support on Spark
– Spark MLLib● Distributed machine learning algorithms on Spark
– Spark GraphX● Graph computation engine
– Spark Streaming● Stream processing engine using microbatching
● Spark API is supported in multiple languages– Java, Scala, Python, R & SQL
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 35 / 48
Hive VS Spark
● Primarily a SQL engine● Option to use Tez (default), Spark and MapReduce
(old) as execution engine● Performance tied to execution engine chosen.● Tez, the default execution engine provide comparative
performance with Spark, if not better.– Data loaded into memory based on needs
● MapReduce, the older engine is slower to process due to high I/O, however more robust to handling failures
● Requires Hadoop core components to function as it runs on YARN
● General purpose data processing tool, supporting:– RDD– DataFrame– SQL– graph processing– machine learning– stream processing.
● Data are loaded into memory before processing● Can run standalone, without running on Hadoop
YARN.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 36 / 48
Spark Job Submission Architecture
Spark LivyServer
Spark2 LivyServer
Knox User / Client Tool
spark-submit
Spark on-demand cluster on YARN
Port 8443
Port 8999Port 8998
Shell onedge node
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 37 / 48
Spark Job Submission Tools
● spark-submit command– Command line tool for submitting code to Spark– Default method and well supported by all major Spark and Hadoop distributions– Requires shell access to the node with spark-submit installed
● Livy Server– REST API based code submission.– Newer tools may use Livy for job submission– Allows REST based security for controlling access to Hadoop cluster
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 38 / 48
Introduction To RDD
● Resilient Distributed Dataset is an abstraction used in Spark for manipulating in-memory data. ● RDD represents a copy of data in the memory of the Spark cluster which can be manipulated,
transformed, analyzed, etc. ● RDD can be created either through parallelizing data (uploading data) from the driver program, or
referencing to an existing dataset in HDFS, HBase or any filesystem offering Hadoop InputFormat.● RDD can be manipulated using 2 types of functions:
– Transform functions– Action functions
● Transform functions are used to transform the RDD, and are lazy functions. No actual processing will be executed until an Action function is triggered
● Action functions triggers RDD processing and is used to get the transformation results for further processing. When an Action function is completed, the results are transferred over from the in-memory RDD into the driver program.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 39 / 48
In-MemoryRDD
Spark DAG Processing Flow
file partition
task
spark-submit / driver program
user
file partition
file partition
HDFS ClientSpark
task
task
task
Writing RDD spark program with pyspark
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 41 / 48
Spark RDD Transformation Functions
● Transformation functions are used to modify RDDs to different structure.● Transformation functions are 'lazy', ie: only DAG definition is added, no data will be processed until an action function
is triggered● Data manipulation are done through manipulating key-value tuples● Common functions:
– map(func) - Return a new distributed dataset formed by passing each element of the source through a function func.– filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true.– flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq
rather than a single item).– distinct() - Return a new dataset that contains the distinct elements of the source dataset.– groupByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. – reduceByKey() - When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are
aggregated using the given reduce function func, which must be of type (V,V) => V.– sortByKey([ascending]) - When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.– join(otherDataset) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of
elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
● Full list of functions : https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 42 / 48
Spark RDD Action Functions
● Action functions are used to get the results out of RDDs● Action will trigger the data transformation DAG defined on the RDD.● Common RDD actions are:
– reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
– collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
– count() - Return the number of elements in the dataset.– first() - Return the first element of the dataset (similar to take(1)).– take(n) - Return an array with the first n elements of the dataset.
● Full list of functions: https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 43 / 48
Initializing Spark Context
Initializing Spark Context
● Before any operations can be done using PySpark, you need to initialize a spark context in your driver program.
– This is only necessary for job submitted through spark-submit and Livy, but is not necessary if you are working with the interactive shell as SparkContext will be initialized by the shell itself.
from pyspark import SparkContext, SparkConf
appName = 'MyApp'conf = SparkConf().setAppName(appName).setMaster(master)sc = SparkContext(conf=conf)
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 44 / 48
Parallelizing data from driver program to the cluster
● The driver program can read data from local files and load it up into the Spark cluster memory. To do this the .parallelize() function is called with the dataset.
Parallelizing dataset
data = [1,2,3,4,5,6,7,8,9,0]
rdd = sc.parallelize(data)
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 45 / 48
Reading Data From HDFS
● Spark support reading various formats from HDFS, such as text files, SequenceFile, Avro, ORC and Parquet. Pure RDD operations however is meant to work with schema-less formats like text files and SequenceFiles
Reading text file from HDFS
rdd1 = sc.textFile('/path/to/textfile') # reading single text filerdd2 = sc.textFile('/path/to/folder/*') # reading multiple filesrdd3 = sc.textFile('/path/to/folder/*.gz') # reading compressed text files
Reading SequenceFiles from HDFS
rdd1 = sc.sequenceFile('/path/to/textfile') # reading single filerdd2 = sc.sequenceFile('/path/to/folder/*') # reading multiple files
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 46 / 48
Processing RDD
Transforming RDD (style #1)
rdd = sc.parallelize(['hello'])rdd = rdd.flatMap(lambda x: list(x))# ['h', 'e', 'l', 'l', 'o']rdd = rdd.map(lambda x: (x, 1))# [('h',1),('e',1),('l',1),('l',1),('o',1)]rdd = rdd.reduceByKey(lambda a,b: a+b)# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())
Transforming RDD (style #2)
rdd = sc.parallelize(['hello'])rdd = (rdd.flatMap(lambda x: list(x)) .map(lambda x: (x, 1)) .reduceByKey(lambda a,b: a+b))# [('h',1),('e',1),('l',2),('o',1)]print(rdd.collect())
● Once data have been loaded into an RDD, we can then begin to manipulate it using transformation and action functions.
● All transformation functions returns a new RDD with the function added into its DAG, so you can chain RDD transform functions together to create a more complex DAG.
www.abyres.net(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 47 / 48
Submitting Job To Cluster
● To submit your spark program into the cluster, run spark-submit command
Submitting PySpark program throughspark-submit using Spark 2
export SPARK_MAJOR_VERSION=2spark-submit --master yarn --deploy-mode client script.py
Lab: Spark RDD Programming