01 introductiontobigdata 20161001

39
Big Data Introduction to Big Data, Hadoop and Spark

Transcript of 01 introductiontobigdata 20161001

Big Data

Introduction to Big Data, Hadoop and Spark

Agenda• About speaker and itversity• Categorization of enterprise applications• Typical data integration architecture• Challenges with conventional technologies• Big Data eco system• Data Storage - Distributed file system and databases• Data Processing - Distributed computing frameworks• Data Ingestion - Open source tools• Data Visualization - BI tools as well as custom• Role of Apache and other companies• Different Certifications in Big Data• Resources for deep dive into Big Data• Job roles in Big Data

About me• IT Professional with 13+ years of experience in vast array of technologies including

Oracle, Java/J2EE, Data Warehousing, ETL, BI as well as Big Data

• Website: http://www.itversity.com• YouTube: https://www.youtube.com/c/itversityin• LinkedIn Profile: https://in.linkedin.com/in/durga0gadiraju• Facebook: https://www.facebook.com/itversity• Twitter: https://twitter.com/itversity• Google Plus: https://plus.google.com/+itversityin• Github: https://github.com/dgadiraju• Meetup:

– https://www.meetup.com/Hyderabad-Technology-Meetup/ (local hyderabad meetup)– https://www.meetup.com/itversityin/ (typically online)

Categorization of Enterprise applications

Enterprise applications

Operational Decision Support

Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems.

Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)

Customer Analytics

Enterprise Applications

• Most of the traditional applications are considered monolith (n-tier architecture)– Monoliths are typically built on RDBMS databases such as

Oracle

• Modern applications use micro services and considered to be polyglot

* Now we have choice of different types of databases (we will see later)

Enterprise applicationsTake the example use case (eCommerce platform)•Operational

– Transactional – check out– Not transactional – recommendation engine

•Decision support – sales trends•Customer analytics – categories in which customers have spent money* Data from transactional systems need to be integrated to non transactional, decision support, customer analytics etc

Data Integration

• Data integration can be categorized into– Real time– Batch

• Traditionally when there are no customer analytics and recommendation engines, we used to have– ODS (compliance and single source of truth)– EDW (Facts and dimensions to support reports)– ETL (to perform transformations and load data into EDW)– BI (to visualize and publish reports)

OLTP

ClosedMain Frames

XMLExternal apps

Source(s)

Data Integration(Current Architecture)

Target system (eg; EDW)

Data Integration(ETL/Real Time)

ODSEDW/ODS

Visualization/Reporting

Reporting

Decision Support

Data Integration - Technologies

• Batch ETL – Informatica, Data Stage etc• Real time data integration – Goldengate,

Shareplex etc• ODS – Oracle, MySQL etc• EDW/MPP – Teradata, Greenplum etc• BI Tools – Cognos, Business Objects etc

Current Scenario - Challenges• Almost all operational systems are using relational databases (RDBMS like Oracle).

– RDBMS are originally designed for Operational and transactional.• Not linearly scalable.

– Transactions– Data integrity

• Expensive• Predefined Schema• Data processing do not happen where data is stored (storage layer)

– Some processing happens at database server level (SQL)– Some processing happens at application server level (Java/.net)– Some processing happens at client/browser level (Java Script)

• Almost all Data Warehouse appliances are expensive and not very flexible for customer analytics and recommendation engines

Evolution of DatabasesNow we have many choices of Databases•Relational Databases (Oracle, Informix, Sybase, MySQL etc)•Datawarehouse and MPP appliances (Teradata, Greenplum etc)•NoSQL Databases (Cassandra, HBase, MongoDB etc)•In memory Databases (Gemfire, Coherence etc)•Search based Databases (Elastic Search, Solr etc)•Batch processing frameworks (Map Reduce, Spark etc)•Graph Databases (Neo4j)* Modern applications need to be polyglot (different modules need different category of databases)

Big Data eco system – History• Started with Google search engine• Google’s use case is different for enterprises

– Crawl web pages– Index based on key words– Return search results

• As conventional database technologies does not scale, them implemented– GFS (Distributed file system)– Google Map Reduce (Distributed processing engine)– Google Big Table (Distributed indexed table)

Big Data eco system - myths

• Big Data is Hadoop• Big Data eco system can only solve problems

with very large data sets• Big Data is cheap• Big Data provide variety of tools and can solve

problems quickly• Big Data is a technology

Big Data eco system – Characteristics

• Distributed storage– Fault tolerance (RAID is replaced by replication)

• Distributed computing/processing– Data locality (code goes to data)

• Scalability (almost linear)• Low cost hardware (commodity)• Low licensing costs* Low cost hardware and software does not mean that

Big Data is cheap for enterprises

Big Data eco systemBig Data eco system of tools can be categorized into•Distributed file systems

– HDFS– Cloud storage (s3/Azure blob)

•Distributed processing engines– Map Reduce– Spark

•Distributed databases (operational)– NoSQL databases (HBase, Cassandra)– Search databases (Elastic Search)

Big Data eco system - Evolution

After successfully building search engine in new technologies, Google have published white papers•Distributed file system – GFS•Distributed processing Engine – Map Reduce•Distributed database – Big Table* Development of Big Data technologies such as Hadoop is started with these white papers

Data Storage

• Distributed file systems– HDFS– Cloud storage

• Distributed Databases– Cassandra– HBase– MongoDB– Solr

Data processing

• I/O based– Map Reduce– Hive, Pig are wrappers on top of map reduce

• In memory– Spark– Spark Data Frames is wrapper on top of core spark

Data IngestionData ingestion strategies are defined by sources from which data is pulled and sinks where data is stored

•Sources– Relational Databases– Non relational Databases– Streaming web logs– Flat files

•Sinks– HDFS– Relational or Non relational Databases– Data processing frameworks

Data Processing

As part of data processing typically we focus on transformations such as•Aggregations•Joins•Sorting•Ranking

Data Analysis or Visualization

Processed data is analyzed or visualized using•BI Tools•Custom visualization frameworks (d3js)•Ad hoc query tools

Let us recollect how data integration is typically done in enterprises (eg: Data Warehousing)

OLTP

ClosedMain Frames

XMLExternal apps

Source(s)

Data Integration(Current Architecture)

Target system (eg; EDW)

Data Integration(ETL/Real Time)

ODSEDW/ODS

Visualization/Reporting

Reporting

Decision Support

Use Case – EDW(Current Architecture)

• Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds

• Data Integration– ODS (Operational Data Store)

• Sources – Disparate• Real time – Tools/custom (Goldengate, Shareplex etc)• Batch – Tools/custom• Uses – Compliance, data lineage, reports etc

– Enterprise Datawarehouse• Sources – ODS or other sources• ETL – Tools/custom (Informatica, Ab Initio, Talend)

• Reporting/Visualization– ODS (Compliance related reporting)– Enterprise Datawarehouse– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

Now we will see how data integration can be done using Big Data eco system

Data Ingestion

• Apache Sqoop to get data from relational databases into Hadoop

• Apache Flume or Kafka to get data from streaming logs

• If some processing need to be done before loading to databases or HDFS, data is processed through streaming technologies such as Flink, Storm etc

Data Processing

• There are 2 engines to apply transformation rules at scale– Map Reduce (uses I/O)

• Hive is the most popular map reduce based tool• Map Reduce works well to process huge amounts of data in few

hours

– Spark (in memory)

* We will see disadvantages of Map Reduce and why Spark with new programming languages such as Scala and python is gaining momentum

Disadvantages of Map Reduce

• Disadvantages of Map Reduce based solutions– Designed for batch, not meant for interactive and

ad hoc reporting– I/O bound and processing of micro batches can be

an issue– Too many tools/technologies (Map Reduce, Hive,

Pig, Sqoop, Flume etc.) to build applications– Not suitable for enterprise hardware where

storage is typically network mounted

Apache Spark• Spark can work with any file system including HDFS• Processing is done in memory – hence I/O is minimized• Suitable for ad hoc or interactive querying or reporting• Streaming jobs can be done much faster than map reduce• Applications can be developed using Scala, Python, Java etc• Choose one programming language and perform

– Data integration from RDBMS using JDBC (no need of sqoop)– Stream data using spark streaming– Leverage data frames and SQL embedded in programming language– As processing is done in memory Spark works well with Enterprise

Hardware with network file system

Data Integration(Big Data eco system)

OLTP

ClosedMain Frames

XMLExternal apps

Source(s)

Visualization/Reporting

Reporting

Decision Support

Node

Node

Node

Big Data Cluster(EDW/ODS)

ETLReal Time/Batch (No ETL)

Reporting Database(optional)

Big Data eco system

Core Components

Non Map Reduce

Hive

Pig

Sqoop

Oozie

Mahout

Hadoop eco system

Big Data Technologies

Distributed File System (HDFS)

Map Reduce

Impala Presto

Ad-hoc querying tools

Impala Presto

NoSQL

Data IngestionKafka Flume

Streaming AnalyticsStorm Flink

Spark

In Memory processing

32

Big Data eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

HiveT and L

Batch Reporting

Non Map Reduce

ImpalaInteractive/adhoc

Reporting

SqoopE and L

OozieWorkflows

Hadoop eco system

Big Data Technologies

Custom Map ReduceE, T and L

HBaseReal Time data integration or

Reporting

Role of Apache• Each of these are separate projects incubated under Apache

– HDFS and MapReduce/YARN– Hive– Pig– Sqoop– HBaseEtc

Installation (plain vanilla)• In plain vanilla mode, depending up on the architecture each

tool/technology needs to be manually downloaded, installed and configured.

• Typically people use Puppet or Chef to set up clusters using plain vanilla tools

• Advantages– You can set up your cluster with latest versions from Apache directly

• Disadvantages– Installation is tedious and error prone– Need to integrate with monitoring tools

Hadoop Distributions• Different vendors pre-package apache suite of big data tools into their

distribution to facilitate– Easier installation/upgrade using wizards– Better monitoring– Easier maintenance– and many more

• Leading distributions include, but not limited to– Cloudera– Hortonworks– MapR– AWS EMR– IBM Big Insights– and many more

Hadoop Distributions

HDFS/YARN/MRHive

Pig

Apache Foundation

Sqoop

Impala

Tez

Flume

SparkGanglia

HBase

Impala

Zookeeper

Cloudera

Hortonworks

MapR

AWS

Certifications• Why to certify?

– To promote skills– Demonstrate industry recognized validation for your expertise.– Meet global standards required to ensure compatibility between Spark and Hadoop– Stay up to date with the latest advances in Big Data technologies such as Spark and

Hadoop• Take certifications from only vendors like

– Cloudera– Hortonworks– MapR– Databricks (oreilly)– http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/– http://www.itversity.com/2016/07/02/hadoop-certifications/

Resources

• www.YouTube.com/itversityin• www.itversity.com

Job Roles

• Go through this blog - http://www.itversity.com/2016/07/02/hadoop-certifications/