01 introductiontobigdata 20161001
-
Upload
durga-gadiraju -
Category
Education
-
view
80 -
download
0
Transcript of 01 introductiontobigdata 20161001
Agenda• About speaker and itversity• Categorization of enterprise applications• Typical data integration architecture• Challenges with conventional technologies• Big Data eco system• Data Storage - Distributed file system and databases• Data Processing - Distributed computing frameworks• Data Ingestion - Open source tools• Data Visualization - BI tools as well as custom• Role of Apache and other companies• Different Certifications in Big Data• Resources for deep dive into Big Data• Job roles in Big Data
About me• IT Professional with 13+ years of experience in vast array of technologies including
Oracle, Java/J2EE, Data Warehousing, ETL, BI as well as Big Data
• Website: http://www.itversity.com• YouTube: https://www.youtube.com/c/itversityin• LinkedIn Profile: https://in.linkedin.com/in/durga0gadiraju• Facebook: https://www.facebook.com/itversity• Twitter: https://twitter.com/itversity• Google Plus: https://plus.google.com/+itversityin• Github: https://github.com/dgadiraju• Meetup:
– https://www.meetup.com/Hyderabad-Technology-Meetup/ (local hyderabad meetup)– https://www.meetup.com/itversityin/ (typically online)
Categorization of Enterprise applications
Enterprise applications
Operational Decision Support
Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems.
Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)
Customer Analytics
Enterprise Applications
• Most of the traditional applications are considered monolith (n-tier architecture)– Monoliths are typically built on RDBMS databases such as
Oracle
• Modern applications use micro services and considered to be polyglot
* Now we have choice of different types of databases (we will see later)
Enterprise applicationsTake the example use case (eCommerce platform)•Operational
– Transactional – check out– Not transactional – recommendation engine
•Decision support – sales trends•Customer analytics – categories in which customers have spent money* Data from transactional systems need to be integrated to non transactional, decision support, customer analytics etc
Data Integration
• Data integration can be categorized into– Real time– Batch
• Traditionally when there are no customer analytics and recommendation engines, we used to have– ODS (compliance and single source of truth)– EDW (Facts and dimensions to support reports)– ETL (to perform transformations and load data into EDW)– BI (to visualize and publish reports)
OLTP
ClosedMain Frames
XMLExternal apps
Source(s)
Data Integration(Current Architecture)
Target system (eg; EDW)
Data Integration(ETL/Real Time)
ODSEDW/ODS
Visualization/Reporting
Reporting
Decision Support
Data Integration - Technologies
• Batch ETL – Informatica, Data Stage etc• Real time data integration – Goldengate,
Shareplex etc• ODS – Oracle, MySQL etc• EDW/MPP – Teradata, Greenplum etc• BI Tools – Cognos, Business Objects etc
Current Scenario - Challenges• Almost all operational systems are using relational databases (RDBMS like Oracle).
– RDBMS are originally designed for Operational and transactional.• Not linearly scalable.
– Transactions– Data integrity
• Expensive• Predefined Schema• Data processing do not happen where data is stored (storage layer)
– Some processing happens at database server level (SQL)– Some processing happens at application server level (Java/.net)– Some processing happens at client/browser level (Java Script)
• Almost all Data Warehouse appliances are expensive and not very flexible for customer analytics and recommendation engines
Evolution of DatabasesNow we have many choices of Databases•Relational Databases (Oracle, Informix, Sybase, MySQL etc)•Datawarehouse and MPP appliances (Teradata, Greenplum etc)•NoSQL Databases (Cassandra, HBase, MongoDB etc)•In memory Databases (Gemfire, Coherence etc)•Search based Databases (Elastic Search, Solr etc)•Batch processing frameworks (Map Reduce, Spark etc)•Graph Databases (Neo4j)* Modern applications need to be polyglot (different modules need different category of databases)
Big Data eco system – History• Started with Google search engine• Google’s use case is different for enterprises
– Crawl web pages– Index based on key words– Return search results
• As conventional database technologies does not scale, them implemented– GFS (Distributed file system)– Google Map Reduce (Distributed processing engine)– Google Big Table (Distributed indexed table)
Big Data eco system - myths
• Big Data is Hadoop• Big Data eco system can only solve problems
with very large data sets• Big Data is cheap• Big Data provide variety of tools and can solve
problems quickly• Big Data is a technology
Big Data eco system – Characteristics
• Distributed storage– Fault tolerance (RAID is replaced by replication)
• Distributed computing/processing– Data locality (code goes to data)
• Scalability (almost linear)• Low cost hardware (commodity)• Low licensing costs* Low cost hardware and software does not mean that
Big Data is cheap for enterprises
Big Data eco systemBig Data eco system of tools can be categorized into•Distributed file systems
– HDFS– Cloud storage (s3/Azure blob)
•Distributed processing engines– Map Reduce– Spark
•Distributed databases (operational)– NoSQL databases (HBase, Cassandra)– Search databases (Elastic Search)
Big Data eco system - Evolution
After successfully building search engine in new technologies, Google have published white papers•Distributed file system – GFS•Distributed processing Engine – Map Reduce•Distributed database – Big Table* Development of Big Data technologies such as Hadoop is started with these white papers
Data Storage
• Distributed file systems– HDFS– Cloud storage
• Distributed Databases– Cassandra– HBase– MongoDB– Solr
Data processing
• I/O based– Map Reduce– Hive, Pig are wrappers on top of map reduce
• In memory– Spark– Spark Data Frames is wrapper on top of core spark
Data IngestionData ingestion strategies are defined by sources from which data is pulled and sinks where data is stored
•Sources– Relational Databases– Non relational Databases– Streaming web logs– Flat files
•Sinks– HDFS– Relational or Non relational Databases– Data processing frameworks
Data Processing
As part of data processing typically we focus on transformations such as•Aggregations•Joins•Sorting•Ranking
Data Analysis or Visualization
Processed data is analyzed or visualized using•BI Tools•Custom visualization frameworks (d3js)•Ad hoc query tools
OLTP
ClosedMain Frames
XMLExternal apps
Source(s)
Data Integration(Current Architecture)
Target system (eg; EDW)
Data Integration(ETL/Real Time)
ODSEDW/ODS
Visualization/Reporting
Reporting
Decision Support
Use Case – EDW(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds
• Data Integration– ODS (Operational Data Store)
• Sources – Disparate• Real time – Tools/custom (Goldengate, Shareplex etc)• Batch – Tools/custom• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse• Sources – ODS or other sources• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization– ODS (Compliance related reporting)– Enterprise Datawarehouse– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
Data Ingestion
• Apache Sqoop to get data from relational databases into Hadoop
• Apache Flume or Kafka to get data from streaming logs
• If some processing need to be done before loading to databases or HDFS, data is processed through streaming technologies such as Flink, Storm etc
Data Processing
• There are 2 engines to apply transformation rules at scale– Map Reduce (uses I/O)
• Hive is the most popular map reduce based tool• Map Reduce works well to process huge amounts of data in few
hours
– Spark (in memory)
* We will see disadvantages of Map Reduce and why Spark with new programming languages such as Scala and python is gaining momentum
Disadvantages of Map Reduce
• Disadvantages of Map Reduce based solutions– Designed for batch, not meant for interactive and
ad hoc reporting– I/O bound and processing of micro batches can be
an issue– Too many tools/technologies (Map Reduce, Hive,
Pig, Sqoop, Flume etc.) to build applications– Not suitable for enterprise hardware where
storage is typically network mounted
Apache Spark• Spark can work with any file system including HDFS• Processing is done in memory – hence I/O is minimized• Suitable for ad hoc or interactive querying or reporting• Streaming jobs can be done much faster than map reduce• Applications can be developed using Scala, Python, Java etc• Choose one programming language and perform
– Data integration from RDBMS using JDBC (no need of sqoop)– Stream data using spark streaming– Leverage data frames and SQL embedded in programming language– As processing is done in memory Spark works well with Enterprise
Hardware with network file system
Data Integration(Big Data eco system)
OLTP
ClosedMain Frames
XMLExternal apps
Source(s)
Visualization/Reporting
Reporting
Decision Support
Node
Node
Node
Big Data Cluster(EDW/ODS)
ETLReal Time/Batch (No ETL)
Reporting Database(optional)
Big Data eco system
Core Components
Non Map Reduce
Hive
Pig
Sqoop
Oozie
Mahout
Hadoop eco system
Big Data Technologies
Distributed File System (HDFS)
Map Reduce
Impala Presto
Ad-hoc querying tools
Impala Presto
NoSQL
Data IngestionKafka Flume
Streaming AnalyticsStorm Flink
Spark
In Memory processing
32
Big Data eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
HiveT and L
Batch Reporting
Non Map Reduce
ImpalaInteractive/adhoc
Reporting
SqoopE and L
OozieWorkflows
Hadoop eco system
Big Data Technologies
Custom Map ReduceE, T and L
HBaseReal Time data integration or
Reporting
Role of Apache• Each of these are separate projects incubated under Apache
– HDFS and MapReduce/YARN– Hive– Pig– Sqoop– HBaseEtc
Installation (plain vanilla)• In plain vanilla mode, depending up on the architecture each
tool/technology needs to be manually downloaded, installed and configured.
• Typically people use Puppet or Chef to set up clusters using plain vanilla tools
• Advantages– You can set up your cluster with latest versions from Apache directly
• Disadvantages– Installation is tedious and error prone– Need to integrate with monitoring tools
Hadoop Distributions• Different vendors pre-package apache suite of big data tools into their
distribution to facilitate– Easier installation/upgrade using wizards– Better monitoring– Easier maintenance– and many more
• Leading distributions include, but not limited to– Cloudera– Hortonworks– MapR– AWS EMR– IBM Big Insights– and many more
Hadoop Distributions
HDFS/YARN/MRHive
Pig
Apache Foundation
Sqoop
Impala
Tez
Flume
SparkGanglia
HBase
Impala
Zookeeper
Cloudera
Hortonworks
MapR
AWS
Certifications• Why to certify?
– To promote skills– Demonstrate industry recognized validation for your expertise.– Meet global standards required to ensure compatibility between Spark and Hadoop– Stay up to date with the latest advances in Big Data technologies such as Spark and
Hadoop• Take certifications from only vendors like
– Cloudera– Hortonworks– MapR– Databricks (oreilly)– http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/– http://www.itversity.com/2016/07/02/hadoop-certifications/