Big Data Introduction

55
Big Data Introduction

Transcript of Big Data Introduction

Page 1: Big Data Introduction

Big Data Introduction

Page 2: Big Data Introduction

Agenda

• Current Scenario/Trends in IT

• Big Data

– Batch eco system

– NoSQL eco system

– Visualization

• Case Studies for Big Data

– Enterprise Data Warehouse

– Customer Analytics

Page 3: Big Data Introduction

Current Scenario

Enterprise applications

OperationalDecision Support

Enterprise applications can be broadly categorized into

Operational and Decision support systems.

Page 4: Big Data Introduction

Current Scenario – Architecture(Typical Enterprise Application)

Client(Browser)

Client(Browser)

Client(Browser)

App Server

App Server

Database

Page 5: Big Data Introduction

Current Scenario - Architecture

• Recent trends

– Standardization and consolidation of hardware (servers, storage, network) etc., to cut down the costs

– Storage is physically separated from servers and connected with high speed fiber optics

Page 6: Big Data Introduction

Current Scenario - Architecture

Database Server

Database Server

Database Server

Network Switch Network Switch Storage Cluster

*Typical database architecture in an enterprise

Page 7: Big Data Introduction

Current Scenario - Architecture

• Databases

– Databases are clustered (Oracle – RAC)

• High availability

• Fault tolerance

• Load balancing

• Scalable (not linear)

– Common network storage

• File abstraction – file can be of any size

• Fault tolerance (using RAID)

Page 8: Big Data Introduction

Current Scenario - Architecture

• Almost all these applications follow similar n-tier architecture

– Core applications (operational)

– EAI (Integrating Enterprise Applications)

– CRM

– ERP

– DW/BI Tools like Informatica, Cognos, Business Objects etc

• However there are exceptions – legacy (Mainframes based) applications which uses closed architecture

Page 9: Big Data Introduction

Current Scenario - Architecture

Application Servers

Database ServersStorage Servers

*Birds eye view – after standardization and

consolidation using cloud architecture

Page 10: Big Data Introduction

Current Scenario - Challenges

• Almost all operational systems are using relational databases (RDBMS like Oracle).

– RDBMS are originally designed for Operational and transactional.

• Not linearly scalable.

– Transactions

– Data integrity

• Expensive

• Predefined Schema

• Data processing do not happen where data is stored (storage layer)

– Some processing happens at database server level (SQL)

– Some processing happens at application server level (Java/.net)

– Some processing happens at client/browser level (Java Script)

Page 11: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

App Server(s)

Mail Server(s)

Database

Client

Client

Client

Page 12: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

• Customer (E-Mail recipient) data needs to be stored in real time

• Customer data can be in hundreds of millions (if not billions)

• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)

• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)

Page 13: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

• Challenges– Small client vs. Big client

• Scalability issues can be significant

– Standard client vs. Premium client

– Infrastructure• Either databases or application servers or email severs can be

bottleneck

– Code development and deployment

– Standardization

*Keep these in mind and I will explain how this can be resolved using Big Data eco system

Page 14: Big Data Introduction

Big Data

• Evolution of Big Data

• Understanding characteristics of Big Data

• Batch, operational and analytics in Big Data eco system

• Types, Technologies or tools, Techniques and Talent

Page 15: Big Data Introduction

Evolution of Big Data

• GFS (Google File System)

• Google Map Reduce

• Google Big Table

Page 16: Big Data Introduction

Understanding characteristics of Big Data

• Volume

• Variety

• Velocity

Page 17: Big Data Introduction

Batch, operational and analytics in Big Data eco system

• Batch – Hadoop eco system

– Map reduce

– Hive/Pig

– Sqoop

• Operational (but not transactional) – NoSQL eco system

– Cassandra

– Hbase

– Mongo DB

• Analytics and visualization

– Sentiment analysis

– Statistical analysis

– Machine Learning and Natural Language Processing

Page 18: Big Data Introduction

Big Data eco system – Advantages

• Distributed storage

– Fault tolerance (RAID is replaced by replication)

• Distributed computing/processing

– Data locality (code goes to data)

• Scalability (almost linear)

• Low cost hardware (commodity)

• Low licensing costs

Page 19: Big Data Introduction

Hadoop eco system

• Evolution of Hadoop eco system

• Use cases that can be addressed using Hadoop eco system

• Hadoop eco system tools/landscape

Page 20: Big Data Introduction

Evolution of Hadoop eco system

• GFS to HDFS

• Google Map Reduce to Hadoop Map Reduce

• Big Table to HBase

Page 21: Big Data Introduction

Use cases that can be addressed using Hadoop eco system

• ETL

• Real time reporting

• Batch reporting

• Operational but not transactional

Page 22: Big Data Introduction

Hadoop eco system tools/landscape

• Operational and real time data integration

– HBase

• ETL

– Map reduce, Hive/Pig, Sqoop etc

• Reporting

– Hive (Batch)

– Impala/Presto (Real time)

• Analytics API

– Map reduce

– Other frameworks

• Miscellaneous/complementary tools

– Zoo Keeper (co-ordination service for masters)

– Oozie (Workflow/Scheduler)

– Chef/Puppet (automation for administrators)

– Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

Page 23: Big Data Introduction

NoSQL eco system

• Evolution of NoSQL eco system

• Use cases that can be addressed using NoSQL eco system

• NoSQL eco system tools/landscape

Page 24: Big Data Introduction

Evolution of NoSQL eco system

• Google Big Table

• Amazon DynamoDB

• Apache HBase

• Apache Cassandra

• MongoDB

Page 25: Big Data Introduction

Use cases that can be addressed using NoSQL eco system

• Operational but not transactional

• Complements conventional RDBMS systems

• NoSQL is generally not a substitute for transactional systems.

• Facebook messenger is implemented using HBase

Page 26: Big Data Introduction

NoSQL eco system tools/landscape

• NoSQL Tools

– Apache HBase

– Apache Cassandra

– MongoDB

• Miscellaneous/complementary tools

– Zoo Keeper (Co-ordination service for high availability of masters)

– Vendor specific DevOps tools

Page 27: Big Data Introduction

Analytics and Visualization

• Evolution of analytics and visualization tools

• Use cases that can be addressed

– Statistical analysis

– Machine learning and Natural language processing

– Conventional Reporting

• Eco system tools/landscape

– Datameer

– Tableau or any BI tool

– R (In memory statistical analysis tool)

Page 28: Big Data Introduction

Use Case – E-Mail Campaigning

• Role of NoSQL

– Operational

• Role of Hadoop

– Decision support

*Both NoSQL and Hadoop can be installed on same servers.

Page 29: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

App Server(s)

Mail Server(s)

Database

Client

Client

Client

Page 30: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

• Customer (E-Mail recipient) data needs to be stored in real time

• Customer data can be in hundreds of millions (if not billions)

• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)

• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)

Page 31: Big Data Introduction

Current Scenario – Use case(E-Mail Campaigning)

• Challenges– Small client vs. Big client

• Scalability issues can be significant

– Standard client vs. Premium client

– Infrastructure• Either databases or application servers or email severs can be

bottleneck

– Code development and deployment

– Standardization

*Keep these in mind and I will explain how this can be resolved using Big Data eco system

Page 32: Big Data Introduction

Use Case (E-Mail Campaigning)Big Data eco system

Client

Client

Client

StorageProcessing

StorageProcessing

Node 1

Node 2

Page 33: Big Data Introduction

Use Case (E-Mail Campaigning)Big Data eco system

• Storage

– Distributed Storage (example HDFS, CFS, GFS etc)

• Processing

– Operational (HBase, Cassandra)• Data storage is operational – for example customers might have to

stored in real time

– Batch (Map Reduce, Hive/Pig)• E-Mail campaigning is batch

• Map Reduce can be integrated with E-Mail notification to push the campaigning.

• Customer validation can be done in batch

Page 34: Big Data Introduction

Use Case (LinkedIn)

• Most of the frames in linkedin.com are implemented using Big Data eco system tools

• Advantages

– Low cost to implement an idea (endorsements)

– No impact on existing applications

– Both operational (actual endorsement) and batch (consolidated e-mail) are done on same servers

– Distributed and scalable

Page 35: Big Data Introduction

Use Case – EDW(Current Architecture)

OLTP

ClosedMain Frames

XMLExternal apps

Data Warehouse

Data Integration(ETL/Real Time)

ODS

Source(s)

EDW/ODSVisualization/

Reporting

Reporting

Decision Support

Page 36: Big Data Introduction

Use Case – EDW(Current Architecture)

• Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds

• Data Integration– ODS (Operational Data Store)

• Sources – Disparate

• Real time – Tools/custom (Goldengate, Shareplex etc)

• Batch – Tools/custom

• Uses – Compliance, data lineage, reports etc

– Enterprise Datawarehouse• Sources – ODS or other sources

• ETL – Tools/custom (Informatica, Ab Initio, Talend)

• Reporting/Visualization– ODS (Compliance related reporting)

– Enterprise Datawarehouse

– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

Page 37: Big Data Introduction

Use Case – EDW(Big Data eco system)

OLTP

ClosedMain Frames

XMLExternal apps

Source(s)

Visualization/

Reporting

Reporting

Decision Support

Node

Node

Node

Hadoop Cluster

(EDW/ODS)

ETL

Real

Time/Batch

(No ETL)

Reporting Database

Page 38: Big Data Introduction

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Page 39: Big Data Introduction

Use Case – EDW(Big Data eco system)

• ODS and EDW can be shared on the same hadoop cluster

• Real time/batch data integration– Flume (to get data from web logs)

– Use HBase layer

• ETL– Should leverage Hadoop Map Reduce capabilities

– Sqoop – to get data from relational databases

– Hive/Pig – To process/transform data as per reporting requirements

• Reporting/Visualization– Reporting can be done either directly from Hadoop or separate

reporting database

Page 40: Big Data Introduction

Use Case – EDW(Big Data eco system)

• Pros over traditional EDW

– Low cost and consolidated hardware

– Low licensing costs

– Open source tools

– Facilitate advanced analytics

• Cons over traditional EDW

– Still evolving

– Learning curve

Page 41: Big Data Introduction

Use Case – Customer Analytics

• A company can often have thousands to millions of customer (eg: eBay, Amazon, YouTube, LinkedIn etc.)

• Analytics at customer level can add significant value to both customer as well as Enterprise

• Traditional EDW appliances will not be able to support customer analytics/reporting for large enterprises

• Big Data eco system of tools can handle customer analytics for an enterprise of any size

Page 42: Big Data Introduction

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Page 43: Big Data Introduction

Use Case – Customer Analytics

• Capture data from web logs and load into Hadoop – Flume/custom solution

• Load customer profile data from traditional MDM or EDW or other source to Hadoop –Sqoop/Hive/HBase

• Perform ETL to compute analytics at customer level – Hive/Pig

• Database to store the pre-computed analytics for all customers – Hbase

• Visualization – is often custom as per company's requirements

Page 44: Big Data Introduction

Jobs in Big Data

• Generalized

– Data Scientists

– Solutions Architects

– Infrastructure Architects

– And many more

• Specialized

– ETL developers/architects

– Advanced analytics developers/architects

– Data Analysts/Business Analysts

– Hadoop Admins

– NoSQL Admins/DBAs

– Devops Engineers

– And many more

Page 45: Big Data Introduction

Industry reaction

• Oracle – Big Data appliance

• IBM – Big Insights

• EMC created PivotalHD

• ETL tools – Informatica, Syncsort etc are adding or rearchitecting big data capabilities

Page 46: Big Data Introduction

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Page 47: Big Data Introduction

Thank You

Page 48: Big Data Introduction

Hadoop eco system - Setup Environment

• https://www.youtube.com/watch?v=p7uCyFfWL-c&index=13&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

Page 49: Big Data Introduction

HDFS

• Hadoop Distributed File System– It is a distributed storage

– HDFS files are logical (physically they are stored as blocks)

– Replication factor is used for fault tolerance.

https://www.youtube.com/watch?v=-Rc-jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

Page 50: Big Data Introduction

Map Reduce

• https://www.youtube.com/watch?v=IRxgew6ytq8&index=16&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

• https://www.youtube.com/watch?v=he8vt835cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

Page 51: Big Data Introduction

Map Reduce based tools

• Hive

• Pig

• Sqoop

• Oozie

• Flume

• Many more

Page 52: Big Data Introduction

Non Map reduce based tools

• Impala

• HBase

• Many more

Page 53: Big Data Introduction

Introduction to Hive

• Define logical structure on top of data in HDFS

• Provides commands to load/insert data into HDFS

• Provides SQL interface to process data in HDFS

• It typically uses map reduce to process the data

• Stores metadata/logical structure in traditional RDBMS such as MySQL

• A small demo on Hive

Page 54: Big Data Introduction

Introduction to Pig

• Another map reduce based interface to process data in HDFS

• Provides commands to load and read data from HDFS

• No need to have pre-defined structure on data

• No need of rigid schemas

• Handy to process unstructured or semi-structured data

• A small demo on Pig

Page 55: Big Data Introduction

Introduction to Sqoop

• Map reduce based data copying utility to and from HDFS

• It can understand HCatalog/Hive structure

• It can copy data from almost all traditional RDBMS, EDW appliances as well as NoSQLdata stores to HDFS and vice-versa

• A small demo on Sqoop