Big Data Introduction

Post on 18-Jul-2015

242 views 0 download

Tags:

Transcript of Big Data Introduction

Big Data Introduction

Agenda

• Current Scenario/Trends in IT

• Big Data

– Batch eco system

– NoSQL eco system

– Visualization

• Case Studies for Big Data

– Enterprise Data Warehouse

– Customer Analytics

Current Scenario

Enterprise applications

OperationalDecision Support

Enterprise applications can be broadly categorized into

Operational and Decision support systems.

Current Scenario – Architecture(Typical Enterprise Application)

Client(Browser)

Client(Browser)

Client(Browser)

App Server

App Server

Database

Current Scenario - Architecture

• Recent trends

– Standardization and consolidation of hardware (servers, storage, network) etc., to cut down the costs

– Storage is physically separated from servers and connected with high speed fiber optics

Current Scenario - Architecture

Database Server

Database Server

Database Server

Network Switch Network Switch Storage Cluster

*Typical database architecture in an enterprise

Current Scenario - Architecture

• Databases

– Databases are clustered (Oracle – RAC)

• High availability

• Fault tolerance

• Load balancing

• Scalable (not linear)

– Common network storage

• File abstraction – file can be of any size

• Fault tolerance (using RAID)

Current Scenario - Architecture

• Almost all these applications follow similar n-tier architecture

– Core applications (operational)

– EAI (Integrating Enterprise Applications)

– CRM

– ERP

– DW/BI Tools like Informatica, Cognos, Business Objects etc

• However there are exceptions – legacy (Mainframes based) applications which uses closed architecture

Current Scenario - Architecture

Application Servers

Database ServersStorage Servers

*Birds eye view – after standardization and

consolidation using cloud architecture

Current Scenario - Challenges

• Almost all operational systems are using relational databases (RDBMS like Oracle).

– RDBMS are originally designed for Operational and transactional.

• Not linearly scalable.

– Transactions

– Data integrity

• Expensive

• Predefined Schema

• Data processing do not happen where data is stored (storage layer)

– Some processing happens at database server level (SQL)

– Some processing happens at application server level (Java/.net)

– Some processing happens at client/browser level (Java Script)

Current Scenario – Use case(E-Mail Campaigning)

App Server(s)

Mail Server(s)

Database

Client

Client

Client

Current Scenario – Use case(E-Mail Campaigning)

• Customer (E-Mail recipient) data needs to be stored in real time

• Customer data can be in hundreds of millions (if not billions)

• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)

• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)

Current Scenario – Use case(E-Mail Campaigning)

• Challenges– Small client vs. Big client

• Scalability issues can be significant

– Standard client vs. Premium client

– Infrastructure• Either databases or application servers or email severs can be

bottleneck

– Code development and deployment

– Standardization

*Keep these in mind and I will explain how this can be resolved using Big Data eco system

Big Data

• Evolution of Big Data

• Understanding characteristics of Big Data

• Batch, operational and analytics in Big Data eco system

• Types, Technologies or tools, Techniques and Talent

Evolution of Big Data

• GFS (Google File System)

• Google Map Reduce

• Google Big Table

Understanding characteristics of Big Data

• Volume

• Variety

• Velocity

Batch, operational and analytics in Big Data eco system

• Batch – Hadoop eco system

– Map reduce

– Hive/Pig

– Sqoop

• Operational (but not transactional) – NoSQL eco system

– Cassandra

– Hbase

– Mongo DB

• Analytics and visualization

– Sentiment analysis

– Statistical analysis

– Machine Learning and Natural Language Processing

Big Data eco system – Advantages

• Distributed storage

– Fault tolerance (RAID is replaced by replication)

• Distributed computing/processing

– Data locality (code goes to data)

• Scalability (almost linear)

• Low cost hardware (commodity)

• Low licensing costs

Hadoop eco system

• Evolution of Hadoop eco system

• Use cases that can be addressed using Hadoop eco system

• Hadoop eco system tools/landscape

Evolution of Hadoop eco system

• GFS to HDFS

• Google Map Reduce to Hadoop Map Reduce

• Big Table to HBase

Use cases that can be addressed using Hadoop eco system

• ETL

• Real time reporting

• Batch reporting

• Operational but not transactional

Hadoop eco system tools/landscape

• Operational and real time data integration

– HBase

• ETL

– Map reduce, Hive/Pig, Sqoop etc

• Reporting

– Hive (Batch)

– Impala/Presto (Real time)

• Analytics API

– Map reduce

– Other frameworks

• Miscellaneous/complementary tools

– Zoo Keeper (co-ordination service for masters)

– Oozie (Workflow/Scheduler)

– Chef/Puppet (automation for administrators)

– Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

NoSQL eco system

• Evolution of NoSQL eco system

• Use cases that can be addressed using NoSQL eco system

• NoSQL eco system tools/landscape

Evolution of NoSQL eco system

• Google Big Table

• Amazon DynamoDB

• Apache HBase

• Apache Cassandra

• MongoDB

Use cases that can be addressed using NoSQL eco system

• Operational but not transactional

• Complements conventional RDBMS systems

• NoSQL is generally not a substitute for transactional systems.

• Facebook messenger is implemented using HBase

NoSQL eco system tools/landscape

• NoSQL Tools

– Apache HBase

– Apache Cassandra

– MongoDB

• Miscellaneous/complementary tools

– Zoo Keeper (Co-ordination service for high availability of masters)

– Vendor specific DevOps tools

Analytics and Visualization

• Evolution of analytics and visualization tools

• Use cases that can be addressed

– Statistical analysis

– Machine learning and Natural language processing

– Conventional Reporting

• Eco system tools/landscape

– Datameer

– Tableau or any BI tool

– R (In memory statistical analysis tool)

Use Case – E-Mail Campaigning

• Role of NoSQL

– Operational

• Role of Hadoop

– Decision support

*Both NoSQL and Hadoop can be installed on same servers.

Current Scenario – Use case(E-Mail Campaigning)

App Server(s)

Mail Server(s)

Database

Client

Client

Client

Current Scenario – Use case(E-Mail Campaigning)

• Customer (E-Mail recipient) data needs to be stored in real time

• Customer data can be in hundreds of millions (if not billions)

• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)

• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)

Current Scenario – Use case(E-Mail Campaigning)

• Challenges– Small client vs. Big client

• Scalability issues can be significant

– Standard client vs. Premium client

– Infrastructure• Either databases or application servers or email severs can be

bottleneck

– Code development and deployment

– Standardization

*Keep these in mind and I will explain how this can be resolved using Big Data eco system

Use Case (E-Mail Campaigning)Big Data eco system

Client

Client

Client

StorageProcessing

StorageProcessing

Node 1

Node 2

Use Case (E-Mail Campaigning)Big Data eco system

• Storage

– Distributed Storage (example HDFS, CFS, GFS etc)

• Processing

– Operational (HBase, Cassandra)• Data storage is operational – for example customers might have to

stored in real time

– Batch (Map Reduce, Hive/Pig)• E-Mail campaigning is batch

• Map Reduce can be integrated with E-Mail notification to push the campaigning.

• Customer validation can be done in batch

Use Case (LinkedIn)

• Most of the frames in linkedin.com are implemented using Big Data eco system tools

• Advantages

– Low cost to implement an idea (endorsements)

– No impact on existing applications

– Both operational (actual endorsement) and batch (consolidated e-mail) are done on same servers

– Distributed and scalable

Use Case – EDW(Current Architecture)

OLTP

ClosedMain Frames

XMLExternal apps

Data Warehouse

Data Integration(ETL/Real Time)

ODS

Source(s)

EDW/ODSVisualization/

Reporting

Reporting

Decision Support

Use Case – EDW(Current Architecture)

• Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds

• Data Integration– ODS (Operational Data Store)

• Sources – Disparate

• Real time – Tools/custom (Goldengate, Shareplex etc)

• Batch – Tools/custom

• Uses – Compliance, data lineage, reports etc

– Enterprise Datawarehouse• Sources – ODS or other sources

• ETL – Tools/custom (Informatica, Ab Initio, Talend)

• Reporting/Visualization– ODS (Compliance related reporting)

– Enterprise Datawarehouse

– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

Use Case – EDW(Big Data eco system)

OLTP

ClosedMain Frames

XMLExternal apps

Source(s)

Visualization/

Reporting

Reporting

Decision Support

Node

Node

Node

Hadoop Cluster

(EDW/ODS)

ETL

Real

Time/Batch

(No ETL)

Reporting Database

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Use Case – EDW(Big Data eco system)

• ODS and EDW can be shared on the same hadoop cluster

• Real time/batch data integration– Flume (to get data from web logs)

– Use HBase layer

• ETL– Should leverage Hadoop Map Reduce capabilities

– Sqoop – to get data from relational databases

– Hive/Pig – To process/transform data as per reporting requirements

• Reporting/Visualization– Reporting can be done either directly from Hadoop or separate

reporting database

Use Case – EDW(Big Data eco system)

• Pros over traditional EDW

– Low cost and consolidated hardware

– Low licensing costs

– Open source tools

– Facilitate advanced analytics

• Cons over traditional EDW

– Still evolving

– Learning curve

Use Case – Customer Analytics

• A company can often have thousands to millions of customer (eg: eBay, Amazon, YouTube, LinkedIn etc.)

• Analytics at customer level can add significant value to both customer as well as Enterprise

• Traditional EDW appliances will not be able to support customer analytics/reporting for large enterprises

• Big Data eco system of tools can handle customer analytics for an enterprise of any size

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Use Case – Customer Analytics

• Capture data from web logs and load into Hadoop – Flume/custom solution

• Load customer profile data from traditional MDM or EDW or other source to Hadoop –Sqoop/Hive/HBase

• Perform ETL to compute analytics at customer level – Hive/Pig

• Database to store the pre-computed analytics for all customers – Hbase

• Visualization – is often custom as per company's requirements

Jobs in Big Data

• Generalized

– Data Scientists

– Solutions Architects

– Infrastructure Architects

– And many more

• Specialized

– ETL developers/architects

– Advanced analytics developers/architects

– Data Analysts/Business Analysts

– Hadoop Admins

– NoSQL Admins/DBAs

– Devops Engineers

– And many more

Industry reaction

• Oracle – Big Data appliance

• IBM – Big Insights

• EMC created PivotalHD

• ETL tools – Informatica, Syncsort etc are adding or rearchitecting big data capabilities

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

Hive

Pig

Flume

Non Map Reduce

Impala

PrestoSqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

HBase

Thank You

Hadoop eco system - Setup Environment

• https://www.youtube.com/watch?v=p7uCyFfWL-c&index=13&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

HDFS

• Hadoop Distributed File System– It is a distributed storage

– HDFS files are logical (physically they are stored as blocks)

– Replication factor is used for fault tolerance.

https://www.youtube.com/watch?v=-Rc-jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

Map Reduce

• https://www.youtube.com/watch?v=IRxgew6ytq8&index=16&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

• https://www.youtube.com/watch?v=he8vt835cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP

Map Reduce based tools

• Hive

• Pig

• Sqoop

• Oozie

• Flume

• Many more

Non Map reduce based tools

• Impala

• HBase

• Many more

Introduction to Hive

• Define logical structure on top of data in HDFS

• Provides commands to load/insert data into HDFS

• Provides SQL interface to process data in HDFS

• It typically uses map reduce to process the data

• Stores metadata/logical structure in traditional RDBMS such as MySQL

• A small demo on Hive

Introduction to Pig

• Another map reduce based interface to process data in HDFS

• Provides commands to load and read data from HDFS

• No need to have pre-defined structure on data

• No need of rigid schemas

• Handy to process unstructured or semi-structured data

• A small demo on Pig

Introduction to Sqoop

• Map reduce based data copying utility to and from HDFS

• It can understand HCatalog/Hive structure

• It can copy data from almost all traditional RDBMS, EDW appliances as well as NoSQLdata stores to HDFS and vice-versa

• A small demo on Sqoop