Big Data Introduction
-
Upload
durga-gadiraju -
Category
Technology
-
view
241 -
download
0
Transcript of Big Data Introduction
Big Data Introduction
Agenda
• Current Scenario/Trends in IT
• Big Data
– Batch eco system
– NoSQL eco system
– Visualization
• Case Studies for Big Data
– Enterprise Data Warehouse
– Customer Analytics
Current Scenario
Enterprise applications
OperationalDecision Support
Enterprise applications can be broadly categorized into
Operational and Decision support systems.
Current Scenario – Architecture(Typical Enterprise Application)
Client(Browser)
Client(Browser)
Client(Browser)
App Server
App Server
Database
Current Scenario - Architecture
• Recent trends
– Standardization and consolidation of hardware (servers, storage, network) etc., to cut down the costs
– Storage is physically separated from servers and connected with high speed fiber optics
Current Scenario - Architecture
Database Server
Database Server
Database Server
Network Switch Network Switch Storage Cluster
*Typical database architecture in an enterprise
Current Scenario - Architecture
• Databases
– Databases are clustered (Oracle – RAC)
• High availability
• Fault tolerance
• Load balancing
• Scalable (not linear)
– Common network storage
• File abstraction – file can be of any size
• Fault tolerance (using RAID)
Current Scenario - Architecture
• Almost all these applications follow similar n-tier architecture
– Core applications (operational)
– EAI (Integrating Enterprise Applications)
– CRM
– ERP
– DW/BI Tools like Informatica, Cognos, Business Objects etc
• However there are exceptions – legacy (Mainframes based) applications which uses closed architecture
Current Scenario - Architecture
Application Servers
Database ServersStorage Servers
*Birds eye view – after standardization and
consolidation using cloud architecture
Current Scenario - Challenges
• Almost all operational systems are using relational databases (RDBMS like Oracle).
– RDBMS are originally designed for Operational and transactional.
• Not linearly scalable.
– Transactions
– Data integrity
• Expensive
• Predefined Schema
• Data processing do not happen where data is stored (storage layer)
– Some processing happens at database server level (SQL)
– Some processing happens at application server level (Java/.net)
– Some processing happens at client/browser level (Java Script)
Current Scenario – Use case(E-Mail Campaigning)
App Server(s)
Mail Server(s)
Database
Client
Client
Client
Current Scenario – Use case(E-Mail Campaigning)
• Customer (E-Mail recipient) data needs to be stored in real time
• Customer data can be in hundreds of millions (if not billions)
• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)
• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)
Current Scenario – Use case(E-Mail Campaigning)
• Challenges– Small client vs. Big client
• Scalability issues can be significant
– Standard client vs. Premium client
– Infrastructure• Either databases or application servers or email severs can be
bottleneck
– Code development and deployment
– Standardization
*Keep these in mind and I will explain how this can be resolved using Big Data eco system
Big Data
• Evolution of Big Data
• Understanding characteristics of Big Data
• Batch, operational and analytics in Big Data eco system
• Types, Technologies or tools, Techniques and Talent
Evolution of Big Data
• GFS (Google File System)
• Google Map Reduce
• Google Big Table
Understanding characteristics of Big Data
• Volume
• Variety
• Velocity
Batch, operational and analytics in Big Data eco system
• Batch – Hadoop eco system
– Map reduce
– Hive/Pig
– Sqoop
• Operational (but not transactional) – NoSQL eco system
– Cassandra
– Hbase
– Mongo DB
• Analytics and visualization
– Sentiment analysis
– Statistical analysis
– Machine Learning and Natural Language Processing
Big Data eco system – Advantages
• Distributed storage
– Fault tolerance (RAID is replaced by replication)
• Distributed computing/processing
– Data locality (code goes to data)
• Scalability (almost linear)
• Low cost hardware (commodity)
• Low licensing costs
Hadoop eco system
• Evolution of Hadoop eco system
• Use cases that can be addressed using Hadoop eco system
• Hadoop eco system tools/landscape
Evolution of Hadoop eco system
• GFS to HDFS
• Google Map Reduce to Hadoop Map Reduce
• Big Table to HBase
Use cases that can be addressed using Hadoop eco system
• ETL
• Real time reporting
• Batch reporting
• Operational but not transactional
Hadoop eco system tools/landscape
• Operational and real time data integration
– HBase
• ETL
– Map reduce, Hive/Pig, Sqoop etc
• Reporting
– Hive (Batch)
– Impala/Presto (Real time)
• Analytics API
– Map reduce
– Other frameworks
• Miscellaneous/complementary tools
– Zoo Keeper (co-ordination service for masters)
– Oozie (Workflow/Scheduler)
– Chef/Puppet (automation for administrators)
– Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)
NoSQL eco system
• Evolution of NoSQL eco system
• Use cases that can be addressed using NoSQL eco system
• NoSQL eco system tools/landscape
Evolution of NoSQL eco system
• Google Big Table
• Amazon DynamoDB
• Apache HBase
• Apache Cassandra
• MongoDB
Use cases that can be addressed using NoSQL eco system
• Operational but not transactional
• Complements conventional RDBMS systems
• NoSQL is generally not a substitute for transactional systems.
• Facebook messenger is implemented using HBase
NoSQL eco system tools/landscape
• NoSQL Tools
– Apache HBase
– Apache Cassandra
– MongoDB
• Miscellaneous/complementary tools
– Zoo Keeper (Co-ordination service for high availability of masters)
– Vendor specific DevOps tools
Analytics and Visualization
• Evolution of analytics and visualization tools
• Use cases that can be addressed
– Statistical analysis
– Machine learning and Natural language processing
– Conventional Reporting
• Eco system tools/landscape
– Datameer
– Tableau or any BI tool
– R (In memory statistical analysis tool)
Use Case – E-Mail Campaigning
• Role of NoSQL
– Operational
• Role of Hadoop
– Decision support
*Both NoSQL and Hadoop can be installed on same servers.
Current Scenario – Use case(E-Mail Campaigning)
App Server(s)
Mail Server(s)
Database
Client
Client
Client
Current Scenario – Use case(E-Mail Campaigning)
• Customer (E-Mail recipient) data needs to be stored in real time
• Customer data can be in hundreds of millions (if not billions)
• For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc)
• Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)
Current Scenario – Use case(E-Mail Campaigning)
• Challenges– Small client vs. Big client
• Scalability issues can be significant
– Standard client vs. Premium client
– Infrastructure• Either databases or application servers or email severs can be
bottleneck
– Code development and deployment
– Standardization
*Keep these in mind and I will explain how this can be resolved using Big Data eco system
Use Case (E-Mail Campaigning)Big Data eco system
Client
Client
Client
StorageProcessing
StorageProcessing
Node 1
Node 2
Use Case (E-Mail Campaigning)Big Data eco system
• Storage
– Distributed Storage (example HDFS, CFS, GFS etc)
• Processing
– Operational (HBase, Cassandra)• Data storage is operational – for example customers might have to
stored in real time
– Batch (Map Reduce, Hive/Pig)• E-Mail campaigning is batch
• Map Reduce can be integrated with E-Mail notification to push the campaigning.
• Customer validation can be done in batch
Use Case (LinkedIn)
• Most of the frames in linkedin.com are implemented using Big Data eco system tools
• Advantages
– Low cost to implement an idea (endorsements)
– No impact on existing applications
– Both operational (actual endorsement) and batch (consolidated e-mail) are done on same servers
– Distributed and scalable
Use Case – EDW(Current Architecture)
OLTP
ClosedMain Frames
XMLExternal apps
Data Warehouse
Data Integration(ETL/Real Time)
ODS
Source(s)
EDW/ODSVisualization/
Reporting
Reporting
Decision Support
Use Case – EDW(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds
• Data Integration– ODS (Operational Data Store)
• Sources – Disparate
• Real time – Tools/custom (Goldengate, Shareplex etc)
• Batch – Tools/custom
• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse• Sources – ODS or other sources
• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization– ODS (Compliance related reporting)
– Enterprise Datawarehouse
– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
Use Case – EDW(Big Data eco system)
OLTP
ClosedMain Frames
XMLExternal apps
Source(s)
Visualization/
Reporting
Reporting
Decision Support
Node
Node
Node
Hadoop Cluster
(EDW/ODS)
ETL
Real
Time/Batch
(No ETL)
Reporting Database
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Use Case – EDW(Big Data eco system)
• ODS and EDW can be shared on the same hadoop cluster
• Real time/batch data integration– Flume (to get data from web logs)
– Use HBase layer
• ETL– Should leverage Hadoop Map Reduce capabilities
– Sqoop – to get data from relational databases
– Hive/Pig – To process/transform data as per reporting requirements
• Reporting/Visualization– Reporting can be done either directly from Hadoop or separate
reporting database
Use Case – EDW(Big Data eco system)
• Pros over traditional EDW
– Low cost and consolidated hardware
– Low licensing costs
– Open source tools
– Facilitate advanced analytics
• Cons over traditional EDW
– Still evolving
– Learning curve
Use Case – Customer Analytics
• A company can often have thousands to millions of customer (eg: eBay, Amazon, YouTube, LinkedIn etc.)
• Analytics at customer level can add significant value to both customer as well as Enterprise
• Traditional EDW appliances will not be able to support customer analytics/reporting for large enterprises
• Big Data eco system of tools can handle customer analytics for an enterprise of any size
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Use Case – Customer Analytics
• Capture data from web logs and load into Hadoop – Flume/custom solution
• Load customer profile data from traditional MDM or EDW or other source to Hadoop –Sqoop/Hive/HBase
• Perform ETL to compute analytics at customer level – Hive/Pig
• Database to store the pre-computed analytics for all customers – Hbase
• Visualization – is often custom as per company's requirements
Jobs in Big Data
• Generalized
– Data Scientists
– Solutions Architects
– Infrastructure Architects
– And many more
• Specialized
– ETL developers/architects
– Advanced analytics developers/architects
– Data Analysts/Business Analysts
– Hadoop Admins
– NoSQL Admins/DBAs
– Devops Engineers
– And many more
Industry reaction
• Oracle – Big Data appliance
• IBM – Big Insights
• EMC created PivotalHD
• ETL tools – Informatica, Syncsort etc are adding or rearchitecting big data capabilities
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Thank You
Hadoop eco system - Setup Environment
• https://www.youtube.com/watch?v=p7uCyFfWL-c&index=13&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP
HDFS
• Hadoop Distributed File System– It is a distributed storage
– HDFS files are logical (physically they are stored as blocks)
– Replication factor is used for fault tolerance.
https://www.youtube.com/watch?v=-Rc-jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP
Map Reduce
• https://www.youtube.com/watch?v=IRxgew6ytq8&index=16&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP
• https://www.youtube.com/watch?v=he8vt835cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3YIBC3FnuZIdP
Map Reduce based tools
• Hive
• Pig
• Sqoop
• Oozie
• Flume
• Many more
Non Map reduce based tools
• Impala
• HBase
• Many more
Introduction to Hive
• Define logical structure on top of data in HDFS
• Provides commands to load/insert data into HDFS
• Provides SQL interface to process data in HDFS
• It typically uses map reduce to process the data
• Stores metadata/logical structure in traditional RDBMS such as MySQL
• A small demo on Hive
Introduction to Pig
• Another map reduce based interface to process data in HDFS
• Provides commands to load and read data from HDFS
• No need to have pre-defined structure on data
• No need of rigid schemas
• Handy to process unstructured or semi-structured data
• A small demo on Pig
Introduction to Sqoop
• Map reduce based data copying utility to and from HDFS
• It can understand HCatalog/Hive structure
• It can copy data from almost all traditional RDBMS, EDW appliances as well as NoSQLdata stores to HDFS and vice-versa
• A small demo on Sqoop