Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
-
Upload
kcmallu -
Category
Data & Analytics
-
view
107 -
download
3
description
Transcript of The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Big Data in Practice: A Pragmatic approach to Adoption and
Value creation
Raj Nair Data Practitioner and Consultant
Application Services
• Enterprise Resource Planning (ERP)
• eCommerce / eBusiness
• Enterprise App Dev and ECM
• Legacy Support, Systems Integration and Conversion
Info Management
• Business Intelligence and Analytics
• Dashboards, Scorecards, Reporting
• MDM & Data Modeling
• Data Marts, ODS, ETL, Data Mining
IT Infrastructure
• IT Professional Services
• Network Administration & Support
• dB Admin & Maintenance
• Hosting and Application Support
Process & Governance
• SDLC – Agile, TDD, TFD Iterative
• Requirements Analysis, PMP, Change Management and Automated QA
• Training & Knowledge Transition and Technical Documentation
2
Content NOT FOR DISTRIBUTION: Property of Raj Nair
3
© Copyright @ 2011 Object Technology Solutions, Inc. (OTSI)
Object Technology Solutions Inc. (OTSI) is a leading Information Technology (IT) Services and Solutions company founded in 1999.
Clientele of Fortune 500 companies providing IT Solutions in the areas
of SDLC, Information Management, Business Intelligence, ERP,
eCommerce (B2B, B2C), Mobile, Enterprise Solutions, Middleware and
Infrastructure.
Technology Expertise and Experience
SAP - Business Objects, ERP, Microsoft - SharePoint, .Net, SQL Server,
Project Server, IBM - WebSphere, Cognos, Rational Suite, HP - Testing
tools, PPM
Data - Oracle, DB2, SQLServer, Teradata, OS – Windows, Unix (AIX, Linux,
HP-UX) etc., Open Source, Java
Certified Diversity Supplier in KS, MO and IL
1 Big Data – The Original Use Case
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
An Open Source Engine
The Year was 2002 ….
Doug Cutting Mike Caferella
Already Somebody’s Biz Problem
• Problem of Capacity & Scale
http://
The Perfect Storm
MapReduce Google File System
BigTable
MapReduce
Google File System
+
=
1 Big Data – The Original Use Case
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
Yes, But… We are not Google
Sears: Dynamic Pricing
AT&T, quantifying customer impact from
failed cell towers
Nokia: Holistic view of how users interact with apps
across the world
Zions Bancorp: Analyze 130 data sources for fraud Cerner:
Detecting Health Risks
Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity
Big Data 1.0 – The Hadoop Ecosystem
Software library
Framework for large scale distributed processing
Ability to scale to thousands of computers
Design Principles
- Large Data Sets
Classic Hadoop MapReduce – Batch Processing
- Moving computation is cheaper than moving data
- Hardware Failure, redundancy
This not “That”
Is Is Not
A Software Framework (Storage/Compute)
A Database Management System An appliance
Batch Processing For real-time or interaction
Write Once, Read Many Delete and Update or “ACID”
Unassuming of data formats Imposing any schemas
Open Source Lock In
Made for commodity servers with local disks
Meant to be run in virtualized environments
What is this you call data?
Unlearn current notion of “Data”
Native Data Source
HDFS Storage and Archival
MapReduce Programming Library
Crunch Data Pipeline
processing HBase Real time access
(low latency)
Pig M/R Abstraction
Hive Data Warehouse
Sqoop Data Transfer
Flume Data Streaming
(High Latency)
Data Processing Workload Management
Data Movement
Purpose Use it for
HDFS Distributed Storage Raw data storage and archival
Flume Data Movement Continuous Streaming into HDFS
Sqoop Data Movement Data transfer from RDBMS to HDFS/HBase
HBase Workload Mgmt Near real-time read/write access to large data sets
Hive Workload Mgmt Analytical queries; data warehouse
Map Reduce
Data Processing Low level custom code for data processing
Crunch Data Processing (Java) Coding M/R pipelines, aggregations
Pig Data Processing Scripting language; similar to Crunch
A Powerful Paradigm
Storage Layer
Query Engine
Processing Engine
Metadata
Hadoop – Separate Layers
Multiple Query Engines
Data in Native format
Oracle SQL Server
Storage
Query
Storage
Query
Storage
Query
DB2
Tightly integrated Proprietary Stacks, cannot free your data
1 Big Data – The Original Use Case
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
Opportunity…
Transform Data Processing
Exploration
Information Enrichment
Data Archival
Data Processing Pipeline
Several sources
Varying Frequencies
Varying Formats
Quality check
Validations, Scrubbing
Transformations/Rules
Prune app data sources
Discard/Archive
Data Processing Engine
Data Warehouse
Data Storage
ETL Engine
Data Warehouse
Data Storage
ELT
Data Warehouse
Data Storage
From Source to Business Value
Shoe-horning
Relational fit Loading Archiving / Purging
Biz Rules
Validations Scrubbing Mapping Transforms
Staging Distribution
Prep Tuning
Data stores
Minutes/Hours Subset of Data
Hours Reliability
Sourcing
Missed SLAs = Biz Frustration
From Source to Business Value
Significantly more data sources
Highly scalable, significantly performant data processing
New business value, Faster time to value
Data Exploration
Large reservoir of data
Descriptive Statistics
Central Tendencies
Dispersion
Visualization
Surprise Me!
Data Exploration
Courtesy: Data Science Central http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven
Information Enrichment
Information Enrichment
Data Archival
Recycle Policy
Data Archival
Storage in Native Format
Redundancy , Replication
Easily accessible, inexpensive
1 Big Data – The Original Use Case
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
Practical Adoption
Big Data Technologies don’t solve all problems
Leveraging existing investments
Complexities of existing systems
Proof of Concept
Use your own data – realistic results
Focus on very specific pain points
Know what you are going to measure
Opportunity Identification
Shoe-horning
Relational fit Loading Archiving / Purging
Biz Rules
Validations Scrubbing Mapping
Staging Distribution
Prep Tuning
Data stores
Minutes/Hours Subset of Data
Hours Reliability
Sourcing
Data Processing Engine
Data Warehouse
Data Storage
Data Processing Engine
Data Warehouse
Data Storage
Keep all your raw data Cheaper Hardware Low cost per byte $$ High value per byte
Offload from RDBMS Improve scale, performance Leverage existing tools
Hardware on a budget Master:
- 12 cores
- 32 GB RAM
- 2 TB SATA Drives, 7.2K RPM
Workers:
- 4 Nodes
- 12 cores
- 16 GB RAM
- 4 TB SATA Drives each, 7.2 PRM
$5000
$5000 each
4-Port 10 Gig Switch - $1500 Grand Total < $30,000
Software costs ? - 0
NoSQL
Data Processing Engine
Data Warehouse
Data Storage
Keep all your raw data Cheaper Hardware
NoSQL
Low cost per byte $$ High value per byte
Exploratory BI / Analysis Data
Storage
Makes Data exploration practically cheaper and faster Use existing visualization tools (Tableau or other) Check for integration with R
Data Architecture
• Single Important factor
• Don’t miss technology trends
But ….
It’s more about the battle plan
1 Big Data – The Road to Now
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
What about that RDBMS?
Too many new data types
Extreme demands for loading & query access
Dynamic / just in time schemas
SQL is great, but why limit to relational?
Still great for transactional workloads
What’s Next?
Multi-tenant Hadoop
SQL on Hadoop
Security In-memory Real Time
HDFS 2 Storage and Archival
MapReduce (BATCH)
HBase (online)
Hive (interactive)
YARN Yet Another Resource Manager
In-memory Search
Application Container - scale resource management Map Reduce becomes “one type of application workload”
Multi-tenant Hadoop
SQL on Hadoop
Impala
Tez
Phoenix
• Cloudera
• MPP Engine
• HortonWorks
• SQL on Hive
• Apache
• SQL on HBase
In memory and Real Time
Spark
Storm
Apache Drill
• 100x faster than M/R
• Event processing
• Low latency ad hoc queries
• Interactive queries at scale
Honorable (Proprietary) mentions
RDBMS on Hadoop
Complete Package
MPP, SMP, DataFlow
HortonWorks underneath
Manage, Analyze machine generated data
1 Big Data – The Road to Now
2 Mainstream Big Data
3 Real World Use Cases and Applications
4 Practical Adoption : Opportunity Identification
5 Big Data 2.0 – What’s on the Horizon ?
6 Conclusion
Where can I get Hadoop?
Distributors
Open Source Apache Project
And these guys…
Cloud
Conclusion
The Power & Paradigm of Distributed Computing
“Nativity” of Data – Unlearn old notions
Identify, understand your data processing pipeline
POC with a measurable, specific use case
Data Architecture – key to sustainable scalability
Stay informed