Integrating Hadoop into Business Intelligence and …download.101com.com/pub/tdwi/Files/TDWI...
-
Upload
truongminh -
Category
Documents
-
view
220 -
download
0
Transcript of Integrating Hadoop into Business Intelligence and …download.101com.com/pub/tdwi/Files/TDWI...
Philip Russom TDWI Research Director for Data Management, April 9 2013
Integrating Hadoop Into Business Intelligence & Data Warehousing
2
TDWI would like to thank the following companies for sponsoring the 2013 TDWI Best Practices research report:
Integrating Hadoop into Business Intelligence and Data Warehousing
This presentation is based on the findings of that report.
3
Today’s Agenda
• Definitions – What is Hadoop? Its components?
– Why care about Hadoop’s integration with BI & DW?
• State of Hadoop Integration – Benefits and Barriers
– Problems and Opportunities
• Hadoop Best Practices – Developer Productivity
– Specific Techniques
• Trends in Hadoop Integration
• Top Ten Priorities
– for Integrating Hadoop with BI & DW
PLEASE TWEET
@pRussom, #TDWI,
#Hadoop, #HDFS,
#Analytics, #BigData
4
Ten Facts About Hadoop These bust Hadoop’s ten most common myths.
1. Hadoop consists of multiple products.
2. Hadoop is open source from the Apache Software Foundation
(apache.org), but available from vendors, too.
3. Hadoop is an ecosystem, not a single product.
4. The Hadoop Distributed File System (HDFS)
is a file system, not a database mgt system.
5. Hive QL resembles SQL, but isn’t standard SQL.
6. HDFS and MapReduce are related,
but don’t require each other.
7. MapReduce provides control for analytics,
not analytics per se.
8. Hadoop is about data diversity, not just data volume.
9. Hadoop complements a DW, rarely replaces one.
10. Hadoop enables many types of analytics,
not just Web analytics.
5
Hadoop Technologies in
Use Today and Tomorrow • HDFS & a few add-ons are the most
common Hadoop products today
– MapReduce – Distributed processing of hand-coded logic, whether for analytics or other apps
– Hive – Projects structure onto Hadoop data, to query it with SQL-like language called HiveQL
– HBase – Simple, record-store database functions w/ HDFS’ data
• Some Hadoop tools are rare today:
– Chukwa, Ambari, Oozie, Hue, Flume
• Some will see aggressive growth:
– Mahout – Recommendation engine
– R – Language for analytics
– HCatalog – Metadata management
6
Status of HDFS Implementations • HDFS is used by a small minority of
organizations today. – Only 10% of survey respondents
report having reached a production deployment.
• A whopping 73% of respondents expect to have HDFS in production.
– 10% are already in production, with another 63% coming.
– Only 27% of respondents say they will never put HDFS in production.
• HDFS usage will go from scarce to ensconced in three years.
– If survey respondents’ plans pan out, HDFS and other Hadoop products and technologies will be quite common in the near future
• HDFS will have a large impact on – BI, DW, DI, and analytics
– IT and data management in general
– How businesses leverage these
7
Potential Benefits of Hadoop Integration In priority order, based on survey responses
• Hadoop’s primary application = big data source for analytics (71%)
– Other apps: data archiving (20%); schema-free data staging (19%); managing machine data from robots, sensors, meters, etc (17%)
• Hadoop-based analytics yields new facts about a business
– Information exploration and discovery (33%); exploratory analytics with big data (48%)
• Hadoop supports advanced forms of analytics, beyond OLAP
– Data mining, statistical analysis, complex SQL, and so on (68%); often coupled with data visualization (25%)
• HDFS complements a data warehouse (30%)
– Handles advanced analytics and multi-structured data
– So DW can stay focused on reporting, OLAP, performance mgt, etc.
• Extreme scalability (19%) on low-cost hardware and software (26%)
– So users can capture more data than before (24%)
8
Challenges to Hadoop Integration In priority order, based on survey responses
• Inadequate staffing or skills for big data analytics (62%)
– HDFS and Hadoop tools (in their current state) demand a fair amount
of hand-coding in languages that the average BI professional does not
know well, namely Java, R, and Hive. Tools will get better.
• Tools for Hadoop are few and immature (28%)
– Hadoop tools lack adequate metadata management (25%); don’t
handle data in real time (22%); don’t support standard SQL
– Tools get better about these almost daily
• Changes required for successfully integrating Hadoop with BI/DW
– Adjustments to an existing user-defined DW architecture (27%)
– Best practices are emerging, so this point will become moot
• Good News – Scalability is not a barrier to Hadoop usage
– Only 8% anticipate problems scaling up HDFS & other Hadoop tools
9
Integrating Hadoop with BI, DW, & Analytics
is an Opportunity, not a Problem
10
Why care about Hadoop integration now?
Because it enables new, compelling apps. • Hadoop scales with file-based big data
– Imagine HDFS as shared
infrastructure, similar to SAN & NAS
– Imagine a huge, live archive
– Imagine content mgt on steroids
– Imagine low price per terabyte
• HDFS extends BI, DW, analytics…
– Managing multi-structured data
– Repository for detail source data
– Processing big data for analytics
– Advanced forms of analytics
– Data staging on steroids
11
DW Architectures are growing
more distributed.
• System on the Side (SOS) or Edge System – A workload and its data that’s deployed on a
platform separate from the EDW
– Usually integrates with EDW, so not a silo
• Long-standing tradition of SOSs w/EDWs – Data marts, operational data stores (ODSs),
data staging areas
– Workload types: analytics, real-time, detailed source data, unstructured data
• Trend – As workloads increase in number, so do SOSs and Edge Systems
– Each analytic method (or even each analytic application) may need its own SOS
• Hadoop can enable some DW areas – Data staging, analytic sandboxes, detailed
source data, multi-structured data mgt
– MapReduce for analytic processing, HBase for record stores, Hive for unstruc queries…
• Core EDW remains a killer app for… – Standard reports, OLAP, performance mgt,
dashboards, real-time operational BI, etc…
Many Systems on the Side (SOSs)
or Edge Systems can surround a
central DW in a heavily distributed
architecture.
EDW
Federated
Data
Marts
Real
Time
ODS
Customer
Mart or
ODS
No-SQL
Database
Hadoop
Distributed
File Sys
Data
Staging
Area
Metrics for
Performance
Mgt
OLAP
Cubes
Multi-
dimensional
Data Models
Detailed
Source
Data
Analytic
Sand
Box
DW
Appliance
Columnar
DBMS
Map
Reduce
Data
Mining
Cache
Star or
Snowflake
Scheme
12
STAT BITES
Organizations surveyed that have
HDFS in production…
• Have 12 HDFS clusters on average
– Median is 2
• Have 45 nodes per cluster on average
– Median is 12
• Manage a few TBs in HDFS today
– But expect a half PB within 3 years
• Load HDFS mostly via batch every 24 hrs
– So, not much streaming big data yet
13
Most Needed Improvements in Hadoop Techs
• Security
– Needs to go beyond simple file-permission checks & become more granular
• Administration
– Need better tools for admin and deployment of clusters
• NameNode reliability
– Patches are available
• Latency issues
– Users want real-time (31%), fast queries (29%), streaming data (25%)
• Development tools
– For metadata, query design, less hand coding
14
Job Titles for Hadoop Workers
• Architects
– For data/BI, apps, generic
• Developers
– For apps, data/BI
• Data Scientists
– This job title is slowly replacing
analyst titles
• Analysts
– Business, data, system
• Miscellaneous
– Ranges from engineers to
marketers
– Ever-broadening range of end
users who depend on data
15
No Plans to Integrate Integrated Today; Will Stay Integrated Will Integrate Within 3 Years
GROUP 1 – BI, DW, DI, and Analytics
Commonly Integrated with Hadoop Today.
Will become a bit more common in the
future.
GROUP 3 – Data Management
Rarely Integrated with Hadoop Today.
Will soon experience aggressive adoption.
GROUP 4 – Machine Data
Half of Hadoop users don’t need these.
But adoption will grow anyway. 38%
35%
52%
52%
50%
40%
38%
42%
46%
44%
38%
40%
42%
44%
8%
13%
13%
19%
21%
25%
27%
38%
38%
40%
44%
44%
46%
46%
54%
52%
35%
29%
29%
35%
35%
21%
17%
17%
19%
17%
13%
10%
Sensors (thermometers, etc.)
Machinery (robots, vehicles)
Master data management tools
Data quality tools
Third-party data providers
Data marts
Operational applications
Data visualization tools
Analytic databases
Data integration tools
Web servers
Reporting tools
Data warehouses
Analytic tools
GROUP 2 – Applications
Trends – What Tool Types are Users Integrating with Hadoop?
SOURCE: TDWI Best Practices Survey of late 2012. Based on 48 respondents who have
experience with Hadoop. The chart is sorted by “Integrated Today,” in descending order.
16
Top Ten Priorities for Hadoop Integration These are recommendations, requirements, or rules that can guide you.
1. Embrace the new tool and platform ecosystem of Hadoop.
2. Know the 10 myths of Hadoop and bust them daily.
3. Don’t be fooled: Hadoop isn’t free.
4. Get training (and maybe new staff) for new Hadoop.
5. Look for capabilities that make Hadoop data look relational.
6. Expect to wait a while for certain Hadoop functionality to mature.
7. Beware silo’d analytics, including Hadoop implementations.
8. Adjust your DW architecture to make place(s) for Hadoop.
9. Set up a proof of concept (POC), if you haven’t already.
10. Develop/apply a strategy for Hadoop integration with BI/DW.
17
Download a free
copy of the report
• Download the report in a
PDF file at:
bit.ly/TDWI-BP-Rpt-List
• Feel free to distribute the
PDF file of any TDWI
Best Practices Report
18
Want to learn more about Big Data & Analytics?
Take courses at the TDWI World Conference in Chicago!
• May 5-10, 2013
• Chicago, Illinois
• New courses on big
data, its mgt, its analysis
• Keynote addresses on
big data best practices
• Peer networking, meals,
social evenings, exhibits
• More information online
• Register online:
tdwi.org/CH2013
19
Questions??
20
Contact Information
If you have further questions or comments:
Philip Russom, TDWI