Ask bigger questions
-
Upload
south-west-data-meetup -
Category
Data & Analytics
-
view
10 -
download
0
Transcript of Ask bigger questions
1
Becoming Informa/on-‐Driven Introduc/on to the Enterprise Data Hub
Mike Olson Cloudera, Inc. Co-‐Founder & Chief Strategy Officer
2
Expanding Data Requires A New Approach
©2014 Cloudera, Inc. All rights reserved. 2
1980s Bring Data to Compute
Now Bring Compute to Data
RelaEve size & complexity
Data InformaEon-‐centric
businesses use all data:
Mul/-‐structured, internal & external data
of all types
Compute
Compute
Compute
Process-‐centric businesses use:
• Structured data mainly • Internal data only
• “Important” data only
Compute
Compute
Compute
Data
Data
Data
Data
3
The Old Way: Bringing Data to Compute
©2014 Cloudera, Inc. All rights reserved. 3
Complex Architecture • Many special-‐purpose
systems • Moving data around • No complete views
Visibility • Leaving data behind • Risk and compliance • High cost of storage
Time to Data • Up-‐front modeling • Transforms slow • Transforms lose data
Cost of AnalyEcs • Exis/ng systems strained • No agility • BI backlog
4
1
2
3
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
4
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS ESTERNAL DATA SOURCES
©2014 Cloudera, Inc. All rights reserved.
MulE-‐workload analyEc plaRorm • Bring applica/ons to data • Combine different workloads on
common data (i.e. SQL + Search) • True BI agility
4
1
2
3 4
The New Way: Bringing Compute to Data
4
AcEve archive • Full fidelity original data • Indefinite /me, any source • Lowest cost storage
1
Data management, transforms • One source of data for all analy/cs • Persist state of transformed data • Significantly faster & cheaper
2
Self-‐service exploratory BI • Simple search + BI tools • “Schema on read” agility • Reduce BI user backlog requests
3
5
Beeer, faster, cheaper and mul/-‐framework
BATCH PROCESSING
MR / PIG/ Hive / Cascading
SQL IMPALA
SEARCH SOLR
MACHINE LEARNING
SAS, R, H20, MLlib
STREAM PROCESSING SPARK STREAMING
NOSQL HBASE
Process Data
IN-‐MEMORY SPARK
Train & Test Models
Respond to Events in RT
Explore & Analyze Data
• Highly mature • Wide range of clients
• Significant advances in speed & usability
• Integra/on with the SAS & Revolu/on product porgolio
• Python / 0xdata / ML lib for advanced users
• Very low (~10ms) latency
• High volumes of single events
• High speed • High concurrency • Workload mgt • Broad BI support
• For unstructured & semi-‐structured data
• For business users
• Low (1 second) latency • Windows (collec/ons) of events
©2014 Cloudera, Inc. All rights reserved.
6
Opera/onal Data Store • Consolidate, cleanse & stage data
• Promote to other opera/onal systems or EDW’s
Data Warehouse • ELT • Archive
Ra/onalizing exis/ng infrastructure
Migra/ng data sets, workloads or en/re systems from more expensive or less flexible systems
©2014 Cloudera, Inc. All rights reserved.
7
Combine & explore new data sets • Scrip/ng • Data blending • Tradi/onal ETL
Support ad-‐hoc marts and self-‐serve BI users • Tableau, Qlik et al
Enable data scien/sts to train & test models • ML libraries • SAS, Revolu/on
What do we mean by data discovery?
Providing a flexible analy/c sandbox where users can apply mul/ple tools & techniques to derive insights from new & tradi/onal data
©2014 Cloudera, Inc. All rights reserved.
8
Analyze paeerns over deep histories • Recommenda/ons • Outliers
Automate responses to new data / observa/ons • Classifying or scoring new data
User explora/on / judgment applica/on • Reviewing outliers • Overriding sugges/ons
What do we mean by pervasive analy/cs?
Using predic/ve analy/cs to improve business processes or augment professional judgment in an automated way across the organiza/on
©2014 Cloudera, Inc. All rights reserved.
9
Big Data in Credit Card Processing
“Customer privacy is paramount, but we need to keep vast amounts of informaFon online to run our business. Can we achieve both goals?”
“Modern credit card fraud rings operate globally over long Fme scales – how can we collect, store & analyze the petabytes of data it takes to detect them?”
“We obviously have vast and detailed informaFon about customer purchases. Can we combine it with GPS & mobile data, combined with browsing behavior to offer new products?”
“How can we deliver what the business team wants, and faster, without spending tens of millions of dollars to expand our data warehouse?”
Fraud DetecEon Regulatory Compliance
Product & Service InnovaEon
OperaEonal Efficiency
CFO & CRO CIO & CRO R&D, CMO CIO
10
Big Data in Retail
360° Customer View Fraud PrevenEon LogisEcs & Supply Chain OperaEonal Efficiency
CMO CMO & Customer Service
CEO, VP OperaEons CIO
“We want to know what our customer do on-‐line and in our stored. How can we combine data from separate analyFcs silos to understand & serve them beSer?”
“TheT, or ‘shrinkage’ in our stores is on the increase – can we combine POS data with video surveillance to reduce it without impacFng customer service negaFvely?”
“How can we reduce stock-‐outs & ensure products are in the right stores at the right Fme? Can we combine data from our carriers with in-‐store historical data from thousands of stores?
“Our EDW infrastructure is being overwhelmed with data and workloads; we are running into capacity limits, and the annual costs of expansion are in the tens of millions. What can we do?”
11
Big Data in Health Care
360° PaEent View Regulatory Compliance
Maximize Medical Efficacy OperaEonal Efficiency
VP OperaEons, Chief of Compliance
VP OperaEons Chief Medical Officer
CFO Chief Medical Officer
CIO
“PaFent data ends up scaSered across many different systems – is there a way to get a complete picture by combining it while ensuring HIPAA compliance?”
“The move to EMR combined with the strict regulaFons means we need to keep at least 7 years of data online – how can we afford to do that and make it searchable and available for analysis?”
“We invest hundreds of millions in new equipment every year. How can we judge the long term efficacy for paFent outcomes, and make smarter investment decisions?”
“Our EDW infrastructure is being overwhelmed with data and workloads; we are running into capacity limits, and the annual costs of expansion are in the tens of millions. What can we do?”