Ask bigger questions

14
1 Becoming Informa/onDriven Introduc/on to the Enterprise Data Hub Mike Olson Cloudera, Inc. CoFounder & Chief Strategy Officer

Transcript of Ask bigger questions

1  

Becoming  Informa/on-­‐Driven  Introduc/on  to  the  Enterprise  Data  Hub  

Mike  Olson  Cloudera,  Inc.  Co-­‐Founder  &  Chief  Strategy  Officer  

2  

Expanding  Data  Requires  A  New  Approach  

©2014  Cloudera,  Inc.  All  rights  reserved.  2  

1980s  Bring  Data  to  Compute  

Now  Bring  Compute  to  Data  

RelaEve  size  &  complexity  

Data  InformaEon-­‐centric  

businesses  use  all  data:      

Mul/-­‐structured,    internal  &  external  data    

of  all  types  

Compute  

Compute  

Compute  

Process-­‐centric    businesses  use:  

 

• Structured  data  mainly  •  Internal  data  only  

• “Important”  data  only      

Compute  

Compute  

Compute  

Data  

Data  

Data  

Data  

3  

The  Old  Way:  Bringing  Data  to  Compute  

©2014  Cloudera,  Inc.  All  rights  reserved.  3  

Complex  Architecture  •  Many  special-­‐purpose  

systems  •  Moving  data  around  •  No  complete  views  

Visibility  •  Leaving  data  behind  •  Risk  and  compliance  •  High  cost  of  storage  

Time  to  Data  •  Up-­‐front  modeling  •  Transforms  slow  •  Transforms  lose  data  

Cost  of  AnalyEcs  •  Exis/ng  systems  strained  •  No  agility  •  BI  backlog  

4  

1  

2  

3  

SERVERS  MARTS  EDWS   DOCUMENTS   STORAGE   SEARCH   ARCHIVE  

ERP,  CRM,  RDBMS,  MACHINES   FILES,  IMAGES,  VIDEOS,  LOGS,  CLICKSTREAMS   EXTERNAL  DATA  SOURCES  

4  

SERVERS   MARTS   EDWS   DOCUMENTS   STORAGE   SEARCH   ARCHIVE  

ERP,  CRM,  RDBMS,  MACHINES   FILES,  IMAGES,  VIDEOS,  LOGS,  CLICKSTREAMS   ESTERNAL  DATA  SOURCES  

©2014  Cloudera,  Inc.  All  rights  reserved.  

MulE-­‐workload  analyEc  plaRorm  •  Bring  applica/ons  to  data  •  Combine  different  workloads  on    

common  data  (i.e.  SQL  +  Search)  •  True  BI  agility  

4  

1  

2  

3   4  

The  New  Way:  Bringing  Compute  to  Data  

4  

AcEve  archive  •  Full  fidelity  original  data  •  Indefinite  /me,  any  source  •  Lowest  cost  storage  

1  

Data  management,  transforms  •  One  source  of  data  for  all  analy/cs  •  Persist  state  of  transformed  data  •  Significantly  faster  &  cheaper  

2  

Self-­‐service  exploratory  BI  •  Simple  search  +  BI  tools  •  “Schema  on  read”  agility  •  Reduce  BI  user  backlog  requests  

3  

5  

Beeer,  faster,  cheaper  and  mul/-­‐framework  

BATCH  PROCESSING  

MR  /  PIG/  Hive  /  Cascading  

SQL  IMPALA  

SEARCH  SOLR  

MACHINE  LEARNING  

SAS,  R,  H20,  MLlib  

STREAM  PROCESSING  SPARK  STREAMING  

NOSQL  HBASE  

Process  Data  

IN-­‐MEMORY  SPARK  

Train  &  Test  Models  

Respond  to  Events  in  RT  

Explore  &  Analyze  Data  

• Highly  mature  • Wide  range  of  clients  

• Significant  advances  in  speed  &  usability  

• Integra/on  with  the  SAS  &  Revolu/on  product  porgolio  

• Python  /  0xdata  /  ML  lib  for  advanced  users  

• Very  low  (~10ms)  latency  

• High  volumes  of  single  events  

• High  speed  • High  concurrency  • Workload  mgt  • Broad  BI  support  

• For  unstructured  &  semi-­‐structured  data  

• For  business  users  

• Low  (1  second)  latency  • Windows  (collec/ons)  of  events  

©2014  Cloudera,  Inc.  All  rights  reserved.  

6  

Opera/onal  Data  Store  •  Consolidate,  cleanse  &  stage  data  

•  Promote  to  other  opera/onal  systems  or  EDW’s  

Data  Warehouse  •  ELT  •  Archive  

Ra/onalizing  exis/ng  infrastructure  

Migra/ng  data  sets,  workloads  or  en/re  systems  from  more  expensive  or  less  flexible  systems  

©2014  Cloudera,  Inc.  All  rights  reserved.  

7  

Combine  &  explore  new    data  sets  • Scrip/ng  • Data  blending  • Tradi/onal  ETL  

Support  ad-­‐hoc  marts  and  self-­‐serve  BI  users  • Tableau,  Qlik  et  al  

Enable  data  scien/sts  to  train  &  test  models  • ML  libraries  • SAS,  Revolu/on  

What  do  we  mean  by  data  discovery?  

Providing  a  flexible  analy/c  sandbox  where  users  can  apply  mul/ple  tools  &  techniques  to  derive  insights  from  new  &  tradi/onal  data  

©2014  Cloudera,  Inc.  All  rights  reserved.  

8  

Analyze  paeerns  over  deep  histories  • Recommenda/ons  • Outliers  

Automate  responses  to  new  data  /  observa/ons  • Classifying  or  scoring  new  data  

User  explora/on  /  judgment  applica/on  • Reviewing  outliers  • Overriding  sugges/ons  

What  do  we  mean  by  pervasive  analy/cs?  

Using  predic/ve  analy/cs  to  improve  business  processes  or  augment  professional  judgment  in  an  automated  way  across  the  organiza/on  

©2014  Cloudera,  Inc.  All  rights  reserved.  

9  

Big  Data  in  Credit  Card  Processing  

“Customer  privacy  is  paramount,  but  we  need  to  keep  vast  amounts  of  informaFon  online  to  run  our  business.  Can  we  achieve  both  goals?”  

“Modern  credit  card  fraud  rings  operate  globally  over  long  Fme  scales  –  how  can  we  collect,  store  &  analyze  the  petabytes  of  data  it  takes  to  detect  them?”  

“We  obviously  have  vast  and  detailed  informaFon  about  customer  purchases.  Can  we  combine  it  with  GPS  &  mobile  data,  combined  with  browsing  behavior  to  offer  new  products?”  

“How  can  we  deliver  what  the  business  team  wants,  and  faster,  without  spending  tens  of  millions  of  dollars  to  expand  our  data  warehouse?”  

Fraud  DetecEon  Regulatory    Compliance  

Product  &  Service    InnovaEon  

OperaEonal    Efficiency  

CFO  &  CRO   CIO  &  CRO   R&D,  CMO   CIO  

10  

Big  Data  in  Retail  

360°  Customer  View   Fraud  PrevenEon  LogisEcs  &    Supply  Chain   OperaEonal  Efficiency  

CMO   CMO  &    Customer  Service  

CEO,  VP  OperaEons   CIO  

“We  want  to  know  what  our  customer  do  on-­‐line  and  in  our  stored.  How  can  we  combine  data  from  separate  analyFcs  silos  to  understand  &  serve  them  beSer?”  

“TheT,  or  ‘shrinkage’  in  our  stores  is  on  the  increase  –  can  we  combine  POS  data  with  video  surveillance  to  reduce  it  without  impacFng  customer  service  negaFvely?”  

“How  can  we  reduce  stock-­‐outs  &  ensure  products  are  in  the  right  stores  at  the  right  Fme?  Can  we  combine  data  from  our  carriers  with  in-­‐store  historical  data  from  thousands  of  stores?  

“Our  EDW  infrastructure  is  being  overwhelmed  with  data  and  workloads;  we  are  running  into  capacity  limits,  and  the  annual  costs  of  expansion  are  in  the  tens  of  millions.  What  can  we  do?”  

11  

Big  Data  in  Health  Care  

360°  PaEent  View  Regulatory  Compliance  

Maximize  Medical  Efficacy   OperaEonal  Efficiency  

VP  OperaEons,    Chief  of  Compliance  

VP  OperaEons  Chief  Medical  Officer  

CFO  Chief  Medical  Officer  

CIO  

“PaFent  data  ends  up  scaSered  across  many  different  systems  –  is  there  a  way  to  get  a  complete  picture  by  combining  it  while  ensuring  HIPAA  compliance?”  

“The  move  to  EMR  combined  with  the  strict  regulaFons  means  we  need  to  keep  at  least  7  years  of  data  online  –  how  can  we  afford  to  do  that  and  make  it  searchable  and  available  for  analysis?”  

“We  invest  hundreds  of  millions  in  new  equipment  every  year.  How  can  we  judge  the  long  term  efficacy  for  paFent  outcomes,  and  make  smarter  investment  decisions?”  

“Our  EDW  infrastructure  is  being  overwhelmed  with  data  and  workloads;  we  are  running  into  capacity  limits,  and  the  annual  costs  of  expansion  are  in  the  tens  of  millions.  What  can  we  do?”  

12

13

14   ©2014  Cloudera,  Inc.  All  rights  reserved.  

Mike  Olson  @mikeolson  [email protected]