Sqrrl real time_big_data_20130411

21
Sqrrl Data, Inc. All Rights Reserved sqrrl Secure. Scale. Adapt. Adam Fuchs, CTO 11 April, 2013

description

Sqrrl CTO, Adam Fuchs, discusses Sqrrl and Accumulo at April 2013 Boston Hadoop User Group

Transcript of Sqrrl real time_big_data_20130411

Page 1: Sqrrl real time_big_data_20130411

sqrrl  Secure.  Scale.  Adapt  

Sqrrl  Data,  Inc.    All  Rights  Reserved  

sqrrl  Secure.  Scale.  Adapt.  

Adam  Fuchs,  CTO  11  April,  2013  

Page 2: Sqrrl real time_big_data_20130411

2  Sqrrl  Data,  Inc.    All  Rights  Reserved  

Management

Ely Kahn sqrrl VP BizDev,

White House

Investors

Adam Fuchs

sqrrl CTO, NSA

Who  We  Are  

20+  years  of  combined  Apache  Accumulo  engineering  exper9se  

Mark Terenzoni sqrrl CEO, F5

•  Founded  July  2012  •  Funded  August  2012  •  Team  includes  former  Tech  

Director  of  Accumulo  at  NSA  and  6  commiDers/contributors    

Page 3: Sqrrl real time_big_data_20130411

3  Sqrrl  Data,  Inc.    All  Rights  Reserved  

3  

Our  Mission  

Security  

AdapGvity  Scalability  

Page 4: Sqrrl real time_big_data_20130411

4  Sqrrl  Data,  Inc.    All  Rights  Reserved  

4  

Apache  Accumulo  

"   Sorted, Distributed Key/Value Store

"   Based on Google’s Big Table Design

"   Built on Top of Apache Hadoop and Apache Zookeeper

"   Augments and Integrates With the Hadoop ecosystem

"   Originally developed at the National Security Agency, now an Apache Software Foundation project

Page 5: Sqrrl real time_big_data_20130411

5  Sqrrl  Data,  Inc.    All  Rights  Reserved  

5  

Applica9ons  

Analy9cs  APIs  

Security  &  Access  Controls  

Data  Integra9on  

Search,  Sta*s*cs,  Graph,  Lucene,  SQL,  Custom  Extensions  

IAM,  Encryp*on,  DAM,  Secure  Code  

ETL,  Hadoop  

Accumulo  

Sqrrl  Enterprise  Architecture  

Page 6: Sqrrl real time_big_data_20130411

6  Sqrrl  Data,  Inc.    All  Rights  Reserved  

"   Start  small,  but  design  for  scalability  –  One  applicaGon  first,  then  grow  to  hundreds  –  One  gigabyte  first,  then  grow  to  petabytes  

"   Itera*ve  schema  refinement  –  IniGally,  let  the  data  define  the  schema  –  Refine  the  schema  in  bulk  as  you  beDer  understand  the  data  –  Middle  ground  between  flat  files  and  complete  ontologies  

"   Discovery  analy*cs  as  applica*on  building  blocks  –  Universal  search:  structured  and  unstructured  data,  across  data  sets,  low  latency  –  Basic  staGsGcs:  aggregaGons  of  query  results,  parallelized,  low  latency,  to  support  big  

picture  analysis  –  Graphs:  scalable  graph  analyGcs  for  analyzing  how  everything  is  connected  

"   Data-­‐centric  security  –  Separate  modeling  of  security  and  analysis  –  Simplifies  mulG-­‐tenancy  and  applicaGon  accreditaGon  

Big  Data  Lessons  Learned  

Page 7: Sqrrl real time_big_data_20130411

7  Sqrrl  Data,  Inc.    All  Rights  Reserved  

7  

Schema  Discovery  

Page 8: Sqrrl real time_big_data_20130411

8  Sqrrl  Data,  Inc.    All  Rights  Reserved  

The  future  of  Big  Data  innovaGon  is  Apps,  built  on:  •  Universal  Search  •  Schema-­‐less  StaGsGcs  •  Graphs  •  IntuiGve  Languages  •  Secure,  Scalable,  and  

Adaptable  pla\orms  

Lightweight  Apps  

Page 9: Sqrrl real time_big_data_20130411

9  Sqrrl  Data,  Inc.    All  Rights  Reserved  

9  

Targeted  Analysis  

Page 10: Sqrrl real time_big_data_20130411

10  Sqrrl  Data,  Inc.    All  Rights  Reserved  

10  

Big-Picture  Analytics  

Page 11: Sqrrl real time_big_data_20130411

11  Sqrrl  Data,  Inc.    All  Rights  Reserved  

DefiniGon:  A  form  of  security  in  which  data  carries  with  it  the  elements  of  provenance  that  are  required  to  make  policy  decisions  on  its  releasability.  •  Separate  data  modeling  for  Security  and  Analysis  •  Reusability  of  applicaGons  across  security  domains  

•  Distributed  development  of  ingest  and  query  applicaGons  

•  Supported  by  Accumulo’s  cell-­‐level  security  

Data-Centric  Security  

Page 12: Sqrrl real time_big_data_20130411

12  Sqrrl  Data,  Inc.    All  Rights  Reserved  

12  

Cell-Level  Security  

Page 13: Sqrrl real time_big_data_20130411

13  Sqrrl  Data,  Inc.    All  Rights  Reserved  

13  

Scalable  Data-Centric  Security  

Data   Labeler   Accumulo   Apps  

User  ACributes  

Audits  

Policies  

HDFS,  Zookeeper  

End  Users  

Auth.  Service  

Policy  Engine  

Page 14: Sqrrl real time_big_data_20130411

14  Sqrrl  Data,  Inc.    All  Rights  Reserved  

14  

Accumulo’s  Strengths  

"   Security  –  Cell-­‐level  security  reduces  the  cost  of  applicaGon  development  in  the  

presence  of  complex  legal  or  policy  restricGons  on  data  use  –  IAM  and  encrypGon  Ges  into  enterprise  security  standards    

"   Scalability  –  Proven  reliability  and  performance  at  the  mulG-­‐petabyte  scale  –  High-­‐performance  parallel  I/O  library    

"   Adap9vity  –  Flexible  schema  support  to  quickly  ingest  new  data  sources  –  Sorted  key/value  paradigm  supports  a  mulGtude  of  search  and  

analysis  applicaGons  –  Server-­‐side  programming  framework  “iterator  trees”  support  best-­‐in-­‐

class  aggregaGon,  filtering,  and  complex  query  semanGcs  

Page 15: Sqrrl real time_big_data_20130411

15  Sqrrl  Data,  Inc.    All  Rights  Reserved  

15  

An  Accumulo  key  is  a  5-­‐tuple,  consis9ng  of:      "   Row:  Controls  Atomicity  "   Column  Family:  Controls  Locality    "   Column  Qualifier:    Controls  Uniqueness  "   Visibility  Label:    Controls  Access  "   Timestamp:    Controls  Versioning  

Row   Col.  Fam.   Col.  Qual.   Visibility   Timestamp   Value  

John  Doe   Notes   PCP   PCP_JD   20120912   PaGent  suffers  from  an  acute  …  

John  Doe   Test  Results   Cholesterol   JD|PCP_JD   20120912   183  

John  Doe   Test  Results   Mental  Health   JD|PSYCH_JD   20120801   Pass  

John  Doe   Test  Results   X-­‐Ray   JD|PHYS_JD   20120513   1010110110100…  

Accumulo  Key/Value  Example  

Accumulo  Key  Structure  

Page 16: Sqrrl real time_big_data_20130411

16  Sqrrl  Data,  Inc.    All  Rights  Reserved  

16  

Accumulo  Architecture  

Tablet  Server  

Tablet  

Tablet  Server  

Tablet  

Tablet  Server  

Tablet  

ApplicaGon  

Zookeeper  

Zookeeper  

Zookeeper  

Master  

HDFS  

Read/Write  

Store/Replicate  

Assign/Balance  

Delegate  Authority  

Delegate  Authority  

ApplicaGon  

ApplicaGon  

Page 17: Sqrrl real time_big_data_20130411

17  Sqrrl  Data,  Inc.    All  Rights  Reserved  

17  

Tablet  Data  Flow  

In-­‐Memory  Map  

Write  Ahead  Log  

(For  Recovery)  

Sorted,  Indexed  File  

Sorted,  Indexed  File  

Sorted,  Indexed  File  

Tablet  Reads  

Iterator  Tree  

Minor  Compac<on  

Merging  /  Major  Compac<on  

Iterator  Tree  

Writes   Iterator  Tree  

Scan  

Page 18: Sqrrl real time_big_data_20130411

Iterator  Framework  

18  

Secure.        Scale.        Adapt.  

Iterator  Opera9ons:    "   File  Reads  "   Block  Caching  "   Merging  "   DeleGon  "   IsolaGon  "   Locality  Groups  "   Range  SelecGon  "   Column  SelecGon  "   Cell-­‐level  Security  "   Versioning  "   Filtering  "   AggregaGon  "   ParGGoned  Joins  

[email protected]  |  @sqrrl_inc  |  617.520.4375                          sqrrl  data,  INC.,    All  Rights  Reserved  

Page 19: Sqrrl real time_big_data_20130411

19  Sqrrl  Data,  Inc.    All  Rights  Reserved  

•  No  built-­‐in  secondary  indices  

•  Sort  Order  ó  Index  •  Balance  between  ingest  and  query  

•  Avoid  introducing  boDlenecks  

•  Preserve  cell-­‐level  security  and  scalability  

Table  Design  Table:  

Row:  

Column  Family:  

Column  Qualifier:  

Value:  

Forward  Index  

<UUID>  

<Type>  

<Field>  

<Term>  

Inverted  Index  

<Term>  

<Type>  +  <Field>  

<UUID>  

<Digest  of  Event>  

Page 20: Sqrrl real time_big_data_20130411

20  Sqrrl  Data,  Inc.    All  Rights  Reserved  

20  

Ecosystem  Architecture  

Apache  HDFS  

Apache  Accumulo  

Sqrrl  Enterprise  

Custom  Ingester  Web  Server    Custom  AnalyGc  Map/Reduce  Task  

Sqrrl  API  over  Apache  Thrip  RPC  :    Hierarchical  Documents  +  Graphs,  Lucene  +  SQL  +  more  

Accumulo  RPC  :  Sorted  Key/Value  I/O  

Hadoop  RPC  :  File  I/O    

Page 21: Sqrrl real time_big_data_20130411

21  Sqrrl  Data,  Inc.    All  Rights  Reserved  

21  

sqrrl  data,  inc.  275  Third  St.  

Cambridge,  MA  02142    

617-­‐902-­‐0784  www.sqrrl.com  @sqrrl_inc  

[email protected]  

Contact