Managing Big Data (Chapter 2, SC 11 Tutorial)

Post on 20-Aug-2015

2.734 views 1 download

Transcript of Managing Big Data (Chapter 2, SC 11 Tutorial)

An  Introduc+on  to    Data  Intensive  Compu+ng  

 Chapter  2:  Data  Management  

Robert  Grossman  University  of  Chicago  Open  Data  Group  

 Collin  BenneC  

Open  Data  Group    

November  14,  2011  1  

1.  Introduc+on  (0830-­‐0900)  a.  Data  clouds  (e.g.  Hadoop)  b.  U+lity  clouds  (e.g.  Amazon)  

2.  Managing  Big  Data  (0900-­‐0945)  a.  Databases  b.  Distributed  File  Systems  (e.g.  Hadoop)  c.  NoSql  databases  (e.g.  HBase)  

3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)  a.  Mul+ple  Virtual  Machines  &  Message  Queues  b.  MapReduce  c.  Streams  over  distributed  file  systems  

4.  Lab  using  Amazon’s  Elas+c  Map  Reduce  (1100-­‐1200)  

 

What  Are  the  Choices?  

Databases    (SqlServer,  Oracle,  DB2)  

File  Systems  

Distributed  File  Systems  (Hadoop,  Sector)  

Clustered  File  Systems  (glusterfs,  …)  

NoSQL  Databases  (HBase,  Accumulo,  Cassandra,  SimpleDB,  …)  

Applica+ons    (R,  SAS,  Excel,  etc.  )  

What  is  the  Fundamental  Trade  Off?  

Scale  up  Scale  out  

vs   …  

2.1    Databases  

Advice  From  Jim  Gray  

1.  Analyzing  big  data  requires  scale-­‐out  solu+ons  not  scale-­‐up  solu+ons  (GrayWulf)  

2.  Move  the  analysis  to  the  data.  3.  Work  with  scien+sts  to  find  the  

most  common  “20  queries”  and  make  them  fast.  

4.  Go  from  “working  to  working.”  

PaCern  1:  Put  the  metadata  in  a  database  and  point  to  files  in  a  

file  system.    

Example:  Sloan  Digital  Sky  Survey  •  Two  surveys  in  one  

– Photometric  survey  in  5  bands  – Spectroscopic  redshii  survey  

•  Data  is  public  – 40  TB  of  raw  data  – 5  TB  processed  catalogs  – 2.5  Terapixels  of  images  

•  Catalog  uses  Microsoi  SQLServer  •  Started  in  1992,  finished  in  2008  •  JHU  SkyServer  serves  millions  of  queries    

Example:  Bionimbus  Genomics  Cloud  

www.bionimbus.org  

Database  Services  

Analysis  Pipelines  &  Re-­‐analysis  

Services  

GWT-­‐based  Front  End  

Data    Cloud  Services  

Data  Inges+on  Services  

U+lity  Cloud  Services  

Intercloud  Services  

Database  Services  

Analysis  Pipelines  &  Re-­‐analysis  

Services  

GWT-­‐based  Front  End  

Large  Data    Cloud  Services  

Data  Inges+on  Services  

Elas+c  Cloud  Services  

Intercloud  Services  

(Hadoop,  Sector/Sphere)  

(Eucalyptus,  OpenStack)  

(PostgreSQL)  

ID  Service  (UDT,  replica+on)  

Sec+on  2.2  Distributed  File  Systems  

Sector/Sphere  

Hadoop’s  Large  Data  Cloud  

Storage  Services  

Compute  Services  

13

Hadoop’s  Stack  

Applica+ons  

Hadoop  Distributed  File  System  (HDFS)  

Hadoop’s  MapReduce  

Data  Services   NoSQL  Databases  

PaCern  2:  Put  the  data  into  a  distributed  file  system.  

Hadoop  Design  •  Designed  to  run  over  commodity  components  that  fail.  

•  Data  replicated,  typically  three  +mes.  •  Block-­‐based  storage.  •  Op+mized  for  efficient  scans  with  high  throughput,  not  low  latency  access.  

•  Designed  for  write  once,  read  many.  •  Append  opera+on  planned  for  future.  

Hadoop  Distributed  File  System  (HDFS)    Architecture  

Name  Node  

Data  Node  

Data  Node  

Data  Node  

Client  control  

Data  Node  

Data  Node  

Data  Node  

data  

Rack   Rack   Rack  

•  HDFS  is  block-­‐based.  

•  WriCen  in  Java.  

Sector  Distributed  File  System  (SDFS)  Architecture  

•  Broadly  similar  to  Google  File  System  and  Hadoop  Distributed  File  System.  

•  Uses  na+ve  file  system.    It  is  not  block  based.  •  Has  security  server  that  provides  authoriza+ons.  

•  Has  mul+ple  master  name  servers  so  that  there  is  no  single  point  of  failure.  

•  Use  UDT  to  support  wide  area  opera+ons.  

Sector  Distributed  File  System  (SDFS)    Architecture  Master  Node  

Slave  Node  

Slave  Node  

Slave  Node  

Client  control  

Slave  Node  

Slave  Node  

Slave  Node  

data  

Rack   Rack   Rack  

•  HDFS  is  file-­‐based.  

•  WriCen  in  C++.  •  Security  server.  •  Mul+ple  masters.  

Security  Server  

control  

Master  Node  

GlusterFS  Architecture  

•  No  metadata  server.  •  No  single  point  of  failure.  •  Uses  algorithms  to  determine  loca+on  of  data.  •  Can  scale  out  by  adding  more  bricks.  

GlusterFS  Architecture  

Brick  

Brick  

Brick  

Client  

Brick  

Brick  

Brick  

data  

Rack   Rack   Rack  

File-­‐based.  

GlusterFS  Server  

Sec+on  2.3  NoSQL  Databases  

21  

Evolu+on  •  Standard  architecture  for  simple  web  applica+ons:  – Presenta+on:  front-­‐end,  load  balanced  web  servers  – Business  logic  layer    – Backend  database  

•  Database  layer  does  not  scale  with  large  numbers  of  users  or  large  amounts  of  data  

•  Alterna+ves  arose  – Sharded  (par++oned)  databases  or  master-­‐slave  dbs  – memcache  

22  

Scaling  RDMS  •  Master  –  slave  database  systems  

– Writes  to  master  – Reads  from  slaves  – Can  be  boClenecks  wri+ng  to  slaves;  can  be  inconsistent  

•  Sharded  databases  – Applica+ons  and  queries  must  understand  sharing  schema  

– Both  reads  and  writes  scale  – No  na+ve,  direct  support  for  joins  across  shards  

23  

NoSQL  Systems  

•  Suggests  No  SQL  support,  also  Not  Only  SQL  •  One  or  more  of  the  ACID  proper+es  not  supported  

•  Joins  generally  not  supported  •  Usually  flexible  schemas  •  Some  well  known  examples:  Google’s  BigTable,  Amazon’s  Dynamo  &  Facebook’s  Cassandra  

•  Quite  a  few  recent  open  source  systems  

24  

PaCern  3:  Put  the  data  into  a  NoSQL  applica+on.  

26  

C  

A   P  

Consistency  

Availability   Par++on-­‐resiliency  

CA:  available  and  consistent,  unless  there  is  a  par++on.  

AP:  a  reachable  replica  provides  service  even  in  a  par++on,  but  may  be  inconsistent.  

CP:  always  consistent,  even  in  a  par++on,  but  a  reachable  replica  may  deny  service  without  quorum.  

Dynamo,  Cassandra    

BigTable,  HBase  

CAP  –  Choose  Two  Per  Opera+on  

CAP  Theorem  

•  Proposed  by  Eric  Brewer,  2000  •  Three  proper+es  of  a  system:  consistency,  availability  and  par++ons  

•  You  can  have  at  most  two  of  these  three  proper+es  for  any  shared-­‐data  system  

•  Scale  out  requires  par++ons  •  Most  large  web-­‐based  systems  choose  availability  over  consistency  

28  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002  

Eventual  Consistency  •  If  no  updates  occur  for  a  while,  all  updates  eventually  propagate  through  the  system  and  all  the  nodes  will  be  consistent  

•  Eventually,  a  node  is  either  updated  or  removed  from  service.      

•  Can  be  implemented  with  Gossip  protocol  •  Amazon’s  Dynamo  popularized  this  approach  •  Some+mes  this  is  called  BASE  (Basically  Available,  Soi  state,  Eventual  consistency),  as  opposed  to  ACID  

29  

Different  Types  of  NoSQL  Systems  

•  Distributed  Key-­‐Value  Systems  – Amazon’s  S3  Key-­‐Value  Store  (Dynamo)  –  Voldemort  –  Cassandra  

•  Column-­‐based  Systems  –  BigTable  – HBase  –  Cassandra  

•  Document-­‐based  systems  –  CouchDB  

30  

Hbase  Architecture  

HRegionServer  

Client   Client   Client   Client  Client  

HBaseMaster  

REST API

Disk  

HRegionServer  

Java  Client  

Disk  

HRegionServer  

Disk  

HRegionServer  

Disk  

HRegionServer  

Source:  Raghu  Ramakrishnan  

HRegion  Server  •  Records  par++oned  by  column  family  into  HStores  

–  Each  HStore  contains  many  MapFiles  

•  All  writes  to  HStore  applied  to  single  memcache  •  Reads  consult  MapFiles  and  memcache  •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full  •  Compac+ons  limit  number  of  MapFiles  

HRegionServer  

HStore  

MapFiles  

Memcache  writes  

Flush  to  disk  

reads  

Source:  Raghu  Ramakrishnan  

Facebook’s  Cassandra  

•  Modeled  aier  BigTable’s  data  model  •  Modeled  aier  Dynamo’s  eventual  consistency  •  Peer  to  peer  storage  architecture  using  consistent  hashing  (Chord  hashing)  

33  

Databases   NoSQL  Systems  Scalability   100’s  TB   100’s  PB  Func+onality   Full  SQL-­‐based  queries,  

including  joins  Op+mized  access  to  sorted  tables  (tables  with  single  keys)  

Op+mized   Databases  op+mized  for  safe  writes  

Clouds  op+mized  for  efficient  reads  

Consistency  model  

ACID  (Atomicity,  Consistency,  Isola+on  &  Durability)  –  database  always  consist  

Eventual  consistency  –  updates  eventually  propagate  through  system  

Parallelism   Difficult  because  of  ACID  model;  shared  nothing  is  possible  

Basic  design  incorporates  parallelism  over  commodity  components    

Scale   Racks   Data  center   34  

Sec+on  2.3    Case  Study:  Project  Matsu  

Zoom  Levels  /  Bounds  Zoom  Level  1:  4  images   Zoom  Level  2:  16  images  

Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  

Source:  Andrew  Levine  

Mapper  Input  Key:  Bounding  Box  

Mapper  Input  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  resizes  and/or  cuts  up  the  original  image  into  pieces  to  output  Bounding  Boxes  

(minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)  

Step  1:  Input  to  Mapper  

Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Build  Tile  Cache  in  the  Cloud  -­‐  Mapper  

Source:  Andrew  Levine  

Reducer  Key  Input:  Bounding  Box  (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)  

Reducer  Value  Input:  

Step  1:  Input  to  Reducer  

…  

Step  2:  Reducer  Output  

Assemble  Images  based  on  bounding  box  

•  Output  to  HBase  •  Builds  up  Layers  for  WMS  for  various  datasets  

Build  Tile  Cache  in  the  Cloud  -­‐  Reducer  

Source:  Andrew  Levine  

HBase  Tables  

•  Open  Geospa+al  Consor+um  (OGC)  Web  Mapping  Service  (WMS)  Query  translates  to  HBase  scheme  – Layers,  Styles,  Projec+on,  Size  

•  Table  name:  WMS  Layer  – Row  ID:  Bounding  Box  of  image  -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  

Sec+on  2.4  Distributed  Key-­‐Value  Stores  

S3  

PaCern  4:  Put  the  data  into  a  distributed  key-­‐value  store.  

S3  Buckets  •  S3  bucket  names  must  be  unique  across  AWS  •  A  good  prac+ce  is  to  use  a  paCern  like  

   tutorial.osdc.org/dataset1.txt  for  a  domain  you  own.  

•  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/

dataset1.txt  •  If  you  own  osdc.org  you  can  create  a  DNS  CNAME  entry  to  access  the  file  as  tutorial.osdc.org/dataset1.txt  

S3  Keys  

•  Keys  must  be  unique  within  a  bucket.  •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  

S3  Security  

•  AWS  access  key  (user  name)  •  This  func+on  as  your  S3  username.  It  is  an  alphanumeric  text  string  that  uniquely  iden+fies  users.    

•  AWS  Secret  key  (func+ons  as  password)  

AWS  Account  Informa+on  

Access  Keys  

User  Name   Password  

Other  Amazon  Data  Services  

•  Amazon  Simple  Database  Service  (SDS)  •  Amazon’s  Elas+c  Block  Storage  (EBS)  

Sec+on  2.5  Moving  Large  Data  Sets  

The  Basic  Problem  

•  TCP  was  never  designed  to  move  large  data  sets  over  wide  area  high  performance  networks.  

•  As  a  general  rule,  reading  data  off  disks  is  slower  than  transpor+ng  it  over  the  network.      

TCP Throughput vs RTT and Packet Loss

0.01%

0.05%

0.1%

0.1%

0.5%

1000

800

600

400

200

1 10 100 200 400

1000

800

600

400

200

Thro

ughp

ut (M

b/s)

Round Trip Time (ms)

LAN US-EU US-ASIA US

Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  

The  Solu+on  

•  Use  parallel  TCP  streams  – GridFTP  

•  Use  specialized  network  protocols  – UDT,  FAST,  etc.  

•  Use  RAID  to  stripe  data  across  disks  to  improve  throughput  when  reading  

•  These  techniques  are  well  understood  in  HEP,  astronomy,  but  not  yet  in  biology.  

Case  Study:  Bio-­‐mirror  

[The  open  source  GridFTP]  from  the  Globus  project  has  recently  been  improved  to  offer  UDP-­‐based  file  transport,  with  long-­‐distance  speed  improvements  of  3x  to  10x  over  the  usual  TCP-­‐based  file  transport.    -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  

Moving  113GB  of  Bio-­‐mirror  Data  

Site   RTT   TCP   UDT   TCP/UDT   Km  NCSA   10   139   139   1   200  Purdue   17   125   125   1   500  ORNL   25   361   120   3   1,200  TACC   37   616   120   55   2,000  SDSC   65   750   475   1.6   3,300  CSTNET   274   3722   304   12   12,000  

GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐mirror.net/biomirror/  

Case  Study:  CGI  60  Genomes  

•  Trace  by  Complete  Genomics  showing  performance  of  moving  60  complete  human  genomes  from  Mountain  View  to  Chicago  using  the  open  source  Sector/UDT.  

•  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.  Source:  Complete  Genomics.      

Resource  Use  

Protocol   CPU  Usage*   Memory*  GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb  GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb  

*CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more  CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  

Sector/Sphere  

•  Sector/Sphere  is  a  pla{orm  for  data  intensive  compu+ng  built  over  UDT  and  designed  to  support  geographically  distributed  clusters.    

Ques+ons?  

For  the  most  current  version  of  these  notes,  see  rgrossman.com