Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,*...

19
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.Ing. Sebas4an Michel [email protected] Distributed Data Management, SoSe 2013, S. Michel 1

Transcript of Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,*...

Page 1: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Distributed  Data  Management  Summer  Semester  2013  

TU  Kaiserslautern  

Dr.-­‐Ing.  Sebas4an  Michel    

[email protected]­‐saarland.de  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   1  

Page 2: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

(DISTRIBUTED)  DATA  STREAM  PROCESSING  (INTRODUCTION)    

Lecture  8+  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   2  

Page 3: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

So  Far:  Databases/NoSQL  Datastores  •  Data  is  changing,  yes,  but  this  is  more  due  to  inserts  and  update  to  stored  data  items  

•  Historic  data  is  kept  •  Queries  operate  on  full  data  (tables)  •  MapReduce  is  extreme,  Write-­‐once  &  Read-­‐many  4mes  

•  Data  warehousing,  too:  periodically  loading  data  in  store  for  deep(er)  analy4cs  

•  Data  mining  Distributed  Data  Management,  SoSe  2013,  S.  Michel   3  

Page 4: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Data  Stream  Management  vs.  Tradi4onal  Data  Management    

•  At  query  4me,  data  is  accessed  as  a  whole  •  Data  is  persistently  stored  •  Queries  are  ad-­‐hoc  (mainly)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   4  

DATA  Base/Store  

Query  &  Results  Insert  

Update  

Delete  

Page 5: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Data  Stream  Management  vs.  Tradi4onal  Data  Management  (Cont’d)    

•  Data  is  moving!  Con4nuously  generated  (assumed  infinite!)  •  At  high  pace  •  Queries  are  (mainly)  con4nuous  (aka.  standing).  Registered  

once,  observed  “forever”.  •  Answer  to  queries  in  (near)  real-­‐4me  required  (o`en)  •  Probabilis4c  methods  for  efficiency  or  considering  only  part  

of  the  stream  (sliding  window)  Distributed  Data  Management,  SoSe  2013,  S.  Michel   5  

DATA  STREAM  

Set  of  queries  

results  

Page 6: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Sensor  Networks  •  (Distributed)  Sensor  Networks;  “Smart  Dust”  •  Mainly  numeric  measurements  of  (natural)  phenomena  

•  Computa4on  of  queries          like  max,  average,  min,  quan4les,          value  >  τ  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   6  

Page 7: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Mobile  Ad-­‐Hoc  Networks  

•  Connec4on  between  sensors  are  ad-­‐hoc  •  Efficient  and  reliable  rou4ng  •  Understanding  (changing)  topology  •  Power  consump4on  •  Also:  Vehicular  Ad-­‐Hoc  Netw.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   7  

Page 8: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Social  Sensors  •  Explicitly:  Snow  Tweets  (hkp://snowcore.uwaterloo.ca/snowtweets/)  – #snowtweets  50.0  cm.  at  K1A  0A2    – #snowtweets  10.0  in.  at  20500  – #snowtweets  4  cm  at  Palmerston  North  4414  

•  Implicitly:  By  men4oning  topics,  people,  in  social  communica4on  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   8  4me  

Page 9: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

9  source:  hkp://blog.socialflow.com/  

Earthquake  News  on  Twiker  

Page 10: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Stock  Market  

•  Real-­‐4me  analysis  of  stock  marked  changes  •  Compu4ng  sta4s4cs  over  streams,  e.g.,  for  decision  support  

•  Opportuni4es  for  reac4ng  in  real-­‐4me    •  Even  with  fully  automated  means:  algorithmic  trading  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   10  

Page 11: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

DBMS  vs.  DSMS  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   11  

Database  management  system  (DBMS)   Data  stream  management  system  (DSMS)  Persistent  data  (rela4ons)     vola4le  data  streams  Random  access   Sequen4al  access  One-­‐4me  queries   Con4nuous  queries  

(theore4cally)  unlimited  secondary  storage  limited  main  memory  

Only  the  current  state  is  relevant   Considera4on  of  the  order  of  the  input  

rela4vely  low  update  rate   poten4ally  extremely  high  update  rate  

Likle  or  no  4me  requirements   Real-­‐4me  requirements  

Assumes  exact  data   Assumes  outdated/inaccurate  data  

Plannable  query  processing  Variable  data  arrival  and  data  characteris4cs  

h"p://en.wikipedia.org/wiki/Data-­‐stream_management_system  

Page 12: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Data  Stream  Model  

•  Stream  of  data  items  is  unbounded  (available  memory  is  not)  

•  No  way  to  store  en4re  stream  (how  could  we,  its  (probably)  not  ending)  

•  To  compute  query  results,  need  to  devise  algorithm  with  likle  memory  consump4on  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   12  

Page 13: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Overview  of  Data  Stream  Topics  •  Synopses:  – concise  representa4ons  of  stream  content  –  tailored  to  tasks,  e.g.,  coun4ng  dis4nct  elements  – usually  not  exact,  but  approxima4ons  (es4mators)  of  true  values.    

•  Windows:  –  focus  of  certain  recent  subset  of  data  – computa4on  of  func4ons/joins  over  window(s)  content  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   13  

Page 14: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Data  Stream  Mining:  Teasers  

•  I  tell  you  integer  numbers  between  1  and  N  •  I  will  tell  all  but  one  number  

•  A`er  N-­‐1  numbers  I  ask:  which  number  was  missing.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   14  

481   324   122   412   871   231   849   447   641   …  

Page 15: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Data  Stream  Mining:  Teasers  (Cont’d)  

•  Keep  Boolean  array  of  length  N:  – Mark  posi4on  for  observed  number  – Size  required:  N  – Computa4on  at  end:  N  to  find  missing  number  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   15  

•  Much  beker:  – keep  sum  of  numbers:  S  – Missing  number  is  N*(N+1)/2  -­‐  S  

Page 16: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Coun4ng  Occurrences  

•  Consider  a  stream  of  elements  ai        …,  a2,  a84,  a41,  a2,  a77,  a231,  a2,  a4,  a54,  …  

•  How  o`en  does  a2  occur?    

•  How  to  implement?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   16  

•  Keep  counter  for  each  id  •  Required  space  #ids  (=N)  •  Not  feasible  of  N  is  very  large    

Page 17: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Probabilis4c  Coun4ng:  Count-­‐Min  Sketch  •  Keep  2-­‐dim  array  (h,  r)  •  h  hash  func4ons*  that  map  to  range  0…(r-­‐1)    

Distributed  Data  Management,  SoSe  2013,  S.  Michel   17  

Cormode,    Muthukrishnan  (2004).  An  Improved  Data  Stream  Summary:  The  Count-­‐Min  Sketch  and  its  Applica4ons.  J.  Algorithms  55:  29–38.  

0   1   2   3   4   5  

•  Arriving  item  a  •  For  each  j:        array[j,  hj(a)]++  

h1  

h2  

h3  

h4  

Page 18: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Count-­‐Min  Sketch:  Coun4ng  

•  How  o`en  did  we  see  item  a?  •  h1(a)  =  4,  h2(a)=5,  h3(a)=0,  h4(a)=2  •  Take  minimum  of  the  corresponding  values  in  the  2-­‐d  array.  Here:  4  

•  Es4mate  is  never  underes4ma4ng  •  Overes4ma4on  probabilis4cally  bounded  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   18  

5   3   4   4   9   3  

4   7   1   4   4   8  

8   4   6   7   2   1  

3   1   4   8   7   5  

0   1   2   3   4   5  h1  

h2  

h3  

h4  

9  

8  

8  

4  

Page 19: Distributed*DataManagement - - TU Kaiserslautern · DBMS*vs.*DSMS* Distributed*DataManagement,* SoSe*2013,*S.*Michel* 11 Database&management&system&(DBMS)& Data&stream&management&system&(DSMS)&

Outlook  

•  Some  more  basics  of  stream  processing    •  Few  more  fundamentals  of  data  stream  mining    •  Then,  window  based  data  stream  management  •  Then  going  distributed:  – Distributed  DSMS  – Distributed  (ad-­‐hoc)  sensor  networks  –  Scalable  computa4on  of  massive  data  streams  (think:  Hadoop  for  data  streams)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   19