Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern &...

18
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.Ing. Sebas4an Michel [email protected] Distributed Data Management, SoSe 2013, S. Michel 1

Transcript of Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern &...

Page 1: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Distributed  Data  Management  Summer  Semester  2013  

TU  Kaiserslautern  

Dr.-­‐Ing.  Sebas4an  Michel    

[email protected]­‐saarland.de  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   1  

Page 2: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

BIG  DATA  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   2  

source:  Dilbert  by  Sco0  Adams  (cropped)  

(The  Big  data  Challenge)  

Lecture  2  

Page 3: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

What  is  Big  Data?  •  Massive  amounts  of  data  from  a  variety  of  sources  – Web  search  logs  – social  networks  and  blogs  – RFID  and  other  sensor  data  – sales  data  – scien4fic  data  

&  it  is  a  big  buzzword!    

Distributed  Data  Management,  SoSe  2013,  S.  Michel   3  

Page 4: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

What  is  Big  Data?    (Cont’d)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   4  

•  Big  data  is  oRen  associated  with  NoSQL  and  MapReduce  tools  to  process  it.  

•  Processed  in  and  across  gigan4c  data  centers    

•  The  term  “Big  Data”  denotes  not  only  size  but  things  we  want  to/can  do  with  it  (benefits)  

Page 5: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Tradi4onal  Handling  

•  Data  warehousing,  e.g.,  at  Walmart,  Ebay,  etc.  Also  super  big  and  constantly  growing.  

•  But  you  know  your  data,  know  what  you  are  looking  for  

•  Schema  is  “small”  enough  to  allow  human  input  (admin)  

•  It  is  “just”  YOUR  data  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   5  

Page 6: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

“Simple”  Case:  Shopping  Paderns  

•  Famous  story:  – sta4s4cian  at  target.com  (large  retailer  in  US)  –  task:  figure  out  woman  is  pregnant  even  if  she  doesn’t  want  them  to  know  

– even  more:  roughly  which  week/month  – Why?  To  sell  products!  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   6  

Read  more:  e.g.,  hdp://www.ny4mes.com/2012/02/19/magazine/shopping-­‐habits.html?pagewanted=all&_r=0  

Page 7: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

“Simple”  Case:  Use  of  Search  Logs  

•  Swine  Flu  epidemic  of  2009  •  Google  tracks  epidemic  by  following  searches  for  flu-­‐related  topics.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   7  source:  Google  

Page 8: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

What  is  different  now?  •  Large  amounts  of  heterogeneous  data  •  Take  all  the  PBs  together,  not  only  your  own  one  (è  From  TB  to  PB  and  EB)  

•  Manual  input  of  humans  hardly  scales  •  Who  anyway  understand  complex  data  and  schema  (if  there  is  one)?  

•  Far  more  data  than  we  can  handle  (with  tradi4onal  means,  and  most  probably  beyond  that)  

 •  It  is  now  beyond  asking  SQL  queries.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   8  

Page 9: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

The  4th  Paradigm  

•  For  scien4fic  discovery,  tradi4onally  – experimental  (since  thousands  of  years)  –  theore4cal    (since  hundreds  of  years)  – computa4onal  (like  simula4ons)    (since  few  decades)  – Now:  data  driven  (i.e.,  discovery  through  analyzing  huge  amounts  of  data)  

 Read  on:  hdp://research.microsoR.com/en-­‐us/collabora4on/fourthparadigm/  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   9  

Page 10: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Data  Science:  What  it  takes  

•  many  fields  touched  – math,  sta4s4cs  – data  engineering  – padern  recogni4on  and  learning  – natural  language  processing    – visualiza4on  – uncertainty  modeling  – data  warehousing  – high  performance  compu4ng  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   10  

Page 11: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

The  BIG  Data  Challenge:  The  4  Vs  

•  Volume  – Lots  of  data  

•  Velocity  – Changing  /  growing  data  

•  Variety  – Heterogeneity  

•  Verity  – True  or  not?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   11  

Addressed  in    this  lecture  

According  to  Gartner  and  others.  

Page 12: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Example:  Trend  Mining  in  Twider  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   12  

•  Mine  trends  in  text  streams  (Twider,  RSS  feeds,  etc.)  

•  No  human  input.  Massive  amount  of  noisy  unstructured  text  data.  

•  Wand  to  find                          trends  like:  

#benedictXVI  #re4rement  

#schavan  #gudenberg  

#armstrong  #doping  

#cyprus  #bankruptcy  

Page 13: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Example  (Cont’d):  Sliding  Window  Model  and  Objec4ve  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   13  

•  Data  valid  for  certain  4me  

4me  

•  Now:  Detect  change  in  co-­‐occurrence,  thus  emerging  trend!  

tag  A  

tag  B  tag  A  

tag  B  

evolving    4me  

Page 14: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Example  (Cont’d):  Predic4on  Model  and  Trend  Ranking  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   14  

0  

0.2  

0.4  

0.6  

0.8  

1  

1   2   3   4   5   6   7   8   9   10  

Correla4on  

Predic4on  

Error  

§  Intensity  of  trend  as  predic4on  error  

§  Exponen4al  smoothing  forecast  

Page 15: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Data  Sources  are  Heterogeneous  15  

super  fast    not  controlled  (noisy)  text  lidle  structure  

super  fast  structured  

sta4c  structured  administered  

Page 16: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

…  so  is  the  Data     16  

Music  

Publica4ons  

Health  Data  

KB  of  En4re  Wikipedia  

Page 17: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Why  is  Big  Data  Interes4ng?  •  Novel  insights  about  customers  – Beyond  pure  shopping  cart  analyses  and  purchase  history  

– Beyond  running  separate  surveys/polls  

•  Social  media  involvement  •  Demographic  data  •  (Purchase)  trend  predic4on  in  social  media  (=>  investment)  

•  Why?  Money  Distributed  Data  Management,  SoSe  2013,  S.  Michel   17  

Page 18: Distributed*DataManagement fileDistributed*DataManagement SummerSemester2013 & TUKaiserslautern & Dr. Ing.*Sebas4an*Michel* * smichel@mmci.uni1saarland.de* Distributed*DataManagement,*SoSe*2013,*S

Need  to  be  Careful  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   18  

•  Not  only  are  facts  oRen  wrong  •  Also  sta4s4cs  can  reveal  wrong  clues.  •  With  enough  data  you  can  “tell”  anything