bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5....

10
DRAFT not for distribution 1 bigdata@csail Mission The goal of bigdata@csail is to identify and develop the technologies needed to solve the next generation data challenges which will require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide. We want to enable people to truly leverage Big Data by developing platforms that are reusable, scalable and easy to deploy across multiple application domains. Our approach includes two key aspects. First, we will collaborate closely with industry to provide realworld applications and drive impact. Second, we view the Big Data problem as fundamentally multidisciplinary. Our team includes faculty and researchers across many related technology areas, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization, as well as domain experts in finance, medical, smart infrastructure, education and science. The Big Data Problem We define big data as data that is too big, too fast, or too hard for existing tools to process. Here, “too big” means that organizations increasingly have to deal with petabytescale collections of data, which come from click streams, transaction records, sensors, and many other places. “Too fast” means that not only is data big, but it needs to be processed quickly – for example, to perform fraud detection at a point of sale or determine what ad to show to a user on a web page. “Too hard” is a catchall for data that doesn’t fit neatly into an existing processing tool, i.e., data that needs more complex analysis than existing tools can readily provide. Examples of the big data problem abound. Web Analytics On the Internet, many websites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their site, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines. Finance

Transcript of bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5....

Page 1: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  1  

bigdata@csail  

Mission  The  goal  of  bigdata@csail  is  to  identify  and  develop  the  technologies  needed  to  solve  the  next  generation  data  challenges  which  will  require  the  ability  to  scale  well  beyond  what  today's  computing  platforms,  algorithms,  and  methods  can  provide.    We  want  to  enable  people  to  truly  leverage  Big  Data  by  developing  platforms  that  are  reusable,  scalable  and  easy  to  deploy  across  multiple  application  domains.  

Our  approach  includes  two  key  aspects.    First,  we  will  collaborate  closely  with  industry  to  provide  real-­‐world  applications  and  drive  impact.    Second,  we  view  the  Big  Data  problem  as  fundamentally  multi-­‐disciplinary.    Our  team  includes  faculty  and  researchers  across  many  related  technology  areas,  including  algorithms,  architecture,  data  management,  machine  learning,  privacy  and  security,  user  interfaces,  and  visualization,  as  well  as  domain  experts  in  finance,  medical,  smart  infrastructure,  education  and  science.  

The  Big  Data  Problem    We  define  big  data  as  data  that  is  too  big,  too  fast,  or  too  hard  for  existing  tools  to  process.    Here,  “too  big”  means  that  organizations  increasingly  have  to  deal  with  petabyte-­‐scale  collections  of  data,  which  come  from  click  streams,  transaction  records,  sensors,  and  many  other  places.  “Too  fast”  means  that  not  only  is  data  big,  but  it  needs  to  be  processed  quickly  –  for  example,  to  perform  fraud  detection  at  a  point  of  sale  or  determine  what  ad  to  show  to  a  user  on  a  web  page.    “Too  hard”  is  a  catchall  for  data  that  doesn’t  fit  neatly  into  an  existing  processing  tool,  i.e.,  data  that  needs  more  complex  analysis  than  existing  tools  can  readily  provide.      

Examples  of  the  big  data  problem  abound.      

Web  Analytics  

On  the  Internet,  many  websites  now  register  millions  of  unique  visitors  per  day.    Each  of  these  visitors  may  access  and  create  a  range  of  content.    This  can  easily  amount  to  tens  to  hundreds  of  gigabytes  per  day  (tens  of  terabytes  per  year)  of  accumulating  user  and  log  data,  even  for  medium  sized  websites.    Increasingly,  companies  want  to  be  able  to  mine  this  data  to  understand  limitations  of  their  site,  improve  response  time,  offer  more  targeted  ads,  and  so  on.    Doing  this  requires  tools  that  can  perform  complicated  analytics  on  data  that  far  exceeds  the  memory  of  a  single  machine  or  even  a  cluster  of  machines.  

Finance  

Page 2: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  2  

As  another  example,  consider  the  big  data  problem  as  it  applies  to  banks  and  other  financial  organizations.    These  organizations  have  vast  quantities  of  data  about  consumer  spending  habits,  credit  card  transactions,  financial  markets,  and  so  on.  This  data  is  massive:  for  example,  Visa  processes  more  than  35B  transactions  per  year;  if  they  record  1  KB  of  data  per  transaction,  this  represents  3.5  petabytes  of  data  per  year.    Visa,  and  large  banks  that  issue  Visa  cards  would  like  to  use  this  data  in  a  number  of  ways:    to  predict  customers  at  risk  of  default,  to  detect  fraud,  to  offer  promotions,  and  so  on.    This  requires  complex  analytics.    Additionally,  this  processing  needs  to  be  done  quickly  and  efficiently,  and  needs  to  be  easy  to  tune  as  new  models  are  developed  and  refined.  

Medical  

As  a  third  example,  consider  the  impact  of  new  sensors  on  our  ability  to  continuously  monitor  a  patient's  health.      Recent  advances  in  wireless  networking,  miniaturization  of  sensors  via  MEMS  processes,  and  incredible  advances  in  digital  imaging  technology  have  made  it  possible  to  cheaply  deploy  wearable  sensors  that  monitor  a  number  of  biological  signals  on  patients,  even  outside  of  the  doctors  office.    These  signals  measure  functioning  of  the  heart,  brain,  circulatory  system,  etc.    Additionally,  accelerometers  and  touch  screens  can  be  used  to  assess  mobility  and  cognitive  function.    This  creates  an  unprecedented  opportunity  for  doctors  to  provide  outpatient  care,  by  understanding  how  patients  are  progressing  outside  of  the  doctor’s  office,  and  when  they  need  to  be  seen  urgently.    Additionally,  by  correlating  signals  from  thousands  of  different  patients,  it  become  possible  to  develop  a  new  understand  of  what  is  normal  or  abnormal,  or  what  kinds  of  signal  features  are  indicative  of  potential  serious  problems.  

Similar  challenges  arise  across  different  industry  sectors  including  healthcare,  finance,  government,  transportation,  biotech,  drug  discovery,  insurance,  retail  as  well  as  across  many  scientific  fields  including  astronomy,  genomics,  oceanography,  physics  and  biology.  

Our  approach  

We  believe  the  solution  to  big  data  is  fundamentally  multi-­‐disciplinary.    Our  approach  is  to  bring  together  world  leaders  in  parallel  architecture,  massive-­‐scale  data  processing,  algorithms,  machine  learning,  visualization,  and  interfaces  to  collectively  identify  and  address  the  fundamental  technology  challenges  we  face  with  Big  Data.  

Our  approach  focuses  on  four  broad  research  themes,  summarized  in  the  figure  to  the  right  

• Computational  Platforms      • Scalable  Algorithms  

Page 3: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  3  

• Machine  Learning  and  Understanding  • Privacy  and  Security  

Below  we  briefly  summarize  these  areas  of  research  at  MIT  CSAIL,  using  italics  to  reference  specific  research  projects  that  are  described  in  more  detail  at  the  end  of  this  document.  

Computational   Platforms:   We   are   building   parallel   data   processing   platforms,  including  SciDB,  BlinkDB,  and  several  cloud-­‐based  deployment  platforms,  including  FOS   and   Relational   Cloud.     The   goal   of   these   platforms   is   to   make   it   easy   for  developers   of   big   data   applications   to  write   programs   much   as   they   would   on   a  single-­‐node   computational   environment,   and   to   be   able   to   rapidly   deploy   those  applications   on   tens   or   hundreds   of   nodes.     Additionally,   as   the   computation   and  storage   requirements   of   applications   change,   these   platforms   should   be   able   to  dynamically  and  elastically  adapt  to  those  changes.  

Scalable  Algorithms:    We  are  developing  a  range  of  algorithms  designed  to  deal  with  very  large  volumes  of  data,  and  to  process  that  data  in  parallel.    These  include  parallel  implementations  of  a  range  of  known  algorithms,  including  matrix  computations,  as  well  as  statistical  operations  like  regression,  optimization  methods  like  gradient  descent,  and  machine  learning  algorithms  like  clustering  and  classification.      

In  addition,  we  are  developing  fundamental  new  types  of  algorithms  designed  to  handle  the  challenges  of  Big  Data.    For  example,  we  are  working  on  sublinear  algorithms  that  can  compute  a  range  of  statistics,  such  as  estimates  of  the  number  of  distinct  items  in  a  set,  using  space  that  is  exponentially  smaller  than  the  input.    Additionally,  we  are  developing  new  algorithms  for  encoding,  comparing,  and  searching  massive  data  sets;  specific  examples  include  hash-­‐based  similarity  search  on  massive  scale  data,  algorithms  for  compressed  sensing  that  provide  a  new  way  to  encode  sparse  data  that  arise  in  a  number  of  scientific  applications,  and  algorithms  for  computing  the  Fourier  Transform  that  are  faster  than  FFT  for  sparse  data.  

Machine   Learning   and   Understanding:     On   top   of   these   algorithms,   we   are  deploying   a   number   of   novel   machine   learning   applications   focused   on   machine  understanding  in  specific  domains.    For  example,  in  work  on  scene  understanding  in  images   we   are   building   tools   that   automatically   label   parts   of   an   image,   or   that  classify  an  image  as  belonging  to  a  certain  category  or  categories  based  on  the  types  of  images  that  appear  in  them.    As  a  second  example,  we  are  using  natural  language  processing  to  convert  massive  quantities  of  text  tweets  and  text  reviews  on  the  web  into  structured  information  about  products,  restaurants,  and  services  that   indicate  the  type  of  content  in  some  text  (e.g.,  a  food  review,  a  rating),  an  assessment  of  the  sentiment  of  the  text,  etc.  

Privacy  and  Security:    Finally,  because  much  of  the  mining  and  analysis  involved  in  a   big   data   context   involves   sensitive,   private   information,   we   are   working  technologies   and   policies   for   protecting,   anonymization,   and   allowing   people   to  

Page 4: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  4  

retain   control   over   their   data.     As   an   example,   in   the   Crypt   DB   project,   we   are  building  a  database  system  that  stores  data  in  an  encrypted  format  in  the  cloud,  in  such  a  way  that  a  curious  database  or  system  administrator  cannot  decrypt  the  data.    Users   retain   the   encryption   keys   over   their   data,   but   have   the   ability   to   execute  queries   over   that   encrypted   data   on   the   database   serving,   enabling   much   better  performance   than   simply   sending   the   data   back   an   decrypting   on   the   client’s  machine.  

Work  in  these  four  areas  is  coupled  with  application  experts  in  Finance  (Professor  Andrew   Lo),   Medical   (Professor   John   Guttag),   Smart   Infrastructure   (Balakrishnan  and   Madden),   Education   (through   a   relationship   with   the   MITx   initiative),   and  Science  (Stonebraker).  

Membership  Model  The  goal  of  bigdata@csail  membership  model  is  to  promote  in-­‐depth  interactions  between  industry  and  academia.    Member  companies  will  have  the  opportunity  to  be  exposed  to  multiple  research  projects  that  span  the  work  of  about  20  MIT  faculty  and  researchers,  including  their  postdocs  and  students.    The  model  has  two  components:  bigdata@csail  membership  and  optional  additional  engagements.  

Membership  

bigdata@csail  will  involve  a  selected  group  of  member  companies  (approx.  10–15  companies)  .  There  is  an  annual  membership  fee  of  $150K  per  company,  to  be  provided  by  each  member  in  the  form  of  an  unrestricted  gift,  with  the  expectation  of  an  initial  three–year  commitment.    

The  membership  fees  will  be  used  to  support  the  operation  of  the  initiative  and  provide  seed  funding  for  new  ideas  and  projects.    Our  faculty  will  continue  to  raise  research  support  from  NSF,  DARPA,  and  other  organizations  to  significantly  amplify  this  industrial  funding,  leveraging  the  investment  from  all  our  member  companies.  

Membership  provides  the  company  with  the  following  benefits:  

1. Each  member  can  contribute  one  member  to  the  advisory  board  of  bigdata@csail,  which  will  advise  and  provide  feedback  to  the  directors  on  research  directions  and  priorities  

2. Diversified  seed  funding  of  about  3-­‐5  early–stage  projects  3. Early  exposure  to  a  larger  set  of  sponsored  projects  in  the  area  of  big  data  4. In-­‐depth  interactions  and  shared  learning  on  topics  of  particular  interest  to  

each  member  company  -­‐-­‐  these  topics  are  chosen  in  consultation  with  the  company  representative  on  the  advisory  board  

5. Interactions  with  the  graduate  students  for  recruitment  and  internships  6. Annual  meetings  in  which  the  students  and  faculty  present  on  relevant  

research  and  results  and  the  companies  provide  feedback  and  discuss  in  relation  to  industry  

Page 5: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  5  

7. Discussions  on  key  topics  of  interest  to  members  8. Ad-­‐hoc  interactions  with  members  on  an  as-­‐needed  basis;  bigdata@csail  

directors  will  facilitate  connections  between  companies  and  researchers  9. Notifications  of  events,  latest  news,  publications.  

Optional  additional  engagements    

Members  may  engage  in  company-­‐specific  activities  through  separate  agreements.  For  example,  if  a  member  company  wishes  to  have  CSAIL  host  one  of  its  employees,  this  may  be  arranged  via  an  Industry  Visitor  Agreement.    Further,  if  a  member  company  becomes  highly  interested  in  a  particular  research  project  and  wants  to  sponsor  future  development  of  that  project,  this  may  also  be  arranged  via  a  CSAIL  Sponsored  Research  Agreement  providing  additional  project-­‐specific  funding.  

Directors  The  Director  of  bigdata@csail  is  Professor  Samuel  Madden,  CSAIL  Principal  Investigator.  

Intellectual  Property  The  overall  goal  of  bigdata@csail  is  to  conduct  basic  research  that  will  have  a  significant  impact  over  a  long  time  scale.  Given  the  nature  of  our  intended  research,  MIT  anticipates  that  most  of  the  research  results  and  technology  will  be  placed  into  the  public  domain  via  publication  and  open-­‐source  licensing.  However,  in  certain  cases,  MIT  may  decide  to  obtain  intellectual  property  protection  for  certain  research  results  and  license  use  of  that  technology  under  those  intellectual  property  rights,  as  the  most  effective  way  to  transfer  technology  we  develop  to  industry  for  economic  benefit  to  society.  

Page 6: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  6  

 

 

MIT  Principal  Investigators    

     

     

Madden,   Stonebraker,   Lo,   Barzilay,   Fisher,   Jaakkola,   Karger,  Miller,   Olivia,   Torralba,   Rubinfeld,   Guttag,   Amarasinghe,   Indyk,  Pentland,    Devadas,    Glass,    Balakrishnan,    Zeldovich,    Freeman  

Page 7: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  7  

Example  Big  Data  Projects      

The  following  are  examples  of  sponsored  projects  conducted  by  MIT  Principal  Investigators  to  illustrate  the  breadth  and  depth  of  the  work  being  conducted  at  CSAIL.    

UNDERSTANDING  

[Finance]  Detecting  Defaults  -­  Andrew  Lo  et  al:    The  goal  of  this  research  is  to  develop  analytical  models  that  can  predict  when  a  consumer  is  at  risk  of  default  on  a  loan,  based  on  their  recent  financial  transactions.    On  a  test  on  1.5TB  of  data  from  a    major  financial  institution,  the  developed  models  were  able  to  much  more  accurately  predict  defaults  than  traditional  measures  like  FICO  scores.  

[Energy]  Hydrocarbon  Exploration  -­  Indyk,  Jaakkola,  Poggio,  Freeman,  et  al.  In  this  project,  the  goal  is  to  identify  boundaries  between  different  types  of  underground  rocks  using  seismic  sensors.    Such  boundaries  are  of  interest  in  hydrocarbon  exploration  as  they  are  places  where  oil  is  often  present.    These  sensors  produce  massive  streams  of  data  that  need  to  be  mined  to  understand  the  location  of  boundaries.    Researchers  are  working  these  mining  algorithms,  as  well  as  advanced  compression  and  encoding  techniques  to  compactly  summarize  these  data  streams.    

[Smart  Transportation]  Cartel  –  Balakrishnan  and  Madden  –  The  goal  of  CarTel  (“car  telecommunications”)  is  to  investigate  how  sensor  equipped  cars  and  smartphones  can  be  used  to  capture  information  about  the  transportation  network  and  urban  environment  in  general.    Example  results  include  an  interactive  map  of  the  biggest  potholes  in  Cambridge  and  Boston,  collected  using  car-­‐mounted  accelerometers,  and  traffic  aware  routing,  where  real-­‐time  traffic  delays  from  cars  are  used  to  find  the  fastest  driving  routes.    

[Social]  Influence  Modeling  -­  Alex  Pentland  et  al.  –  The  goal  of  this  project  is  to  learn  how  people  inside  of  large  organizations  influence  each  other,  and  to  track  the  flow  of  influence  throughout  an  organization.    Relationships  can  be  modeled  as  graphs,  with  edges  indicating  the  degree  of  influence.    Weights  are  learned  from  a  variety  of  data  sources,  including  personal  communication  and  data  gathered  from  sensors  about  face-­‐to-­‐face  interaction.    In  large  organizations,  there  can  be  billions  of  pieces  of  information  that  need  to  be  incorporated  into  this  influence  graph,  and  the  calculations  to  track  influence  throughout  the  graph  are  not  readily  expressed  in  existing  query  processing  or  database  systems.    

[Social]    TwitInfo  -­    Karger,  Miller,  Madden,  et  al.    TwitInfo  extracts  a  series  of  tweets  that  match  a  keyword  from  Twitter  and  arranges  them  on  a  timeline,  provide  a  quick  summary  of  a  collection  of  Tweets  on  topic  in  a  simple  visualization.    The  key  idea  is  to  identify  “peaks”  in  the  frequency  of  tweets  that  represent  interesting  occurrences  in  time  (e.g.,  points  scored  in  a  sporting  event,  or  a  major  speech  by  a  

Page 8: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  8  

politician),  and  then  assign  labels  to  peaks  using  information  retrieval  techniques.    A  related  system,  called  TweeQL  is  used  to  implement  TwitInfo;    TweeQL  provides  a  SQL-­‐like  streaming  language  for  running  queries  over  the  Twitter  stream  in  real  time.  

[Social]    Condensr  -­  Barzilay  et  al.    Condensr  is  a  review  summarization  system  that  processes  Yelp  restauarant  reviews  and  categorizes  them,  breaking  down  reviews  into  comments  about  food,  ambience,  service  and  value,  as  well  as  giving  an  overall  summary  of  reviewer  sentiment.    The  goal  is  to  go  beyond  a  simple  star  rating  to  give  the  overall  consensus  of  diners  about  various  aspects  of  a  restaurant  experience.  

[Images]    Large  Scale  Vision  -­  A.  Torralba  and  A.  Oliva.    The  goal  of  this  project  is  to  study  computer  and  human  vision  when  large  amounts  of  visual  data  become  available.  We  are  developing  the  Scene  UNderstanding  (SUN)  database,  a  large  database  of  images  found  on  the  web  organized  by  scene  types  that  are  being  fully  segmented  and  annotated.  With  this  large  database  we  are  developing  computer  vision  algorithms  for  scene  understanding  that  make  use  of  a  large  training  combined  with  non-­‐parametric  (memory  based)  methods.  In  parallel,  we  are  also  studying  how  humans  memorize  large  amounts  of  visual  information.  As  a  result  we  try  to  understand  which  representations  might  be  useful  for  developing  new  efficient  computer  vision  algorithms  and  also,  how  can  we  use  computer  vision  models  of  human  memory  to  predict  which  images  will  be  remembered.  

ALGORITHMS  

Machine  learning  -­  Jaakkola.    Modern  use  of  data  relies  heavily  on  predictive  modeling.    Machine  learning  methods  are  needed  to  distill  large,  heterogeneous,  and  fragmented  data  sources  into  useful  pieces  of  information  such  as  answers  to  search  queries,  purchasing  patterns  of  customers,  or  likely  actions  of  mobile  users.  This  research  focuses  on  predicting  the  behavior  of  mobile  users  -­‐-­‐  actions  they  are  likely  to  take  in  any  particular  context  -­‐-­‐  based  on  a  collection  of  intermittent  sensors  such  as  GPS,  wifi,  accelerometer,  and  others.  Our  goal  is  to  develop  methods  that  will  be  useful  more  broadly.  Our  work  addresses  the  following  key  problems:  1)  scaling  to  realistic  problem  sizes,  2)  robustness,  and  3)  maintaining  privacy  even  as  data  are  used  collaboratively.    Faster  Fourier  Transform  -­  Indyk,  Katabi  et  al.    Sparse  Fast  Fourier  Transform  (sFFT)  is  a  new  class  of  highly  efficient  algorithms  for  computing  the  frequency  spectrum  of  a  signal.  The  algorithms  work  for  signals  whose  spectrum  is  sparse,  i.e.,  signals  that  consist  of  a  small  number  of  dominating  frequencies.  Such  signals  often  occur  in  areas  such  as  image/audio/video  compression,  signal  processing  and  data  communication.  For  such  signals,  the  algorithms  are  significantly  faster  than  the  state  of  the  art  algorithms  based  on  the  Fast  Fourier  Transform  (FFT).  The  goal  of  this  project  is  to  develop  more  efficient  variants  and  implementations  of  sFFT,  and  apply  them  to  concrete  massive  data  problems.    

Page 9: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  9  

Tunable  Fast  Similarity  Search  for  High-­Dimensional  Data  –  Indyk  et  al.    Locality-­‐Sensitive  Hashing  (LSH)  is  an  efficient  algorithm  for  finding  pairs  of  similar  (or  highly  correlated)  objects  in  a  database  without  enumerating  all  pairs  of  such  objects.    Example  applications  include  searching  for  near-­‐duplicate  documents,  similar  images,  highly  correlated  stocks  etc.  Although  the  algorithm  is  very  fast,  one  can  envision  further  improvements  in  its  efficiency  by  adapting  it  to  specific  data  sets.  The  goal  of  this  project  is  to  develop  tools  and  techniques  for  performing  such  tuning.    COMPUTATIONAL  PLATFORMS  

SciDB  -­  Stonebraker  and  Madden.    The  vast  majority  of  machine  learning,  statistical,  and  scientific  operations  can  be  expressed  via  a  small  number  of  linear  algebra  operations.    SciDB  is  a  database  system  designed  to  support  scalable  linear  algebra  over  massive  arrays  stored  on  disk  of  a  large  cluster  of  machines.    It  is  much  faster  than  relational  databases  on  these  types  of  workloads,  and  scales  to  much  larger  datasets  than  main  memory  matrix-­‐oriented  systems  like  Matlab  and  R.    

BlinkDB  -­  Madden  et  al.    BlinkDB  is  a  database  system  that  runs  on  top  of  Hadoop  (MapReduce),  running  SQL  queries  and  translating  them  into  MapReduce  jobs.    The  key  idea  is  that  rather  than  running  queries  over  the  entire  data  set,  it  runs  queries  on  a  random  (precomputed)  sample  of  the  data,  and  uses  sampling  theory  to  estimate  the  true  query  answer.      

Execution  Migration  Machine  -­  Devadas  et  al.      The  Execution  Migration  Machine  (EM²)  is  a  novel  data-­‐centric  multicore  memory  system  architecture  based  on  computation  migration.  Unlike  traditional  distributed  memory  multicores,  which  rely  on  complex  cache  coherence  protocols  to  move  the  data  to  the  core  where  the  computation  is  taking  place,  our  scheme  always  moves  the  computation  to  the  core  where  the  data  resides.  By  doing  away  with  the  cache  coherence  protocol,  we  can  boost  the  effectiveness  of  per-­‐core  caches  while  drastically  reducing  hardware  complexity.  Experimental  results  on  a  range  of  SPLASH-­‐2  and  PARSEC  benchmarks  indicate  that  EM2  can  significantly  improve  per-­‐core  cache  performance  in  comparison  to  directory-­‐based  cache-­‐coherent  architectures,  decreasing  overall  miss  rates  by  as  much  as  84%  and  reducing  average  memory  latency  by  up  to  58%.  

Crowd  Computing  -­  Miller  et  al.    The  goal  of  this  work  is  to  build  and  study  systems  that  orchestrate  small  contributions  from  a  crowd  of  people.    Examples  include  Soylent,  which  is  an  add-­‐in  to  Microsoft  Word  that  uses  crowd  contributions  to  perform  interactive  document  shortening,  proofreading,  and  human-­‐language  macros  and  TurKit,  is  a  Java/JavaScript  API  for  running  iterative  tasks  on  Mechanical  Turk.  

PRIVACY  &  SECURITY  

CryptDB  –  Balakrishnan,  Kaashoek,  Madden,  Zeldovich,  et  al.    CryptDB  is  system  for  processing  queries  over  an  encrypted  database.    The  key  idea  is  that,  in  a  cloud-­‐

Page 10: bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5. 16. · DRAFT&’’&not$for$distribution& & 5& 7. Discussionson&keytopicsofinteresttomembers&

DRAFT  -­‐-­‐  not  for  distribution  

  10  

based  setting,  a  database  may  be  stored  on  machine  that  aren’t  completely  trusted,  and  so  keeping  it  encrypted  may  be  necessary.    In  such  a  setting,  processing  queries  naively  would  require  transmitting  the  entire  encrypted  database  back  for  local  processing.    Instead,  in  CryptDB,  special  types  of  encryption  are  used  which  protect  the  data  while  allowing  queries  to  be  processed  on  it;  in  this  way,  the  user  can  encrypt  his  queries,  send  them  to  the  database,  and  receive  encrypted  answers  back  while  transferring  far  less  data  than  the  naïve  solution  requires.