Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...

90
School of Computer Science Carnegie Mellon University Big Arctic Data Evangelos (Vagelis) Papalexakis School of Computer Science, Carnegie Mellon University Arctic Analysis 2014, Greenland

Transcript of Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...

Page 1: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

SchoolofComputerScienceCarnegieMellonUniversity

Big Arctic Data

Evangelos (Vagelis) PapalexakisSchool of Computer Science,Carnegie Mellon University

Arctic Analysis 2014, Greenland

Page 2: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

• Motivation & Introduction••••••

• Motivation & Introduction••••••

2

Page 3: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Eric Fisher, “See something, say something”3

Page 4: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

http://socialgraph.blogspot.com/2010/12/facebook‐map‐of‐world‐visualising.html 4

Page 5: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

How big is big?How big is big?

Slide adapted from: http://graphlab.com/learn/presentations.htmlPicture from:   http://web.netenrich.com/Portals/128884/images/FB_SERVER_040_x900.jpg

Need many data centersto store the data

100#Hours#a#MinuteYouTube#28#Million##

Wikipedia#Pages#

1#Billion#Facebook#Users#

6#Billion##Flickr#Photos#

5

Page 6: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Definition – The 3 V’sDefinition – The 3 V’s

• VolumeHard to store

• VarietyVery diverse/rich

• VelocityComing in faster than we can handle

• VolumeHard to store

• VarietyVery diverse/rich

• VelocityComing in faster than we can handle

6

Page 7: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Success storySuccess story

http://www.forbes.com/sites/kashmirhill/2012/02/16/how‐target‐figured‐out‐a‐teen‐girl‐was‐pregnant‐before‐her‐father‐did/

• Target assigns every customer  ID number, tied to their credit card (or name, or email) Also gather any additional information

• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy

• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed 

pregnant. More impressive: Target was able to estimate the due date 

somewhat accurately

• Target assigns every customer  ID number, tied to their credit card (or name, or email) Also gather any additional information

• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy

• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed 

pregnant. More impressive: Target was able to estimate the due date 

somewhat accurately

7

Page 8: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Success Story 2Success Story 2

• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data

Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models

• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data

Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models

Alon Halevy et al. The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 2009 

8

Page 9: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

But I don’t have that much data!!Why should I care?

But I don’t have that much data!!Why should I care?

• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency

• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency

9

Page 10: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

•• Matlab is great•••••

•• Matlab is great•••••

10

Page 11: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Matlab is great!Matlab is great!

• Powerful tool• Great implementations of Matrix algorithms

Eigen‐decomposition Singular Value Decomposition Basic matrix operations

• Vector based operations• Instant “debugging” by plotting

• All of the above make it a great prototyping tool for math intensive data analysis

• Powerful tool• Great implementations of Matrix algorithms

Eigen‐decomposition Singular Value Decomposition Basic matrix operations

• Vector based operations• Instant “debugging” by plotting

• All of the above make it a great prototyping tool for math intensive data analysis

11

Page 12: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Data representation mattersData representation matters

• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations

• Need to represent data carefully

• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations

• Need to represent data carefully

12

Page 13: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

29mixtures

18989 m/z value

LC‐MSLC‐MSLC‐MS

1054 retentiontime

1.55% dense

Sparse storage

~ 275 MB

~ 4.4 GB

Liquid‐Chromatography Mass‐Spectrometry (LC‐MS) measurements are usually treated as two‐way arrays, i.e., samplesby peaks. The original raw data is a three‐way array and we can explore its underlying structure by taking advantage ofsparsity.

usually converted into a set of peaks

mixtures

peaks

Each peak is a(m/z, retention time) pair.

Dense storage

RAW DATA:

Note that this is a very small data set with only 29 samples!

Slide borrowed from Evrim Acar

13

Sparse vs. dense storageSparse vs. dense storage

Page 14: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Tensor ToolboxTensor Toolbox

• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]

• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html

• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]

• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html

[1] Bader & Kolda, Efficient MATLAB computations with sparse and factored tensors, SIAM JSC’07 

14

Page 15: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Matlab Parallel Computing ToolboxMatlab Parallel Computing Toolbox

• Support for parallel computations• Provides “parallel for” (parfor)

Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!

• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:

Example of using the above w/ sampling for fast PARAFAC

• Support for parallel computations• Provides “parallel for” (parfor)

Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!

• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:

Example of using the above w/ sampling for fast PARAFAC

15

Page 16: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

••• Map/Reduce••••

••• Map/Reduce••••

16

Page 17: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Map/Reduce MotivationMap/Reduce Motivation

• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search

Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together

• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search

Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together

17

Page 18: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

The Map/Reduce FrameworkThe Map/Reduce Framework

• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud

Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine

Provides a Reduce() function Tells system how to combine values of all records with same key

• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud

Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine

Provides a Reduce() function Tells system how to combine values of all records with same key

18

Page 19: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

The Map/Reduce FrameworkThe Map/Reduce Framework

• Abstracts the computation into a Map() & Reduce() pair

• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!

• The programmer does not need to know details about the cluster

• Abstracts the computation into a Map() & Reduce() pair

• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!

• The programmer does not need to know details about the cluster

19

Page 20: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Apache HadoopApache Hadoop

• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python

• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python

20

Page 21: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Hadoop’s inner workings by example

Hadoop’s inner workings by example

Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/

21

Page 22: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Map functionMap function

Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/

22

Page 23: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Reduce functionReduce function

Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/

23

Page 24: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Putting it all togetherPutting it all together

Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/

24

Page 25: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Matrix Multiplication in HadoopMatrix Multiplication in Hadoop

• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as

• How can we multiply them on Hadoop?

• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as

• How can we multiply them on Hadoop?

A 0 0 5

A 0 1 4

A 1 1 2

B 0 0 7

B 0 1 1

25

Page 26: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Map()Map()Input:

26http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/

Key ideas:• Mapper has to emit <k,v> pairs with (i, k) as the key• (i, k) is a single value of A*B• Inner index j is fixed• Iterates over k (for A) and over m (for B)

Page 27: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Reduce()Reduce()

27

Output:

http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/

Key ideas:• Each mapper works on one element (i,k) of A*B• Collects all a_ij and b_kj where i and k fixed• Calculates the sum of produces for (i,k)‐th element

Page 28: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Why should I care?Why should I care?

• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!

• Portable Anything you write runs on any Hadoop cluster

• Scalable You can run your code to 1 or 1000 machines without changes!

• Fault tolerant Even when cluster nodes fail the job finishes

• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!

• Portable Anything you write runs on any Hadoop cluster

• Scalable You can run your code to 1 or 1000 machines without changes!

• Fault tolerant Even when cluster nodes fail the job finishes

28

Page 29: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ShortcomingsShortcomings

• If data fits in memory, could be much slower than in‐memory approaches!

• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).

At the beginning of every iteration M/R has to read things from HDFS.

Slows down iterative algorithms!!Ways around it:  Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/

• If data fits in memory, could be much slower than in‐memory approaches!

• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).

At the beginning of every iteration M/R has to read things from HDFS.

Slows down iterative algorithms!!Ways around it:  Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/

29

Page 30: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ApplicationsApplications

• Graph Mining  PegasusHEigen

• Machine LearningMahout

• Tensor AnalysisGigaTensor

• Graph Mining  PegasusHEigen

• Machine LearningMahout

• Tensor AnalysisGigaTensor

30

Page 31: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Graph MiningGraph Mining

• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in

Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank

Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example

• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in

Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank

Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example

31

Page 32: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

PegasusPegasus• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized: 

Relax multiply to combine Relax sum to aggregate

Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min

• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/

• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized: 

Relax multiply to combine Relax sum to aggregate

Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min

• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/

32

Page 33: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

PegasusPegasus1.4M nodes6.3M edges

33

Page 34: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

PegasusPegasus

34

Page 35: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Triangle CountingTriangle Counting

• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network

• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers

• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph

• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network

• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers

• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph

35

Page 36: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

HEigen – Eigenvalue DecompositionHEigen – Eigenvalue Decomposition

• Scalable tool for computing eigenvalue decomposition

• Using Lanczos algorithm with Selective Orthogonalization

• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally

• Scalable tool for computing eigenvalue decomposition

• Using Lanczos algorithm with Selective Orthogonalization

• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally

U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011

36

Page 37: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

HEigenHEigen

U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011 37

Page 38: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Machine Learning-Mahout

Machine Learning-Mahout

• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/

• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/

38

Page 39: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

SchoolofComputerScienceCarnegieMellonUniversity

GigaTensor: Scaling Tensor Analysis Up By 100 Times –Algorithms and Discoveries

U Kang, Evangelos Papalexakis, Abhay Harpale, Christos Faloutsos

Page 40: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

MotivationMotivation

• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web

Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?

Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs

• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web

Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?

Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs

40

Page 41: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CP/PARAFAC decompositionCP/PARAFAC decomposition• Decompose X 

into sum of rank one tensors

• Decompose X into sum of rank one tensors

X + … +

a1 aF

b1 bF

c1 cF

Objective function:

41

Page 42: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ALS algorithm for CP/PARAFACALS algorithm for CP/PARAFAC

• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:

Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C

Block coordinate descent algorithmMonotone convergence to local optimum

• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:

Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C

Block coordinate descent algorithmMonotone convergence to local optimum

42

Page 43: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ALS Zoom-In: Intermediate Data Explosion

ALS Zoom-In: Intermediate Data Explosion

X

Unfold/Matricize

X(1)

(CB) = [C(:,1)   B(:,1) … C(:,F)   B(:,F)]JKxFKronecker product

CP/PARAFAC property 

Khatri Rao Product 

• (CB) can be very large• Materializing is a showstopper!• Intermediate Data Explosion• Same issues for B and C!

43

Page 44: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Main IdeaMain Idea

• Avoiding Intermediate Data Explosion• Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL)- Proposed: 1.5 GB

Size of Intermediate Data (NELL)- Naïve: 100 PB

(Before) (After)

44

Page 45: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ResultsResults

• GigaTensor solved 100x larger problemsthan the current state of the art

• GigaTensor solved 100x larger problemsthan the current state of the art

GigaTensor

Out ofMemory

100x

45

Page 46: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

BREAKBREAK

46

Page 47: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

•••• Other Distributed Approaches•••

•••• Other Distributed Approaches•••

47

Page 48: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Other Distributed ApproachesOther Distributed Approaches

• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?

• Three approaches (with Graph flavor)GraphLab PregelGraphChi

• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?

• Three approaches (with Graph flavor)GraphLab PregelGraphChi

48

Page 49: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

GraphLabGraphLab

• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies

• In ML applications there usually are Data dependencies Iterative algos

• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph

• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies

• In ML applications there usually are Data dependencies Iterative algos

• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph

49

Page 50: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

GraphLabGraphLabHigh level idea

• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem

• Sync Analogous to Reduce() Also applies to overlapping parts of the problem

• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem

• Sync Analogous to Reduce() Also applies to overlapping parts of the problem

Yucheng Low  et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud VLDB 2012 50

Page 51: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

GraphLab ApplicationsGraphLab Applications

• Not restricted to Graph computations• Can express many problems in this way:

Least squares regression Lasso regressionMatrix Factorization

• Active community (software package and annual conference) http://graphlab.com/index.html

• Not restricted to Graph computations• Can express many problems in this way:

Least squares regression Lasso regressionMatrix Factorization

• Active community (software package and annual conference) http://graphlab.com/index.html

51

Page 52: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

PregelPregel• Google’s response to Graph Computations• Vertex centric computations

A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology

• Can express algorithms such as PageRank or Shortest Paths this way

• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google 

• Google’s response to Graph Computations• Vertex centric computations

A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology

• Can express algorithms such as PageRank or Shortest Paths this way

• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google 

Grzegorz Malewicz et al. Pregel: A System for Large‐Scale Graph Processing, SIGMOD’1052

Page 53: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

GraphChiGraphChi

• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi

Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts

• Performance comparable to distributed approaches!

• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi

Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts

• Performance comparable to distributed approaches!

Aapo Kyrola et al. GraphChi: Large‐Scale Graph Computation on Just a PC , USENIX’1253

Page 54: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

••••• Databases••

••••• Databases••

54

Page 55: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

DatabasesDatabases

• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)

e.g. table STUDENT with entries(name, student_id, gpa) 

Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;

• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)

e.g. table STUDENT with entries(name, student_id, gpa) 

Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;

55

Page 56: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ExampleExampleName Student_id GPA

Rasmus 1 4

Evrim 2 4

Vagelis 3 3

SELECT * FROM STUDENTWHERE gpa>=3.5;

Name Student_id GPA

Rasmus 1 4

Evrim 2 4

STUDENT

56

Page 57: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Joins of two tablesJoins of two tables

• We have two tables: STUDENT(name, student_id, gpa)  TAKES_CLASS(student_id, class_name)

• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

• We have two tables: STUDENT(name, student_id, gpa)  TAKES_CLASS(student_id, class_name)

• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

57

Page 58: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

ExampleExample

Name Student_id

GPA

Rasmus 1 4

Evrim 2 4

Vagelis 3 3

SELECT UNIQUE(class_name)FROM STUDENT JOIN TAKES_CLASS ONSTUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

STUDENT

Class_name Student_id

Chemometrics 101 1

Databases 201 1

Chemometrics 101 2

Chemometrics 101 3

Class_name

Chemometrics 101

Databases 201

TAKES_CLASS

58

Page 59: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

DatabasesDatabases

• That’s all nice…• But, why would we want to use it?• That’s all nice…• But, why would we want to use it?

59

Page 60: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Matrix operations in DBMSMatrix operations in DBMS• Say that we have two matrices A, B• Store them in a DB as

A(row, col, value) B(row, col, value)

• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;

• Gives us A*B !

• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication

• Say that we have two matrices A, B• Store them in a DB as

A(row, col, value) B(row, col, value)

• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;

• Gives us A*B !

• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication

60

Page 61: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

What else can we do?What else can we do?

• We can find eigenvectors of a matrix A• Simply do Power Iteration

Start with random xDo x(i) = A*x(i‐1) until x converges

• Series of Matrix‐Vector multiplications• SQL can do that

• We can find eigenvectors of a matrix A• Simply do Power Iteration

Start with random xDo x(i) = A*x(i‐1) until x converges

• Series of Matrix‐Vector multiplications• SQL can do that

61

Page 62: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Why should I care?Why should I care?

• Re‐usableWrite a library of queries, use it at will

• Portable SQL is a standard, so any DBMS supports basic SQL operations

• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed

• Re‐usableWrite a library of queries, use it at will

• Portable SQL is a standard, so any DBMS supports basic SQL operations

• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed

62

Page 63: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

NoSQLNoSQL

• Traditional RDMS implement stuff like Concurrency control Data integrity

• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back

Data should be consistent

• Not really necessary for Data Analysis Data is usually immutable

• Traditional RDMS implement stuff like Concurrency control Data integrity

• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back

Data should be consistent

• Not really necessary for Data Analysis Data is usually immutable

63

Page 64: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

NoSQLNoSQL

• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems

• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems

• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language

• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language

64

Page 65: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

SciDBSciDB

• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end

Currently in R Soon in Python, Matlab etc

• All computation & storage takes place in Database server

• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end

Currently in R Soon in Python, Matlab etc

• All computation & storage takes place in Database server

65

Page 66: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

•••••• Sampling•

•••••• Sampling•

66

Page 67: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

SamplingSampling

• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay

• Today: CUR decomposition TensorCUR ParCube

• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay

• Today: CUR decomposition TensorCUR ParCube

67

Page 68: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Analysis using SVDAnalysis using SVD

A UVTΣ

products

users

users

latent groups

latent groups

products

• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data

• (Alternative) CUR decomposition:  Instead of latent approximation, use actual cols & rows of A

• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data

• (Alternative) CUR decomposition:  Instead of latent approximation, use actual cols & rows of A

SVD

68

Page 69: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CUR DecompositionCUR Decomposition

A CRU

• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!

Not true for SVD

• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!

Not true for SVD

Mahoney et al. CUR matrix decompositions for improved data analysis , PNAS 200969

Page 70: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CUR discussionCUR discussion

• Good for cases when we can’t interpret latent dimensions

• Directly interpretable factors• Retains sparsity on factors

• Good for cases when we can’t interpret latent dimensions

• Directly interpretable factors• Retains sparsity on factors

70

Page 71: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Tensor CURTensor CUR

• Extension of the CUR decomposition to tensors

• Assumes that third mode is “special” e.g. time

• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)

• Extension of the CUR decomposition to tensors

• Assumes that third mode is “special” e.g. time

• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)

Mahoney et al. Tensor‐CUR decompositions for tensor‐based data, SIAM JMAA 2008 71

Page 72: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Speeding up and parallelizing tensor decompositions

Speeding up and parallelizing tensor decompositions

• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?

• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC

Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization

• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?

• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC

Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization

72

Page 73: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

PARCUBE: The big picturePARCUBE: The big picture

! "#! "

$! "

%! "

##"

$#"

%#"

! $%!"

&"

#! "

$! "

%! "

! $%#"

• Sampling selects small portion of indices• PARAFAC vectors ai bi ci will be sparse by construction

Break up tensor into small piecesusing sampling

Fit dense PARAFAC decomposition on small sampled tensors

Match columns and distribute non‐zero values to appropriate indices in original (non‐sampled) space

73

Page 74: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Putting the pieces togetherPutting the pieces together

• Say we have matrices  As from each sample• Possibly have re‐ordering of factors• Each matrix corresponds to different sampled index set of the 

original index space• All factors share the “upper” part (by construction)

Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would

G3

74

Page 75: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Up to 200x speedupUp to 200x speedup

75

Page 76: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

NeurosemanticsNeurosemantics

• Brain Scan Data*

• 9 persons in fMRI machine• Presented with 60 concrete 

nouns• 7s pause between nouns

to ‘neutralize’ activity…

airplanedog

noun

s

*Mitchell et al. Predicting human brain activity associated with the meanings of nouns. Science, 2008Data available@ http://www.cs.cmu.edu/afs/cs/project/theo‐73/www/science2008/data.html

These images don’t correspond to the right words!

76voxelsquestions

Page 77: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Neurosemantics ResultsNeurosemantics Results

77

Page 78: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

RoadmapRoadmap

••••••• Streaming & Sketching

••••••• Streaming & Sketching

78

Page 79: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Problem1Problem1

• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?

• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?

79

Page 80: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Data StreamsData Streams

• The previous problem is a Data Stream problem

• We are going to see: Sketching Streaming Algorithms

• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient

• The previous problem is a Data Stream problem

• We are going to see: Sketching Streaming Algorithms

• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient

80

Page 81: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Reservoir SamplingReservoir Sampling• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R

• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]

Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.

• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R

• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]

Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.

81

Page 82: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Problem2Problem2

• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:

Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?

• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:

Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?

82

Page 83: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CountMin sketch preliminariesCountMin sketch preliminaries

• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]

We see updates of its coordinates over time,  e.g. update (it,ct) ait(t) = ait(t‐1) + ct

• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]

We see updates of its coordinates over time,  e.g. update (it,ct) ait(t) = ait(t‐1) + ct

83

Page 84: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CountMin SketchCountMin Sketch

• When we see update (it,ct)  For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct

• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems

• When we see update (it,ct)  For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct

• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems

84

Page 85: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

CountMin at workCountMin at work

• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee:  aest(i) ≤ a(i) + ε||a ||1 where 

• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb

With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value

• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee:  aest(i) ≤ a(i) + ε||a ||1 where 

• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb

With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value

85

Page 86: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Streams of Co-evolving Time-SeriesStreams of Co-evolving Time-Series

(a) Sensor measurements (b) Hidden variablesFigure 1: Illust rat ion of problem. Sensors measurechlorine in drinking water and show a daily, near si-nusoidal periodicity during phases 1 and 3. Duringphase 2, some of the sensors are “ stuck” due to a ma-jor leak. The extra hidden variable int roduced duringphase 2 captures the presence of a new trend. SPIRITcan also tell us which sensors part icipate in the new,“ abnormal” t rend (e.g., close to a construct ion site).In phase 3, everything returns to normal.

• We are given n sensors

• We record their activity over time

• We would like to track their PCA as new measurements become available

• We are given n sensors

• We record their activity over time

• We would like to track their PCA as new measurements become available

86

Page 87: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

SPIRITSPIRIT

• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently

• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently

Papadimitriou et al. Streaming Pattern Discovery in Multiple Time‐Series, VLDB 200587

Page 88: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Tensor StreamTensor Stream

Given:

track its decompositionwithout re‐computing

88

Page 89: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

Tensor StreamsTensor Streams

• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT

Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009

• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT

Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009

89

Page 90: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry

The EndThe End

Web: www.cs.cmu.edu/~epapalexCode: www.cs.cmi.edu/~epapalex/code.htmlemail: [email protected]

Questions?

90