Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...

SchoolofComputerScienceCarnegieMellonUniversity

Big Arctic Data

Evangelos (Vagelis) PapalexakisSchool of Computer Science,Carnegie Mellon University

Arctic Analysis 2014, Greenland

RoadmapRoadmap

• Motivation & Introduction••••••

• Motivation & Introduction••••••

2

Eric Fisher, “See something, say something”3

http://socialgraph.blogspot.com/2010/12/facebook‐map‐of‐world‐visualising.html 4

How big is big?How big is big?

Slide adapted from: http://graphlab.com/learn/presentations.htmlPicture from: http://web.netenrich.com/Portals/128884/images/FB_SERVER_040_x900.jpg

Need many data centersto store the data

100#Hours#a#MinuteYouTube#28#Million##

Wikipedia#Pages#

1#Billion#Facebook#Users#

6#Billion##Flickr#Photos#

5

Definition – The 3 V’sDefinition – The 3 V’s

• VolumeHard to store

• VarietyVery diverse/rich

• VelocityComing in faster than we can handle

• VolumeHard to store

• VarietyVery diverse/rich

• VelocityComing in faster than we can handle

6

Success storySuccess story

http://www.forbes.com/sites/kashmirhill/2012/02/16/how‐target‐figured‐out‐a‐teen‐girl‐was‐pregnant‐before‐her‐father‐did/

• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information

• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy

• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed

pregnant. More impressive: Target was able to estimate the due date

somewhat accurately

• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information

• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy

• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed

pregnant. More impressive: Target was able to estimate the due date

somewhat accurately

7

Success Story 2Success Story 2

• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data

Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models

• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data

Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models

Alon Halevy et al. The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 2009

8

But I don’t have that much data!!Why should I care?

But I don’t have that much data!!Why should I care?

• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency

• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency

9

RoadmapRoadmap

•• Matlab is great•••••

•• Matlab is great•••••

10

Matlab is great!Matlab is great!

• Powerful tool• Great implementations of Matrix algorithms

Eigen‐decomposition Singular Value Decomposition Basic matrix operations

• Vector based operations• Instant “debugging” by plotting

• All of the above make it a great prototyping tool for math intensive data analysis

• Powerful tool• Great implementations of Matrix algorithms

Eigen‐decomposition Singular Value Decomposition Basic matrix operations

• Vector based operations• Instant “debugging” by plotting

• All of the above make it a great prototyping tool for math intensive data analysis

11

Data representation mattersData representation matters

• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations

• Need to represent data carefully

• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations

• Need to represent data carefully

12

29mixtures

18989 m/z value

LC‐MSLC‐MSLC‐MS

1054 retentiontime

1.55% dense

Sparse storage

~ 275 MB

~ 4.4 GB

Liquid‐Chromatography Mass‐Spectrometry (LC‐MS) measurements are usually treated as two‐way arrays, i.e., samplesby peaks. The original raw data is a three‐way array and we can explore its underlying structure by taking advantage ofsparsity.

usually converted into a set of peaks

mixtures

peaks

Each peak is a(m/z, retention time) pair.

Dense storage

RAW DATA:

Note that this is a very small data set with only 29 samples!

Slide borrowed from Evrim Acar

13

Sparse vs. dense storageSparse vs. dense storage

Tensor ToolboxTensor Toolbox

• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]

• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html

• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]

• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html

[1] Bader & Kolda, Efficient MATLAB computations with sparse and factored tensors, SIAM JSC’07

14

Matlab Parallel Computing ToolboxMatlab Parallel Computing Toolbox

• Support for parallel computations• Provides “parallel for” (parfor)

Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!

• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:

Example of using the above w/ sampling for fast PARAFAC

• Support for parallel computations• Provides “parallel for” (parfor)

Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!

• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:

Example of using the above w/ sampling for fast PARAFAC

15

RoadmapRoadmap

••• Map/Reduce••••

••• Map/Reduce••••

16

Map/Reduce MotivationMap/Reduce Motivation

• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search

Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together

• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search

Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together

17

The Map/Reduce FrameworkThe Map/Reduce Framework

• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud

Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine

Provides a Reduce() function Tells system how to combine values of all records with same key

• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud

Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine

Provides a Reduce() function Tells system how to combine values of all records with same key

18

The Map/Reduce FrameworkThe Map/Reduce Framework

• Abstracts the computation into a Map() & Reduce() pair

• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!

• The programmer does not need to know details about the cluster

• Abstracts the computation into a Map() & Reduce() pair

• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!

• The programmer does not need to know details about the cluster

19

Apache HadoopApache Hadoop

• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python

• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python

20

Hadoop’s inner workings by example

Hadoop’s inner workings by example

Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/

21

Map functionMap function


22

Reduce functionReduce function


23

Putting it all togetherPutting it all together


24

Matrix Multiplication in HadoopMatrix Multiplication in Hadoop

• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as

• How can we multiply them on Hadoop?

• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as

• How can we multiply them on Hadoop?

A 0 0 5

A 0 1 4

A 1 1 2

B 0 0 7

B 0 1 1

…

25

Map()Map()Input:

26http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/

Key ideas:• Mapper has to emit <k,v> pairs with (i, k) as the key• (i, k) is a single value of A*B• Inner index j is fixed• Iterates over k (for A) and over m (for B)

Reduce()Reduce()

27

Output:

http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/

Key ideas:• Each mapper works on one element (i,k) of A*B• Collects all a_ij and b_kj where i and k fixed• Calculates the sum of produces for (i,k)‐th element

Why should I care?Why should I care?

• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!

• Portable Anything you write runs on any Hadoop cluster

• Scalable You can run your code to 1 or 1000 machines without changes!

• Fault tolerant Even when cluster nodes fail the job finishes

• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!

• Portable Anything you write runs on any Hadoop cluster

• Scalable You can run your code to 1 or 1000 machines without changes!

• Fault tolerant Even when cluster nodes fail the job finishes

28

ShortcomingsShortcomings

• If data fits in memory, could be much slower than in‐memory approaches!

• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).

At the beginning of every iteration M/R has to read things from HDFS.

Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/

• If data fits in memory, could be much slower than in‐memory approaches!

• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).

At the beginning of every iteration M/R has to read things from HDFS.

Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/

29

ApplicationsApplications

• Graph Mining PegasusHEigen

• Machine LearningMahout

• Tensor AnalysisGigaTensor

• Graph Mining PegasusHEigen

• Machine LearningMahout

• Tensor AnalysisGigaTensor

30

Graph MiningGraph Mining

• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in

Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank

Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example

• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in

Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank

Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example

31

PegasusPegasus• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:

Relax multiply to combine Relax sum to aggregate

Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min

• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/

• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:

Relax multiply to combine Relax sum to aggregate

Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min

• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/

32

PegasusPegasus1.4M nodes6.3M edges

33

PegasusPegasus

34

Triangle CountingTriangle Counting

• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network

• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers

• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph

• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network

• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers

• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph

35

HEigen – Eigenvalue DecompositionHEigen – Eigenvalue Decomposition

• Scalable tool for computing eigenvalue decomposition

• Using Lanczos algorithm with Selective Orthogonalization

• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally

• Scalable tool for computing eigenvalue decomposition

• Using Lanczos algorithm with Selective Orthogonalization

• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally

U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011

36

HEigenHEigen

U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011 37

Machine Learning-Mahout

Machine Learning-Mahout

• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/

• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/

38

SchoolofComputerScienceCarnegieMellonUniversity

GigaTensor: Scaling Tensor Analysis Up By 100 Times –Algorithms and Discoveries

U Kang, Evangelos Papalexakis, Abhay Harpale, Christos Faloutsos

MotivationMotivation

• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web

Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?

Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs

• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web

Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?

Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs

40

CP/PARAFAC decompositionCP/PARAFAC decomposition• Decompose X

into sum of rank one tensors

• Decompose X into sum of rank one tensors

X + … +

a1 aF

b1 bF

c1 cF

Objective function:

≈

41

ALS algorithm for CP/PARAFACALS algorithm for CP/PARAFAC

• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:

Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C

Block coordinate descent algorithmMonotone convergence to local optimum

• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:

Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C

Block coordinate descent algorithmMonotone convergence to local optimum

42

ALS Zoom-In: Intermediate Data Explosion

ALS Zoom-In: Intermediate Data Explosion

X

Unfold/Matricize

X(1)

(CB) = [C(:,1) B(:,1) … C(:,F) B(:,F)]JKxFKronecker product

CP/PARAFAC property

Khatri Rao Product

• (CB) can be very large• Materializing is a showstopper!• Intermediate Data Explosion• Same issues for B and C!

43

Main IdeaMain Idea

• Avoiding Intermediate Data Explosion• Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL)- Proposed: 1.5 GB

Size of Intermediate Data (NELL)- Naïve: 100 PB

(Before) (After)

44

ResultsResults

• GigaTensor solved 100x larger problemsthan the current state of the art

• GigaTensor solved 100x larger problemsthan the current state of the art

GigaTensor

Out ofMemory

100x

45

BREAKBREAK

46

RoadmapRoadmap

•••• Other Distributed Approaches•••

•••• Other Distributed Approaches•••

47

Other Distributed ApproachesOther Distributed Approaches

• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?

• Three approaches (with Graph flavor)GraphLab PregelGraphChi

• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?

• Three approaches (with Graph flavor)GraphLab PregelGraphChi

48

GraphLabGraphLab

• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies

• In ML applications there usually are Data dependencies Iterative algos

• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph

• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies

• In ML applications there usually are Data dependencies Iterative algos

• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph

49

GraphLabGraphLabHigh level idea

• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem

• Sync Analogous to Reduce() Also applies to overlapping parts of the problem

• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem

• Sync Analogous to Reduce() Also applies to overlapping parts of the problem

Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud VLDB 2012 50

GraphLab ApplicationsGraphLab Applications

• Not restricted to Graph computations• Can express many problems in this way:

Least squares regression Lasso regressionMatrix Factorization

• Active community (software package and annual conference) http://graphlab.com/index.html

• Not restricted to Graph computations• Can express many problems in this way:

Least squares regression Lasso regressionMatrix Factorization

• Active community (software package and annual conference) http://graphlab.com/index.html

51

PregelPregel• Google’s response to Graph Computations• Vertex centric computations

A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology

• Can express algorithms such as PageRank or Shortest Paths this way

• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google

• Google’s response to Graph Computations• Vertex centric computations

A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology

• Can express algorithms such as PageRank or Shortest Paths this way

• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google

Grzegorz Malewicz et al. Pregel: A System for Large‐Scale Graph Processing, SIGMOD’1052

GraphChiGraphChi

• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi

Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts

• Performance comparable to distributed approaches!

• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi

Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts

• Performance comparable to distributed approaches!

Aapo Kyrola et al. GraphChi: Large‐Scale Graph Computation on Just a PC , USENIX’1253

RoadmapRoadmap

••••• Databases••

••••• Databases••

54

DatabasesDatabases

• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)

e.g. table STUDENT with entries(name, student_id, gpa)

Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;

• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)

e.g. table STUDENT with entries(name, student_id, gpa)

Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;

55

ExampleExampleName Student_id GPA

Rasmus 1 4

Evrim 2 4

Vagelis 3 3

SELECT * FROM STUDENTWHERE gpa>=3.5;

Name Student_id GPA

Rasmus 1 4

Evrim 2 4

STUDENT

56

Joins of two tablesJoins of two tables

• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)

• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)

• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

57

ExampleExample

Name Student_id

GPA

Rasmus 1 4

Evrim 2 4

Vagelis 3 3

SELECT UNIQUE(class_name)FROM STUDENT JOIN TAKES_CLASS ONSTUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;

STUDENT

Class_name Student_id

Chemometrics 101 1

Databases 201 1

Chemometrics 101 2

Chemometrics 101 3

Class_name

Chemometrics 101

Databases 201

TAKES_CLASS

58

DatabasesDatabases

• That’s all nice…• But, why would we want to use it?• That’s all nice…• But, why would we want to use it?

59

Matrix operations in DBMSMatrix operations in DBMS• Say that we have two matrices A, B• Store them in a DB as

A(row, col, value) B(row, col, value)

• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;

• Gives us A*B !

• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication

• Say that we have two matrices A, B• Store them in a DB as

A(row, col, value) B(row, col, value)

• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;

• Gives us A*B !

• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication

60

What else can we do?What else can we do?

• We can find eigenvectors of a matrix A• Simply do Power Iteration

Start with random xDo x(i) = A*x(i‐1) until x converges

• Series of Matrix‐Vector multiplications• SQL can do that

• We can find eigenvectors of a matrix A• Simply do Power Iteration

Start with random xDo x(i) = A*x(i‐1) until x converges

• Series of Matrix‐Vector multiplications• SQL can do that

61

Why should I care?Why should I care?

• Re‐usableWrite a library of queries, use it at will

• Portable SQL is a standard, so any DBMS supports basic SQL operations

• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed

• Re‐usableWrite a library of queries, use it at will

• Portable SQL is a standard, so any DBMS supports basic SQL operations

• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed

62

NoSQLNoSQL

• Traditional RDMS implement stuff like Concurrency control Data integrity

• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back

Data should be consistent

• Not really necessary for Data Analysis Data is usually immutable

• Traditional RDMS implement stuff like Concurrency control Data integrity

• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back

Data should be consistent

• Not really necessary for Data Analysis Data is usually immutable

63

NoSQLNoSQL

• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems

• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems

• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language

• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language

64

SciDBSciDB

• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end

Currently in R Soon in Python, Matlab etc

• All computation & storage takes place in Database server

• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end

Currently in R Soon in Python, Matlab etc

• All computation & storage takes place in Database server

65

RoadmapRoadmap

•••••• Sampling•

•••••• Sampling•

66

SamplingSampling

• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay

• Today: CUR decomposition TensorCUR ParCube

• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay

• Today: CUR decomposition TensorCUR ParCube

67

Analysis using SVDAnalysis using SVD

A UVTΣ

≈

products

users

users

latent groups

latent groups

products

• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data

• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A

• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data

• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A

SVD

68

CUR DecompositionCUR Decomposition

A CRU

• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!

Not true for SVD

• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!

Not true for SVD

Mahoney et al. CUR matrix decompositions for improved data analysis , PNAS 200969

≈

CUR discussionCUR discussion

• Good for cases when we can’t interpret latent dimensions

• Directly interpretable factors• Retains sparsity on factors

• Good for cases when we can’t interpret latent dimensions

• Directly interpretable factors• Retains sparsity on factors

70

Tensor CURTensor CUR

• Extension of the CUR decomposition to tensors

• Assumes that third mode is “special” e.g. time

• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)

• Extension of the CUR decomposition to tensors

• Assumes that third mode is “special” e.g. time

• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)

Mahoney et al. Tensor‐CUR decompositions for tensor‐based data, SIAM JMAA 2008 71

Speeding up and parallelizing tensor decompositions

Speeding up and parallelizing tensor decompositions

• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?

• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC

Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization

• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?

• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC

Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization

72

PARCUBE: The big picturePARCUBE: The big picture

! "#! "

$! "

%! "

##"

$#"

%#"

! $%!"

&"

#! "

$! "

%! "

! $%#"

• Sampling selects small portion of indices• PARAFAC vectors ai bi ci will be sparse by construction

Break up tensor into small piecesusing sampling

Fit dense PARAFAC decomposition on small sampled tensors

Match columns and distribute non‐zero values to appropriate indices in original (non‐sampled) space

73

Putting the pieces togetherPutting the pieces together

…

• Say we have matrices As from each sample• Possibly have re‐ordering of factors• Each matrix corresponds to different sampled index set of the

original index space• All factors share the “upper” part (by construction)

Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would

G3

74

Up to 200x speedupUp to 200x speedup

75

NeurosemanticsNeurosemantics

…

• Brain Scan Data*

• 9 persons in fMRI machine• Presented with 60 concrete

nouns• 7s pause between nouns

to ‘neutralize’ activity…

airplanedog

noun

s

*Mitchell et al. Predicting human brain activity associated with the meanings of nouns. Science, 2008Data available@ http://www.cs.cmu.edu/afs/cs/project/theo‐73/www/science2008/data.html

These images don’t correspond to the right words!

76voxelsquestions

Neurosemantics ResultsNeurosemantics Results

77

RoadmapRoadmap

••••••• Streaming & Sketching

••••••• Streaming & Sketching

78

Problem1Problem1

• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?

• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?

79

Data StreamsData Streams

• The previous problem is a Data Stream problem

• We are going to see: Sketching Streaming Algorithms

• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient

• The previous problem is a Data Stream problem

• We are going to see: Sketching Streaming Algorithms

• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient

80

Reservoir SamplingReservoir Sampling• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R

• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]

Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.

• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R

• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]

Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.

81

Problem2Problem2

• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:

Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?

• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:

Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?

82

CountMin sketch preliminariesCountMin sketch preliminaries

• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]

We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct

• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]

We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct

83

CountMin SketchCountMin Sketch

• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct

• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems

• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct

• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems

84

CountMin at workCountMin at work

• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where

• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb

With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value

• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where

• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb

With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value

85

Streams of Co-evolving Time-SeriesStreams of Co-evolving Time-Series

(a) Sensor measurements (b) Hidden variablesFigure 1: Illust rat ion of problem. Sensors measurechlorine in drinking water and show a daily, near si-nusoidal periodicity during phases 1 and 3. Duringphase 2, some of the sensors are “ stuck” due to a ma-jor leak. The extra hidden variable int roduced duringphase 2 captures the presence of a new trend. SPIRITcan also tell us which sensors part icipate in the new,“ abnormal” t rend (e.g., close to a construct ion site).In phase 3, everything returns to normal.

• We are given n sensors

• We record their activity over time

• We would like to track their PCA as new measurements become available

• We are given n sensors

• We record their activity over time

• We would like to track their PCA as new measurements become available

86

SPIRITSPIRIT

• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently

• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently

Papadimitriou et al. Streaming Pattern Discovery in Multiple Time‐Series, VLDB 200587

Tensor StreamTensor Stream

Given:

track its decompositionwithout re‐computing

88

Tensor StreamsTensor Streams

• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT

Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009

• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT

Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009

89

The EndThe End

Web: www.cs.cmu.edu/~epapalexCode: www.cs.cmi.edu/~epapalex/code.htmlemail: [email protected]

Questions?

90

Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...

Documents

Transcript of Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...