Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...
Transcript of Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...
SchoolofComputerScienceCarnegieMellonUniversity
Big Arctic Data
Evangelos (Vagelis) PapalexakisSchool of Computer Science,Carnegie Mellon University
Arctic Analysis 2014, Greenland
RoadmapRoadmap
• Motivation & Introduction••••••
• Motivation & Introduction••••••
2
Eric Fisher, “See something, say something”3
http://socialgraph.blogspot.com/2010/12/facebook‐map‐of‐world‐visualising.html 4
How big is big?How big is big?
Slide adapted from: http://graphlab.com/learn/presentations.htmlPicture from: http://web.netenrich.com/Portals/128884/images/FB_SERVER_040_x900.jpg
Need many data centersto store the data
100#Hours#a#MinuteYouTube#28#Million##
Wikipedia#Pages#
1#Billion#Facebook#Users#
6#Billion##Flickr#Photos#
5
Definition – The 3 V’sDefinition – The 3 V’s
• VolumeHard to store
• VarietyVery diverse/rich
• VelocityComing in faster than we can handle
• VolumeHard to store
• VarietyVery diverse/rich
• VelocityComing in faster than we can handle
6
Success storySuccess story
http://www.forbes.com/sites/kashmirhill/2012/02/16/how‐target‐figured‐out‐a‐teen‐girl‐was‐pregnant‐before‐her‐father‐did/
• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information
• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy
• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed
pregnant. More impressive: Target was able to estimate the due date
somewhat accurately
• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information
• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy
• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed
pregnant. More impressive: Target was able to estimate the due date
somewhat accurately
7
Success Story 2Success Story 2
• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data
Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models
• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data
Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models
Alon Halevy et al. The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 2009
8
But I don’t have that much data!!Why should I care?
But I don’t have that much data!!Why should I care?
• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency
• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency
9
RoadmapRoadmap
•• Matlab is great•••••
•• Matlab is great•••••
10
Matlab is great!Matlab is great!
• Powerful tool• Great implementations of Matrix algorithms
Eigen‐decomposition Singular Value Decomposition Basic matrix operations
• Vector based operations• Instant “debugging” by plotting
• All of the above make it a great prototyping tool for math intensive data analysis
• Powerful tool• Great implementations of Matrix algorithms
Eigen‐decomposition Singular Value Decomposition Basic matrix operations
• Vector based operations• Instant “debugging” by plotting
• All of the above make it a great prototyping tool for math intensive data analysis
11
Data representation mattersData representation matters
• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations
• Need to represent data carefully
• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations
• Need to represent data carefully
12
29mixtures
18989 m/z value
LC‐MSLC‐MSLC‐MS
1054 retentiontime
1.55% dense
Sparse storage
~ 275 MB
~ 4.4 GB
Liquid‐Chromatography Mass‐Spectrometry (LC‐MS) measurements are usually treated as two‐way arrays, i.e., samplesby peaks. The original raw data is a three‐way array and we can explore its underlying structure by taking advantage ofsparsity.
usually converted into a set of peaks
mixtures
peaks
Each peak is a(m/z, retention time) pair.
Dense storage
RAW DATA:
Note that this is a very small data set with only 29 samples!
Slide borrowed from Evrim Acar
13
Sparse vs. dense storageSparse vs. dense storage
Tensor ToolboxTensor Toolbox
• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]
• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html
• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]
• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html
[1] Bader & Kolda, Efficient MATLAB computations with sparse and factored tensors, SIAM JSC’07
14
Matlab Parallel Computing ToolboxMatlab Parallel Computing Toolbox
• Support for parallel computations• Provides “parallel for” (parfor)
Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!
• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:
Example of using the above w/ sampling for fast PARAFAC
• Support for parallel computations• Provides “parallel for” (parfor)
Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!
• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:
Example of using the above w/ sampling for fast PARAFAC
15
RoadmapRoadmap
••• Map/Reduce••••
••• Map/Reduce••••
16
Map/Reduce MotivationMap/Reduce Motivation
• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search
Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together
• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search
Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together
17
The Map/Reduce FrameworkThe Map/Reduce Framework
• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud
Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine
Provides a Reduce() function Tells system how to combine values of all records with same key
• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud
Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine
Provides a Reduce() function Tells system how to combine values of all records with same key
18
The Map/Reduce FrameworkThe Map/Reduce Framework
• Abstracts the computation into a Map() & Reduce() pair
• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!
• The programmer does not need to know details about the cluster
• Abstracts the computation into a Map() & Reduce() pair
• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!
• The programmer does not need to know details about the cluster
19
Apache HadoopApache Hadoop
• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python
• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python
20
Hadoop’s inner workings by example
Hadoop’s inner workings by example
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
21
Map functionMap function
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
22
Reduce functionReduce function
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
23
Putting it all togetherPutting it all together
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
24
Matrix Multiplication in HadoopMatrix Multiplication in Hadoop
• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as
• How can we multiply them on Hadoop?
• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as
• How can we multiply them on Hadoop?
A 0 0 5
A 0 1 4
A 1 1 2
B 0 0 7
B 0 1 1
…
25
Map()Map()Input:
26http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/
Key ideas:• Mapper has to emit <k,v> pairs with (i, k) as the key• (i, k) is a single value of A*B• Inner index j is fixed• Iterates over k (for A) and over m (for B)
Reduce()Reduce()
27
Output:
http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/
Key ideas:• Each mapper works on one element (i,k) of A*B• Collects all a_ij and b_kj where i and k fixed• Calculates the sum of produces for (i,k)‐th element
Why should I care?Why should I care?
• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!
• Portable Anything you write runs on any Hadoop cluster
• Scalable You can run your code to 1 or 1000 machines without changes!
• Fault tolerant Even when cluster nodes fail the job finishes
• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!
• Portable Anything you write runs on any Hadoop cluster
• Scalable You can run your code to 1 or 1000 machines without changes!
• Fault tolerant Even when cluster nodes fail the job finishes
28
ShortcomingsShortcomings
• If data fits in memory, could be much slower than in‐memory approaches!
• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).
At the beginning of every iteration M/R has to read things from HDFS.
Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/
• If data fits in memory, could be much slower than in‐memory approaches!
• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).
At the beginning of every iteration M/R has to read things from HDFS.
Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/
29
ApplicationsApplications
• Graph Mining PegasusHEigen
• Machine LearningMahout
• Tensor AnalysisGigaTensor
• Graph Mining PegasusHEigen
• Machine LearningMahout
• Tensor AnalysisGigaTensor
30
Graph MiningGraph Mining
• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in
Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank
Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example
• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in
Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank
Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example
31
PegasusPegasus• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:
Relax multiply to combine Relax sum to aggregate
Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min
• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/
• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:
Relax multiply to combine Relax sum to aggregate
Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min
• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/
32
PegasusPegasus1.4M nodes6.3M edges
33
PegasusPegasus
34
Triangle CountingTriangle Counting
• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network
• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers
• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph
• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network
• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers
• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph
35
HEigen – Eigenvalue DecompositionHEigen – Eigenvalue Decomposition
• Scalable tool for computing eigenvalue decomposition
• Using Lanczos algorithm with Selective Orthogonalization
• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally
• Scalable tool for computing eigenvalue decomposition
• Using Lanczos algorithm with Selective Orthogonalization
• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally
U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011
36
HEigenHEigen
U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011 37
Machine Learning-Mahout
Machine Learning-Mahout
• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/
• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/
38
SchoolofComputerScienceCarnegieMellonUniversity
GigaTensor: Scaling Tensor Analysis Up By 100 Times –Algorithms and Discoveries
U Kang, Evangelos Papalexakis, Abhay Harpale, Christos Faloutsos
MotivationMotivation
• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web
Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?
Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs
• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web
Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?
Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs
40
CP/PARAFAC decompositionCP/PARAFAC decomposition• Decompose X
into sum of rank one tensors
• Decompose X into sum of rank one tensors
X + … +
a1 aF
b1 bF
c1 cF
Objective function:
≈
41
ALS algorithm for CP/PARAFACALS algorithm for CP/PARAFAC
• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:
Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C
Block coordinate descent algorithmMonotone convergence to local optimum
• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:
Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C
Block coordinate descent algorithmMonotone convergence to local optimum
42
ALS Zoom-In: Intermediate Data Explosion
ALS Zoom-In: Intermediate Data Explosion
X
Unfold/Matricize
X(1)
(CB) = [C(:,1) B(:,1) … C(:,F) B(:,F)]JKxFKronecker product
CP/PARAFAC property
Khatri Rao Product
• (CB) can be very large• Materializing is a showstopper!• Intermediate Data Explosion• Same issues for B and C!
43
Main IdeaMain Idea
• Avoiding Intermediate Data Explosion• Avoiding Intermediate Data Explosion
Size of Intermediate Data (NELL)- Proposed: 1.5 GB
Size of Intermediate Data (NELL)- Naïve: 100 PB
(Before) (After)
44
ResultsResults
• GigaTensor solved 100x larger problemsthan the current state of the art
• GigaTensor solved 100x larger problemsthan the current state of the art
GigaTensor
Out ofMemory
100x
45
BREAKBREAK
46
RoadmapRoadmap
•••• Other Distributed Approaches•••
•••• Other Distributed Approaches•••
47
Other Distributed ApproachesOther Distributed Approaches
• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?
• Three approaches (with Graph flavor)GraphLab PregelGraphChi
• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?
• Three approaches (with Graph flavor)GraphLab PregelGraphChi
48
GraphLabGraphLab
• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies
• In ML applications there usually are Data dependencies Iterative algos
• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph
• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies
• In ML applications there usually are Data dependencies Iterative algos
• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph
49
GraphLabGraphLabHigh level idea
• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem
• Sync Analogous to Reduce() Also applies to overlapping parts of the problem
• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem
• Sync Analogous to Reduce() Also applies to overlapping parts of the problem
Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud VLDB 2012 50
GraphLab ApplicationsGraphLab Applications
• Not restricted to Graph computations• Can express many problems in this way:
Least squares regression Lasso regressionMatrix Factorization
• Active community (software package and annual conference) http://graphlab.com/index.html
• Not restricted to Graph computations• Can express many problems in this way:
Least squares regression Lasso regressionMatrix Factorization
• Active community (software package and annual conference) http://graphlab.com/index.html
51
PregelPregel• Google’s response to Graph Computations• Vertex centric computations
A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology
• Can express algorithms such as PageRank or Shortest Paths this way
• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google
• Google’s response to Graph Computations• Vertex centric computations
A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology
• Can express algorithms such as PageRank or Shortest Paths this way
• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google
Grzegorz Malewicz et al. Pregel: A System for Large‐Scale Graph Processing, SIGMOD’1052
GraphChiGraphChi
• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi
Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts
• Performance comparable to distributed approaches!
• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi
Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts
• Performance comparable to distributed approaches!
Aapo Kyrola et al. GraphChi: Large‐Scale Graph Computation on Just a PC , USENIX’1253
RoadmapRoadmap
••••• Databases••
••••• Databases••
54
DatabasesDatabases
• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)
e.g. table STUDENT with entries(name, student_id, gpa)
Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;
• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)
e.g. table STUDENT with entries(name, student_id, gpa)
Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;
55
ExampleExampleName Student_id GPA
Rasmus 1 4
Evrim 2 4
Vagelis 3 3
SELECT * FROM STUDENTWHERE gpa>=3.5;
Name Student_id GPA
Rasmus 1 4
Evrim 2 4
STUDENT
56
Joins of two tablesJoins of two tables
• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)
• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)
• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
57
ExampleExample
Name Student_id
GPA
Rasmus 1 4
Evrim 2 4
Vagelis 3 3
SELECT UNIQUE(class_name)FROM STUDENT JOIN TAKES_CLASS ONSTUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
STUDENT
Class_name Student_id
Chemometrics 101 1
Databases 201 1
Chemometrics 101 2
Chemometrics 101 3
Class_name
Chemometrics 101
Databases 201
TAKES_CLASS
58
DatabasesDatabases
• That’s all nice…• But, why would we want to use it?• That’s all nice…• But, why would we want to use it?
59
Matrix operations in DBMSMatrix operations in DBMS• Say that we have two matrices A, B• Store them in a DB as
A(row, col, value) B(row, col, value)
• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;
• Gives us A*B !
• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication
• Say that we have two matrices A, B• Store them in a DB as
A(row, col, value) B(row, col, value)
• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;
• Gives us A*B !
• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication
60
What else can we do?What else can we do?
• We can find eigenvectors of a matrix A• Simply do Power Iteration
Start with random xDo x(i) = A*x(i‐1) until x converges
• Series of Matrix‐Vector multiplications• SQL can do that
• We can find eigenvectors of a matrix A• Simply do Power Iteration
Start with random xDo x(i) = A*x(i‐1) until x converges
• Series of Matrix‐Vector multiplications• SQL can do that
61
Why should I care?Why should I care?
• Re‐usableWrite a library of queries, use it at will
• Portable SQL is a standard, so any DBMS supports basic SQL operations
• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed
• Re‐usableWrite a library of queries, use it at will
• Portable SQL is a standard, so any DBMS supports basic SQL operations
• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed
62
NoSQLNoSQL
• Traditional RDMS implement stuff like Concurrency control Data integrity
• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back
Data should be consistent
• Not really necessary for Data Analysis Data is usually immutable
• Traditional RDMS implement stuff like Concurrency control Data integrity
• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back
Data should be consistent
• Not really necessary for Data Analysis Data is usually immutable
63
NoSQLNoSQL
• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems
• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems
• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language
• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language
64
SciDBSciDB
• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end
Currently in R Soon in Python, Matlab etc
• All computation & storage takes place in Database server
• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end
Currently in R Soon in Python, Matlab etc
• All computation & storage takes place in Database server
65
RoadmapRoadmap
•••••• Sampling•
•••••• Sampling•
66
SamplingSampling
• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay
• Today: CUR decomposition TensorCUR ParCube
• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay
• Today: CUR decomposition TensorCUR ParCube
67
Analysis using SVDAnalysis using SVD
A UVTΣ
≈
products
users
users
latent groups
latent groups
products
• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data
• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A
• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data
• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A
SVD
68
CUR DecompositionCUR Decomposition
A CRU
• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!
Not true for SVD
• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!
Not true for SVD
Mahoney et al. CUR matrix decompositions for improved data analysis , PNAS 200969
≈
CUR discussionCUR discussion
• Good for cases when we can’t interpret latent dimensions
• Directly interpretable factors• Retains sparsity on factors
• Good for cases when we can’t interpret latent dimensions
• Directly interpretable factors• Retains sparsity on factors
70
Tensor CURTensor CUR
• Extension of the CUR decomposition to tensors
• Assumes that third mode is “special” e.g. time
• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)
• Extension of the CUR decomposition to tensors
• Assumes that third mode is “special” e.g. time
• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)
Mahoney et al. Tensor‐CUR decompositions for tensor‐based data, SIAM JMAA 2008 71
Speeding up and parallelizing tensor decompositions
Speeding up and parallelizing tensor decompositions
• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?
• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC
Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization
• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?
• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC
Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization
72
PARCUBE: The big picturePARCUBE: The big picture
! "#! "
$! "
%! "
##"
$#"
%#"
! $%!"
&"
#! "
$! "
%! "
! $%#"
• Sampling selects small portion of indices• PARAFAC vectors ai bi ci will be sparse by construction
Break up tensor into small piecesusing sampling
Fit dense PARAFAC decomposition on small sampled tensors
Match columns and distribute non‐zero values to appropriate indices in original (non‐sampled) space
73
Putting the pieces togetherPutting the pieces together
…
• Say we have matrices As from each sample• Possibly have re‐ordering of factors• Each matrix corresponds to different sampled index set of the
original index space• All factors share the “upper” part (by construction)
Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would
G3
74
Up to 200x speedupUp to 200x speedup
75
NeurosemanticsNeurosemantics
…
• Brain Scan Data*
• 9 persons in fMRI machine• Presented with 60 concrete
nouns• 7s pause between nouns
to ‘neutralize’ activity…
airplanedog
noun
s
*Mitchell et al. Predicting human brain activity associated with the meanings of nouns. Science, 2008Data available@ http://www.cs.cmu.edu/afs/cs/project/theo‐73/www/science2008/data.html
These images don’t correspond to the right words!
76voxelsquestions
Neurosemantics ResultsNeurosemantics Results
77
RoadmapRoadmap
••••••• Streaming & Sketching
••••••• Streaming & Sketching
78
Problem1Problem1
• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?
• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?
79
Data StreamsData Streams
• The previous problem is a Data Stream problem
• We are going to see: Sketching Streaming Algorithms
• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient
• The previous problem is a Data Stream problem
• We are going to see: Sketching Streaming Algorithms
• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient
80
Reservoir SamplingReservoir Sampling• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R
• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]
Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.
• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R
• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]
Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.
81
Problem2Problem2
• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:
Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?
• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:
Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?
82
CountMin sketch preliminariesCountMin sketch preliminaries
• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]
We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct
• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]
We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct
83
CountMin SketchCountMin Sketch
• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct
• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems
• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct
• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems
84
CountMin at workCountMin at work
• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where
• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb
With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value
• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where
• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb
With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value
85
Streams of Co-evolving Time-SeriesStreams of Co-evolving Time-Series
(a) Sensor measurements (b) Hidden variablesFigure 1: Illust rat ion of problem. Sensors measurechlorine in drinking water and show a daily, near si-nusoidal periodicity during phases 1 and 3. Duringphase 2, some of the sensors are “ stuck” due to a ma-jor leak. The extra hidden variable int roduced duringphase 2 captures the presence of a new trend. SPIRITcan also tell us which sensors part icipate in the new,“ abnormal” t rend (e.g., close to a construct ion site).In phase 3, everything returns to normal.
• We are given n sensors
• We record their activity over time
• We would like to track their PCA as new measurements become available
• We are given n sensors
• We record their activity over time
• We would like to track their PCA as new measurements become available
86
SPIRITSPIRIT
• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently
• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently
Papadimitriou et al. Streaming Pattern Discovery in Multiple Time‐Series, VLDB 200587
Tensor StreamTensor Stream
Given:
track its decompositionwithout re‐computing
88
Tensor StreamsTensor Streams
• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT
Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009
• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT
Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009
89
The EndThe End
Web: www.cs.cmu.edu/~epapalexCode: www.cs.cmi.edu/~epapalex/code.htmlemail: [email protected]
Questions?
90