15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c)...
Transcript of 15-826: Multimedia Databases and Data Miningchristos/courses/826.S10/FOILS-pdf/620_ICA.pdf15-826 (c)...
1
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #1
15-826: Multimedia Databases and Data Mining
BONUS LECTURE#2: Independent Component Analysis (ICA)
Jia-Yu Pan and Christos Faloutsos
NOT in Final exam
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #2
Outline • Motivation • Formulation • PCA and ICA • Example applications • Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #3
Motivation: (Q1) Find patterns in data
• Motion capture data: broad jumps
Left Knee
Right Knee
Energy exerted
Take-off Landing
Energy exerted
2
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #4
Motivation: (Q1) Find patterns in data
• Human would say – Pattern 1: along
diagonal – Pattern 2: along
vertical axis • How to find these
automatically? Each point is the measurement at a time tick (total 550 points).
Take-off
Landing R:L=60:1
R:L=1:1
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #5
Motivation: (Q2) Find hidden variables
Hidden variables
“Internet bubble”
Alcoa
American Express
Boeing
Citi Group
…
“General trend”
Stock prices
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #6
Motivation: (Q2) Find hidden variables
“General trend” “Internet bubble”
0.94
0.03 0.63
0.64
Caterpillar Intel
Hidden variables
3
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #7
Motivation: (Q2) Find hidden variables
Hidden variable 1 Hidden variable 2
B1,CAT
B1,INTC B2,CAT
B2,INTC ?
? ?
Caterpillar Intel
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #8
Motivation: Find hidden variables
• There are two sound sources in a cocktail party…
Also called the “blind source separation” problem. “Blind” because we don’t know the sources, nor how they are mixed.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #9
Outline • Motivation • Formulation • PCA and ICA • Example applications • Conclusion
4
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #10
Formulation: Finding patterns
Given n data points, each with m attributes.
Find patterns that describe data properties the best.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #11
Linear representation
• Find patterns that are vectors that describe the data set the best. • Each point is described as a linear combination of the vectors
(patterns):
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #12
Patterns as data “vocabulary”
Good pattern ≈ sparse coding
(Q) Given data xi’s, compute hi,j’s and bi’s that are “sparse”?
Only b1 is needed to describe xi.
5
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #13
Patterns in motion capture data
b2
b1
n=550 ticks Data matrix
Left Right
Basis vectors
Hidden variables
? ?
Sparse ~ non-Gaussian ~ “Independent”
“Independent”: e.g., minimize mutual information.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #14
Outline • Motivation • Formulation • PCA and ICA • Example applications • Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #15
Basis vectors and hidden variables
• Goal: Knowing X, find H and B, where X = H B
• Problem: Under-constrained – Need additional assumptions/constraints
X: data set H: hidden variables B: basis vectors
6
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #16
PCA and ICA
• PCA vectors: major variations – Together = good “low-rank approximation”/dimensional reduction – Individually ≠ meaningful patterns
• Luckily, ICA detects the major meaningful patterns.
PCA Vectors ICA Vectors
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #17
PCA • PCA (Principal Component Analysis)
– Choose vectors which are orthonormal and – give smallest projection error L2
• Matrices H and B can be solved by – SVD, neural networks, or many optimization
methods
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #18
PCA
• Extremely popular – Latent Semantic Indexing [Deerwester+90] – KL transform [Duda,Hart,Stork00] – EigenFaces [Turk,Pentlend91] – Multiple time series correlation
[Guha,Gunopulos,Koudas03] • But, there is room for improvement.
7
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #19
ICA • ICA (Independent Component Analysis)
– Make hidden variables hi’s (columns of H) mutually independent.
• Many implementations – Many ways to define “independence” – Many ways to find the most independent H. – (B is found at the same time, since X=HB.)
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #20
ICA
• Define “Independence”: – Zero mutual information – Non-Gaussianity, max. absolute Kurtosis
• To solve for H,B: – Neural networks, optimization methods
(gradient ascent, fixed-point, …)
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #21
An non-gaussian distribution: Laplacian pdf
Sharper at 0, and more heavy tail than Gaussian pdf
8
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #22
Maximizing non-gaussianity • Assume hi ~ non-gaussian pdf (e.g. Laplacian pdf)
– Fixed hi values, what is the most likely “B” ? • (data point x is given and fixed)
– Find B, s.t. likelihood P(x|B) is maximized.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #23
Maximize likelihood • Likelihood P(x|B) is a function of B, f(B) • Gradient ascent
– To find B which maximizes P(x|B)
f(B)
B B*
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #24
ICA by maximum likelihood
H B
X: static
9
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #25
Convergence of ICA
ICA
Start with the matrix B as an identity matrix.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #26
Outline • Motivation • Formulation • PCA and ICA • Example applications
– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images
• Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #27
Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05]
Step 1: Data points (matrix)
Step 2: Compute patterns
Step 3: Interpret patterns
…
or
or
…
…
Data mining (Case studies)
(Q) What pattern?
(Q) Different modalities
(Q) How?
Video frames
Stock prices
Text documents
10
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #28
Finding patterns in high-dimensional data
PCA finds the hyperplane. ICA finds the correct patterns.
Dimensionality reduction
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #29
Outline • Motivation • Formulation • PCA and ICA • Example applications
– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images
• Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #30
Topic discovery on text streams • Data: CNN headline news (Jan.-Jun. 1998) • Documents of 10 topics in one single text
stream – Documents are sorted by date/time – Subsequent documents may have different
topics
Date/Time Topic 1 Topic 3 Topic 1 …
11
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #31
Topic discovery on text streams
• Known: number of topics = 10 • Unknown: (1) topic of each document (2)
topic description
Date/Time Topic 1 Topic 3 Topic 1 …
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #32
Topic discovery in documents
X[nxm]
xi = [1, 5, …, 0]
Step 1
Windowing
Step 2
X[nxm] = H[nxm] B[mxm]
(1) Find hyperplane (m=10)
(2) Find patterns
aaron
zoo
m=3887 (dictionary size)
(n=1659) New stories
…
(30 words)
Step 3
b’i = [0, 0.7, …, 0.6] aaron
zoo animal
B’[10x3887] (Q) What does b’i mean?
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #33
Step 3: Interpret the patterns
b’i = [0, 0.7, …, 0.6]
aaron
zoo animal
Topics found
A hidden topic!
Top words : “animal”, “zoo”, …
ID Sorted word list
A Mckinne Sergeant sexual Major Armi
B bomb Rudolph Clinic Atlanta Birmingham
C Winfrei Beef Texa Oprah Cattl
D Viagra Drug Impot Pill Doctor
E Zamora Graham Kill Former Jone
F Medal Olymp Gold Women Game
G Pope Cube Castro Cuban Visit
H Asia Economi Japan Econom Asian
I Super Bowl Game Team Re
J Peopl Tornado Florida Re bomb
General idea: related to the data attributes
m=3887 (dictionary size)
12
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #34
Step 3: Evaluate the patterns ID True Topic 1 Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2 A bomb explodes in a Birmingham, AL abortion clinic 3 The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4 New impotency drug Viagra is approved for use 5 Diane Zamora is convicted of helping to murder her lover’s girlfriend 6 1998 Winter Olympic games 7 The Pope’s historic visit to Cube in Winter 1998 8 The economic crisis in Asia 9 Superbowl XXXII 10 Tornado in Florida
AutoSplit finds correct topics.
ID Sorted word list A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #35
Step 3: Evaluate the patterns ID AutoSplit A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone
AutoSplit’s topics are better than PCA.
ID PCA A’ mckinne bomb women sexual sergeant B’ bomb mckinne rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ viagra winfrei drug texa beef E’ zamora viagra winfrei graham olymp
ID PCA A’ mckinne bomb women sexual sergeant B’ bomb mckinne rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ viagra winfrei drug texa beef E’ zamora viagra winfrei graham olymp
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #36
Step 3: Evaluate the patterns AutoSplit
A B C D E
AutoSplit’s topics are better than PCA.
PCA A’ B’ C’ D’ E’
PCA vectors mix the topics.
Topic 1
Topic 2
13
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #37
Outline • Motivation • Formulation • PCA and ICA • Example applications
– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images
• Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #38
Find hidden variables (DJIA stocks)
• Weekly DJIA closing prices – 01/02/1990-08/05/2002, n=660 data points – A data point: prices of 29 companies at the time
Alcoa
American Express
Boeing
Caterpillar
Citi Group
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #39
Formulation: Find hidden variables
AA1, …, XOM1
…
…
AAn, …, XOMn
H11, H12, …, H1m
…
…
Hn1, Hn2, …, Hnm
=
B11, B12, …, B1m
…
Bm1, Bm2, …, Bmm ? ?
Date Hidden variable
Date
14
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #40
Characterize hidden variable by the companies it influences
“General trend” “Internet bubble”
0.94
0.63 0.03
0.64
Caterpillar Intel
B1,CAT
B1,INTC B2,CAT
B2,INTC
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #41
Companies related to hidden variable 1
B1,j
Highest Lowest
Caterpillar 0.938512 AT&T 0.021885
Boeing 0.911120 WalMart 0.624570
MMM 0.906542 Intel 0.638010
Coca Cola 0.903858 Home Depot 0.647774
Du Pont 0.900317 Hewlett-Packard 0.658768
“General trend”
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #42
Companies related to hidden variable 1
B1,j
Highest Lowest
Caterpillar 0.938512 AT&T 0.021885
Boeing 0.911120 WalMart 0.624570
MMM 0.906542 Intel 0.638010
Coca Cola 0.903858 Home Depot 0.647774
Du Pont 0.900317 Hewlett-Packard 0.658768
All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.
15
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #43
General trend (and outlier)
“General trend”
AT&T
United Technologies
Walmart
Exxon Mobil
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #44
Companies related to hidden variable 2
B2,j
Highest Lowest Intel 0.641102 Philip Morris -0.194843
Hewlett-Packard 0.621159 International Paper -0.089569 GE 0.509164 Caterpillar 0.031678
American Express 0.504871 Procter and Gamble 0.109576 Disney 0.490529 Du Pont 0.133337
2000-2001 “Internet bubble”
Tech company
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #45
Companies related to hidden variable 2
B2,j
Highest Lowest Intel 0.641102 Philip Morris -0.194843
Hewlett-Packard 0.621159 International Paper -0.089569 GE 0.509164 Caterpillar 0.031678
American Express 0.504871 Procter and Gamble 0.109576 Disney 0.490529 Du Pont 0.133337
Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15).
Tech company
16
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #46
Outline • Motivation • Formulation • PCA and ICA • Example applications
– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images
• Conclusion
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #47
Mining cat retinal images [ICDM 05]
Normal 1 day after detachment
7 days after detachment
Retina
28 days after detachment
3d28dr 1h3dr 1d6dO2
Treatment
Detachment Development Distribution of 2 proteins
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #48
“Vocabulary” for biomedical images?
• How to describe biomedical images? • Analogy: the topics for text
• Football reports: “touchdown”, “punt”, etc. • DB papers: “query”, “optimization”, etc.
• How to derive “visual vocabulary terms”?
Normal 7 days after detachment
“spongy”
17
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #49
Visual Vocabulary (ViVo) by AutoSplit
Step 1: Tile image Step 2:
Extract tile features
Step 3: ViVo
generation
Visual���vocabulary
V1
V2
Feature 1
Feat
ure
2
8x12 tiles
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #50
Finding ViVos
Each point is a tile. Projected to the 1st and 2nd PCA vectors. (Feature vector: 512 color structure features.)
Red lines indicate ViVos.
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #51
Bio-mining with “ViVo”
• Visual Vocabulary for retinal images – using AutoSplit
• Evaluation – Qualitative: biological meanings of ViVos – Data mining: highlight “interesting” regions
18
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #52
Biological interpretation of ViVos ID ViVo Description Condition
V1 GFAP in inner retina (Müller cells) Healthy
V10 Healthy outer segments of rod photoreceptors Healthy
V8 Redistribution of rod opsin into cell bodies of rod photoreceptors Detached
V11 Co-occurring processes: Müller cell hypertrophy and rod opsin redistribution
Detached
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #53
Biological interpretation of ViVos ID ViVo Description Condition
2 GFAP in hypertrophy Müller cells
Morphological changes in inner retina
3 GFAP in hypertrophy Müller cells
Morphological changes in inner retina
4 GFAP in hypertrophy Müller cells
Morphological changes in inner retina
5
Healthy outer segments of rod photoreceptors (rod opsin)
Healthy
ID ViVo Description Condition
6 Rod photoreceptor cell body
Background labeling
7 GFAP in hypertrophy Müller cells
Morphological changes in inner retina
9 Outer segment degeneration (rod opsin)
Detached
12 GFAP in hypertrophy Müller cells
Morphological changes in inner retina
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #54
Bio-mining with “ViVo”
• Visual Vocabulary for retinal images – using AutoSplit
• Evaluation – Qualitative: biological meanings of ViVos – Data mining: highlight “interesting” regions
19
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #55
Finding “distinguishing ViVos” • Given: Images of two classes
– Find the class-distinguishing ViVo (“DiVo”) – Highlight distinguishing regions
Normal Detached 3 days
DiVo: “spongy”
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #56
Summary: a system viewpoint
Our system
Input Output
(1) Left: “n”; Right: “3d”
Accurate classification
DiVo analysis
V8: “spongy”
(2) Regions shown (V8): “cells of rod photoreceptors” (3) Description: “Detachment occurs!”
“Rod opsin distributes from outer segment into cell bodies.”
ViVo interpretation
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #57
Outline • Motivation • Formulation • PCA and ICA • Example applications
– Find topics in documents – Hidden variables in stock prices – Visual vocabulary for retinal images
• Conclusion
20
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #58
Conclusion • ICA: more flexible than PCA in finding
patterns. • Many applications
– Find topics and “vocabulary” for images – Find hidden variables in time series (e.g., stock
prices) – Blind source separation
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #59
Vocabulary for embryo gene expressions
with André Balan, Christos Faloutsos, Eric P. Xing
Vocabulary
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #60
References • Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci
Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.
• Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005.
• Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130.
• Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.
21
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #61
Acknowledgement
• Prof. Tai Sing Lee • Prof. Hiroyuki
Kitagawa • Prof. HyungJeong
Yang • Masafumi Hamamoto • Prof. Nancy Pollard • Prof. Jessica Hodgins
• Prof. Michael Lewicki • Prof. Eric Xing • CMU Informedia
project • UCSB DB Lab • CMU bio-imaging
center • CMU graphics lab
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #62
Citation
• AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto
PAKDD 2004, Sydney, Australia
CMU SCS
15-826 (c) C. Faloutsos and J-Y Pan (2010) #63
References
• Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001