CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #25: Multimedia indexing C. Faloutsos.
Indexing and Data Mining in Multimedia Databases
description
Transcript of Indexing and Data Mining in Multimedia Databases
Indexing and Data Mining in Multimedia Databases
Christos Faloutsos
CMU www.cs.cmu.edu/~christos
USC 2001 C. Faloutsos 2
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resources
USC 2001 C. Faloutsos 3
Problem
Given a large collection of (multimedia) records, find similar/interesting things, ie:
• Allow fast, approximate queries, and
• Find rules/patterns
USC 2001 C. Faloutsos 4
Sample queries
• Similarity search– Find pairs of branches with similar sales
patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync– Find shapes like a spark-plug
USC 2001 C. Faloutsos 5
Sample queries –cont’d
• Rule discovery– Clusters (of branches; of sensor data; ...)– Forecasting (total sales for next year?)– Outliers (eg., unexpected part failures; fraud
detection)
USC 2001 C. Faloutsos 6
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• related projects @ CMU and resourses
USC 2001 C. Faloutsos 7
Indexing - Multimedia
Problem:
• given a set of (multimedia) objects,
• find the ones similar to a desirable query object
USC 2001 C. Faloutsos 8
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
USC 2001 C. Faloutsos 9
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
‘GEMINI’ - Pictorially
eg, avg
eg,. std
USC 2001 C. Faloutsos 10
Remaining issues
• how to extract features automatically?
• how to merge similarity scores from different media
USC 2001 C. Faloutsos 11
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001 C. Faloutsos 12
FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
~100
~1
??
USC 2001 C. Faloutsos 13
FastMap
• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time
• We want a linear algorithm: FastMap [SIGMOD95]
USC 2001 C. Faloutsos 14
Applications: time sequences
• given n co-evolving time sequences
• visualize them + find rules [ICDE00]
time
rate
HKD
JPY
DEM
USC 2001 C. Faloutsos 15
Applications - financial• currency exchange rates [ICDE00]
USD(t)
USD(t-5)
FRFGBPJPYHKD
USC 2001 C. Faloutsos 16
Applications - financial• currency exchange rates [ICDE00]
USD
HKD
JPY
FRFDEM
GBP
USD(t)
USD(t-5)
USC 2001 C. Faloutsos 17
Application: VideoTrails
[ACM MM97]
USC 2001 C. Faloutsos 18
VideoTrails - usage
• scene-cut detection (about 10% errors)
• scene classification (eg., dialogue vs action)
USC 2001 C. Faloutsos 19
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001 C. Faloutsos 20
Merging similarity scores
• eg., video: text, color, motion, audio– weights change with the query!
• solution 1: user specifies weights
• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)– but: how about disjunctive queries?
USC 2001 C. Faloutsos 21
‘FALCON’Inverted VsVs
Trader wants only ‘unstable’ stocks
USC 2001 C. Faloutsos 22
“Single query point” methods
Rocchio
+
+ ++
++
x
USC 2001 C. Faloutsos 23
“Single query point” methods
Rocchio MindReader
+
+ ++
++ +
+ ++
++ +
+ ++
++
MARS
The averaging affect in action...
x x x
USC 2001 C. Faloutsos 24
++
+
++
Main idea: FALCON Contours
feature1 (eg., temperature)
feature2
eg., frequency
[Wu+, vldb2000]
USC 2001 C. Faloutsos 25
Conclusions for indexing + visualization
• GEMINI: fast indexing, exploiting off-the-shelf SAMs
• FastMap: automatic feature extraction in O(N) time
• FALCON: relevance feedback for disjunctive queries
USC 2001 C. Faloutsos 26
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
USC 2001 C. Faloutsos 27
Data mining & fractals – Road map
• Motivation – problems / case study
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
USC 2001 C. Faloutsos 28
Problem #1 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’
galaxies
(stores & households ; mpg & MTBF...)
- patterns? (not Gaussian; not uniform)
-attraction/repulsion?
- separability??
USC 2001 C. Faloutsos 29
Problem#2: dim. reduction
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
(Q: why?
A: to avoid the ‘dimensionality curse’)
USC 2001 C. Faloutsos 30
Answer:
• Fractals / self-similarities / power laws
USC 2001 C. Faloutsos 31
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area;
infinite length!
USC 2001 C. Faloutsos 32
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58… (long story)
USC 2001 C. Faloutsos 33
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
x y
5 1
4 2
3 3
2 4
Eg:
#cylinders; miles / gallon
USC 2001 C. Faloutsos 34
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)
USC 2001 C. Faloutsos 35
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)
• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs
log(r) )
USC 2001 C. Faloutsos 36
Sierpinsky triangle
log( r )
log(#pairs within <=r )
1.58
== ‘correlation integral’
USC 2001 C. Faloutsos 37
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
• Conclusions
USC 2001 C. Faloutsos 38
Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - ‘BOPS’ plot - [sigmod2000])
•clusters?
•separable?
•attraction/repulsion?
•data ‘scrubbing’ – duplicates?
USC 2001 C. Faloutsos 39
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
USC 2001 C. Faloutsos 40
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
[w/ Seeger, Traina, Traina, SIGMOD00]
USC 2001 C. Faloutsos 41
spatial d.m.
r1r2
r1
r2
Heuristic on choosing # of clusters
USC 2001 C. Faloutsos 42
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
USC 2001 C. Faloutsos 43
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
-repulsion!!
-duplicates
USC 2001 C. Faloutsos 44
Problem #2: Dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
USC 2001 C. Faloutsos 45
Solution:
• drop the attributes that don’t increase the ‘partial f.d.’ PFD
• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]
USC 2001 C. Faloutsos 46
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1 PFD=1
PFD=0PFD=1
USC 2001 C. Faloutsos 47
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD=1global FD=1PFD=1
PFD=0PFD=1
Notice: ‘max variance’ would fail here
USC 2001 C. Faloutsos 48
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1
PFD=1
PFD=0PFD=1
Notice: SVD would fail here
USC 2001 C. Faloutsos 49
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples– fractals– power laws
• Conclusions
USC 2001 C. Faloutsos 50
disk traffic
• Not Poisson, not(?) iid - BUT: self-similar• How to model it?
time
#bytes
USC 2001 C. Faloutsos 51
traffic
• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])
time
#bytes
20% 80%
USC 2001 C. Faloutsos 52
Traffic
Many other time-sequences are bursty/clustered: (such as?)
USC 2001 C. Faloutsos 53
Tape accesses
time
Tape#1 Tape# N
# tapes needed, to retrieve n records?
(# days down, due to failures / hurricanes / communication noise...)
USC 2001 C. Faloutsos 54
Tape accesses
time
Tape#1 Tape# N
# tapes retrieved
# qual. records
50-50 = Poisson
real
USC 2001 C. Faloutsos 55
More apps: Brain scans
• Oct-trees; brain-scans
octree levels
Log(#octants)
2.63 = fd
USC 2001 C. Faloutsos 56
Cross-roads of Montgomery county:
•any rules?
GIS points
USC 2001 C. Faloutsos 57
GIS
A: self-similarity:• intrinsic dim. = 1.51• avg#neighbors(<= r )
= r^D
log( r )
log(#pairs(within <= r))
1.51
USC 2001 C. Faloutsos 58
Examples:LB county
• Long Beach county of CA (road end-points)
USC 2001 C. Faloutsos 59
More fractals:
• cardiovascular system: 3 (!)
• stock prices (LYCOS) - random walks: 1.5
• Coastlines: 1.2-1.58 (?)
1 year 2 years
USC 2001 C. Faloutsos 60
USC 2001 C. Faloutsos 61
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples – fractals– power laws
• Conclusions
USC 2001 C. Faloutsos 62
Fractals <-> Power laws
self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))
log( r )
log(#pairs within <=r )
1.58
USC 2001 C. Faloutsos 63
Bible
RANK-FREQUENCY plot: (in log-log scales)
Zipf’s (first) Law:
Zipf’s law
log(rank)
log(freq)
“the”
“and”
USC 2001 C. Faloutsos 64
Zipf’s law
• similarly for first names (slope ~-1)
• last names (~ -0.7)
• etc
USC 2001 C. Faloutsos 65
More power laws
• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]
log(count)
magnitudeday
amplitude
USC 2001 C. Faloutsos 66
<url, u-id, ....>
Web Site Traffic
log(freq)
log(count)
Zipf
Clickstream data
USC 2001 C. Faloutsos 67
Lotka’s law
• library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)
log(#citations)
log(count)
J. Ullman
USC 2001 C. Faloutsos 68
Korcak’s law
Scandinavian lakes area vs complementary cumulative count (log-log axes)
log(count( >= area))
log(area)
USC 2001 C. Faloutsos 69
More power laws: Korcak
Japan islands;
area vs cumulative count (log-log axes) log(area)
log(count( >= area))
USC 2001 C. Faloutsos 70
(Korcak’s law: Aegean islands)
USC 2001 C. Faloutsos 71
Olympic medals:
y = -0.9676x + 2.3054
R2 = 0.9458
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Series1
Linear (Series1)
log rank
log(# medals)
USA
ChinaRussia
USC 2001 C. Faloutsos 72
SALES data – store#96
# units sold
count of products
USC 2001 C. Faloutsos 73
TELCO data
# of service units
count ofcustomers
USC 2001 C. Faloutsos 74
More power laws on the Internet
degree vs rank, for Internet domains (log-log) [sigcomm99]
log(rank)
log(degree)
-0.82
USC 2001 C. Faloutsos 75
Even more power laws:
• Income distribution (Pareto’s law);
• duration of UNIX jobs [Harchol-Balter] • Distribution of UNIX file sizes• Web graph [CLEVER-IBM; Barabasi]
USC 2001 C. Faloutsos 76
Overall Conclusions:
‘Find similar/interesting things’ in multimedia databases
• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON
USC 2001 C. Faloutsos 77
Conclusions - cont’d
• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster
detection– PFD for dimensionality reduction
USC 2001 C. Faloutsos 78
Resources:
• Software and papers:– www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000, kdd2001)– Relevance feedback for query by content
(FALCON – vldb 2000)
USC 2001 C. Faloutsos 79
Resources
• Manfred Schroeder “Chaos, Fractals and Power Laws”