Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon...
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon...
Fast Nonparametric Machine Learning Fast Nonparametric Machine Learning Algorithms for High-dimensionalAlgorithms for High-dimensional
Ting LiuTing Liu
Carnegie Mellon UniversityCarnegie Mellon University
February, 2005February, 2005
Ph.D. Thesis ProposalPh.D. Thesis Proposal
Massive Data and ApplicationsMassive Data and Applications
Ting Liu, CMU 2
Thesis CommitteeThesis Committee
• Andrew Moore (Chair)
• Martial Hebert
• Jeff Schneider
• Trevor Darrell (MIT)
Ting Liu, CMU 3
Thesis ProposalThesis Proposal
Goal: to make nonparametric methods tractable for high-dim, massive datasets
• Nonparametric methods:– K-nearest-neighbor (K-NN)– Kernel density estimation– SVM evaluation phase– and more …
My thesis
Ting Liu, CMU 4
High-dim, massive data
Why K-NN?Why K-NN?• It is simple
– goes back to as early as [Fix-Hodges 1951]– [Cover-Hart 1967] justifies k-NN theoretically
• It is easy to implement– sanity check for other (more complicated) algorithms– similar insights for other nonparametric algorithms
• It is useful many applications in
– text categorization– drug activity detection– multimedia, computer vision– and more…
Ting Liu, CMU 5
Application: Video SegmentationApplication: Video Segmentation
Task: Shot transition detection
• Cut
• Gradual transition (fades, dissolves …)
Ting Liu, CMU 6
Technically Technically [Qi-Hauptmann-Liu 2003][Qi-Hauptmann-Liu 2003]
Pair-wise similarityfeatures
Classificationnormal: 0cut: 1gradual: 2
4 hours MPEG-1 video
(420,970 frames)
K-NN• very slow • good performance
We want a fast k-NN classification
method.
Color histogramVideoframes
Ting Liu, CMU 7
Application: Application: Near-duplicate Detection and Sub-image Retrieval
Copyrighted Image Database
Ting Liu, CMU 8
Algorithm Overview Algorithm Overview [Yan-Rahul 2004][Yan-Rahul 2004]
12,100,000 patches(12,100 copyrighted images)
Transformation DoG + PCA-SIFT
Searchstore
Each image: 1000 patches
1000 k-NN search per query
Each image:1000 patches
Each patch: 36-dim
train query
We want a fast k-NN search
method.
Ting Liu, CMU 9
• KNS2 (2-class)
• KNS3 (2-class)
• IOC (multi-class)
Spatial tree
• SR-tree
• Kd-tree
• Metric-tree
K-NN MethodsK-NN MethodsK-NN
Exact K-NN Approximate K-NN
K-NN searchK-NN
classification
Naïve
• Random sample• PCA• LSH
Spill-tree
My workslow
Ting Liu, CMU 10
• KNS2 (2-class)
• KNS3 (2-class)
• IOC (multi-class)
K-NN MethodsK-NN MethodsK-NN
Exact K-NN Approximate K-NN
K-NN searchK-NN
classification
Spill-treeSpatial tree
• SR-tree
• Kd-tree
• Metric-tree
Ting Liu, CMU 11
Problems with Exact K-NN Search: Problems with Exact K-NN Search: EfficiencyEfficiency
• Slow with huge dataset in high dimensions
• Complexity of algorithms
– Naïve (linear scan): O(dN) per query
– Advanced: O(dlogN) ~ O(dN)
(spatial data structure to avoid searching all points) • SR-tree [Katayama-Satoh 1997]
• Kd-tree [Friedman-Bentley-Finkel 1977]
• Metric-tree (ball-tree) [Uhlmann 1991, Omohundro 1991]
Ting Liu, CMU 12
A set of points in R2
Metric-tree: an ExampleMetric-tree: an Example
Ting Liu, CMU 13
Build a metric-treeBuild a metric-tree
P2
P1
L
[Uhlmann 1991, Omohundro 1991]
Ting Liu, CMU 14
A metric-tree
Metric-tree Data StructureMetric-tree Data Structure
Internal data structure
P2
P1
[Uhlmann 1991, Omohundro 1991]
Ting Liu, CMU 15
• Let q be any query point
• Let x be a point inside ball B
Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality
xq
xq
Ting Liu, CMU 16
Metric-tree Based K-NN SearchMetric-tree Based K-NN Search
• Depth first search
• Pruning using the triangle inequality
• Significant speed-up when d is small: O(dlogN)
• Little speed-up when d is large: O(dN)
• “Curse of dimensionality”
Ting Liu, CMU 17
• KNS2 (2-class)
• KNS3 (2-class)
• IOC (multi-class)
K-NN MethodsK-NN MethodsK-NN
Exact K-NN Approximate K-NN
K-NN searchK-NN
classification
Spill-treeSpatial tree
• SR-tree
• Kd-tree
• Metric-tree
Ting Liu, CMU 18
My Work (part 1): Fast K-NN Classification Based on Metric-tree
Idea: Do classification w/o finding the k-NNs
KNS2: Fast k-NN classification for skewed 2-class KNS3: Fast k-NN classification for 2-class IOC: Fast k-NN classification for multi-class
Ting Liu, CMU 19
KNS2: Fast K-NN Classification for KNS2: Fast K-NN Classification for Skewed 2-class Skewed 2-class
Assumptions:
(1) 2 classes: pos. / neg.
(2) pos. class much less frequent than neg. class
Example: video segmentation
(~10,000 shot transitions, ~400,000 normal frames)
Q: How many of the k-NN are from pos. class?
Ting Liu, CMU 20
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 1 --- Find positive
Find the k closest pos. points
q
d1
d2
d3
Example: k = 3
di : distance of the i’th
closest pos. point to q
Fewer pos. points → easy to compute
Ting Liu, CMU 21
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Count negative
q
d1
d2
d3
c1 c2 c3
Example: k = 3
c1 = 1
c2 = 5
c3 = 8
ci: Num. of neg. points within di
Ting Liu, CMU 22
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Lowerbound negative
q
d1
d2
d3
c1 c2 c3
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
Idea: lowerbound each ci instead of computing it
ci: Num. of neg. points within di
Ting Liu, CMU 23
• Let q be any query point
• Let x be a point inside ball B
Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality
xq
xq
Ting Liu, CMU 24
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di20
c1 ≥ 0, c2 ≥ 0, c3 ≥ 0
A
Ting Liu, CMU 25
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di
12
8
c1 ≥ 0, c2 ≥ 0, c3 ≥ 12
B
C
Ting Liu, CMU 26
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di
12
8
c1 ≥ 0, c2 ≥ 0, c3 ≥ 12
B
C
Ting Liu, CMU 27
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di
5
c1 ≥ 0, c2 ≥ 5, c3 ≥ 12
7
DE
Ting Liu, CMU 28
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di
5
c1 ≥ 0, c2 ≥ 5, c3 ≥ 12
7
DE
Ting Liu, CMU 29
How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
q
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
ci: Num. of neg. points within di
4
c1 ≥ 4, c2 ≥ 5, c3 ≥ 12
7
We are done! Return 0
E
F
Ting Liu, CMU 30
KNS2: the AlgorithmKNS2: the Algorithm
Build two metric-trees (Pos_tree / Neg_tree) Search Pos_tree, find k pos. NNs Search Neg_tree repeat pick a node from Neg_tree refine C = {c1, c2,…,ck}
if ci ≥ k-i+1 remove ci from C
end repeat
Let k’ = size(C) after the search
return k’
Ting Liu, CMU 31
Experimental Results (KNS2)Experimental Results (KNS2)
Dataset Dimension(d)
Data Size(N)
ds1 10 26,733
Letter 16 20,000
Video 45 420,970
J_Lee 100 181,395
Blanc_Mel 100 186,414
ds2 1.1£106 88,358
Ting Liu, CMU 32
CPU Time Speedup Over Naïve K-NN CPU Time Speedup Over Naïve K-NN (k = 9)(k = 9)
KNS2: 3x – 60x speed-up over naïve
0
10
20
30
40
50
60
70
Ds1(d=10) Letter(d=16) video(d=45) J_Lee(d=100) Blanc_Mel(d=100) ds2(d=1.1M)
spee
dups
Naive
Metric-tree
KNS2
Ting Liu, CMU 33
• KNS2 (2-class)
• KNS3 (2-class)
• IOC (multi-class)
K-NN MethodsK-NN MethodsK-NN
Exact K-NN Approximate K-NN
K-NN searchK-NN
classification
Spill-tree
Ting Liu, CMU 34
--- “I’m Feeling Lucky” search
--- spill-tree
My Work (Part 2): a New Metric-tree Based Approximate NN Search
Ting Liu, CMU 35
Empirically…
• takes 10% of the time finding the NN
• takes 90% of the time backtracking
Why is Metric-tree Slow?
p2
p1
q
Ting Liu, CMU 36
““I’m Feeling Lucky” SearchI’m Feeling Lucky” Search
• Algorithm: simple– Descends a metric-tree without backtracking– Return the first point hit in a leaf node
• Complexity: super fast – O(logN) per query
• Accuracy: quite low – Liable to make mistakes when q is near the decision
boundary
Ting Liu, CMU 37
Spill-tree:Spill-tree: – adding redundancy to help – adding redundancy to help “I’m-Feeling-Lucky” search “I’m-Feeling-Lucky” search
Ting Liu, CMU 38
Spill-treeSpill-tree
• A variant of metric-tree
• The children of a node can “spill over” onto each other, and contain shared data-points
Ting Liu, CMU 39
A Spill-tree Data StructureA Spill-tree Data Structure
LLL
p2p1
LROverlapping buffer
Overlapping buffer size
• Spill-tree: Both children own points between LL and LR
• Metric-tree: each child only owns points to one side of L
Ting Liu, CMU 40
A Spill-tree Data StructureA Spill-tree Data Structure
Advantage of Spill-tree
– higher accuracy– makes mistake only when
true NN is far away
LLL LR
p2p1
q
Overlapping buffer
Overlapping buffer size
Ting Liu, CMU 41
A Spill-tree Data StructureA Spill-tree Data Structure
Problem with spill-tree
– uncontrolled depth – O(logN) when – when – empirically,
is the expected dist.
of a point to its NN
LLL LR
p2p1
q
Overlapping buffer
Overlapping buffer size
Ting Liu, CMU 42
Hybrid Spill-tree SearchHybrid Spill-tree Search
• Balance threshold ρ = 70% (empirically)
if either child of a node v contains more than ρ of the total points,
then split v in the conventional way.
Overlapping node
-- “I’m Feeling Lucky” search
Non-overlapping node
-- backtracking search
Ting Liu, CMU 43
Further Efficiency Improvement by Further Efficiency Improvement by Random ProjectionRandom Projection
Intuition: random projection approximately preserves distance.
Ting Liu, CMU 44
Experiments for Spill-treeExperiments for Spill-tree
Dataset Num. Data (N)
Num. Dim
(d)
Aerial 275,465 60
Corel_hist 20,000 64
Corel_uci 68,040 64
Disk 40,000 1024
Galaxy 40,000 3838
Ting Liu, CMU 45
Comparison MethodsComparison Methods
• Naïve k-NN
• Metric-tree
• Locality Sensitive Hashing (LSH)
• Spill-tree
Ting Liu, CMU 46
Spill-tree vs. Metric-treeSpill-tree vs. Metric-tree
The CPU time (s) speed-up of Spill-tree over metric-tree
Spill-tree enjoys 3.3 ~ 706 folds speed-up over metric-tree
Ting Liu, CMU 47
Spill-tree vs. LSHSpill-tree vs. LSH
The CPU time (s) of Spill-tree and its speedup (in parentheses) over LSH
Spill-tree enjoys 2.5 ~ 31 folds speed-up over LSH
Ting Liu, CMU 48
• KNS2 (2-class)
• KNS3 (2-class)
• IOC (multi-class)
K-NN MethodsK-NN MethodsK-NN
Exact K-NN Approximate K-NN
K-NN searchK-NN
classification
Spill-tree
Ting Liu, CMU 49
My ContributionMy Contribution• T.Liu, A. W. Moore, A. Gray.
Efficient Exact k-NN and Nonparametric Classification in High Dimensions, NIPS 2003.
• Y. Qi, A. Hauptman, T.Liu. Supervised Classification for Video Shot Segmentation, ICME 2003.
• T.Liu, K. Yang, A. W. Moore. The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data, KDD 2004.
• T.Liu, A. W. Moore, A. Gray, K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms, NIPS 2004.
Ting Liu, CMU 50
Related WorkRelated Work
• [Uhlmann 1991, Omohundro 1991] Propose the idea of Metric-tree (Ball-tree)• [Omachi-Aso, 1997] Similar idea of KNS2 for NN classification• [Gionis-Indyk-Motwani, 1999] A practical approximate NN method: LSH• [Arya-Fu, 2003] Expected-case complexity of approximate NN searching
• [Yan-Rahul, 2004] Near-duplicate Detection and Sub-image Retrieval• [Indyk, 1998]
Approximate NN under L∞ norm
Ting Liu, CMU 51
Future WorkFuture Work
• Improve my previous work– Self-tuning spill-tree– Theoretical analysis of spill-tree
• Explore new related area– Dual-tree
• Applications in real-world
Ting Liu, CMU 52
Future Work (1): Self-Tuning Spill-treeFuture Work (1): Self-Tuning Spill-tree
• Two key factors of spill-tree– random projection dimension d’– overlapping size
Ting Liu, CMU 53
Benefits of Automatic Parameter TuningBenefits of Automatic Parameter Tuning
• Avoid tedious hand-tuning
• Gain more insights into the approx. NN
Ting Liu, CMU 54
Future work(2): Theoretical AnalysisFuture work(2): Theoretical Analysis
• Spill-tree + “I’m feeling lucky search”– good performance in practice – no theoretic guarantee
Ting Liu, CMU 55
Idea: when the number of points is large enough, then I’m feeling lucky search finds the true NN w.h.p.
Ting Liu, CMU 56
Idea: with overlapping buffer, the probability of successfully finding the true NN can be increased
Ting Liu, CMU 57
Future Work(3): Dual-Tree SearchFuture Work(3): Dual-Tree Search
• N-body problem [Gray-Moore, 2001]– NN classification– Kernel density estimation– Outlier detection– Two-point correlation
• Require pair-wise comparison of all N points– Naïve solution: O(N2)– Advanced solution based metric-tree
• Single-tree: only build trees on training data• Dual-tree: build trees on both training, query data
Ting Liu, CMU 58
• Let q be a point inside query node Q
• Let x be a point inside training node B
Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality
x
q
Q B
Ting Liu, CMU 59
Pruning OpportunityPruning Opportunity [Gray-Moore 2001]
Dmax(Q, B)
Dmin(Q, A)
A B
Q
OAOB
OQ
Prune A when
A can’t be pruned in this case
A, B: nodes from training set
Q: node from test set
But, this is too pessimistic!
Ting Liu, CMU 60
More pruning opportunityMore pruning opportunity
A B
Q
OAOB
q
Prune A when
Hyperbola H determined by OA,OB, rA+rB
A can be pruned in this caseChallenge: to compute this efficiently
Ting Liu, CMU 61
Future Work(4): ApplicationsFuture Work(4): Applications
• Multimedia --- video segmentation– shot-based segmentation– story-based segmentation
• Image retrieval --- near-duplicate detection
• Computer vision --- object recognition
Ting Liu, CMU 62
Time LineTime Line
• Now – Apr., 2005– Dual-tree (design and implementation)– Testing on real-world datasets
• May – Aug., 2005– Improving spill-tree algorithm– Theoretical analysis
• Sept. – Dec., 2005– Applications of new k-NN algorithm
• Jan. – Mar., 2006– Write up final thesis
Ting Liu, CMU 63
Thank you!Thank you!
QU
ES
TIONS