Post on 15-Jan-2016
Hubness in the Context of Feature Selection and Generation
Miloš Radovanović1 Alexandros Nanopoulos2
Mirjana Ivanović1
1Department of Mathematics and InformaticsFaculty of Science, University of Novi Sad, Serbia
2Institute of Computer ScienceUniversity of Hildesheim, Germany
k-occurrences (Nk)
Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set Nk(x) is the in-degree of node x in the k-NN digraph
It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005]
FGSIR'10 2July 23, 2010
Skewness of Nk
What causes the skewness of Nk?Artefact of data?
Are some songs more similar to others?Do some people have fingerprints or voices that
are harder to distinguish from other people’s?
Specifics of modeling algorithms? Inadequate choice of features?
Something more general?
FGSIR'10 3July 23, 2010
FGSIR'10 4July 23, 2010
0 5 10 150
0.05
0.1
0.15
0.2
0.25iid uniform, d = 3
N5
p(N
5)
l2
l0.5
cos
0 10 20 30 40 500
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16iid uniform, d = 20
N5
p(N
5)
l2
l0.5
cos
0 0.5 1 1.5 2 2.5-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5iid uniform, d = 100
log10
(N5)
log 1
0(
p(N
5))
l2
l0.5
cos
Contributions - Outline
Demonstrate the phenomenon Skewness in the distr of k-occurrences
Explain its main reasons No artifact of data No specifics of models (inadequate features, etc.) A new aspect of the „curse of dimensionality“
Impact on Feature Selection and Generation
FGSIR'10July 23, 2010 5
Outline
Demonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions
FGSIR'10July 23, 2010 6
Collection of 23 real text data sets
FGSIR'10July 23, 2010
SNk is standardized 3rd moment of Nk
If SNk = 0 no skew, positive (negative) values
signify right (left) skew
High skewness indicates hubness
7
Collection of 14 real UCI data sets+ microarray data
FGSIR'10 8July 23, 2010
Outline
Demonstrate the phenomenonExplain its main reasonsImpact on IRConclusions
FGSIR'10July 23, 2010 9
Where are the hubs located?
FGSIR'10July 23, 2010
Spearman correlation between N10 and distance from data set mean10dmNC
Hubs are closer to the data center
10
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
iid uniform, d = 3, Cdm N
5 = -0.018
Distance from data set mean
N5
0.6 0.8 1 1.2 1.4 1.6 1.8 20
10
20
30
40
50
iid uniform, d = 20, Cdm N
5 = -0.803
Distance from data set mean
N5
2 2.5 3 3.5 40
20
4060
80
100120
140160
iid uniform, d = 100, Cdm N
5 = -0.865
Distance from data set mean
N5
Centrality and its amplification
Hubs due to centrality vectors closer to the
center tend to be closer to all other vectors
thus more frequent k-NN
Centrality is amplified by dimensionality
FGSIR'10July 23, 2010
point A closer to center than point B
∑ sim(A,x) - ∑ sim(B,x) x x
11
Concentration of similarityConcentration: as dim grows to infinity
Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero
Minkowski [François 2007, Beyer 1999, Aggarwal 2001]
Meaningfulness of nearest neighbors?
Analytical proof for cosine sim [Radovanović 2010]
FGSIR'10July 23, 2010 12
The hyper-sphere view
Hyper-sphere view Most vectors are about equidistant from the center and
from each other, and lie on the surface of a hyper-sphere Few vectors lie at the inner part of hyper-sphere, closer
to its center, thus closer to all others This is expected for large but finite dimensionality, since
is non negligible
FGSIR'10July 23, 2010
E
√V
13
What happens with real data?
Real text data are usually clustered (mixture of distributions)
Cluster with k-Means (#clusters = 3*Cls)
Compare with
Generalization of the hyper-sphere view with clusters
FGSIR'10July 23, 2010
10dmNC 10
cmNC
Spearman correlation between N10 anddistance from data/cluster center
14
UCI data
FGSIR'10 15July 23, 2010
Can dim reduction help?
FGSIR'10July 23, 2010
Intrinsic dimensionalityis reached
16
UCI data
FGSIR'10 17July 23, 2010
0 20 40 60 80 100-0.5
0
0.5
1
1.5PCA
Features (%)
SN
10
musk1mfeat-factorsspectrometeriid uniform,d =15,no PCA
Outline
Demonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions
FGSIR'10July 23, 2010 18
FGSIR'10July 23, 2010
“Bad” hubs as obstinate results
Based on information about classes,k-occurrences can be distinguished into:“Bad” k-occurrences, BNk(x)“Good” k-occurrences, GNk(x)Nk(x) = BNk(x) + GNk(x)
19
FGSIR'10July 23, 2010
How do “bad” hubs originate?Mixture is important also:
High dimensionality and skewness of Nk do not automatically induce “badness”
“Bad” hubs originate from a combination of high dimensionality and violation of the CA
Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006]
20
Skewness of Nk vs. #features
FGSIR'10 21July 23, 2010
Skewness stays relatively constant
It abruptly drops when intrinsic dimensionality is reached
Further feature selection may incur loss of information.
Badness vs. #features
FGSIR'10 22July 23, 2010
Similar observations
When reaching intrinsic dimensionality, BNk ratio increases
The representation ceases to reflect the information provided by labels very well
Feature generationWhen adding features to bring new information to
the data: Representation will ultimately increase SNk and, thus,
produce hubs The reduction of BNk ratio “flattens out” fairly quickly,
limiting the usefulness of adding new features in the sense of being able to express the “ground truth”
If instead of BNk ratio we use classifier error rate, the results are similar
FGSIR'10 23July 23, 2010
FGSIR'10July 23, 2010
Conclusion Little attention by research in feature selection/
generation to the fact that in intrinsically high-dimensional data, hubs will : Result in an uneven distribution of the cluster assumption violation
(hubs will be generated that attract more label mismatches with neighborin points)
Result in an uneven distribution of responsibility for classification or retrieval error among data points.
Investigating further the interaction between: hubness and different notions of CA violation
Important new insights into feature selection/generation
24
FGSIR'10July 23, 2010
Thank You!
Alexandros Nanopoulosnanopoulos@ismll.de
25