Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros...

Hubness in the Context of Feature Selection and Generation

Miloš Radovanović1 Alexandros Nanopoulos2

Mirjana Ivanović1

1Department of Mathematics and InformaticsFaculty of Science, University of Novi Sad, Serbia

2Institute of Computer ScienceUniversity of Hildesheim, Germany

k-occurrences (Nk)

Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set Nk(x) is the in-degree of node x in the k-NN digraph

It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005]

FGSIR'10 2July 23, 2010

Skewness of Nk

What causes the skewness of Nk?Artefact of data?

Are some songs more similar to others?Do some people have fingerprints or voices that

are harder to distinguish from other people’s?

Specifics of modeling algorithms? Inadequate choice of features?

Something more general?

FGSIR'10 3July 23, 2010

FGSIR'10 4July 23, 2010

0 5 10 150

0.25iid uniform, d = 3

0 10 20 30 40 500

0.16iid uniform, d = 20

0 0.5 1 1.5 2 2.5-4

-0.5iid uniform, d = 100

Contributions - Outline

Demonstrate the phenomenon Skewness in the distr of k-occurrences

Explain its main reasons No artifact of data No specifics of models (inadequate features, etc.) A new aspect of the „curse of dimensionality“

Impact on Feature Selection and Generation

FGSIR'10July 23, 2010 5

Outline

Demonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions

FGSIR'10July 23, 2010 6

Collection of 23 real text data sets

FGSIR'10July 23, 2010

SNk is standardized 3rd moment of Nk

If SNk = 0 no skew, positive (negative) values

signify right (left) skew

High skewness indicates hubness

Collection of 14 real UCI data sets+ microarray data

FGSIR'10 8July 23, 2010

Outline

Demonstrate the phenomenonExplain its main reasonsImpact on IRConclusions

FGSIR'10July 23, 2010 9

Where are the hubs located?

Spearman correlation between N10 and distance from data set mean10dmNC

Hubs are closer to the data center

0 0.2 0.4 0.6 0.8 10

iid uniform, d = 3, Cdm N

5 = -0.018

Distance from data set mean

0.6 0.8 1 1.2 1.4 1.6 1.8 20

5 = -0.803

2 2.5 3 3.5 40

100120

140160

5 = -0.865

Centrality and its amplification

Hubs due to centrality vectors closer to the

center tend to be closer to all other vectors

thus more frequent k-NN

Centrality is amplified by dimensionality

point A closer to center than point B

∑ sim(A,x) - ∑ sim(B,x) x x

Concentration of similarityConcentration: as dim grows to infinity

Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero

Minkowski [François 2007, Beyer 1999, Aggarwal 2001]

Meaningfulness of nearest neighbors?

Analytical proof for cosine sim [Radovanović 2010]

FGSIR'10July 23, 2010 12

The hyper-sphere view

Hyper-sphere view Most vectors are about equidistant from the center and

from each other, and lie on the surface of a hyper-sphere Few vectors lie at the inner part of hyper-sphere, closer

to its center, thus closer to all others This is expected for large but finite dimensionality, since

is non negligible

What happens with real data?

Real text data are usually clustered (mixture of distributions)

Cluster with k-Means (#clusters = 3*Cls)

Compare with

Generalization of the hyper-sphere view with clusters

10dmNC 10

Spearman correlation between N10 anddistance from data/cluster center

UCI data

FGSIR'10 15July 23, 2010

Can dim reduction help?

Intrinsic dimensionalityis reached

UCI data

FGSIR'10 17July 23, 2010

0 20 40 60 80 100-0.5

1.5PCA

Features (%)

musk1mfeat-factorsspectrometeriid uniform,d =15,no PCA

Outline

Demonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions

FGSIR'10July 23, 2010 18

“Bad” hubs as obstinate results

Based on information about classes,k-occurrences can be distinguished into:“Bad” k-occurrences, BNk(x)“Good” k-occurrences, GNk(x)Nk(x) = BNk(x) + GNk(x)

How do “bad” hubs originate?Mixture is important also:

High dimensionality and skewness of Nk do not automatically induce “badness”

“Bad” hubs originate from a combination of high dimensionality and violation of the CA

Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006]

Skewness of Nk vs. #features

FGSIR'10 21July 23, 2010

Skewness stays relatively constant

It abruptly drops when intrinsic dimensionality is reached

Further feature selection may incur loss of information.

Badness vs. #features

FGSIR'10 22July 23, 2010

Similar observations

When reaching intrinsic dimensionality, BNk ratio increases

The representation ceases to reflect the information provided by labels very well

Feature generationWhen adding features to bring new information to

the data: Representation will ultimately increase SNk and, thus,

produce hubs The reduction of BNk ratio “flattens out” fairly quickly,

limiting the usefulness of adding new features in the sense of being able to express the “ground truth”

If instead of BNk ratio we use classifier error rate, the results are similar

FGSIR'10 23July 23, 2010

Conclusion Little attention by research in feature selection/

generation to the fact that in intrinsically high-dimensional data, hubs will : Result in an uneven distribution of the cluster assumption violation

(hubs will be generated that attract more label mismatches with neighborin points)

Result in an uneven distribution of responsibility for classification or retrieval error among data points.

Investigating further the interaction between: hubness and different notions of CA violation

Important new insights into feature selection/generation

Thank You!

Alexandros Nanopoulosnanopoulos@ismll.de

Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros...

Documents

Transcript of Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros...

Energy Dependence of the Arrival Times of Photons from a ... · Alexander Sakharov John Ellis, N.E. Mavromatos, D.V. Nanopoulos, E.K.G.Sarkisyan & MAGIC Collaboration Astroparticle

My Town- Kraljevo Zdravko i Đorđe Radovanović VI 2.

Radovanović, I. - Houses and Burials at Lepenski Vir

Milo š Radovanović Department of Mathematics and Informatics

August 24-29, 2015Bohinj, Slovenia1 FAP - applications in research and education Zoltan Geler, Vladimir Kurbalija, Miloš Radovanović Mirjana Ivanović University.

REPORT - greenfest.rs · mentor Marija Keržlin Trifunović Workshop GLOBAL WARMING AND ACID RAIN Students of the primary school “Banović Strahinja” with their mentors Maja Radovanović,

Hubness in the Context of Feature Selection and Generation

July 30, 2007Signals of Supercritical String Theory at the LHC1 T. Kamon B. Dutta, A. Gurrola, T. Kamon, A. Krislock, D. Nanopoulos (Texas A&M University)

Hubness as a case of technical algorithmic bias in music ... Learning and data mining in high-dimensional spaces is challenging due to a number of phenomena that are commonly referred

Results in I cycle of the Internet Music Competition 2015sar-ped-ob.ru/sites/default/files/raboti/052015/2015_chamber_ensem… · Radovanović, Ivan Đuric, Anđela Andreski, Danica

Optimal crop distribution in Vojvodina Instructor: Dr. Lužanin Zorana Aleksić Tatjana Dénes Attila Pap Zoltan Račić Sanja Radovanović Dragica Tomašević.

Distant Deep Earthquake Impact on Seismic Hazard in Serbia Slavica Radovanović Seismological Survey of Serbia NATO SfP 984374 | Improvements in the Harmonized.

WebKDD 2001 Aristotle University of Thessaloniki 1 Effective Prediction of Web-user Accesses: A Data Mining Approach Nanopoulos Alexandros Katsaros Dimitrios.

The Odd Couple: No-Scale Multiverse & LHC Dimitri V. Nanopoulos Texas A&M University Houston Advanced Research Center Academy of Athens Research in collaboration.

Slides from Alexandros Nanopoulos, Stiftung Universit¨at ...

POSSIBILITIES OF APPLICATION OF THE ANFIS MODELS FOR PREDICTION OF THE FOREST FIRES IN THE UNITED STATES IN THE SUMMER PERIOD Milan Radovanović* Yaroslav.

CONFERENCE PROGRAM QUALITY FEST 2017qfest.ues.rs.ba/Conference program Qfest.pdf · PhD Radomir Radovanović, Serbia PhD Kozeta Sevrani, Albania PhD Krešimir Buntak, Croatia PhD

ParkinsoNET: Estimation of UPDRS Score using Hubness-aware ...real.mtak.hu/39763/1/parkinsonet.pdf · ParkinsoNET: Estimation of UPDRS Score using Hubness-aware Feed-Forward Neural

Quantum Brain? - CERNcds.cern.ch/record/448412/files/0007088.pdf · with brain function. Recent works in this eld by Penrose[1, 2], Hamero [3], Mavro-matos and Nanopoulos[4, 5] and

THE ROLE OF HUBNESS IN HIGH-DIMENSIONAL …...Vlogo zvezdi s c smo preu cili tudi v stevilnih drugih kontekstih, vklju cno z neravnovesjem razredov, izborom primerov, medjezi cnim