Learning Term-weighting Functions for Similarity Measures

Learning Term-weighting Functions for Similarity Measures

Scott Wen-tau YihMicrosoft Research

Applications of Similarity Measures

Query Suggestion

How similar are they?

mariners vs. seattle marinersmariners vs. 1st mariner bank

querymariners

Applications of Similarity Measures

Ad Relevancequerymovie theater tickets

Similarity Measures based on TFIDF VectorsDigital Camera ReviewThe new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels.

vp = { digital: 1.35, camera: 0.89, review: 0.32, … }

Dptf (“review”, Dp) idf

(“review”)

Sim(Dp,Dq) fsim(vp,vq)fsim could be cosine, overlap, Jaccard, etc.

Vector-based Similarity Measures Pros & Cons

AdvantagesSimple & EfficientConcise representationEffective in many applications

IssuesNot trivial to adapt to target domain

Lots of variations of TFIDF formulasNot clear how to incorporate other information

e.g., term position, query log frequency, etc.

Approach: Learn Term-weighting FunctionsTWEAK – Term-weighting Learning

FrameworkInstead of a fixed TFIDF formula, learn the term-weighting functions

Preserve the engineering advantages of the vector-based similarity measuresAble to incorporate other term information and fine tune the similarity measureFlexible in choosing various loss functions to match the true objectives in the target applications

OutlineIntroductionProblem Statement & Model

Formal definitionLoss functions

ExperimentsQuery suggestionAd page relevance

Conclusions

Vector-based Similarity Measures Formal Definition

Compute the similarity between Dp and Dq

Vocabulary: Term-vector:Term-weighting score:

npS

vp1qS

vq𝑆𝑝1 𝑆𝑞𝑛

𝑓sim൫𝐯𝑝,𝐯𝑞൯

𝑉= {𝑡1,𝑡2,⋯,𝑡𝑛} 𝐯𝑝 = {𝑠𝑝1,𝑠𝑝2,⋯,𝑠𝑝𝑛} 𝑠𝑝𝑖 ≡ tw(𝑡𝑖,𝐷𝑝)

TFIDF Cosine Similarity

Use the same fsim(∙, ∙) (i.e., cosine)Linear term-weighting function

npS

vp1qS

vq𝑆𝑝1 𝑆𝑞𝑛

𝑓sim൫𝐯𝑝,𝐯𝑞൯ 𝑓𝑠𝑖𝑚൫𝐯𝑝,𝐯𝑞൯= 𝐯𝑝 ⋅ 𝐯𝑞ฮ𝐯𝑝ฮ ⋅ ฮ𝐯𝑞ฮ

tw൫𝑡𝑖,𝐷𝑝൯≡ 𝑡𝑓൫𝑡𝑖,𝐷𝑝൯⋅ log൬ 𝑁𝑑𝑓(𝑡𝑖)൰

tw𝛌൫𝑡𝑖,𝐷𝑝൯≡ 𝜆𝑗 ⋅ 𝜙𝑗(𝑡𝑖,𝐷𝑝)𝑗

Learning Similarity MetricTraining examples: document pairs

Loss functions

Sum-of-squares error

Log-loss

Smoothing

ቀ𝑦1,൫𝐷𝑝1,𝐷𝑞1൯ቁ,⋯,ቀ𝑦𝑚,൫𝐷𝑝𝑚,𝐷𝑞𝑚൯ቁ

𝐿sseሺ𝛌ሻ= 12 ቀ𝑦𝑘,𝑓𝑠𝑖𝑚(𝐯𝑝𝑘,𝐯𝑞𝑘)ቁ2mk

𝐿logሺ𝛌ሻ= −𝑦𝑘 log൬𝑓𝑠𝑖𝑚ቀ𝐯𝑝𝑘,𝐯𝑞𝑘ቁ൰− (1−𝑦𝑘)mk ቀ1− log൬𝑓𝑠𝑖𝑚ቀ𝐯𝑝𝑘,𝐯𝑞𝑘ቁ൰ቁ α2ԡ𝛌ԡ2

Learning Preference OrderingTraining examples: pairs of document pairs

LogExpLoss [Dekel et al. NIPS-03]

Upper bound the pairwise accuracy

ቀ𝑦1,൫𝑥𝑎1,𝑥𝑏1൯ቁ,⋯,ቀ𝑦𝑚,൫𝑥𝑎𝑚 ,𝑥𝑏𝑚൯ቁ 𝑥𝑎𝑘 = ቀ𝐷𝑝𝑎𝑘,𝐷𝑝𝑎𝑘ቁ,𝑥𝑏𝑘 = ቀ𝐷𝑝𝑏𝑘,𝐷𝑝𝑏𝑘ቁ

∆𝑘= 𝑓𝑠𝑖𝑚 ቀ𝐯𝑎𝑘,𝐯𝑞𝑎𝑘ቁ− 𝑓𝑠𝑖𝑚 ቀ𝐯𝑏𝑘,𝐯𝑞𝑏𝑘ቁ

𝐿ሺ𝛌ሻ= log(1+ exp(−y𝑘Δ𝑘 −ሺ1− y𝑘ሻሺ−Δ𝑘ሻ𝑚

𝑘=1 ))

OutlineIntroductionProblem Definition & Model

Term-weighting functionsObjective functions

ExperimentsQuery suggestionAd page relevance

Conclusions

Experiment – Query SuggestionData: Query suggestion dataset

[Metzler et al. ’07; Yih&Meek ‘07]

|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

Query Suggestion Labelshell oil credit card shell gas cards Excellentshell oil credit card texaco credit card Fairtarrant county college

fresno city college Bad

tarrant county college

dallas county schools

Good

Term Vector Construction and FeaturesQuery expansion of x using a search

engineIssue the query x to a search engineConcatenate top-n search result snippets

Titles and summaries of top-n returned documents

Features (of each term w.r.t. the document)

Term Frequency, Capitalization, LocationDocument Frequency, Query Log Frequency

Results – Query Suggestion

1 20

0.10.20.30.40.50.60.70.8

0.782

0.597

Series1Series2Series3Series4

10 fold CV; smoothing parameter selected on dev set

Experiment – Ad Page RelevanceData: a random sample of queries and ad landing pages collected during 2008

Collected 13,341 query/page pairs with reliable labels (8,309 – relevant; 5,032 – irrelevant)

Apply the same query expansion on queriesAdditional HTML Features

Hypertext, URL, TitleMeta-keywords, Meta-Description

Results – Ad Page RelevanceFeatures

AUC

TFIDF 0.794

TF&DF 0.806

Plaintext 0.832

HTML 0.855

Preference order learning on different feature sets

0 0.1 0.2 0.3 0.4 0.5 0.60

0.10.20.30.40.50.60.70.80.9

1

Series1Series3Series5Series7

Results – Ad Page Relevance

Features

AUC

TFIDF 0.794

TF&DF 0.806

Plaintext 0.832

HTML 0.855Preference order learning on different feature

sets

Related Work“Siamese” neural network framework

Vectors of objects being compared are generated by two-layer neural networksApplications: fingerprint matching, face matchingTWEAK can be viewed as a single-layer neural network with many (vocabulary size) output nodes

Learning directly the term-weighting scores [Bilenko&Mooney ‘03]

May work for limited vocabulary sizeLearning to combine multiple similarity measures [Yih&Meek ‘07]

Features of each pair: similarity scores from different measuresComplementary to TWEAK

Future Work – Other ApplicationsNear-duplicate detection

Existing methods (e.g., shingles, I-Match)Create hash code of n-grams in document as fingerprintsDetect duplicates when identical fingerprints are found

Learn which fingerprints are importantParaphrase recognition

Vector-based similarity for surface matchingDeep NLP analysis may be needed and encoded as features for sentence pairs

Future Work – Model ImprovementLearn additional weights on terms

Create an indicator feature for each termCreate a two-layer neural network, where each term is a node; learn the weight of each term as well

A joint model for term-weighting learning and similarity function (e.g., kernel) learning

The final similarity function combines multiple similarity functions and incorporates pair-level featuresThe vector construction and term-weighting scores are trained using TWEAK

ConclusionsTWEAK: A term-weighting learning framework for improving vector-based similarity measuresGiven labels of text pairs, learns the term-weighting functionA principled way to incorporate more information and adapt to target applicationsCan replace existing TFIDF methods directlyFlexible in using various loss functionsPotential for more applications and model enhancement

Learning Term-weighting Functions for Similarity Measures

Documents

Transcript of Learning Term-weighting Functions for Similarity Measures