Similarity Measurement Preliminary Results

Similarity measurement:Folksonomyvs.LSA

Preliminary Results

The Tripartite structure of tagging• Folksonomy is a set of triples < user , tag, object>• A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite

sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources.

Del.icio.us Tag distributionTag distribution Log-log Tag distribution

After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964. All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887.

LSA Processing Workflow in R

tm = textmatrix(‘dir/‘)

tm = lw_logtf(tm) * gw_idf(tm)

space = lsa(tm, dims=dimcalc_share())

as.textmatrix(tm)

LSA corpus preparing

• A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups.

• Stemming and Stop-word removal• The processed corpus:14,993,620 tokens, 259,464 types of

words. • Only words with frequency more than 100 were kept to be

entered into a word-by-document matrix. There were 1047 words with frequency more than 100. The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .

• Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words).

LSA document length distribution

The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .

Three similarity measurements

• Tag Co-coccurrence counts

• Tag vector cosine similarity

• LSA

Similarity Measurement

• Tag Co-occurrence Counts: 1)simple count: how many times two tags are

used by the same user to annotate the same resource

2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.

Distribution of Tag co-occurrence Counts (simple counts)

Distribution of Tag co-occurrence Counts (normalized)

Measurement 2: Cosine Similarity

• Based on the co-occurrence vector of each tag with every other tag.

• Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two

• X and Y are the co-occurrence vectors of two distinct tags.

Distribution of Tag Cosine Similarity

Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence

counts

Results

• The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]

Correlation

Pearson (p)

Spearman(s)

Tag Cooccur

TagCosine

LSA TagCooccurnorm

TagCosinenorm

Tag Cooccur

S:0.299478p: 0.17257

S: 0.078392p:0.0861644

TagCosine

S:0.098562P: 0.114581

S:0.1042865p:0.307023

LSA S:0.0770016p:0.1930085

S:0.07058p:0.152654

TagCooccurnorm

S: 0.155882p:0.648709

TagCosinenorm

Qualitative Insight – ‘linguistics’

Tag Cooccur LSA Tag Cosine Norm TagCooccur

Norm TagCosine

Languag Teach English Languag Languag

Refer Languag Alphabet Nlp Nlp

English Teacher Learn Cultur English

Nlp bibliographi Natur English Research

Cultur Select Chines Word Write

Grammar Tesol Russian Grammar Word

Write Center Pronunci Research Refer

Research profession Interest Histori Grammar

Dictionari statement Word Dictionari Dictionari

blog standard identifi scienc tool

Top 10 “linguistics” related words according to 5 measurements

Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosineP= 0.7547

Similarity Measurement Preliminary Results

Technology

Transcript of Similarity Measurement Preliminary Results

A Novel Approach for Patent Similarity Measurement Based ...

PRELIMINARY PROGRAMME - s3-eu-west-1.amazonaws.com · 1 PRELIMINARY PROGRAMME 74th LMHI Homeopathic World Congress The Medicine of the Future from the Ancient Heart Similarity at

Uses of Similarity Indirect Measurement Scale Drawings and Models Irma Crespo 2010.

Measurement of Carbonate Minerals in Aerosol Samples- A Preliminary Study

2003 - Evaluation of Similarity Measurement for Image Retrieval

Similarity Measurement Method between Two Songs by Using ...

BUILDING AN INTRUSION BLACKLIST USING SIMILARITY By Enas … · 2016-02-07 · BUILDING AN INTRUSION BLACKLIST USING SIMILARITY MEASUREMENT By Enas Ayman Al-Utrakchi Supervisor Dr.

An preliminary investigation into the measurement of time ...downloads.bbc.co.uk/rd/pubs/reports/1979-09.pdfAn preliminary investigation into the measurement of time and frequency

Purpose Site layout Measurement setup Preliminary wake measurements Data analysis Eliminate shear

Preliminary Detailed Design Review Periodontal Measurement Test System February 1, 2013.

Privacy Preserving Group Linkagecae.ittc.ku.edu/papers/ppgl.pdf434 F. Li et al. use Jaccard similarity [15] as the group-level similarity measurement (see Sec-tion 3.2 for details).

Preliminary report on the measurement of cutting-tool ... · PDF filePreliminary report on the measurement of cutting-tool temperature ... Preliminary report on the measurement of

Semantic Similarity Measurement and Geographic Applications Introduction

Measurement of Text Similarity: A Survey€¦ · Information 2020, 11, 421 2 of 17 it traces the evolution of semantic similarity technologies over the past few decades, distinguishing

Some preliminary thoughts: Tracking & F L measurement

Geographical and Temporal Similarity Measurement in Location-based Social Networks

A Preliminary Investigation into Dynamic Measurement and ...€¦ · A Preliminary Investigation into Dynamic Measurement and Implicit Affect in Assessing ... ―mini assessment centers‖

Indirect Measurement and Additional Similarity Theorems 8.5.

A preliminary test for broadband speed measurement

Transformed Facial Similarity as a Political Cue: A …...Transformed Facial Similarity as a Political Cue: A Preliminary Investigation Jeremy N. Bailenson, Philip Garland, Shanto