Large -Scale Cost-sensitive Online Social Network Profile Linkage
description
Transcript of Large -Scale Cost-sensitive Online Social Network Profile Linkage
![Page 1: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/1.jpg)
Large-Scale Cost-sensitive Online Social Network Profile Linkage
![Page 2: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/2.jpg)
Background & MotivationFoot prints in different social networks.User identification in social analysis.Privacy & securityCommercial & government applications
![Page 3: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/3.jpg)
OutlineProblem definitionRelated workApproach
Experiment
Conclusion & future work
![Page 4: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/4.jpg)
Problem DefinitionTerminology
Identity: PersonProfile/User: Your footprint on social mediaProfile Linkage: Link your footprints together
Input & OutputInput: profiles of one site as QUERY and profiles of the other site as TARGET.Output: all pairs of classified matched profiles.
![Page 5: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/5.jpg)
Characteristics of profile
Name (semi vs. structured)
{“given name”: “haochen”, “family name”: “zhang”}name: zhang haochen
Semi-structured schemaIncompleteness & missing attributes
Privacy policyVirtual identification
Free text descriptionBio, About me, Tags
Multilingualism
![Page 6: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/6.jpg)
MultilingualismTop 5 languages in dataset of Facebook
EnglishPortugueseSpanishChineseFrench
Most frequent tokens in different languages
chris, john, michaelchen, wang, leecarlos, garcia, danielsergey, olga, alexander
About 70% users are in English7.2% users register as different localesTransliteration
昊辰 => Haochen
![Page 7: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/7.jpg)
Feature AcquisitionNetwork communication costs too much time.Usage limit of the web service.
1000 invocations per day for Google Maps API
Compute complexity comparing to string similarity.
Image processing algorithm.
![Page 8: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/8.jpg)
Overview of approach
Classification of Potential LinksFeatures
representationSupervised
learningCost-sensitive
Feature Acquisition
Pruning with CanopyParameter tuning Canopy construction
Entity-based Representation of ProfilesMapping Tokenization Entity extraction
![Page 9: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/9.jpg)
Canopy: design
![Page 10: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/10.jpg)
Canopy: efficiency
![Page 11: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/11.jpg)
Local FeaturesUsername
Jaro Winkler Similarity
LanguageJaccard Simlarity
Description, URLCosine similarity with TF×IDF
PopularityDefined as the friend amount of a user.Adopt following metric
![Page 12: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/12.jpg)
External FeaturesGeographic Location
Values are diverse with different types.Google Maps API:
string-represented location => geographic information
Spherical distance between two locations as the feature
Avatarχ2 dissimilarity of the avatar’s gray-scale histogram.
![Page 13: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/13.jpg)
Classification: learningProbabilistic model derived from naïve bayes
Independent feature assumption
![Page 14: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/14.jpg)
Classification: learningIterative inference
Terminate if S_n is discriminative.Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative
Order of the features
![Page 15: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/15.jpg)
Classification: learningInitial value
Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched.
as the initial value
![Page 16: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/16.jpg)
Dataset of experimentData source
152,294 Twitter users154,379 LinkedIn users
Ground truth: 9,750 identities4,779 identities with both accounts.3,339 identities with only Twitter account.1,632 identities with only LinkedIn account.
![Page 17: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/17.jpg)
Experiment: Performance on overall linkage
I-Acc(Identity Accuracy)correctly identified identities / all identities in ground truth
Better than naïve learning method caused by adopting the prior.Different performance on different learning methods.
![Page 18: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/18.jpg)
Experiment: Cost-sensitive feature acquisition
5% improvement of F1 by taking 148743 external feature acquisitions.Different order of external features.
Rank by costRank by distinguishability
Three sections divided by two inflection points.
![Page 19: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/19.jpg)
Discussion: dataset construction
Dataset constructionConnections
Cannot correctly reflect the web-scale occasion.Name is too significant.
People searchDifficult to construct the ground truth.
Solution?
![Page 20: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/20.jpg)
Discussion: people search task
Query in LinkedIn by Twitter user’s name Average 10 results for each query
Pre Rec F1Human 0.643 0.900 0.750NB_Local 0.369 0.441 0.402NB_All 0.418 0.493 0.453C4.5_Local 0.594 0.240 0.342C4.5_All 0.609 0.380 0.468CSPL_Local 0.543 0.658 0.595CSPL_All 0.578 0.713 0.638
![Page 21: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/21.jpg)
Discussion: feature dependency
Compare features independently.2 people in Tsinghua with same name Li Peng2 people in NUS with same name Li Peng
Construct different IDF table for name in different locale.
Not generallyNot significantly effective
![Page 22: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/22.jpg)
ConclusionWe proposed an supervised probabilistic to solve the identity linkage problem effectively.Prior that users sharing rarer tokens are more likely matched improves the performance of the approach.Iterative inference is able to reduce unnecessary feature acquisitions.
![Page 23: Large -Scale Cost-sensitive Online Social Network Profile Linkage](https://reader036.fdocuments.us/reader036/viewer/2022062323/5681677a550346895ddc7954/html5/thumbnails/23.jpg)
Thank you