Information excellence 2012feb_komli_srinivasan s h_making data repitions work
-
Upload
information-excellence -
Category
Technology
-
view
447 -
download
0
Transcript of Information excellence 2012feb_komli_srinivasan s h_making data repitions work
Confidential
Making Data Repetitions Work for You
Srinivasan H SengameduKomli Labs
Information Excellence Summit,February 25, 2012 Bangalorehttp://Informationexcellence.wordpress.com
Confidential
Srinivasan H SengameduBio: Srinivasan H Sengamedu (SHS) is the Head of Komli Labs where he works on real-time bidding, user modeling, and other areas related to computational advertising.
He was Director of Audience and Search Sciences at Yahoo Labs, Bangalore earlier where he worked on information extraction, machine learned ranking, pornography detection in images, comment spam detection, etc. Most of the technologies are powering various Yahoo products.
He got his PhD from Indian Institute of Science, Bangalore and has held visiting positions at UCSD and NUS.
He has published over 100 papers and has more than 30 approved or filed patent applications. He's generally excited about creating and productizing advanced technologies.
Confidential
Not all information is new!
Web pages about the same product, business, etc.Near-duplicate imagesSimilar comments, tweets, etc.
Confidential
Leveraging redundant information
Classical use– Compression
• Lossless compression (LZW)• Perceptually lossless compression (JPEG, MP3)
– Co-occurrence• Pointwise Mutual Information
Redundancy ≈ ConfidenceLeveraging redundancy requires care.
Confidential
Any other uses of redundancy?
Commercial Spam Detection– Min-closed sequences
Information Extraction– Strong Similarity
Near-duplicate images– Image Signatures
Face recognition– Consistency Learning
Confidential
Two Spam Comments
Happy to see she is progressing well. Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating s'ite called ----------Celeb Mingle.C○M-------- - and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true!Every love story will unfold on it's own. Also happy to see that most Americans reject the blame-the-conservatives crap that some (not all) liberals from all social strata were trying to promote for political gain.
Texas and Israel forever. Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating site called ----------Rich'Friends.Org----- - and received his chat invitations a few days later. I can't believe it's true! Every love story will unfold on it's own. you can start your own wealthy love story for real at there too !many famous and wealthy people had a profile there ,why not me ? Taking out the world's trash. Oooraaaah!-----
Confidential
Sequence-based Spam Detection
Motivation: Commercial spammers repeat variations of the spam content and embed it in good content. These usually avoid detection by spam filters.
Technical Challenge: mine frequent subsequences efficiently. The general problem is NP-Hard. The algorithms in the literature do not scale to web-scale data. The spam patterns change every few hours.Basic Ideas A new sequence mining algorithm that scales to internet scale and is faster than
those in the literature even for other public data sets like Gazelle A new framework for spam detection using frequent subsequences Experimental studies to measure the efficacy of the subsequence mining
approach in detecting spam. We also study the life cycle of a typical spam pattern and use it to tune our mining parameters
ResultsExperiments on News comment data show Coverage >70% Editorial Savings of a factor of ~30.
Confidential
mcPrism
The main ingredients in the algorithm: A modified DFS on the lexicographically ordered sequence
tree. The tree is pruned whenever we encounter a prefix-l-closed node.
The set of prefix-l-closed nodes is pruned by inclusion check Prime Block Encodings for fast computation of joins. We
enhance the encoding scheme to handle gap and closure constraints.
On-the-fly closure checking. We use the bidirectional closure checking and the backscan pruning schemes in BIDE. This is done using an enhancement of the Block encoding scheme
This enhancement also solves an open problem: how to use block encodings to speed up closed sequence mining.
Confidential
Commercial Spam Detection – Results
Subsequence: happy 2011 friend yrs lady announced wedding it' amazing posted received chat invitations days believe it' true love story unfold it' own
Match 1: Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating s'ite called ----------Celeb Mingle.C○M--------- and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true! Every love story will unfold on it's own...=====Happy to see she is progressing well. Also happy to see that most Americans reject the blame-the-conservatives crap that some (not all) liberals from all social strata were trying to promote for political gain.
Match 2: Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25 yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO of a MNC. It's amazing, she said she just posted her profile on a millionaire d'ating site called ----------Rich'Friends.Org----- - and received his chat invitations a few days later. Then, everything went so well that I can't believe it's true! Every love story will unfold on it's own. you can start your own wealthy love story for real at there too !many famous and wealthy people had a profile there ,why not me ?Texas and Israel forever. Taking out the world's trash. Oooraaaah!-----
Total Matches: 35; Only 15 marked spam by existing classifiers/editors
Confidential
Content Matching Approach
Key idea: Leverage redundant content across template-based sites for automatic information extraction.
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Seed Database
Web page
Confidential
Baseline Similarity Measure
Use q-grams to handle spelling errors
Weak Similarity = Cosine-similarity between IDF-weighted q-grams.
String 3-grams
chinese mirch
{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}
chinese mirrch
{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}
• Weight of a q-gram (attribute specific)= Sum of the IDFs of the words it appears in.
Confidential
Strong Similarity
Address (Seed) Address (Site) WS120 Lexington AvenueNew York, NY 10016
120 Lexington Ave (between 28th and 29th St) New York, NY 10016
0.53
312 W 34th StreetNew York, NY 10001
312 W 34th St (between 8th and 9th Ave) New York, NY 10001
0.49
Strong similarity is defined between two sets of strings.1. Calculate the matching pattern between weakly similar
pairs in the two sets.2. Pick matching patterns with sufficient “support”3. Use only parts of a string selected by the matching pattern
in the final similarity calculation.
1. Variations are systematic and site-dependent.2. Cannot be handled by term weighting.
Confidential
Support & Strong Similarity
Address (Seed) Address (Site) Matching Pattern
Matching Segments
120 Lexington AvenueNew York, NY 10016
120 Lexington Ave (between 28th and 29th St) New York, NY 10016
103 103 120 Lexington New York, NY 10016
312 W 34th StreetNew York, NY 10001
312 W 34th St (between 8th and 9th Ave)New York, NY 10001
103 103 312 W 34th New York, NY 10001
Matching Pattern: 103 103Support(103 103) = |{“120 Lexington New York, NY 10016”, “312 W 34th New York, NY 10001”}| = 2 (100% support)
Address’ (Seed) Address’ (Site) SS
120 LexingtonNew York, NY 10016
120 LexingtonNew York, NY 10016
1
312 W 34thNew York, NY 10001
312 W 34thNew York, NY 10001
1
Confidential
Need for Support of a Matching Pattern
Address (Seed) Address (Site)
120 Lexington AvenueNew York, NY 10016
1075 Fifth Ave New York, NY 10128
312 W 34th StreetNew York, NY 10001
1167 Madison AveNew York, NY 10128
Matching Pattern: 010 010Support(010 010): |{“New York, NY”}| = 1 (50% support)Hence Strong Similarity = Weak Similarity
Address (Seed) Address (Site) Matching Pattern
MatchingSegments
120 Lexington AvenueNew York, NY 10016
1075 Fifth Ave New York, NY 10128
010 010 New York, NY
312 W 34th StreetNew York, NY 10001
1167 Madison AveNew York, NY 10128
010 010 New York, NY
Confidential
Strong Similarity Scores
SS boosts the similarity scores of TPs over a wide-range of WS scores without boosting that of FPs.SS is not always 1 – even for true positives.SS scores are very high for most true positives.
String 1 String 2 WS SS
980 n michigan ave 14th floorchicago il
980 n michigan avechicago il 60611
0.57 1
1100 e north ave westchicago il 60185
300 w north ave westchicago il 60185
0.74 0.74
Confidential
Approach
Feature– DCT/FMT transform– Choose low-frequency coefficients
Signature– Median-based quantization– Signature size depends on number of coefficients
Performance– Large Signature Near dup detection– Small size Image Similarity
Confidential
Face Recognition
Face recognition was an important open problem in computer vision.Availability of text and image/video data has provided new directions in web-scale face recognition.If an image occurs in a news article, the named entities in the article can be associated with the faces in the images. This provides weak labels.With large amount of data, such weak signals can be boosted.
Confidential
Conclusions
There is an information explosion but the information has lots of near-duplicates.Spotting near-duplicates has lots of advantages but is a challenge.Large datasets present an equally large opportunity (“Unreasonable effectiveness of data …”).
Confidential
References
Ravi Kant, Srinivasan H. Sengamedu, Krishnan S. Kumar: Comment spam detection by sequence mining, WSDM 2012.Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli: Exploiting Content Redundancy for Web Information Extraction, PVLDB, 2010.Srinivasan H. Sengamedu, Neela Sawant: Finding near-duplicate images on the web using fingerprints, ACM Multimedia 2008.Ming Zhao, Jay Yagnik, Hartwig Adam, David Bau, Large Scale Learning and Recognition of Faces in Web Videos, FG 2008.