Quality, quantity, web and semantics

91
Quality to Quantity to Quality on the Web Andraž Tori, CTO at Zemanta @andraz

description

How is organized data used by some web players having not the best intentions? How can tools that try to help individual authors be subverted by spammers?Also, how does Zemanta work and why are we interested in this topic.

Transcript of Quality, quantity, web and semantics

Page 1: Quality, quantity, web and semantics

Quality to Quantity to Qualityon the Web

Andraž Tori, CTO at Zemanta@andraz

Page 2: Quality, quantity, web and semantics

Topics

- a bit about Zemanta

- how advanced “data tools” and spammers interact

Page 3: Quality, quantity, web and semantics

We are all trying to organize the web

Page 4: Quality, quantity, web and semantics

Making it right,

making it useful

and linked

Page 5: Quality, quantity, web and semantics
Page 6: Quality, quantity, web and semantics
Page 7: Quality, quantity, web and semantics

Not so long time ago, in a city not far away...

Page 8: Quality, quantity, web and semantics

some other people

Page 9: Quality, quantity, web and semantics

are trying to do the opposite

Page 10: Quality, quantity, web and semantics

trying to disorganize it,

make it confusing,

and to profit from that

Page 11: Quality, quantity, web and semantics
Page 12: Quality, quantity, web and semantics

using the tools we have built!

Page 13: Quality, quantity, web and semantics
Page 14: Quality, quantity, web and semantics

Their motives are not sinster(mostly)

Page 15: Quality, quantity, web and semantics

it is about profit

Page 16: Quality, quantity, web and semantics

Profit

- publish as much content as possible

- quality is not (that) important

- get traffic or high page ranking for certain terms

- sell clicks, links or whole “fully built” sites to the highest bidder

- users and search engines are necessary evil to be tricked as cheaply as possible

Page 17: Quality, quantity, web and semantics
Page 18: Quality, quantity, web and semantics

So, why do I care?

Page 19: Quality, quantity, web and semantics

Job opening

You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ...

You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta

https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600

Page 20: Quality, quantity, web and semantics

And why might you care?

- the organized information is great tool for those that try to disorganize it

- they are poisoning “our web”, including twitter, facebook

- and it's hard to see in the fog they are causing

- it is just matter of time when they start poisioning linked data too

Page 21: Quality, quantity, web and semantics
Page 22: Quality, quantity, web and semantics

What do we do at

Page 23: Quality, quantity, web and semantics

- is a “personal writing assistant”

- suggesting content while you write (your blog)

- analyzing your text

- connecting it with background knowledge, other stories on the web, images

- you choose what suggestions to include

- to make your writing more informative, vivid and useful

Page 24: Quality, quantity, web and semantics
Page 25: Quality, quantity, web and semantics
Page 26: Quality, quantity, web and semantics
Page 27: Quality, quantity, web and semantics

Opening up the hood

Page 28: Quality, quantity, web and semantics
Page 29: Quality, quantity, web and semantics
Page 30: Quality, quantity, web and semantics

the reality

Page 31: Quality, quantity, web and semantics
Page 32: Quality, quantity, web and semantics

Contentsuggestions

How it works

Plain text(article) Analysis Semantic

search

RSS feedsLinked data

Page 33: Quality, quantity, web and semantics

Main design goals

- Input is meaningful chunk of text (not a keyword or a phrase)

- Input is (semi) English language

- Has to work across all domains in the open world

- music, celebrities, finance, entertainment, politics, gardening, parenting, …

Page 34: Quality, quantity, web and semantics

Analysis pipeline

Named EntityExtraction

Known phrasesextraction

(aho-corasick)

Triple storeSurface form features evaluation

Statistical comparison tobackground knowledge

Semantic coherenceand hand-tuned

heuristics

Disambiguated entities

etc.

Page 35: Quality, quantity, web and semantics

Analysis pipeline

Named EntityExtraction

Known phrasesextraction

(aho-corasick)

Triple storeSurface form features evaluation

Statistical comparison tobackground knowledge

Semantic coherenceand hand-tuned

heuristics

Disambiguated entities

etc.

Categorization to D

moz

Categories Ambigious named entities

Page 36: Quality, quantity, web and semantics

Background knowledge

- Data from Wikipedia, MusicBrainz, Freebase… and world wild web

- Includes linguistical and semantical properties+ unstructured data

- Present in two forms:

- in “original” custom built triple store on top of MySQL (150 GB)

- processed into 7 GB optimized “memory mapped dump”

Page 37: Quality, quantity, web and semantics

Background knowledge

- 7M mined and linked up entities and concepts

- 30M aliases

- Refreshed about once a month

- want to make it real-time

- Input data quality is really important

Triple store

etc.

Page 38: Quality, quantity, web and semantics

Text

After analysis

SOLRarticles

SOLRimages

Related articles

Images

Page 39: Quality, quantity, web and semantics

Example SOLR query

Page 40: Quality, quantity, web and semantics

boost((( wiki_entities:Health insurance wiki_entities:Medical underwriting wiki_entities:United States wiki_entities:Affordable Care Act wiki_entities:Barack Obama wiki_entities:Lifetime (TV network) wiki_entities:Insurance wiki_entities:Preventive medicine wiki_entities:Childwiki_entities:Patient Protection and Affordable Care Act ) ^3.0)

(text:zemhealthinsurq^0.68 text:health^0.62 text:premium^0.36text:zeminsurcompaniq^0.56 text:increas^0.29 text:rate^0.27text:zemhealthinsurcompaniq^0.35 text:zempreventcareq^0.26text:medic^0.26 text:compani^0.23 text:obamacar^0.21text:todai^0.21 text:polici^0.21 text:care^0.19 ) ^105.0

((dmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Healthdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_Statesdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States/California) ^0.1),

(1 - 0.2) * sqrt(1.0/(1.15E-8*float(1285185600000 - date(published_datetime) ms)+1.0)) + 0.2)

Page 41: Quality, quantity, web and semantics

Solr

- We adapted Solr for “query by document”

- 52% precision (at 10) on internal evaluations

- plain Lucene MLT comes to 44%

- difference is from “bag of terms” approach over “bag of words” (terms coming from analysis step)

- Our live index is 5M articles

- Solr is really not optimized to handle 50 terms in a single query

Page 42: Quality, quantity, web and semantics

Lucene plain “More Like This”

Page 43: Quality, quantity, web and semantics

Metrics & tests

- Every part of the system is being constantly evaluted

- Precision/recall at 5 different points in the system

- Mostly bi-weekly releases of new datasets and the engine

Page 44: Quality, quantity, web and semantics

Overview

- We do pretty deep processing to deliver simple user experience of “personal authoring assistant”

- And everything is available over the web API

- tagging

- named entity recognition and disambiguation to Linked Open Data URIs

Page 45: Quality, quantity, web and semantics

Most used

What API offers?

Most interesting

• Tags

• Categories

• Concepts and entities

• Related articles

• Related images

Page 46: Quality, quantity, web and semantics
Page 47: Quality, quantity, web and semantics

So mash-ups happen...

Page 48: Quality, quantity, web and semantics

Some API users

Page 49: Quality, quantity, web and semantics

We are just one of the many people offeringservices based on large amounts of web data

each spending man-years trying to organize their data, trying to offer best possible service

Page 50: Quality, quantity, web and semantics

now back to the bad guys

Page 51: Quality, quantity, web and semantics
Page 52: Quality, quantity, web and semantics

Job opening

You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ...

You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta

https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600

Page 53: Quality, quantity, web and semantics

There's more than meets the eye

Page 54: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 55: Quality, quantity, web and semantics

Warnings

- I've seen no single system using the whole pipeline as described, however all parts were found in the wild

- Examples used are from all kinds of sites – good, bad and ugly

- I am not trying to imply that all of the steps in the diagram are bad, but they can be used by bad guys efficiently

Page 56: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 57: Quality, quantity, web and semantics

Finding their keywords, niches

- Domain expertise

- Users like to install extensions and say “yes”

- You observe referrers on sites you control

- You buy the data on the black market

Page 58: Quality, quantity, web and semantics
Page 59: Quality, quantity, web and semantics

The sophisticated part of the market

“Demand Media relies on a proprietary algorithm to help editors best determine what subjects their writers should tackle.”

Factors:

- Keyword competition

- Revenue

- Driving traffic to/from existing conent

http://emediavitals.com/article/16/demand-media-s-content-assembly-line

Page 60: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 61: Quality, quantity, web and semantics

Find / create content

- Steal

- Take from “open article directories”

- Have your own “content assembly line” like Demand Media

Page 62: Quality, quantity, web and semantics

Open article directories

Page 63: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 64: Quality, quantity, web and semantics

T i i n t the text you re lookin for.һ ѕ ѕ ο а ɡ

Page 65: Quality, quantity, web and semantics

T i i nοt the text you аre lookinɡ for.һ ѕ ѕ

Page 66: Quality, quantity, web and semantics

Übersetzen sie zufällig Sprache und wieder auf EnglischLanguage and translate it happen again in English

Μεταφράστε αυτό σε δειγματοληπτικούς γλώσσα και πίσω στην αγγλική γλώσσα

Translate this random language back to English

Traduisez au langage aléatoire et revenir à l'anglaisTranslate to random language to English and back

它翻译成随机的语言和回英文Translate it back into the English language and random

Translate it to random language and back to English

Page 67: Quality, quantity, web and semantics

Covering their tracks

- Trying to fool search engines or people?

- Search engines are catching up

- Google Translate API is being closed due to “abuse”?

- The trend is “rewriting” by human editors, procured on the global market

Page 68: Quality, quantity, web and semantics
Page 69: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta, OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 70: Quality, quantity, web and semantics

Spammers say darndest things

Page 71: Quality, quantity, web and semantics
Page 72: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 73: Quality, quantity, web and semantics
Page 74: Quality, quantity, web and semantics

Remixing linked data and spam

- Currently mostly the good guys are using Linked Data

- However, it's just too tempting to be left alone

- Fully synthetic articles using factual information from linked data?

– Using advanced tools to form proper natural language sentences and maybe even storyline?

Page 75: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 76: Quality, quantity, web and semantics

Publish

- On hosted third party platforms

- eating their resources

- Platforms have hard time killing spammers

- Smaller ones don't necessarily have the incentive

- If they remove spammer too fast, it is easier for spammer to probe the limits

- Platforms use “kill with delay”

- Spam detection is resource intensive

Page 77: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 78: Quality, quantity, web and semantics

Valuable comments

As I write this post, Zemanta is showing me 5 different articles that are related to my post. I could visit each one of these sites and reach out to the owner to see if they would be interested in linking to my post, or I could leave a valuable comment on the page and include a link back to my post.

http://www.mainelyseo.com/zemanta-review-seo-link-building-with-the-zemanta-plugin/

Page 79: Quality, quantity, web and semantics

- Guy in previous slide is honest and well-meaning

- But what if you automate that via Amazon Mechanical Turk or oDesk?

Page 80: Quality, quantity, web and semantics

Gather search terms(extensions, logs, guess)

Analyze → what people search for?

Find / createsuch content

Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links

Pull additional contentfrom Freebase

Use Zemanta to findsimilar blogs

Amazon Mechanical Turkto post comments

and links back to your siteProfit?

Publish

Page 81: Quality, quantity, web and semantics

Profit?

- sell ads

- sell links

- sell “fully developed site”

- to the highest bidder

Page 82: Quality, quantity, web and semantics

Search engines to the rescue?

- Mahalo cut 10% of the staff the day after Google announced ranking changes

- Demand Media's stock isn't doing that well anymore

- However this is a never-ending story, we'll have co-evolution for foreseeable future

Page 83: Quality, quantity, web and semantics

Ecosystem

- Very sophisticated, large players

- moving to more high quality content, video?

- Small time operations

- using more and more sophisticated tools available on the market cheaply (modern asymmetric warfare?)

- Dark industry specifically building tools to poison the web and sell them to small time operators

Page 84: Quality, quantity, web and semantics

Food for thought

Page 85: Quality, quantity, web and semantics

Can we make spammers (and others) work for us, making linked data better?

(think reCAPTCHA)

Page 86: Quality, quantity, web and semantics

Could article directories be fruitfully used?

eZineArticles.com, GoArticles.com, etc...

Page 87: Quality, quantity, web and semantics

Find rewritten articles and use them as parallel corpus?

Page 88: Quality, quantity, web and semantics

Could we use global workforce market more efficiently to get more linked data?

Page 89: Quality, quantity, web and semantics

Thesis, antithesis, synthesis?

http://xkcd.com/810/

Page 90: Quality, quantity, web and semantics

Thank you!

Questions?

Page 91: Quality, quantity, web and semantics

Image sources http://www.flickr.com/photos/dzingeek/4587871752/

http://www.flickr.com/photos/25101572@N02/4393474025/

http://www.flickr.com/photos/billward/4740384434/

http://www.flickr.com/photos/jurvetson/542500748

http://www.flickr.com/photos/legofenris/4288913574

http://www.flickr.com/photos/ekilby/3733627940

http://www.flickr.com/photos/ekilby/3732799269/

http://www.flickr.com/photos/cipherswarm/38354452

http://xkcd.com/810/