Post on 29-Nov-2014
description
Duplicate Content Filters, Penalties and other Content Minefields
27th March 2012
Search Quality – the Duplicate Content Headache
Google can’t afford a SERPs of;
1)Search engine optimization Search engine optimization (SEO) is the process of improving the
visibility of a website or a web page in search engines........ 2) Search engine optimization
Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........3) Search engine optimization
Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........4) Search engine optimization
Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines........
2
Resource – the Duplicate Content Headache
Duplicate content has consequences for SE in;
Wastes Crawler resources - finite number of crawlers
Wastes Bandwidth – how often can you crawl 1 trillion documents and keep your index fresh?
Increases Query CPU time – how do you search 1 trillion documents as quickly as possible?
3
Document importance – Duplicate Content Headache
Duplicate content can be a signal of an important document;
• Song lyrics
• Scholarly texts and historical documents, eg the Bible (1,000 pages)
• The Linux manual (2,000 pages)
• Breaking News – Associated Press, Reuters
etc.
4
Types of Duplicate Content
Duplicate content comes in many forms
Intentional vs non intentional
On-site vs off-site
5
On-Site Duplicate Content (Impacts Quality Score)Intentional• Printer friendly pages•Different font sizes•PDF documents•Archive (non graphics versions)•Shopping filters (sort by and pagination)•RSS feeds
Non-intentional• Affiliate URLs - www.example.com/?btag=123• Adwords Campaigns - www.example.com/?utc=google•Search results•www vs non www URLs•https vs http•Stubs/plugins
6
On-Site Duplicate Content (Impacts Quality Score)10’000s of stub pages worst case scenario example;
7
This was 2 weeks after Andy had removed the duplicate links from the search pages on our advice eg;http://www.motors.co.uk/Ford-Escort-0-9999999---2http://www.motors.co.uk/Ford-Escort-0-9999999--U-2-http://www.motors.co.uk/Ford-Escort-0-9999999---2%20-
Off-Site Duplicate Content (Filters and Penalties)Intentional vs non-intentional somewhat grey
Domain branding eg .com, .co.za(Mobile website)Content syndicationContent theftStaging websites a common problem!!
Quality signals are often used to filter off-site Duplicates!!!
8
How Does Google Filter Off-site Duplicate Content
Authors feel they have a right to rank for their own content – Google’s Loyalty is to its users!!!
Google doesn’t necessarily reward a source or original but assesses;
• Relevance (eg is an article in context)• Domain authority & links (eg Google Knol, Facebook)• Fresh content boost
• Site quality signals (eg internal duplicate content!!!)
9
Examples of Off-site Duplicate Content and QualityClient with .com.au and a .com with https duplicates
Casino Client with a lot of stub pages (pre Panda)
Casino site – severe health issues;
10
How to Diagnose (on-site) Duplicate Content
Link building will exacerbate duplicate content indexing
Keep an eye on indexed pages (weekly) and look for spikes in Google Indexing, (Yahoo and Bing)
Look for site:example.com duplicates
Use Xenu link checker
Heed any Webmaster Tools warnings
Check your crawling and cache dates Frequent update but stale cache dates = dupe content issues
11
How to address on-site and off-site duplicate content
You have a whole armoury of potential tools including;
Robots.txt exclusionRobots meta tagCanonical tagWebmaster URL exclusionPassword protection(301 redirects)
(File a DMCA against serial content thieves?)
Lot of well-meaning people give bad advice though
12
Google Engineers Can’t Agree
Adam Lasnik – “Deftly Dealing with Duplicate Content” 2006
Probably the authoritative guide to duplicate content;
• What is duplicate content?
• What isn't duplicate content?
• Why does Google care about duplicate content?
• What does Google do about it?
• How can Webmasters proactively address duplicate content issues?
`
Deftly Dealing with... - Our advice/experience
Robots.txt
Routinely ignored by Google, probably because of malware
User-agent: *
Allow: /the-good-stuff/Disallow: /the-malware/
Robots.txt is ignored unless combined with emergency Webmaster Tools URL removal (3 months)
15
Our advice/experience
Canonical tag
Works great for cross-domain duplicate content
Largely ineffective for pagination eg shopping sites
Totally ineffective unless canonical URLs are VERY similar if not identical
16
Our advice/experience
Robots Meta Tag
Noindex,Follow - 100% obeyed by Google and passes Page Rank too
Very effective for pagination eg shopping sites
Works well for tracking links too (www.example.com/?affid=123456)
Doesn’t work when used with blocking robots.txt
17
Our advice/experience
Password Protect/htaccess 403 Forbidden
Works great for staging sites
Stubs - Problem in that it generates Webmaster Tools errors
Our feeling best to avoid on your main domain
18
Extreme Techniques to Avoid Dupe ContentMake all your backend .exe with htaccess
Summary
Duplicate content is a minefield!
Filters usually apply, penalties are very rare
You have the answer in your own hands
Stay on top of your site’s health – especially internal duplicate content
Thank you for your attention!
Thanks to:Anton GroeneveldtCarla dos Santos