Seo book

Contents

I Theory 50.1 Why SEO is important . . . . . . . . . . . . . . . . . . . . . . . . 50.2 Di�erent needs from SEO . . . . . . . . . . . . . . . . . . . . . . 5

1 What is a Search Engine? 71.1 History of Search Engines . . . . . . . . . . . . . . . . . . . . . . 71.2 Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Dynamic Data . . . . . . . . . . . . . . . . . . . . . . . . 81.2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.4 Spam and Manipulation . . . . . . . . . . . . . . . . . . 8

1.3 How a Search Engine works . . . . . . . . . . . . . . . . . . . . . 91.3.1 Text acquisition . . . . . . . . . . . . . . . . . . . . . . . 101.3.2 Duplicate Content Detection . . . . . . . . . . . . . . . . 101.3.3 Text transformation . . . . . . . . . . . . . . . . . . . . . 111.3.4 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . 121.3.5 User Interaction . . . . . . . . . . . . . . . . . . . . . . . 121.3.6 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 How good can a search engine be? 132.1 NP Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 AI Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Ranking Factors 153.1 On Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 O� Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Google PageRank Notes . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Short Description . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 Mathematical Description . . . . . . . . . . . . . . . . . . 193.3.3 Interesting Notes on the Original Implementation of PageR-

ank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.4 Optimal Linking Strategies . . . . . . . . . . . . . . . . . 213.3.5 Implementation to make computing PageRank faster . . . 233.3.6 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.7 Is linking out a good thing? . . . . . . . . . . . . . . . . . 233.3.8 TrustRank / Bad Page Rank . . . . . . . . . . . . . . . . 243.3.9 Improvements to Google's ranking algorithms . . . . . . . 25

2

4 Detecting Spam and Manipulation 274.1 Google Webmaster Guidelines . . . . . . . . . . . . . . . . . . . . 274.2 Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Detecting Manipulation in Content . . . . . . . . . . . . . . . . . 284.4 Detecting Manipulation in Links . . . . . . . . . . . . . . . . . . 284.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

II Practice 29

5 An Example Campaign 305.1 Company Pro�le . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Competitor Research . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Keyword Research . . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Content Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.6 Website Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.7 Link Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3

Preface

This book aims to provide a general overview of how search engines rank doc-uments in practice, the core of which will remain true even as Search Engine'salgorithms are re�ned.

4

Part I

Theory

0.1 Why SEO is important

� A higher search engine result will receive exponentially greaterclicks than a lower one

For example, if a search was repeated 1000 times by di�erent users, this istypically how many clicks each result would get.

Position Clicks

1 2222 633 454 325 266 217 188 169 1510 16

Source: Leaked Aol Click Data

� Paid adverts have low click through rates, and get expensivequickly

Search Engine % Organic Click Through Rate % Paid Result Click Through Rate

Google 72 28Yahoo 61 39MSN 71 29AOL 50 50

Average 63 37

88% of online search dollars are spent on paid results, even though 85% ofsearchers click on organic results.

� Vanessa Fox, "Marketing in the Age of Google", May 3, 2010

0.2 Di�erent needs from SEO

There are many di�erent reasons you may wish to engage in optimising yoursearch results, including

� Money - Sales for e-commerce sites are directly correlated with tra�c.

� Reputation - Some companies go to the extent of pushing negative arti-cles down in the rankings.

5

� Branding - Coming up top in the results pages is impressive to customers,and is particularly important in industries where reputation is extremelyimportant.

6

1 What is a Search Engine?

1.1 History of Search Engines

The �rst mechanised information retrieval sysyems were built by the US militaryto analyse the mass of documents being captured from the Germans. Researchwas boosted when the UK and US governments funded research to reduce aperceived �science gap� with the USSR. By the time the internet was becomingcommonplace in the early 1990s information retrieval was at an advanced stage.Complicated methods, primarily statistical, had been developed an archives ofthousands of documents could be searched in seconds.

Web search engines are a special case of information retrieval systems, ap-plied to the massive collection of documents available on the internet. A typicalsearch engine in 1990 was split into two parts: a web spider that traverses theweb following links and creating a local index of the pages, then traditional in-formation retrieval methods to search the index for pages relevant to the usersquery and order the pages by some ranking function. Many factors in�uence aperson's decision about what is relevant, such as the current task, context andfreshness.

In 1998 pages were primarily ranked by their contextual content. Since thisis entirely controlled by the owner of the page, results were easy to manipulateand as the Internet became ever more commercialized the noise from spam inSERP's (search engine results pages) made search a frustrating activity. It wasalso hard to discern websites which more people would want to visit, for examplea celebrities o�cial home page, from less wanted websites with similar content,for example a site. For these reasons directory sites such as Yahoo were stillpopular, despite being out of date and making the user work out the relevance

Google's founders Larry Page and Sergey Brin's Page Rank innovation (namedafter Larry Page), and that of a similar algorithm also released in 1998 calledHyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the addi-tional meta information from the link structure of the Internet. A more detaileddescription of Page Rank will follow in [chapter], but for now Google's own de-scription will su�ce.

PageRank relies on the uniquely democratic nature of the web by using itsvast link structure as an indicator of an individual page's value. In essence,Google interprets a link from page A to page B as a vote, by page A, for pageB. But, Google looks at more than the sheer volume of votes, or links a pagereceives; it also analyzes the page that casts the vote. Votes cast by pages thatare themselves �important� weight more heavily and help to make other pages�important�.

Whilst it is impossible to know how Google has evolved their algorithmssince the 1998 paper that launched page rank, and how real world e�cientimplementation di�ers from the theory, as Google themselves say the PageRankalgorithm remains �the heart of Google's software ... and continues to providethe basis for all of [their] web search tools�. The search engines continue toevolve at a blistering pace, improving their ranking algorithms (Google says

7

there are now over 200 ranking factors considered for each search1), and indexinga growing Internet more rapidly.

1.2 Important Issues

The building of a system as complex as a modern search engine is all aboutbalancing di�erent positive qualities. For example, you could e�ectively preventlow quality spam by paying humans to review every document on the web,but the cost would be immense. Or you could speed up your search engine byconsidering only every other document your spider encounters, but the relevanceof results would su�er. Some things, such as getting a computer to analyse adocument to with the same quality as a human, are theoretically impossibletoday, but Google in particular is pushing boundaries and getting ever closer.

Search engines have some particular considerations:

1.2.1 Performance

The response time to a user's query must be lightening fast.

1.2.2 Dynamic Data

Unlike a traditional information retrieval system in a library the pages on theInternet are constantly changing.

1.2.3 Scalability

Search engines need to work with billions of users searching through trillions ofdocuments, distributed across the Earth.

1.2.4 Spam and Manipulation

Actively engaging against other humans to maintain the relevancy of results isrelatively unique to search engines. In a library system you may have an authorthat creates a long title packed with words their readers may be interested in,but that's about the worst of it. When designing your search engine you arein a constant battle with adversaries who will attempt to reverse engineer youralgorithm to �nd the easiest ways to a�ect your restyles. A common termfor this relation ship is "Adverse rial Information Retrieval". The relationshipbetween the owner of a Web site trying to rank high on a search engine and thesearch engine designer is an adversarial relationship in a zero-sum game. Thatis, assuming the results were better before, every gain for the web site owner is aloss for the search engine designer. Classifying where your e�orts cross helping asearch engine be aware of your web site's content and popularity, which shouldhelp to improve a search engine's results, and start instead ranking beyondyour means and start decreasing the quality of a search engine's results can be

1See http://googlewebmastercentral.blogspot.com/2008/10/

good-times-with-inbound-links.html

8

somewhat tricky. The practicalities of what search engines consider to be spam,and as importantly what they can detect and �x, will be discussed later.

According to "Web Spam Taxonomy"2, approximately 10-15% of indexedcontent on the web is spam. What is considered spam and duplicate contentvaries, which makes this statistic hard to verify. There is a core of about 56million pages3 that are highly interlinked at the center of the Internet, and areless likely to be spam. Document's further away (in link steps) from this coreare more likely be spam.

Deciding the quality of a document well (say whether it is a page writtenby an expert in the �eld, or generated by a computer program using naturallanguage processing) is an AI Complete problem, that is it won't be possibleuntil we have arti�cial intelligence that can match that of a human.

However, search engines hope to get spam under control by lessening the�nancial incentive of spam. This quote from a Microsoft Research paper4 ex-presses this nicely:

�E�ectively detecting web spam is essentially an arms race be-tween search engines and site operators. It is almost certain thatwe will have to adapt our methods overtime, to accommodate fornew spam methods that the spammers use. It is our hope that ourwork will help the users enjoy a better search experience on theweb.Victory does not require perfection, just a rate of detec-tion thatalters the economic balance for a would-be spammer. It is our hopethat continued research on this front can make e�ective spam moreexpensive than genuine content.�

Google developers for their part describe web spam as the following5, citing thedetrimental impact it has upon users

These manipulated documents can be referred to as spam. When a userreceives a manipulated document in the search results and clicks on the link togo to the manipulated document, the document is very often an advertisementfor goods or services unrelated to the search query or a pornography websiteor the manipulated document automatically forwards the user on to a websiteunrelated to the user's query.

1.3 How a Search Engine works

A typical search engine can split into two parts: Indexing, where the Internet istransformed into an internal representation that can be e�ciently searched. Thequery process, where the index is searched for the user query and documentsare ranked and returned to the user in a list.

Indexing

2Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Work-shop on Adversarial Information Retrieval on the Web, May 2005

3See �On Determining Communities in the Web� by K Verbeurg4See �Detecting Spam Web Pages through Content Analysis� by A Ntoulas5See patent 7302645: �Methods and systems for identifying manipulated articles�

9

1.3.1 Text acquisition

A crawler starts at a seed site such as the DMOZ directory, then repeatedlyfollows links to �nd documents across the web, storing the content of the pagesand associated meta data (such as the date of indexing, which page linked to thesite). In a modern search engine the crawler is constantly running, downloadingthousands of pages simultaneously, to continuously update and expand the in-dex. A good crawler will cover a large percentage of the pages on the Internet,and visit popular pages frequently to keep its index fresh. A crawler will connectto the web server and use a HTTP request to retrieve the document, if it haschanged. On average, Web page updates follow the Poisson distribution - that isthe crawler can expect the time until the web page updates next time to followan exponential distribution. Crawlers are now also indexing near real time datathrough varying sources such as access to RSS Feeds and the Twitter API, andare able to index a range of formats such as PDF's and Flash. These formatsare converted into a common intermediate format such as XML. A crawler canalso be asked to update its copy of a page via methods such as a ping or XMLsitemap, but the update time will still be up to the crawler. The document datastore stores the text and meta data the crawler retrieves, it must allow for veryfast access to a large amount of documents. Text can be compressed relativelyeasily, and pages are typically indexed by a hash of their URL. Google's originalpatent used a system called BigTable, Google now keeps documents in sectionscalled shards distributed over a range of data centres (this o�ers performance,redundancy and security bene�ts).

1.3.2 Duplicate Content Detection

Detecting exact duplicates is easy, remove the boilerplate content (menus etc.)then compare the core text through check sums. Detecting near duplicates isharder, particularly if you want to build an algorithm that is fast enough tocompare a document against every other document in the index. To performfaster duplicate detection, �nger prints of a document are taken.

A simple �ngerprinting algorithm for this is outlined here:

1. Parse the document into words, and remove formatting content such aspunctuation and HTML tags.

2. The words are grouped into groups of words (called n-grams, a 3-grambeing 3 words, 4-gram 4 words etc.)

3. Some of these n-grams are selected to represent a document

4. The selected n-grams are hashed to create a shorter description

5. The hash values are stored in a quick look up database

6. The documents are compared by looking at overlaps of �ngerprints.

10

Fingerprinting in action

A paper6 by four Google employees found the following statistics across theirindex of the web.

Number of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of �vegrams: 1,176,470,663Most common trigram in English: �all rights reserved�Detecting unusual patterns of n-grams can also be used to detect low qual-

ity/spam documents7.

1.3.3 Text transformation

Tokenization is the process of splitting a series of characters up into separatewords. These tokens are then parsed to look for tokens such as <a ></a> to�nd which parts of the text is plain text, links and such.

� Identifying Content

Sections of documents that are just content are found, in an attempt to ignore"boiler plate" content such as navigation menus. A simple way is to look forsections where there are few HTML tags, more complicated methods considerthe visual layout of the page.

� Stopping

Common words such as "the" and "and" are removed to increase the e�ciencyof the search engine, resulting in a slight loss in accuracy. In general, the moreunusual a word the better it is at determining if a document is relevant.

6See �N-gram Statistics in English and Chinese: Similarities and Di�erences�7See http://www.seobythesea.com/?p=5108

11

� Stemming

Stemming reduces words to just their stem, for example "computer" and "com-puting" become "comput". Typically around a 10% improvement is seen inrelevance in English, and up to 50% in Arabic. The classic stemmer algorithmis the "Porter Stemmer" which works through a series of rules such as "replacesses with ss to stresses -> stress".

� Information Extraction

Trying to determine the meaning of text is very di�cult in general, but certainwords can give clues. For example the phrase "x has worked at y" is usefulwhen building an index of employees.

1.3.4 Index Creation

Document statistics such as the count of words are stored for use in rankingalgorithms. An inverted index8 is created to allow for fast full text searches.The index is distributed across multiple data centres across the globe9.

1.3.5 User Interaction

The user is provided with an interface in which to give their query. The queryis then transformed, using similar techniques to with documents such as stem-ming, as well as spell checking and expanding the query to �nd other queriessynonymous with the users query. After ranking the document set, a top set ofresults are displayed together with snippets to show how they were matched.

1.3.6 Ranking

A scoring function calculates scores for documents. Some parts of the scoringcan be performed at query time, others at document processing time.

1.3.7 Evaluation

Users queries and their actions are logged in detail for improve results. Forexample, if a user clicks on a result then quickly performs the same searchagain, it is likely that they clicked a poor result.

8An inverted index is an index data structure storing a mapping from content, such aswords or numbers, to its document in a set of documents. The purpose of an inverted indexis to allow fast full text searches, at a cost of increased processing when a document is addedto the database. http://en.wikipedia.org/wiki/Inverted_index

9A good overview of Google's �shard� approach is at http://highscalability.com/

google-architecture

12

2 How good can a search engine be?

There are some very speci�c limits in computer science as to what a computerprogram is capable of doing, and these have direct consequences for how searchengines can index and rank your web pages. The two core sets or problemsare NP-Complete problems, which for large sets of data take too long to solveperfectly, and AI-Complete problems, which can't be done perfectly until wehave computers that are intelligent as people. That doesn't mean search enginescan't make approximations, for example �nding the shortest route on a map isa NP-Complete problem yet Google maps still manages to plot pretty goodroutes10.

2.1 NP Hard Problems

Polynomial (P) problems can be solved in polynomial time, that is relativelyquickly. Non Polynomial (NP) problems cannot be solved in polynomial time,that is they can't be solved for any reasonably large set of inputs such as anumber of web pages.

The time taken to solve the NP hard problem (in red) grows extremelyquickly as the size of the problem grows.

These concepts become complex quickly, but the key thing to pick up isthat if a problem is NP Hard there is no way it can ever be solved perfectly forsomething as large as a search engines index, and approximations will have tobe used. There are some NP Hard problems that are of particular interest toSEO:

� The Hamiltonian Path Problem - Detecting a greedy network (IE if youinterlink your web pages to hoard page rank) in the structure of a Hamil-tonian path11 is an NP hard problem

� Detecting Page Farms (the set of pages that link to a page) is NP hard12

10http://www.youtube.com/watch?v=-0ErpE8tQbw11http://en.wikipedia.org/wiki/Hamiltonian_path12See �Sketching Landscapes of Page Farms� by Bin Zhou and Jian Pei

13

� Detecting Phrase Level Duplication in a Search Engine's Index13

2.2 AI Hard Problems

AI Hard problems require intelligence matching that of a human being to besolved. Examples include the Turing Test (tricking a human into thinking theyare talking to a human, not a computer), recognising di�cult CAPTCHA's andtranslating text as well as an expert (who wouldn't be perfect either).

During a question-and-answer session after a presentation at his alma mat-ter,Stanford University, in May 2002, Page said that Google would full its mis-sion only when its search engine was AI-complete, and said something similarin an interview with Newsweek then Playboy.

"I think we're pretty far along compared to 10 years ago, he said. At thesame time, where can you go? Certainly if you had all the world's informationdirectly attached to your brain, or an arti�cial brain that was smarter than yourbrain,you'd be better. Between that and today, there's plenty of space to cover."What would a perfect search engine look like? we asked. "It would be the mindof God"14

�And, actually, the ultimate search engine, which would understand, youknow, exactly what you wanted when you typed in a query, and it would giveyou the exact right thing back, in computer science we call theatrical intelli-gence. That means it would be smart, and we're a long way from having smartcomputers.�15

Of particular interest to SEO is that fully understanding the meaning ofhuman text is an AI complete problem, and even getting close to understandingwords in context is very di�cult16. This means detecting the quality of reason-able quality computer generated text against that of a human expert automat-ically is tricky. Its not unusual to see websites packed with decent computergenerated text (which automatically detecting is an AI complete problem) andsingle phrases stitched together from a variety of sources (which is an NP com-plete problem) ranking for Google Trends results. This is particularly hard tostop as for new news items there are less fresh sources available to choose from,this results in search engine poisoning17. Any site that receives a large amountof tra�c from this will eventually be visited manually by a Google employee,and penalised manually18.

Google's solution to the very similar machine translation problem is inter-esting; rather than attempting to build AI they use their massive resources anddata stored from web pages and user queries to build a reliable statistical engine

13See �Detecting phrase-level duplication on the world wide web� by Microsoft Researchemployees

14http: // searchenginewatch. com/ 215660115http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/16http://en.wikipedia.org/wiki/Natural_language_understanding17http://igniteresearch.net/spam-in-poisoned-world-cup-results/18http://www.google.co.uk/search?q="Google+Spam+Recognition+Guide+for+Quality+

Rater"

14

- their approach isn't necessarily far smarter than their competitors but theirresources make them the best translator out there.

2.3 Competitors

Although not a classic computer science problem, a big limit to how searchengines can treat possible spam is that competitors could attempt to make yourwebsite look like it was spamming to lower your ranking, increasing theirs. Forexample, if your website suddenly receives and in�ux of low quality links fromsites known to ink to spam, how would Google know if you naively ordered thisor a competitor did?

This is an unsolvable problem, short of non-stop surveillance of all websiteowners. This is what Google has to say on the matter19

�There's almost nothing a competitor can do to harm your ranking or haveyour site removed from our index. If you're concerned about another site linkingto yours, we suggest contacting the webmaster of the site in question. Googleaggregates and organizes information published on the web; we don't control thecontent of these pages.�

I can say from experience that Google bowling most certainly does happen,and there are a couple of experiments written up on the web20, though it wouldbe very di�cult to Google bowl a popular website. Essentially, if a small per-centage of links to a site are most likely spam they are just ignored, if a largepercentage are likely spam then the links may result in a penalty rather thanjust being ignored.

It seems likely that poor quality links are increasingly being ignored. Thepaper �Link Spam Alliances� from Stanford, the Google founder's Alma mater,discusses both dated methods of detecting and punishing potential link spam.Note that link spam isn't the only way that sites can potentially be Googlebowled, if your competitor �lls your comment section with duplicate contentabout organ enlargement and links to known phishing sites it is unlikely to helpyour rankings. Google now also takes into account users choosing to block sitesfrom results21, presumably with a negative e�ect.

3 Ranking Factors

Google engineers update their algorithms daily22. They then run many tests tocheck they have the right balance between all these factors.

The following is from an interview with Google's Udi Manber.Q: How do you determine that a change actually improves a set of results?A: We ran over 5,000 experiments last year. Probably 10 experiments for ev-

ery successful launch. We launch on the order of 100 to 120 a quarter. We have

19http://www.google.com/support/webmasters/bin/answer.py?answer=3444920http://bit.ly/jEKzMa21http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes.

html22http://www.nytimes.com/2007/06/03/business/yourmoney/03google

15

dozens of people working just on the measurement part. We have statisticianswho know how to analyze data, we have engineers to build the tools. We have atleast 5 or 10 tools where I can go and see here are 5 bad things that happened.Like this particular query got bad results because it didn't �nd something or thepages were slow or we didn't get some spell correction.

I have created a spreadsheet that shows how a search engine may cal-culate the ranking of a trivial set of documents for a particular query, youcan view it and try changing things yourself at http://igniteresearch.net/poodle-a-simple-emulation-of-search-engine-ranking-factors/.

3.1 On Page Factors

� Keywords

Repetitions of the words in the query in the document, particularly in key areassuch as the title and headers are positive signals of relevance. The proximityof the words together is important, particularly having the exact query in thedocument. A very large repetition, particularly in nongrammatical sentences,can be a negative signal of spam. Presence of the query words in the Domainand URL are useful signals of relevance. Related phrases to the query arealso positive signals of relevance (see Latent Semantic Indexing). The metakeywords HTML tag, <meta name=�keywords� content=�my, keywords�>, islargely ignored by modern search engines23.

� Quality

A number of di�erent authors on a website, good grammar, spelling and longpages written at reasonable time intervals are positive signs of high qualitycontent24.

� Geographical Locality

Mentions of an address close the user show the document may be geographicallyrelevant to the user, particularly for geograpihcally sensitive queries such as�plumbers in london�.

� Freshness

For time dependant queries, such as news events, recent pages are more likelyto be helpful to the user. See Google's �Quality Deserves Freshness� drive, ofwhich Google's faster indexing Ca�eine update was a part.

� Duplicate Content

Large percentages of content duplicated either from the same site, or others isan indicator of poor quality content and users will only want to see the canonicalcopy.

23See http://googlewebmastercentral.blogspot.com/2009/09/

google-does-not-use-keywords-meta-tag.html24See http://www.seobythesea.com/?p=541

16

� Adverts

A very large number of adverts can reduce the user experience, and a�liatelinks are often associated with heavily SEO manipulated websites.

� Outbound Links

Links to spammy of phising websites, or an unusually large number of outboundlinks on a number of pages, are common indicators of a page that users will notwant to visit25.

� Spam

An unusual repetition of keywords, particularly outside of sentences is a signof spam. Techniques such as hidden text and sneaky javascript redirects arerelatively easy to detect and punish.

3.2 O� Page Factors

� Site Reliability

Unreliable or slow sites provide a poor user experience, and so will have a penaltyapplied. You can be warned if this happens if you sign up for Google webmastertools26.

� Popularity of the Site

From aggregated ISP data that search engine's buy and search tra�c27.

� Incoming Links/ PageRank

The link structure of the internet is a useful pointer of a websites popularity.Anchor text on incoming links related to query shows a search engine the pageis related to the query. Links they remain for a long time from sites that havemany links pointing to themselves are rated highly. Links that are in boiler plateareas or sitewide may be ignored. Links that are all identical in anchor text (ieblatantly machine generated), from spammy websites (�bad neighbourhoods�28),thought to be paid for with the intention of manipulating rankings or spam canresult in penalties. Links from sites that are most likely owned by the sameowner, detected either from Whois data or if the sites are hosted within thesame Class C IP, are likely considered less reliable signals of importance. Anormal rate of growth of incoming links, as opposed to bursty start stops29 thatindicate link building campaigns 30.

25See �Improving Web Spam Classi�ers Using Link Structure� for a very interesting Yahoopatent on detecting spam based on the number of inbound and outbound links

26See http://www.mattcutts.com/blog/site-speed/27See http://trends.google.com/websites?q=bing.com&geo=all&date=all and http://

www.compete.com28See http://www.google.com/support/webmasters/bin/answer.py?answer=3576929See http://www.seobook.com/link-growth-profile30See http://www.wolf-howl.com/seo/google-patent-analysis/

17

� Other indirect signals of a website's popularity

Other data can include mentions in chats, emails and social networks.

� Links from trusted websites

The proximity on web graph to important, trusted sites (Links from old, highpage rank websites at the centre of the old heavily interconnected internet areuseful signals that a website can be trusted and is important 31).

� Links from other sites that rank for the query

Results may be reordered based on how they link to each other.

� Geographical Location

If the geographical location of server, website according to directories, top leveldomain or location as set in Google Webmaster Tools match that of the userit is a signal that the page will be more relevant to the user, particularly forlocation sensitive searches.

� User Click Data

If users often search again after clicking on the sites result that is an indicatorthat the page is not a good match for the query. The personal history of resultsclicked, and pattern of related searches may help indicate what a user is lookingfor32.

� Domain Information

Older domains are likely trusted more. Google is a domain registrar so has ex-tensive information Whois Information, and validates that address informationassociated with domains is correct.

� Manul Reviews

Google Quality Raters33 manually reviewing websites and tagging them as cat-egories such as �essential to query�, �not relevant to query�, �spam�.

3.3 Google PageRank Notes

Google's PageRank was the innovation that propelled Google to the top ofthe search engine pile. Whilst its implementation has changed much since itsoriginal description, and many other factors are now taken into account, it isstill at the heart of modern search engines so some extra notes will be made onit here.

31See http://www.touchgraph.com/seo and type in http://www.nasa.gov for a visual graph32See Seehttp://www.seobythesea.com/?p=33433See http://searchengineland.com/the-google-quality-raters-handbook-13575

18

3.3.1 Short Description

The key point is that PageRank considers each link a vote, and links from pageswhich have many links themselves are considered more important. Or as Googleputs it:

�PageRank re�ects our view of the importance of web pages by consideringmore than 500 million variables and 2 billion terms. Pages that we believe areimportant pages receive a higher PageRank and are more likely to appear at thetop of the search results. PageRank also considers the importance of each pagethat casts a vote, as votes from some pages are considered to have greater value,thus giving the linked page greater value.�

3.3.2 Mathematical Description

Its not essential to have a mathematical understanding of how PageRank is cal-culated, but for those familiar with basic graph theory and algebra it is useful.You may wish to skip this section, and read a slightly less mathematical de-scription34. For a more complete treatment of the mathematics see the originalPageRank paper 35, the �Deeper Inside PageRank� by Amy N. Langvilleandand Carl D, and this thesis36. The following is summarised from �SketchingLandscapes of Page Farms�37 by Bin Zhou and Jian Pei:

The Web can be modeled as a directed Web graph G = (V, E), where V isthe set of Web pages, and E is the set of hyperlinks. A link from page p to pageq is denoted by edge p � q. An edge p � q can also be writte nas a tuple (p,q).

PageRank measues the importance of a page p by considering how collec-tively other Web pages point to p directly or indirectly. Formally, for a Webpage p, the PageRank score is de�ned as:

Where M(p) = { q| q � p } is the set of pages having a hyperlink pointto p, OutDeg(pi) is the out-degree of pi (i.e., the number of hyperlinks frompi pointing to some pages other than pi), and d is a damping factor (0.85 inthe original PageRank implementation) which models the random transitions ofthe web. If a damping factor of 0.5 is used then at each page there is a 50/50

34See the introductions of http://www.sirgroane.net/google-page-rank/, http://www.

webworkshop.net/pagerank.html or the Wikipedia article35At http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf36http://web.engr.oregonstate.edu/~sheldon/papers/thesis.pdf37See http://www.cs.sfu.ca/~bzhou/personal/paper/sdm07_page_farm.pdf

19

chance of the surfer clicking a link, or jumping to a random page on the internet.Without the damping factor the PageRank of any page with an outgoing linkwould be 0.

To calculate the PageRank scroes for all pages in a graph, one can assign arandom PageRank score value to each node in the graph, then apply the aboveequation iteratively until the PageRank scroes in the graph converge.

The google toolbar is a logarithmic scale out of 10, not the actual internaldata. For example:

Domain Calculated PageRank PageRank displayed in Toolbar

small.com 47 2medium1.com 54093 5medium2.com 84063 5

big.com 1234567 7big2.com 2364854 7

3.3.3 Interesting Notes on the Original Implementation of PageR-ank

From �PageRank Uncovered�38, essential reading for those looking to understandPageRank from an SEO perspective:

� PageRank is a multiplier, applied after relevant results are found

�Remember, PageRank alone cannot get you high rankings. We've mentionedbefore that PageRank is a multiplier; so if your score for all other factors is 0andyour PageRank is twenty billion, then you still score 0 (last in the results).This isnot to say PageRank is worthless, but there is some confusion over whenPageRank is useful and when it is not. This leads to many misinterpretationsof its worth. The only way to clear up these misinterpretations is to point outwhen PageRank is not worth while.If you perform any broad search on Google, itwill appear as if you've found several thousand results. However, you can onlyview the �rst 1000 of them. Understanding why this is so, explains why youshould always concentrate on �on the page� factors and anchor text �rst, andPageRank last.�

� Each page is born with a small amount of PageRank

A page that is in the Google index has a vote, however small. Thus, the morepages you have in the index � the more overall vote you are likely to have.Or,simply put, bigger sites tend to hold a greater total amount of PageRankwithin their site (as they have more pages to work with).

Note that Google's original algorithm has most likely been amended sinceto detect and reduce page rank hoarding, and generating PageRank by massiveinterlinking on auto generated pages. Also for quicker calculations an approx-

38See http://www.bbs-consultant.net/IMG/pdf_PageRank.pdf

20

imation of PageRank which only gives certain seed pages PageRank may beused39.

Interestingly, however, there are examples of this working, see �How to getbillions of pages indexed in Google� at http://www.threadwatch.org/node/

6999. In a related issue, at one point 10% of MSN Search's (now known asBing) German index was computer generated content on a single domain40.

3.3.4 Optimal Linking Strategies

Deciding how to interlink pages that you own or have in�uence over is tricky;interlinking can be a good signal that that pages are related and on a certaintopic, build PageRank and control PageRank �ow. However, heavily interlinkingcan be a signal of manipulation and spam, and di�erent linking structures canmake di�erent sites in your possession rank higher. The mathematics gets trickyfast, here is a quick overview of the literature today:

� Note from �Web Spam Taxonomy�

Though written about Spam farms, the math holds true for good commercialsites too. Essentially this states that maximum page rank for a target pageis achieved by linking only to the target page from forums, blogs etc. theninterlinking the network of sites owned (as if there are no outlinks on a page the�random surfer� will jump to a random page on the Internet).

�1. Inaccessible pages are those that a spammer cannot modify. These arethe pages out of reach; the spammer cannot in�uence their outgoing links. (Notethat a spammer can still point to inaccessible pages.)

2. Accessible pages are maintained by others (presumably not a�liated withthe spammer), but can still be modi�ed in a limited way by a spammer. Forexample, a spammer may be able to post a comment to a blog entry, and thatcomment may contain a link to a spam site.

3. Own pages are maintained by the spammer, who thus has full control overtheir contents.

We can observe how the presented structure maximizes the total PageRankscore of the spam farm, and of page t in particular:

1. All available n own pages are part of the spam farm, maximizing the staticscore total PageRank.

2. All m accessible pages point to the spam farm, maximizing the incomingscore incoming PageRank.

39For more on why this shouldn't work see http://www.pagerank.dk/Pagerank/

Generate-pagerank.htm40See http://research.microsoft.com/pubs/65144/sigir2005.pdf

21

3. Links pointing outside the spam farm are suppressed, making PRout out-going PageRank zero.

4. All pages within the farm have some outgoing links, rendering a zeroPRsink score component.

Within the spam farm, the the score of page t is maximal because:1. All accessible and own pages point directly to the target, maximizing its

incoming score PRin (t).2. The target points to all other own pages. Without such links, t would had

lost a signi�cant part of its score (PRsink (t) > 0), and the own pages would hadbeen unreachable from outside the spam farm. Note that it would not be wise toadd links from the target to pages outside the farm, as those would decrease thetotal PageRank of the spam farm.�

� From �Link Spam Alliances�

The analysis that we have presented show how the PageRank of target pages canbe maximized in spam farms. Most importantly, we �nd that there is an entireclass of farm structures that yield the largest achievable target PageRank score.All such optimal farm structures share the following properties:

1. All boosting pages point to and only to the target.

2. All hijacked point to the target.

3. There are some links from the target to one or more boosting pages.

� From �Maximizing PageRank via Outlinks�

In this paper we provide the general shape of an optimal link structure for awebsite in order to maximize its PageRank. This structure with a forward chainand every possible backward link may be not intuitive. At our knowledge, ithas never been mentioned, while topologies like a clique, a ring or a star areconsidered in the literature on collusion and alliance between pages. Moreover,this optimal structure gives new insight into the a�rmation of Bianchini et al.that, in order to maximize the PageRank of a website, hyperlinks to the restof the webgraph should be in pages with a small PageRank and that have manyinternal hyperlinks. More precisely, we have seen that the leaking pages must bechosen with respect to the mean number of visits before zapping they give to thewebsite, rather than their PageRank.

� From �The e�ect of New Links on PageRank� by Xie

Theorem: The optimal linking strategy for a Web page is to have only one out-going link pointing to a Web page with a shortest mean �rst passage time backto the original page.

Conclusions: .... We conclude that having no outgoing link is a bad policyand that the best policy is to link to pages from the same Web community.Surprisingly, a new incoming link might not be good news if a page that pointsto us gives many other irrelevant links at the same time.

Reading this paper fully it is only in very particular circumstances that anew incoming link is not good news.

22

3.3.5 Implementation to make computing PageRank faster

There have been a number of proposed improvements to the original PageRankalgorithm to improve the speed of calculation41, and to adapt it to be better atdetermining quality results. No search engine calculates PageRank as shown inthe naive algorithm in the original paper42.

3.3.6 HITS

HITS is another ranking algorithm that takes into account the pattern of linksfound throughout the web, and it was released just before PageRank in 1999.HITS treats some pages on the web as authorities, which are good documentson a topic, and hubs, which mostly link to authorities.

A page is given a high authority score by being linked to by pages that arerecognized as Hubs for information. A page is given a high hub score by linkingto nodes that are considered to be authorities on the subject.

Unlike PageRank, which is query independent and so computed at index-ing time, HITS hub and author scores are query depend ant and so computed(though likely cached) at query time.

3.3.7 Is linking out a good thing?

Whilst TEOMA is the only search engine that uses HITS at its core, its think-ing has heavily in�uenced search engine designers - so it is likely that linkingout to high quality authorities can positively in�uence either a pages ranking(though potentially negatively, if designers want authorities rather than hubs toappear in their results43), or the importance of the other links it contains. Manywebmasters fear linking out to sites as they would rather keep links internal toprevent PageRank ��owing out� (many webmasters also nofollow links to similarreasons, not that this form of PageRank sculpting no longer works according toMatt Cutts, Google's head of [anti]web spam).

Matt Cutts also said a number of years ago:�Of course, folks never know when we're going to adjust our scoring. It's

pretty easy to spot domains that are hoarding PageRank; that can be just anotherfactor in scoring.�

Some search engines are even concerned about people linking out too much,whilst crawlers can now index a large number of links on a page, a very largenumber of outbound links often indicates that a site has been hacked with spamlinks or is machine generated.

�A spammer might manually add a number of outgoing links to well-knownpages, hoping to increase the page's hub score. At the same time,the most

41For example, see �Computing PageRank using Power Extrapolation� and �E�cient PageR-ank Approximation via Graph Aggregation�

42Matt Cutts discusses a couple of the implementation details at http://www.mattcutts.

com/blog/more-info-on-pagerank/43See http://www.wolf-howl.com/seo/seo-case-study-outbound-links/ and �Deeper In-

side PageRank�, discussed earlier

23

wide-spread method for creating a massive number of outgoing links is direc-tory cloning�44.

3.3.8 TrustRank / Bad Page Rank

Its likely that after results are generated based on relevance, PageRank is thenapplied to help order, then Trust Rank to help order the results. A site may losetrust every time it fails some kind of spam test (for example if a large numberof reciprocal links are found,cloaking, duplicate content, fake whois data) andgain Trust for certain properties (domain age, tra�c, being one a number ofimportant "seed" sites that are manually tagged as trusted sites). These initialTrust Ranks could then be propagated in a similar way to PageRank, so linkingto and from "bad neighborhoods" would negatively a�ect the sites Trust Rankthrough association45.

From SEO By The Sea:In 2004, a Yahoo whitepaper was published which described how the search

engine might attempt to identify web spam by looking at how di�erent pageslinked to each other. That paper was mistakenly attributed to Google by a largenumber of people, most likely because Google was in the process of trademarkingthe term TrustRank around the same time, but for di�erent reasons. Surpris-ingly, Google was granted a patent on something it referred to as Trust Rankin 2009, though the concept behind it was di�erent than Yahoo's description ofTrustRank. Instead of looking at the ways that di�erent sites linked to eachother, Google's Trust Rank works to have pages ranked according to a measureof the trust associated with entities that have provided labels for the documents.

44See �Web Spam Taxonomy�45See http://bakara.eng.tau.ac.il/semcomm/GKRT.pdfand http://www.

freepatentsonline.com/7603350.html and http://www.cs.toronto.edu/vldb04/protected/

eProceedings/contents/pdf/RS15P3.PDF

24

...If you've ever heard or seen the phrase "TrustRank" before, it's possible that

whoever was writing about it, or referring to it was discussing a paper titled Com-bating Web Spam with TrustRank (pdf). While the paper was the joint work ofresearchers from Stanford University and Yahoo!, many writers have attributedit to Google since its publication date in 2004 The confusion over who cameup with the idea of TrustRank wasn't helped by Google trademarking the term"TrustRank" in 2005. That trademark was abandoned by Google on February29, 2008, according to the records at the US PTO Tess database. However, apatent called "Search result ranking based on trust" deals with something calledtrust rank, �led on May 9, 2006.

Google mentions distrust and trust changes as indicators. More than trustanalysis, trust variation analysis is on the road. Fake reviews, sponsored blogsand e-commerce trust network in�uence are pointed out.

The paper �A Cautious Surfer for PageRank� comments on why TrustRankshouldn't be overused:

"However, the goal of a search engine is to �nd good quality results; spam-freeis a necessary but not su�cient condition for high quality. If we use a trust-basedalgorithm alone to simply replace PageRank for ranking purposes, some goodquality pages will be unfairly demoted and replaced, for example, by pages withinthe trusted seed sets, even though they may be much less authoritative.Consideredfrom another angle, such trust-based algorithms propagate trust through pathsoriginating from the seed set; as a result,some good quality pages may get lowvalue if they are not well connected to those seeds."

3.3.9 Improvements to Google's ranking algorithms

There have been a number of notable algorithm changes which made consider-able changes appear to results pages, though often the e�ects were later scaledback slightly.

� NoFollow

Matt Cutts and Jason Shellen created the nofollow speci�cation to help limitthe e�ect and incentive for blog spam. If a search engine comes across a linktagged as nofollow, it will not treat the link as a vote, ie as a positive signal inrankings. Areas where untrusted users can post content are often tagged nofol-low, roughly 80% of content management systems (the software that websitesrun on) implement nofollow.

The HTML code of a NoFollow link:<a href="signin.php" rel="nofollow">sign in</a>

� Increasing use of anchor text

Even the original PageRank algorithm took into account the anchor text of links,so links were used to give both a number that indicated the sites popularityand information about the content of a document and so its relevance for userqueries.

25

� Google Bombing Prevention, 2nd February 2007

Google Bombing is the process of massively linking to a page with a speci�canchor text, to give PageRank but more importantly indications that the doc-ument is related to the anchor text. For example, in 1999 a number of bloggersgrouped together to link to Microsoft.com with the anchor text "more evil thanSatan himself". This resulted in Microsoft being placed number one in searchesfor "more evil than Satan himself" despite not having the phrase anywhere on itspage. Detecting a sudden in�ux of links with identical anchor text is very easy,and in 2007 Google changed their indexing structure so that Google bombs suchas "miserable failure" would "typically return commentary, discussions, and ar-ticles" about the tactic itself. Matt Cutts said the Google bombs had not "beena very high priority for us. Over time, we've seen more people assume thatthey are Google's opinion, or that Google has hand-coded the results for theseGoogle-bombed queries. That's not true, and it seemed like it was worth tryingto correct that perception."46 Some Google bombs still work, particularly thosetar getting unusual phrases, with varied anchor text, over a period of time,within paragraphs of text.

� Florida, November 2003

Results for highly commercial queries, likely informed from the cost of Adwords,became heavily �ltered so more trusted academic websites and less commercialoptimised websites ranked. Some of these changes resulted in less relevance, forexample if a user was searching for �buy bricks� they probably didn't want tomainly see websites about the process of creating bricks, and were rolled back.For more see47 and 48.

� Bourbon, June 2005

A penalty was applied to sites with unusually fast or bursty patterns of linkgrowth.

� Jagger, October 2005

A penalty applied to sites with unusually large amounts of reciprocal links, newmethods for detecting hidden text.

� Big Daddy, December 2005

According to Matt Cutts, punished were "sites where our algorithms had verylow trust in the inlinks or the outlinks of that site. Examples that might causethat include excessive reciprocal links, linking to spammy neighborhoods on theweb, or link buying/selling."49

46See http://answers.google.com/answers/main?cmd=threadview&id=17992247http://www.searchengineguide.com/barry-lloyd/been-gazumped-by-google-trying-to-make-sense-of-the-florida-update.

php48http://www.seoresearchlabs.com/seo-research-labs-google-report.pdf49See http://www.webworkshop.net/googles-big-daddy-update.html

26

� Ca�eine, October 2010

A faster indexing system that changed results little, but allowed for fresherresults and some of the later Panda updates50.

� Panda, April 2011

Penalty applied to content deemed low quality, detected primarily from userdata. Websites which contained masses of articles, focusing on quantity overquality, were often hit51.

4 Detecting Spam and Manipulation

You will often hear that your site has to look �natural� to the search engines.Just what �natural� means is hard to de�ne, but essentially it means the pro�leof a site whose popularity was never engineered or promoted, and was insteadbased on people luckily coming across it and deciding to recommend it to theirfriends with links. Whats more, you also need to make your site look �popular�,creating no links to your site yourself will look �natural� but you will haveno chance of competing with people who do unless you have the cash to buylarge amounts of advertising. This section brie�y covers what search enginesconsider to be acceptable, when and how they can detect violations, and whatthe potential penalties are.

4.1 Google Webmaster Guidelines

Google have created a page called �Webmaster Guidelines� to inform users ofwhat they consider to be acceptable methods of promoting your website. Whilstthe lines for crossing general principles such as �Would I do this if search enginesdidn't exist?� are somewhat vague, they do o�er some speci�c notes of whatnot to do:

� Avoid hidden text or hidden links.

� Don't use cloaking or sneaky redirects.

� Don't send automated queries to Google.

� Don't load pages with irrelevant keywords.

� Don't create multiple pages, sub domains, or domains with substantiallyduplicate content.

50See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html51See http://blog.searchmetrics.com/us/2011/04/12/googles-panda-update-rolls-out-to-uk/

and http://www.seobook.com/questioning-questions and http://

googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-building-high-quality.

html

27

� Don't create pages with malicious behavior, such as phishing or installingviruses, Trojans, or other bad ware.

� Avoid "doorway" pages created just for search engines, or other "cookiecutter" approaches such as a�liate programs with little or no original con-tent.

� If your site participates in an a�liate program, make sure that your siteadds value. Provide unique and relevant content that gives users a reasonto visit your site �rst.

Most of the methods listed above are naive and easy to detect, Google havebeen fairly successful in making successful manipulation aligned with creatinggenuine content, though without any promotion it is unlikely even the bestcontent will be noticed.

4.2 Penalties

Penalties52 that Google to detected manipulation vary in length of time ande�ect, from small ranking penalties for certain keywords for a page to sitewide bans, depending upon the sophistication of the manipulating methodsand the quality of the o�ending site. If you believe you had had one applied,you can submit a Google Reconsideration Requesthttp://www.google.com/support/webmasters/bin/answer.py?answer=35843 from Google WebmasterTools, once you have �xed the o�ending issues.

4.3 Detecting Manipulation in Content

There is a fascinating paper by Microsoft which details a number of methods fordetecting spam pages in search engine index's based on their content. A simpleway is to use Bayesian �lters (one is included with Ignite SEO to test yourcontent as the search engine's would), so for example seeing the phrase �buypills� would be a strong indicator of spam. Most of the research is on detectingblatantly computer generated lists of keywords, which is fairly easy to detect.Detecting the quality of human written content is very di�cult, so unless youare endlessly repeating your keywords if you are writing your own content youcan be reasonably happy with its quality in search engine's eyes.

The following graphs are cut from �Detecting Spam Web Pages throughContent Analysis�53 by Microsoft Research employees.

4.4 Detecting Manipulation in Links

Much research has focused on detecting spam pages through their backlinks oroutlinks. Yahoo obtained a patent that uses the rate of link growth to detect

52http://www.forbes.com/2007/04/29/sanar-google-skyfacet-tech-cx_ag_

0430googhell.html53http://cs.wellesley.edu/~cs315/Papers/Ntoulas-DetectingSpamThroughContentAnalysis.

pdf

28

manipulation. Essentially a constant rate of new backlinks, perhaps with asmall growth over time, is expected for a typical site. A �saw-tooth� pattern ofinlinks is a strong indicator of backlink campaigns that start and stop (thoughcould also be an indicator of say a site that releases new software monthly).

In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) andoutdegree (links on the page) distributions of web pages:

�Most web pages have in and outdegrees that follow a powerlaw distribution.Occasionally, however, search engines encounter substantially more pages withthe exact same in or outdegrees than what is predicted by the distribution for-mula. The authors �nd that the vast majority of such outliers are spam pages.�

As discussed in the Trust Rank section earlier, large amount of links fromsites that have already been detected as linking to spam (so called �untrustwor-thy hubs�) is a negative indicator. Links from unrelated websites, reciprocallinks, links out of content, from sites that are known to host paid links andmany other signals are likely taken into consideration.

Zhang et al have identi�ed a method for identifying unusually highly inter-connected groups of web pages. More methods of identifying manipulative sitesare listed in "Link Spam Alliances" by Geyongyi and Garcia-Molina.

4.5 Other Methods

If you think a competitor has been using methods that violate the webmasterguidelines, you can report them to Google54. Its good practice to ensure thatany site you wish to keep for a long time, and expect to get reasonable amountsof tra�c,

Google will sometimes manually review websites without prompting, GoogleQuality Raters inspect sites for relevance to results but can also take web pagesas spam. Particular markets are inspected more often than others.

54https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1

29

Part II

Practice

5 An Example Campaign

Now we've covered the theory, its time for a real world example of putting itinto practice.

5.1 Company Pro�le

John runs a driving school in Spring�eld, Ohio. He has a website he has ownedfor a couple of years, that ranks around the second page for most searches relatedto driving schools in Ohio and receives about 20 visitors day, a third from searchengines and two thirds from links from local websites.

A quick search for what he imagines would be his main keyword, �drivingschool Spring�eld Ohio�, has a company directory site at the top followed byother directories, companies and people asking on forums for recommendations.This mix of relevant small companies web site's and small pages on big websitesindicates the keyword to be of medium di�culty to rank for.

5.2 Goals

John thinks if he can get his site to rank 3rd instead of around the middle of thesecond page for his core keywords, he will increase his search tra�c by around1000%, his overall tra�c by about 300%, and roughly double his sales. He aimsto do this over a period of roughly one month.

5.3 Competitor Research

John �nds his main competitors by searching, and gets estimates of their tra�csources using sites such as compete.com and serversiders.com. A tool suchas Ignite SEO can automatically build SEO reports of competitors, listing theirpaid and organic keywords, demographics and backlinks. Looking at the HTMLsource code of some his competitors displays their targeted keywords in the<meta name=�keywords� content=�keyword1, keyword�>.

5.4 Keyword Research

John takes his initial guesses of what potential customers might search for,and those from his competitors and his existing tra�c, and using the GoogleKeyword Tool55 and Google Insights56 expands this list.

55https://adwords.google.co.uk/select/KeywordToolExternal56http://www.google.com/insights

30

5.5 Content Creation

John takes his keywords and create a small amount of content on his websitecontaining them. He then creates a large amount of content quickly and createssites hosted on free hosting sites57 that, each one targeting a di�erent keyword.The content generator section of Ignite SEO58 is perfect for this.

5.6 Website Check

Before investing in o� site promotion (ie link building), it is worth performinga quick check that the site is search engine friendly. Creating an account inGoogle Webmaster Tools will let you know if Google has any issues indexingyour website, and it is worth ensuring navigation isn't over reliant on JavaScriptor Flash.

5.7 Link Building

This is the core process that will actually improve John's rankings. By looking athis competitors backlinks using Yahoo's linkdomain: command, John replicatestheir links to his website by visiting each site one by one. Using a tool suchas Ignite SEO, he can automatically build links to the hosted sites he quicklycreated in 5.5, without the risk of a link campaign negatively a�ecting therankings of his core website. Other signals of quality such as facebook andtwitter recommendations are built here.

5.8 Analysis

The success of the campaign is measured with a good tracking system suchas Google Analytics, as well as tracking the new incoming links with GoogleWebmaster Tools and Yahoo's link: command. The results are compared withthe goals, and the whole process is re�ned and repeated.

57http://igniteresearch.net/which-web-2-0-ranks-best-hubpages-vs-squidoo-vs-tumblr-vs-blogspot-etc/58http://igniteresearch.net

31

About the Author

Christopher Doman is a partner of Ignite Research, a �rm specialising in soft-ware and consultancies for search engine marketing. He holds a BA in ComputerScience from the University of Cambridge.

32

Seo book

Technology

Transcript of Seo book