Risks of search engine dependency and its influence on data quality

Risks of search engine dependency and

its influence on data quality

Thesis intermediate report submitted for the European Master in Business Studies

(EMBS)

by Ronan CHARDONNEAU

Institut de Management de l'Université de Savoie d'Annecy (FR)

Università degli studi di Trento (IT)

Universität Kassel (GER)

Universidad de León (SP)

Date of submission: 26th January, 2009

Master Thesis

Disclaimer

Before starting this report I inform any readers that this is an intermediate

report of a final master thesis which will be finished in June 2009.

The work within is the result of months of work since June 2008. To find

what inspired me please have a look at my blog (in French) at:

http://moteurs-de-recherches-alternatifs.blogspot.com/

The following intermediate report has not taken yet ( at the date of February

the 13th 2009) in consideration the pieces of advice from my teachers.

Even if there are not many corrections to do this work is then not perfect yet.

I removed the acknowledgements part for the public version of the thesis.

Lastly I would like to inform any readers of this work that I will normally

be graduated in June 2009 and that from this date I will be actively looking for a

job or a phd in the field of international e-marketing.

I am very flexible and able to move at any place around the world.

Please feel free to contact me though my blog by posting a comment or at

[email protected].

CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 2/66

mailto:[email protected]



Educational background

□ 2007/2009European Master in Business Studies Quadruple degree in international management (Trento, Annecy, Kassel Léon)

□ 2006/2007A three-year degree in Import-Export (IUT Quimper)

□ 2004-2006A two-year degree in Business and Administration: Finance, Accountancy and Computering(IUT Nantes)

• 2004Scientific A-level(Lycée Clemenceau Nantes)

Professional projects□ Market study

Possibilities of entering the Italian market for Easteq China Offshore

□ Market studyResearch of distributors and competitors in Germany for Thyssen Krupp Elevators

□ Setting up a prototype computer application for the students registration system of the university of Nantes

Foreign languages

□ Trilingual: French, English, ItalianAutonomous in those three languages

• Chinese Able to stand basic conversations

□ German and Spanish Elementary level (A2)

Professional experience

• Search Engine Optimizer(Eng-IT)June- September 2008 – Search engine

optimizer in English and Italian Edexon Venice - Italy

• International salesmanApril- august 2007 – Selling IT services

on the local foreign marketEasteq Shanghai - China

□ Assistant of the director of operationsApril- June 2006 – Creation of a search

engine for a database Northwest Folklife Seattle - USA

□ Track and Field coach trainer2006- 2008

□ MailmanSummer 2004-2006 – (La Poste - France)

Others

□ Extra scholars activitiesAuthor of several blogs dealing with trips and many articles about Web applications.

□ Centers of interestsForeign languages, traveling (more than 20 countries visited), computers and technology

Computer skills

• Search Engine Optimization6 months of experience in this field

□ Software Suite Competent in GIMP, Microsoft and Open Office. Notions of Linux/Ubuntu• Programming Notions of PHP and XHTML

[email protected]: http://moteurs-de-recherches-alternatifs.blogspot.com/

Ronan CHARDONNEAU



mailto:[email protected]

http://www.journaldugratuit.com/actu/inscription.php

http://quimpershanghai.travelblog.fr/

http://www.nwfolklife.org/

http://www.easteq.com/

http://www.edexon.com/

http://www.iutnantes.univ-nantes.fr/904/0/fiche___formation/



http://www.univ-brest.fr/iutquimp/IUT/licencepro_dev-int-PME-long.htm

http://www.imus.univ-savoie.fr/aid=346.phtml

Contents

Foreword.......................................................................................................................6

Chapter 1: Introduction of the topic background..........................................................8

1.1 Relevance of the subject...................................................................................10

1.2 Major terms......................................................................................................11

1.3 Focus, goals and structure of the report...........................................................11

Chapter 2: Concept of data quality.............................................................................13

2.1 Data quality definition......................................................................................14

2.2 The importance of data quality.........................................................................15

Chapter 3: Search engines dependency.......................................................................16

3.1 Search engine market configuration.................................................................17

3.1.1 Search engine categories..........................................................................17

3.1.2 Search engine market...............................................................................19

3.1.3 The search engines in the world...............................................................19

3.1.4 The search engine market shares per country...........................................22

3.1.5 The search engines competition...............................................................23

3.1.6 The semantic web.....................................................................................24

3.2 Search engines dependency aspect...................................................................25

3.2.1 Search engines dependency proves..........................................................25

3.2.2 Search engines dependency aspect...........................................................27

3.3 Search engines dependency problems..............................................................28

3.3.1 Privacy issues...........................................................................................29

3.3.2 Looking for other search engines.............................................................30

3.3.3 Search engine awareness..........................................................................30

3.3.4 Other search engines existence awareness...............................................32

3.3.5 Less confident regarding other search engines.........................................33

3.3.5 Less confident regarding other search engines.........................................33

3.3.6 Even the best cannot provide you everything...........................................34

Chapter 4: Risks of search engines dependency and its influence on data quality.....35

4.1 The information has been found but is poor....................................................36

4.2 What the search engines do not tell you...........................................................36

4.3 The best way to get data quality.......................................................................37


4.3.1 The sub-search engines.............................................................................37

4.3.2 The size of the Internet.............................................................................38

4.3.3 Single search engine Internet coverage....................................................39

4.3.4 Multiple search engine Internet coverage.................................................42

4.3.5 Others search engine Internet coverage....................................................44

4.3.6 A concrete representation of the World Wide Web...................................46

4.4 The gap between search engine dependency and data quality.........................47

Chapter 5: The Google example.................................................................................50

5.1 Google..............................................................................................................51

5.2 Google's success...............................................................................................51

5.3 Google dependency state..................................................................................52

5.4 Google functions..............................................................................................52

5.5 Google added functionalities............................................................................53

5.6 Google success is his weakness.......................................................................53

5.7 Google's disappearance hypothesis..................................................................54

Conclusion..................................................................................................................55

Declaration..................................................................................................................56

List of literature...........................................................................................................57

Afterword....................................................................................................................61


Table of figuresIllustration 1: Total Sites Across All Domains August 1995 - January 2009................9

Illustration 2: A light interface: the homepage of the search engine www.ask.fr.......17

Illustration 3: Home page of the Mail.ru portal..........................................................18

Illustration 4: Top 10: Search websites in the world August 2007.............................19

Illustration 5: The most visited website by country....................................................20

Illustration 6: Search engine market shares in Czech Republic..................................22

Illustration 7: Results page of Cuil.............................................................................24

Illustration 8: Search engines figures for France, Source: XitiMonitor .....................28

Illustration 9: Some sub-search engines of Google....................................................37

Illustration 10: Google's Index coverage....................................................................40

Illustration 11: “Powered by Google” search engines coverage.................................41

Illustration 12: Search engines coverage....................................................................43

Illustration 13: “Who is like it” search engine............................................................44

Illustration 14: Dogpile meta search engine...............................................................45

Illustration 15: A representation of the most visited websites....................................46

Illustration 16: Comparison among Google, Yahoo and MSN...................................48

Illustration 17: Google's domination in Europe..........................................................52

Illustration 18: Google's Advanced search engine......................................................53

Illustration 19: Eye Tracking on Google.....................................................................54

CHARDONNEAU Ronan - European Master in Business Studies 6/66

Foreword

As most of the students who has a computer one of my first move when I

wake up is to switch on the computer and to spend my first twenty minutes of the day

on the Internet.

From there I have a look at the last news, I check my e-mails and eventually

exchange some few words with a couple of friends by using online chat applications.

I also check my other email account as well as my blogs and analyze the traffic I got

during the last few days, to finish this process I consult my advertisement account to

see if I got some revenues. I often use as well search engine to look for information

which just came up into my mind during the night.

In the paragraph you just read was the description of my morning routine on

Internet. There is nothing special except that most of the moves I described above are

in fact done on two to three major search engines: Google, Yahoo and Microsoft.

I hardly ever use Yahoo or Microsoft for search purpose but Google is for

sure the website I visit the most to crawl the web but... is Google the Internet?

I got the idea to write about: « Risks of search engine dependency and its

influence on data quality » not because I was using all those Google applications

everyday and was scared about what will happen if I get in troubles with Google

such as privacy issues or if Google just closed. I just write about it because one day I

found Google results not accurate enough.

And from this observation a lot of questions came to my mind:

• Is it me who is not good enough at performing research on the Internet?

• Is it because no one wrote about the information I am looking for?

• Is it because the information is not on the first pages in Google that I have to

browse all the pages in order to find it?

• Is it because Google is not good enough?

• Is it because the information is hidden in some other documents such as PDF,

pictures, videos?

• Is it because I have to use another way to crawl the web and if yes how?

You see here how a simple observation can raise a lot of questions.

I hesitated a lot about writing on this topic, the main problem I got was that I

was not convinced that there is a potential risk of being search engine dependent. The


reason is that companies such as Google are working hard in order to fit Internet

users expectations and the vision we get is that they are doing a wonderful work. The

problem is that there could be a difference between perception and real facts and this

is exactly what I am eager to discover here.

Can we measure how huge is the gap between the information we were

looking for and the one of search engines as Google are providing us?

Search engines are set up to find information on the Internet, information

being the basis of any good decisions making we can then understand how important

and interesting it is to write on this topic.

I hope you will appreciate this reading as much as I did when making my

research.

Ronan CHARDONNEAU


Chapter 1: Introduction of the topic background


I will not surprise you if I say that Internet has been created to share

information and to communicate with each others.

It is hard to evaluate how big is the Internet, estimations among companies

are very different, it varies from 15 to some 30 billion Web pages1. The number of

websites is increasing everyday and estimated at 185,167,8972 with a constant

augmentation since the creation of the world wide web.

Illustration 1: Total Sites Across All Domains August 1995 - January 2009

Habits have changed since the creation of the Internet and websites are used now in

diverse manners if it comes to be a standard for companies (recognized as a mark of

trust, seriousness and quality) it is also a space for many individuals (blog

phenomenon). As an example regarding France, in June 2008 14% of French people

above 12 year-old which means 22% of French Internet users are authors of a blog or

a website3.

The banalization of the Internet and the fact that anyone can create his own

website for free increase the feeling we have regarding the Internet: a true jungle of

information and even sometimes real “dump” regarding information accuracy.

1 cf. Koch, P. / Koch, S. (2009): How big is the Internet?.2 Netcraft, (2009): January 2009 Web Server Survey.3 Crédoc, (2008): La diffusion des technologies de l'information et de la communication dans la

société française, p.120.


Websites can be accessible through three channels:

• Direct access (for example you know the website address by heart, you put it

in your favorites or you find a website on a business card and you are typing

it in the address bar);

• External links (you access to a website which has the link of another

website, this is the case in most of websites, catalogs, advertisement);

• Through Search Engines (you use a dedicated application by typing in some

keywords in order to get suggestions of what you are looking for);

As you can see from this list if you use only the first two ways to crawl the

web it comes to be too rigid and not wide enough. It has been said as well that the

first way is disappearing more and more in profit of search engines4.

So one could say that there is currently two main ways to crawl the web, from

link to link and by using search engine.

This last one being indispensable in order to crawl the web properly.

More and more information are put on the Internet which makes it

come a true jungle. The only way to crawl those information properly

is to use search engines.

1.1 Relevance of the subject

Internet is becoming more and more our information provider. "In 2002 a

study from the U.S. Department of Commerce estimated that 36% of the American

public over the age of 3 used the Internet to search for product and service

information in 2001. This usage represented a substantial increase in information

search behavior from 2000, when 26% of the American public reported using the

Internet to search for product and service information."5

Between 2000 and 2008 the USA got an increase of Internet users of 130,9

%6 we can then imagine how important is searching information through Internet

4 cf. Ohayon, O. (2008): Google, moteur de recherche ou moteur de navigation?.5 Peterson, Robert A. / Merino, Maria C. (2003):Consumer Information Search Behavior and the

Internet, p.6.6 Internet World Stats (2008): Internet Usage and Population in North America.


nowadays.

The number of Internet users is estimated to 1,463,632,361 (world population

6,676,120,288) with a growth rate from 2000-2008 fixed at 305.5 %7.

1.2 Major terms

In this thesis you will hear a lot about the following terms: search engines,

search engine dependency and data quality.

Search engine is the most flexible technology which has been created in

order to crawl the web. A search engine is no more than a web application which is

processing data. A search engine does not create data it just process some

information it has in his index.

I describe search engine dependency as the fact that one does not make the

choice between one search engine and another. Search engine dependency is very

relevant in most of the countries. Most of the people are using only one single search

engine when performing requests on the web. I will speak as well about bad habits

when dealing with search engines. For example you can be dependent of using a

search engine but using it badly.

Data quality is the quality of data. Data are of high quality "if they are fit for

their intended uses in operations, decision making and planning ". Alternatively, the

data are deemed of high quality if they correctly represent the real-world construct to

which they refer. These two views can often be in disagreement, even about the same

set of data used for the same purpose8.

1.3 Focus, goals and structure of the report

The focus of this work is to show if there are some risks of being search

engine dependent and if there are some, is the gap of information significant

between a search engine to another?

Goals are numerous:

• to put in evidence how bad it is to be dependent of an unique search engine;

7 Internet World Stats (2008): INTERNET USAGE STATISTICSThe Internet Big Picture.

8 Kaplan, I. (2008): Bad Data Can Cost You Big Time.


• to show how important it is to have reliable information;

• to show how big is the gap between a general search engine and a specialized

one;

• to look for alternatives if tomorrow the main search engines disappear;

• to show the weaknesses of general search engines;

• to show and discover other search engines and put in evidence their

usefulness;

• to show that general search engines are not well used;

• to discover the future expectations people have regarding search engines and

what are the coming technologies in this field;

The structure of the report should be done as follow:

The first idea is to introduce the concept of data quality. We all go on

search engines in order to find information whatever it is or at least to find the

answer to one of our question (how much does a Paris-Berlin train ticket costs? What

is the weather in New York? What is the last result of our favorite soccer team?).

The second point is about Data quality which is very interesting in order to

understand why search engines exist.

The next point is dealing with the world of search engines and the

dependency which is outcome from them.

Analyzing the world of search engines is fundamental to understand how

Internet is not as rational as we could think (search engines may be not the Internet,

search engines may be different from a country to another).

Then we will have a look on figures which show quite clearly that people are

not using several search engines when making research on the Internet but an unique

one that I will define as the dependency concept.

Once this definition given I will focus on the heart of this work which are the

risks that this dependency provide and its influence on data quality.

Google being in Europe the most used search engine I will use it as a major

example in my last part.


Chapter 2: Concept of data quality


When I firstly decided to write about the concept of data quality for search

engine I did not mean at the beginning that data has to be perfect. I just meant that

when I write a request on any search engine I expect to have as results some

answers which fit to my expectations. The last example I have in mind is when on

the 31th of December a friend of mine looked for the « average height of Korean

people » (request were made on Google) and all the results on the page were dealing

about the « average size of the sexual organs of Korean people ». I have high doubts

that no information regarding the average height of Korean people does not exist on

the web it however has been what Google seemed to tell me.

Here as you can see in this example it raises a lot of issues regarding data

quality. My first expectation regarding data quality was then to get some results

which fit to my request. But looking it deeper what is the value of a supposed correct

answer that I could have found written by someone like you and I (not specialized in

this topic).

In fact here everything depends on how professional the data as been written,

this is what I am explaining in this chapter.

2.1 Data quality definition

“This part needs additional information and improvements and is then not

finished yet.”

Data quality is defined by four criteria:

• Accuracy: it means that the information has to be true so based on real facts.

Here we see the importance of having the source of the document.

• Timely: the data given has to be dated. The most striking example could be

the one with stock exchange, what is the value of a data regarding a currency

change without the date?

• Meaningful: here it comes back to what I was explaining in the introduction.

Does the result fits with my request? What is the color of a frog? The answer

is green, is it useful? Yes but is it complete?;

• Complete: an information can be for sure useful but will be far more useful if


it is complete.

Here is then the minimum vital of data quality.

2.2 The importance of data quality

“This part needs additional information and improvements and is then not finished

yet.”

As I mentioned it there are two levels of data quality:

• Poor quality one: for example you and I are looking for information

whatever it is in order to get a quick answer or even a mere comment. Here

two theories are facing each other:

▪ An information should be given even if it is possibly wrong. It of

course can be useless if the information is a hoax but will it be in the

case of a warning of a terrorist attack?

▪ An information should be given only if it is 100% accurate. Here it is

interesting in order to not create polemical situations for nothing.

There are no rational choice to make here, sometimes you need to use the first

theory and sometimes the other one.

I would describe poor quality data as mass consumption information which

mean good enough for “lambda” people but not that useful for businesses and even

less for researchers. With the increasing of Internet users more and more mass

consumption data are created and of course their number is bypassing the one of

quality ones.

• High quality one: here we find the data which fits the four criteria I

described above. High quality data are useful in order to take complex and

rational decisions.

The Wikipedia Issue:

As you may know Wikipedia is an Open Source encyclopedia where


everyone can write what they want on a specific subject. His success made in come

the biggest encyclopedia of all time. Such a success is recognized as a mark of trust

and seriousness for search engines which is characterized by being displayed most of

the time in the 5th first result for any request. With most of the time an appearance on

the first result.

The main issue even if recently it has been added the possibility to add the author

name most of the articles are anonymous. It is also possible to write down an article

without mentioning any source of information. This is then explained to the reader

before starting the article but everything is then dependending on reader's behaviour.

Which quality of information does he or she want to have.


Chapter 3: Search engines dependency


In order to understand well this part we should have firstly a look at the

search engine market configuration. Even if for a lot of people search engine is a

synonym of Google it may be not true for some other people. I made a strong effort

during this work to make it as global as possible. Europe is good as well as the

United States but we will never got the all answers of our issues if we always

consider these two parts of the world.

3.1 Search engine market configuration

3.1.1 Search engine categories

The world of search engines is not as uniformed as we could have think. I

until now identified three kind of search engines:

• Standard: the most well known search engines such as www.google.com,

www.livesearch.com (Microsoft) , Ask (http://www.ask.com/?o=312) they are

looking for any kind of information through the Internet and are characterized

by a very light interface:

Illustration 2: A light interface: the homepage of the search engine www.ask.fr

• Portals: may be the most complex search engines to analyze on a figures


http://www.google.com/

http://www.ask.com/?o=312

http://www.livesearch.com/

http://www.livesearch.com/

point of view. Portals are characterized by a lot of information on their home

page including the search engine function. It is then difficult to make the

difference here between the success of the portal and the success of the search

engine on this web page. The best example I found is the one of mail.ru

which is the most visited website in Russia far before Google. It seems when

looking at the search engine statistics that the research made on Mail.ru are

not accounted or that no one is using the search function of Mail.ru. So it is

sometimes hard to measure the success of some search engines. The most

well known portal is Yahoo.

Illustration 3: Home page of the Mail.ru portal

• Specialized search engines: they to my point of view belong to a

subcategory of the first group. The major problem of standard search engines

is that they are too big. Specialized search engines are then no more than a

special function which is working as a filter. It is then far more easier to find

the information you are looking for on a specific website rather than using

standard ones. A good example of it can be the one of Ebay, it is of course far

more convenient to go on the Ebay website and use the search engine directly

from there rather than going on Google and writing a request such as 'Ebay

buying socks'. I mentioned it again but in most of the cases specialized search

engines are not a revolution they are just one part of standard search engine

and are not a new technology by themselves.


3.1.2 Search engine market

The search engine market is segmented by an enormous leader: Google, a

follower who is far from him: Yahoo and a large amounts of small search engines.

The dominant position of Google may stay for years and years and only a technical

revolution could really jeopardized him.

Illustration 4: Top 10: Search websites in the world August 2007

As you can see here I identified what we could call « The big four » which

represents the four major search engines. Then comes three specialized search

engines (7,49% all together) but their success are limited to the size of the website

they are browsing. Then came in last positions what I would called the dead

champions which once have been great and well known search engines but which

are nowadays in decline and will one day probably disappear.

3.1.3 The search engines in the world

The more I study the E-world and the more I realize that the web is exactly

the reflect of our society but in a non-physical aspect. Internet did not break cultural

differences and search engines really show it.

As we just saw Google represents 6 research out of 10 in the world but does it

mean that each country in the world has a population of 60% Google users?

To answer this question let's have a look at the following map:


Source9

As we can see here the world is not covered entirely by Google. We clearly

have some Google countries, Yahoo countries, Mail.ru countries and so on and so

fourth.

We can however notice some striking information such as almost all the

American continent is using Google as well as Europe, Northern Africa and

Southern Africa, Australia, India. In one word almost all countries which have strong

links with the Anglo-Saxon culture.

Then comes what I would qualified as a cultural wall which is starting in

Eastern Europe and which is finishing in Russia. Here are the ex-soviet countries. I

unfortunately have no concrete proves of what I am saying but I suppose that there is

a kind of a « boycott of American technologies » and support of Russian

technologies. The recent partnership between Yandex (main search engine in Russia)

and the browser Firefox raised those suspicions10.

Russia is not the only country in this situation, China also. The recent

advertisement broadcast by Baidu (the leader search engine in China) shows clearly

9 Alexa the Web information company (2008).10 cf. Houste, F. (2009): Russie : Yandex sera le moteur de recherche par défaut de Firefox.

cf. Schwartz, B. (2009): Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler.


Illustration 5: The most visited website by country

the will that Chinese public institutions are ready to protect their territory.11

As you can see Asia is the region where Google is the least present. It is also

the continent where are gathering a lot of different search engines.

I may not emphasize the diversity of the Caribbean area as well as Center

Africa which are areas where Internet is not that well implemented which mean then

that the battle to take the lead is not finished yet. For example is it really relevant to

say that Yahoo is the leader in Cameroon which is a country with less than 500,000

Internet users?

I however will highlight that the Pacific area which is containing all the

« Tigers » (Taiwan, Thailand, Singapore...) are all in red: Yahoo.

The search engine world is then divided in two parts:

• The Google world: which is composed of all the Anglo-Saxon countries as

well as countries which have strong links with the United States or Great

Britain. It is clearly showed regarding India and Australia. We can at the date

of today still identify two countries which are not under Google control:

Czech Republic and Iceland but actually by looking at the figures and the

forecasts it is just a matter of time12.

• The Asian – Pacific world: Asia is composed of a lot of countries and of

course a lot of cultures. Among them we can identify four players:

◦ Mail.ru which is dominating all the ex-soviet countries;

◦ Baidu which has a total control over China;

◦ Naver, a 100% South Korean product which is the best example that

search engines work by culture;

◦ Yahoo which is leader in all the “Tigers” Asian countries.

Yahoo being an American technology such as Google, how is it possible

that Yahoo is so successful in the Pacific area and not elsewhere? The

reason I found is that Yahoo is a shiny portal and that Asian culture on

11 cf. Einhorn, B. (2007): Baidu Thinks It Can Play in Japan.cf. (2007): Baidu au Japon?.cf. Grallet, G. (2009): Baidu, un autre Google s'éveille.

12 cf. Rafat, A. (2008): Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and Others in Fray.cf. Mar Hauksson, K. (2007): Global search report 2007, p8.


Internet recognize a quality website to the number of animations on it13. I will then

add that Japan is a strong pole of Internet with one of the highest rate of Internet

integration in the world per capita14. This is why I think Yahoo is so popular in this

region.

3.1.4 The search engine market shares per country

One of the most complete work I found on this topic after mine is the

« Global Search Report 2007 »15. What stroke me the most in all the countries

studied in this report as well as in all the report and research I made until now are the

search engines market configuration which looks like very often to this:

Illustration 6: Search engine market shares in Czech Republic

It is very rare to find a country where there is a close competition among

search engines. Even if in the High Technology world things change from a day to

another you have often the following configuration where the first search engine

is leading the game by more than 30 points on its followers.

This is typical from the search engines market or you are adopted by a

population or you are not. This trend seems quite relevant in the

software industry, people seem to look for a standard used by all. This is the case for

the Operating System industry, the browser industry, the e-learning industry. The 13 cf. Tobin, R. / Hotchkiss, G. / Lee, P. (2008): Chinese Search Engine Engagement, p28.14 Internet World Stats. (2008): Internet Usage in Asia.15 cf. Wilsdon, N. (2007): Global Search Report 2007.


explanation I found for the success of search engine within a population is the word

to mouth, this is how Google has been so successful isn 't it? How never heard

sentences such as « you just have to Google it » Google is even nowadays in

dictionaries as a verb16.

As a conclusion I would say that in the world of search engines you are first

or you are nothing.

I have to mention as well that the market has a lot of small local search

engines which are if original enough bought by the big ones or if not will disappear

quickly (some example are coming in the news every month). The only key of the

success on the short term seem to be advertisement but on the long run you need the

technology behind in order to compete.

3.1.5 The search engines competition

Google has been created in 1998 and at that time search engines were already

in place, it did not scared Google and one after the other Google bypassed all of

them. In fact among the big four Yahoo is the oldest (1994) and Microsoft the

youngest (2003). Even if the battle seems to be finished it will take a lot of time to

Google to be the number one in all countries (everything being linked to culture

rather than rationality) which in fact is giving hope to its followers.

At the time I am writing this thesis discussions are still on the way between

Yahoo and Microsoft in order for Microsoft to buy Yahoo search technologies. We

can understand how strategic a such acquisition could be. Yahoo having the research

knowledge and Microsoft the funds as well as the software ownership.

Regarding Baidu we cannot clearly see how they could compete against

Google outside of China.

As I said previously specialized search engines are limited to the website they

are linked to.

We could then think about new comers who starting from nothing could beat

famous search engines in a small period of time, it could have been the success of

some products such as Cuil launched in summer 2008 which received a lot of

advertisement through the news17. But search engines is a very ungrateful world

16 cf. Merriam Webster. (2001): Google - Definition from the Merriam-Webster Online Dictionary.17 cf. Arrington, M. (2008): Cuil On BusinessWeek's Most Successful of 2008 List. Huh?.


where visitors are giving no more than one chance: the product works or it does not.

This is a point that I discovered very quickly and that you can test by

yourself. People want the information as soon as they can. They are

ready to test the product but in a certain amount of tries. When you

move from Google to another search engine you are often intransigent. At the first

result which does not fit your expectations you will go back to Google. But is the

search engine wrong or is it because it is responding differently that on what you

were used to?

In order to conclude this part I would say that with the search engine history

we have and the search engine market configuration, I cannot see how Google

could lose its position. Until now only one company succeeds to make a such gap in

the world of search engine and it is Google itself and it was in a period where

everything had to be created on Internet.

So I would say that on this field I don't see how Google can be beaten and

even worried.

A new technology regarding research is however more and more recurrent in

this field and is called semantic research.

3.1.6 The semantic web


yet.”


Illustration 7: Results page of Cuil

The semantic web is another way of crawling the net. We all know how to

make a search on the Internet isn't it? We just type in some keywords and press the

return key in order to get the answer. In this configuration you have to feed the

search engine with the request.

With semantic Web the concept is a bit different and based on suggesting you

the request instead of typing it entirely. Each time that you are starting to type your

request a list of suggestions are coming to you. We are recently seeing more and

more this technology on the biggest search engines.

The purpose is in fact to guide you as best as they can in order to put you on

the right track and trying to avoid you to reach the labyrinth of the web.

This technology fits one of the main drawback of search engines and that I

call « search engine technology awareness » which consists in how to write good

requests for search engines.

The main drawback of the semantic web is that this is a very new technology

which then have a lot to do before reaching his maturity point. Here we are speaking

about a maximum length of a decade. We can also complain about the rigidity of the

system but it is true that with length and experience this issue could be fixed.

It said that Ask is one of the search engine which based a lot of R&D on this

new technology but according to me and without being a technician I think that

Google can have better results because of its huge database of requests. Future will

tell us what is going to happen.

3.2 Search engines dependency aspect

As I mentioned it in the introduction I define search engine dependency as the

fact that people are swearing only by one search engine when looking for

information on the Internet.

3.2.1 Search engines dependency proves


yet.”


This is not the studies which are lacking on this topic. When looking for

information regarding information literacy on the Internet you arrive on different

sources of studies and this regarding all the countries of the world. Information

literacy is a relevant topic and an issue. I focused on some very recent and

francophone research that I found on the Internet regarding Canadian students18,

French19 and Belgium students20. I also found information regarding Germany on this

topic. Many sources are as well saying that such research have been made in mostly

all Europe, China (Hong Kong)21 and the United States22.

All the studies I found until now (all done on students panels so literate

people) are all saying the same thing: search engine are the first source of

information when looking on the Internet and all students seem to have receive not

enough training on how to look for information on the Internet.

The best study I found on this topic is one made on all the registered PhD

students (2,218 with an answer rate of 23,4%) last year (2008) on a whole region of

France (not a high technological developed country but far to be the least on a

worldwide scale)23.

As we can imagine PhD students have a high requirements regarding quality

of information.

The study shows that 67,5% of the respondents have never received a training

regarding how to look for information during their whole stay at the university which

could explain the fact that people are running toward search engines directly.

Search engines are used in 96% of the cases when performing research

(which emphasize the necessity of how to well use those technologies).

94% of them do not use blogs which I take as a good thing (even if the survey

is saying the opposite, blogs being written by professionals as well).

The most used search engines are Google (85%) and Google Scholar 37%

18 cf. Crepuq. (2003): Etude sur les connaissances en recherche documentaire des étudiants entrant au 1er cycle dans les universités québécoises.

19 cf. Université de Lyon. (2007): De la documentation au plagiat.20 cf. EduDoc. (2008): Enquête sur les compétences documentaires et informationnelles des étudiants

qui accèdent à l'enseignement supérieur en Communauté française de Belgique.

21 cf. Leung, Hon-wing 梁漢榮. (2004): A study of computer science students' conceptions of information literacy and their experiences in information search process and use.

22 cf. Enquiro. (2004): Search Engine Usage in North America.23 cf. URFIST de Rennes. (2008): ENQUÊTE SUR LES BESOINS DE FORMATION DES

DOCTORANTS Á LA MAÎTRISE DE L’INFORMATION SCIENTIFIQUE dans les Ecoles doctorales de Bretagne.


(which is a sub search engine of Google).

60% of them do not know what is a meta search engine and only 5% of them

use them.

46 % do not know the search engine of their field and only 20% do use them.

Those figures are very interesting because they show clearly how people are

not adapted to the technology they are using. PhD students should be some of the

most search engines awarded people and it seems that for France they are not. They

are strictly dependent of a single search engine which is here Google. They know

very few of his sub search engine and as written above they do not know how to use

the technology. They are also not aware massively about other search engines.

3.2.2 Search engines dependency aspect

Search engines dependency can however comes from different ways:

• Search engine satisfaction: you are using a specific search engine which

give you entire satisfaction, so why should you change?;

• Search engine patriotism: you are using this search engine in order to

support your local technologies;

• Search engine convenience: the search engine is providing you all kind of

services which made it very convenient to use or even made the other ones

not convenient to use;

Of course the trend for all search engine is to go for convenience because it

gives to customer everything they need. The main drawback is that for the search

engine companies you have then to dedicate less people to your core activity and

then there is a risk that your search activity will pay the price.


So being search engine dependent means using massively a search engine for

one of those reasons and ignoring all the other ones. Search engines dependency

reach very high rate in Europe:

Most of the European countries have like France a strong addiction to Google with

more than 90%. What does it concretely mean? Almost all European when making

search on the Internet are fed by using the same way to process information.

3.3 Search engines dependency problems

At the first sight when using a search engine we are not thinking about all the

issues which are coming out from them. We make our research and we get results

from this and then we try the results one after the other until finding the one which

fits the best our expectations.

The first main problem is that when addicted to a specific search engine

which normally gave you satisfaction the day when the result will not be the one you

want you may think about different possibilities:

• The information is not displayed so the information you are looking for does

not exist yet;

• The request was not good enough so let's try with other keywords;

The main issue to highlight is that people are so confident with some search

engines that they will not normally look for alternatives or even consider that their


Illustration 8: Search engines figures for France, Source: XitiMonitor

favorite search engine can be wrong.

People are so confident with their search engine that they are not

thinking that the search engine can be wrong.

3.3.1 Privacy issues

I am a bit divided on this topic because I consider it as the mass public

« scarecrow » which is only good to animate polemical debates for nothing.

It finds its explanations in the way that search engines are collecting

information.

When analyzing search engines we have to consider that it is a free

product for all of us (in fact search engines get paid by displaying

advertisement on each web page). Each time you are making a

research on the Internet the search engine you are using registers the IP

number of your computer and of course the research you just made. All these data are

of course supposed to be confidential but some are used in order to make some

statistics such as how many Internet users from a specific country have visited this

website. It can be used for other purposes such as the rank of the most used research

and others data such as those. Of course the more information you give and the most

they collect so if you open an email account on a search engine for example they will

collect your name, address. Until now few are the cases where we got the proof that

information collected by search engines have been given to third parties. The most

famous one is the one of Yahoo in China which filtered some emails and gives the

names of some Chinese journalists who were denouncing things about the Chinese

government24.

To make it clear until now no mass exploitation of data have been observed

and the recent news given by major search engines (Microsoft and Google) are

saying that the trend is to eliminate those data as much as possible in the fear of

losing confidentiality25. We however have to consider an additional element the more

search engine know about what we are looking for and the most they can fit our

expectations, so I personally do not think that reducing the collection of data is in

24 cf. Kahn, J. (2005): Yahoo helped Chinese to prosecute journalist.25 cf. Boucq, I. (2009): Yahoo et vos données persos....


people interest and I will qualify the privacy issue as a global scarecrow in order to

bug the major search engines and putting on the first row some alternative ones

which on the long run may will not fix those issues.

3.3.2 Looking for other search engines

A recurrent question often asked is « Why people are not looking for other

search engines? ». Why do people knowledge regarding search engines is so low? If

you ask to someone how many search engines he may know he will for sure answer

you « Google » maybe « Yahoo » he may add « Microsoft » without telling you its

real name which is « Live Search » and it will for sure stops right here.

Many others search engines are however present on the market. We should

then conclude that people are satisfied by the current search engines.

However studies are showing that regarding Google people when getting their

results are only considering the first three results and not the others giving far more

importance to the first one.26 By just this observation we could then say that there are

things to improve in Google's interface. I guess again that people are looking for

something new even if it is not their priority. This is why I would like to raise another

issue which is search engine awareness.

3.3.3 Search engine awareness

Search engine awareness is for me the key issue of this all work and reveal

two parts:

• Poor search engine awareness regarding how to use a specific search engine;

• Poor search engine awareness regarding the existence of other search

engines;

Both parts are fundamental. The first one deals with what we call search

tools. It consists of a combination of keys in order to fit a specific request.

If you are on Google typing the following request:

search engine dependency

is different from:

26 Eye tools. (2009): Eyetools Eyetracking Research.


« search engine dependency »

In the first case you will look for pages which are containing the words:

search engine dependency as well as all the others combinations such as pages

with search dependency, engine dependency, search, engine, dependency. Whereas

in the other cases you are restricting the pages which include only those three

words and in the same order.

It exists dozens of those tools per search engine, the most famous ones are

called « boolean operators ». Some websites provide the list of those tools27. We have

also to consider that all the search engines are not using the same conventions so

some tools under search engine A will not work on search engine B.

If you are on Yahoo the following request:

• title:search engine

will give you all the web pages which have for titles the following

keywords. This tool is unfortunately not working on Google.

This example shows well how search engines are in fact complementary in order to

make precise research.

Is Google really better than Yahoo?

To get a piece of the answer you can make this empirical experience I

made a couple of months ago. I did the same request on three different

search engines (Google, Yahoo and MS Live Search) and compared the first five

results and gave my opinion on each of those results. If the result fit my

expectations I gave one point if it did not zero. Google got a four out of five, Yahoo

a two out of five and MS Live Search a zero out of five. However all results from a

search engine to another was different. So Google won for sure at the time I

performed the award of pertinence, however the information I was looking may

have been on the two results Yahoo gave me.

So is Google better than Yahoo for this example the answer is yes. Does it mean

that I should not consider Yahoo? Absolutely not.

27 cf. Koch, P. / Koch, S. (1999-2006): A short and easy search engine tutorial.


3.3.4 Other search engines existence awareness


yet.”

Here is another issue which is actually not hitting only the competitors of

major search engines but also themselves. In order to be recognized in the world of

search engines you need very strong guarantees. This year two of them succeeds their

advertisement campaign: « Cuil » by claiming that they had a database which is 6

times more than the one of Google. It said also that their success is coming from the

fact that the company has been built by a former Google employee but actually Cuil

disappears very soon as the change of their CEO. The main reproach which has been

done is that the results given were not numerous enough which of course is seen

from a very bad eye when you are claiming that it is your main strength.

In the world of search engines you cannot be a clown. People give you

one time the floor and will not give it to you back if you do not

perform as they wish.

The other example I can give is the one of Exalead, a French search engine

which is supported by the European Union, the purpose behind is to set up an

European technology in order to face the American ones. So as you can imagine

there is a big support and a huge project behind which make me coming to the bullet

point above if you are a search engine which wants just to be correctly advertised

you definitely need very strong supports.

I today January the 7th 2009 got twice in a day the same observation

about researching information on the Internet. I was looking for two

different songs but I did not know the title as well as the singer, all I

had was some couple of words. I made then a research on Google with

the words I knew from the song and added the word « lyrics » as well in order to


specify that what I was looking for was a song. I as well added the quotes « » in

order to say to Google that I wanted the words in this strict order in order for it to

identify better the song I was looking for among the jungle of the web.

I wrote then: lyrics « freaky people » the answers I got where dealing with a singer

called « Michael Franti and Spearhead » and this on three pages in a row. I tried to

set a different request and the result was the same. Then I gave a try to GrabAll

which is a double search engine displaying Google and Yahoo on the same screen but

on two different columns. I made then the same request and got very surprise to see

the name of the Fat Boy Slim band on the first page of Yahoo. This result was on the

fourth page of Google instead. Here I have two points to make. Firstly the utility of

using a second search engine in order to control the result and secondly the fact

that Google was taking in account that I was 100% of what I was looking for.

Google was in fact giving me those results because the two words « freaky people »

are a part of a title of a song of « Michael Franti and Spearhead » but did I asked for

that? No I just asked for lyrics which have the words « freaky people ».

The second example I am going to give you is as well quite relevant of how search

engine dependency can be dangerous. I was looking this time for lyrics with « lyrics

wanna be your doll » what I did not know is that I had it wrong and it was not

« lyrics wanna be your doll » but « wanna be your dog ». The fact is that both songs

containing those lyrics exist and that Google was displaying me the songs containing

the words I just fill in whereas actually Yahoo was presenting me diverse results

including the one I wanted.

I am not writing here that Yahoo is better than Google, I am just saying that

there is a high interest to have a look around in the world of search engines, I could

have a look around all night on Google looking for my song that I will have never

found because I had it wrong since the beginning. So looking for diversity should

be the first reflex when you cannot find what you are looking for in short

attempts.

3.3.5 Less confident regarding other search engines


yet.”


This part is in fact in close link with the example I just gave above. Each

search engine has his own methodology when displaying the result. When you get

used to get the results in a specific way then when it comes to be different you may

think that the results are totally absurd. The problem is that the more you get addicted

the more you will have the feeling that other search engines are bad and then will not

give them a try.

This is why I describe search engine dependency as the fact to be less and less

confident regarding other search engines.

3.3.6 Even the best cannot provide you everything


yet.”

Some functionalities are not included in the biggest search engines which

make not you think that they could even exist such as search engines for jobs or

people. How to get a visual representation of what I am looking for? And the main

problem of the big search engines is that they are too big so even if they have what

you are looking for it is not sure that you can even see it.


Chapter 4: Risks of search engines dependency

and its influence on data quality


This part is dealing with the measurement and the proves of what I am

writing about. Here I expect to show how search engine dependency is affecting our

everyday life and how we could overcome this situation.

4.1 The information has been found but is poor


yet.”

Let's imagine that you and I just performed a request under a specific search

engine and that the answers given do not satisfy you entirely. You found the

information but it is not developed enough, not signed, too old...

4.2 What the search engines do not tell you

One of my former English teacher taught me one day that there is different

ways of not giving information, one is to lie and one is to not say the information. I

don't think search engines are lying and cheating even if in the case of Baidu the

Chinese search engine there are high suspicions on it28. Some are saying that in order

to be well ranked you need to pay Baidu for it and it has been said as well that the

Chinese government as a big role to play when displaying the results.

The recent Milk scandal in China (Baidu accepted to high ranked unlicensed

companies which were providing fake milk in exchange of money) showed how

dangerous can be a poor data quality search29.

I however have the proof that search engines are not telling you everything

when displaying results and that this rule is general for all the major search engines.

In Google case, search engines are censored according to the country in

which you are making your research from30 and it touches all the country around the

world even countries such as France and Germany.

Most of the time those censorship acts are for your welfare or to protect

national security interest.

28 cf. China Tech News.com. (2007): CCTV: Baidu Search Engine Fraud Exposed?.29 cf. China Daily. (2008): Baidu cuts revenue forecast on ad scandal.30 cf. ROSEN, J. (2008): Google’s Gatekeepers.


4.3 The best way to get data quality

Here is maybe the most interesting part of my work which consists in giving

the solution to the main issue I highlighted since the beginning.

Until now I just raised questions regarding the risks of dependency and not

showed what you should do in order to explore the web properly.

4.3.1 The sub-search engines

As I said previously Google is too big, the bigger it is the more precise your

request have to be. Few people knowing the existence of Boolean operators it then

comes more and more difficult to get the right information. This is why in fact

Google put at your disposal some sub-search engines in order to make your research

easier, for example: Google books, Google videos, Google images. We all agree that

looking for pictures could be done on the research bar of Google, but it is far more

convenient to make it directly from « Google images » because it is displayed better.

The main problem is what I described previously: « search engine awareness

within a search engine ». Am I aware that Google has the following services?

I could sum up it all in a scheme:

For a Google search engine dependent everything starts from Google home

page. He is then aware here of those specialized search engines in order to make


Illustration 9: Some sub-search engines of Google

more precise research on the following information: images, maps, news and

products. It is however under his own initiative to discover what is going on under

those search engines and to discover them. I do not have a proof of what I am saying

but I may presume that in order to buy a product more people are going on Google,

type eBay or Amazon and go and buy on those websites rather than going on Google

products. Google products is however including all those websites in his database so

it should more convenient to have a look on Google products first in order to get the

best price rather than going individually on eBay and Amazon.

We can then symbolize the Internet users awareness of Google search engines

according to this scheme.

The more we get into deep of those levels and the least the Internet user is

aware of it.

Level 2 is still available on the first page but with two clicks.

Level 3 is no more available on the home page but can be accessible in three

clicks.

Level 4 is even not accessible from Google main website and that you have

too be aware of it in order to access it.

Level 5 is hypothetical and constitute the unknown Google projects, often

developed under another name.

All those Google sub levels have been created in order to make easier

research on a specific request. When looking for an image it is far

more convenient to pass through Google images rather than the general Google.

4.3.2 The size of the Internet

How big is the Internet? Is Google indexing all websites?

It is really hard to answer to those questions but however possible to set

up some estimations according to some information31. Those sources are

saying that in 2005 the size of the Internet is estimated to 5 million terabytes and

Google's index to 170 terabytes which would mean that Google is processing only

0,000034%. However the Internet is also containing what we call the invisible web

31 cf. Plesu, A. (2005): How Big Is the Internet?.


composed of websites that owners do not want its content indexed as well as

websites which are protected by a password. In 2004 this invisible web was

estimated to be 500 times bigger than the visible web. It has been said as well that

Google is indexing invisible web only recently32.

A clever calculation will then give us:

Internet size = Visible Web + Invisible Web;

Internet size = 501 * Visible Web;

Visible Web = Internet size / 501;

Visible Web = 5 000 000 / 501;

Visible Web = 9980 terabytes;

Google Index = 170 / Visible Web;

Google Index = 1,7%

This estimation is of course only an estimation and could be full of errors. I

however find it more useful than no information at all. I would also emphasize the

fact that Google has no interest in indexing bad quality websites and that

technologies have evolved in the last few years and that this rate should be of course

far higher than those 1,7%. Whatever is the final result my point is the following:

Google is not the Internet and is not processing all the web... but does

all the web need to be indexed?

4.3.3 Single search engine Internet coverage


yet.”

Here is a more optimistic representation of Google Internet coverage I made

in order to show that Google is not the Internet and that even within Google's sphere

Internet users cannot access to all the information:

32 cf. Chitu, A. (2008): Google Starts to Index the Invisible Web.


Google dependent people are not only using Google when making research but as

well Google partners all symbolized by the sign:

Powered by Google means according to an IT company called Alacra33:

Alacra uses Google Search Appliances to create the Alacra Compliance Web. The

use of Google Search Appliances combines the power of Google search technology,

including the ability to find the highest quality and most relevant documents, with

Alacra's domain expertise in selecting those web sites and pages which are relevant

for AML compliance.

Which means that here you have a partnership working more or less in the same way

that using a Google specialized search engine.33 Alacra. (2008?): What does "Powered By Google" mean?.


Illustration 10: Google's Index coverage

There are thousands and thousands search engines all specialized in a specific

field on the Internet which are using Google technology to search sites such as

Tourism, tutorials...

By using all those specialized search engines you are crawling better the

Google's coverage space.

As you can see here on this configuration by using specialized search engines

powered by Google you will always browse Google's cyberspace and when we know

that Google is not the Internet we can then ask ourselves how we can get the best out

of Internet.


Illustration 11: “Powered by Google” search engines coverage

On January the 11th I went on both websites http://www.aol.fr/ and

http://www.google.fr/ and type the following request 'les moteurs de

recherches' both results on the first page were identical. The only

difference is that Google gave me 3,210,000 results and AOL 290,000

(9% of Google results) so as I developed above AOL is looking at the same place as

Google but is applying more filters. Is it really useful considering that AOL is a

general search engine? The answer maybe be given in their home page at

http://search.aol.com/aol/webhome « The AOL Search engine delivers great search

results, enhanced by Google, plus relevant multimedia results delivered on a single

page-so you can search less and discover more. »34

If you go on http://www.google.com/coop/cse/ you will have a good example

of the search engine powered by Google. You can even create your own one. All is

done by using Google technology and you are just applying your own filter by listing

the websites where you want Google to look inside.

4.3.4 Multiple search engine Internet coverage


yet.”

Let's go now deeper in our analysis by taking in account search engine which

are not using Google technologies. All developed their own technology and are

sometimes better than Google in some fields worst in the other. We will then have

something like that:

34 cf. Boswell, W. (2004?): AOL Search: How to Search with AOL Search.


http://www.google.com/coop/cse/

http://search.aol.com/aol/webhome

http://www.google.fr/

http://www.aol.fr/

So here is my point the most the search engines are different and the most it

gives you the possibility to discover the web. Here I just put the main actors but we

have also to consider that it exists a lot of small search engines which developed their

own index and have then their own way to process data.

Of course one could say that there is no interest for an European person to

process information on a Chinese or even Russian search engine because Chinese

search engine should of course be better in looking for Chinese information rather

than an European one. It is definitely true if we are speaking about contents such as

texts, but when it is dealing with pictures or video the reality should be totally

different.

On the other hand the day where this European person is looking for Chinese


Illustration 12: Search engines coverage

information he should then be aware that he should use the Chinese search engine

rather than the Chinese version of the European one. And as we can imagine all the

other players: Yahoo, Baidu, Mail.ru have their own Boolean operators and their own

sub search engines. So it comes back to the idea of search engine awareness. The

more you know about them and the most you are increasing your knowledge about

how to get the best out of Internet.

4.3.5 Others search engine Internet coverage


yet.”

You have a last category of search engine which are the one coming from the

semantic web concept which are not based on keywords requests. I found one of

those which is called « Who is like it » and gives as results websites similar to the

one you just gave him.

Of course solutions have been found in order to create the perfect search

engine including all those technologies. But as you can imagine it is not an easy

thing if firstly search engines did not take the same standards as Boolean operators.

Secondly the more search engine you combine the more results you will get and it

then very complicated to decide which one is more pertinent than the other.


Illustration 13: “Who is like it” search engine

A good example of it is the meta search engine called Dogpile which is quite

popular in the United States it combines the results of 4 big search engines which are

Google, Yahoo, MSN live search and Ask. The main problem is already showed on

the advanced search web page of Dogpile, few are the Boolean operators limited to

4.

The second issue which is relevant is how to decide which results from those

search engines are the most relevant. I showed you previously that when making a

comparison between Google and Yahoo I found what I wanted on Yahoo on the 8th

position. So it is hard to decide which result is the most relevant and to this game you

are the only judge of the situation.

So the hypothetical search engine should have the following characteristics:

including all the technologies ever created regarding search engines (Google's

knowledge+Yahoo knowledge+MSN knowledge+...) in order to define the most

powerful algorithm ever. All those companies have of course different objectives

than being the most rational search engine all are looking to be the number one. This

day will may come but it will not be for sure for tomorrow so waiting for it we

should understand how to use one by one all those technologies.

The solution being then to test the search engines one by one but

before testing all of them you have to know that they exist and how to


Illustration 14: Dogpile meta search engine

use them but you should at least to know that they exist.

4.3.6 A concrete representation of the World Wide Web

In order to make this work easier a Japanese company set up regularly a web

map of the most famous websites in the world by referencing them by categories this

map could of course be improved but should be a good start:

This map is available at the following address: http://informationarchitects.jp/

start/ with all the links included towards the websites it is composed of. It is very

interesting in order to break the search engine dependency phenomenon. On this map

are located all the most famous websites for 2008 you can then see all the most


Illustration 15: A representation of the most visited websites

http://informationarchitects.jp/start/

http://informationarchitects.jp/start/

influential websites and where to seek for information. For example if I want to look

for a video I may follow the gray line and test all the websites which are on it to find

the video I am looking for.

In order to conclude this part which actually should not be the core of the all

thesis the Web is big and in order to discover it you need to know how to.

We could compare it to the real world when living in country A you receive as

feedback from country B by different ways (people who are moving from country B

to A, the news, the books and documentations you have about country B) but you

will never be physically in country B and for this there are some information that you

could never get. Of course you can get nearer to those information by crawling more

and more the web with your country A search engine (it will be like documenting

yourself more and more about country B) but it will never be like being and living in

country B.

4.4 The gap between search engine dependency and data quality


yet.”

In this part I will explain how to put in evidence the gap of information

between being search engine dependent of only one search engine and using the most

rational tool to search for information on the Internet.

It may not be easy to prove it concretely so I may need to prove it by making

empirical studies.

I succeed to put in evidence so far that:

• Internet users are search engines addicted;

• Internet users have few knowledge about search engines awareness;

• I have now to prove that this is bad;

My first point will be to show that search engines are using different

technologies provide different results.

Here is a comparison I made for three search engines through


http://www.thumbshots.org/Products/Thumbshots/Ranking.aspx which shows how

many similarities search engines have among them. Here I said if I type in

“Universität Kassel” what are the results that those search engines have in common

(in blue). And I moreover added the option to highlight me the website of the

university of Kassel (in red) which for me is a sign of relevancy of my request.

Illustration 16: Comparison among Google, Yahoo and MSN

Here as we can see:

• Google has 7 similarities with Yahoo out of 60;

• Google has 10 similarities with MSN out of 60;

• Google found two times the website I was looking for;

• Yahoo has 4 similarities with MSN out of 60;

• Yahoo found one time the website I was looking for;

• MSN found 4 times the website I was looking for;


http://www.thumbshots.org/Products/Thumbshots/Ranking.aspx

This analysis is then confirming what I was supposing before search engines do not

look for information at the same place.

Then one could ask about the pertinence of the results, is it worthwhile to display

more than one time on one page the website of the university of Kassel? Or is it a

sign of relevancy? So here we have an interpretation according to the Internet user.

One thing is sure search engines using different technology provide

different results.


Chapter 5: The Google example



yet.”

Google is for sure in some parts of the world the best example we can find of

the search engine dependency phenomenon I described.

5.1 Google

In January 1996 a 24 year-old PhD student called Larry Page studying at the

University of Stanford was looking for a theme for his thesis. Encouraged by his

supervisor he studied the following topic “exploring the mathematical properties of

the World Wide Web“ working in collaboration with another student called Sergey

Brin. To make it simple, it is from this work and collaboration which will came up

“Google Inc” (officially created in September the 7th 1998).

Two months later Google is already included in the Top 100 of world

websites of PC magazine (a reference in the United States for computers).

Even if Google is formerly a web based application in English it is a worldwide

service available on the Internet for all. As his creator (Larry Page) said "Google's

search engine has always had strong global appeal"35.

Google is nowadays the most famous search engine. In ten years as the

Millward Brown report said36 Google will become the most powerful brand in the

world.

5.2 Google's success

The broadest explanation I found is the following « Google provides for free

a useful service that people actively seek out »37. And when you think about it it is

definitely true. People are going on Google because they are all looking for that kind

of services. But how can it be more successful than its fellows? Here I could say that

in general Google is giving better results which may be one reason, but moreover it

35 cf. Chardonneau, R. (2008): International Marketing: Google.36 cf. Millward Brown. (2008): Top 100 most powerful brands 08.37 Eternal Dreamer. (2008): Why Google is so Successful?.


is providing added services such emails, blogs, news, Advertisement programs.

Moreover Google's health as a company has been well preserved by making

good choices when making acquisitions and mergers, they internationally developed

themselves very well.

5.3 Google dependency state

If we take Europe we can see that it is definitely a Google dependent

continent:

As we can see here Google (in blue) is not only the most used search engine

in those countries it is in fact like the only search engine present at the continental

level.

5.4 Google functions

Google Boolean operators are numerous but who really use them when

performing a research on Internet? A simplified interface of them is available at

http://www.google.com/advanced_search?hl=eng but here also who really makes

research under it?


Illustration 17: Google's domination in Europe

http://www.google.com/advanced_search?hl=eng

5.5 Google added functionalities

Here is another part I did not developed so far. It deals with other

functionalities and services that the search engine do have but do not put sufficiently

in advance to the Internet user. So let's say that the advanced research which is

located on the home page is already not that much used you can then imagine how

much is used a service which is hidden.

5.6 Google success is his weakness

The fate of Google is linked also to the one of Search Engine Optimization

which is the ability to well index a website on search engines. The main issue is that

Google being the most well known search engine a lot of computer scientists tried to

understand how Google was working in order to index web pages. The more

experiments are made and the more the secret algorithm of Google is known. As we

can imagine few are the web masters who are interested to have a website in the

latest pages of Google. This is why Google results are not the most rational and are

not the reflect of our physical society.

This observation is far more important if we consider that some studies38

showed that Internet users are considering only the first 3 results of Google when

making research on it.

38 cf. Enquiro. (2008): Eye Tracking Studies.


Illustration 18: Google's Advanced search engine

5.7 Google's disappearance hypothesis

It is hard to believe that a company such as Google can close his gates one

day but everything can happen and moreover in the world of ICT, so let's imagine

what will be the consequences of his disappearance. I truly believe that all Internet

users will go for its followers Yahoo and Microsoft with exactly the same behaviors.

It however said that the patent of Google algorithm belong to the Stanford

University and expires in 2011 what will then happen to Google at that time?

On January 31th 2009 a good example of this hypothesis has been put in

evidence on a global scale. Due to a Google technician mistake during 50 minutes

Google search engine was displaying all the results websites as potentially

dangerous:


Illustration 19: Eye Tracking on Google

According to French analysts this human error decreased the number of

Google visitors of more than 90%:


The first conclusion we can set is that people are definitely trusting Google. If

Google say something is that it is supposed to be true.

According to French analysts during that time Internet users did not went for

other search engines39. They probably have shut down their computers during that

time. However the day after all search engines competitors rose drastically their

market shares except for powered Google search engines40. It clearly shows that

Internet is stopping during a rather short period of time before that Internet users

goes for alternative solutions. They then go for the major alternative search engines.

39 http://www.pcworld.fr/actualite/google-panne-et-degats-collateraux/26021/ 40 http://fr.news.yahoo.com/16/20090203/ttc-le-bug-de-google-revele-la-forte-dep-c2f7783.html


http://fr.news.yahoo.com/16/20090203/ttc-le-bug-de-google-revele-la-forte-dep-c2f7783.html

http://www.pcworld.fr/actualite/google-panne-et-degats-collateraux/26021/

Conclusion

The purpose of this intermediate report was for me to show my knowledge

about the topic, to prove that the problematic is pertinent, interesting and that there is

enough to write about it.

I wanted to show as well that I have the ideas clear enough for the next

coming months. What I keep in mind from this work is that I like the research I am

doing, that I am very curious about the final result (how to make the difference when

looking for information on the Internet and that the gap between rational search of

information and mass search information is abyssal).

I am quite conscious that a lot of work has to be done and that a lot of

questions have still no answers.

However I would like as well to emphasize what I personally did so far and

what I learned:

• I know now how to structure a master thesis. I had until now no knowledge

about it and I learned from scratch how to set up an academical report;

• I studied since June 2008 the world of search engines and comes to know a

lot about the topic. Once graduated I would like to continue on the field of e-

marketing either on a PhD level or in an international e-marketing company.

This thesis is then making me more valuable to pretend to one of those two

goals;

• Studying this field brought me additional ideas regarding the e-marketing

field. I wrote done an additional 20-pages report (not included in this

intermediate report) about a hypothetical computer application which could

give information about how to launch e-marketing campaigns on a global

scale;

February will be for sure the month which will decide how this master thesis

will be achieved. I have concretely more than 3 weeks to dedicate days and nights to

this project.

The biggest issues I will have to think about are:

• Chapter 3.1.6: regarding semantic web I did not study yet in detail the topic


and I will have to come back on this point later;

• Chapter 4: how to measure the gap of information between mass behavior

when looking for information on the Internet and the most rational behavior?;

• Chapter 5: The Google example;

• Re writing the thesis from an informal point of view this time (using the third

person of the singular instead of the first which does not sound very

academic);


Declaration

I certify that this work has been done by myself and only myself. All the

sources used for its realization have been well indicated.

Ronan CHARDONNEAU

European Master in Business Studies

Institut de Management de l'Université de Savoie d'Annecy (FR)

Università degli studi di Trento (IT)

Universität Kassel (GER)

Universidad de León (SP)

22th January, 2009


List of literature Alacra. (2008?): What does "Powered By Google" mean? URL: http://www.alacra.com/compliance-

search/faq.asp Last access: 23/01/2009

Alexa the Web information company. (2008): Top sites by country URL:

http://www.alexa.com/site/ds/top_500 Last access: 23/01/2009

Anonymous. (2007): Baidu au Japon? URL: http://asia.2803.com/chine/baidu-au-japon/ Last access:

23/01/2009

Arrington, M. (2008): Cuil On BusinessWeek's Most Successful of 2008 List. Huh? URL:

http://www.washingtonpost.com/wp-dyn/content/article/2008/12/29/AR2008122901227.html Last

access: 23/01/2009

Battelle, John. (2005): The birth of Google URL:

http://www.wired.com/wired/archive/13.08/battelle.html?tw=wn_tophead_4 Last access:

23/01/2009

Boswell, W. (2004?): AOL Search: How to Search with AOL Search URL:

http://websearch.about.com/od/enginesanddirectories/a/aol_search_2.htm Last access: 23/01/2009

Boucq, I. (2009): Yahoo et vos données persos... URL:

http://www.erenumerique.fr/yahoo_et_vos_donnees_persos_-news-15162.html Last access:

23/01/2009

Chardonneau, R. (2008): International Marketing: Google URL:

http://storage.canalblog.com/24/01/274197/31730758.pdf Last access: 23/01/2009

China Tech News. (2007): CCTV: Baidu Search Engine Fraud Exposed? URL:

http://www.chinatechnews.com/2007/05/31/5459-cctv-baidu-search-engine-fraud-exposed/ Last

access: 23/01/2009

China Daily. (2008): Baidu cuts revenue forecast on ad scandal URL:

http://www.chinadaily.com.cn/bizchina/2008-12/14/content_7302788.htm Last access: 23/01/2009

Chitu, A. (2008): Google Starts to Index the Invisible Web URL:

http://googlesystem.blogspot.com/2008/04/google-starts-to-index-invisible-web.html Last access:

23/01/2009

Crédoc, (2008): La diffusion des technologies de l'information et de la communication dans la société

française URL: www.arcep.fr/uploads/tx_gspublication/etude- credoc - 2008 -101208.pdf Last acess

23/01/2009

Crepuq. (2003): Etude sur les connaissances en recherche documentaire des étudiants entrant au 1er

cycle dans les universités québécoises URL:

http://www.crepuq.qc.ca/documents/bibl/formation/etude.pdf Last access: 23/01/2009

EduDoc. (2008): Enquête sur les compétences documentaires et informationnelles des étudiants qui

accèdent à l'enseignement supérieur en Communauté française de Belgique URL:

http://www.edudoc.be/synthese.pdf Last access: 23/01/2009


http://www.edudoc.be/synthese.pdf

http://www.crepuq.qc.ca/documents/bibl/formation/etude.pdf

http://www.arcep.fr/uploads/tx_gspublication/etude-credoc-2008-101208.pdf





http://googlesystem.blogspot.com/2008/04/google-starts-to-index-invisible-web.html

http://www.chinadaily.com.cn/bizchina/2008-12/14/content_7302788.htm

http://www.chinatechnews.com/2007/05/31/5459-cctv-baidu-search-engine-fraud-exposed/

http://storage.canalblog.com/24/01/274197/31730758.pdf

http://www.erenumerique.fr/yahoo_et_vos_donnees_persos_-news-15162.html

http://websearch.about.com/od/enginesanddirectories/a/aol_search_2.htm

http://en.wikipedia.org/wiki/Larry_Page

http://www.washingtonpost.com/wp-dyn/content/article/2008/12/29/AR2008122901227.html

http://asia.2803.com/chine/baidu-au-japon/

http://www.alexa.com/site/ds/top_500

http://www.querycat.com/question/2b0388c814d37cef39bfe2ea3f2b507e

http://www.querycat.com/question/2b0388c814d37cef39bfe2ea3f2b507e

Einhorn, B. (2007): Baidu Thinks It Can Play in Japan URL:

http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm?

chan=globalbiz_asia_technology Last access: 23/01/2009

Enquiro. (2008): Eye Tracking Studies URL:http://www.enquiroresearch.com/eyetracking-report.aspx

Last access: 23/01/2009

Enquiro. (2004): Search Engine Usage in North America URL:

http://pages.enquiroresearch.com/search-engine-usage-in-na.html?

source=Search_Engine_Usage_In_North_America_whitepaper Last access: 19/01/2009

Eternal Dreamer. (2008): Why Google is so Successful? URL:

http://crumja.wordpress.com/2008/05/20/why-google-is-so-successful/ Last access: 23/01/2009

Eye tools. (2009): Eyetools Eyetracking Research URL:

http://www.eyetools.com/inpage/research_google_eyetracking_heatmap.htm Last access:

23/01/2009

Grallet, G. (2009): Baidu, un autre Google s'éveille URL: http://www.lexpress.fr/actualite/high-

tech/baidu-un-autre-google-s-eveille_734826.html Last access: 23/01/2009

Houste, F. (2009): Russie: Yandex sera le moteur de recherche par défaut de Firefox URL:

http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-de-recherche-par-

defaut-de-firefox/ Last access: 23/01/2009

Internet World Stats (2008): Internet Usage and Population in North America URL:

http://www.internetworldstats.com/stats14.htm Last access: 23/01/2009

Internet World Stats (2008): INTERNET USAGE STATISTICS

The Internet Big Picture URL:http://www.internetworldstats.com/stats.htm Last access:

23/01/2009

Internet World Stats. (2008): Internet Usage in Asia

URL:http://www.internetworldstats.com/stats3.htm Last access: 23/01/2009

Kahn, J. (2005): Yahoo helped Chinese to prosecute journalist URL:

http://www.iht.com/articles/2005/09/07/business/yahoo.php Last access: 23/01/2009

Kaplan, I. (2008): Bad Data Can Cost You Big Time URL:

http://www.federationofcredit.com/base/document/Newsletter/IKaplanSept08.html Last

access:23/01/2009

Koch, P. / Koch, S. (2009): How big is the Internet? URL: http://www.pandia.com/sew/383-

web size.html . Last access: 19/01/2009

Koch, P. / Koch, S. (1999-2006): A short and easy search engine tutorial URL:

http://www.pandia.com/goalgetter/index.html Last access: 19/01/2009

Leung, Hon-wing 梁漢榮. (2004): A study of computer science students' conceptions of information

literacy and their experiences in information search process and use URL:

http://sunzi.lib.hku.hk/hkuto/record/B29597730 Last access: 19/01/2009

Lipsman, A. (2007): 61 Billion Searches Conducted Worldwide in August/Google Ranks as Top

Global Search Property URL: http://www.comscore.com/press/release.asp?press=1802 Last


http://www.comscore.com/press/release.asp?press=1802

http://sunzi.lib.hku.hk/hkuto/record/B29597730

http://www.pandia.com/goalgetter/index.html

http://www.pandia.com/sew/383-web-size.html

http://www.pandia.com/sew/383-web-size.html

http://www.pandia.com/sew/383-web-

http://www.pandia.com/sew/383-

http://www.pandia.com/sew/383-

http://www.federationofcredit.com/base/document/Newsletter/IKaplanSept08.html

http://www.iht.com/articles/2005/09/07/business/yahoo.php

http://www.internetworldstats.com/stats3.htm



http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-de-recherche-par-defaut-de-firefox/

http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-de-recherche-par-defaut-de-firefox/

http://www.lexpress.fr/actualite/high-tech/baidu-un-autre-google-s-eveille_734826.html

http://www.lexpress.fr/actualite/high-tech/baidu-un-autre-google-s-eveille_734826.html

http://www.eyetools.com/inpage/research_google_eyetracking_heatmap.htm

http://crumja.wordpress.com/2008/05/20/why-google-is-so-successful/

http://pages.enquiroresearch.com/search-engine-usage-in-na.html?source=Search_Engine_Usage_In_North_America_whitepaper

http://pages.enquiroresearch.com/search-engine-usage-in-na.html?source=Search_Engine_Usage_In_North_America_whitepaper

http://www.enquiroresearch.com/eyetracking-report.aspx

http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm?chan=globalbiz_asia_technology

http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm?chan=globalbiz_asia_technology

access: 23/01/2009

Mar Hauksson, K. (2007): Global search report 2007 URL:

http://www.e3internet.com/downloads/global-search-report-2007.pdf Last access: 23/01/2009

Merriam Webster. (2001): Google - Definition from the Merriam-Webster Online Dictionary URL:

http://www.merriam-webster.com/dictionary/google Last access: 23/01/2009

Millward Brown. (2008): Top 100 most powerful brands 08 URL:

http://www.millwardbrown.com/Sites/optimor/Media/Pdfs/en/BrandZ/BrandZ-2008-Report.pdf


Netcraft (2009): January 2009 Web Server Survey URL:http://news.netcraft.com/archives/2009/01/16/

january_2009_web_server_survey.html Last access 23/01/2009

Ohayon, O. (2008): Google, moteur de recherche ou moteur de navigation? URL:

http://fr.techcrunch.com/2008/10/30/fr-google-moteur-de-recherche-ou-moteur-de-navigation/ Last

access: 23/01/2009

Peterson, Robert A. / Merino, Maria C. (2003): Consumer Information Search Behavior and the

Internet URL: http://www3.interscience.wiley.com/cgi-bin/fulltext/102525298/PDFSTART Last

access: 23/01/2009

Plesu, A. (2005): How Big Is the Internet? URL: http://news.softpedia.com/news/How-Big-Is-the-

Internet-10177.shtml Last access: 23/01/2009

Rafat, A. (2008): Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and Others

in Fray URL: http://www.washingtonpost.com/wp-

dyn/content/article/2008/08/15/AR2008081502517.html Last access: 23/01/2009

Rosen, J. (2008): Google’s Gatekeepers URL:

http://www.nytimes.com/2008/11/30/magazine/30google-t.html?

_r=2&partner=rss&emc=rss&pagewanted=all Last access: 23/01/2009

Schwartz, B. (2009): Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler

URL:http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-but-big-loser-may-

be-rambler-16107 Last access: 23/01/2009

Tobin, R. / Hotchkiss, G. / Lee, P. (2008): Chinese Search Engine Engagement URL:

http://app.marketo.com/lp/enquiro/chinese-search-engine-engagement.html?

source=Chinese_Search_Engine_Engagement_whitepaper Last access: 23/01/2009

Université de Lyon. (2007): De la documentation au plagiat URL:

http://www.compilatio.net/files/sixdegres-univ-lyon_enquete-plagiat_sept07.pdf Last access:

23/01/2009

URFIST de Rennes. (2008): ENQUÊTE SUR LES BESOINS DE FORMATION DES

DOCTORANTS Á LA MAÎTRISE DE L’INFORMATION SCIENTIFIQUE dans les Ecoles

doctorales de Bretagne URL: http://www.uhb.fr/urfist/files/Synthese_Enquete_SCD-URFIST.pdf


Wilsdon, N. (2007): Global Search Report 2007 URL: http://www.e3internet.com/downloads/global-

search-report-2007.pdf Last access: 23/01/2009

Xiti Monitor. (2008): Baromètre des moteurs - Novembre 2008


http://www.e3internet.com/downloads/global-search-report-2007.pdf


http://www.uhb.fr/urfist/files/Synthese_Enquete_SCD-URFIST.pdf

http://www.compilatio.net/files/sixdegres-univ-lyon_enquete-plagiat_sept07.pdf

http://app.marketo.com/lp/enquiro/chinese-search-engine-engagement.html?source=Chinese_Search_Engine_Engagement_whitepaper

http://app.marketo.com/lp/enquiro/chinese-search-engine-engagement.html?source=Chinese_Search_Engine_Engagement_whitepaper

http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-but-big-loser-may-be-rambler-16107

http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-but-big-loser-may-be-rambler-16107

http://www.nytimes.com/2008/11/30/magazine/30google-t.html?_r=2&partner=rss&emc=rss&pagewanted=all

http://www.nytimes.com/2008/11/30/magazine/30google-t.html?_r=2&partner=rss&emc=rss&pagewanted=all



http://news.softpedia.com/news/How-Big-Is-the-Internet-10177.shtml

http://news.softpedia.com/news/How-Big-Is-the-Internet-10177.shtml

http://www3.interscience.wiley.com/cgi-bin/fulltext/102525298/PDFSTART

http://fr.techcrunch.com/2008/10/30/fr-google-moteur-de-recherche-ou-moteur-de-navigation/

http://news.netcraft.com/archives/2009/01/16/january_2009_web_server_survey.html

http://news.netcraft.com/archives/2009/01/16/january_2009_web_server_survey.html

http://www.millwardbrown.com/Sites/optimor/Media/Pdfs/en/BrandZ/BrandZ-2008-Report.pdf

http://dictionary.reference.com/browse/google


Encore un mois porteur pour Google URL: http://www.xitimonitor.com/fr-fr/barometre-des-

moteurs/barometre-des-moteurs-novembre-2008/index-1-1-6-152.html Last access: 23/01/2009


http://www.xitimonitor.com/fr-fr/barometre-des-moteurs/barometre-des-moteurs-novembre-2008/index-1-1-6-152.html



Afterword

The first idea of Larry Page (one of the creator of Google) when making his

PhD thesis was to « explore the mathematical properties of the World Wide Web,

understanding its link structure as a huge graph »41.

Without knowing it I went a bit in the same direction evaluating how to get

the best out of the web.

I hope I will keep moving forward on this topic and maybe one day set up a

correct theory on the subject.

41 Battelle, John. (2005): The birth of Google.


Risks of search engine dependency and its influence on data quality

Technology

Transcript of Risks of search engine dependency and its influence on data quality