ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines...

45
ITIS 1210 ITIS 1210 Introduction to Web- Introduction to Web- Based Information Based Information Systems Systems Internet Research Two Internet Research Two How Search Engines Rank Pages & How Search Engines Rank Pages & Constructing Complex Searches Constructing Complex Searches

Transcript of ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines...

Page 1: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

ITIS 1210ITIS 1210Introduction to Web-Based Introduction to Web-Based

Information SystemsInformation Systems

Internet Research TwoInternet Research TwoHow Search Engines Rank Pages &How Search Engines Rank Pages &

Constructing Complex SearchesConstructing Complex Searches

Page 2: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

How do Search Engines Crawl?How do Search Engines Crawl?

Gathering data from the Web is like Gathering data from the Web is like browsing:browsing:1.1. Visit a page.Visit a page.

2.2. Record all the words on the pageRecord all the words on the page

3.3. Choose a link you haven’t seen/recordedChoose a link you haven’t seen/recorded

4.4. Click on the link.Click on the link.

Repeat 8 billion times.

Page 3: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Crawling the WebCrawling the Web

One person with a Web browser, following One person with a Web browser, following one link per second.one link per second.

How long does it take to browse the How long does it take to browse the surface Web (8 billion pages)?surface Web (8 billion pages)?

Page 4: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Crawling the WebCrawling the Web

How many people would it take to crawl How many people would it take to crawl the surface Web in a week? If each person the surface Web in a week? If each person follows one link per second (with no follows one link per second (with no sleep):sleep):

One week = six hundred thousand secondsSix hundred thousand / eight billion = thirteen thousand

Page 5: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Challenges:Challenges:

Remembering where you’ve beenRemembering where you’ve been Remembering where you haven’t beenRemembering where you haven’t been Storing all the dataStoring all the data

Page 6: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

A (small) Server FarmA (small) Server Farm

Page 7: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

The Deep WebThe Deep Web

Not all pages get crawled:Not all pages get crawled: Private pages on Intranets (company Private pages on Intranets (company

networks)networks) Pages that people don’t want crawledPages that people don’t want crawled Dynamic content pages (from databases)Dynamic content pages (from databases)

Dynamic content pages make the size of the Dynamic content pages make the size of the Internet infinite!Internet infinite!

Page 8: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Dynamic Content ExampleDynamic Content Example

zillow.comzillow.com Won’t be Won’t be

indexedindexed

Page 9: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Identifying High Quality Web Identifying High Quality Web PagesPages

Google has ranked billions of Web pages Google has ranked billions of Web pages by "quality".by "quality".

You enter your search terms:You enter your search terms:

UNC Charlotte HCIUNC Charlotte HCI

Google finds the Google finds the highest quality pagehighest quality page associated with these search terms.associated with these search terms.

Page 10: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Google PagerankGoogle Pagerank

Pretend you're surfing the Web randomly.Pretend you're surfing the Web randomly.

To move from page to page you could:To move from page to page you could:

1) type in an address (1) type in an address (www.sis.uncc.edusis.uncc.edu) )

includes using a bookmarkincludes using a bookmark

OROR

2) follow a link.2) follow a link.

Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2).The main idea is that links to your page from important web pages indicate that your page is important.

Page 11: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Computing PagerankComputing Pagerank(what’s the probability of getting to this page?)(what’s the probability of getting to this page?)

Q A, B, C, ...

L(A), L(B), L(C),...

= Web page= Pages pointing to Q= number of links on each page

Pagerank of Q: R(Q) = (1-d) + d·(R(A)/L(A) + R(B)/L(B) + ...)

d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark):

Usually these are: d = 0.9 (1-d) = 0.1

Page 12: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Computing PagerankComputing Pagerank

Pretend the Web has only four pages:Pretend the Web has only four pages:

W X Y ZW X Y Z

Links:Links:

W W X Y X Y W Y W Y Z Z Z Z W W

L(W)L(W)=1 =1 L(X)L(X)=0 =0 L(Y)L(Y)=2 =2 L(Z)L(Z)=1=1

Which page has the highest “quality”?Which page has the highest “quality”?

Page 13: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Computing PagerankComputing PagerankLinks: W X Y W Y Z Z W

L(W)=1 L(X)=0 L(Y)=2 L(Z)=1

R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z)) = 0.1 + 0.9 * (R(Y)/2 + R(Z)/1))

R(X) = 0.1 + 0.9 * R(W)

R(Y) = 0.1

R(Z) = 0.1 + 0.9 * (R(Y)/2)

Now, solve for: R(W), R(X), R(Y), R(Z)

Page 14: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Computing Values for R(W), R(X), R(Y) and R(Z)Computing Values for R(W), R(X), R(Y) and R(Z)

We could use algebra to find the values, in the same way we could solve for x and y in:

x = 1 + 2x + yy = 2 + x + 3y

Page 15: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Algebraic SolutionAlgebraic Solution

w = R(W) x = R(X) y = R(Y) z = R(Z)

w = 0.1 + 0.45y + 0.9z

x = 0.1 + 0.9w

y = 0.1

z = 0.1 + 0.45y

y = 0.1

z = 0.145

w = 0.2775

x = 0.34795

But solving for eight billion variables is hard. Instead, we'll use fixed point iteration.

Page 16: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Solution by Fixed-Point IterationSolution by Fixed-Point Iteration

Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (1.0/2 + 1.0) = 1.45 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.0 = 1.0 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (1.0/2) = 0.55

Start with initial estimates of PageRank for each page: R(W) = 1.0 R(X) = 1.0 R(Y) = 1.0 R(Z) = 1.0

Page 17: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Solution by Fixed-Point IterationSolution by Fixed-Point Iteration

Start with updated estimates: R(W) = 1.45 R(X) = 1.0 R(Y) = 0.1 R(Z) = 0.55

Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (0.1/2 + 0.55) = 0.64 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.45 = 1.405 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (0.1/2) = 0.145

Page 18: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Solution by IterationSolution by Iteration

iteration R(W) R(X) R(Y) R(Z)0 1.00000 1.00000 1.00000 1.000001 1.45000 1.00000 0.10000 0.550002 0.64000 1.40500 0.10000 0.14500

Compute new estimates from the old until the estimatesstop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better.

3 0.27550 0.67600 0.10000 0.145004 0.27550 0.34795 0.10000 0.145005 0.27550 0.34795 0.10000 0.14500 ... ... ... ...

Page 19: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Final PageranksFinal Pageranks

highest page X R(X) = 0.34795

. page W R(W) = 0.2755 . . page Z R(Z) = 0.14500

lowest page Y R(Y) = 0.10000

Page 20: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Final PageranksFinal Pageranks

Y

W X

Z2 1

1 0

0.100000.10000

0.347950.347950.275500.27550

0.145000.14500

Page 21: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

How does Google Use How does Google Use Pagerank?Pagerank?

You enter search terms, such as “UNC You enter search terms, such as “UNC Charlotte HCI”Charlotte HCI”

Google finds all the pages that have Google finds all the pages that have allall those words on themthose words on them

Of all those pages, Google will list the Of all those pages, Google will list the ones with the highest page rank first, but…ones with the highest page rank first, but…

……other ‘magic ingredients’ are used by other ‘magic ingredients’ are used by Google: trade secrets of their algorithms.Google: trade secrets of their algorithms.

Page 22: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

IntroductionIntroduction

Basic queries are somewhat limitedBasic queries are somewhat limited One or two keywordsOne or two keywords Simple relationshipsSimple relationships Limited syntaxLimited syntax

Complex queries provide more powerComplex queries provide more power Keywords & phrase can be connected to form Keywords & phrase can be connected to form

more complex relationshipsmore complex relationships Search filters can be employed to limit resultsSearch filters can be employed to limit results

Page 23: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

SyntaxSyntax Rules for combining simple words to form Rules for combining simple words to form

complex sentencescomplex sentences

Search engine syntax implemented by Search engine syntax implemented by applying Boolean logicapplying Boolean logic

George BooleGeorge Boole 1815-18641815-1864

Page 24: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

Boolean logicBoolean logic Keywords act as nounsKeywords act as nouns Boolean operators act as conjunctionsBoolean operators act as conjunctions

They define the connections between keywordsThey define the connections between keywords Illustrated with Venn diagramsIllustrated with Venn diagrams

John VennJohn Venn 1834-19231834-1923

Page 25: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

cats

W W W

All web pages containing the word cats

Page 26: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

dogs

W W W

All web pages containing the word dogs

Page 27: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

W W W

dogscats

All web pages containing the words cats and dogs

Intersection of the two sets

Searches containing both words

Page 28: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

W W W

cats dogs

All web pages containing the words cats or dogs

Searches containing either word

Union of the two sets

Page 29: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

W W W

cats dogs

All web pages containing the words cats and not dogs

Exclusion of the dogs set

Searches containing one word but not the other

Page 30: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

W W W

All web pages containing the words dogs and not cats

dogscats

Exclusion of the cats set

Searches containing one word but not the other

Page 31: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

Boolean operatorsBoolean operators ANDAND OROR NOTNOT

Instruct the engine on how to combine Instruct the engine on how to combine keywords to produce resultskeywords to produce results

Always use capital letters to avoid Always use capital letters to avoid confusion with and, or, not as keywordsconfusion with and, or, not as keywords

Page 32: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

ANDAND All these keywords must be on the Web pageAll these keywords must be on the Web page

OROR These keywords may or may not be on the These keywords may or may not be on the

Web pageWeb page At least one of them must beAt least one of them must be

NOTNOT None of these keywords can be on the Web None of these keywords can be on the Web

pagepage

Page 33: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

Default operatorDefault operator Some engines have a default Boolean Some engines have a default Boolean

operatoroperator Usually ANDUsually AND Might be ORMight be OR

Some engines may search for multiple Some engines may search for multiple words as phraseswords as phrases

Page 34: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Understanding Boolean OperatorsUnderstanding Boolean Operators

Boolean operators may beBoolean operators may be Allowed on main page Allowed on main page Confined to Advanced search pagesConfined to Advanced search pages

Some engines use symbols insteadSome engines use symbols instead + for AND+ for AND - for NOT- for NOT No space between sign and word: No space between sign and word:

+solar +energy -windmill+solar +energy -windmill

Page 35: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Narrowing Searches with ANDNarrowing Searches with AND

ANDAND Limits resultsLimits results Forces inclusion of a stop wordForces inclusion of a stop word

Indicates that Indicates that allall keywords must be found keywords must be found on Web pageon Web page

Adding more ANDed keywords limits Adding more ANDed keywords limits search moresearch more

Results should be more relevant because Results should be more relevant because the keyword list has expandedthe keyword list has expanded

Page 36: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Narrowing Searches with ANDNarrowing Searches with AND

Example:Example: ““solar energy association” AND Portlandsolar energy association” AND Portland

W W W

Solar energy association

Portland

Page 37: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Narrowing Searches with ANDNarrowing Searches with AND

Example:Example: Henry +I same as “Henry I”Henry +I same as “Henry I”

W W W

Henry I

Page 38: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Expanding Searches with ORExpanding Searches with OR

OR expands resultsOR expands results Useful if you didn’t get enough returns from Useful if you didn’t get enough returns from

your first searchyour first search The more keywords you add, the more results The more keywords you add, the more results

you should getyou should get

Every page returned must have at least Every page returned must have at least one of the keywords on itone of the keywords on it Good to use when you have synonymsGood to use when you have synonyms

Page 39: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Expanding Searches with ORExpanding Searches with OR

Example:Example: oregon OR northwestoregon OR northwest

W W W

oregon northwest

Page 40: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Restricting Queries with AND NOTRestricting Queries with AND NOT

AND NOT excludes the keyword that AND NOT excludes the keyword that follows NOTfollows NOT

Limits your searchLimits your search Produces fewer resultsProduces fewer results

Useful if first search returns irrelevant Useful if first search returns irrelevant resultsresults Use AND NOT to get rid of those resultsUse AND NOT to get rid of those results

Page 41: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Restricting Queries with AND NOTRestricting Queries with AND NOT

Equivalent forms:Equivalent forms: cats AND NOT dogscats AND NOT dogs cats AND-NOT dogscats AND-NOT dogs cats NOT dogscats NOT dogs cats –dogscats –dogs

Page 42: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Restricting Queries with AND NOTRestricting Queries with AND NOT

Example:Example: ““solar energy association” AND portland solar energy association” AND portland

AND NOT maineAND NOT maine

Solar energy association

portland

maine

Page 43: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Multiple Boolean OperatorsMultiple Boolean Operators

Boolean operators allow you to focus a Boolean operators allow you to focus a searchsearch

Any logical combination of operators is Any logical combination of operators is allowedallowed

If it makes sense when spoken like a If it makes sense when spoken like a sentence it’s probably OK to usesentence it’s probably OK to use

Order of operations is usually left to rightOrder of operations is usually left to right Use parentheses to organize termsUse parentheses to organize terms

Page 44: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Multiple Boolean OperatorsMultiple Boolean Operators

Bad example:Bad example: constitution +american OR “united states”constitution +american OR “united states”

constitution american

“united states”

Page 45: ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Multiple Boolean OperatorsMultiple Boolean Operators

Good example:Good example: constitution +(american OR “united states”)constitution +(american OR “united states”)

constitution american

“united states”