ITIS 1210 Introduction to Web-Based Information Systems Chapter 15 How VoIP and Skype Work.
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines...
-
Upload
melvin-bradley -
Category
Documents
-
view
215 -
download
0
Transcript of ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines...
ITIS 1210ITIS 1210Introduction to Web-Based Introduction to Web-Based
Information SystemsInformation Systems
Internet Research TwoInternet Research TwoHow Search Engines Rank Pages &How Search Engines Rank Pages &
Constructing Complex SearchesConstructing Complex Searches
How do Search Engines Crawl?How do Search Engines Crawl?
Gathering data from the Web is like Gathering data from the Web is like browsing:browsing:1.1. Visit a page.Visit a page.
2.2. Record all the words on the pageRecord all the words on the page
3.3. Choose a link you haven’t seen/recordedChoose a link you haven’t seen/recorded
4.4. Click on the link.Click on the link.
Repeat 8 billion times.
Crawling the WebCrawling the Web
One person with a Web browser, following One person with a Web browser, following one link per second.one link per second.
How long does it take to browse the How long does it take to browse the surface Web (8 billion pages)?surface Web (8 billion pages)?
Crawling the WebCrawling the Web
How many people would it take to crawl How many people would it take to crawl the surface Web in a week? If each person the surface Web in a week? If each person follows one link per second (with no follows one link per second (with no sleep):sleep):
One week = six hundred thousand secondsSix hundred thousand / eight billion = thirteen thousand
Challenges:Challenges:
Remembering where you’ve beenRemembering where you’ve been Remembering where you haven’t beenRemembering where you haven’t been Storing all the dataStoring all the data
A (small) Server FarmA (small) Server Farm
The Deep WebThe Deep Web
Not all pages get crawled:Not all pages get crawled: Private pages on Intranets (company Private pages on Intranets (company
networks)networks) Pages that people don’t want crawledPages that people don’t want crawled Dynamic content pages (from databases)Dynamic content pages (from databases)
Dynamic content pages make the size of the Dynamic content pages make the size of the Internet infinite!Internet infinite!
Dynamic Content ExampleDynamic Content Example
zillow.comzillow.com Won’t be Won’t be
indexedindexed
Identifying High Quality Web Identifying High Quality Web PagesPages
Google has ranked billions of Web pages Google has ranked billions of Web pages by "quality".by "quality".
You enter your search terms:You enter your search terms:
UNC Charlotte HCIUNC Charlotte HCI
Google finds the Google finds the highest quality pagehighest quality page associated with these search terms.associated with these search terms.
Google PagerankGoogle Pagerank
Pretend you're surfing the Web randomly.Pretend you're surfing the Web randomly.
To move from page to page you could:To move from page to page you could:
1) type in an address (1) type in an address (www.sis.uncc.edusis.uncc.edu) )
includes using a bookmarkincludes using a bookmark
OROR
2) follow a link.2) follow a link.
Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2).The main idea is that links to your page from important web pages indicate that your page is important.
Computing PagerankComputing Pagerank(what’s the probability of getting to this page?)(what’s the probability of getting to this page?)
Q A, B, C, ...
L(A), L(B), L(C),...
= Web page= Pages pointing to Q= number of links on each page
Pagerank of Q: R(Q) = (1-d) + d·(R(A)/L(A) + R(B)/L(B) + ...)
d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark):
Usually these are: d = 0.9 (1-d) = 0.1
Computing PagerankComputing Pagerank
Pretend the Web has only four pages:Pretend the Web has only four pages:
W X Y ZW X Y Z
Links:Links:
W W X Y X Y W Y W Y Z Z Z Z W W
L(W)L(W)=1 =1 L(X)L(X)=0 =0 L(Y)L(Y)=2 =2 L(Z)L(Z)=1=1
Which page has the highest “quality”?Which page has the highest “quality”?
Computing PagerankComputing PagerankLinks: W X Y W Y Z Z W
L(W)=1 L(X)=0 L(Y)=2 L(Z)=1
R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z)) = 0.1 + 0.9 * (R(Y)/2 + R(Z)/1))
R(X) = 0.1 + 0.9 * R(W)
R(Y) = 0.1
R(Z) = 0.1 + 0.9 * (R(Y)/2)
Now, solve for: R(W), R(X), R(Y), R(Z)
Computing Values for R(W), R(X), R(Y) and R(Z)Computing Values for R(W), R(X), R(Y) and R(Z)
We could use algebra to find the values, in the same way we could solve for x and y in:
x = 1 + 2x + yy = 2 + x + 3y
Algebraic SolutionAlgebraic Solution
w = R(W) x = R(X) y = R(Y) z = R(Z)
w = 0.1 + 0.45y + 0.9z
x = 0.1 + 0.9w
y = 0.1
z = 0.1 + 0.45y
y = 0.1
z = 0.145
w = 0.2775
x = 0.34795
But solving for eight billion variables is hard. Instead, we'll use fixed point iteration.
Solution by Fixed-Point IterationSolution by Fixed-Point Iteration
Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (1.0/2 + 1.0) = 1.45 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.0 = 1.0 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (1.0/2) = 0.55
Start with initial estimates of PageRank for each page: R(W) = 1.0 R(X) = 1.0 R(Y) = 1.0 R(Z) = 1.0
Solution by Fixed-Point IterationSolution by Fixed-Point Iteration
Start with updated estimates: R(W) = 1.45 R(X) = 1.0 R(Y) = 0.1 R(Z) = 0.55
Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (0.1/2 + 0.55) = 0.64 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.45 = 1.405 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (0.1/2) = 0.145
Solution by IterationSolution by Iteration
iteration R(W) R(X) R(Y) R(Z)0 1.00000 1.00000 1.00000 1.000001 1.45000 1.00000 0.10000 0.550002 0.64000 1.40500 0.10000 0.14500
Compute new estimates from the old until the estimatesstop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better.
3 0.27550 0.67600 0.10000 0.145004 0.27550 0.34795 0.10000 0.145005 0.27550 0.34795 0.10000 0.14500 ... ... ... ...
Final PageranksFinal Pageranks
highest page X R(X) = 0.34795
. page W R(W) = 0.2755 . . page Z R(Z) = 0.14500
lowest page Y R(Y) = 0.10000
Final PageranksFinal Pageranks
Y
W X
Z2 1
1 0
0.100000.10000
0.347950.347950.275500.27550
0.145000.14500
How does Google Use How does Google Use Pagerank?Pagerank?
You enter search terms, such as “UNC You enter search terms, such as “UNC Charlotte HCI”Charlotte HCI”
Google finds all the pages that have Google finds all the pages that have allall those words on themthose words on them
Of all those pages, Google will list the Of all those pages, Google will list the ones with the highest page rank first, but…ones with the highest page rank first, but…
……other ‘magic ingredients’ are used by other ‘magic ingredients’ are used by Google: trade secrets of their algorithms.Google: trade secrets of their algorithms.
IntroductionIntroduction
Basic queries are somewhat limitedBasic queries are somewhat limited One or two keywordsOne or two keywords Simple relationshipsSimple relationships Limited syntaxLimited syntax
Complex queries provide more powerComplex queries provide more power Keywords & phrase can be connected to form Keywords & phrase can be connected to form
more complex relationshipsmore complex relationships Search filters can be employed to limit resultsSearch filters can be employed to limit results
Understanding Boolean OperatorsUnderstanding Boolean Operators
SyntaxSyntax Rules for combining simple words to form Rules for combining simple words to form
complex sentencescomplex sentences
Search engine syntax implemented by Search engine syntax implemented by applying Boolean logicapplying Boolean logic
George BooleGeorge Boole 1815-18641815-1864
Understanding Boolean OperatorsUnderstanding Boolean Operators
Boolean logicBoolean logic Keywords act as nounsKeywords act as nouns Boolean operators act as conjunctionsBoolean operators act as conjunctions
They define the connections between keywordsThey define the connections between keywords Illustrated with Venn diagramsIllustrated with Venn diagrams
John VennJohn Venn 1834-19231834-1923
Understanding Boolean OperatorsUnderstanding Boolean Operators
cats
W W W
All web pages containing the word cats
Understanding Boolean OperatorsUnderstanding Boolean Operators
dogs
W W W
All web pages containing the word dogs
Understanding Boolean OperatorsUnderstanding Boolean Operators
W W W
dogscats
All web pages containing the words cats and dogs
Intersection of the two sets
Searches containing both words
Understanding Boolean OperatorsUnderstanding Boolean Operators
W W W
cats dogs
All web pages containing the words cats or dogs
Searches containing either word
Union of the two sets
Understanding Boolean OperatorsUnderstanding Boolean Operators
W W W
cats dogs
All web pages containing the words cats and not dogs
Exclusion of the dogs set
Searches containing one word but not the other
Understanding Boolean OperatorsUnderstanding Boolean Operators
W W W
All web pages containing the words dogs and not cats
dogscats
Exclusion of the cats set
Searches containing one word but not the other
Understanding Boolean OperatorsUnderstanding Boolean Operators
Boolean operatorsBoolean operators ANDAND OROR NOTNOT
Instruct the engine on how to combine Instruct the engine on how to combine keywords to produce resultskeywords to produce results
Always use capital letters to avoid Always use capital letters to avoid confusion with and, or, not as keywordsconfusion with and, or, not as keywords
Understanding Boolean OperatorsUnderstanding Boolean Operators
ANDAND All these keywords must be on the Web pageAll these keywords must be on the Web page
OROR These keywords may or may not be on the These keywords may or may not be on the
Web pageWeb page At least one of them must beAt least one of them must be
NOTNOT None of these keywords can be on the Web None of these keywords can be on the Web
pagepage
Understanding Boolean OperatorsUnderstanding Boolean Operators
Default operatorDefault operator Some engines have a default Boolean Some engines have a default Boolean
operatoroperator Usually ANDUsually AND Might be ORMight be OR
Some engines may search for multiple Some engines may search for multiple words as phraseswords as phrases
Understanding Boolean OperatorsUnderstanding Boolean Operators
Boolean operators may beBoolean operators may be Allowed on main page Allowed on main page Confined to Advanced search pagesConfined to Advanced search pages
Some engines use symbols insteadSome engines use symbols instead + for AND+ for AND - for NOT- for NOT No space between sign and word: No space between sign and word:
+solar +energy -windmill+solar +energy -windmill
Narrowing Searches with ANDNarrowing Searches with AND
ANDAND Limits resultsLimits results Forces inclusion of a stop wordForces inclusion of a stop word
Indicates that Indicates that allall keywords must be found keywords must be found on Web pageon Web page
Adding more ANDed keywords limits Adding more ANDed keywords limits search moresearch more
Results should be more relevant because Results should be more relevant because the keyword list has expandedthe keyword list has expanded
Narrowing Searches with ANDNarrowing Searches with AND
Example:Example: ““solar energy association” AND Portlandsolar energy association” AND Portland
W W W
Solar energy association
Portland
Narrowing Searches with ANDNarrowing Searches with AND
Example:Example: Henry +I same as “Henry I”Henry +I same as “Henry I”
W W W
Henry I
Expanding Searches with ORExpanding Searches with OR
OR expands resultsOR expands results Useful if you didn’t get enough returns from Useful if you didn’t get enough returns from
your first searchyour first search The more keywords you add, the more results The more keywords you add, the more results
you should getyou should get
Every page returned must have at least Every page returned must have at least one of the keywords on itone of the keywords on it Good to use when you have synonymsGood to use when you have synonyms
Expanding Searches with ORExpanding Searches with OR
Example:Example: oregon OR northwestoregon OR northwest
W W W
oregon northwest
Restricting Queries with AND NOTRestricting Queries with AND NOT
AND NOT excludes the keyword that AND NOT excludes the keyword that follows NOTfollows NOT
Limits your searchLimits your search Produces fewer resultsProduces fewer results
Useful if first search returns irrelevant Useful if first search returns irrelevant resultsresults Use AND NOT to get rid of those resultsUse AND NOT to get rid of those results
Restricting Queries with AND NOTRestricting Queries with AND NOT
Equivalent forms:Equivalent forms: cats AND NOT dogscats AND NOT dogs cats AND-NOT dogscats AND-NOT dogs cats NOT dogscats NOT dogs cats –dogscats –dogs
Restricting Queries with AND NOTRestricting Queries with AND NOT
Example:Example: ““solar energy association” AND portland solar energy association” AND portland
AND NOT maineAND NOT maine
Solar energy association
portland
maine
Multiple Boolean OperatorsMultiple Boolean Operators
Boolean operators allow you to focus a Boolean operators allow you to focus a searchsearch
Any logical combination of operators is Any logical combination of operators is allowedallowed
If it makes sense when spoken like a If it makes sense when spoken like a sentence it’s probably OK to usesentence it’s probably OK to use
Order of operations is usually left to rightOrder of operations is usually left to right Use parentheses to organize termsUse parentheses to organize terms
Multiple Boolean OperatorsMultiple Boolean Operators
Bad example:Bad example: constitution +american OR “united states”constitution +american OR “united states”
constitution american
“united states”
Multiple Boolean OperatorsMultiple Boolean Operators
Good example:Good example: constitution +(american OR “united states”)constitution +(american OR “united states”)
constitution american
“united states”