Curtis Spencer Ezra Burgoyne An Internet Forum Index.
-
Upload
abel-austin -
Category
Documents
-
view
219 -
download
2
Transcript of Curtis Spencer Ezra Burgoyne An Internet Forum Index.
![Page 1: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/1.jpg)
Curtis Spencer
Ezra Burgoyne
An Internet Forum Index
![Page 2: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/2.jpg)
The Problem• Forums provide a wealth of information often overlooked by search engines.
• Semi-structured data provided by forums is not taken advantage of by popular search software (e.g. Google, Yahoo).
• Despite being crawled, many useful information rich posts never appear in results due to low page rank.
• Discovering what the best forums are for a given topic is difficult even when the help of a search engine is enlisted.
• Forum users are often unaware of related information found on rival forums.
• A forum’s own search software is often slow and returns poor results.
![Page 3: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/3.jpg)
Quick summary of solution
• Forum detection crawlers continually find new forums with the help of a web search engine (e.g. Dogpile)
• These discovered forums are eventually wrapped in their entirety through a distributed crawler.
• Forum content collected in the database is indexed using latest MySQL fulltext natural language index.
• Search ranking algorithm uses data ignored by traditional search engines such as number of replies, number of views, popularity of poster, etc.
![Page 4: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/4.jpg)
![Page 5: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/5.jpg)
Forums supported by Forum Looter
phpBB vBulletin
![Page 6: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/6.jpg)
![Page 7: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/7.jpg)
Discovering forums
• Using WordNet, a program serves dictionary words and their synonyms to a set of distributed crawlers.
• Every link returned by Dogpile subjected to a detection algorithm that consist of URL formations as well as common patterns in the markup.
• Detects the three most popular forum types used on the internet: vBulletin, phpBB, Invision
• In trying to be good netizens, Dogpile website only accessed every two minutes. In addition, robots.txt was respected.
![Page 8: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/8.jpg)
Dogpile example query
![Page 9: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/9.jpg)
![Page 10: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/10.jpg)
Distributed forum wrapping (architecture)
• Synchronized Java RMI server stores a queue of jobs.
• Distributed crawlers retrieve jobs from the central RMI server.
• Distributed crawlers wrap whatever page their fetched job contains and saves results into database.
• Distributed crawlers can schedule new jobs on the RMI server, too.
![Page 11: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/11.jpg)
More on distributed forum wrapping (access)
• In trying to be good netizens, each forum website is only accessed once in any 20 second time period.
• At the mentioned rate, it would take two months to completely wrap some of the largest forums out there.
• Last request times of forum websites were set by individual client crawlers and kept track of in the RMI server.
• Exponential back off algorithm used for slow sites.
![Page 12: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/12.jpg)
More on distributed forum wrapping (performance)
• Java RMI server performed very well, only using 10% of an AMD Athlon CPU.
• Individual crawlers were rather memory intensive.
• Individual crawlers used JTidy for parsing pages and performed DOM manipulation in addition to regular expressions to extract data.
• Memory use attributed to Hibernate and JTidy.
• Database access caused main bottleneck in the distributed system.
![Page 13: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/13.jpg)
![Page 14: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/14.jpg)
Indexing of data
• Shadow database periodically updates forum data from crawl and creates an incremental MySQL fulltext index.
• Has built-in support for stop words and document frequency scaling.
![Page 15: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/15.jpg)
Ranking algorithm for search results
• Forum software has many missed opportunities for metadata analysis.
• We do a hierarchical weighted value calculation for each post once it is matched a NLP query from MySQL.
• An approximation of this calculation is Value = w(apc)*apc + w(numViews) + w(numReplies) + w(isThread())
![Page 16: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/16.jpg)
Future changes
• Natural language parsing of forum post corpus so ranking may be affected by things such as “thank you” or “great post” replies.
• Collaborative filtering of results per query by tracking user clicks.
• Improve resource usage of crawlers.
![Page 17: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/17.jpg)
![Page 18: Curtis Spencer Ezra Burgoyne An Internet Forum Index.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f0c5503460f94c1ffe5/html5/thumbnails/18.jpg)