A Characterization of the Portuguese Web
Daniel Gomes and Mário J. Silva
University of Lisbon
http://xldb.fc.ul.pt
Presentation
• Introduction
• Setup
• Statistics
• Conclusions
• Future Work
Terminology
• Document: file resultant from a successful HTTP download
• Publisher: entity responsible for publishing the document on the Web
• Web site: collection of documents referenced by URLs that share the same host name
Why Characterize?
• Extraction of cultural, commercial and social aspects: – Presence of natural languages
– Most popular web servers
• Adequate design and tuning of web applications:– The web is described through its characterization.
– Parameters of the Web graph- How many nodes compose the graph
- Types of this nodes
Characterizing the WWW vs. Community Webs
• Huge
• Sampling is a “must”
• WWW is not uniform
• Small partitions are ignored
+ Relevant to a certain community
+ Less resources
+ A complete scan is possible, no sampling!
– Difficult to establish boundaries
WWW.TUMBA.PT
Publicly available:
• Characterize
• Search
Almost:
• Archive» The Portuguese Web
Main objectives:
• Estimate the resources need to create a web-archive of the Portuguese Web;
• Validate crawls;
• Gather guidelines to improve the systems (crawling, repository, index).
Characterization Setup
• Viúva Negra Crawlers: gather information from the Web and insert it into Versus.
• Versus: keeps documents in files and meta-data in relations.
• Web statistics are produced issuing SQL queries to the Versus Repository.
VN CrawlerVN Crawler VN Crawler
Versus Repository
WebStatistics
SQL
What is the Portuguese Web?
• Set of documents of cultural and sociological interest to the Portuguese people.
• Language– Brazilian/Portuguese community web sites– Both written in Portuguese
• TLDs– Many sites hosted in gTLDs.
Crawler configuration
• Influences statistics– The depth of the crawl influences the number of
documents gathered
– Replication• Mirrors
• URL Aliases
• Crawl as many documents as possible• Maintain robustness against pathological
situations
VN Configuration Parameters– Text documents (list selected MIME types)
– Hosted under “.PT”
– Hosted under “.COM”, “.NET”, “.ORG”, “.TV”.• Written in Portuguese
• Host site had at least one incoming link originated under “.PT”
– Download timeout=60s
– Max Size=2MB
– Avoid traps: • max docs per site=8000
• crawl at most 50 times the same document
Collected Statistics
• 4 million URLs and 78 GB.
• 83% successfully downloaded (200)
• 3.4% not found (404)
• 1.2% took more than 1 minute to download
• 0.5% bigger than 2 MB
Site statistics
COM12%
NET2%
ORG1%
TV0%
PT85%
138%
1-1034%
10-10021%
>10001%
100-10006%
Sites per TLD Documents per Site
Language Distribution (.pt only)
Portuguese 73%
English 17%
German 3%
Spanish 1% others
1%
unknown 4%
French 1%
Size Distribution
0
200000
400000
600000
800000
1000000
0 1 2 4 8 16 32 64 128
256
512
1024
2048
size (KB)
nu
mb
er o
f d
ocu
men
ts
Other Statistics
• Average length of an URL is 62 chars
• unknown Last-Modified Date: 53%
• HTML: 95%
• 78 GB of data produced 8.7 GB of text
• Meta-tags are scarce (description 17%, keywords 18%)
• 15.5% Replication
http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank
Other Statistics
• Average length of an URL is 62 chars
• unknown Last-Modified Date: 53%
• HTML: 95%
• 78 GB of data produced 8.7 GB of text
• Meta-tags are scarce (description 17%, keywords 18%)
• 15.5% Replication
Conclusions
• Defined the Portuguese Web as a crawling policy.
• Characterization can not be dissociated from crawling technology.
• A search engine repository is a source of interesting statistics.
• Statistics are an important tool for validating and designing web applications
Future Work
• Study the linkage structure
• Crawl other types such as postscripts
• Improve the algorithm used to find Portuguese web sites outside the .PT domain
• Study the evolution of the Portuguese Web
Thank you for your attention.
http://xldb.fc.ul.pt
http://www.tumba.pt
Top Related