Wasim Rangoonwala
ID# 00506259CS-460 Computer Security
“Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others”
- Alan Westin: Privacy & Freedom,1967
What are www Robots?
A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders or Bots.
Web Spiders / Robots Collecting Data
Controlling how search engine access and index your website?
Google refers to their spiders as Googlebots and Googlebots-Image
Google has a set of computers that continually crawl the web. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.
Controlling how search engine access and index your website?
One key Question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results and which pages can be kept Private..
Answer:Robots.txt File
Controlling how search engine access and index your website?
1. Robots.txt has been an industry standard for many years that lets a site owner control how search engines access their web site.
2. The robots.txt file contains a list of the pages that search engines shouldn't access.
3. You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory.
Making Use ofRobots.txt File
Controlling how search engine access and index your website?
• Example of pages you want to kept private from search engines
1. A directory that contains internal logs.
2. News articles that require payment to access.
3. Administration area of website. Database configuration string, stored passwords, credit card details.
4. Images that you want to kept Private.
Making Use ofRobots.txt File
Continue
Achieving Privacy through Robots.txt File
# robots.txt File
# Currently disallow all images to the Google Image bot
User-agent: Googlebot-Image
Disallow: /
# ALL search engine spiders/crawlers (put at end of file)
User-agent: Googlebot
Disallow: /admin/
Disallow: /account_password.html
Disallow: /address_book.html
Disallow: /checkout_payment.html
Disallow: /cookie_usage.html
Disallow: /login.html
Example ofRobots.txt File
Privacy through Robots <META> tag
You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.
Example
<html><head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
• The "NAME" attribute must be "ROBOTS".
• Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out.
Example of<META> Tag
Search Engine Web Spiders Names
• Yahoo! Search-Yahoo Slurp
• AltaVista- Scooter• AskJeeves- Ask
Jeeves/Teoma• MSN Search- MSNbot• Visit
http://www.robotstxt.org/db.html
For more details on Search Engine
Web Spider Names.
Bonus
Google: Anatomy
Google Crawlers (GoogleBot)• Multiple distributed
crawlers• Own DNS cache• 300 connections open at
once• Send fetched pages to
Store Server• Originally written in Python
PageRank™ Algorithm
Hypertext-matching Analysis
Google: Technology
Google Webmaster Central
Webmasters Central offer services:
• see which parts of a site Googlebot had problems crawling• upload an XML Sitemap file• analyze and generate robots.txt files• remove URLs already crawled by Googlebot• specify the preferred domain• identify issues with title and description meta tags• understand the top searches used to reach a site• get a glimpse at how Googlebot sees pages• remove unwanted site links that Google may use in results
When surfing theinternet, avoid “free”
offers and protectyour information! Chatting – guard
your information unlessYou are 100%
Sure who you arechatting with.
Cookies aren’t justfor eating, they may
be sending yourpersonal information
to others.
Protect your passwordslike you would yourwallet or car keys.
Make it complicate!
E-mail is notsecure and should
never be thoughof as private.
Don’t even
open Spam, download
a spam buster!
Beware of phishing,which are fake e-mails
Sent to try to gainyour personal and
financial information.
Protect yourprivacy
on the Web
•http://www.google.com/support/webmasters/bin/answer.py?answer=80553•http://www.google.com/bot.html•http://www.googleguide.com•http://www.searchengineposition.com•http://www.google-watch.org•http://www.robotstxt.org/db.html•http://www.googleblog.blogspot.com
• For more Details Visit http://techwasim.blogspot.com
Top Related