Download - Wasim Rangoonwala ID# 00506259 CS-460 Computer Security

Wasim Rangoonwala

ID# 00506259CS-460 Computer Security

“Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others”

- Alan Westin: Privacy & Freedom,1967

What are www Robots?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders or Bots.

Web Spiders / Robots Collecting Data

Controlling how search engine access and index your website?

Google refers to their spiders as Googlebots and Googlebots-Image

Google has a set of computers that continually crawl the web. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.


One key Question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results and which pages can be kept Private..

Answer:Robots.txt File


1. Robots.txt has been an industry standard for many years that lets a site owner control how search engines access their web site.

2. The robots.txt file contains a list of the pages that search engines shouldn't access.

3. You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory.

Making Use ofRobots.txt File


• Example of pages you want to kept private from search engines

1. A directory that contains internal logs.

2. News articles that require payment to access.

3. Administration area of website. Database configuration string, stored passwords, credit card details.

4. Images that you want to kept Private.

Making Use ofRobots.txt File

Continue

Achieving Privacy through Robots.txt File

# robots.txt File

# Currently disallow all images to the Google Image bot

User-agent: Googlebot-Image

Disallow: /

# ALL search engine spiders/crawlers (put at end of file)

User-agent: Googlebot

Disallow: /admin/

Disallow: /account_password.html

Disallow: /address_book.html

Disallow: /checkout_payment.html

Disallow: /cookie_usage.html

Disallow: /login.html

Example ofRobots.txt File

Privacy through Robots <META> tag

You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

Example

<html><head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>

• The "NAME" attribute must be "ROBOTS".

• Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out.

Example of<META> Tag

Search Engine Web Spiders Names

• Yahoo! Search-Yahoo Slurp

• AltaVista- Scooter• AskJeeves- Ask

Jeeves/Teoma• MSN Search- MSNbot• Visit

http://www.robotstxt.org/db.html

For more details on Search Engine

Web Spider Names.

http://www.robotstxt.org/db.html

Google: Anatomy

Google Crawlers (GoogleBot)• Multiple distributed

crawlers• Own DNS cache• 300 connections open at

once• Send fetched pages to

Store Server• Originally written in Python

PageRank™ Algorithm

Hypertext-matching Analysis

Google: Technology

Google Webmaster Central

Webmasters Central offer services:

• see which parts of a site Googlebot had problems crawling• upload an XML Sitemap file• analyze and generate robots.txt files• remove URLs already crawled by Googlebot• specify the preferred domain• identify issues with title and description meta tags• understand the top searches used to reach a site• get a glimpse at how Googlebot sees pages• remove unwanted site links that Google may use in results

When surfing theinternet, avoid “free”

offers and protectyour information! Chatting – guard

your information unlessYou are 100%

Sure who you arechatting with.

Cookies aren’t justfor eating, they may

be sending yourpersonal information

to others.

Protect your passwordslike you would yourwallet or car keys.

Make it complicate!

E-mail is notsecure and should

never be thoughof as private.

Don’t even

open Spam, download

a spam buster!

Beware of phishing,which are fake e-mails

Sent to try to gainyour personal and

financial information.

Protect yourprivacy

on the Web

•http://www.google.com/support/webmasters/bin/answer.py?answer=80553•http://www.google.com/bot.html•http://www.googleguide.com•http://www.searchengineposition.com•http://www.google-watch.org•http://www.robotstxt.org/db.html•http://www.googleblog.blogspot.com

• For more Details Visit http://techwasim.blogspot.com