Google Search Appliance Help Center - Linked by...

102
Help Center Software version 4.4.x Hello and welcome to the Help Center! Browse the Admin Console Documentation Most questions can be answered by reading the Admin Console Help Center. This complete version 4.4.x Google Search Appliance Help Center is arranged in the same order as the Admin Console navigation menu. Find additional technical details about the Google Search Appliance under More Information. Visit the Google Support site You can find additional help content, including up-to-date FAQs, tips, and release notes at https://support.google.com/enterprise (separate login required). More information is being added all the time, so check back often. Contents Crawl and Index Crawl URLs Databases Feeds Crawler Access Proxy Servers Cookie Sites Forms Authentication HTTP Headers Duplicate Hosts Document Dates Host Load Schedule Index Rollback Freshness Tuning Collections Edit Collections Serving Front Ends Output Format KeyMatch Synonyms Filters Remove URLs Authorization Forms Authentication Status and Reports Crawl Status Crawl Diagnostics Serving Status System Status Search Reports Search Log Event Log Administration Network Settings System Settings User Accounts Change Password SNMP Configuration Certificate Authorities SSL Settings License Import/Export Reset Index Shutdown More Information Rules for Valid URL Patterns Crawling and Indexing Spelling & Stop Words Hexadecimal Notation Font Families Security and Error Handling XML Reference Index Admin Console: Home When you access the Admin Console for the Google Search Appliance, the Home page gives you instant information about the status of your system. Test Center The Test Center link in the horizontal blue bar at the top right of the Admin Console is provided on every page, so you can do a test search on the front end you have selected and on any collection you like. When you click the Test Center link, a new browser window opens.

Transcript of Google Search Appliance Help Center - Linked by...

Help Center

Software version 4.4.x

Hello and welcome to the Help Center!

Browse the Admin Console Documentation

Most questions can be answered by reading the Admin Console Help Center. This complete version 4.4.x Google Search Appliance Help Center is arranged in the same order as the Admin Console navigation menu. Find additional technical details about the Google Search Appliance under More Information.

Visit the Google Support site

You can find additional help content, including up-to-date FAQs, tips, and release notes at https://support.google.com/enterprise (separate login required). More information is being added all the time, so check back often.

Contents

Crawl and Index Crawl URLs Databases Feeds Crawler Access Proxy Servers Cookie Sites Forms Authentication HTTP Headers Duplicate Hosts Document Dates Host Load Schedule Index Rollback Freshness Tuning Collections Edit Collections

Serving Front Ends Output Format KeyMatch Synonyms Filters Remove URLs Authorization Forms Authentication

Status and Reports Crawl Status Crawl Diagnostics Serving Status System Status Search Reports Search Log Event Log

Administration Network Settings System Settings User Accounts Change Password SNMP Configuration Certificate Authorities SSL Settings License Import/Export Reset Index Shutdown

More Information Rules for Valid URL Patterns Crawling and Indexing Spelling & Stop Words Hexadecimal Notation Font Families Security and Error Handling XML Reference Index

Admin Console: Home

When you access the Admin Console for the Google Search Appliance, the Home page gives you instant information about the status of your system.

Test Center

The Test Center link in the horizontal blue bar at the top right of the Admin Console is provided on every page, so you can do a test search on the front end you have selected and on any collection you like. When you click the Test Center link, a new browser window opens.

Enter a search term in the search box and click the Search button to view the results page for the front end and the collection you selected.

System Status

If all is running as it should, you see the System OK green button. A yellow Caution button means that there may be an issue with disk space, with the machine temperature, or with a machine itself. If there is a red Warning button, you should contact Google support. Click the System Status link to get more information.

Crawl Status

The Crawl Status chart shows the number of URLs found and the number of URLs crawled in the last 24 hours. Clicking the click to expand link opens the Crawl Status page, which contains a larger version of the chart, as well as a table showing the total documents being served, the current crawling rate, and number of document bytes indexed.

The Crawl Status page also has a Pause button that you can click to temporarily suspend crawling of your servers. This button becomes a Resume button to use when you want to restart the crawl.

Serving Status

The Serving Status chart shows statistics on users' queries and on search results served to those users. Clicking the click to expand link opens the Serving Status page, which contains a larger version of the chart and the recent number of queries per second that are coming into the Google Search Appliance.

Quick Access Links

Under the charts are shortcut access links to commonly used pages in the Admin Console.

Crawl and Index > Crawl URLs

Before you begin crawling your web content, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid. For a given URL to be crawled, it must match at least one URL pattern in the Follow and Crawl Only URLs with the Following Patterns box and none of the URL patterns in the Do Not Crawl URLs with the Following Patterns box.

Note: If a URL is matched by patterns from both Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns, the URL will not be crawled.

The following options let you control and refine your crawls.

Start Crawling from the Following URLs

Starting URLs, entered one per line, control where the crawl begins. All content that you wish to include in all of the collections should be reachable by following links from one or more documents listed in the starting URLs.

These URLs are only the starting point(s) for the crawl. They tell the crawler where to begin crawling. However, links from the start URLs will be followed and indexed only if they match a pattern in Follow and Crawl Only URLs with the Following Patterns. For example, if you specify a starting URL of http://mycompany.com/ in this section and a pattern www.mycompany.com/ in the Follow and Crawl Only URLs with the Following Patterns section, the crawler will discover links in the http://www.mycompany.com/ web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.

All entries in this window must be fully qualified URLs, using the format: <protocol>://<host>[:port]/[path].

The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required.

Valid examples: http://www.google.com/ http://www.google.com:80/help/

Invalid examples: Reason:

Google Search Appliance Help Center

Google Inc. 2

http://www/ Invalid because the hostname is incomplete. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com.

www.example.com/ Invalid because the protocol information is missing.

http://www.example.com The "/" after <host>[:port] is required.

To enter a new URL, type a valid entry into the window. Press Enter to add additional URLs, one per line.

Note: This window must contain at least one start URL. Google Search will attempt to resolve all incomplete path information entered. However, if it cannot be successfully resolved, the following error message displays in red on the page: You have entered one or more invalid start URLs. Please check your edits.

The crawler will retry several times to crawl URLs that are temporarily unreachable.

Follow and Crawl Only URLs with the Following Patterns

Only URLs matching the patterns you specify (one per line) in this window will be followed and crawled. This allows you to control which files will be crawled on your server.

Example:

www.example.corp.com/example/

This entry limits the crawl to URLs containing www.example.corp.com/example/.

The URLs that are discovered are checked against these patterns for inclusion in the index. Only URLs that match these patterns are crawled and indexed. In order for a URL to be crawled and indexed, there must be a sequence of links matching the Follow patterns from one of the Starting URLs. If there is no valid link path, you should add the URL to the Start Crawling from the Following URLs section.

The URL patterns you list in this window must conform to the rules for valid URL patterns. To enter a URL pattern, type a valid pattern into the window. Press Enter to add additional patterns. Empty lines and comments (starting with #) are permitted.

Test These Patterns

To test which URLs will be matched by one of the patterns you have entered in this field, click either of the Test these patterns links to open the Pattern Tester Utility. This Utility lets you specify a list of URLs on the left and a set of patterns on the right. It tells you if each URL is matched by one of the patterns in the set.

When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to best analyze your pattern sets. However, your modifications will not be saved; you have to explicitly enter and save them in the Crawl and Index > Crawl URLs page.

After you click the Test These Patterns button, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

Click the Back to Crawl and Index > Crawl URLs link to return to the Crawl and Index > Crawl URLs page.

Do Not Crawl URLs with the Following Patterns

Any pure text in a document is extracted and indexed by Google file type search. Graphics, diagrams, and formatting information are not indexed. You can exclude any particular file format from being crawled and indexed by defining URL pattern exceptions to prevent crawling from occurring on those pages. URLs matching the patterns you specify (one per line) in this window will not be crawled.

This option allows you to prevent specific file types, directories, or other sets of pages from being crawled. For example, entering the pattern contains:? in this box will prevent many Common Gateway Interface (CGI) scripts from being crawled.

The URL patterns you list here must conform to the rules for valid URL patterns. To enter a URL pattern, type a valid pattern into the window. Press Enter to add additional patterns on new lines. Empty lines and comments (starting with #) are permitted.

Google Search Appliance Help Center

Google Inc. 3

For your convenience, this box is prepopulated with many URL patterns and file types, some of which you may not want the crawler to index. We do not recommend deleting any of the default patterns unless you detect parts of your site that are currently being excluded by these rules.

To make a pattern or file type unavailable to the crawler, remove the # mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line

#.xls$ to .xls$

Test These Patterns

To test the patterns you have entered, click one of the Test these patterns links. When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to best analyze your pattern sets. However, your modifications will not be saved; you have to explicitly enter and save them in the Crawl and Index > Crawl URLs page. After you click the Test These Patterns button, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

Click the Back to Crawl and Index > Crawl URLs link to return to the Crawl and Index > Crawl URLs page.

Note: If the search should never crawl outside of your intranet site, then we recommend that you do one or more of the following:

● Configure your network to disallow Google Search Appliance connectivity outside of your intranet.

If you want to make sure that the Google Search Appliance never crawls outside of your intranet, then a person in your IT/IS group needs to specifically block the Google Search Appliance IP addresses from leaving your intranet. The GB-5005 and GB-8008 use three IP addresses, and these IP addresses are in your DNS entries as: googleswitch, googleweb, and googlecrawl. The GB-1001 uses only googleweb. Your IT/IS group needs to configure either an Access Control List (ACL) on your external routers or a set of rules on your firewall to disallow any communication between these IP addresses and the outside world.

● Make sure all patterns in the field Follow and Crawl Only URLs with the Following Patterns specify yourcompany.com as the domain name.

Crawl and Index > Databases

The Google Search Appliance can crawl your databases and show search results from the databases to users' queries. You need to supply information to allow crawl access to each database. You enter this information on the Crawl and Index > Databases page.

If your data source contains a URL column with URLs that point to your own website, add those URL patterns under Follow and Crawl Only URLs with the Following Patterns on the Crawl and Index > Crawl URLs page.

Here is the information about your database(s) that you need to have ready. The first seven entries are used by the system to talk to the external database server.

● Source Name - a name for the data source, which can contain only alphanumeric characters, underscores, and hyphens. The first character of the data source name must be either alphanumeric or an underscore. ● Database Type - choose from DB2, Oracle, MySQL, MySQL Server, or Sybase. ● Hostname - name of the server where the database resides. ● Port - the port number that is open to the database that JDBC should connect to. ● Database Name - the name given to the database. ● Username - user name to access the database. ● Password - password for the database.

● Crawl Query - a SQL statement accepted by the targeted database software that returns all rows to be indexed. See example. ● Display - choose from a default stylesheet for displaying results or upload a stylesheet from your network. (To view the default stylesheet, log on to the Google Support site. You can download it from there and make

changes to it, then upload it on the Crawl and Index > Databases page.) ● Serving Interface - choose either Serve Query or Serve URL Field

❍ Serve Query - a SQL statement that returns a row in a document that matches a search query. See example. Primary Key Fields - Column heading names (separated by commas), such as Last_Name,First_Name,SSN,Birth_Date, etc.

❍ Serve URL Field - If your database records already have URLs that display them, you should specify the database column that contains the URL. For example, in a company directory, if an HTML page exists for each record, and the links are always in the same format (such as http://corp.company.com/hr/Joe_Employee.html), then the appliance displays that link when it serves results. Specify the name of the field that contains the URL, such as "Employee_name".

Google Search Appliance Help Center

Google Inc. 4

The Advanced Settings section lets you define additional database information for the appliance to crawl.

● Incremental Crawl Query - a SQL statement that targets insertions, updates, and deletions in the database Action Field - the name of the column that lists the modification type; valid values for the Action field are "add" or "delete".

● BLOB MIME Type field - the name of the column that contains the standard Internet MIME type values of Binary Large Objects, such as text/plain and text/html. ● BLOB Content field - the name of the column that contains the types of BLOB content, such as documents.

The creation of a database source results in the automatic entry of the source in the Crawl and Index > Feeds page.

Examples of Serve Queries

Note: The primary key as entered in the field called Primary Key Fields needs to be part of the database query (either by "select * ..." or by "select <primary_key> ..."

For an "employee" database with these fields:

employee_id, first_name, last_name, email, dept

here are likely crawl and serve queries.

Crawl query:

SELECT employee_id, first_name, last_name, email, dept FROM employee

Serve query:

SELECT employee_id, first_name, last_name, email, dept FROM employee WHERE employee_id = ? The Primary Key Field for this example is: employee_id

For a database with multiple column primary keys, if the combination of employee_id, dept is unique, then:

Crawl query can be the same as the one above.

Serve query:

SELECT employee_id, first_name, last_name, email, dept FROM employee WHERE employee_id = ? and dept = ?

To configure crawling a database:

1. Click Crawl and Index and then click Databases. 2. Enter your database information in the fields. All fields down to Advanced Settings are required. Refer to the section above for definitions. 3. Click the Create Database Data Source button. 4. Click the Sync link.

To edit an existing database configuration:

1. Click the Edit link next to the database you want to edit. 2. Enter your changes in the form. 3. Click the Save Database Configuration button. 4. Click the Sync link.

To delete a database configuration:

Google Search Appliance Help Center

Google Inc. 5

1. Select the Delete link to the right of the database name. 2. Click Yes to confirm the deletion.

Crawl and Index > Feeds

You can feed (or push) documents to the Google Search Appliance. You would want to do this if you have internal documents that cannot be found by the crawler or that do not lend themselves to HTTP crawling. The data source feeds are provided through an API and are displayed on the Crawl and Index > Feeds page by the system.

Here is the information about the data source feeds that you can see.

● Source Name - the name of the data source, which can contain only alphanumeric characters, underscores, and hyphens. The first character of the data source name must be either alphanumeric or an underscore. ● Feed Type - Full or Incremental. If the feed has been deleted, the column entry will be "Deleted." ● Time - The system's time-stamp at the start of each stage. ● Status - Accepted (preparing for indexing), in progress (indexing), or completed (queued and now serving). ● Documents included - the number of documents finished indexing. ● Documents with errors - the number of documents that had errors and were not added to the feed.

The List of Trusted IP Addresses section lets you choose either to trust feeds from all IP addresses on your network or to list those IP addresses to trust. If you select the second option, you enter the IP addresses of trusted machines. The system accepts individual IP addresses of the format X.X.X.X, where X represents one octet of a value from 0 to 255. The system also accepts subnet IP addresses in the format X.X.X.X/Y where Y represents the subnet mask in a range of 0 to 32.

If you need to add more addresses, click the Add More Rows button. When finished, click the Save Settings button.

If you delete a feed data source, all documents associated with the feeds from that source are removed from the index.

To delete a feed configuration:

1. Select the Delete link under the source name. 2. Click Yes to confirm the deletion.

Crawl and Index > Crawler Access

On the Crawl and Index > Crawler Access page, you configure how the crawler accesses web servers that require some kind of authentication for crawling the confidential content.

Crawl and Serve Secure Content

You can index and serve results on your content that is protected by authentication mechanisms (Basic Authorization and NTLM). (Note: Requires a special license. Contact your Google Account Manager.) You enter the URL matching patterns for secure areas and their domains, usernames, and passwords so that the crawler can crawl these locations. Using the Make Public checkbox, you can allow users to get results on both the public content (normally available to everyone) and the secure (confidential) content.

To set options for crawling secure content:

Have ready the URLs (or matching patterns), the domain used by the web server, and the user names and passwords.

1. Click Crawl and Index, and then click the Crawler Access link. 2. Under Users and Passwords for Crawling, enter the URLs Matching Pattern, the username, the domain, and the password and confirmation in the text boxes. 3. If you need more rows for additional patterns, click the Add More Rows button. 4. Click the Save Crawler Access Configuration button.

Important: The entries you make in the Users and Passwords for Crawling section are sequential rules. Always enter more specific rules before general rules. For example, first enter

Google Search Appliance Help Center

Google Inc. 6

http://corp.mycompany.com/secure/

followed by

http://corp.mycompany.com/

The Google Search Appliance can serve results over both plaintext HTTP as well as encrypted HTTPS.

When secure content results are displayed, the total number of results and number of pages returned is hidden to prevent exposing information about secure documents to users who do not have access.

Although there is no overload on secure servers at crawl time, a search request will add some load to servers containing secure content.

In the Search Box section of Page Layout, you can add option buttons to your search page that let your users decide to search on public content or on the complete index (both public and secure content) at the time of their search.

A query against public and secure content requires that the user be authenticated by entering the username and password for the secure area. If your servers require a domain name for authentication, users should enter it like this: domain/username. If a user enters an incorrect username or password, no secure results will be included in the search results.

Crawl and Index > Proxy Servers

On the Crawl and Index > Proxy Servers page, you configure a proxy server to crawl outside your internal network and include the crawled data in your index.

To set options for a proxy server:

1. Under Proxy Servers, specify the URL patterns you want crawled through the proxy in the For URLs Matching Pattern text boxes. These patterns must conform to the rules for valid URL patterns. 2. Specify the proxy to use for crawling URLs in the Use This Proxy Server text boxes. 3. Specify the proxy port in the On Port text boxes. 4. If you need more rows for additional patterns or proxy servers, click the Add More Rows button. 5. Click the Save Crawler Proxies Configuration button.

Crawl and Index > Cookie Sites

If your intranet has pages that are behind a login form or that require cookies to return the correct content, you can set up rules to provide the crawler with access to those pages. Then you can test your rules before you perform the crawl. You enter a URL for the login page and then a URL pattern for that area without the page name, but including the final slash to provide correct path information.

For example, the URL for the login page might be

http://mycompany.com/support/login.html

and the URL pattern then would be

http://mycompany.com/support/

When you create the rule, you see a wizard page that displays your login page. Enter the username and password credentials and submit the login form. The wizard captures that information, as well as its action (POST or GET), and other values, depending on the available form fields. After a rule is set up, you can change the username or password, or change the length of time allowed for authentication to occur before the rule expires. The default is 300 seconds (5 minutes).

To set up a rule for crawling pages behind login pages or pages that require cookies:

1. Click Crawl and Index and then click Cookie Sites. 2. Enter the URL of the login page.

Google Search Appliance Help Center

Google Inc. 7

3. Enter the URL pattern of the location of the login page. (Include the final slash.) 4. Click the Create a New Cookie Rule button. A new browser window opens, displaying your login page in the lower half. 5. Type the correct username and password to log in to your site.

Note: If you mistype the username or password, extra actions may be recorded and displayed on the Cookie Sites page. To avoid that, close the Cookie Login Wizard window and restart the process on the Cookie Sites page.

6. Make sure that the page you expect to see appears. 7. Click the Save Cookie Rule and Close Window button. You are returned to the Cookie Sites page where your new rule is listed with its pattern, action, and form fields. 8. Click the Save Cookie Rules Configuration button.

To edit existing cookie rules:

1. Change the username and/or password, if necessary. 2. Change the time to wait for authentication by entering a new number of seconds or minutes, if you wish. 3. Click the Save Cookie Rules Configuration button.

To delete an existing cookie rule:

1. Select the Delete Rule checkbox to the right of the rule. 2. Click the Save Cookie Rules Configuration button.

Crawl and Index > Forms Authentication

Google Search Appliance works with Single Sign-On (SSO) servers, which are available from a variety of vendors. (As of this writing, the SSO authentication provided by Oblix and Netegrity are supported.)

Using an SSO server has the advantage of requiring credentials from a user only one time, and it unifies the authentication process by first authenticating the user and then by authorizing the user on the web servers to which that user has access.

You provide the crawler with access to the web pages hidden behind a forms-based login control. To do this, you enter a URL pattern for those web pages, which is then treated as a forms authentication rule.

The forms authentication rule governs the action that occurs when a browser visits a particular URL using either a POST or a GET protocol, and submits values into fields used to authenticate a user's credentials.

When you create the rule, you see a wizard page that displays your login page. Enter the username and password credentials and submit the login form. The wizard captures that information, as well as its action (POST or GET), and other values. The form fields and values that you see in the Admin Console's Forms Authentication page depend on the SSO system your company uses.

After a rule is set up, you can add additional URL patterns that use the same access control.You can also use the Make Public option for each URL pattern to show web page snippets on the search results pages. In addition, you can change the username or password, or change the length of time allowed for authentication to occur before the rule expires. The default is 300 seconds (5 minutes).

Important: If you have set up additional HTTP Headers, they may conflict with this Forms Authentication feature. If that is the case, please contact Google Support by visiting http://support.google.com and click on Contact Us to submit a request.

Note: To have your protected pages served by the Google Search Appliance, go to Serving > Forms Authentication.

Note: To set the length of time that a user's authorization for secure URLs should be kept in the Google Search Appliance authorization cache, go to Serving > Authorization.

To set up a rule for crawling pages behind a Forms Authentication login page:

1. Click Crawl and Index and then click Forms Authentication. 2. Enter a sample URL behind the forms login page. 3. Enter the URL pattern of the location of the login page. (Include the final slash.) 4. Click the Create a New Forms Authentication Rule button. A new browser window opens, displaying your login page in the lower half. 5. Type the correct username and password to log in to your site.

Google Search Appliance Help Center

Google Inc. 8

Note: If you mistype the username or password, extra actions may be recorded and displayed on the forms login page. To avoid that, close the Forms Authentication Wizard window and restart the process on the Forms Authentication page.

6. Make sure that the page you expect to see appears. 7. Click the Save Forms Authentication Rule and Close Window button. You are returned to the Forms Authentication page where your new rule is listed with its pattern, action, and form fields. 8. Click the Save Forms Authentication Rule Configuration button.

To edit existing Forms Authentication rules:

1. Change the username and/or password, if necessary. 2. Change the time to wait for authentication by entering a new number of seconds or minutes, if you wish. 3. Click the Save Forms Authentication Rule Configuration button.

To delete the Forms Authentication rule:

1. Select the Delete Rule checkbox to the right of the rule. 2. Click the Save Forms Authentication Rule Configuration button.

Crawl and Index > HTTP Headers

User Agent Name

The gsa-crawler is the Google Search Appliance robot that performs the crawling on a web site. The crawler identifies itself with every page it downloads from any web server by specifying a user agent that can be stored in a web server log file by webmasters.

The identifier used by the crawler consists of:

● The user agent name, set by default to gsa-crawler. If you change the user agent name, use only alphabetic characters and hyphens; no numerals or other characters are permitted. ● A unique identifier that is assigned for each Google Search Appliance. ● The problem email address you entered in Administration > System Settings.

If you keep the user agent name gsa-crawler, the accessed web servers might see an identifier such as

gsa-crawler (Enterprise; GID01065; [email protected])

The email is a required part of the identification to allow webmasters to contact you if the Google Search Appliance affects them negatively by crawling their sites too rapidly.

There may be pages or sites in your organization that you do not want the Google Search Appliance to crawl, such as password-protected directories with information that you want to keep private. To prevent the gsa-crawler from accessing the information on these servers, you can either:

● Enter their URL patterns in Do Not Crawl URLs with the Following Patterns ● Create and put a robots.txt file in the root of the server. A robots.txt file consists of the user-agent name and one or more lines of instruction for the robot.

For example:

# /robots.txt file for gsa-crawler (This is a comment line.) User-agent: gsa-crawler (This names the user-agent that the file targets.) Disallow: /*.cgi (The gsa-crawler will not be allowed to crawl any CGI files.) Disallow: /*.pl (The gsa-crawler will not be allowed to crawl any Perl scripts.) Allow: /$ (The gsa-crawler is allowed to crawl everything else.) Disallow: / (This prevents the gsa-crawler from crawling anything on the site.)

For more information, see the resource A Standard for Robot Exclusion, which explains robots.txt files in detail.

Google Search Appliance Help Center

Google Inc. 9

Additional HTTP Headers for Crawler

Specify HTTP headers that will be included in all HTTP requests made during crawling.

The HTTP headers specified in this window must follow the formats specified by http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2 and http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.3.

Two examples of valid HTTP headers are Authorization and Proxy-Authorization. Be sure to read about them before using.

● Authorization: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8

● Proxy-Authorization: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.34

Caution: Certain HTTP headers are used by the crawler for its normal operation (such as Host, Connection, Accept, From, User-Agent, etc.). Any values entered here for these headers will overwrite the crawler's standard headers and may cause undesired operation.

You may use nonstandard headers that enable passing certain information your servers may require, but make sure that all nonstandard headers are valid for your servers. Otherwise, search results may be returned in an unpredictable manner.

To specify additional HTTP headers:

1. Click Crawl and Index and then click HTTP Headers. 2. In the Additional HTTP Headers for Crawler box, enter a new header. 3. To add more headers, press Enter to start a new line. 4. After all the headers are specified, click the Update Header Settings button.

Example header:

Authorization: Basic c29tZXVzZXI6c29tZXBhc3M=

Crawl and Index > Duplicate Hosts

The Duplicate Hosts page lets you prevent the recrawling of content that resides on mirrored servers. For example, if you have load-balancing servers in your system that serve the same content, you will not want all these servers crawled since they contain only duplicates of what you are already crawling. Entries on this page identify the duplicate hosts so that any links found during the crawl that point to the duplicate host are treated as if they are pointing to the corresponding canonical host.

The following rules also apply to entries on this page:

● Only one <canonical_host> entry is permitted per box in the Canonical Host column. ● The <canonical_host> must be a fully qualified host name. ● Multiple <duplicate_host> entries are permitted in the same box for the corresponding canonical host. ● Each box in the Duplicate Host column must contain at least one entry.

Examples:

Canonical Host Duplicate Host(s)

www.google.com www.offsite.com web.offsite.com

www2.google.com website.example

Google Search Appliance Help Center

Google Inc. 10

Crawl and Index > Document Dates

Using the Document Dates page, you can sort and present search results based on the date in the documents. Here you define rules for the crawler to use as it indexes documents.

Google extracts the date from the title, text, URL, or meta tag of the document or from the last modified date returned by the HTTP server. By default, the last-modified field returned by the HTTP headers for all documents is checked for the date. The Document Dates search also looks in the text of non-HTML files for the date.

For the date extracted from the title, text, URL, or meta tag, the first instance of the most common date format encountered is considered the date of the document. Files that have been moved to a directory and are being sorted by last-modified date may reflect the date the file was copied or moved.

Google recognizes dates in most reasonable formats. However, dates that only mention the year (YY or YYYY), such as 2002, are not used. For dates in the format month year, the date is assumed to be the first of the month. Document Dates currently recognizes most Latin1 month names, but not Chinese, Japanese, or Korean month names.

Date Format Meanings

Format Example

YYYY 2001

YY 99

YR YY, YYYY

M 2 or 02

D 7 or 07

MM 02

DD 07

WK Monday or Mon

MON March or Mar

SEP - or / or . or ,

HYP -

PER .

SLA /

Acceptable Date Formats

Format Separator Example

YYYY_M_D HYP 2001-2-27

YYYY_D_M HYP 2001-27-2

YYYY_M_D PER 2001.2.27

YYYY_D_M PER 2001.27.2

YYYY_M_D SLA 2001/2/27

YYYY_D_M SLA 2001/27/2

D_M_YYYY HYP 20-2-1999

M_D_YYYY HYP 2-23-1999

D_M_YYYY PER 20.2.1999

M_D_YYYY PER 2.23.1999

D_M_YYYY SLA 20/2/1999

M_D_YYYY SLA 2/23/1999

Google Search Appliance Help Center

Google Inc. 11

YY_MM_DD HYP 99-04-27

DD_MM_YY HYP 27-04-99

MM_DD_YY HYP 04-27-99

YY_MM_DD PER 99.04.27

DD_MM_YY PER 27.04.99

MM_DD_YY PER 04.27.99

YY_MM_DD SLA 99/04/27

DD_MM_YY SLA 27/04/99

MM_DD_YY SLA 04/27/99

WK_D_MON_YR (comma) Tue, 3 March, 2001

WK_MON_D_YR (comma) Tue, March 3, 2001

D_MON_YR (space, comma) 2 Jan, 99

MON_YYYY (space) March 2001

MON_D_YR (space, comma) Mar 03, 99

MON_YY (space) Mar 99

YYYYMMDD (none) 20010323

DDMMYYYY (none) 23032001

MMDDYYYY (none) 03232001

YYMMDD (none) 990225

DDMMYY (none) 150299

MMDDYY (none) 021599

Use meta tags with dates in the ISO-8601 format (YYYY-MM-DD) to avoid the confusion caused by multiple dates and multiple formats in the title or text of the documents.

The date of each file is returned in the date field of the results. This cannot be turned off, but you can choose not to display it on the front end to your users.

If no date is found for a file, it is indexed without date data. Results that do not contain date data are displayed at the end of the results with dates, sorted by relevance.

If you have documents that contain exceptions to the default dates rule, enter the specific URL or pattern for the file and place these rules at the top of your list. The rules are handled in the order in which they are specified in the rule list. The first rule containing a valid date for the document determines the date of the document.

To specify rules for dates of documents:

1. Click Crawl and Index and then click Document Dates. 2. In the Host or URL Pattern column, enter the host or pattern to which the rule will apply. 3. Use the drop-down list in the Locate Date In column to select the location of the date for the documents in the specified URL pattern. 4. If you select Meta Tag, specify the name of the meta tag in the Meta Tag Name column. 5. To add more rules, click the Add More Lines button. 6. After all the rules are specified, click the Save Changes button.

Examples of rules:

Rule #

Host or URL Pattern Date Located In Meta Tag Name

1 www.foo.com/google/ Title

2 www.foo2.com/archives/ URL

Google Search Appliance Help Center

Google Inc. 12

3 www.foo.com/ Meta tag publication_date

4 www.foo2.com/ Body

5 / Last Modified

Because the document http://www.foo.com/google/foo.html matches the URL pattern in rule 1, we first check for the date in the title of the document. The URL doesn't match rule 2, so we check against rule 3. If we are unable to find a valid date in the title or the URL, we look for the date in the meta tag named publication_date according to rule 3. If we are unable to find a valid date in the meta tag, we default to the last modified date of the HTTP server, according to rule 5.

The date from the URL http://www.foo2.com/archives/20040605/abc.html will be extracted.

Since the document http://www.foo.com/foo.html does not match the URL pattern in rule 1, we look for the date in the meta tag, according to rule 3 and default to rule 5 if we cannot find a valid date in rule 3.

For the document http://www.foo2.com/foo.html, we look for the date in the body and default to the last-modified date.

For the document http://www.foo3.com/foo.html, we look for the date only on the last-modified header as it only matches the URL pattern of rule 5.

Different Date Formats

Your corpus of documents can contain any number of different date formats. However, you must define a separate rule for each different date format.

For example, foo.html contains a title with the following date format:

June 7, 2004

And bar.html contains a title with the following date format:

6/7/2004

You would need to define two separate rules to match both date formats:

Rule: contains:foo Location of date: Title Rule: contains:bar Location of date: Title

Crawl and Index > Host Load Schedule

Maximum Number of URLs to Crawl

Your license specifies the maximum number of URLs you can crawl. However, you can specify a smaller maximum number of URLs you wish to crawl if you do not yet have as many URLs as your license stipulates. You can improve system performance if you enter a number that is less than the maximum overall pages specified by the license. After you click the Save Schedule and Host Loads button, the system will crawl up to approximately 10% over the number you specified. The system crawls slightly more URLs, so that after it eliminates duplicates, the number of pages closely matches the maximum you specified.

Note: If you leave this box blank, the system continuously crawls URLs to the limit of your license.

Web Server Host Load

The Web Server Host Load value specifies the maximum number of concurrent connections open on every web server for crawling. We recommend you start with a value of 4 connections and then gradually increase the value only when you are confident your web servers can handle the load you specify. Check with the webmaster whose sites you crawl if you are uncertain of a web server's load capacity.

Google Search Appliance Help Center

Google Inc. 13

Warning: Some servers may not be able to handle a high load.

If the crawler deems that a server cannot handle the host load defined, it reduces the crawl rate until an acceptable response time is achieved.

Note: The number of concurrent connections may occasionally be lower than the number you specify here, depending on your system activity. The system attempts to maintain this number.

Exceptions to Web Server Host Load

Exceptions to Web Server Host Load lets you specify exceptions for web server host loads by assigning different maximum host loads for specified web servers. For time periods when you do not specify a host load exception, the default web server host load will apply.

For example, you may have three web servers that can handle more crawl load during the night. For these three web servers, you can specify a higher load than the default host load setting of 4 for 12 a.m. to 6 a.m.

To minimize the host load on servers during the day, you might set an exceptional value of 0 between 9:00 a.m. and 5:00 p.m., when the servers cannot handle the extra load.

The host name you enter must be a fully qualified host name.

When sites are crawled using a proxy, the same host load is used to crawl all sites behind the proxy. The host load used will be the maximum host load specified for any URL pattern crawled using the proxy. You should do one of the following:

● Specify no host load for sites that you wish to crawl using the proxy, in which case the maximum host load is used. ● Specify a host load that is small enough so as not to affect the performance of any proxied sites.

The following rules also apply to entries on this page:

● Only one host name entry is permitted per line. ● A host load of zero (0) means that the crawler will access the server only a few times per hour. ● You may specify the load factor as decimal value, for example:

.5, 1, 2.0

Note: A value of 2 indicates that, on average, only two concurrent connections per host are used. Therefore, a value of .25 indicates that, on average, only 25% of the time a connection to the web server is used.

Crawl and Index >

Index Rollback

The Google Search Appliance tracks the continuous indexing, taking snapshots that are identified by unique date-time stamps. You select the snapshot of the index that you want to serve and Google makes sure that index is valid.

Manual Rollback

The Google Search Appliance takes a snapshot of your index every six hours. You can use

● the Most Recent Valid Snapshot to serve your users (the default and recommended), ● the most up-to-date index as of the current time, ● the index snapshot as of 6 hours before the current time, ● the index snapshot as of 12 hours before the current time, or ● the index snapshot as of 24 hours before the current time.

When Most Recent Valid Snapshot is selected, if the index does not meet the Required URLs in Results (and/or number of URL prerequisites) set on the bottom half of the Crawl and Index > Index Rollback page, it automatically rolls back to the most recent valid snapshot. Other causes for a rollback are a problem with the configuration of the system, with your network or web servers, or other errors.

Google Search Appliance Help Center

Google Inc. 14

If there is a problem with the prerequisites for serving an index, a warning displays on the Rollback page and an email notification is sent to the person designated in System Settings to receive such notifications.

On the Crawl and Index > Index Rollback page, if you choose a snapshot from the dropdown list other than the default, the selection you make is shown on the page next to Most Recent Valid Snapshot. If the most recent valid snapshot is the current index, it displays "Current Time." If there is no valid snapshot, it displays "None."

If you select something other than Most Recent Valid Snapshot, and the index encounters a problem, you will see a warning, but the system will not automatically roll back to a previous snapshot. You will need to select another snapshot.

To roll back to an earlier index snapshot:

1. Select an index from the Serve content from index as of dropdown list. 2. Click the Change Serving Snapshot button.

On the same page, you can test any snapshot before you serve it.

To test the search results of an index snapshot:

1. Select a snapshot from the Serve content from index as of dropdown list. 2. Click the Test Center link. A search page displays. 3. Enter a test search term, click Search, and view the results.

Automatic Rollback

The Automatic Rollback settings help you quickly identify potential problems with an index. You specify the particular result that must appear in the first 20 search results for a given search query. If the result you specify does not appear in the first 20 results, the index automatically rolls back to the previous most recent valid snapshot. You can enter the same search term with different URLs, if you want to require several results display for that term.

You can also specify the number of search results that must display for a given search term for the index to be valid.

To specify one or more URLs that must be displayed for a search term before the new index can be served:

1. Enter the search term in the Search box. 2. Enter the corresponding URL in the Required URLs in Results box, or enter a number of results that must display. 3. For more lines after the three given, click the Add More Required URLs button. 4. Click the Save Changes button before leaving the page.

Read how the Google Search Appliance crawls and indexes your intranet and public web sites.

Crawl and Index > Freshness Tuning

The Freshness Tuning page lets you fine-tune the timing of crawls on different URLs. You can control crawling to be more frequent, as for a news documents, or less frequent, as for archived documents. You can also recrawl URLs that would not normally be recrawled, if you have documents that are not returning the correct last-modified date.

Crawl More Frequently

You may have content that changes frequently, as often as once an hour or even every few minutes. On the Crawl and Index > Freshness Tuning page, you can specify the URL patterns of pages that change frequently, so that they are crawled often, keeping your serving index fresh.

It is possible to slow the system down by overloading the frequently changing content section. Try to keep the number of URLs fairly small to avoid reduced performance.

To set options for crawling frequently changing content:

1. Under Crawl More Frequently, enter URL patterns for content that changes often.

Google Search Appliance Help Center

Google Inc. 15

2. Click the Save Changes button. 3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link. 4. Check the URLs in the Start Crawling from the Following URLs box to make sure the documents can be reached. 5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl More Frequently section are included.

Crawl Less Frequently

To index documents that are never updated or modified, such as a stable database, or that are only incrementally added to, such as in a mail or a news archive, you can have the crawler reuse URLs that have already been crawled. This reusing of URLs reduces the load on your web servers. Make sure that the archival URL patterns you specify can be reached from the Start URLs and are in the Follow and Crawl Only URLs with the Following Patterns box.

Example:

Using a Lotus Domino database that is never modified, with a URL of http://myhost.com/mydb.nsf, you would add this pattern to the Archives URL Patterns in the Freshness Tuning page:

http://myhost.com/mydb.nsf

After the initial indexing of that URL, the crawler would fetch all pages in mydb.nsf from the local cache.

If the database is append only, that is, new documents are added, but old ones are not modified, then use these patterns:

regexp:http://myhost\\.com/mydb\\.nsf/.*\\?OpenDocument.* regexp:http://myhost\\.com/mydb\\.nsf/.*\\$FILE.*

The crawler will first try to fetch documents or newly added attachments in mydb.nsf from the local cache when possible. The crawler will still fetch views (?OpenView URLs) from the remote domino server, if the database is actually changed, that is, when new documents are added.

To set options for crawling archival servers:

1. Under Crawl Less Frequently, enter URL patterns for rarely changing or archived documents. 2. Click the Save Changes button. 3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link. 4. Check the URLs in the Start Crawling from the Following URLs box to make sure the archived documents can be reached. 5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl Less Frequently section are included.

Always Force Recrawl

The first time URLs are crawled, the data is indexed and stored on disk. Subsequently, to allow for faster crawls and less load on the servers, only files encountered whose last-modified dates have changed will be crawled. These updates are added to the index.

Enter URL patterns in the Always Force Recrawl section only if out-of-date pages are displayed in your index. Although the crawler does try to figure out the servers with wrong dates and to adjust automatically, other types of misconfigurations may be present.

Make sure that your servers maintain the correct time. If you think one or more of your web servers does not support the Last-modified-since option or is misconfigured, use this section to enter URL patterns to recrawl. Refer problems to the webmaster.

To force recrawling certain URL patterns, regardless of last-modified date:

1. Under Crawl and Index, click the Freshness Tuning link. 2. Under Always Force Recrawl, enter URL patterns for pages to always recrawl regardless of last-modified date. 3. Click the Save Changes button.

Recrawl These URL Patterns

If you discover that set of URLs you want to have in the search index is not being crawled (usually because changes made to the web pages or because of a temporary error or misconfiguration present when the crawler last tried to crawl the URL), you can enter the pattern here to inject it quickly into the queue of URLs the Google Search Appliance is crawling.

Google Search Appliance Help Center

Google Inc. 16

Enter the URL pattern and click the Recrawl These URL Patterns button. The URL pattern is placed in the queue, where it will be crawled soon, unless there are higher priority URLs in the queue.

Crawl and Index > Freshness Tuning

The Freshness Tuning page lets you fine-tune the timing of crawls on different URLs. You can control crawling to be more frequent, as for a news documents, or less frequent, as for archived documents. You can also recrawl URLs that would not normally be recrawled, if you have documents that are not returning the correct last-modified date.

Crawl More Frequently

You may have content that changes frequently, as often as once an hour or even every few minutes. On the Crawl and Index > Freshness Tuning page, you can specify the URL patterns of pages that change frequently, so that they are crawled often, keeping your serving index fresh.

It is possible to slow the system down by overloading the frequently changing content section. Try to keep the number of URLs fairly small to avoid reduced performance.

To set options for crawling frequently changing content:

1. Under Crawl More Frequently, enter URL patterns for content that changes often. 2. Click the Save Changes button. 3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link. 4. Check the URLs in the Start Crawling from the Following URLs box to make sure the documents can be reached. 5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl More Frequently section are included.

Crawl Less Frequently

To index documents that are never updated or modified, such as a stable database, or that are only incrementally added to, such as in a mail or a news archive, you can have the crawler reuse URLs that have already been crawled. This reusing of URLs reduces the load on your web servers. Make sure that the archival URL patterns you specify can be reached from the Start URLs and are in the Follow and Crawl Only URLs with the Following Patterns box.

Example:

Using a Lotus Domino database that is never modified, with a URL of http://myhost.com/mydb.nsf, you would add this pattern to the Archives URL Patterns in the Freshness Tuning page:

http://myhost.com/mydb.nsf

After the initial indexing of that URL, the crawler would fetch all pages in mydb.nsf from the local cache.

If the database is append only, that is, new documents are added, but old ones are not modified, then use these patterns:

regexp:http://myhost\\.com/mydb\\.nsf/.*\\?OpenDocument.* regexp:http://myhost\\.com/mydb\\.nsf/.*\\$FILE.*

The crawler will first try to fetch documents or newly added attachments in mydb.nsf from the local cache when possible. The crawler will still fetch views (?OpenView URLs) from the remote domino server, if the database is actually changed, that is, when new documents are added.

To set options for crawling archival servers:

1. Under Crawl Less Frequently, enter URL patterns for rarely changing or archived documents. 2. Click the Save Changes button. 3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link. 4. Check the URLs in the Start Crawling from the Following URLs box to make sure the archived documents can be reached. 5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl Less Frequently section are included.

Google Search Appliance Help Center

Google Inc. 17

Always Force Recrawl

The first time URLs are crawled, the data is indexed and stored on disk. Subsequently, to allow for faster crawls and less load on the servers, only files encountered whose last-modified dates have changed will be crawled. These updates are added to the index.

Enter URL patterns in the Always Force Recrawl section only if out-of-date pages are displayed in your index. Although the crawler does try to figure out the servers with wrong dates and to adjust automatically, other types of misconfigurations may be present.

Make sure that your servers maintain the correct time. If you think one or more of your web servers does not support the Last-modified-since option or is misconfigured, use this section to enter URL patterns to recrawl. Refer problems to the webmaster.

To force recrawling certain URL patterns, regardless of last-modified date:

1. Under Crawl and Index, click the Freshness Tuning link. 2. Under Always Force Recrawl, enter URL patterns for pages to always recrawl regardless of last-modified date. 3. Click the Save Changes button.

Recrawl These URL Patterns

If you discover that set of URLs you want to have in the search index is not being crawled (usually because changes made to the web pages or because of a temporary error or misconfiguration present when the crawler last tried to crawl the URL), you can enter the pattern here to inject it quickly into the queue of URLs the Google Search Appliance is crawling.

Enter the URL pattern and click the Recrawl These URL Patterns button. The URL pattern is placed in the queue, where it will be crawled soon, unless there are higher priority URLs in the queue.

Crawl and Index >

Collections

The crawler accesses and indexes the URLs and URL patterns that you entered in the Crawl and Index > Crawl URLs page. The resulting index is the default_collection that you see on the Crawl and Index > Collections page.

You, as administrator, can create collections of documents that are subsets of the complete index. Each collection is defined by a group of URL patterns that encompasses the URLs of the documents in the collection. You can also import a collection configuration that was previously exported from the system.

A collection lets your users search over a specific part of the index. For example, you may want to create a products collection or a human_resources collection that support searches only within the products or human resources part of your index.

The number of collections that you can create is unlimited. Collection names can be up to 20 characters long and can contain only alphanumerics, underscores, and dashes, but a collection name cannot begin with a dash.

Important: Avoid naming your collections "search," "images," and "groups." These words have special meanings in the Google Search Appliance.

To create a collection:

1. On the Crawl and Index > Collections page, under Create New Collection, enter a name for the collection. 2. Either leave the Use default configuration option selected or click the Import configuration from file option. 3. Click the Create Collection button. The new collection's name appears in the list of collections and is selected. 4. On the Crawl and Index > Collections page, click the Edit link next to the collection name. 5. Enter the URL patterns you want to include in the collection in the upper box. At least one valid URL pattern is required. 6. Enter URL patterns for pages that you do not want to include in the collection, if you wish, in the lower box. 7. In each box, press Enter to add additional URLs or patterns.

Empty lines and comments (starting with #) are permitted.

Note: These are the URLs that will define the contents of your collection. Any URL patterns you provide must conform to the rules for valid URL patterns.

Google Search Appliance Help Center

Google Inc. 18

8. Click the Save Collection Definition button. 9. Return to the Crawl and Index > Collections page, to create another collection.

Note: You must enter at least one URL pattern to have search results for your collection.

Exporting a Configuration

If you have a collection that is set up in a way that you'd like to reuse, you can export its configuration and import that configuration for a new collection.

The collection configuration file is an XML file that contains:

● entries in Include Content Matching the Following Patterns ● entries in Do Not Include Content Matching the Following Patterns ● required URLs entered in the Automatic Rollback section of the Index Rollback page

To reuse the information in a configuration file:

1. Click the Export Configuration link next to the name of the collection whose configuration you want to reuse. 2. In the Download dialog box, click Save to save the file, noting the location of the file you are saving. (The configuration file's name is collection_name.xml). 3. Under Create a New Collection, enter a name for the new collection. 4. Select the Import configuration from file option and use its text box to enter the configuration's path (or browse for the file). If you browse, find the file, highlight it, and click Open.

Default Collections

In addition to the collections you create, Google Search Appliance, by default, creates collections for:

● Your complete index, which you can expose to your users or not, as you wish ● Language-based pages, enabling support for searches restricted to pages in specific languages ● Meta tags, enabling support for searches restricted to pages with specific meta tag names or name-value pairs

Searching Collections

Individual collection search results have the same relevance ranking as full index searches. Only the content searched differs as it is restricted to the individual collection's content.

The Page Layout Helper lets you automatically modify the search form to include a menu for search by collection.

To search a collection:

To restrict searches to a collection that you have defined, add the following to the URL of your search query:

&site=COLLECTION_NAME

Examples:

A search for "vacation" in the collection "Human_Resources:"

http://www.google.com/search?q=vacation&output=xml&client=yoursite&site=human_resources

This search returns vacation results specifically from URLs in the Human_Resources collection.

A search for "product" in the collections "Development" and "Marketing:"

http://www.google.com/search?q=product&output=xml&client=yoursite&site=(development)|(marketing)

Google Search Appliance Help Center

Google Inc. 19

This search for "product" returns results from either the Development or Marketing collections.

For more information, see the Filtering section of the Google XML Reference.

Crawl and Index >

Collections > <collection_name>

The URL patterns that you enter on the Crawl and Index > Collections > <collection_name> page define the contents of your collection and govern the results that are served to your users when they search this collection. You must enter URL patterns for content you want to include and may also enter URL patterns for content you want to exclude from the collection.

Any URL patterns you provide must conform to the rules for valid URL patterns.

Note: Here are instructions for creating a collection.

To edit a collection:

1. On the Crawl and Index > Collections page, click the Edit link next to the collection name. 2. Enter the URL patterns you want to include in the collection in the upper box. At least one valid URL pattern is required. 3. Enter URL patterns for pages that you do not want to include in the collection, if you wish, in the lower box. 4. In each box, press Enter to add additional URLs or patterns.

Empty lines and comments (starting with #) are permitted. 5. Click the Save Collection Definition button. 6. Return to the Crawl and Index > Collections page to create another collection or to select another collection to edit.

Note: You must enter at least one URL pattern to have search results for your collection.

Status and Reports >

Crawl Status

The Status and Reports > Crawl Status page provides information about the current status of a crawl. The chart is on a short time delay. The following information is available:

● Total Documents Being Served - The total number of URLs indexed at the time of viewing this page. ● Current Crawling Rate - The number of pages being crawled per second. ● Document Bytes Indexed - Total size of the stored documents that have been crawled. ● Documents Crawled Since Yesterday ● Document Errors Since Yesterday

This page also reports whether the crawl is paused or is running. If the system is crawling, you'll see "The crawling system is currently running." Next to that status is a Pause Crawl button. Click this button to temporarily suspend crawling. The status then reports: "The crawling system is currently paused." Click the Resume Crawl button to start the crawl again.

Note: You can change the frequency of crawling certain web servers on the Crawl and Index > Freshness Tuning page.

The Crawl Status graph shows the URL Tracker results. The x-axis represents two-hour segments on Universal Military Time (UMT). The y-axis shows the number of URLs crawled. The red line shows the number of URLs successfully crawled. The yellow line shows all found URLs, not including those that had errors, were excluded by follow-patterns, or were excluded by robots.txt. Sometimes the yellow line may override the red line when they represent the same number of URLs.

You can test your search index by clicking the Test Center link in the horizontal blue bar at the top right of the page. This link takes you directly to the search page where you can run sample queries to test results. The link

Google Search Appliance Help Center

Google Inc. 20

appears on all Admin Console pages.

Status and Reports >

Crawl Diagnostics

On the Status and Reports > Crawl Diagnostics page, you can see what happened to each URL that has been crawled or that the crawler attempted to access. The URLs are displayed in a table by host name, first by directory, followed by file names. You can click on a directory to see a breakdown of its directories and files or on a file name to see information in addition to its crawl history.

The following information is provided on the Crawl Diagnostics page.

● Page Rank - the relevancy rank in the index that this URL received. ● File/Directory - file names and directories. ● Crawling Status - sort by the links to view all of the URLs, the ones that were crawled successfully, ones that had errors, or ones that were excluded.

❍ Successful The total number of URLs crawled at the time of viewing this page.

❍ Errors The URLs that could not be reached by the crawler because the server (where crawl was attempted) returned an error for them, possibly due to network problems. Depending on the error, the crawler will retry crawling some URLs. When the crawl status reports an error, it displays the error, such as: "Retrying url: Host unreachable while trying to fetch robots.txt."

❍ Excluded The URLs that were discovered, but dropped and not crawled at all. Some reasons for exclusion are the existence of a robots.txt file, an entry in Do Not Crawl URLs with the Following Patterns, or perhaps the URL contained an excluded document type, such as a GIF file.

● Time Crawled - the date and time that the URL was crawled.

Note: The crawler retries unreachable URLs several times.

To view URLs and their crawl status for a particular host/directory:

1. Enter the URL pattern and port for the host you want to see. 2. In the drop-down menu, select the state of the URLs that you want to see, or leave the selection at Any state. 3. Click the Show URLs button. 4. Click directory names to drill down to file level details.

At the file level, URL information includes:

● the link to the page ● a link to the cached version ● the PageRank ● a link to the list of other public pages that link to the page ● a link to the list of all (public and secure) pages that link to the page (login required; available with installed security package) ● the number of links on this page to crawled pages ● a list of crawled pages that link to this page ● the collection that serves the page

The Export all pages to a file button lets you transfer the information to a .cvs file that you can open in Microsoft Excel. To export pages to a file:

1. Enter the URL and port for the host you want to see. 2. In the drop-down menu, select the state of the URLs that you want to see, or leave the selection at Any state. (The generated file, however, may be quite large with this setting.) 3. Display the URL information you want to export, by clicking one of the links View All, Successful, Errors, Excluded. (Again, View All might generate a large file to download.) 4. Click the Export All Pages to a File button. The File Download wizard opens. 5. Click Save and browse to a location where you want to save the file. The file name offered describes the collection in this format:

CrawlDiagnostics_<collection_name>_<host_name_port>_.csv

Google Search Appliance Help Center

Google Inc. 21

Status and Reports > Serving Status

A search index is created when the Google Search Appliance starts to crawl your URLs. This index immediately starts serving results to your end users. The Serving Status page displays how many queries per second the Google Search Appliance is receiving.

If you have more than one collection, you can select in the drop-down menu the front end for the collection for which to display the serving status. Click Go to see the serving status of the collection you selected.

If you see a problem with your index, you can click the Index Rollback link to select an earlier index to serve results to your users.

The graph shows a global summary of recent queries per second. The x-axis represents two-hour segments on Universal Military Time (UMT); the y-axis represents queries received (per second). If the search query rate is very low, the y axis will be labeled in units "m" for "milliqueries," indicating 1/1000 of a query per second.

You can test your search index by clicking the Test Center link in the horizontal blue bar at the top right of the page. This link takes you directly to the search page where you can run sample queries to test results. The Test Center link appears on all Admin Console pages.

Status and Reports > System Status

The System Status page is available both from the Home page and through a link under Status and Reports. The page monitors the available disk space, the temperature of the components, and the status of the computers that make up the Google Search Appliance.

Disk Status

Disk refers to the capacity of the Google Search Appliance disk drives. If the drives become full, performance can be affected.

● If the status shows OK along with the green button, everything is running normally. ● If the status shows Caution with the yellow button, the drives are approximately half full. No action is needed. ● However, if the status shows Warning with the red button, you should notify Google support.

Temperature Status

Temperature refers to the CPU temperature and the speed of the cooling fans inside the Google Search Appliance.

● If the status shows OK along with the green button, everything is fine. ● If the status shows Caution with the yellow button, the CPU temperature is higher than normal and/or the speed of the cooling fans is lower than normal. However, the CPU temperature and fan speeds are still within

standard operating levels. The appliance will be able to operate at yellow levels without any increased risk of failure. You should check to see if there is a problem in the area where the Google Search Appliance is housed, such as a blockage in airflow. It is not necessary to report this status to Google.

● If the status shows Warning with the red button, you should notify Google support.

Machine Status

Machine refers to the number of machines in the cluster that are having problems. It is not used in GB-1001 appliances.

● If the status shows OK along with the green button, everything is fine. ● If the status shows Caution with the yellow button, at least one machine is experiencing a problem and may be offline. You should check the machine with the listed GID. ● If the status shows Warning with the red button, you should notify Google support.

Status and Reports >

Search Reports

Google Search Appliance Help Center

Google Inc. 22

Search reports, a summary of each collection, provide a synopsis of your query logs. A search report is automatically generated every day. The latest one is listed in the Report for collection name page. You can also generate a search report at any time. When you generate a summary, it becomes the most up-to-date report and its results are listed, by current date, in the Report for collection name page.

The reports are maintained by the Google Search Appliance for 30 days.

Note: Any queries done through the user interface are counted in the logs. For example, clicking the Test Center link constitutes a query.

The following data is provided in the reports:

● Total Results Pages The number of result pages seen by users for the report period. This includes both search results and non-search results, such as requests of cached pages. This value includes every result page viewed.

● Total Searches The total number of search result pages seen by users. If a user performs a search and then selects "next" to see a second page, that counts as two searches.

● Distinct Searches The number of times users submitted a specific search. Distinct Searches only include the first page where the user typed in a search but not subsequent pages for the same query.

● Number of Searches per Day

● Average Number of Searches Per Hour

● Top 100 Keywords and number of occurrences for each keyword

● Top 100 Queries and number of occurrences for each query

In addition, the following can be derived from the reports:

● Average Result Sets per Query The ratio of Total Searches to Distinct Searches equals your Average Result Sets per Query. This represents, on average, how many pages of search results a user views for each search he or she does.

Status and Reports >

Search Log

Use the Search Log to find statistics on searches that have occurred on your index. The log's format is a simple extension of the Common Log Format (CLF). The file contains a separate line for each request. A line is composed of several tokens separated by spaces:

host -- [date time] request status bytes results time

● Host The IP address of the client.

● Date The date and time of the request, in the following format:

❍ date = [day/month/year:hour:minute:second zone] ❍ day = 2*digit ❍ month = 3*letter ❍ year = 4*digit ❍ hour = 2*digit ❍ minute = 2*digit

Google Search Appliance Help Center

Google Inc. 23

❍ second = 2*digit ❍ zone = (`+' | `-') 4*digit

● Request The request line from the client, enclosed in double quotes (").

● Status The three-digit status code returned to the client.

● Bytes The number of bytes returned to the client.

● Results The number of search results returned to the client.

● Time The total time (in seconds) spent fulfilling this request.

You can search through the logs for any lines containing specific strings. You can also export the logs to a file that can be opened with any Apache analysis software.

The logs are maintained by the Google Search Appliance for one year.

To export a log file:

1. Click the Status and Reports link, and then click the Search Logs link. 2. Select the day, month or date range for the log you want to see. After a brief wait, the log appears. 3. Click the Export to File button. A File Download wizard opens. You can click Open to view the file, if you wish. 4. To export the file as a text document, click Save and enter a location in the Save In field. 5. Navigate to the location where you saved the file and open it.

Many administrators want to generate their own reports. The logs can be automatically exported in a standard format to your syslog server. To see the Syslog Protocol Reference, go to Syslog Reports.

You can then run the log analysis software of your choice to customize the reports as you wish.

Status and Reports > Event Log

In addition to the optional syslog, a system event log is running.

The Event Log is an audit trail of all system activity. Here is a partial list of the type of information that is included in the log:

● Logins and logouts of users

● Date and time of crawling (when the crawl was paused and resumed)

● Creations of collections and front ends

● Serving index rollback time, if one occurred

● Date and time of system password change

You can page through a long log, using Previous and Next links. At the top of the log, you can see what line you are on and the total number of lines. You can also enter text in the Show Lines Containing box to view particular events. In the Jump to Line box, enter a line number and click Jump to quickly go to a particular line.

Google Search Appliance Help Center

Google Inc. 24

From the log, you can export the contents to a file by clicking the Export to File button. The File Download wizard opens, so you can either open the log file as a text file and then save it to your preferred location, or download it as a text file to a location you select. The log filename defaults to

[y-m-d]-web_log-[machine_name].log

Serving > Front Ends

You use the Serving > Front Ends pages to change the look and feel of the search and search result pages your users access. You can customize these pages to display your organization's colors, fonts, and design. If you have more than one collection, you can make each collection's front end appear in a different format, and have its own keymatches, synonyms, and filters.

To look at the default search and search result pages, click the Test Center link. A new browser window opens and displays the default front end, which has the Google logo, a search box, a Search button, and Advanced Search and Search Tips links. You can use this default front end if you like, or make changes to it.

The Edit link next to the default_frontend (or any front end name) gives you access to these front end options:

● Output Format - add a logo, change fonts, colors, and so on ● KeyMatch - force results to display during selected word and phrase matches ● Synonyms - identify terms that mean the same as likely search terms ● Filters - restrict search results by domain, language, file type, or meta tag values ● Remove URLs - list URLs to ignore for this front end

In the Output Format tab, the Page Layout Helper and the XSLT Stylesheet Editor allow you to do as much or as little as you want to affect the look of your search and search result pages.

If your license permits it, you can create more front ends with different looks, changing the look and feel in small ways or in a major way.

Note: You are not required to display "Powered by Google" unless you want to.

To create a new front end (if your license permits more than one):

1. Click Serving and then Front Ends. 2. In the Front End Name text box, enter a name for the new front end. 3. Click the Create New Front End button. 4. Click the Edit link next to the new front end name in the list of front ends. 5. Make changes to the new front end using the Output Format, Keymatch, Synonyms, Filters, and Remove URLs tabs. 6. Use the Test Center link to view your changes to the front end.

To edit any front end:

1. Click Serving and then click Front Ends. 2. Click the Edit link next to the front end you want to edit. 3. Make changes to the front end using the Output Format, Keymatch, Synonyms, Filters, and Remove URLs tabs. 4. Use the Test Center link to view your changes to the front end.

Serving > Front Ends >

Output Format

You can change the look and feel of your search and search results pages using the Page Layout Helper and by editing the XSLT Stylesheet Editor. The Page Layout Helper lets you easily make changes to global attributes (logo, fonts, header, and footer), and to the look of the Search Box and Search Results.

Google Search Appliance Help Center

Google Inc. 25

There are several pages and their components that you can change to show your company's look and feel.

● Starting or "front end" search page ● Search results page ● Advanced Search page ● Cached page (header only)

Stylesheet Changes

You may want to make only simple changes, such as replacing the Google logo with your company logo, changing some global components on your pages, and making the search and result pages uniquely suited to your company. To make this type of change, use the Page Layout Helper.

Then you may want to make more extensive changes, so that the only Google part left is the search technology itself. After you make the changes you want in the Page Layout Helper and save them, you can make further changes, if you wish, in the XSLT Stylesheet Editor. The Stylesheet contains sections for various components, preceded by comments so that you know whether a section can be customized.

Important: We suggest that you preview the results of your changes in small steps and avoid saving changes until you are completely satisfied. As soon as you click the Save XSLT Code button, it updates the corresponding front end. The pages being served immediately display your changes. The preview pages are designed only for showing the output format. The links and buttons in the preview page do not function as real searches. To see the actual search or results pages, use the Test Center link in the blue bar at the top right of the page.

To return the Stylesheet to its original state, click the Restore Default button. However, restoring the defaults removes both changes to the XSLT Stylesheet and the changes you made in the Page Layout Helper.

Here is the recommended sequence:

1. Make changes using the Page Layout Helper. 2. Click Preview to view the each change you make. A new browser window opens with each preview, so you may want to close the window each time you return to the Page Layout Helper. The Preview button lets you

look at each change you make before you move on. It does not save your changes, however. 3. Continue making changes in the Page Layout Helper and previewing them. 4. When finished, click the Save Page Layout Code button. 5. Click the Export button to save the XSLT Stylesheet as a backup.

If you are satisfied with the page layout of your search and results pages, you can go to step 12 now. If you want to make more changes, go on to step 6. If you edit the XSLT Stylesheet, those edits are made in addition to the Page Layout changes. Be aware that you cannot return to the Page Layout section after editing the Stylesheet itself.

6. Click the Edit underlying XSLT code link. (The code now contains your Page Layout changes from using the Helper.) 7. Following the commented instructions, make the changes you want. 8. Click Preview to see your changes. A new browser window opens with each preview, so you may want to close the window each time you return to the Stylesheet. 9. Continue making changes in the XSLT code and previewing the changes.

10. When finished and ready to serve the changed pages, click the Save XSLT Code button. 11. Click the Export button to save the XSLT Stylesheet as a backup. 12. Test your changed search and results pages by using the Test Center link at the upper right of the page. Click the links and do some searches to make sure the pages look the way you want them to. If you are serving

a collection, your changed pages are immediately served to your users.

Note: Later, you can use the Import button to use your edited Stylesheet to make further changes.

Stylesheet Language Options

You can have your users' search page and search results pages in a language other than English, the default. You also can have several languages active for your users and the Google Search Appliance will present search results for an active language based on the settings detected in the end user's computer.

The Google Search Appliance allows multiple stylesheets that present the search page, advanced search, and results pages in different languages, all associated with a single front end. The language-specific stylesheet is selected based on the Accept-language header sent from the user's browser, or the hl= query option. The stylesheet is selected from the set of languages marked "active"; if there is no match, the default language is used. A language-specific stylesheet is created when you make a language active. Each language's stylesheet can be edited and customized independently.

To make a language active using Page Layout Helper:

1. Choose the language to activate in the Language drop-down menu. 2. In the Page Layout Helper, make any format changes you like. (If you are only making a language active, no changes are necessary.) 3. Click the Save Page Layout Code button. In the Language drop-down menu, the language you activated will have the word "(Active)" next to it.

Google Search Appliance Help Center

Google Inc. 26

To make a language the default:

1. Choose an active language in the Language drop-down menu. 2. Click the Make this Language the Default button.

To make a language active using the XSLT Stylesheet Editor:

1. Choose the language to activate in the Language drop-down menu. 2. Click the Edit underlying XSLT code link. The Google text in the default stylesheet converts to that language. 3. In the XSLT Stylesheet Editor, make any format changes you like. (If you are only making a language active, no changes are necessary.) 4. Click Save XSLT Code to save as an active stylesheet.

To make a language active for a custom stylesheet using the XSLT Stylesheet Editor:

1. Choose the language to activate in the Language drop-down menu. 2. Click the Edit underlying XSLT code link. The Google text in the default stylesheet converts to that language. 3. In the XSLT Stylesheet Editor, make any format changes you like. (If you are only making a language active, no changes are necessary.) 4. Click Save XSLT Code to save as an active stylesheet.

To make languages active for a set of custom stylesheets using the XSLT Stylesheet Editor:

1. Customize the default language's stylesheet as desired using the Page Layout Helper or the XSLT Stylesheet Editor. 2. For each additional language, choose the language in the Language drop-down menu. 3. Click the Edit underlying XSLT code link. The Google text in the default stylesheet converts to that language. 4. Click the Make this Language the Default button. 5. Perform any language-specific customizations in the XSLT Stylesheet Editor. (If you are only making a language active, no changes are necessary.) 6. Click Save XSLT Code to save as an active stylesheet.

To remove an active language and disable the stylesheet for that language:

1. Choose the language in the Language drop-down menu. 2. Click the Remove this Language button. The stylesheet is no longer available in that language.

Serving > Front Ends > Output Format - Page Layout Helper

The Page Layout Helper lets you easily make changes to global attributes (logo, fonts, header, and footer), and to the look of the Search Box and Search Results.

The Page Layout Helper has three sections:

● Global Attributes● Search Box● Search Results

You can have one section open, two, or all three at the same time. Click the right arrow in front of the section you want to open. When you click the Save Page Layout Code button, the changes you made in any open section are saved to the Stylesheet. All changes are optional.

Global Attributes

In the Global Attributes section, you can quickly put your logo on the pages, specify the fonts to use, and add the HTML header and the HTML footer code used on your web site. The Preview button opens a browser window to let you see the actual look of each page with each change as you make it, but the changes are not saved until you click the Save Page Layout Code button. A new window opens each time you click Preview. You can close these windows as you finish looking at them.

To change the Global attributes:

1. Click the right arrow next to Global Attributes to display the contents.

Google Search Appliance Help Center

Google Inc. 27

2. Enter the location and name of your company logo. You may have to type the complete URL. 3. Enter the width and height in pixels of your logo image. 4. Click the Preview button. A browser window opens to show your change. Close the browser window. 5. Enter the name of the font family that your web site uses, such as "Times Roman,serif." The font face is case insensitive. If you enter a font that is not recognized, it defaults to "Times." (Continue previewing each

change.) 6. Paste your web site's header code in the Header area. 7. Paste your web site's footer code in the Footer area. 8. When finished, click the Save Page Layout Code button.

Search Box

In the Search Box section, you can affect changes to the Search text box and button, to the language and encoding, and select which collections are available to your users to search.

To change the Search Box attributes:

1. Click Search Box to display the contents. 2. To lengthen or shorten the Search Box from 32 characters, type another number. 3. Click the Preview button. A browser window opens to show your change. Close the browser window. 4. To replace the phrase Google Search on the button, type another word in the Use Text box. To use another image to replace the gray rectangular button, click the Use Image option and enter the complete URL to the

image. (Continue previewing each change.) 5. Click the Collections checkbox to display a menu of your collections so that your users can select which one to search. 6. If you purchased the secure search package, the Secure Search option is enabled, letting your users choose to search over public documents or both public and secure documents. Click the checkbox to disable the

display of the Secure Search option. 7. When finished, click the Save Page Layout Code button.

Search Results

As you select check boxes in each Search Results area, the sample page on the right shows your changes dynamically (for some browsers). Other browsers display a Quick Preview button.

To change the Search Results attributes:

1. Click the right arrow next to Search Results to display the contents. 2. Click the check boxes to show or hide the named component. 3. Review the changes in the dynamic window (or by clicking Quick Preview, depending on your browser.) 4. Select option buttons to choose the style of the components. 5. When finished, click the Save Page Layout Code button.

Note: After you finish making and saving changes in the Page Layout Helper, you can, if you wish, make further changes in the XSLT Stylesheet Editor. You must make all Page Layout changes in the boxes provided before editing the Stylesheet directly. These changes are saved in the Stylesheet when you click Save Page Layout Code.

You cannot go back to the Page Layout Helper after you manually edit the Stylesheet, unless you start over completely by clicking the Restore Default button. Restore Default does not restore your Page Layout Helper changes, but rather it restores the stylesheet provided with the software.

If you want to do more editing, you can do so in the XSLT Stylesheet Editor.

Caution: Remember that you cannot go back to the Page Layout Helper after you work in the XSLT Stylesheet Editor unless you restore the default stylesheet.

Serving > Front Ends > Output Format - XSLT Stylesheet Editor

After you finish making and saving changes in the Page Layout Helper, you can, if you wish, make further changes in the XSLT Stylesheet Editor. You must make all Page Layout changes in the boxes provided before editing the Stylesheet directly. These changes are saved in the Stylesheet when you click Save Page Layout Code.

You cannot go back to the Page Layout Helper after you manually edit the Stylesheet, unless you start over completely by clicking the Restore Default button.

You can export the Stylesheet to work on in another location, then import it when you are satisfied with the changes. You can then preview the changes you've made.

Google Search Appliance Help Center

Google Inc. 28

Making Changes in the XSLT Stylesheet Editor

Here are the sections of the Stylesheet that you can work in to make changes. Click the links for more details on each one.

● Global Style Variables ● Additional Results Page Components ● Result Navigation and Separation Bars ● Result Elements ● Templates ● Other Variables

To view or edit the XSLT stylesheet:

1. Under Serving, click the Output Format link. Scroll down to see the XSLT Stylesheet Editor. 2. Click the Edit Underlying XSLT Code link. 3. Enter any changes to the stylesheet and click the Preview button to review the changes. 4. Click your browser's Back button to return to the Admin Console. 5. Correct errors or make more changes. 6. When finished, click the Save XSLT Code button.

To export the XSLT stylesheet:

1. On the Output Format page, scroll down to see the XSLT Stylesheet Editor. 2. Click the Edit Underlying XSLT Code link. 3. Click the Export button. 4. In the File Download wizard, click OK. Then navigate to a location for the file.

To import an edited XSLT stylesheet:

1. On the Output Format page, scroll down to see the XSLT Stylesheet Editor. 2. Click the Edit Underlying XSLT Code link. 3. Enter the filename of the edited Stylesheet in the Import Stylesheet box, or browse for the file. Click the Import button. A confirmation message warns that this will overwrite your page layout settings. 4. Click OK. The edited Stylesheet displays and is validated. Errors found during validation are displayed in red. 5. Fix errors in the file and repeat these Import steps. 6. When finished, click the Save XSLT Code button.

To restore the default XSLT stylesheet:

1. On the Output Format page, scroll down to see the XSLT Stylesheet Editor. 2. Click the Edit Underlying XSLT Code link. 3. Click the Restore Default button. A confirmation message warns that this will overwrite your page layout settings. 4. Click OK. The original Stylesheet is restored.

The Google XML Reference, Version 4.0, section on HTML via XSLT provides more information.

Serving > Front Ends > Output Format - XSLT Stylesheet Editor

Most of the changes you can make to the XSLT Stylesheet are described in the Global Style Variables and the Additional Results Page Components.

Global Style Variables Back to XSLT Stylesheet Editor

Google Search Appliance Help Center

Google Inc. 29

In the Global Style Variables section of the Stylesheet, you can change the default font and font size, background color, the color of text used for the regular (default) text, and colors of text links.

Important: Any changes made in the XSLT Stylesheet's raw code will overwrite changes made in the Page Layout Helper.

<!-- ********************************************************************** Logo setup (can be customized) - whether to show logo: 0 for FALSE, 1 (or non-zero) for TRUE - logo url - logo size: '' for default image size ********************************************************************** --><xsl:variable name="show_logo">1</xsl:variable> <xsl:variable name="logo_url">images/Title_Left.gif</xsl:variable> <xsl:variable name="logo_width">200</xsl:variable> <xsl:variable name="logo_height">78</xsl:variable>

<!-- ********************************************************************** Global Style variables (can be customized): '' for using browser's default***************************************************************************-->

<xsl:variable name="global_font">arial,sans-serif</xsl:variable><xsl:variable name="global_font_size"></xsl:variable><xsl:variable name="global_bg_color">#ffffff</xsl:variable> <xsl:variable name="global_text_color">#000000</xsl:variable> <xsl:variable name="global_link_color">#0000cc</xsl:variable> <xsl:variable name="global_vlink_color">#551a8b</xsl:variable> <xsl:variable name="global_alink_color">#ff0000</xsl:variable>

Note: Be careful to replace only the text between the tags. Take care not to delete the "<" and the ">" characters that surround the tags.

You can change colors for five different elements.

● global_bg_color - the background color of the page. The default is white. If your site has a distinctive color, you may want to use it on the Search and Results pages. ● global_text_color - the color of regular text. The default is black. ● global_link_color - the color of text that serves as a link to another page. The default is magenta. ● global_vlink_color - the color of text that indicates a link has been visited. The default is red. ● global_alink_color - the color that text changes to as it is clicked and before it changes to the visited link color. The default is red. (That is, the vlink color and the alink color are the same. You can keep them the same

or change one or both.)

Colors are indicated by hexadecimal notation, a set of alphanumeric characters that in varying combinations tell a browser what color to display. The notation always begins with a # sign.

Change the Background Color

To change the background color:

1. Select ffffff in the Stylesheet. The # sign is required, so do not select it. 2. Replace the hexadecimal notation for the background color you want. You can look at the source code of your web site pages to see the notation used. 3. Click the Preview button. 4. Return to the Stylesheet to continue making changes.

Change the Text Colors

All of the text colors (text, link, vlink, and alink) are changed in the same way.

To change a text color:

1. Select the hexadecimal code in the Stylesheet for the color you want to change. 2. Replace the hexadecimal notation for the text color you want. You can look at the source code of your web site pages to see the notation used for various text elements. 3. Click the Preview button. 4. To see a vlink and an alink color, click Advanced Search on the Search page. Click Back to return to the Search page.

Google Search Appliance Help Center

Google Inc. 30

5. Return to the Stylesheet to continue making changes.

Note: If you have a problem with a color change, check to make sure the # mark is at the beginning of the hexadecimal notation.

Additional Results Page Components Back to XSLT Stylesheet Editor

If you require more customization than changing only the Global Style Variables, you can make further changes on the Results page.

In the Results Page Components section of the Stylesheet, you can choose whether to display search boxes, separation bars, and navigation bars. You can also choose whether to display spelling suggestions, synonym suggestions, and keymatch suggestions. In some cases, you can also change the size, the background color, the text color, and the anchor text of these components.

For the Result page header, you can select

● provided - the header that Google provides as seen on Google's search results pages ● mine - the header that you wish to have on the results pages ● both - your header on the page with the Google header under it.

If you choose the "provided" result page header, you can still make changes to the logo, the advanced link and its text, and the search tips (help) link and its text.

Important: Any changes made in the XSLT Stylesheet's raw code will overwrite changes made in the Page Layout Helper.

<!-- ********************************************************************** Results page components (can be customized) - whether to show a component: 0 for FALSE, non-zero (e.g., 1) for TRUE - text and style*************************************************************************** -->

<!-- *** choose result page header: '', 'provided', 'mine', or 'both' *** -->

<xsl:variable name="choose_result_page_header">both</xsl:variable>

<!-- *** customize provided result page header *** -->

<xsl:variable name="show_result_page_adv_link">1</xsl:variable>

<xsl:variable name="adv_search_anchor_text">Advanced Search</xsl:variable>

<xsl:variable name="show_result_page_help_link">1</xsl:variable>

<xsl:variable name="search_help_anchor_text">Search Tips</xsl:variable>

<!-- *** search boxes *** -->

<xsl:variable name="show_top_search_box">1</xsl:variable>

<xsl:variable name="show_bottom_search_box">1</xsl:variable>

<xsl:variable name="search_box_size">32</xsl:variable>

<!-- *** choose search button type: 'text' or 'image' *** -->

<xsl:variable name="choose_search_button">text</xsl:variable>

Google Search Appliance Help Center

Google Inc. 31

<xsl:variable name="search_button_text">Google Search</xsl:variable>

<xsl:variable name="search_button_image_url"></xsl:variable>

<!-- *** search info bars *** -->

<xsl:variable name="show_search_info">1</xsl:variable>

<!--*** choose separation bar 'blue', 'line', 'nothing' *** -->

<xsl:variable name="choose_sep_bar">blue</xsl:variable>

<!-- *** navigation bars: '', 'google', 'link', or 'simple'*** -->

<xsl:variable name="show_top_navigation">1</xsl:variable>

<xsl:variable name="choose_bottom_navigation">google</xsl:variable> <xsl:variable name="my_nav_align">right</xsl:variable> <xsl:variable name="my_nav_size">-1</xsl:variable> <xsl:variable name="my_nav_color">#6f6f6f</xsl:variable>

<!-- *** sort by date/relevance *** -->

<xsl:variable name="show_sort_by">1</xsl:variable>

<!-- *** spelling suggestions *** --> <xsl:variable name="show_spelling">1</xsl:variable> <xsl:variable name="spelling_text">Did you mean:</xsl:variable> <xsl:variable name="spelling_text_color">#cc0000</xsl:variable>

<!-- *** synonyms suggestions *** --> <xsl:variable name="show_synonyms">1</xsl:variable> <xsl:variable name="synonyms_text">You could also try:</xsl:variable> <xsl:variable name="synonyms_text_color">#cc0000</xsl:variable>

<!-- *** keymatch suggestions *** --> <xsl:variable name="show_keymatch">1</xsl:variable> <xsl:variable name="keymatch_text">KeyMatch</xsl:variable> <xsl:variable name="keymatch_text_color">#2255aa</xsl:variable> <xsl:variable name="keymatch_bg_color">#e8e8ff</xsl:variable>

<!-- *** category information *** --><xsl:variable name="show_category">1</xsl:variable><xsl:variable name="category_text_color">#808080</xsl:variable>

Note: Be careful to replace only the text between the tags. Take care not to delete the "<" and the ">" characters that surround the tags.

Changing a Box, Bar, or Suggestion

On the search result page, you can have the search text boxes on the top of the page or on the bottom of the page. The default is to have them in both places. The same is true of the Separation bars that set off the search results from the top and bottom components on the page.

You can also decide if you want to show (1) or not show (0) the Navigation bars by changing the variable, or to display a simple one with Previous and Next. In the Navigation Bars section of the Stylesheet, replace either or both variables "top_navigation" and "bottom_navigation" (where the bar appears at the top and bottom of the page) with one of the following:

Google Search Appliance Help Center

Google Inc. 32

● 'google' for a navigation bar like Google's ● 'link' to use results page numbers and "<<Previous" and ">>Next"● 'simple' to use only "<<Previous" and ">>Next"

In addition, you can change the Navigation bar alignment, size, and color.

On the search result page, you can affect the appearance of Spelling suggestions, Synonym suggestions, and KeyMatch suggestions. You can choose whether the suggestions display, change their associated text, and change their text colors.

Not all browsers display text the same way. You will want to test your results on all common browsers.

To remove a box, bar, suggestion, or category information:

1. Find the component you want to remove. It will have a "1" before the text: </xsl:variable>. 2. Select the 1 and type a 0 to replace it. 3. Click the Preview button. 4. Return to the Stylesheet to continue making changes.

To redisplay a box, bar, suggestion, or category information that has been removed:

1. Find the component you want to display. It will have a "0" before the text: </xsl:variable>. 2. Select the 0 and type a 1 to replace it. 3. Click the Preview button. 4. Return to the Stylesheet to continue making changes.

Changing the Names of Buttons

Replace the name of the Search button by selecting "Search" and typing a new name. You can also change the names of the links "Advanced Search" and "Search Tips" in the same way.

Serving > Front Ends > Output Format - XSLT Stylesheet Editor

Result Elements Back to XSLT Stylesheet Editor

On the search result page, you can affect the elements such as the title of the page found, its snippet (a sentence or two from the page), the look of the word you searched for (Keyword Match), the Link URL (of the page found), and some miscellaneous elements (the result description, size, date, and "cached" link). You can choose whether to display the suggestions, change their associated text color and size.

Important: Any changes made in the XSLT Stylesheet's raw code will overwrite changes made in the Page Layout Helper.

<!-- ********************************************************************** Result elements (can be customized) - whether to show an element ('1' for yes, '0' for no) - font/size/color ('' for using style of the context)********************************************************************** -->

You change the text color by replacing its hexadecimal notation. The font size of text changes relative to the base font set in the Global Variables fonts section. A value of -1 changes the font to one size smaller, a value of +2 changes the title to two sizes larger. The values use a scale of 1 to 7, but not all browsers support all the sizes.

Result Title and Snippet

The changes you can make to the result title and snippet include whether to display them (or display one and not the other), and changing the color and size of the title. The title variable is blank; you can leave it blank for the title to use the browser's size. To increase or decrease its size, enter a +1 or -1, respectively.

<!-- *** result title and snippet *** -->

Google Search Appliance Help Center

Google Inc. 33

<xsl:variable name="show_res_title">1</xsl:variable> <xsl:variable name="res_title_color">#0000cc</xsl:variable> <xsl:variable name="res_title_size"></xsl:variable> <xsl:variable name="show_res_snippet">1</xsl:variable> <xsl:variable name="res_snippet_size">80%</xsl:variable>

Keyword Match

The keyword match (the word or phrase that matches the text searched for) can have its color, size, and weight changed. The weight refers to bold or not bold. You can also replace the "b" with an "i" for italic text or a "u" for underlined text.

<!-- *** keyword match (in title or snippet) *** --> <xsl:variable name="res_keyword_color"></xsl:variable> <xsl:variable name="res_keyword_size"></xsl:variable> <xsl:variable name="res_keyword_format">b</xsl:variable> <!-- 'b' for bold -->

Link URL

The Link URL takes the user to the page found by the search. You can affect the look of the Link URL by changing its color and size, or by not displaying it at all.

<!-- *** link URL *** --> <xsl:variable name="show_res_url">1</xsl:variable> <xsl:variable name="res_url_color">#008000</xsl:variable> <xsl:variable name="res_url_size">-1</xsl:variable>

Misc Elements

The Miscellaneous elements include the display of the results description from the web page, the result page size in kilobytes, the result date (when the page was found, if the associated result date is available), and the result cache.

<!-- *** misc elements *** --> <xsl:variable name="show_res_description">1</xsl:variable> <xsl:variable name="show_res_size">1</xsl:variable> <xsl:variable name="show_res_date">1</xsl:variable> <xsl:variable name="show_res_cache">1</xsl:variable>

Color of Links for Cache, Similar pages, and Description

To display cached pages links, similar pages links, and page descriptions in another color, edit the color tag in this section using hexadecimal notation. Make sure to choose a color that is light enough so that your users can read the text.

<!-- *** used in result cache link, similar pages link, and description *** --> <xsl:variable name="faint_color">#6f6f6f</xsl:variable>

Secure Results Button

If you have secure content, you can use this part of the Stylesheet to have your Results page display a radio button next to the search results that require authentication to view.

<!-- *** show secure results radio button *** -->

<xsl:variable name="show_secure_radio">0</xsl:variable>

Note: Be careful to replace only the text between the tags. Take care not to delete the "<" and the ">" characters that surround the tags.

Serving > Front Ends > Output Format - XSLT Stylesheet Editor

Google Search Appliance Help Center

Google Inc. 34

Templates Back to XSLT Stylesheet Editor

In several Template sections that follow Other Variables in the Stylesheet, you can supply your own HTML code, provided it is XML-compatible. You can copy your code into any part of the Stylesheet that is marked "can be customized" and that starts with the line

<xsl:template name="[variable]">

The templates that you use pull variables from other sections of the Stylesheet.

Important: Any changes made in the XSLT Stylesheet's raw code will overwrite changes made in the Page Layout Helper.

Logo Template

This section of the Stylesheet contains the text that appears when you move your mouse over the logo at the top of the Search page, the Results page, and the Advanced Search page. The logo is also a link to the Search page.

<!-- **********************************************************************Logo template (can be customized) ********************************************************************** --> <xsl:template name="logo"> <a href="{$home_url}"><img src="{$logo_url}" width="{$logo_width}" height="{$logo_height}" alt="Go to Search Home" border="0" /></a> </xsl:template>

To change the text that displays for the logo:

1. Locate the section above in the Stylesheet.2. Replace the text "Go to Search Home" with the text you want. 3. Click the Preview button. 4. Return to the Stylesheet to continue making changes

Note: Be careful to replace only the text between the tags. Take care not to delete the "<" and the ">" characters that surround the tags.

Templates for Headers and Footers

You can change the look of headers in these pages:

● Global page header/footer ● Search result page header ● Separation bar variables ● Advanced search page header ● Cached page header

You make the change by carefully typing or pasting in your own XML-compatible HTML code between the opening tag

<xsl:template name="my_page_header">

and the closing tag.

</xsl:template>

Here are the template sections that you use to make changes to the pages listed above.

Google Search Appliance Help Center

Google Inc. 35

In the Global page header/footer, the XML-compatible HTML code you enter will affect the header and footer of every search and results page.

<xsl:template name="my_page_header"> <!-- *** replace the following with your own xhtml code or replace the text between the xsl:text tags with html escaped html code *** --> <xsl:text disable-output-escaping="yes"> <!-- Please enter html code below. --></xsl:text> </xsl:template><xsl:template name="my_page_footer"> <span class="p"> <xsl:text disable-output-escaping="yes"> <!-- Please enter html code below. --></xsl:text> </span> </xsl:template>

In the Search result page header, the XML-compatible HTML code you enter affects the search results page header. You can also change the font size of the header.

The font size of text changes relative to the base font set in the Global Style Variables fonts section. A value of -1 changes the font to one size smaller, a value of +2 changes the title to two sizes larger. The values use a scale of 1 to 7, but not all browsers support all the sizes.

<!-- ********************************************************************** Search result page header (can be customized): logo and search box ********************************************************************** --> <xsl:template name="result_page_header"> <table border="0" cellpadding="2" cellspacing="0"> <tr> <td rowspan="2"> <xsl:call-template name="logo"/> <xsl:call-template name="nbsp3"/> </td> <td nowrap="1"> <font size="-1"> <xsl:call-template name="nbsp3"/> <a href="{$adv_search_url}"> <xsl:value-of select="$adv_search_anchor_text"/> </a> <xsl:call-template name="nbsp4"/> <a href="{$help_url}"> <xsl:value-of select="$search_help_anchor_text"/> </a><br/> <xsl:call-template name="nbsp"/> </font> </td> </tr> <tr> <td valign="middle"> <xsl:call-template name="search_box"/> <br/> </td> </tr> </table> </xsl:template>

You can change the color and the background color of Separation bars used in advanced search headers and results pages.

<!-- ********************************************************************** Separation bar variables (used in advanced search header and result page) ********************************************************************** -->

<xsl:variable name="sep_bar_bg_color"> <xsl:choose> <xsl:when test="$choose_sep_bar = 'blue'">#3366cc</xsl:when> <xsl:otherwise><xsl:value-of select="$global_bg_color"/></xsl:otherwise> </xsl:choose> </xsl:variable>

Google Search Appliance Help Center

Google Inc. 36

<xsl:variable name="sep_bar_text_color"> <xsl:choose> <xsl:when test="$choose_sep_bar = 'blue'">#ffffff</xsl:when> <xsl:otherwise><xsl:value-of select="$global_text_color"/></xsl:otherwise> </xsl:choose> </xsl:variable>

In the Advanced search page header, you can put whatever you like in the page's header. It will appear under the Global header on your pages.

<!-- ********************************************************************** Advanced search page header HTML (can be customized) ********************************************************************** --> <xsl:template name="advanced_search_header"> <table width="99%" border="0" cellpadding="0" cellspacing="2"> <tr> <xsl:if test="$show_logo != '0'"> <td rowspan="2" width="1%"> <table cellpadding="0" cellspacing="0" border="0"> <tr> <td align="right" valign="bottom"> <xsl:call-template name="logo"/></td> </tr> </table> </td> <xsl:if>

<td valign="bottom" align="right"><font size="-1" class="p"></font></td> </tr>

<tr> <td valign="top"> <table cellspacing="2" cellpadding="2" border="0" width="100%"> <tr bgcolor="{$sep_bar_bg_color}"> <td><font face="{$global_font}" color="{$sep_bar_text_color}"> <b><xsl:call-template name="nbsp"/> <xsl:value-of select="$adv_page_title"/></b> </font> </td> </tr> </table> </td> </tr> </table> </xsl:template>

In the Cached page header, the code you enter will affect the header and footer of every search and results page.

<!-- ********************************************************************** Cached page header (can be customized) ********************************************************************** --> <xsl:template name="cached_page_header"> <xsl:param name="cached_page_url"/>

<table border="1" width="100%"> <tr> <td> <table border="1" width="100%" cellpadding="10" cellspacing="0" bgcolor="{$global_bg_color}" color="{$global_bg_color}"> <tr> <td>

Google Search Appliance Help Center

Google Inc. 37

<font face="{$global_font}" color="{$global_text_color}" "size="-1"> <xsl:value-of select="$cached_page_header_text"/> <a href="{$cached_page_url}"><font color="{$global_link_color}"> <xsl:value-of select="$cached_page_url"/></font></a>.<br/> </font> </td> </tr> </table> </td> </tr> </table> <hr/> </xsl:template>

Front Door Search Input Page Template

The section labeled "Front door" search input page allows complete customization with your XML-compatible HTML code. Use this section if you want to completely change the look of the search page your users will see.

<!-- ********************************************************************** "Front door" search input page (can be customized) ********************************************************************** --> <xsl:template name="front_door"> <html> <xsl:call-template name="langHeadStart"/> <title><xsl:value-of select="$front_page_title"/></title> <xsl:call-template name="style"/> <xsl:call-template name="langHeadEnd"/>

<body> <xsl:call-template name="my_page_header"/> <xsl:call-template name="result_page_header"/> <hr/> <xsl:call-template name="copyright"/> <xsl:call-template name="my_page_footer"/>

</body> </html> </xsl:template>

Empty Result Set Template

The section labeled "Empty Result Set" changes the look of the results page your users will see when there are no results to return. You might want to change or add to the suggestions in the list to make them specific to your intranet.

If you make no changes to this section, the default is that nothing is changed, with the exception of any Global Variables you changed.

<!-- ********************************************************************** Empty result set (can be customized) ********************************************************************** --> <xsl:template name="no_RES"> <xsl:param name="query"/> <span class="p"> <br/> Your search - <b><xsl:value-of disable-output-escaping="yes" select="$query"/></b> - did not match any documents. <br/> No pages were found containing <b>"<xsl:value-of disable-output-escaping="yes" select="$query"/>"</b>. <br/> <br/> Suggestions:

Google Search Appliance Help Center

Google Inc. 38

<ul> <li>Make sure all words are spelled correctly.</li> <li>Try different keywords.</li> <li>Try more general keywords.</li> </ul> </span> </xsl:template>

Serving > Front Ends > Output Format - XSLT Stylesheet Editor

Other Variables Back to XSLT Stylesheet Editor

Other variables also affect the way results pages look. The cached page header text, for example, appears on a cached page that you see by clicking the "Cached" link. You can change other variables, such as the title of the page found, its snippet (a sentence or two from the page), the look of the word you searched for (Keyword Match), the Link URL of the page found, and some miscellaneous elements (the result description, size, date, and "cached" link). You can choose whether to display the suggestions, change their associated text, and change their text colors.

<!-- **********************************************************************Other variables (can be customized)********************************************************************** -->

<!-- *** page title *** --> <xsl:variable name="front_page_title">Search Home</xsl:variable> <xsl:variable name="result_page_title">Search Results</xsl:variable> <xsl:variable name="adv_page_title">Advanced Search</xsl:variable> <xsl:variable name="error_page_title">Error</xsl:variable>

<!-- *** choose adv_search page header: '', 'provided', 'mine', or 'both' *** --> <xsl:variable name="choose_adv_search_page_header">both</xsl:variable>

<!-- *** cached page header text *** --> <xsl:variable name="cached_page_header_text">This is the cached copy of </xsl:variable>

<!-- *** error message text *** --> <xsl:variable name="xml_error_msg_text">Unknown XML result type.</xsl:variable> <xsl:variable name="xml_error_des_text">View page source to see the offending XML.</xsl:variable>

<!-- *** advanced search page panel background color *** --> <xsl:variable name="adv_search_panel_bgcolor">#cbdced</xsl:variable>

Page Title

The changes you can make to page titles include the title text displayed on the Search page, the Result page, the Advanced Search page, and the Error page.

<!-- *** page title *** --> <xsl:variable name="front_page_title">Search Home</xsl:variable> <xsl:variable name="result_page_title">Search Results</xsl:variable> <xsl:variable name="adv_page_title">Advanced Search</xsl:variable> <xsl:variable name="error_page_title">Error</xsl:variable>

Choose Adv_Search Page Header

<!-- *** choose adv_search page header: '', 'provided', 'mine', or 'both' *** --> <xsl:variable name="choose_adv_search_page_header">both</xsl:variable>

Google Search Appliance Help Center

Google Inc. 39

For the header that displays on the Advanced Search page, choose one of these:

● Provided - the Google header● Mine - your header● Both (default) - the Google header and your header

Cached Page Header Text

<!-- *** cached page header text *** --> <xsl:variable name="cached_page_header_text">This is the cached copy of </xsl:variable>

You can change the text "This is the cached copy of" to whatever makes sense for your company.

Error Message Text

<!-- *** error message text *** --> <xsl:variable name="xml_error_msg_text">Unknown XML result type.</xsl:variable> <xsl:variable name= "xml_error_des_text">View page source to see the offending XML.</xsl:variable>

You can decide to display error message text that is different from the default text in this section.

Advanced Search Panel Background Color

<!-- *** advanced search page panel background color *** --> <xsl:variable name="adv_search_panel_bgcolor">#cbdced</xsl:variable>

You can make the Advanced Search page a different background color by entering another hexadecimal value.

Note: Be careful to replace only the text between the tags. Take care not to delete the "<" and the ">" characters that surround the tags.

Serving > Front Ends > KeyMatch

KeyMatch lets you promote specific web pages on your site. For example, if a department is releasing a new Operations web page that should be returned for certain types of queries, you can direct users to that new web page by associating specific search terms, such as operations, with the new web page. A link for the new web page is returned alongside the search results (as a text-based ad does) for queries containing the term operations. This feature is especially useful in driving traffic directly to web pages that are not yet part of the Production Index or that have very few links to them, causing them to appear further down in the results list than you would like.

To create a KeyMatch, you must provide the word, phrase, or exact match criteria for which a specific result will be returned. The rules for creating a KeyMatch are as follows:

KeyMatch Type Criteria (none are case-sensitive)

If search query is "Abraham Lincoln"

Reason for KeyMatch

KeywordMatch A word that must appear anywhere in query. KeywordMatches = "Abraham" and "Lincoln"

If your KeywordMatch is "Abraham Lincoln", the search query must include both "Abraham" and "Lincoln" to trigger this KeywordMatch. To get a KeywordMatch for either "Abraham" or "Lincoln," then enter two KeywordMatches: one for "Abraham" and one for "Lincoln."

PhraseMatch A phrase that appears anywhere in query. For the phrase to match, all of the words must be present, the order of the words must be the same with no intervening words, and any hyphens in the query must be matched.

PhraseMatch = "Abraham Lincoln," "President Abraham Lincoln," "Abraham Lincoln president," and "young Abraham Lincoln"

These are all phrase KeyMatches because the words appear in the order entered in the search query, "Abraham Lincoln."

"Abraham the Tall Lincoln" is not a PhraseMatch because "the Tall" separates the phrase "Abraham Lincoln."

Google Search Appliance Help Center

Google Inc. 40

ExactMatch Phrase must exactly match the query. ExactMatch = "Abraham Lincoln"

Only "Abraham Lincoln" is an ExactMatch for the query. "President Abraham Lincoln" and "Abraham Lincoln's" are not ExactMatches.

You can submit up to five designated URLs per word, phrase, or exact match. By default, however, only a maximum of three keymatches are returned for a search. If you want to increase this number to five, you can use the numgm parameter documented in the XML Reference document.

To use KeyMatches:

1. Click Serving and then click Front Ends. 2. Click the Edit link next to the collection front end you want to edit. 3. Click the KeyMatch tab.

From this page, you can view, edit, create, and import and export matches.

To view existing KeyMatches:

1. Click the View Matches link. If matches have been created, all matches are displayed with a search field and navigation links.

2. To navigate through all of the KeyMatches, use the First, Previous, Next, and Last links.

To edit existing KeyMatches:

1. Click the Edit Matches link. The Edit page is displayed with all matches in editable fields with a search field and navigation links.

2. To navigate through all of the KeyMatches, use the First, Previous, Next, and Last links. KeyMatches are displayed 25 at a time. 3. Make changes and click the Save Changes button. 4. Select the box in the Delete column for any matches you want to delete. 5. Click the Save Changes button.

To add new KeyMatches:

1. Click the Add Matches link (the default). The Add Matches page displays with editable fields.

2. Enter your KeyMatches one per line. 3. To add more lines, click the Add More Lines button. 4. Click the Save New Matches button.

To import or export KeyMatches:

Note: Because KeyMatches are overwritten when a new KeyMatch file is imported, export a current version before making changes. See steps 4 and 5.

1. Click the Import/Export Matches link. 2. To import KeyMatches from a URL, enter the URL in the URL field and click the Import KeyMatches Now button. 3. To import KeyMatches from a file, click the Browse button, select the file, click Open, and click the Import KeyMatches Now button. 4. To export KeyMatches from this system to a local file, click the Export KeyMatches Now button.

The File Download wizard is displayed. 5. Save the file with a .cvs extension, which can be opened in Microsoft Excel.

Note: Changes to the KeyMatch file may take several minutes to appear in the search results.

Serving > Front Ends > Synonyms

You use Synonyms to suggest alternate words or phrases for search queries. For example, if a user searches for "Mark Twain," you can suggest "Sam Clemens." If there is a search for "File Transfer Protocol," you can suggest "FTP." Synonyms display as "Other suggested searches" on the results page (unless you have altered this text using the XSLT Stylesheet).

Google Search Appliance Help Center

Google Inc. 41

Synonyms work in only one direction. That is, if you specify a synonym for a search term, a user would have to enter the search term to get the synonym. Entering the synonym would not display the search term unless you entered both as search terms and as synonyms.

On the Synonyms page, you enter a search term and then the synonym you want to suggest to users.

Examples:

Search Term Synonym Other suggested searches on results page

NATO North Atlantic Treaty Organization North Atlantic Treaty Organization

North Atlantic Treaty Organization NATO NATO as a synonym

Mark Twain Sam Clemens Sam Clemens

Sam Clemens none no suggested synonym displayed

To use synonyms:

1. Click Serving and then click Front Ends. 2. Click the Edit link next to the collection front end you want to edit. 3. Click the Synonyms tab.

From this page you can view, edit, create, and import and export synonyms.

To view synonyms:

1. Click the View Synonyms link. 2. To search for synonyms on this page, enter a complete or partial synonym in the search field and click the Search button.

All synonyms containing the search string are displayed. 3. Click the Show All Synonyms link to display all available synonyms.

To edit synonyms:

1. Click the Edit Synonyms link. 2. To search for synonyms on this page, enter a complete or partial synonym in the search field and click the Search button.

All synonyms containing the search string are displayed. 3. Click the Show All Synonyms link to display all available synonyms. 4. To delete a synonym, select the box to the left of the synonym. 5. When all changes have been made, click the Save Changes button.

To add new synonyms:

1. Click the Add Synonyms link (the default). 2. Enter the search term for which you are creating a synonym in the Search Term field. 3. Enter a single synonym that you want displayed for that search term in the Synonym field, one per line. 4. Click the Save New Synonyms button. The View Synonyms page is displayed.

Note: Search terms and synonyms may be words or phrases.

To import or export synonyms:

Note: Because synonyms are overwritten when a new synonym file is imported, export a current version before making changes. See steps 4 and 5.

1. Click the Import/Export Synonyms link. 2. To import synonyms from a URL, enter the URL in the URL field and click the Import Synonyms Now button. 3. To import synonyms from a file, click the Browse button, select the file, click Open, and click the Import Synonyms Now button. 4. To export synonyms from this system to a local file, click the Export Synonyms Now button.

The File Download wizard is displayed.

Google Search Appliance Help Center

Google Inc. 42

5. Save the file with a .csv extension, which can be imported to Microsoft Excel.

Serving > Front Ends > Filters

Use the Serving > Front Ends > Filters page to restrict users' searches by front end, using filters for:

● Domain - restrict searches to one or more domain names (not IP addresses) ● Language - restrict searches to either all languages or a selected set of languages ● File type - restrict searches to one or more file types, such as HTML, PDF, and so on ● Meta tags - filter searches by values and value types in meta tags

To use filters:

1. Click Serving and then click Front Ends. 2. Click the Edit link next to the collection front end you want to edit. 3. Click the Filters tab.

From this page you can set up filters for the selected front end.

Domain Filters

When filtering by a domain name, all domains ending with that name are filtered. For example, the domain

yourdomainname.com

returns search results in all domains such as

www.yourdomainname.com news.yourdomainname.com ops.yourdomainname.com

and so on.

However, if the domain name is followed by a directory name, then the domain name must be fully qualified. For example,

yourdomainname.com/marketing

will not return any results under the domain name filter alone. The correct filter would be

www.yourdomainname.com/marketing

If the directory marketing is in multiple domains, then each filter must be specified individually -- one per line, one for each domain, such as

www.yourdomainname.com/marketing news.yourdomainname.com/marketing

and so on.

To set up a domain filter:

1. In the Domain Restrict text box, enter at least one domain to which searches (using this front end) will be restricted to. 2. Press Enter to add another domain name so that each domain is on a separate line. 3. To make filter changes to another front end, click the down arrow in the Edit Front End drop-down menu and select the front end you want. 4. When finished, click the Update Filter button.

Google Search Appliance Help Center

Google Inc. 43

Language Filters

You can select, for any front end, to search for pages in any language or to search for pages in selected languages.

To set up a language filter:

1. In the Language Restrict area, select either one of two options: Search for pages written in any language. or Search only for pages written in these language(s):

2. If you select the second option, select the checkboxes next to the languages you want to restrict search results to. 3. To make filter changes to another front end, click the down arrow in the Edit Front End drop-down menu and select the front end you want. 4. When finished, click the Update Filter button.

File Type Filters

The Google Search Appliance can crawl many file types.

Specify extensions without the leading period. Separate multiple extensions with spaces, NOT with commas. For example,

Correct: pdf pdf html xml

Incorrect: .pdf .pdf .html pdf, html

To set up a file type filter:

1. In the File Type Restrict text box, enter the extensions of file types to which you want to restrict the search results. For example, to restrict results to only HTML files, you would enter HTML. 2. To make filter changes to another front end, click the down arrow in the Edit Front End drop-down menu and select the front end you want. 3. When finished, click the Update Filter button.

Meta Tag Filters

You can filter search results by meta tag, restricting by "all" or "any," by value type, and by value.

To set up a meta tag filter:

1. In the Meta Tag Filter area, select either one of two options: Match all or Match any

2. Enter the Meta Tag Name, such as "author". 3. Select the Value Type (Exact, Partial, Existence). 4. Enter the Meta Tag Value, such as "Peter Norton". 5. To make filter changes to another front end, click the down arrow in the Edit Front End drop-down menu and select the front end you want. 6. When finished, click the Update Filter button.

Serving > Front Ends > Remove URLs

Google Search Appliance Help Center

Google Inc. 44

As results are being served, you may find some results that you do not want to be served to your end users. To prevent particular URLs from being served in results, you can enter URLs or URL patterns on the Remove URLs panel. Even though the URLs exist in the Test Center, they will not be served.

You can add URL patterns to this panel at any time, and they will be removed from the served results. You can also delete the URL patterns in this window at any time to return those patterns to the served results.

To prevent crawling of pages that you add to the Remove URLs list, you can add the same URLs to the Do Not Crawl URLs with the Following Patterns in the Crawl and Index > Crawl URLs page.

The URL patterns you provide must conform to the rules for valid URL patterns.

To enter a URL or URL pattern:

1. Click Serving and then Front Ends. 2. Click the Edit link next to the front end you want to edit. 3. Click the Remove URLs tab. 4. Type a valid URL pattern into the text box. 5. Press Enter to add additional URLs or patterns.

Empty lines and comments starting with # are permitted. 6. Click the Update List of Removed URLs button.

Serving > Authorization

Authorization API

If you have implemented an authorization service using the Google Search Authorization API, you need to enter the URL of the service so that the system can access the service when authorization is needed. When finished, click the Save Settings button. For more information on the Google Search Authorization API, go to Google Support.

Authorization Cache

The Google Search Appliance caches information regarding user access to confidential documents, so that a security check does not have to be repeated on subsequent searches. This authorization information is used when serving secure documents, which are designated by entries in both Crawler Access and Forms Authentication.

For example, when a user does a search, the Google Search Appliance stores the fact that the user has access to a document that will be returned in the search results. The cache makes subsequent searches even faster.

The Administrator can specify how long (in seconds) that authorization information is cached for. The default is one hour or 3600 seconds. The number of seconds you set, if you decide to change the default setting, should not be so short that a user's browser session searching secure documents would be running when the cache expires.

To change the length of time the authorization cache is kept:

1. Click Serving and then click Authorization. 2. Enter the number of seconds you want to keep the cache. 3. Click the Save Settings button.

The Administrator can access the Authorization page and click the Clear Cache button to immediately empty the contents of the cache. This should be done periodically in case the Administrator changes access to secure documents, and to keep the cache fresh.

Serving > Forms Authentication

If you are using Single-Sign-on functionality, such as that provided by Oblix and Netegrity, the Google Search Appliance can serve pages that are protected by forms-based authentication when you enter information on this page.

Google Search Appliance Help Center

Google Inc. 45

Note: To have your protected pages crawled and indexed by the Google Search Appliance, go to Crawl and Index > Forms Authentication.

When serving SSO documents, if the end user does not have a cookie or if it has expired, the Google Search Appliance first challenges for credentials and then tries to obtain a cookie by submitting credentials to an SSO server.

The system provides two options for authentication. One is to authenticate against an external login server using cookie forwarding. The other is to authenticate against a protected URL, using either cookie forwarding or user impersonation.

If your system uses cookie forwarding, for the appliance to get a cookie and send it to the user’s browser, the hostname of the appliance must domain-match the cookie domain.

For example, if the hostname of the Google Search Appliance is search.corp.yourcompany.com, these cookie domains would support cookie forwarding in your system:

corp.yourcompany.com yourcompany.com

but these would not:

sso_docs.yourcompany.com gsa.search.yourcompany.com

That is, if the SSO cookie domain is so specific that it doesn’t cover the appliance’s hostname, then cookie forwarding will not work on your system. You can, however, use User Impersonation.

You can use User Impersonation if your system restricts the cookie domain to something other than the search domain used by the appliance. In this case, we suggest using a fully qualified host name for the cookie domain. Also, in Serving > Forms Authentication, select the Only User Impersonation check box.

For example, these domains will allow the Google Search Appliance to impersonate the user and get a cookie on the user’s behalf from your SSO server:

sso_docs.yourcompany.com gsa.search.yourcompany.com

If you select Only User Impersonation, set the cookie duration to the proper length of time, depending on your company's policy. By default, the duration is set to zero, which means the cookie never expires.

To use forms-based authentication with the appliance:

1. Click Serving and then Forms Authentication. 2. Select either option: (1) Log in against a sample protected URL, or (2) Always redirect to an external login server.

Note: If you selected the second option, skip to step 5. 3. If you selected the first option and wish to use user impersonation only, click the Only User Impersonation check box and then enter the User Impersonated Cookie Name associated with full user impersonation. 4. If you selected Only User Impersonation, the User Impersonated Cookie duration field is enabled. Set the number of hours, minutes, or seconds before the cookie should expire. 5. Enter the URL that is protected by your security policy or the URL to the external login server. 6. Enter the name of the forms authentication cookie. 7. Click the Save Forms Authentication Serving Configuration button.

Administration > Network Settings

The Network Settings page is available through a link under Administration. The page is divided into these sections:

● Network Settings ● Network Diagnostics

To view or edit network settings:

1. Click Administration, and then click the Network Settings link. 2. Enter the required information in the editable fields.

Google Search Appliance Help Center

Google Inc. 46

3. Optionally, under Network Diagnostics, you can enter one URL per line in the URLs to Test area to check their accessibility. 4. Click the Update Settings and Perform Diagnostics button.

Network Settings

● DNS Server: This is a comma-separated list of Internet Protocol (IP) addresses for your Domain Name Servers (DNS). These servers translate host names, such as www.google.com, into IP addresses and are a prerequisite for successful crawling. You can and should enter up to three addresses to provide fault tolerance.

● DNS Suffix (DNS Search Path): This is a comma-separated list that gives possible expansions for host names. For example, if your search path domain is baz.foobar.com, foobar.com , and the crawler encounters a host "test," it will try the hosts test.baz.foobar.com and test.foobar.com.

● SMTP Server: This is the server that delivers email notification from the Google Search Appliance to administrators.

Note: All email sent by the Google Search Appliance will be sent from nobody@localhost or the address you entered during network configuration. The supplied SMTP server must allow mail to be sent from this address.

● NTP Servers: The Network Time Protocol (NTP) servers are a list of comma-separated servers that synchronize the Google Search Appliance time with an outside server.

Note: The Google Search Appliance no longer follows the POSIX standard for specifying time zones. The offset specifies the difference between local time and GMT. Therefore, Japan's time zone would be specified as GMT+9, because it is 9 hours ahead of GMT.

● Syslog Server: (Optional) The Google system's syslog collects the web server logs. The syslog does not collect system events. The Google Search Appliance syslog messages have a Priority of "Informational" and are sent to the syslog server every 10 minutes.

Facility for usage logs: You can set the Facility to any Local Use level. The Facility setting has no effect on what messages are logged. For more details on the syslog protocol, see RFC 3164.

You can find full syslog documentation in Syslog Reports.

Network Diagnostics

● URLs to Test: This diagnostic test checks URLs, entered one per line, to ensure ❍ that the URL is valid and can be resolved by DNS ❍ that the host is pingable by the appliance ❍ that a head request for the web page can be retrieved without error

This is useful for diagnosing network problems when URLs appear to be uncrawled.

● A table is updated at the bottom of the page when the Update Settings and Perform Diagnostics button is clicked. This table contains a status line for each network parameter. If an item is okay, the third column contains the word "OK" on a green background. If the server is not okay, the last column is red and displays a diagnostic error message, such as "unpingable" or "cannot resolve hostname."

Common problems include:

HTTP Error Code Description Who Should Fix It

returncode 404, should be 200 Authenticated site See Crawler Access for information on crawling and serving secure information.

401 and 403 Disallowed access Web master must resolve.

404 URL not found Web master must resolve.

connection timed outOne possibility is that a router between the Google Search Appliance and a URL is blocked by an ACL

Network administrator must resolve.

unpingable non-existent domain Check your entry for typos before contacting your web master.

Google Search Appliance Help Center

Google Inc. 47

Administration > System Settings

The System Settings page is available through a link under Administration. The page is divided into these sections:

● Email Notification ● Default Search URL ● Remote Support ● Daily Status Report

To view or edit system settings:

1. Click Administration, and then click the System Settings link. 2. Enter the required information in the editable fields. 3. Click the Update System Settings button.

Email Notification

To send automatic reports and problem reports, you can enter email addresses into the appropriate fields. Multiple email addresses must be separated by commas. The system validates and qualifies email addresses. Suggestion: Use an alias for automatic reports.

● Automatic reports include email notifications when crawling begins and ends, indexing begins and ends, and a new index begins serving. You can leave this blank if routine emails on Google Search status are not necessary.

● Problem reports include notifications for hardware failures and for crawling and indexing failures. This email address is also used by the gsa-crawler user agent.

● Sender of outgoing emails is the sender address you want shown to the receiver of automatic and problem reports. The default is nobody@localhost.

Default Search URL

When a user points a browser to the Google Search Appliance, the Default Search URL is the page they see. You can enter the default search URL in this text box. If you do not enter a default search URL, the user is redirected to a search on the default_collection, using the default_frontend.

If you delete the default_collection or the default_frontend and do not define a default search URL, the user will receive an error when they try to do a search. For example, if your appliance is named search.yourdomain.com, then going to http://search.yourdomain.com/ displays the default page URL. Be sure to test the URL you enter to make sure your users are redirected to the page you expect.

Note that after the new search page URL has been accepted, the change will take a short while to appear.

You might want to redirect your users to a web page that contains a search box, whether it's a custom page you have created or the Search Index provided directly from the Google Search Appliance. To redirect your users to an existing collection, provide the URL of the collection's Search Index.

The collection's Search Index URL will be similar to this:

/search?site=my_collection&client=my_frontend&output=xml_no_dtd &proxystylesheet=my_frontend&proxycustom=<HOME/>&ie=&oe=&lr=

To obtain the URL of a collection's search index:

1. In the Admin Console, click Serving > Front Ends. 2. Select the collection name and click the Edit link. 3. In the upper right corner, click the Test Center link. A new browser window opens. 4. In Internet Explorer 6.0, view the properties of the IFrame, and highlight the URL and copy it.

In Netscape and Mozilla, right click on the IFrame and select This Frame > Open Frame in New Window. Highlight the URL in the browser and copy it. 5. Close the Test Center browser window. 6. In the Admin Console, click Administration > System Settings. 7. Paste the URL in the Default Search URL field. 8. Click the Update System Settings button.

Google Search Appliance Help Center

Google Inc. 48

Remote Support

For security reasons, the SSH port is not enabled by default. If you need Google support to perform remote troubleshooting, and you prefer to have them support you using the SSH port, you can enable the SSH access by checking the Remote Support option, and saving this setting by clicking the Update System Settings button.

In addition, if your network has a firewall set up, you need to inform your system administrator to open port 22. Then, you can provide the IP address of the Google Search Appliance for Google Support to connect to.

When the remote support session is completed, Google Support engineers will remind you to disable the check box, so that no one else can perform SSH with the Google Search Appliance.

Daily Status Report

You can have a daily status report sent to the email address entered in the Email Notification text box. Check the Enable Daily Status Email Messages box to have a message like this sent daily:

Date: Fri, 25 Jul 2003 00:00:08 -0700 From: nobody@localhost Subject: GID-[XXXX]: NOTIFICATION: System ***status report at 2003-07-25 00:00:07:

System Status: OK Machines Status: OK. Disks Status: OK. Temperatures Status: OK.

Crawl Summary: Documents crawled since yesterday:: 1244.0 Documents error since yesterday:: 0.0 [This message was generated automatically @2003/07/25 00:00:08 from Hostname]

Administration > User Accounts

There are two levels of user accounts: administrators and managers. Different permissions are given to each of these levels. One administrator user account ("admin") is created by default. The default account cannot be deleted.

All functions in the Admin Console are available to the administrator, including setting up the other user accounts and their permission levels. Only an administrator can create new users and delete users. As an administrator, you can create, assign, and delete collections and front ends. You can also view and edit user accounts and network and system settings.

The manager has access to assigned collections and front ends. She can view and edit her collections and export collection configurations, but cannot create or delete collections. She has access to KeyMatch, Synonyms, Filters, Remove URLs, Search Logs, and Search Reports within her assigned front ends and collections.

Note: The manager does not have access to the user accounts or network and system settings. Both administrators and managers can change their passwords.

When a new account is set up, the system sends two automated email messages from the address nobody@localhost: one to the newly created user that contains the user's username and password, and the username and email address of the administrator who created the account, and one to the administrator as a confirmation.

To set up a new user account:

1. Click Administration and then click the User Accounts link. 2. Enter the username in the text field. (The username can contain alphanumeric, hyphens, and underscore characters, but cannot begin with a hyphen.) 3. Enter the user's email address. (Enter a complete address, such as [email protected].)

The welcome message that includes account information and password will be sent to this email address. 4. Select an account type. 5. Select one or more collections and/or front ends that may be accessed by the new user. (Administrators can access all collections.) 6. Enter the user's password. 7. Confirm the password in the second password field.

If you do not enter a password, a temporary one will be created. All passwords assigned at account creation should be changed by the user. If you successfully create the account, a confirmation message is displayed.

Google Search Appliance Help Center

Google Inc. 49

8. Click the Create User Account button before leaving the page.

To edit user accounts:

1. Click Administration and then click the User Accounts link. 2. Click the Edit link for the user you wish to edit. 3. If necessary, type a new password and retype to confirm it. 4. If necessary, change the assigned collections and/or frontends. 5. Click the Update User Accounts button before leaving the page.

To delete a user account:

● Click the Delete link on the same row as the username.

Forgot Your Password?

You can request a new password from the Login page.

1. If you forget your password, click the Forgot Your Password? link on the Login page. The Login page is redisplayed without the password field. 2. Enter your user name. 3. Click the Send Me a New Password button.

You will receive a new password at the email address entered when your account was set up. You must change this password on your next login.

Administration > Change Password

Having your own password prevents others from making changes in your name without your authorization. Keep your password in a safe place.

To change your password:

1. Click Administration in the menu, then click the Change Password link. 2. Type your old password that you used to log in. 3. Type your new password. 4. Retype your new password to confirm. 5. Click the Change Password button.

Administration > SNMP Configuration

The SNMP (Simple Network Management Protocol) Configuration page is available through a link under Administration. The page is divided into these sections:

● SNMP Support ● SNMP v1/v2 Configuration ● SNMP v3 Configuration

❍ Add SNMP v3 Users ❍ Delete SNMP v3 Users

To view or edit SNMP configuration:

1. Click Administration, and then click the SNMP Configuration link. 2. Enter the required information in the editable fields. 3. Click the submit button corresponding to information edited.

Google Search Appliance Help Center

Google Inc. 50

SNMP Support

The SNMP interface of Google Search Appliance provides information about a number of system parameters and can be used to monitor the system health. Google Search Appliance can be configured to enable or disable the SNMP support. Use the checkbox to enable/disable SNMP support and then click the Update System Settings button.

SNMP v1/v2 Configuration

SNMP version 1 and version 2 use community names to authenticate a user before providing system information. To use the SNMP interface of Google Search Appliance in version 1/version 2 mode, provide community names in the text field and click the Edit Communities button.

To disable access using SNMP v1/v2, leave the text field blank and click the Edit Communities button.

Tip: Keep the community names secret to prevent unauthorized access to information about the Google Search Appliance.

SNMP v3 Configuration

SNMP version 3 uses a much stronger authentication mechanism for users. It requires setting up user accounts to access SNMP information. User accounts to access SNMP can be added and deleted as described below. Tip: If your SNMP client can use SNMP v3, it is always preferable over SNMP v1 and SNMP v2.

Add SNMP v3 User

To add an SNMP v3 user, complete the UserName and other fields and click the Add User button.

Delete SNMP v3 User(s)

To delete users allowed to access SNMP information, check the boxes next to their user names and click the Delete User button.

Administration > Certificate Authorities

The Google Search Appliance can use certificate authorities to authenticate users credentials before they can view protected search results. Use this page to upload certificate authorities from your network to the Google Search Appliance.

Current Certificate Authorities

The Certificate Authorities that are active in the system are listed in the Current Certificate Authorities area. If the Certificate Authority has an associated Certificate Revocation List (CRL), the system adds a checkmark to the checkbox on the page. (The CRL contains a list of serial numbers of revoked, but unexpired, certificates.)

The Google Search Appliance supports CA certificates and CRL files in PEM format.

To add a Certificate Authority from your network:

1. Click the Browse button next to Add more Certificate Authorities. 2. Find the Certificate Authority file (in the supported PEM format) on your network and click Open. 3. Click the Save Settings button.

To add or renew a CRL from your network (if the Certificate Authority you added has a CRL):

1. Click the Browse button next to Add Certificate Revocation List. 2. Find the CRL file (in the supported PEM format) on your network and click Open. 3. Click the Save Settings button. If the CRL is issued by a CA known to the system, it will be accepted.

Google Search Appliance Help Center

Google Inc. 51

Administration > SSL Settings

The Google Search Appliance ships with a Secure Sockets Layer (SSL) certificate that you use during installation and network configuration. So that your users are not confronted with a security message each time they search, you should request your own SSL certificate with the hostname of the server your users will connect to when they perform a search on your intranet.

SSL is an Internet security protocol used by Internet browsers and web servers to transmit sensitive information. SSL is an implementation of public-key encryption.

A digital certificate is a bit of information saying that the web server is trusted by an independent source known as a certificate authority. The certificate authority acts as a middleman that both computers trust. It confirms that each computer is in fact who it says it is.

SSL uses Certificate Authorities (CAs) to authenticate certificates. Your browser trusts a list of root CAs that can validate the authenticity of SSL certificates. Examples of CAs that common web browsers trust by default are VeriSign, Thawte, E-Certify, and the like. A CA is trusted either because it's a root CA that ships with the browser (such as VeriSign) or it's an intermediate CA that your company has set up, which has been authenticated by a root CA.

To avoid browser security warnings:

● Your fully qualified search machine name must match the certificate name. ● Your search URL must also be the fully qualified machine name as shown on the certificate. ● The certificate must be signed by a CA that your browser trusts.

Step-by-step instructions follow, but generally, to get a certificate, you will:

1. Install a self-signed certificate. This eliminates the "hostname mismatch" warning that displays when users do secure searches. It is also a preliminary step to sending a signing request to a CA as described in step 2. For this, you will enter

❍ The fully-qualified hostname of the Google Search Appliance as it appears to your end users ❍ Your organization's name (your company name or school name) ❍ Your locality, usually your city (such as New York City or Vancouver) ❍ Your state or province, spelled out (such as New Hampshire) ❍ Your country abbreviation (such as US (United States), FR (France), or JP (Japan))

2. After this data is updated, you can export a certificate signing request to get a secure certificate from a commonly trusted CA. The Google Search Appliance generates a certificate request for you in the SSL Settings page. The generated keypair is 1024-bit RSA. Send the request for a SSL server certificate to your organization's preferred CA. If you don't have a preferred CA, a few are listed above.

Note: A new keypair for a certificate signing request will be generated only after changing the certificate data under Create a New SSL Certificate and clicking on the button Create Self-Signed Certificate.

3. a) When you receive the certificate that corresponds to the CSR from step 2, install it using the SSL Settings page.

b) Alternatively, you can also generate a keypair externally. In this case, you need to upload both the certificate and the corresponding private key to the SSL Settings page. Currently, Google accepts only non-encrypted RSA keys. If you have an encrypted private key to upload to the appliance, you need to decrypt it first. Use the freely available openssl software and the following openssl command to decrypt the private key:

openssl rsa -in key-with-passphrase.pem -out key-without-passphrase.pem

Note: To prevent access to the private key by any other user during this process, the output file should be stored on a USB stick or a floppy disk, or on a hardware security module. It should never be stored unencrypted on a network drive. Delete the local copy of the unencrypted private key file after the upload is performed.

To install a self-signed SSL Certificate:

1. Click Administration, then click SSL Settings. If a certificate has been installed, you see the details of the current certificate. 2. To create a new certificate or to replace the current certificate, enter the fully qualified host name of the Google Search Appliance, the name users will see when they search on your site. 3. Enter the name of your organization, your locality (city), and your state or province (spelled out) and country (two-letter abbreviation, such as US or FR). 4. Enter your email address. 5. Enter your company's SSL non-encrypted Private Key or click the Browse button to locate it. 6. Click the Create Self-Signed Certificate button to display the information you just entered. 7. Click the Install SSL Certificate button. The certificate installs. Notice the message in green at the top: "SSL certificate installed. The appliance console needs to be restarted, please log in again." 8. Either click "log in" or wait 15 seconds for the system to log out and display the login page. 9. Log in and click Administration > SSL Settings. Your new certificate information is listed under "Current SSL Certificate Information."

Google Search Appliance Help Center

Google Inc. 52

To export the Certificate Signing Request:

1. With your new certificate information in "Current SSL Certificate Information," click the Export Certificate Signing Request button to start the process of getting a certificate from a signing authority. The Download dialog box opens with a Google Search Appliance Certificate Signing Request file.

2. Save the file to the hard disk. 3. Find the file you saved and send it to a signing authority organization. The root CA will ask for some proof that you are the company you say you are. It may take a few days to hear back from them. When you receive

the signed certificate, complete the steps in "To install an SSL Certificate."

To install an SSL Certificate a) After submitting a request and receiving a certificate from a CA:

1. On the SSL Settings page, enter the file name of the SSL Certificate or click the Browse button to locate it on your network. 2. Click the View Certificate Information button. 3. Click Install SSL Certificate. 4. When the page refreshes, notice the message in green at the top: "SSL certificate installed. The appliance console needs to be restarted, please log in again." 5. Either click "log in" or wait 15 seconds for the system to log out and display the login page. 6. Log in and click Administration, then click SSL Settings. Your new certificate information is listed under "Current SSL Certificate Information."

b) after external keypair and certificate signing request generation and receiving a certificate from a CA:

1. On the SSL Settings page, enter the file name of the SSL Certificate or click the Browse button to locate it on your network. 2. Enter the file name of the unencrypted Private Key file or click the Browse button to locate it on your local computer.

Note the openssl command and notes mentioned above for decrypting this key.

3. Click the View Certificate Information button. 4. Click Install SSL Certificate. 5. When the page refreshes, notice the message in green at the top: "SSL certificate installed. The appliance console needs to be restarted, please log in again." 6. Either click "log in" or wait 15 seconds for the system to log out and display the login page. Log in and click Administration, then click SSL Settings. Your new certificate information is listed under "Current SSL

Certificate Information." 7. Enter your company's SSL non-encrypted Private Key or click the Browse button to locate it.

Force secure connections when serving?

To make sure that search results containing confidential documents are served over a secure connection, you can choose among these options. The HTTPS protocol does slow performance somewhat.

● No. No results served over HTTPS (fastest performance). ● Use HTTPS when serving secure results, but not when serving public results. Only documents requiring credential authentication are served over HTTPS. ● Use HTTPS when serving both public and secure results. All documents, both public and secure, served over HTTPS.

To make a selection, click an option button and click the Save Setup button.

Client certificates for User Authentication

To use client certificates as authentication for confidential documents, go to the Administration > Certificate Authorities page, upload a Certificate Authority certificate and its Certificate Revocation List files. Then navigate to the Administration > SSL Settings page and check the Enable Client Certificate Authentication box.

Administration > License

You can see information about your current license, such as how many days remain before it expires, your Google Search Appliance and license ID numbers, the last day the license will be valid, the maximum number of collections and front ends your license allows, and the maximum number of pages permitted in the system.

When you purchase a new or additional license, you will receive a license key through email.

Google Search Appliance Help Center

Google Inc. 53

To install the license key:

1. Open the email message and save the attached license key to your local computer. 2. In the Admin Console, click Administration in the menu, then click the License link. 3. Click the Browse button to find and upload the license key file that you saved in step 1. 4. Click View New License to check the new license. 5. Click Accept and Install New License to complete the process. 6. Click Home to return to the Home page.

Administration > Import and Export

The Google Search Appliance provides the ability to export all configuration information in a single file, and import it back into the system. The exported file serves as a reliable backup of the configuration and can be used to replicate the same configuration across multiple appliances. The format of the exported file is suitable for use with a configuration management system and it can help in auditing the changes to the configuration over time. Sensitive information, such as user accounts and the crawl access list, is encrypted before being exported.

Caution: Please make sure to export your current configuration as a backup before you import another configuration file. This is for safety, in case importing the new configuration should fail in the middle, leaving the current configuration inconsistent. If this occurs, use the backup configuration to bring the system back to the previous state.

The configuration files contain the following information:

● Frontends ❍ Keymatches ❍ Synonyms ❍ Filters ❍ Remove URLs ❍ XSLT Stylesheet settings

● Collections ❍ entries in Include Content Matching the Following Patterns ❍ entries in Do Not Include Content Matching the Following Patterns ❍ URL patterns in the Pattern Tester Utility ❍ Required URLs entered in the Automatic Rollback section of the Index Rollback page

● General Parameters ❍ Start URLs ❍ URL patterns to follow and crawl ❍ URL patterns not to follow and crawl ❍ Encrypted user account passwords ❍ Encrypted Crawler Access usernames and passwords ❍ Duplicate hosts ❍ Host loads ❍ Proxy configuration ❍ Other

You can import configuration files for the Google Search Appliance and, of course, you can export configuration files to import later.

The import/export passphrase must be at least eight characters long.

To import a configuration:

1. Click Administration in the menu, then click the Import/Export link. 2. Enter a filename or click the Browse button to find the file on your network. 3. Enter the passphrase used for importing and exporting. 4. Click the Import Configuration button.

To export a configuration:

Google Search Appliance Help Center

Google Inc. 54

1. Click Administration in the menu, then click the Import/Export link. 2. Enter the passphrase used for importing and exporting. 3. Retype the passphrase to confirm. 4. Click the Export Configuration button. 5. Browse to a location for the file and click Save.

Administration > Reset Index

You can reset your crawling queues and delete your search index, removing all its contents. This feature is intended for use during initial configuration and experimentation. It is not recommended during production use, as it will eliminate all content and interrupt serving.

Caution: If you remove the index, it is deleted and irretrievable. The Google Search Appliance will need to recrawl your web servers to recreate your search index.

Clicking the Reset the Index Now button stops all crawling and serving processes. The Google Search Appliance becomes unavailable for about 15 minutes. Resetting the index retains the entries you have made in the Admin Console.

When the Google Search Appliance restarts, it will immediately begin crawling. If you do not wish it to crawl until you have changed some settings, before you reset the index, go to the Status and Reports > Crawl Status page, and click the Pause Crawl button. Then return to the Administration > Reset Index page and click the Reset the Index Now button.

Warning: If you are updating your software version and the machine is in Test Mode, do not use Reset Index.

Administration > Shutdown

The Shutdown link displays a page where you can shut the system down by clicking Shut the System Down Now. This page also contains information on powering down the system and restarting it.

Note: Clicking the Shut the System Down Now button will stop all serving and ongoing crawling.

You can restart operations by unplugging the power cable and plugging it back in. The Google Search Appliance resumes serving existing indices and the Admin Console is once again available. The system takes about ten minutes to resume normal operation.

More Information

Rules for Valid URL Patterns Crawling and Indexing Spelling & Stop Words Hexadecimal Notation Font Families Security and Error Handling XML Reference Index

Appendix A: Rules for Valid URL Patterns

When specifying the URLs that should/should not be crawled on your site or when building URL-based collections, your URLs must conform to the valid patterns listed below. For more information, see the URL Pattern How-To documentation on the Google Search Appliance Support site.

Google Search Appliance Help Center

Google Inc. 55

Valid URL Patterns Examples Explanation

Any substring of a URL that includes the host/path separating slash http://www.google.com/ Any page on www.google.com using the HTTP protocol.

www.google.com/ Any page on www.google.com using any supported protocol.

google.com/ Any page in the google.com domain.

Any suffix of a string. You specify the suffix with the $ at the end of the string.

#home$ All pages ending with #home.

.pdf$ All pages with the extension .pdf.

Any prefix of a string. You specify the prefix with the ^ at the beginning of the string. A prefix can be used in combination with the suffix for exact string matches. For example, ^candy cane$ matches the exact string for "candy cane."

^http:// Any page using the HTTP protocol.

^https:// Any page using the HTTPS protocol.

^http://www.google.com/page.html$ Only the specified page.

An arbitrary substring of a URL. These patterns are specified using the prefix "contains".

contains:coffee Any URL that contains "coffee."

contains:beans Any URL that contains "beans."

Exceptions denoted by - (minus) sign. candy.com/ -www.candy.com/

Means that "www.chocolate.candy.com" is a match, but "www.candy.com" is not a match.

Regular expressions from the GNU Regular Expression library. Regular expressions: (1) are case insensitive (unless you specify "regexpCase:") (2) must use two escape characters (backslashes "\\") when reserved characters are added to the regular expression

regexp:-sid=[0-9A-Z]+/

regexp:http://www\\.example\\.google\\.com/.*/images/

See the GNU Regular Expression library.

Comments #this is a comment Empty lines and comments starting with # are permissible. These comments are removed from the URL pattern and ignored.

Appendix B:

Crawling and Indexing

The gsa-crawler begins crawling from the URLs you specify on the Crawl and Index > Crawl URLs page. When a page is crawled, it is automatically indexed. Periodically, the index is checked against all of your serving prerequisites. If the index doesn't satisfy the prerequisites, the serving index rolls back to a previous snapshot, based on the rollback strategy that you chose. In other words, if the index conforms to the prerequisites, then it is served. If the index does not conform to the prerequisites, then the current index continues serving results.

File Size Limits

Some limits are imposed on crawling large HTML and non-HTML files.

For HTML files, Google crawls up to a size of 2.5 MB, then discards the remainder of the file.

Google Search Appliance Help Center

Google Inc. 56

For non-HTML files, Google crawls files up to 30 MB in size. Files larger than 30MB are discarded and not crawled. Non-HTML files are converted to HTML. Then the first 2MB of the HTML file are indexed. The remainder is discarded.

Crawling Frames and Framesets

To crawl framesets and their nested frames, the framesets must be well-formed with the <frame> tags occurring within the <frameset> tag. Anchors (links) must also occur within the frameset. Depending on the particular structure of your site, the search results may point to the frameset page itself or to the individual frame pages. There is currently no way to specify which behavior you prefer.

Types of Files the Google Search Appliance Crawls and Indexes

Note: Encrypted, viewable PDF documents are converted to HTML for indexing; however, the HTML is not displayed.

Word Processing FormatsAdobe FrameMaker mif Version 6.0

ASCII Text (7 & 8 bit) txt All versions

ANSI Text (7 & 8 bit) ans All versions

DEC WPS Plus dx Versions through 4.0

DEC WPS Plus wpl Versions through 4.1

DisplayWrite 2 & 3 txt All versions

DisplayWrite 4 & 5 doc Versions through Release 2.0

Enable wpf Versions 3.0, 4.0 and 4.5

First Choice pfc Versions through 3.0

Framework Version 3.0

HTML html, htm Versions through 3.0 (some limitations)

IBM FFT fft All versions

IBM Revisable Form Text rft All versions

IBM Writing Assistant iwa Version 1.01

JustSystems Ichitaro jaw, jbw Versions 5.0, 6.0, 8.0-13.0, 2004

JustWrite jw Versions through 3.0

Legacy leg Versions through 1.1

Lotus AMI/AMI Professional sam Versions through 3.1

Lotus Manuscript doc Versions through 2.0

Lotus WordPro (Windows only) lwp Versions 96 through Millennium 9.6, text only

Lotus WordPro (Text only on UNIX) lwp SmartSuite 97 and Millennium

MacWrite II mcw, mw, mwii Version 1.1

MASS11 m11 Versions through 8.0

Microsoft Rich Text Format rtf All versions

Microsoft Word for DOS doc Versions through 6.0

Microsoft Word for Macintosh doc Versions 4.0 through 98

Microsoft Word for Windows doc Versions through 2003

Microsoft WordPad rtf, doc All versions

Microsoft Works for DOS wks, wps Versions through 2.0

Microsoft Works for Macintosh wks, wps Versions through 2.0

Microsoft Works for Windows wks, wpf Versions through 4.0

Microsoft Write wri Versions through 3.0

MultiMate dox Versions through 4.0

Navy DIF dif All versions

Nota Bene nb Version 3.0

Google Search Appliance Help Center

Google Inc. 57

Novell Perfect Works Version 2.0

Novell WordPerfect for DOS Versions through 6.1

Novell WordPerfect for Mac Versions 1.02 through 3.0

Novell/Corel WordPerfect for Windows Versions through 12.0

Office Writer ow4 Version 4.0 to 6.0

PC-File Letter ltr Versions through 5.0

PC-File+ Letter ltr Versions through 3.0

PFS:Write pfb Versions A, B, and C

Professional Write for DOS pw Versions through 2.1

Professional Write Plus pw, pwp Version 1.0

Q&A for DOS qa, qw, dtf Version 2.0

Q&A Write for Windows dtf Version 3.0

Samna Word sam, sm Versions through Samna Word IV+

SmartWare II smt Version 1.02

Sprint spr Version 1.0

Text Mail (MIME) No specific version

Total Word tw Version 1.2

Unicode Text txt All versions

Volkswriter 3 & 4 vw Versions through 1.0

Wang PC iwp Versions through 2.6

WML Version 5.2

WordMARC wmc Versions through Composer Plus

WordStar 2000 for DOS ws1, ws2, ws3 Versions through 3.0

WordStar for DOS ws Versions through 7.0

WordStar for Windows ws, wst, wsd Version 1.0

XyWrite xy3, xyp, xyw Versions through III Plus

Spreadsheet Formats

Enable 300, wpf, ssf, dbf Versions 3.0, 4.0 and 4.5

First Choice ss, fol Versions through 3.0

Framework fw3 Version 3.0

Lotus 1-2-3 (DOS & Windows) wku, wk1, wk2, wk3, wk4, wk5, wki, wks

Versions through 5.0

Lotus 1-2-3 for SmartSuite wku, wk1, wk2, wk3, wk4, wk5, wki, wks

SmartSuite 97and Millennium

Lotus 1-2-3 Charts (DOS & Windows) wku, wk1, wk2, wk3, wk4, wk5, wki, wks

Versions through 5.0

Lotus 1-2-3 (OS/2) wku, wk1, wk2 Versions through 2.0

Lotus 1-2-3 Charts (OS/2) wku, wk1, wk2 Versions through 2.0

Lotus Symphony wr1 Versions 1.0,1.1 and 2.0

Microsoft Excel for Macintosh xls Versions 3.0 through 4.0, 98

Microsoft Excel for Windows xls, xlw Versions 2.2 through 2000

Microsoft Excel Charts xlc Versions 2.x through 7.0

Microsoft Multiplan col, cod, mod Version 4.0

Microsoft Works for Windows wps, wks Versions through 4.0

Microsoft Works (DOS) wps, wks, wdb, wcm Versions through 2.0

Microsoft Works (Macintosh) wps, wks, wdb, wcm Versions through 2.0

Mosaic Twin wku Version 2.5

Google Search Appliance Help Center

Google Inc. 58

Novell Perfect Works Version 2.0

PFS:Professional Plan tid Version 1.0

QuattroPro for DOS wkq, wq1 Versions through 5.0

QuattroPro for Windows wb1, wb2, wk3 Versions through 9.0

SmartWare II def, smt Version 1.02

StarOffice Calc for Windows and UNIX text only Version 5.2

SuperCalc 5 cal Version 4.0

VP Planner 3D np Version 1.0

Database Formats

Access mdb Versions through 2.0

DBASE dbf Versions through 5.0

DataEase dba, dbm, dql Version 4.x

dBXL dbf Version 1.3

Enable 300, wpf, ssf, dbf Versions 3.0, 4.0 and 4.5

First Choice pfc Versions through 3.0

FoxBase fmt, dbt, fox, inx, dbf Version 2.1

Framework fwk, fw, fw2, fw3 Version 3.0

Microsoft Works (DOS) wdb, wks Versions through 2.0

Microsoft Works (Macintosh) wdb, wks Versions through 2. 0

Microsoft Works for Windows wdb, wks, dbf Versions through 4.0

Paradox (DOS) fsl, db, px Versions through 4.0

Paradox (Windows) fsl, db, px Versions through 1.0

Personal R:BASE rbf Version 1.0

R:BASE 5000 rbf, dbf Versions through 3.1

R:BASE System V rbf Version 1.0

Reflex r2d Version 2.0

Q & A qa, qw, dtf Versions through 2.0

SmartWare II db, def, smt Version 1.02

Graphics FormatsAdobe FrameMaker Graphics fmv Vector/raster through 5.0

Adobe Illustrator File Format ai Versions through 7.0, 9.0

Adobe Photoshop File Format psd Version 4.0

Adobe Portable Document Format pdf Versions 2.1, 3.0 - 6.0, Japanese

Ami Draw Format sdw Ami Draw

AutoCAD Interchange and Native Drawing Format dxf, dwg

AutoCAD Drawing Format dwg Versions 2.5-2.6, 9.0-14.0, 2000i and 2002

AutoShade Rendering Format rnd Version 2

Binary Group 3 Fax All Versions

Bitmap Format bmp, rle, ico, cur, OS/2 dib & warp

Windows

CALS Raster Format gp4 Type I and Type II

Corel Clipart Format cmx Versions 5 through 6

Corel Draw cdr Versions 3.x through 8.x

Corel Draw cdr with tiff header Versions 2.x through 9.x

Computer Graphics Metafile cgm ANSI, CALS NIST versions 3.0

Encapsulated PostScript eps tiff header only

Google Search Appliance Help Center

Google Inc. 59

GEM Paint img No specific version

Graphics Environment Manager gem Bitmap and vector

Graphics Interface Format gif No specific version

Hewlett Packard Graphics Language hpgl Version 2

IBM Graphics Data Format gdf Version 1.0

IBM Picture Interchange Format pif Version 1.0

Initial Graphics Exchange Specification iges Version 5.1

JFIF (jpeg not in tiff format) jfif All Versions

JPEG (including EXIF) jpeg All versions

Kodak Flash Pix fpx All versions

Kodak Photo CD pcd Version 1.0

Lotus PIC pic All versions

Lotus Snapshot All versions

Macintosh PICT1 and PICT2 pict Bitmap only

MacPaint pntg No specific version

Micrografx Draw drw Versions through 4.0

Micrografx Designer drw Versions through 3.1

Micrografx Designer dsf Windows 95, Version 6.0

Novell PerfectWorks draw Version 2.0

OS/2 PM Metafile met Version 3.0

Paint Shop Pro 6 psp Version 5.0 - 6.0

PC Paintbrush pcx, dcx All versions

Portable Bitmap pbm All versions

Portable Graymap pgm No specific version

Portable Network Graphics png Version 1.0

Portable Pixmap ppm No specific version

PostScript ps Level 2

Progressive JPEG jpeg No specific version

Sun Raster srs No specific version

TIFF tiff Versions through 6

TIFF CCITT Group 3 & 4 tiff Versions through 6

Truevision TGA targa Version 2

Visio (preview) Version 4

Visio Version 5, 2000 - 2003

WBMP No specific version

Windows Enhanced Metafile emf No specific version

Windows Metafile wmf No specific version

WordPerfect Graphics wpg, wpg2 Versions through 2.0, 7 and 10

X-Windows Bitmap xbm x10 compatible

X-Windows Dump xdm x10 compatible

X-Windows Pixmap xpm x10 compatible

Presentation Formats

Corel/Novell Presentations shw Versions through 12.0

Harvard Graphics for DOS hgs, cht, ch3, prs Versions 2.x & 3.x

Harvard Graphics for Windows hgs, cht, ch3, prs Windows versions

Freelance for Windows flw, shw, drw, pre Versions through Millennium

Freelance for OS/2 flw, shw, drw, pre Versions through 2.0

Google Search Appliance Help Center

Google Inc. 60

Microsoft PowerPoint for Windows ppt Versions 3.0 through 2003

Microsoft PowerPoint for Macintosh ppt Versions 4.0, 98 through 2004

StarOffice Impress for Windows and UNIX text only Versions 5.2

Compressed Formats

UNIX Compress Z No specific version

UNIX TAR tar No specific version

ZIP zip PKWARE versions through 2.04g

Other Formats

Microsoft Outlook Express eml No specific version

Microsoft Outlook Message msg Text only

Microsoft Project Windows 98, text only

vCard Version 2.1

Appendix C: Spelling & Stop Words

Spell Checker

The spell checker uses data from the documents crawled by your appliance to make spelling suggestions. Periodically, the spell server will explore your index to update its database. The spell server updates are automatic, so no configuration is required from the server administrator.

A single spelling suggestion is returned with the results for queries where the spell checker has detected a possible spelling suggestion. Spelling suggestions are automatically enabled by default. The administrator may choose not to display the suggested spelling field in the XSLT Stylesheet for any front end.

The spell checker is context sensitive. For example, if the query submitted is "gail divers," "gail devers" is suggested as an alternative query. However, "scuba divers" would not return an alternate query suggestion.

Note: The example is specific to google.com's spell checker. Your spell checker might perform differently.

The spell checker is disabled when a query contains special query terms, such as inurl:, allintitle:, and so on.

Note: Currently, the spell checker supports only US English and cannot be manually edited.

Stop Words

Stop words are words or single letters that are usually not searched during a query because they are so common. Most stop words are prepositions, pronouns, and articles. However, common words that are part of a connected phrase, such as "i/o" or "site.com," are not ignored. Also, if a query is in quotations, stop words within the query are included in the search.

Note: The notification that a stop word has been taken out of consideration for a query is noted in the top line of the search results.

Appendix D: Hexadecimal Notation

Colors on web pages are specified in HTML using hexadecimal notation or names.

Using hexadecimal notation, the first way to express a color is to specify the amount of each of the three primary colors to mix. By specifying the red, green, and blue components, you can construct any color.

Each of the 6 digits in the hexadecimal code represents a value of the final color.

Google Search Appliance Help Center

Google Inc. 61

● #XXxxxx - Red Color Value ● #xxXXxx - Green Color Value ● #xxxxXX - Blue Color Value

The amount of each color is specified as two hexadecimal digits. That means that none of the color is 00 and all of the color is FF. There are 216 colors that look the same on every system that is displaying at least 256 colors. These colors are made up of 0%, 20%, 40%, 60%, 80%, and 100% of each color combined with all those amounts of each other color. For example, if you combine 20% red and 80% green and 40% blue, you get dark green with some blue and a touch of red.

The hexadecimal values that are equivalent to those percentages are shown in this table.

Percentage of Color Hexadecimal Value0 0020 3340 6660 9980 CC100 FF

To use those hexadecimal values to produce a color, you precede them with the # sign to show that they are in hexadecimal. That means, for example, that to get the color that is 20% red and 80% green and 40% blue you write #33CC66. You can use colors other than the 216 safe ones, but they may not look the same on all browsers.

Here are the codes for commonly used colors.

Color Hexadecimal CodeWhite #FFFFFFBlack #000000Blue #0000FF Green #008000 Red #FF0000 Yellow #FFFF00 Aqua #00FFFF Gray #BEBEBE Lime #00FF00 Navy #000080 Purple #800080 Silver #C0C0C0 Maroon #800000 Olive #808000 Teal #008080

You can read more about colors on web pages and hexadecimal notation here.

http://www.w3.org/MarkUp/Guide/Style.html

Appendix E: Font Families

Rather than specify fonts for web pages, it is good practice to specify Font Families so that reasonable fonts are presented, no matter what fonts are on a user's computer.

Not all browsers display text the same way. You will want to test your results on all common browsers.

Google Search Appliance Help Center

Google Inc. 62

To set the font, enter the name of the font you want, such as Arial, FranklinGothic, or Helvetica. In addition, you can specify a generic font such as serif, sans-serif, cursive, fantasy, or monospace. You can list choices, separated by commas; the browser will attempt to provide the first choice if it exists on the computer, then will try the second choice, and so on. Put the generic font last.

For example, in the Stylesheet:

<xsl:variable name= "global_font">verdana,sans-serif</xsl:variable>

You can read more about fonts and Font Families here.

http://www.w3.org/MarkUp/Guide/Style.html http://www.w3.org/TR/REC-CSS2/fonts.html#font-family-prop

Appendix F: Security and Error Handling

You can permit your users to directly connect to the Google Search Appliance to retrieve search results. In some circumstances, however, you may find advantages to placing a system in front of the Google Search Appliance.

This system can provide additional functions that are not part of search, yet may be considered useful when running a network service. Here are two benefits that the additional system can provide.

Firewall capabilities

If you isolate the Google Search Appliance behind a firewall, you can selectively block access. Here are some reasons you may want to do this:

● Block access to the Admin Console on port 8000, so that users can only get to the Admin Console on port 8443 (which uses HTTPS). ● Restrict access to the Google Search Appliance based on end users' IP addresses. ● Prevent a Denial of Service attack.

Error handling

The Google Search Appliance is designed to correct its own problems. In rare cases, however, users can get an error from a search request. You can control how these errors are presented to the user with a script that runs on your web server. It works like this:

● Users send a search request to the script. ● The script formats the request and sends it to the Google Search Appliance. ● The system sends the response back to the web server, which processes the results before sending them to the user.

Here are some example strategies for handling errors in a script.

● If the HTTP status code of the response is 200, no error has occurred. Send the results back to the user. ● If the HTTP status code is 500, then an unexpected error has occurred. The script can retry the search request or send an error to the user. ● If the HTTP status code is 404, the user has requested a URL that does not exist. Send an appropriate error message to the user. ● Set a timeout in your script. If the system does not respond within the specified time, the script can attempt to ping the Google Search Appliance. If ping fails, then send an error to the user. If ping succeeds, then retry

the search request once more. If that fails, send an error to the user.

Google XML Reference Revised May, 2005

Google Search Appliance Help Center

Google Inc. 63

Google has developed a simple HTTP-based protocol for serving search results. Search administrators have complete control over how search results are requested and presented to the end user. This document describes the technical details of Google search request and results formats. It assumes that the reader has basic understanding of the HTTP protocol and the HTML document format.

Contents

1. Overview

2. Request Format 2.1 Request Overview 2.2 Search Parameters 2.3 Query Terms 2.4 Filtering 2.5 Internationalization 2.6 Sorting 2.7 Meta Tags 2.8 Limits

3. Results Format 3.1 Custom HTML 3.1.1 Custom HTML Output Overview 3.1.2 Internationalization 3.2 XML 3.2.1 XML Output Overview 3.2.2 Character Encoding Conventions 3.2.3 Google XML Results DTD 3.2.4 Google XML Tag Definitions

Appendices Appendix A: Estimated vs. Actual Number of Results Appendix B: URL Escaping Glossary

1. Overview[Table of Contents]

A Google search request is a simple HTTP request to the Google search engine. The search request format and options available are detailed in the Request Format section.

The search results are returned in the output format specified in the search request. Currently, Google supports output results in XML and HTML format. XML formatted results give you the power to customize the display of the results through the implementation of a custom XML parser. The HTML results can be customized through the application of an XSL stylesheet to the standard XML results.

2. Request Format [Table of Contents]

This section is broken into the following categories:

Google Search Appliance Help Center

Google Inc. 64

● Request Overview ● Search Parameters ● Query Terms ● Filtering ● Internationalization ● Sorting ● Meta Tags ● Limits

2.1 Request Overview [Request Format] - [Table of Contents]

Using the Google search protocol is as simple as requesting a page from a web server. The Google search request is a standard HTTP GET command, which returns results in either XML or HTML format as specified in the search request. The search request is a URL combining the search engine host name, port and path; as well as a collection of name-value pairs (input parameters) separated by & characters. Some examples are listed below. Explanations of input parameters and output results can be found in the remainder of this document.

Note: Google recommends performing a HTTP version 1.0 (or later) GET command.

Note: To determine which host name and port to send your search requests to, please review your specific configuration documentation. The path to send your search requests to is always "/search".

Examples

The query GET /search?q=bill+material&output=xml&client=test&site=operations would return the first 10 results matching the query "bill material" in the "operations" collection in the Google XML output format.

The query GET /search?q=bill+material&start=10&num=5&output=xml_no_dtd&proxystylesheet=test&client=test&site=operations would return results numbering 11-15 matching the query "bill material" in the "operations" collection in the Google XML output format.

The query GET /search?q=Star+Wars+Episode+%2BI&output=xml_no_dtd&lr=lang_de&ie=latin1&oe=latin1&client=test&site=movies &proxystylesheet=test would return the first 10 German results matching the query "Star Wars Episode +I" in the "movies" collection returned in the Google custom HTML output format by applying the XSL stylesheet associated with the "test" front end to the standard XML results.

2.2 Search Parameters [Request Format] - [Table of Contents]

This table lists all the valid name-value pairs that can be used in a search request and descriptions of how these parameters will modify the search results.

Name Description Default Value

access

Defines whether the user is searching public content or all content (i.e. public and secure). This parameter takes effect only if Secured Content Search capability is enabled. The access parameter can have one of these possible values: p - search public content s - search secure content a - search all content, both public and secure The access parameter defaults to "p" if none is provided. Note: Secured Content Search is automatically enabled for clustered appliances.

p

Google Search Appliance Help Center

Google Inc. 65

as_dt

Modifies the as_sitesearch parameter as follows: Value Modification

i Include only results in the web directory specified by as_sitesearch

e Exclude all results in the web directory specified by as_sitesearch

i

as_epq

Adds an additional search query term to search for the phrase specified. This parameter has the same effect as the phrase special query term. Note: New query terms specified will be combined with q query terms to generate search results. Note: The value specified for this parameter must be URL-escaped.

Empty string

as_eq

Adds an additional search query terms to exclude any of the terms specified. This parameter has the same effect as the exclude (-) special query term. Note: New query terms will be combined with q query terms to generate search results. Note: The value specified for this parameter must be URL-escaped.

Empty string

as_lq

Additional search query term to show any pages which link to the specified URL. This parameter has the same effect as the link special query term. Note: No other query terms can be specified when using this special query term. Note: The value specified for this parameter must be URL-escaped.

Empty string

as_occt

Additional search query term to specify where the search terms occur on the page: anywhere on the page, in the title, or in the URL. Note: Query terms specified will be combined with q query terms to generate search results. Note: The value specified for this parameter must be URL-escaped.

Value Meaning

any anywhere on the page

title in the title of the page

URL in the URL for the page

Empty string

as_oq

Adds additional search query terms to find any of the terms specified. This parameter has the same effect as the OR special query term. Note: New query terms will be combined with q query terms to generate search results. Note: The value specified for this parameter must be URL-escaped.

Empty string

as_q

Search query terms as entered by the user. (See Query Terms section for additional query features.) Note: Query terms specified will be combined with q query terms to generate search results. Note: The value specified for this parameter must be URL-escaped.

Empty string

as_sitesearch

Additional search query term to show links in the specified web directory or to exclude those links depending on the value of as_dt. This parameter has the same effect as the site special query term. When the Google Search Appliance is sent a search request that includes the as_sitesearch parameter, it converts the value of the parameter into an argument to the site special query term and appends it to the value of q in the search results. For example, if your search contains the following parameters: q=mycompany&as_sitesearch=www.mycompany.com The raw XML of your search results will contain the following: <q>mycompany site:www.mycompany.com</q> The default XSLT stylesheet displays the value of the q tag in the search box on the results page. Consequently, using an as_sitesearch parameter will appear to change the user's search query. If the parameter and value as_dt=e are specified, -site: is appended to the end of the query term. Note: The value specified for this parameter must be URL-escaped.

Empty string

client A string indicating any valid front end REQUIRED

filter Activates or deactivates automatic results filtering performed by Google search. By default, filtering is applied to all Google results returned to improve results quality. (See Automatic Filtering section for more details.)

1

getfields Requests that the names and values of the meta tags specified be returned with each search result, when available. (See Meta Tags section for more details.) Note: All meta tag names or values specified must be double URL-escaped.

Empty string

Google Search Appliance Help Center

Google Inc. 66

ieInput Encoding Sets the character encoding used to interpret the query string. (See Internationalization section for details.)

latin1

lrLanguage restrict Restricts searches to pages in the specified language. (See Language Restricts section for more details.)

Empty string

numNumber of results desired per a single request. The maximum allowable value is 100. (The maximum number of results available for a query is 1,000.) See also start. Note: The actual number of results may be smaller than the requested value.

10

numgm Number of KeyMatch results to return with the results. A value between 0 to 5 (inclusive) can be specified for this option. 3

oeOutput Encoding Sets the character encoding used to encode the results returned. (See Internationalization section for details.)

UTF8

output

Select the format of the search results. Valid formats are:

Value Output Format

xml_no_dtd XML results or custom HTML (See proxystylesheet parameter for details.)

xml XML results with Google DTD reference. If using this value, proxystylesheet must be omitted from the parameters or must be set to an empty string.

REQUIRED

partialfields Restricts the search results to documents with meta tags whose values contain the words or phrases specified. (See Meta Tags section for more details.) Note: All meta tag names or values specified must be double URL-escaped.

Empty string

proxycustom

Custom XML tags to be included in the XML results. The only permitted values for this parameter are either <HOME/>, <ADVANCED/>, or <TEST/>. (See the Custom HTML output section for more details.) Note: This parameter is disabled if the search request does not contain the proxystylesheet tag. Note: If custom XML is specified, search results will not be returned with the search request. Note: Custom XML must be URL-escaped.

Empty string

proxyreloadA value of 1 indicates that the Google Search Appliance should update the XSL stylesheet cache to refresh the stylesheet currently being requested. This parameter is optional. The XSL stylesheet cache is updated approximately every 15 minutes. (See the Custom HTML section for more details.)

0

proxystylesheet

If the value of the output parameter is xml_no_dtd, then the output format is modified by the proxystylesheet value as follows:

Proxystylesheet Value Output Format

Omitted XML results

Empty XML results have a content-type of text/html (rather than text/xml), because the XML results are not transformed.

Front End Name Custom HTML results through application of the XSL stylesheet associated with the specified front end

(See the Custom HTML section for more details.) Note: This parameter may also specify the identifier of a valid collection. The default XSL stylesheet associated with that collection will then be used for custom HTML output. Note: The value specified for this parameter must be URL-escaped.

NA

q

Search query as entered by the user. (See Query Terms section for additional query features.)

Note: The value specified for this parameter must be URL-escaped. Empty string

requiredfields Restricts the search results to documents that contain exact meta tag names or name-value pairs specified. (See Meta Tags section for more details.) Note: All meta tag names or values specified must be double URL-escaped.

Empty string

site The name of a collection. Note that you can search over multiple collections using the properly escaped OR (pipe character) to separate the collection names. REQUIRED

Google Search Appliance Help Center

Google Inc. 67

sitesearch

Additional search query term to show links in the specified web directory. Requires that a value for q (query) be submitted as well. (The value of as_dt does not modify the effect of the sitesearch parameter.)

This parameter has the same effect as the site special query term. Note: The sitesearch and as_sitesearch parameters differ in how they are returned in the XML results. The sitesearch parameter is not appended to the

search query in the results. That is, the original query term will not be modified when you use the sitesearch parameter. Note: The value specified for this parameter must be URL-escaped.

Empty string

sortIndicates alternate sorting method. (See Sorting section for sort parameter format and details.) Note: Only date sort is currently supported.

Empty string

startUse this parameter to support result set page navigation. The maximum number of results available for a query is 1,000, i.e., the value of the start parameter added

to the value of the num parameter cannot exceed 1,000. See also num. 0

Custom Parameters

If any custom parameters that contain spaces are added to the search URL, the space characters will be replaced by an underscore (_). For example:

http://search.customer.com/search?q=customer+query&site=collection&client=collection&output=xml_no_dtd&newparam=test+this

This URL adds the custom parameter newparam with a value of "test+this." For security reasons, all space characters (represented as a "+") in the custom parameter newparam will be replaced by "_" characters, while built-in variables, such as q, will not be affected.

The resulting XML will look like this:

<PARAM name="q" value="customer query" original_value="customer+query"/> <PARAM name="newvar" value="test_this" original_value="test+this" />

The unmodified value can still be retrieved from the original_value attribute.

2.3 Query Terms [Request Format] - [Table of Contents]

Default Search

By default, Google only returns pages that include all of your search terms. There is no need to include "AND" between terms. Keep in mind that the order in which the terms are typed will affect the search results. To restrict a search further, just include more terms.

Google ignores common words and characters such as "where" and "how," as well as certain single digits and single letters, because they tend to slow down your search without improving the results. Google will indicate if a common word has been excluded by including text in the search comments field of the search results returned.

Special Characters

By default, all non-alphanumeric characters that are included in a search query are treated as query term separators (just like space characters).

The only exceptions to this rule are the following characters: double quote mark ("), plus sign (+), minus sign (hyphen) (-) and ampersand (&). The ampersand character (&) is treated as another character in the query term in which it is included, while the remaining exception characters correspond to search features listed in the section below.

Special Query Terms

Google supports the use of several special query terms that allow the user or search administrator to access additional capabilities of the Google search engine. These special query terms are listed below.

Note: All query terms must be correctly URL-escaped in the search request sent to Google search.

Google Search Appliance Help Center

Google Inc. 68

Special Query Capability Sample Usage Description

Include Query Term Star Wars Episode +I If a common word is essential to getting the results you want, you can include it by putting a "+" sign in front of it.

Exclude Query Term bass -music

Sometimes what you're searching for has more than one meaning. For example, the term "bass" can refer to either fishing or music. You can exclude a word from your search by putting a minus sign ("-") immediately in front of the term you want to exclude from the search results.

Note: The search request parameter, as_eq, can also be used to submit terms to exclude.

Phrase Search "yellow pages"

Search for complete phrases by enclosing them in quotation marks or connecting them with hyphens. Words marked in this way will appear together in all results exactly as you have entered them. Phrase searches are especially useful when searching for famous sayings or proper names.

Note: The search request parameter, as_epq, can also be used to submit a phrase search.

Boolean OR Search vacation london OR paris

Google search supports the Boolean "OR" operator. To retrieve pages that include either word A or word B, use an uppercase OR between terms.

Note: The search request parameter, as_oq, can also be used to submit a search for any term in a set of terms.

Directory Restricted Search

Domain search examples: site:www.google.com site:google.com site:com

Directory search examples: admission site:www.stanford.edu/group/uga site:www.google.com/about/ site:www.google.com/about

To search a domain, specify a partial string that matches complete name segments from the end of the canonical host name.

To search a particular directory on a web server (including root), you must specify the complete canonical name of the host server followed by the path of the directory. The string must have a "/" character after the host name to limit searches to a single server/directory. The path segments searched must be a complete match, because there is no partial path segment matching. Enter the query followed by "site:" followed by the host name and path of the web directory. If the ("/") character is at the end of the web directory path specified, then only files within that directory will be searched and files in sub-directories will not be considered.

Note: The exclusion operator ("-") can be applied to this query term to remove a web directory from consideration in the search. Note: Only one "site:" search term per search request is supported at this time. Note: The search request parameters, as_sitesearch and as_dt, can also be used to submit "site:" and "-site:" search terms.

Title Search (term) intitle:Google search

If you prepend "intitle:" to a query term, Google search will restrict the results to documents containing that word in the title. The query term must appear in the first 10 words of the title. Note there can be no space between the "intitle:" and the following word.

Note: Putting "intitle:" in front of every word in your query is equivalent to putting "allintitle:" at the front of your query.

Title Search (all) allintitle: Google searchIf you start a query with the term, "allintitle:"; Google search will restrict the results to those with all of the query words in the title. The query terms must appear in the first 10 words of the title.

URL Search (term) inurl:Google search

If you prepend "inurl:" to a query term, Google search will restrict the results to documents containing that word in the result URL. Note there can be no space between the "inurl:" and the following word.

Note: "inurl:" works only on words, not URL components. In particular, it ignores punctuation and will only use the first word following the "inurl:" operator. To find multiple words in a result URL, use the "inurl:" operator for each word.

Note: Putting "inurl:" in front of every word in your query is equivalent to putting "allinurl:" at the front of your query.

Google Search Appliance Help Center

Google Inc. 69

URL Search (all) allinurl: Google search

If you start a query with the term, "allinurl:"; Google search will restrict the results to those with all of the query words in the result URL.

Note: "allinurl:" works only on words, not URL components. In particular, it ignores punctuation. Thus, "allinurl: foo/bar" will restrict the results to page with the words "foo" and "bar" in the URL, but won't require that they be separated by a slash within that URL, that they be adjacent, or that they be in that particular word order. There is currently no way to enforce these constraints.

File Type Filtering Google filetype:doc OR filetype:pdf

The query prefix, "filetype:", will filter the results returned to only include documents with the extension specified immediately after. Note there can be no space between "filetype:" and the specified extension.

Note: Multiple file types can be included in a filtered search by adding more "filetype:" terms to the search query, when used in conjunction with the Boolean OR.

File Type Exclusion Google -filetype:doc -filetype:pdf

The query prefix, "-filetype:", will filter the results to exclude documents with the extension specified immediately after. Note there can be no space between "-filetype:" and the specified extension.

Note: Multiple file types can be excluded in a filtered search by adding more "-filetype:" terms to the search query.

Web Document Info info:www.google.com

The query prefix, "info:", will return a single result for the specified URL if it exists in the index.

Note: No other query terms can be specified when using this special query term.

Back Links link:www.google.com

The query prefix, "link:", will list web pages that have links to the specified web page. Note there can be no space between "link:" and the web page URL.

Note: No other query terms can be specified when using this special query term. Note: The search request parameter, as_lq, can also be used to submit a link: request.

Cached Results Page cache:www.google.com web

The query prefix, "cache:", will return the cached HTML version of the specified web document that the Google search crawled. Note there can be no space between "cache:" and the web page URL.

If you include other words in the query, Google will highlight those words within the cached document.

Note: To use Google's default cached result display, simply omit the output parameter in the cache request. To customize the display of cached results, simply request XML or Custom HTML output as part of the cache request and ensure your parser or stylesheet will handle the incoming cache data.

2.4 Filtering [Request Format] - [Table of Contents]

Google search provides many ways for you to filter the results that are returned as part of your query. These filtering options include:

● Automatic Filtering ● Language Filters

❍ Automatic Language Filters ❍ Combining Language Filters

Other filtering options can be applied through special query parameters, query terms and meta tags, which are documented in their respective sections. Please review these sections for more information on other filtering options.

Google Search Appliance Help Center

Google Inc. 70

2.4.1 Automatic Filtering

The quality of the results Google returns for searches is extremely important. One method that makes sure the best results are returned for a query is automatic "filtering" of the search results to weed out undesirable results.

Currently, Google search uses two techniques for automatic filtering of results:

● Duplicate Snippet Filter - If multiple documents contain the same information in their snippets in response to a query, then only the most relevant document of that set will be displayed in the results. ● Duplicate Directory Filter - If there are many results in a single web directory, then only the two most relevant results for that directory will be returned in the results. An output flag indicates that more results are

available from that directory.

By default, both types of filters are enabled. However, you can disable them with the filter parameter.

Setting filter=1 enables both Duplicate Directory Filtering and Duplicate Snippet Filtering. This is the default setting if no value for the filter parameter is provided.

Setting filter=0 will disable both Duplicate Directory Filtering and Duplicate Snippet Filtering.

Although determining when to use this option is up to each search administrator, Google recommends against setting filter=0 for typical search requests, since Google has found that document filtering significantly enhances the quality of most search results.

Setting filter=p will disable Duplicate Snippet Filtering only.

Setting filter=s will disable Duplicate Directory Filtering only.

When an end user submits a search request in which filtering removes any results, the removal of the results will be noted in the output generated for the search results. See the section on Estimated vs. Actual Number of Results for more information on how a filtered result set is identified and recommendations for results display.

The appliance also will automatically group results from a single directory in the search results.

If you set filter=0, then the order in which results are ranked can change depending on the value of the num parameter.

For example, if you set num=10 and filter=0 you may get two results in a particular directory that are considered in the 10 most relevant results. If one of these results is the most relevant of all, then directory crowding will cause both be displayed at the top of the results.

If you now set num=20, you may get a third result in the same directory that would be ranked from between 11 and 20. However, this result will actually be ranked third because of directory crowding.

2.4.2 Language Filters

This section covers:

● Automatic Language Filters ● Combining Language Filters

2.4.2.1 Automatic Language Filters

Language filters limit searches to pages in the specified languages. The algorithm for automatically determining the language of a web document is not customizable. The language determination algorithm is primarily based on the majority language used in the web document body text. Automatic language collections may not be appropriate for all users.

Note: Encoding schemes for input and output of search requests are important when providing international search. Please review the Internationalization section for more details.

The automatic language filters generated are:

Google Search Appliance Help Center

Google Inc. 71

Language Automatic Language Filter Name

Arabic lang_ar

Chinese (Simplified) lang_zh-CN

Chinese (Traditional) lang_zh-TW

Czech lang_cs

Danish lang_da

Dutch lang_nl

English lang_en

Estonian lang_et

Finnish lang_fi

French lang_fr

German lang_de

Greek lang_el

Hebrew lang_iw

Hungarian lang_hu

Icelandic lang_is

Italian lang_it

Japanese lang_ja

Korean lang_ko

Latvian lang_lv

Lithuanian lang_lt

Norwegian lang_no

Portuguese lang_pt

Polish lang_pl

Romanian lang_ro

Russian lang_ru

Spanish lang_es

Swedish lang_sv

Turkish lang_tu

2.4.2.2 Combining Language Filters

Search requests that use the lr parameter support the Boolean operators identified in the table below (in order of precedence).

Boolean Operator Sample Usage Description

Google Search Appliance Help Center

Google Inc. 72

Boolean NOT [ - ] -lang_fr

Removes all results that are defined as part of the Language Filter immediately following the "-" operator.

The example lr value would remove all results in French.

Boolean AND [ . ] gloves.hats

Returns results that are in the intersection of the results returned by the collection to either side of the "." operator.

The example restrict value would return all results which are in both the "hats" and "gloves" custom collections.

Boolean OR [ | ] lang_en|lang_fr

Returns results that are in either of the results returned by the collection to either side of the "|" operator.

The example lr value would return all results matching the query that are in either French or English.

Parentheses [ ( ) ] (gloves).(-(lang_hu|lang_cs))

All terms within the innermost set of parentheses will be evaluated before terms outside the parentheses are evaluated. Use parentheses to adjust the order of term evaluation.

The example lr value would return all results in the "gloves" custom collection that are not in either the Hungarian or Czech collections.

Note: Spaces are not valid characters in the collection string.

2.5 Internationalization [Request Format] - [Table of Contents]

In order to support searching documents in multiple languages and character encodings, Google provides the ie parameter to specify how Google search should interpret characters in the search request, and the oe parameter to specify how characters in the search results output should be encoded. To appropriately decode the search query and correctly encode the search results, specify the correct ie and oe parameters, respectively, in the search request.

Note: When providing search for multiple languages, Google recommends the usage of the utf8 encoding value for the ie and oe parameters.

Example

The query GET /search?q=gloves&client=test&site=test&lr=lang_en|lang_fr&ie=latin1&oe=latin1 would interpret the search query "gloves" using the latin1 encoding scheme, search for English or French results, and return results in the latin1 encoding scheme.

The query GET /search?q=gloves&client=test&site=test&lr=(-lang_hu).(-lang_cs)&ie=latin2&oe=latin2 would interpret the search query "gloves" using the latin2 encoding scheme, search for any results which are not in Hungarian or Czech, and return results in the latin2 encoding scheme.

The query GET /search?q=gloves&client=test&site=test&lr=lang_zh-CN|lang_zh-TW&ie=utf8&oe=utf8 would interpret the search query "gloves" using the utf8 encoding scheme, search for any results which are in Simplified or Traditional Chinese, and return results in the utf8 encoding scheme.

Note: See the Language Filters section for details of language specific searches using the lr parameter.

Character Encoding Values

The following table lists all encoding values supported by these parameters:

Google Search Appliance Help Center

Google Inc. 73

Language Encoding Value Alternate Encoding Value

Chinese (Simplified) gb GB2312

Chinese (Traditional) big5 Big5

Czech latin2 ISO-8859-2

Danish latin1 ISO-8859-1

Dutch latin1 ISO-8859-1

English latin1 ISO-8859-1

Estonian latin4 ISO-8859-4

Finnish latin1 ISO-8859-1

French latin1 ISO-8859-1

German latin1 ISO-8859-1

Greek greek ISO-8859-7

Hebrew hebrew ISO-8859-8

Hungarian latin2 ISO-8859-2

Icelandic latin1 ISO-8859-1

Italian latin1 ISO-8859-1

Japanese sjis Shift_JIS

Korean euc-kr EUC-KR

Latvian latin4 ISO-8859-4

Lithuanian latin4 ISO-8859-4

Norwegian latin1 ISO-8859-1

Portuguese latin1 ISO-8859-1

Polish latin2 ISO-8859-2

Romanian latin2 ISO-8859-2

Russian cyrillic ISO-8859-5

Spanish latin1 ISO-8859-1

Swedish latin1 ISO-8859-1

Google Search Appliance Help Center

Google Inc. 74

latin3 ISO-8859-3

latin5 ISO-8859-9

latin6 ISO-8859-10

euc-jp EUC-JP

Unicode (All Languages) utf8 UTF-8

2.6 Sorting [Request Format] - [Table of Contents]

Google search provides two sorting options for implementing your search solution:

● Sort By Relevance ● Sort By Date

2.6.1 Sort By Relevance (Default)

By default, Google combines hypertext analysis and PageRank technologies to provide users with highly relevant results. Hypertext analysis uses the design of the page, examining over 100 factors to determine the best result for your query term. PageRank considers the link structure of the entire index to understand how each page links to the other pages in the index.

2.6.2 Sort By Date

Google search also supports the ability to order search results by date. The date of a web document is defined by parameters configured by the search administrator. When a search is performed using the sort by date capability, the date associated with each result document will be included with the results.

When using the Sort by Date feature, the automatic quality filter will sometimes re-order results when performing result grouping. This can be disabled by adding the "filter =0" parameter to the search request when performing search by date.

Example

The query GET /search?q=chicken+teriyaki&output=xml&client=test&site=test&sort=date:D:S:d1 would return the first 10 top results sorted by both date and relevancy which match the query "chicken teriyaki" in the "test" collection.

Details

To sort the results by date, the sort parameter must be formatted as follows:

date:<direction>:<mode>:<format>

where <direction>, <mode> and <format> can have the following values: <direction> Value Results

A Sort results in ascending date order

D Sort results in descending date order

Google Search Appliance Help Center

Google Inc. 75

<mode> Value Results

S Sort relevant results. Google's algorithm will determine a subset of the most relevant results from the set of all results, and then sort that subset by date to return as results for the search request.

R Sort all results Note: Providing sort by date on queries with large result sets may incur performance penalties.

L Perform a look-up on the date associated with each document and return the date information for each result returned; but no sorting is performed.

<format> Value Results

d1 The format of the value returned for each search result returned is set to YYYY-MM-DD

2.7 Meta Tags [Request Format] - [Table of Contents]

Google search provides two options for leveraging the meta tags that are available in your content. Unless one of these parameters is specified; meta tag information will not be considered in your search results, since that information is not visible to the search user. These options are:

● Requesting Meta Tag Values ● Filtering by Meta Tags

2.7.1 Requesting Meta Tag Values

Through the use of the getfields parameter, the Google search engine allows a search request to specify meta tag values to return with the search results. The search engine will only return meta tag information for results which actually contain the meta tags. The search for meta tags is case-insensitive. Use only whole words in the getfields parameter, not partial words or word "stems."

Usage

GET /search?q=[search term]&output=xml&client=test&site=test&getfields=[meta tag name]

Example

The query GET /search?q=books&output=xml&client=[test]&site=[test]&getfields=author.title.keywords would return the first 10 results that match the query "books" in the "test" collection. If any of the results contain the author, title and/or keywords meta tags, then the values of those meta tags will be returned with the respective results. For example, the following tags could be returned with this search request: <META NAME="author" CONTENT="Jakob Nielsen"> <META NAME="title" CONTENT="Usability Engineering"> <META NAME="keywords" CONTENT="Usability, User Interface, User Feedback">

Details

To specify multiple meta tag values to be returned, list all meta tag names separated by a period (".") as in the example above. To request all available meta tags for each search result, specify an asterisk ("*") as the value for the getfields parameter.

Note: When meta tag values are requested, they are not displayed in results requested in the default HTML format. Please use the custom HTML or XML output options to take advantage of this feature.

Note: All meta tag names or values specified must be double URL-escaped. See an example in the following section.

Google Search Appliance Help Center

Google Inc. 76

2.7.2 Filtering by Meta Tags

The Google search engine can filter results by the values of the result meta tags. This section details how to use the requiredfields and partialfields input parameters to filter on meta tag values. The term partialfields refers to part of the meta tag content, rather than part of a word. Other filtering techniques are noted in the Filtering section.

Usage

GET /search?q=[search term]&output=xml&client=test&site=test&requiredfields=[metatag name]:[metatag content]

Examples

The query GET /search?q=checks&output=xml&client=test&site=test&requiredfields=department:Human%252BResources|department:Finance returns the first 10 results which match the query "checks" in the "test" collection which also contained either of the following meta tags: <META NAME="department" CONTENT="Human Resources"> <META NAME="department" CONTENT="Finance">

The query GET /search?q=books&output=xml&client=test&site=test&partialfields=author:Scott would return the first 10 results which match the query "books" in the "test" collection which also contained the word "Scott" somewhere in the "author" meta tag. Some example meta tags satisfying this search request are: <META NAME="author" CONTENT="Sir Walter Scott"> <META NAME="author" CONTENT="F. Scott Fitzgerald">

Details

Multiple meta tag constraints can be specified using the requiredfields and partialfields parameters. To filter for the presence of a meta tag, indicate the name of the meta tag to be found. To filter on a specific meta tag value, indicate the name of the meta tag followed by the colon ":" character and then the specific value. The partialfields parameter matches complete words, not parts of words. In addition, the match must be within the first 160 characters of the meta tag. See the examples in the table below for sample usage.

To combine multiple name-value pairs, use the following operators:

Boolean Operator Sample Usage Description

Boolean AND [ . ] author:William.keywords Returns results which satisfy both meta tag constraints.

Boolean OR [ | ] department:Sales|department:Finance Returns results which satisfy either meta tag constraint.

As stated in the "Query Terms" section, all non-alphanumeric characters included in a search query are treated as query term separators (just like space characters). Similarly, Google uses these separators to divide metatag content into single entities, or word tokens; that is, a word or a string that may or may not be a real word. The separators, used in both queries and results, and their values are in the table. They are not customizable.

Separator Value

~ ! @ # $ % ^ & * ( ) - + { } | ` [ ] : ; ' < > ? , . / = space character

\ 92

" 34

\t 9

\r 13

\n 10

\v 11

\f 12

\177 177

Google Search Appliance Help Center

Google Inc. 77

Note: All meta tag names or values specified must be double URL-escaped. See example above.

2.8 Limits [Request Format] - [Table of Contents]

This section lists any limitations on the search requests sent to Google search.

Component Limit

Search request length 2048 bytes

Query Terms (includes terms in parameter q and any parameters starting with as_ ) 20

site: query terms (includes use of as_sitesearch parameter) 1 (per search request)

3. Results Format[Table of Contents]

This section is broken into the following categories:

● Custom HTML ● XML

3.1 Custom HTML [Results Format] - [Table of Contents]

The description of the custom HTML results section is broken down into the following sections:

● Custom HTML Output Overview ● Internationalization

3.1.1 Custom HTML Output Overview [Custom HTML] - [Results Format] - [Table of Contents]

Google search provides the ability to generate custom HTML by incorporating an XSLT (eXtensible Stylesheet Language Transformation) server into the search engine infrastructure. Search requests submitted to the Google search engine, with the output input parameter set to xml_no_dtd and a valid proxystylesheet parameter value, will automatically be processed by the XSLT server as requests for custom HTML output.

Using the XSL stylesheet specified by the proxystylesheet parameter; the XSLT server will apply the transformation rules found in the XSL stylesheet to the standard Google XML results and return the resulting output. While this document assumes that the output generated by applying the XSL stylesheet will be HTML, almost any output format can be generated by the application of the appropriate XSL stylesheet rules. For any front end, the default XSL stylesheet can be customized or replaced by the search administrator.

To customize the XSL stylesheet used to generate custom HTML output, please review Google's XML output format to determine the XML tags that may be transformed using a customized XSL stylesheet.

Additionally, you can leverage the proxycustom parameter to pass custom XML tags to the XSLT server. Since the inclusion of custom XML does not generate search results, this feature is useful for implementing additional static search pages, such as an advanced search page.

Google Search Appliance Help Center

Google Inc. 78

Notes:

● XSL stylesheets used by the XSLT server will be cached for 15 minutes. To force the XSLT server to use the latest version of an XSL stylesheet, set the proxyreload input parameter to a value of 1 in your search request.

● XSL stylesheets which include other files may not be used with the Google search engine. Any XSL stylesheet which contains the following tags / functions will generate an error result: <xsl:import>, <xsl:include>, xmlns: and document()

● When requesting cached results in custom HTML output, the BLOB XML tag and associated value are automatically converted to the original text before the XSL stylesheet rules are applied. When using an XSL stylesheet which customizes cache results, simply use the values of the CACHE_LEGEND_TEXT, CACHE_LEGEND_NOTFOUND and CACHE_LEGEND_HTML XML tags directly instead of applying a rule on the BLOB sub-tag.

● If you use input or output encodings other than latin1, please consult the Internationalization section for more details. ● More information on XSL and XSLT can be found on the W3C web site.

3.1.2 Internationalization [Custom HTML] - [Results Format] - [Table of Contents]

The Google search engine handles over 20 character encoding schemes. This section will discuss any special considerations that must be made when using the custom HTML output format with encoding schemes other than latin1.

In order to support all the encoding schemes supported by Google, the XSLT server follows a process to ensure that the results are returned in the correct encoding scheme. When requesting search results through the XSLT server, the server will translate the results to the UTF8 encoding scheme before applying the selected XSL stylesheet. Once the XSL stylesheet rules are applied to generate the results, then the results will be converted to the encoding scheme specified in the output encoding parameter, oe, of the search request. The one exception to this rule is cached result pages, which get converted to the encoding scheme of the cached document after XSLT processing.

Note: XSL stylesheets are associated with a front end. All XSL stylesheets must be in latin1 or UTF8 formats.

3.2 XML [Results Format] - [Table of Contents]

The description of the XML results format is broken down into the following sections:

● XML Output Overview ● Character Encoding Conventions ● Google XML Results DTD ● Google XML Tag Definitions

3.2.1 XML Output Overview [XML] - [Results Format] - [Table of Contents]

For maximum flexibility, Google provides search results in XML format. Using the Google XML results, you can use your own XML parser to customize the display for your search users. For developers who want to specify an XSL stylesheet for transformation of the XML results, instead of developing their own XML parser, proceed to the Custom HTML section.

Note:

● All element values will be valid HTML suitable for display, unless otherwise noted in the XML tag definitions. Some values are URLs which will need to be HTML encoded before displaying. ● All XML parsers used to parse Google results should be built to ignore any attributes or tags which are not documented. This will allow custom XML parsers to continue working without modification when Google adds

more features to the XML output in the future. In any custom parameters added that contain spaces, each space will be replaced with "_". You can still retrieve the unmodified value from "original_value." For example:

<PARAM name="temp" value="token_ring" original_value="token+ring" />

Google Search Appliance Help Center

Google Inc. 79

3.2.2 Character Encoding Conventions [XML] - [Results Format] - [Table of Contents]

The first line of the Google XML results will indicate which character encoding is used. See the XML Standard for more details.

Additionally, certain characters are required to be escaped when included as values in XML tags. These characters are documented in the XML standard, and are also reproduced in the table below. All other characters in the XML results will be presented without modification.

Character Escaped form

< either &lt; or &#60;

& either &amp; or &#38;

> either &gt; or &#62;

' either &apos; or &#39;

" either &quot; or &#34;

3.2.3 Google XML Results DTD [XML] - [Results Format] - [Table of Contents]

Google XML results can be returned either with or without a reference to the most recent DTD (Document Type Definition) describing Google's XML format. The DTD is a guide to help search administrators and XML parsers understand the XML results output. Since Google's XML grammar may change from time to time, you should not configure your parser to use the DTD to validate the XML results.

Additionally, XML parsers should not be configured to fetch the DTD every time a search request is performed. Since the DTD is updated infrequently, these fetches create unnecessary delay and bandwidth requirements.

Google recommends that you use the xml_no_dtd output format to get XML results. If you specify the xml output format in your search request, then the only difference will be the inclusion of the following line in the XML results.

<!DOCTYPE GSP SYSTEM "google.dtd">

The DTD is available on the Google Search Appliance at

http://<appliance_hostname>/google.dtd

If there are other features you would like to see on the DTD, please consult with your account representative. Not all features in the DTD may be available or supported at this time.

3.2.4 Google XML Tag Definitions [XML] - [Results Format] - [Table of Contents]

This section provides an index and details of Google's XML results.

Sub-Tags Legend

? = optional sub-tag * = zero or more instances of the sub-tag + = one or more instances of the sub-tag | = Boolean OR

Google Search Appliance Help Center

Google Inc. 80

Index

The XML tags are listed in alphabetical order below. Please click on the first letter of the XML tag in question to jump to the correct section.

B C F G H L M N O P Q R S T U X

Details

BLOB

Format Text (See Definition)

Sub-Tags

Definition This tag contains HTML data in the encoding format specified in the attribute. Additionally, the data has been BASE64 encoded to preserve data integrity of cached results encoded in a different encoding scheme then the results requested.

AttributesName Format Description

encoding Text (Encoding Scheme)The encoding scheme of the HTML data (See the Internationalization section for a list of common encoding values)

C

Format

Sub-Tags

Definition Indicates that the "cache:" special query term is supported for this search result URL

Attributes

Name Format Description

SZText (Integer + "k") Provides the size of the cached version of the search result in kilobytes ("k").

CID Text

Identifier of a document in Google's cache. To fetch the document from the cache, send a search term built like this: "cache:" + CID text + ":" + escaped URL. The escaped URL is available in the UE tag. Send this search term normally, as one would type it into the search form.

CACHE

Format

Sub-Tags CACHE_URL, CACHE_REDIR_URL, CACHE_LAST_MODIFIED, CACHE_LEGEND_FOUND?, CACHE_LEGEND_NOTFOUND?, CACHE_CONTENT_TYPE, CACHE_LANGUAGE, CACHE_ENCODING, CACHE_HTML

Definition Provides encapsulation for the cached version of a search result

Attributes

CACHE_CONTENT_TYPE

Format Text (MIME type)

Sub-Tags

Definition MIME type of the cached result as specified in the HTTP header returned when the document was crawled

Attributes

Google Search Appliance Help Center

Google Inc. 81

CACHE_ENCODING

Format Text

Sub-Tags

Definition The encoding scheme of the cached result as specified in the HTTP header returned when the document was crawled (See the Internationalization section for a list of common values)

Attributes

CACHE_HTML

Format Text (HTML) (Custom HTML output only)

Sub-Tags BLOB? (XML output only)

Definition The cached version of the search result. All search results are stored in HTML format after being translated for indexing.

Attributes

CACHE_LANGUAGE

Format Text (Google language tag)

Sub-Tags

Definition The language of the cached result as determined by Google's automatic language classification algorithm. The value of this tag is the same as the values used for the automatic language collections without the "lang_" prefix.

Attributes

CACHE_LAST_MODIFIED

Format Text

Sub-Tags

DefinitionDate that the document was crawled, as specified in the Date HTTP header when the document was crawled for this index. The crawler will fetch documents from its cache if the web server responds with a 304 (not modified) status code to an if-modified-since request. In this case, the CACHE_LAST_MODIFIED will be the date the document was originally crawled and not the date of the if-modified-since request.

Attributes

CACHE_LEGEND_FOUND

Format

Sub-Tags CACHE_LEGEND_TEXT*

Definition Provides encapsulation for query terms found in the visible text of the cached result returned

Attributes

CACHE_LEGEND_NOTFOUND

Google Search Appliance Help Center

Google Inc. 82

Format Text (Custom HTML output only)

Sub-Tags BLOB? (XML output only)

Definition Details of any query terms not visible in the cached result returned

Attributes

CACHE_LEGEND_TEXT

Format Text (Custom HTML output only)

Sub-Tags BLOB (XML output only)

Definition Details of a query term which is visible in the cached result. Any query terms found in the cached result will automatically be highlighted using the colors described in the attributes of this tag.

Attributes

Name Format Description

fgcolor Color attribute The foreground color of the query term in the cached result. This value can be used directly in a color attribute for HTML tags.

bgcolor Color attribute The background color of the query term in the cached result. This value can be used directly in a color attribute for HTML tags.

CACHE_REDIR_URL

Format Text (Absolute URL)

Sub-Tags

Definition Final URL of cached result after all redirects are resolved

Attributes

CACHE_URL

Format Text (Absolute URL)

Sub-Tags

Definition Initial URL of cached result

Attributes

CRAWLDATE

Format Text

Sub-Tags

Definition This is an optional element that shows the date that the page was crawled. It is shown only for pages crawled within the past two days.

Attributes

CT

Google Search Appliance Help Center

Google Inc. 83

Format HTML

Sub-Tags

Definition Search comments Example comment: Sorry, no content found for this URL

Attributes

CUSTOM

Format

Sub-Tags (Any custom XML specified in the search request)

Definition Provides encapsulation for any custom XML tags specified in the proxycustom input parameter

Attributes

FI

Format

Sub-Tags

Definition Indicates that document filtering was performed during this search Note: See the section on Automatic Filtering for more details

Attributes

FS

Format

Sub-Tags

Definition Additional search result details

Attributes

Name Format Description

NAME Text Name of the result descriptor

VALUE Text Value of the result descriptor

GSP

Format

Sub-Tags (TM, Q, PARAM*, CUSTOM?, Spelling?, Synonyms?, CT?, TT?, GM*, RES?) | CACHE

Definition GSP = "Google Search Protocol" Provides an encapsulation for all data returned in the Google XML search results

AttributesName Format Description

VER Text Indicates version of the search results output. The current output version is "3.2".

GD

Google Search Appliance Help Center

Google Inc. 84

Format Text (HTML)

Sub-Tags

Definition Contains the description of a KeyMatch result

Attributes

GL

Format Text (URL)

Sub-Tags

Definition Contains the URL of a KeyMatch result

Attributes

GM

Format

Sub-Tags GL, GD?

Definition Provides encapsulation for a single KeyMatch result

Attributes

HAS

Format

Sub-Tags L?, C?

Definition Provides encapsulation for any special features supported for this search request

Attributes

HN

Format Text (URL-escaped web directory)

Sub-Tags

Definition Indicates that directory crowding has occurred and that additional results are available from the directory where this search result was found. The value of this tag is ready to be used with the "site:" query term.

AttributesName Format Description

U Text HTML version of web directory

L

Format

Sub-Tags

Definition Indicates that the "link:" special query term is supported for this search result URL

Attributes

Google Search Appliance Help Center

Google Inc. 85

M

Format Text (Integer)

Sub-Tags

DefinitionThe estimated total number of results for the search Note: The estimate of the total number of results for a search can be too high or too low. Please review the appendix entitled, Estimated vs. Actual Number of Results.

Attributes

MT

Format

Sub-Tags

Definition Meta tag name and value pairs pulled from the search result Note: Only meta tags which are requested in the search request will be returned

Attributes

Name Format Description

N Text Name of the meta tag

V Text Value of the meta tag

NB

Format

Sub-Tags PU?, NU?

Definition Provides encapsulation for result set navigation information Note: The NB tag will only be present if either previous or additional results are available

Attributes

NU

Format Text (Relative URL)

Sub-Tags

Definition Contains relative URL to the next results page Note: The NU tag will only be present if additional results are available

Attributes

OneSynonym

Format HTML

Sub-Tags

Definition A synonym suggestion for the submitted query in HTML format.

Google Search Appliance Help Center

Google Inc. 86

AttributesName Format Description

Q Text The URL-escaped version of the synonym suggestion

PARAM

Format

Sub-Tags

Definition The input parameters submitted to the Google search engine to generate these results

Attributes

Name Format Description

name Text Input parameter name

value HTML HTML formatted version of the input parameter value

original_value Text Original URL-escaped version of the input parameter value

PU

Format Text (Relative URL)

Sub-Tags

Definition Contains relative URL to the previous results page Note: The PU tag will only be present if previous results are available

Attributes

Q

Format HTML

Sub-Tags

Definition The search query submitted to the Google search engine to generate these results

Attributes

R

Format

Sub-Tags U, T?, RK, FS?, MT*, S?, HAS, HN?

Definition Provides encapsulation for the details of an individual search result

Attributes

Name Format Description

N Text (Integer) Indicates the index (1-based) of this search result

L Text (Integer)Indicates the recommended indentation level of the results. Note: Currently this value will always be 1 unless directory crowding occurs. In this case, the second directory result will have a value of 2.

MIME Text Indicates the MIME type of the search result

Google Search Appliance Help Center

Google Inc. 87

RES

Format

Sub-Tags M, FI?, XT?, NB?, R*

Definition Provides encapsulation for the details of the individual search results

Attributes

Name Format Description

SN Text (Integer) Indicates the index (1-based) of the first search result returned in this result set

EN Text (Integer) Indicates the index (1-based) of the last search result returned in this result set

RK

Format Text (Integer in the range 0-10)

Sub-Tags

Definition Provides a general rating of the relevance of the search result

Attributes

S

Format Text (HTML)

Sub-Tags

Definition Search result snippet for the search result Note: Query terms will be in highlighted in bold in the results, and line breaks will be included for proper text wrapping.

Attributes

Spelling

Format

Sub-Tags Suggestion+

Definition Provides encapsulation for alternate spelling suggestions for the submitted query. Only one spelling suggestion is returned at this time.

Attributes

Suggestion

Format HTML

Sub-Tags

Definition An alternate spelling suggestion for the submitted query in HTML format

AttributesName Format Description

Q Text The URL-escaped version of the spelling suggestion

Synonyms

Google Search Appliance Help Center

Google Inc. 88

Format

Sub-Tags OneSynonym+

Definition Provides encapsulation for synonym suggestions for the submitted query. Up to 20 synonym suggestions may be returned depending on the synonym list associated with the front end by the search administrator.

Attributes

T

Format Text (HTML)

Sub-Tags

Definition The title of the search result

Attributes

TM

Format Text (Floating-point number)

Sub-Tags

Definition Total server time to return search results, measured in seconds.

Attributes

U

Format Text (Absolute URL)

Sub-Tags

Definition The URL of the search result.

Attributes

XT

Format

Sub-Tags

Definition Indicates that the estimated total number of results specified in this search result is exact. Note: See the section on Automatic Filtering for more details.

Attributes

Appendices[Table of Contents]

This section contains any appendices relevant to Google search:

Google Search Appliance Help Center

Google Inc. 89

● Estimated vs. Actual Number of Results ● URL Escaping

Appendix A: Estimated vs. Actual Number of Results [Appendices] - [Table of Contents]

The Google search engine does not guarantee the ability to return a particular number of results for any given search query. The total number of results provided by Google in the search results is an estimate of the actual number of results for the query. This number can be higher or lower than the actual number of results available. This section covers any issues relating to this topic.

Behavior

When a search request is made to Google, the following behavior occurs:

1. If Google has results to satisfy the search request, then the requested number of results will be returned. 2. If Google has results and the search request is for results beyond what is available, the last page of results will be returned. The last page of results is determined by dividing the total number of results into pages

based on the number of results requested. 3. If no results are available for the search request, then an empty result set will be returned.

In order to determine if a particular results page is the last page of available results, check for any of the following conditions:

1. The first result number returned does not match the first result number requested. 2. The number of results returned is less than the number of results requested. 3. The results returned do not contain a link to the next result set.

Automatic Filtering

Typically, the number of results actually returned is significantly reduced by the automatic filtering that Google performs on all search results to weed out undesirable results. This feature can be disabled per the instructions in the Automatic Filtering section.

Any results which have been filtered will be identified in the results returned. For example, the <FI> XML tag will be present in any XML search results where automatic document filtering has occurred.

Google recommends that the search results page display a message on the last page of the search results similar to the following message when automatic filtering occurs:

In order to show you the most relevant results, we have omitted some entries very similar to the search results already displayed. If you like, you can repeat the search with the omitted results included.

The underlined text in the message should be a hypertext link to submit the same search again with the filter parameter set to the value 0. Google has found that this method of informing users about automatic document filtering works well and is used on the Google Internet search site.

Navigation

When the total number of results returned is an estimate, the navigation structure for search results can be complicated. Google recommends two approaches for generating a navigation scheme for your search results:

1. Only provide the search user with the ability to navigate to the previous results page and the next results page. Google provides links to the previous and next result set in the results returned when appropriate. 2. Provide the search user with the ability to jump to any search page in the estimated number of results. If the user requests a results page beyond which results are actually available, the last results page will be

returned and the navigation structure should be updated at that time. Google uses this approach on our Internet search site.

Appendix B: URL Escaping [Appendices] - [Table of Contents]

In order to make a search request to the Google search engine through an HTTP URL request, there are certain conventions that must be followed in order to allow the search engine to correctly translate your search request.

Google Search Appliance Help Center

Google Inc. 90

The HTTP URL schema defines that only alphanumeric, the special characters $-_.+!*'(), and the reserved characters ;/?:@=& can be used as values within an HTTP URL request. Since reserved characters are used by the search engine to decode the URL and some special characters are used to request search features, then all non-alphanumeric characters used as input parameter values should be URL escaped.

In order to URL escape a string, all space characters should be converted to a "+" character and all other alphanumeric characters should be replaced by a "%" character followed by two hexadecimal digits representing the value of that character.

Some input parameters require that the values passed to Google search will need to be double URL escaped. This means that you will need to apply the URL escaping to the string twice in succession to generate the final value. See the input parameter descriptions for more information.

Note: Additional information on URL escaping can be found at W3C and IETF web sites.

Examples

Original String URL Escaped String

chicken -teriyaki chicken+%2Dteriyaki

admission form site:www.stanford.edu admission+form+site%3Awww.stanford.edu

Original String Doubly URL Escaped String

William Shakespeare William%2BShakespeare

admission form site:www.stanford.edu admission%2Bform%2Bsite%253Awww.stanford.edu

Glossary[Table of Contents]

This glossary contains basic descriptions of acronyms and terms found in this document which may be new to some readers.

Cached result - As part of its core technology, Google indexes all the content on a page, rather than a portion of the content (percentage or meta tags). Each page that is indexed is also available to be served in a cached HTML format (up to 4 million bytes of each document before HTML conversion). When a user views a cached document, each query term is highlighted in a different color, making it easy for the user to find the information sought. Because all pages are cached, the user always has access to content that has been indexed, even if the server where the live content is stored happens to be refusing connections or is slow to return the page.

Collection - A collection is a subset or a view of the document index. Collections are specified by URL patterns; some collections are created automatically by the Google search engine. Collections are useful for allowing refined or advanced searches, for limiting access to classified information, for group-level security, for language-specific queries and for many other applications.

DTD - Document Type Definition. The purpose of a DTD is to define the legal building blocks of an XML document. It defines the XML document structure with a list of legal elements.

Encoding Scheme - Each language has an official encoding scheme which is used to represent all of the language's characters in an 8-bit data stream format. These encoding schemes are used by Google search to determine how to translate incoming and outgoing search requests.

KeyMatch - Because you occasionally may want to return special results for specific queries, Google search may be configured with the KeyMatch feature. Using KeyMatch, the search administrator can designate special results that are returned in addition to the standard results when specific queries are made. Google recommends using KeyMatch carefully, as it can drastically decrease the quality of results if overused.

Meta Tags - HTML tags which can be specified within an HTML document which are not displayed to the end user, but which may contain document meta-data. Google search uses meta tags with the NAME attribute to enhance and filter search results when requested.

MIME - Multipurpose Internet Mail Extensions. The MIME type of a web document (or search result) identifies the format of the document it is associated with. Some sample MIME types include "text/html" for HTML documents, and "application/ms-word" for Microsoft Word documents.

Query - A string of query terms separated by the space character which is submitted to Google search. The results returned for a particular query will satisfy all query terms by default.

Google Search Appliance Help Center

Google Inc. 91

Query term - A single term which defines a unit of search for the Google search engine to find in the index. A single query term can not contain any spaces or punctuation.

UTF-8 - Unicode Transformation Format (8-bit). UTF-8 is a Unicode based encoding scheme for describing language data by representing the data using 8-bit codes. This encoding scheme is used by Google search to support multiple languages simultaneously.

Web Directory - A subset of files on a web server stored under its own directory name.

XML - eXtensible Markup Language. XML is a markup language, similar to HTML, which was designed to describe data. The tags used in XML are not pre-defined, and are described by a DTD or the data provider.

XSL - eXtensible Stylesheet Language. XSL is a language that is designed to describe how an XML document should be displayed. XSL contains commands that can be used to describe the transformation and formatting of an XML document for display. XSL is used in the Google search environment to transform XML results into custom HTML output.

XSLT - XSL Transformation. XSLT describes the process of transforming an XML document into another format. Google search allows search administrators to use our XSLT server to transform our standard XML results into their own custom HTML output.

Help Center Index

A - C - D - E - F - G - H - I - K - L - M - N - O - P - Q - R - S - T - U - V - W - X

A

Access - Crawl and Index > Crawler Access

Access Control List (ACL) - Administration > Network Diagnostics Crawl and Index > Crawl URLs

Admin Console - Admin Console > Home

Administration - Administration > License Administration > Network Settings Administration > Change Password Administration > Shutdown Administration > System Settings Administration > User Accounts

Agent name - Crawl and Index > HTTP Headers

Archives - Crawl and Index > Freshness Tuning

Automatic rollback - Crawl and Index > Index Rollback

Authorization - Crawl and Index > Crawler Access

C

Certificate Authorities - Administration > Certificate Authorities, - Administration > SSL Settings

Google Search Appliance Help Center

Google Inc. 92

Certificate Revocation List (CLR) - Administration > Certificate Authorities

Change password - Administration > Change Password

Collections - Crawl and Index > Collections

Configuration - Crawl and Index > Collections Crawl and Index > Cookie Sites Crawl and Index > Crawler Access Crawl and Index > Duplicate Hosts Crawl and Index > Forms Authentication Crawl and Index > Freshness Tuning Crawl and Index > Host Load Schedule Administration > Import and Export Administration > Network Settings Administration > System Settings

Cookie forwarding - Serving > Forms Authentication

Cookie rules - Crawl and Index > Cookie Sites

Cookie sites - Crawl and Index > Cookie Sites

Crawl - Crawl and Index > Crawl URLs Crawl and Index > Host Load Schedule Crawl and Index > Crawler Access Crawl and Index > Freshness Tuning Crawling and Indexing

Crawl diagnostics - Status and Reports > Crawl Diagnostics

Crawl host load scheduling - Crawl and Index > Host Load Schedule

Crawl status - Status and Reports > Crawl Status

Crawled URLs - Status and Reports > Crawl Diagnostics

Crawler access - Crawl and Index > Crawler Access

Crawling files - Crawling and Indexing Crawl and Index > Crawler Access

Crawling frames - Crawling and Indexing

Crawling framesets - Crawling and Indexing

Create a collection - Crawl and Index > Collections

D

Databases - Crawl and Index > Databases

Google Search Appliance Help Center

Google Inc. 93

Date - Crawl and Index > Document Dates

Delete URLs - Serving > Front Ends > Remove URLs

Diagnostics - Status and Reports > Search Reports Administration > Network Diagnostics Status and Reports > Crawl Status

DNS entries - Administration > Network Settings Crawl and Index > Crawl URLs

DNS Search Path - Administration > Network Settings

DNS Servers - Administration > Network Settings

DNS Suffix - Administration > Network Settings

Duplicate hosts - Crawl and Index > Duplicate Hosts

E

Edit XSLT - Serving > Front Ends > Output Format Serving > Front Ends > XSLT Stylesheet Editor Serving > Front Ends > Global Style Variables Serving > Front Ends > Additional Results Page Components Serving > Front Ends > Additional Results Page Components - Search Boxes Serving > Front Ends > Suggestion Pages Serving > Front Ends > Result Navigation and Separation Bars Serving > Front Ends > Result Elements Serving > Front Ends > Templates Serving > Front Ends > Other Variables

Email notification - Administration > System Settings

Error codes - Administration > Network Settings

Errors - Status and Reports > Crawl Diagnostics Administration > Network Settings

Event Log - Status and Reports > Event Log

ExactMatch - Serving > Front Ends > KeyMatch

Excluded URLs - Crawl and Index > Crawl URLs Status and Reports > Crawl Diagnostics Crawl and Index > Collections

Exclusions - Status and Reports > Crawl Diagnostics

Existing collection - Crawl and Index > Collections

Exit - Administration > Shutdown

Google Search Appliance Help Center

Google Inc. 94

Export - Crawl and Index > Collections Status and Reports > Event Log Serving > Front Ends > Output Format

F

Feeds - Crawl and Index - Feeds

File size limits - Crawling and Indexing

File types - Crawling and Indexing

Font Families - Font Families

Forgotten password - Administration > User Accounts

Forms Authentication - Crawl and Index > Forms Authentication, Serving > Forms Authentication

Frames - Crawling and Indexing

Framesets - Crawling and Indexing

Freshness Tuning - Crawl and Index > Freshness Tuning

G

Global Variables - Global Style Variables XSLT Stylesheet Editor

Google - Home Crawling and Indexing

gsa-crawler - Crawling and Indexing Crawl and Index > HTTP Headers

Googleweb - Crawl and Index > Crawl URLs

H

Hexadecimal Notation - Hexadecimal Notation

Host load configuration - Crawl and Index > Host Load Schedule

Host name - Administration > Network Settings Crawl and Index > Duplicate Hosts Crawl and Index > Host Load Schedule

HTML - Serving > Front Ends > Output Format Crawling and Indexing

Google Search Appliance Help Center

Google Inc. 95

HTTP headers - Crawl and Index > HTTP Headers

I

Import/Export Configuration - Administration > Import and Export

Index - Crawling and Indexing Status and Reports > Search Reports

Index Rollback - Crawling and Indexing > Index Rollback

IP address - Administration > Network Settings

K

KeyMatch - Serving > Front Ends > KeyMatch

Keypair - Administration > SSL Settings

Keywords - Status and Reports > Search Reports

L

Language to use - Home

License information - Administration > License

Load factor - Crawl and Index > Host Load Schedule

Logs - Status and Reports > Event Log Status and Reports > Search Log

M

Manual rollback - Crawl and Index > Index Rollback

Maximum host load - Crawl and Index > Host Load Schedule

Maximum URLs - Crawl and Index > Host Load Schedule

Meta tag - Crawl and Index > Document Dates Crawl and Index > Collections

N

Netegrity - Crawl and Index > Forms Authentication Serving > Forms Authentication

Google Search Appliance Help Center

Google Inc. 96

Network diagnostics - Administration > Network Diagnostics

Network settings - Administration > Network Settings

Network Time Protocol - Administration > Network Settings

New collection - Crawl and Index > Collections

Non-HTML files - Crawling and Indexing

NTP servers - Administration > Network Settings

O

Oblix - Crawl and Index > Forms Authentication Serving > Forms Authentication

Operator Logs - Status and Reports > Event Log

Output format - Serving > Front Ends > Output Format Serving > Front Ends > Output Format - XSLT Stylesheet Editor

Overview - Help Center Home Crawling and Indexing

P

Page Layout - Serving > Front Ends > Output Format - Page Layout Helper

Page Layout Code - Serving > Front Ends > Output Format - Page Layout Helper

Page Rank - Status and Reports > Crawl Diagnostics

Password - Administration > User Accounts Administration > Change Password Crawl and Index > Crawler Access

Pattern Tester Utility - Crawl and Index > Crawl URLs

Patterns - Rules for Valid URL Patterns Crawl and Index > Crawl URLs

Pause crawling - Status and Reports > Crawl Status

Phrase - Serving > Front Ends > KeyMatch

PhraseMatch - Serving > Front Ends > KeyMatch

Google Search Appliance Help Center

Google Inc. 97

Prerequisites - Crawl and Index > Index Rollback

Private key - Administration > SSL Settings

Problem reports email - Administration > System Settings

Proxy configuration - Crawl and Index > Proxy Servers

Proxy host load - Crawl and Index > Host Load Schedule

Q

Queries - Status and Reports > Search Reports

Query Logs - Status and Reports > Search Log

Quit - Administration > Shutdown

R

Rank - Status and Reports > Crawl Diagnostics

Recrawl - Crawl and Index > Freshness Tuning

Regex - Rules for Valid URL Patterns

Regular expressions - Rules for Valid URL Patterns

Relevancy rank - Status and Reports > Crawl Diagnostics

Remove URLs - Serving > Front Ends > Remove URLs

Reports - Status and Reports > Search Reports Status and Reports > Event Log Status and Reports > Search Log Status and Reports > Crawl Diagnostics

Reports on Query Logs - Status and Reports > Search Reports

Required URLs - Crawl and Index > Index Rollback

Restrict a Crawl - Crawl and Index > Crawl URLs

Results - Crawl and Index > Index Rollback Serving > Front Ends > KeyMatch Crawl and Index > Document Dates Serving > Front Ends > Output Format Status and Reports > Serving Status Status and Reports > Search Reports

Google Search Appliance Help Center

Google Inc. 98

Results data - Status and Reports > Search Reports

Results format - Serving > Front Ends > Output Format

Resume crawling - Status and Reports > Crawl Status

Retrieval errors - Status and Reports > Crawl Diagnostics

Return codes - Administration > Network

Rollback - Crawl and Index > Index Rollback

Rules - Rules for Valid URL Patterns Crawl and Index > Cookie Sites Crawl and Index > Forms Authentication

S

Schedule crawling - Crawl and Index > Host Load Schedule

Search data - Status and Reports > Search Reports

Search your index - Status and Reports > Crawl Status Status and Reports > Serving Status

Secure content - Crawl and Index > Crawler Access Administration > SSL Settings Crawl and Index > Forms Authentication Serving > Authorization Serving > Forms Authentication Serving > Front Ends > Output Format - XSLT Stylesheet Editor - Result Elements

Secure Sockets Layer (SSL) - Administration > SSL Settings

Serving prerequisites - Crawl and Index > Index Rollback Crawl and Index > Host Load Schedule

Serving status - Status and Reports > Serving Status

Shut down - Administration > Shutdown

Shutdown Page - Administration > Shutdown

Simple Network Management Protocol (SNMP- Administration > SNMP Configuration

Single Sign-on - Crawl and Index > Forms Authentication Serving > Forms Authentication

SMTP server - Administration > Network Settings

SNMP configuration - Administration > SNMP Configuration

Google Search Appliance Help Center

Google Inc. 99

Sort by date - Crawl and Index > Document Dates

Spell check - Spell Checker

Status - Status and Reports > Crawl Status Home > System Status Home > Serving Status Status and Reports > System Status Status and Reports > Serving Status Status and Reports > Event Log

Stylesheet - Serving > Front Ends > Output Format Serving > Front Ends > XSLT Stylesheet Editor Serving > Front Ends > Global Style Variables Serving > Front Ends > Additional Results Page Components Serving > Front Ends > Additional Results Page Components - Search Boxes Serving > Front Ends > Suggestion Pages Serving > Front Ends > Result Navigation and Separation Bars Serving > Front Ends > Result Elements Serving > Front Ends > Templates Serving > Front Ends > Other Variables

Summaries - Status and Reports > Search Reports

Synonyms - Serving > Front Ends > Synonyms

Syslog server - Administration > Network Settings

System Event Log - Status and Reports > Event Log

System settings - Administration > System Settings

System Shutdown Page - Administration > Shutdown

T

Test Center link - Status and Reports > Crawl Status

Test Pattern Utility - Crawl and Index > Crawl URLs

Time server - Network Settings

U

URL dates - Crawl and Index > Document Dates

URL errors - Crawl and Index > Crawl URLs

URL exclusions - Crawl and Index > Crawl URLs

URL patterns - Crawl and Index > Crawl URLs

Google Search Appliance Help Center

Google Inc. 100

Rules for Valid URL Patterns

URL states - Status and Reports > Crawl Diagnostics

URL tracker - Status and Reports > Crawl Status

URLs - Crawl and Index > Crawl URLs Status and Reports > Crawl Status Crawl and Index > Host Load Schedule Rules for Valid URL Patterns Status and Reports > Crawl Diagnostics

URLs to test - Administration > Network Settings Crawl and Index > Crawl URLs

Usage logs - Administration > Network Settings

User accounts - Administration > User Accounts

User agent name - Crawl and Index > HTTP Headers

User Impersonation - Serving > Forms Authentication

Utility, Page Layout - Serving > Front Ends > Output Format > Page Layout Helper

Utility, Pattern Tester - Crawl and Index > Crawl URLs

V

View System Events - Status and Reports > Event Log

View URLs - Status and Reports > Crawl Diagnostics

W

Words to match queries - Serving > Front Ends > KeyMatch

Words to avoid - Crawl and Index > Collections

X

XML - Serving > Front Ends > Output Format

XML Reference - Google XML Reference

XSLT - Serving > Front Ends > Output Format - XSLT Stylesheet Editor

XSLT Stylesheet - Serving > Front Ends > Output Format Serving > Front Ends > Output Format > XSLT Stylesheet Editor

Google Search Appliance Help Center

Google Inc. 101

© 2001-2005 Google Inc. All Rights Reserved.

Google Search Appliance Help Center

Google Inc. 102