25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail...

76
25 January 2011 Kaiser: COMS E6125 1 COMS E6125 Web-enHanced COMS E6125 Web-enHanced Information Management Information Management (WHIM) (WHIM) Prof. Gail Kaiser Prof. Gail Kaiser Spring 2011 Spring 2011

Transcript of 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail...

Page 1: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 1

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2011Spring 2011

Page 2: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 2

Today’s Topic• Basic Web Mechanics

– URI– HTTP– Client/Server Intermediaries

Page 3: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 3

What is a “URI”?• Uniform Resource Identifier• Compact string of characters for

identifying an abstract or physical resource

• Conforms to a simple and extensible format

• Example: http://bank.cs.columbia.edu/classes/cs6125

Page 4: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 4

What is a “Resource”?• Some piece of information that can be

identified by a URI• The most common kind of resource is a

file• But may also be a dynamically-

generated query result, the output of a script, a document available in several languages or formats, etc.

Page 5: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 5

Uniform Resource Identifier• Uniform: aka Universal - same string can be

used with same semantic interpretation, even when mechanisms used to access the resource differ

• Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity that corresponds to that mapping at any particular instance in time

• Identifier: An object that can act as a reference to something that has identity

Page 6: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 6

Key Requirement: Transcribability

• May be transcribed from non-network source

• Often needs to be remembered by people• Should consist of characters that are most

likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales

Page 7: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 7

Why do we usually say URL rather than URI?

• A Uniform Resource Locator (URL) refers to the subset of URIs that identify resources via a representation of their primary access mechanism (i.e., their network “location”)

• Most popular form of URI

Page 8: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 8

What’s a URI that’s not a URL?

• URN = Uniform Resource Name• Subset of URIs that denote a resource

independent of its current location, the name by which it is known, or the mechanism by which it is accessed

• Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable

• Thus not necessarily “retrievable”

Page 9: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 9

URN vs. URL Example• Assume a published book (the resource)• The ISBN (International Standard Book Number)

is a 10-digit number that uniquely identifies books and book-like products published internationally - this is the URN

• The entire contents of the book might be placed on a Web server at http://www.xyz.com/book.gz and an Ftp server at

ftp://ftp.xyz.com/book.gz - both of these are URLs

• All of these are URIs

Page 10: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 10

URI Syntax• <scheme>:<scheme-specific-

part> • For a URL, the scheme indicates the protocol

employed for retrieval (http, ftp, file, mailto, etc.)

• More generally, a scheme is a specification for defining the syntax and semantics of the rest of the URI

• Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:)

Page 11: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 11

URL Notation• <scheme>://<authority><path>?

<query>

typically, an Internet domainname

specific to the authority, identifies the resource within

the scope of the scheme and authority

a string of information to be interpreted

by the resource

Page 12: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 12

What’s a “domain name”?

• Domain Name System (DNS)– Maps domain names to IP addresses and vice versa – Hierarchy of DNS servers for top level domains

(.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc.), and so on

– Eventually finds IP address for individual host (e.g., bank.cs.columbia.edu)

– DNS servers cache responses based on TTL = Time to Live

• Originated ~1982, e.g., for email (gk60@CMUA -> [email protected] -> [email protected])

Page 13: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 13

Example URLs• http://www.ietf.org/rfc/rfc3986.txt • gopher://seanm.ca/00/nerd/gopher-

manifesto.txt

• mailto:[email protected]

• telnet:bank.cs.columbia.edu

Page 14: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 14

Relative URLs• Allows document trees to be independent of

their location and scheme• A single set of hypertext documents can be

simultaneously traversable via each of the ftp, http and file schemes

• Such document trees can be moved, as a whole, without changing any of the relative references

• Resolved to full (absolute) URLs using a base URL

Page 15: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 15

Example Relative URLs• http://somehost/absolute/URL/with/absolute/

path/to/resource.txt• /relative/URI/with/absolute/path/to/

resource.txt• relative/path/to/resource.txt• ../../../resource.txt• resource.txt• /resource.txt#frag01• #frag01• [empty string]

Page 16: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 16

URI “Standard”• URI is an Internet protocol element

defined currently in RFC 3986 (2005)• Originally RFC1630 (1994)

Page 17: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 17

What is an “RFC”?• Request for Comments • One of a series, begun in 1969, of

numbered informational documents and standards followed by commercial software and freeware in the Internet and Unix communities

• All Internet standards are recorded in RFCs

Page 18: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 18

Who keeps track of RFCs?

• IETF = Internet Engineering Task Force• Open, all-volunteer organization, with no

formal membership or membership requirements

• Organized into a large number of working groups, each dealing with a specific topic

• April 1st RFCs, e.g., http://www.apps.ietf.org/rfc/rfc3514.html

Page 19: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 19

What is “W3C”?• World Wide Web Consortium defines data

formats and usage conventions as well as Internet protocols relevant to Web

• Members pay fees depending on country, revenues and non-profit/for-profit status

• Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments”

• http://www.w3.org/

Page 20: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 20

Back to URLs• Most Web documents use the “http”

scheme (or “https” = http over TLS/SSL)

• What is “http” (HyperText Transfer Protocol)?

Page 21: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 21

HTTP = HyperText Transfer Protocol

• Most Web documents use the “http” scheme, the default Internet protocol used to deliver data on WWW

• Usually through TCP/IP sockets on port 80, but can use any port and can be implemented on top of any reliable networking protocol

• A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client

Page 22: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 22

What’s “TCP/IP”?• IP = Internet Protocol

– Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.11.100)

– Network routers direct traffic of IP packets• Analogous to telephone numbers (area code

plus exchange plus 4 digits plus extension) and postal address (zip code plus street name plus building number plus apartment number)

Page 23: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 23

What’s “TCP/IP”?• TCP = Transmission Control Protocol

– Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address

– The so-called well known ports (< 1024) are reserved for specific protocols (telnet, ftp, smtp, pop3, imap, etc.)

– By default, HTTP uses port 80; this can be changed in the URL

– http://www.foo.com:2011/doc.html• Main alternative is UDP = User Datagram

Protocol, no connection, no reliable delivery (used by DNS)

Page 24: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 24

HTTP History• HTTP/0.9 (1990) - simple protocol for raw data

transfer• HTTP/1.0 (1996) - allows MIME-like messages,

containing meta-information about the resources transferred and modifiers on the request/response semantics

• HTTP/1.1 (1999) – lots of practical improvements, e.g., caching policies, chunked encoding, persistent connections

• W3C closed activity but IETF still has a working group to revise

Page 25: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 25

What is “MIME”?• Multipurpose Internet Mail Extensions• Standard representation for “complex”

message bodies (numerous RFCs since 1993)

• Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages

Page 26: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 26

MIME Header Fields• Mime-Version, Content-Type, Content-

Transfer-Encoding, Content-Description, Content-ID, Content-Location, Content-Disposition, Part Body

• Discrete (text, image, audio) and Multipart (mixed, digest) content types

Page 27: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 27

HTTP Properties• Uses URLs for identifying Web

resources• Request-response – always initiated by

client to server (never vice versa), the server responds with results

• Stateless – each request-response pair independent from every other, so any state information (login credentials, shopping carts, etc.) needs to be encoded somehow

Page 28: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 28

HTTP Request/Response

HTTPrequest

Port 80

ResponseOther port

Processing

HTTP C

lien

t

• Web server processes HTTP requests, generally over TCP Port 80

• The request specifies a resource URL

• The server parses the URL and processes the request:– Returns a document with

its type information– Invokes a program or

script, and returns its output

• The output (including metadata) is sent back to the client as a response message

Page 29: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 29

HTTP Requests

• Small number of request types (GET, POST, HEAD, etc.)

• Request may contain additional information, e.g. client info, parameters for forms, cookies, etc.

• Consists of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body

Page 30: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 30

HTTP Responses• Larger number of response codes

(200 OK, 404 NOT FOUND)• Message body only allowed with

certain response status codes• Includes MIME metadata as well as

“payload” (data)

Page 31: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 31

Start Line• HTTP Version (0.9, 1.0, 1.1)• URI• Method (request) or Status Code

(response)

Page 32: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 32

Sample HTTP Exchange• To retrieve the file at the URL

http://bank.cs.columbia.edu• First open a socket to the host

bank.cs.columbia.edu, port 80 (use the default port because none is specified in the URL)

Connect to 128.59.11.100 on port 80 ... ok

Page 33: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 33

Sample• Then, send something like the following through the

socket: GET / HTTP/1.1[CRLF]

Host: bank.cs.columbia.edu[CRLF] Connection: close[CRLF] User-Agent: Web-sniffer/1.0.37 (+http://web-sniffer.net/)[CRLF] Accept-Encoding: gzip[CRLF] Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7[CRLF] Cache-Control: no-cache[CRLF] Accept-Language: de,en;q=0.7,en-us;q=0.3[CRLF] Referer: http://web-sniffer.net/[CRLF]

[CRLF]

Page 34: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 34

• The server should respond with something like the followingHTTP Status Code: HTTP/1.1 403 Forbidden[CRLF] Content-Length:218[CRLF] Content-Type:text/html[CRLF] Server:Microsoft-IIS/6.0[CRLF] X-Powered-By:ASP.NET[CRLF] Date: Sat, 22 Jan 2011 14:024:22 GMT[CRLF] Connection:close[CRLF]<html><head><title>Error</title></

head><body><head><title>Directory Listing Denied</title></head>[LF] <body><h1>Directory Listing Denied</h1>This Virtual Directory does not allow contents to be listed.</body></body></html>

Sample

Page 35: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 35

Some Request Headers• User-Agent: identifies the program that's

making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser)– User-Agent: Mozilla/5.0 (Windows; U;

Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0

• Referer: the URL of the previous webpage from which a link was followed– Referer: http://web-sniffer.net/

Page 36: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 36

Some Response Headers

• Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx"– Server: Apache/2.2.8 (Ubuntu)

• Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching – Use Greenwich Mean Time, in the format

Last-Modified: Sat, 22 Jan 2011 14:46:32 GMT

Page 37: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 37

HTTP URIs• Up to some bounded length (often

255), or “unbounded”, status code 414 (Request-URI Too Long)

• Equivalence comparisonhttp://abc.com:80/~smith/home.htmlhttp://ABC.com/%7Esmith/home.htmlhttp://ABC.com:/%7esmith/home.html

Page 38: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 38

Request Messages• Method SP Request-URI SP HTTP-

Version CRLF • GET http://www.gailkaiser.org• Equivalent to client making TCP

connection to bank.cs.columbia.edu on port 80, then sending GET / Host: www.gailkaiser.org

• Host field allows for virtual hosts

Page 39: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 39

What is a “virtual host”?

• Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting)

• Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port

Page 40: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 40

GET• Retrieve whatever information (in the form of

an entity) is identified by the URL• If the URL refers to a data-producing process,

it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process)

http://foo.com/run.cgi?name1=val1&name2=val2

Page 41: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 41

Conditional and Partial GET

• Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field

• Partial if the request message includes a Range header field

• Don’t retrieve data the client doesn’t need (e.g., at least the part already up to date in cache)

Page 42: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 42

HEAD• Identical to GET except that the server

must not return a message-body in the response - only returns headers

• Often used for testing hypertext links for validity and modification

• Can mark cache entries as stale if certain header information changes (e.g., length, last-modified)

Page 43: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 43

POST• Used to request that the origin server

accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line

• Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI

Page 44: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 44

POST supports several functions

• Annotation of an existing resource• Posting a message to a bulletin board,

newsgroup, mailing list, or similar group of articles

• Providing a block of data, such as the result of submitting a form, to a data-handling process

• Extending a database through an append operation

Page 45: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 45

POST vs. GET• GET can only be used to send relatively

small amounts of data to a server, with the data following the ? character

• The rest of the request-URI (before the ?) refers to some kind of processing program

GET /run.cgi?name1=val1&name2=val2 HTTP/1.0

Page 46: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 46

PUT and DELETE

• Often unsupported (501 Not Implemented)• PUT requests that the enclosed entity be

stored under the supplied Request-URI – May create a new resource at a new URI, or modify

an existing resource already at that URI• DELETE requests that the origin server delete

the resource identified by the Request-URI– May be overridden, e.g., by human intervention,

even if status code indicates successfully completed• Effectively supplanted by WebDAV

Page 47: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 47

OPTIONS and TRACE• OPTIONS allows the client to determine the

requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval

• TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information

Page 48: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 48

HTTP Responses• HTTP-Version SP Status-Code SP

Reason-Phrase CRLF • Example: HTTP/1.0 404 Not Found • Status code: 3-digit integer result code

of the attempt to understand and satisfy the request

• Response phrase: short textual description of the Status-Code

Page 49: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 49

Response Messages• Larger number of response codes

(200 OK, 404 NOT FOUND)• Message body only allowed with

certain response status codes• Includes MIME metadata as well as

“payload” (data)

Page 50: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 50

Status Codes• Applications need only understand first digit, treat

others as equivalent to x00• 1xx: Informational - Request received, continuing

process ("100" : Continue, relevant to persistent connections in HTTP 1.1)

• 2xx: Success - The action was successfully received, understood and accepted ("200" : OK)

• 3xx: Redirection - Further action must be taken in order to complete the request ("300" : Multiple Choices)

• 4xx: Client Error - The request contains bad syntax or cannot be fulfilled ("400" : Bad Request)

• 5xx: Server Error - The server failed to fulfill an apparently valid request ("500" : Internal Server Error)

Page 51: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 51

HTTP is “Stateless”• Server doesn’t remember anything about

client between connections• Not even between requests during the same

persistent connection, except TCP data• So how does HTTP support “remembering” the

user during a session or across sessions?• Some state can be encoded in complex URLs

or otherwise in the web page itself (e.g., query strings added to links, hidden form fields)

• Or saved on client in “cookies”

Page 52: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 52

Cookies• String associated with a name/domain/path, stored at the

browser • Series of name-value pairs, interpreted by the web

application• Create in HTTP response with “Set-Cookie: ” (or “Set-Cookie2: ”)

• In all subsequent requests to this site, until cookie’s expiration, the client sends the HTTP header “Cookie: ” (or “Cookie2: ”)

• Often have an expiration (otherwise expire when browser closed)

• Various technical, privacy and security issues (e.g., inconsistent state after using “back” button, third-party cookies, cross-site scripting)

Page 53: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 53

Cookie Example• Set-Cookie: name=newvalue;

expires=date; path=/; domain=.example.org

• Set-Cookie: RMID=732423sdfs73242; expires=Sat, 31-Dec-2011 23:59:59 GMT; path=/; domain=.example.net

Page 54: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 54

HTTP Request/Response

• In HTTP 1.0, a connection is established by the client prior to each request and closed by the server after sending the response

• Either party may close the connection prematurely, due to user action, automated time-out, or program failure

• Closing of the connection by either or both parties always terminates the current request, regardless of its status

• But TCP connections are expensive…

Page 55: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 55

HTTP 1.1 “Persistent Connection”

• Many Web pages consist of several files on the same server

• If an HTTP 1.1 client sends multiple (pipelined) requests through a single connection, the server should send responses back in the same order

• Intermediate responses "100" : Continue

Page 56: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 56

How does the connection finally get

closed?

• If a request includes the "Connection: close" header, that request is the final one for the connection and the server should close the connection after sending the response

• The server should also close an idle connection after some timeout period

Page 57: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 57

Advantages of Persistent Connections

• Requests and responses can be pipelined - a client makes multiple requests without waiting for each response

• Network congestion reduced by fewer packets for TCP opens, and by allowing TCP sufficient time to determine the congestion state of the network

• Latency on subsequent requests is reduced since there is no time spent in theTCP connection’s opening handshake

Page 58: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 58

Basic HTTP Architecture

Page 59: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 59

Intermediary

• Program sitting in the path between HTTP clients and servers

• Acts as a server to clients and as a client to origin servers or other intermediaries

Page 60: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 60

Purposes of Intermediaries

– Reduce communication cost– Lower the latency perceived by the

client– Reduce the load on the network– Reduce the load on the Web server– Implement security for an organization– Translate requests to various servers

Page 61: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 61

Proxy

• Forwarding agent• Receives request, rewrites all or

parts of the message, and forwards the reformatted request toward the server identified by the URI

Page 62: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 62

Gateway• Receiving agent• Acts as a layer above some other server(s)

and, if necessary, translates the requests to the underlying server's protocol

• Example: Web mail accessing an IMAP server– A URL identifies the mail server, mailbox,

password– Converts the HTTP request to an IMAP

request, gets the IMAP response, converts it to HTTP response

Page 63: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 63

Tunnel• Relay point between two connections

without changing the message• Looks at the first line of the HTTP

message to locate the host to be contacted and accept the request

• Simply relays bits between the two connection points

• Does not parse or interpret messages • Used when the communication needs to

pass through a firewall

Page 64: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 64

Transcoder• Modifies data as it passes to clients, e.g.,

to filter ads, reduce image sizes, compress content

• Particularly useful for wireless and/or constrained devices– Convert HTML to XHTML MP– Modify content to fit small screen– Convert modality of interaction, e.g., driving

directions from displaying text to playing audio

Page 65: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 65

Caching

• Request/response chain is shortened if one of the participants along the chain has a cached response applicable to request

Page 66: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 66

HTTP 1.1 Caching Support

• Allows a server to determine caching policies in its response– Expires xx-xx-xx yy:yy:yy.yy– Cache-Control: no-store – don’t cache at all– Cache-Control: no-cache – validate every time

or don’t cache– Cache-Control: private – can’t keep in a public

cache

• Secure sessions (https) generally not cached

Page 67: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 67

HTTP 1.1 Chunked Encoding

• Faster response for dynamically-generated pages or very large pages

• Allows the beginning of a response to be sent before its total length is known

• Each chunk is prefixed by its size in bytes• A zero size chunk indicates the end of the

response message• If a server is using chunked encoding it must set

the Transfer-Encoding header to "chunked"

Page 68: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 68

Summary• Clients (browsers) often implement

many schemes• Technically, only http scheme is World

Wide Web• But many of the more recent schemes

also associated with the Web• Clients do not always talk directly to

origin servers indicated in URLs

Page 69: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 69

First Assignment: Logistics

• Due Tuesday February 1st by 10am• Two pages (not including optional

figures and required reference list)• Submit by posting in Paper Proposals

folder on CourseWorks• Must be in a format I can read, which

means pdf, word, html, plain ascii text (with all figures embedded or viewable in a browser without special “plugins”)

Page 70: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 70

First Assignment: Paper Proposal

• Sketch the topic you have in mind• Include tentative reference list (specific

background reading to learn more about the topic)

• Some general topic areas suggested at http://bank.cs.columbia.edu/classes/cs6125/topics.htm, or invent your own

Page 71: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 71

First Assignment:“Goal” of Paper

• Do not simply survey some topic  • Compare this to that, argue a position

in favor or against something, evaluate something according to some meaningful criteria, etc. 

• Explain why your topic is relevant to this course

Page 72: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 72

First Assignment: Background Reading

• List some specific materials you intend to read to learn about the topic– Scholarly papers from conferences or journals– White papers– Third-party reviews or commentaries (blogs ok)– System documentation– Specifications of "standards" (or proposed

standards)– Not advertising or publicity brochures– Not wikipedia

• Should include materials from at least two different points of view (e.g., do not get all your background information from the same website)

Page 73: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 73

Upcoming Assignments:

Paper• Paper outline due Tuesday February

14th • Full paper due Tuesday March 9th

Page 74: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 74

Student Presentations• Individual ~10 minute talk in class• Schedule will be assigned (posted next

week)• One paragraph proposal, due Tuesday

February 15th • May be based on paper, project, or

some other topic

Page 75: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

18 January 2011 Kaiser: COMS E6125 75

Heads Up on Project• Project Proposal due Tuesday March 9th • Optionally work in teams (see

http://bank.cs.columbia.edu/classes/cs6125/team_advice.htm)

• Build a new system or extend an existing system

• OR evaluate/compare one or more existing system(s)

• You may "continue" your paper topic towards the project, or do something entirely different

Page 76: 25 January 2011Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011.

25 January 2011 Kaiser: COMS E6125 76

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2011Spring 2011