1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226)...

34
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 7 HTTP and Web Programming in Java (Based on Møller and Schwartzbach, 2006, Chapter 8) David Meredith [email protected] ww.titanmusic.com/teaching/cis336-2006-7.htm

Transcript of 1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226)...

1

CIS336Website design, implementation and

management(also Semester 2 of CIS219, CIS221 and

IT226)

Lecture 7HTTP and Web Programming in Java

(Based on Møller and Schwartzbach, 2006, Chapter 8)

David [email protected]

www.titanmusic.com/teaching/cis336-2006-7.html

2

The Internet and HTTP

• HTTP: Hypertext Transfer Protocol– a cornerstone of the infrastructure of the Web– prescribes how machines on the web exchange

• HTML and XML documents• form field values• ...

– uses a client-server model• communication follows a simple request-response

pattern– client always initiates the interaction– client (e.g., browser) requests a resource by sending the

URL of the resource (e.g., HTML file) to a server– if server accepts request then it returns the resource

3

Network layers

• Internet network protocols organised into a number of layers

• Network Interface Layer is hardware used to communicate bits from one physical location to another (e.g., ethernet)

THE NETWORK INTERFACE LAYER

THE INTERNET LAYER

THE TRANSPORT LAYER

THE APPLICATION LAYER

OUR APPLICATIONS

IP

TCP, UDP

HTTP, FTP, SMTP, DNS

Ethernet

4

Internet Layer

• Internet Layer is that of the Internet Protocol (IP)– IP addresses

• used to identify machines on the network• e.g., 158.223.1.118 is the IP address of the Department of Computing Web server

(www.doc.gold.ac.uk)• Internet Assigned Numbers Authority (IANA) manages allocation of IP addresses to

organizations• 127.0.0.1 always refers to the current machine (also called localhost)

– Datagram• packet of data of limited size

– up to 65535 bytes, but only 1500 bytes on Ethernet network

– IP defines how datagrams sent across the network• involves routing through intermediate machines

– IP is an unreliable protocol• datagrams may be lost, arrive out of order or duplicated

THE NETWORK INTERFACE LAYER

THE INTERNET LAYER

THE TRANSPORT LAYER

THE APPLICATION LAYER

OUR APPLICATIONS

IP

TCP, UDP

HTTP, FTP, SMTP, DNS

Ethernet

5

Transport Layer

• Transport layer contains Transmission Control Protocol (TCP)– transmits data in a stream of unbounded size– segments stream into IP datagrams and reassembles them at

destination– Reliable protocol

• retransmits lost datagrams• sorts datagrams into correct order when received• discards duplicate datagrams

– Connection-oriented• connection set up between two machines• data can be sent in both directions across connection (full-duplex)

THE NETWORK INTERFACE LAYER

THE INTERNET LAYER

THE TRANSPORT LAYER

THE APPLICATION LAYER

OUR APPLICATIONS

IP

TCP, UDP

HTTP, FTP, SMTP, DNS

Ethernet

6

Sockets and ports• End points of a TCP connection are called sockets• Each socket is associated with a particular port on a particular

machine• Port is identified by an integer between 0 and 65535

– allows single machine to have many simultaneous connections, each to a different port

– Ports 0-1023: well-known ports• assigned to server applications executed by privileged processes (e.g., UNIX

root user), e.g.,– port 80 reserved for HTTP communication– ports 20 and 21 reserved for FTP servers– port 25 reserved for SMTP servers– port 443 reserved for HTTPS

– Ports 1024-49151: registered ports• allocated by IANA to avoid vendor conflicts• e.g., port 8080 reserved as alternative to 80 for running a web server using

ordinary user privileges

– Ports 49152-65535: dynamic or private ports• can be freely used by any client or server program

• Browsers obtain ports for their TCP sockets arbitrarily among unused non-well-known ports

7

User Datagram Protocol (UDP)

• User datagram protocol (UDP) is an alternative to TCP in the transport layer– UDP is unreliable and datagram-oriented– faster than TCP– can be used for voice and video where

speed is important and occasional losses are acceptable

• UDP provides foundation for the domain name system (DNS)

8

IP is getting old

• Specifications for TCP/IP are from 1981– original ideas from 1960s developed by

DARPA• Most internet traffic uses IPv4

– more than 20 years old– shortage of IP addresses

• even though allows for 4 billion

• IPv6 solves IP address shortage

9

Application Layer

• Application layer contains applications of the transport layer, e.g.,– HTTP, FTP, SMTP, DNS

• HTTP requests and responses transmitted using TCP• Two versions of HTTP:

– HTTP/1.0– HTTP/1.1

• becoming more prevalent• provides better support for caching, bandwidth optimization, error

notification, security and content negotiation

THE NETWORK INTERFACE LAYER

THE INTERNET LAYER

THE TRANSPORT LAYER

THE APPLICATION LAYER

OUR APPLICATIONS

IP

TCP, UDP

HTTP, FTP, SMTP, DNS

Ethernet

10

Domain Name System (DNS)• Defines structure of domain names• Defines services governing association of IP

addresses with domain names– e.g., association of 82.165.120.54 with

www.titanmusic.com

• Benefits of DNS– can move services from one machine to another without

changing domain name– single domain name can be associated with many IP

addresses• allows replication of servers

– decreases workload– improves fault tolerance

– many domain names can be associated with a single IP address

• virtual hosting

– domain names are easier to remember than IP addresses

11

URIs• URI identifies network resource and has the

general formhttp://<host>:<port>/<path>?<query>

– e.g.http://www.google.com/search?q=An+Introduction+to+XML+and+Web+Technologies

• scheme is http

• host is www.google.com which is a domain name that has been registered using DNS as being associated with one or more IP addresses

• no port specified (port 80 is the default for http)

• host and port identify web server program to be used to process request

• path is search– path typically identifies file in server's file system or program that can generate

appropriate response

• query here is q=An+Introduction+to+XML+and+Web+Technologies– contains arguments to program that processes request

• URI may also contain fragment identifier that accesses a particular part (fragment) of a resource

– prefixed by # symbol

12

Requests

• HTTP request sent from client to server using TCP• Entering the address

http://www.google.com/search?q=An+Introduction+to+XML+and+Web+Technologiesin a web browser causes– TCP connection to be established with

• the IP address associated by DNS with www.google.com• port 80 (default value)

– message such as one above to be sent from browser to server• Line 1 is a request line

– here, uses GET method to ask the server to send the resource/search?q=An+Introduction+to+XML+and+Web+Technologies

using HTTP/1.1• Remaining lines are header lines, each with the form,

field: value• HTTP/1.1 supports larger set of header fields than HTTP/1.0

13

Request header fields

• Host– contains domain name and port number (if not

omitted) of server that receives request– optional in HTTP/1.0, mandatory in HTTP/1.1

• User-Agent– contains information about the user agent (e.g.,

browser) that sends the request• allows response to be tailored for use in the client software

• Referer– allows client to specify URI of resource from which URI

in request was obtained• e.g., if HTML page contains an img link, then request for

image will contain Referer field set to URI of HTML page

14

Accept header field

• Specifies media types that are acceptable as a response to the request

– also called MIME types (Multipurpose Internet Mail Extensions)

• now used for much more than e-mail

• Common media type are

– text/plain - plain, unformatted text

– text/html - HTML documents (not XHTML)

– text/xml - XML documents

– application/xml - for XML documents intended for application use, not human-readable XML (not clearly demarcated from text/xml)

– application/xhtml+xml - recommended for use with XHTML

– multipart/form-data - HTML-like form field values

– application/octet-stream - arbitrary binary data and data that doesn't fit into other categories

– image/jpeg - JPEG image

• Long list of media types maintained by Internet Assigned Numbers Authority (IANA)

• */* means all media fields

• Quality parameter: mime-type;q=value

– value between 0 and 1 (default)

– indicates that mime-type is only acceptable if the quality of other mime types with higher q values is less than value times the quality of the mime-type format resource

15

Other request header fields

• Accept-Language– defines acceptability of natural languages

• Accept-Encoding– specifies accepted content codings

• usually compression techniques

• Accept-Charset– specifies accepted character sets

• All can use q parameters

16

Responses• Response from server

sent using same TCP connection as request

• Response consists of – header (lines 1-10 at left)

• begins with status line indicating overall result of attempt to satisfy request

• followed by header lines

– body (lines 12-24 at left)• contains requested

resource if request was successful

• Response at left returned when request URI is

http://www.brics.dk/index.html

17

Response status line• Status line (line 1 at left) tells

us that– response uses HTTP/1.1– status code for request is

200 OK• means request succeeded

and resource follows header

• Five classes of status codes:– 1xx indicates provisional,

informational response– 2xx indicates success

• e.g., 200 OK

– 3xx indicates redirection• e.g., 301 Moved

Permanently

– 4xx indicates client error• e.g., 404 Not found

– 5xx indicates server error• e.g., 500 Internal Server

Error

18

HTTP Response header lines• Date shows date and time when

response sent• Server contains information

about the server software• ETag used for cache

management– usually digest of file size and last

modification time

• Content-Length gives size of body in bytes

• Content-Type gives mime type of resource in body

• Content-Encoding indicates whether resource has been compressed (e.g., with gzip)

• Transfer-Encoding, if present, usually has value chunked, indicating that resource is being delivered in chunks

• Location used with status codes 301 and 307 to give new location of resource

19

• When GO! button pressed, form field values sent to server as list of name-value pairs, encoded into a query string according to media type chosen using enctype attribute in form element

• Default media type is application/x-www-form-urlencoded(URL encoding) which would produce following:bet=someone+else&email=toot%40pop.com&send=GO%21

• Fields listed in order of appearance in source& separates fields= separates name from value+ replaces each spacenon-alphanumeric characters escapedline breaks encoded as %0d%0a

HTML Forms

20

get and post methods in an HTML form

• If form method is get, then query string is appended to action URL:– http://www.brics.dk/ixwt/echo?

bet=someone+else&email=toot%40pop.com&send=GO%21

– Request line in HTTP request will therefore beGET /ixwt/echo?bet=someone+else&email=toot%40pop.com&send=GO%21 HTTP 1.1

• If form method is post, then query string is placed in body of HTTP request which might then be as above– as in response, body of request separated by

empty line from header

21

The difference between get and post

• GET requests– mainly for retrieving data

– safe to the client• client not responsible for any side-effects on server

– idempotent - i.e., side effects of two or more identical requests are same as for one

– generated by clicking on an HTML link

– limited by maximum URL length imposed by browsers

– only possible media type is application/x-www-urlencoded

• POST request – is for operations that have side-effects on the server

– user usually responsible for any side effects on server

– not necessarily idempotent• clicking "reload" on a page that results from a POST request causes browser to

warn that this might repeat the action the form has carried out

– not limited by maximum URL length imposed by browsers

– used for sensitive information (e.g. passwords) because servers usually log request URIs but not request bodies

22

Web programming with Java• Java highly suitable for web (and XML)

programming because– it is platform independent– it has a safe runtime model

• array bound checks, automatic garbage collection, bytecode verification, etc.

– supports multi-threading and concurrency• useful for servers and clients

– supports Unicode– comes with a suite of powerful libraries for

network programming

• Only other language that competes with it for web programming is C#

23

TCP/IP in Java

• Accessing TCP/IP in Java usually requires– java.net.InetAddress

• represents an IP address• can do DNS look-ups

– java.net.Socket• represents a TCP socket

– java.net.ServerSocket• represents a server socket which is capable

of waiting for requests from clients

24

Performing DNS look-up

• Above program takes a single argument which should be a domain name

• In line 7, getAllByName method used to produce an array of InetAddresses which contains the IP addresses associated with the domain name

• In line 9, getHostAddress method used to get IP address from each InetAddress object in array a and print it out

• getAllByName method may throw an UnknownHostException

25

Finding the domain name and IP address of current machine

• Uses getLocalHost method in line 6 to construct an InetAddress object containing information about the name and IP address of the current machine on which the program is being executed

• Use getHostName and getHostAddress in lines 7 and 8 to get the name and IP address of the current machine and print them out

• getLocalHost method may throw an UnknownHostException

26

Making a TCP connection between a server and a client: The server

• New ServerSocket created on line 7

• Starts infinite loop in line 8, on each iteration of which,

– uses accept method in line 9 to get ss to listen for a connection to be made on the port given on the command line, then accepts it and creates a new socket, con, to represent the connection

– constructs an InputStreamReader, in, to read bytes from the input stream of con (line 10) and convert them to characters

– reads input using in, terminated with a 0 byte (lines 11-14) and stores in msg

– attaches PrintWriter object, out, to the output stream of con and prints "Simon says: " plus the message in msg on this stream (lines 15-17)

– closes the connection con (line 18)

– accept method may throw an IOException

27

Making a TCP connection between a server and a client: The client

• Establishes a connection with the SimpleServer by giving its IP address and port as command line arguments

• The third command line argument is a message to send to the server

• Attaches a PrintWriter to the output stream associated with the connection (line 8)

• Prints the message given as an argument to the program to this output stream and terminates the message with a zero byte

• The read method (line 14) returns -1 when end of stream is reached

• Then associates an InputStreamReader with the input stream associated with the connection and receives the message sent by the server

• Finally closes the connection (line 17)

• getOutputStream method may throw an IOException

28

HTTP in Java(The hard way)

• Manually implements HTTP support on top of TCP/IP

• Sends request to Google and extracts the result

• Manually constructs an HTTP request (lines 8-11, 15-17) using fact that Google's "I'm Feeling Lucky" feature accepts GET requests of a particular format

• Parses response using fact that response always contains a Location header line

• First constructs a Socket and establishes a connection with Google server on port 80 (line 7)

• Constructs a query string in the right format for the "I'm Feeling Lucky" feature (lines 8-11)

• Writes the request header to an output stream attached to the socket (lines 12-18)

• Reads response a line at a time until finds a header line starting with "Location:" (while loop starting in line 24)

• Prints the URL value of this header line to standard output (line 26)

• Closes connection (line 35)

29

• HttpURLConnection class makes it easier to create HTTP requests and parse responses

• Above program does same as previous one but uses HttpURLConnection object to create a connection

• First construct a URL object (line 13) then use its openConnection method to create a URLConnection

• URLConnection is an abstract class but when URL's scheme is http, openConnection creates an HttpURLConnection

– return value of openConnection should therefore be coerced to the correct class

• Read http://www.google.com/terms_of_service.html before running this program!

HTTP in Java(The easier way)

30

Methods in HttpURLConnection

• setRequestMethod– sets request method (usually GET or

POST)

• setRequestProperty– sets a field:value pair in a header line

in the request

• setDoInput– should be set to true (default) if

intend to read input from connection

• setDoOutput– set to true (false by default) if intend

to write output to connection

• connect– establishes TCP connection

– usually not necessary since connection attempted at first write

• getOutputStream– gives output stream for request body

of POST requests

• getResponseCode– returns response code (e.g., 200 for

OK)

• getHeaderField– returns field from response header

• getInputStream– gives input stream for reading

response body

• Note that request header lines are called properties in HttpURLConnection

• In HttpURLConnection, http redirects are followed by default– can be disabled using

setInstanceFollowRedirects(false) • see line 15 above

31

A simple Web server in Java• Takes two command line

arguments– a port– the root directory for files to be

served

• Then instantiates the class FileServer and starts it (lines 26-27)

32

A simple Web server in Java

• run method creates a ServerSocket

• Starts infinite loop of processing requests

• Only reads first line of each request (lines 45-6)

33

A simple Web server in Java

• processRequest parses reqest line

• First makes sure request is well-formed (lines 63-9)

• Then ensures that URL does not contain "/." or end with a "~" (lines 72-5)

• Then checks that if the file is a directory then it ends with a '/' and sends a "Moved Permanently" message back to the browser (which typically resends the request with the new URL (lines 77-84)

• If requested file is a directory, then path of returned file set to the file index.html in the directory (lines 86-8)

• Attaches input stream to requested file (line 91)

• Guesses content type of file (lines 92-3)

• Prints out the response on the output print stream (lines 94-99)

• Logs interaction (line 100)

34

A simple Web server in Java• log method

prints out record of each interaction

• errorReport returns an HTML Error page to the client browser

• sendFile sends the file as the body of an HTTP response as a sequence of bytes