UCDavis, ecs150 Spring 2006 05/31/2006ecs150, spring 20061 Operating System ecs150 Spring 2006 :...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of UCDavis, ecs150 Spring 2006 05/31/2006ecs150, spring 20061 Operating System ecs150 Spring 2006 :...
05/31/2006 ecs150, spring 2006 1
UCDavis, ecs150Spring 2006
ecs150 Spring 2006:Operating SystemOperating System#7: mbuf(Chapter 11)
Dr. S. Felix Wu
Computer Science Department
University of California, Davishttp://www.cs.ucdavis.edu/~wu/
05/31/2006 ecs150, spring 2006 2
UCDavis, ecs150Spring 2006
IPCIPC
Uniform communication for distributed processes– “socket”: network programming– operating system kernel issues
Semaphores, messages queues, and shared memory for local processes
05/31/2006 ecs150, spring 2006 3
UCDavis, ecs150Spring 2006 SocketSocket
an IPC Abstraction Layeran IPC Abstraction Layer
05/31/2006 ecs150, spring 2006 4
UCDavis, ecs150Spring 2006
MbufsMbufsMemory BuffersMemory Buffers
The main data structure for network processing in the kernel
Why can’t we use “kernel memory management” facilities such as kernel malloc (power of 2 alike), page, or VM objects directly?
05/31/2006 ecs150, spring 2006 5
UCDavis, ecs150Spring 2006
““Packet”Packet”
EtherNet or 802.11 header IP header
– IPsec header Transport headers (TCP/UDP/…)
– SSL header Others???
05/31/2006 ecs150, spring 2006 6
UCDavis, ecs150Spring 2006
PropertiesPropertiesNetwork Packet ProcessingNetwork Packet Processing
Variable sizes Prepend or remove Fragment/divide or defragment/combine can we avoid COPYING as much
as possible??? Queue Parallel processing for high speed
– E.g., Juniper routers are running FreeBSD
05/31/2006 ecs150, spring 2006 7
UCDavis, ecs150Spring 2006
sys/mbuf.hkern/kern_mbuf.ckern/ipc_mbuf.ckern/ipc_mbuf2.c
256bytes
24
4
05/31/2006 ecs150, spring 2006 8
UCDavis, ecs150Spring 2006
M_EXTM_PKTHDRM_EORM_BCASTM_MCAST
the same packetnext packet
05/31/2006 ecs150, spring 2006 9
UCDavis, ecs150Spring 2006
#define M_EXT 0x0001#define M_PKTHDR 0x0002 #define M_EOR 0x0004#define M_RDONLY 0x0008 #define M_PROTO1 0x0010 #define M_PROTO2 0x0020 #define M_PROTO3 0x0040 #define M_PROTO4 0x0080 #define M_PROTO5 0x0100 #define M_SKIP_FIREWALL 0x4000 #define M_FREELIST 0x8000 #define M_BCAST 0x0200 #define M_MCAST 0x0400 #define M_FRAG 0x0800 #define M_FIRSTFRAG 0x1000 #define M_LASTFRAG 0x2000
05/31/2006 ecs150, spring 2006 10
UCDavis, ecs150Spring 2006
struct mbuf { struct m_hdr m_hdr; union { struct { struct pkthdr MH_pkthdr; union { struct m_ext MH_ext; char MH_databuf[MHLEN]; } MH_dat; } MH; char M_databuf[MLEN]; } M_dat;};
05/31/2006 ecs150, spring 2006 11
UCDavis, ecs150Spring 2006
struct mbuf { struct m_hdr m_hdr; union { struct { struct pkthdr MH_pkthdr; union { struct m_ext MH_ext; char MH_databuf[MHLEN]; } MH_dat; } MH; char M_databuf[MLEN]; } M_dat;};
05/31/2006 ecs150, spring 2006 12
UCDavis, ecs150Spring 2006
struct mbuf { struct m_hdr m_hdr; union { struct { struct pkthdr MH_pkthdr; union { struct m_ext MH_ext; char MH_databuf[MHLEN]; } MH_dat; } MH; char M_databuf[MLEN]; } M_dat;};
05/31/2006 ecs150, spring 2006 13
UCDavis, ecs150Spring 2006
24 bytes
05/31/2006 ecs150, spring 2006 14
UCDavis, ecs150Spring 2006
IPsec_IN_DONEIPsec_OUT_DONEIPsec_IN_CRYPTO_DONEIPsec_OUT_CRYPTO_DONE
05/31/2006 ecs150, spring 2006 15
UCDavis, ecs150Spring 2006
mbufmbuf
Current: 256 Old: 128 (shown in the following slides)
05/31/2006 ecs150, spring 2006 16
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 17
UCDavis, ecs150Spring 2006
A Typical UDP Packet
05/31/2006 ecs150, spring 2006 18
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 19
UCDavis, ecs150Spring 2006
m_devget: When an IP packet comes in…
05/31/2006 ecs150, spring 2006 20
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 21
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 22
UCDavis, ecs150Spring 2006
mtod & dtommtod & dtom
mbuf ptr data region – e.g. struct ip
mtod?
dtom?
05/31/2006 ecs150, spring 2006 23
UCDavis, ecs150Spring 2006
mtod & dtommtod & dtom
mbuf ptr data region – e.g. struct ip
mtod? dtom?
#define dtom(x) (struct mbuf *) ((int) (x) & (MSIZE -1)))
05/31/2006 ecs150, spring 2006 24
UCDavis, ecs150Spring 2006
mtod & dtommtod & dtom
mbuf ptr data region – e.g. struct ip
mtod? dtom?
#define dtom(x) (struct mbuf *)((int *)(x)&~(MSIZE -1)))
05/31/2006 ecs150, spring 2006 25
UCDavis, ecs150Spring 2006
netstat -mnetstat -m
Check for mbuf statistics
05/31/2006 ecs150, spring 2006 26
UCDavis, ecs150Spring 2006
mbufmbuf
IP input/output/forward IPsec IP fragmentation/defragmentation Device IP Socket
05/31/2006 ecs150, spring 2006 27
UCDavis, ecs150Spring 2006
Memory Management for IPCMemory Management for IPC
Why do we need something like MBUF?
05/31/2006 ecs150, spring 2006 28
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 29
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 30
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 31
UCDavis, ecs150Spring 2006
I/O ArchitectureI/O Architecture
CPU MemoryDeviceController
I/ODevice
Control bus
Data and I/O buses
Internalbuffer
InitializationInputOutputConfigurationInterrupt
IRQ
05/31/2006 ecs150, spring 2006 32
UCDavis, ecs150Spring 2006
Direct Memory AccessDirect Memory Access
Used to avoid programmed I/O for large data movement
Requires DMA controller Bypasses CPU to transfer data directly
between I/O device and memory
05/31/2006 ecs150, spring 2006 33
UCDavis, ecs150Spring 2006
DMA RequestsDMA Requests Disk address to start copying Destination memory address Number of bytes to copy
05/31/2006 ecs150, spring 2006 34
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 35
UCDavis, ecs150Spring 2006
Is DMA a good idea?Is DMA a good idea? CPU is a lot faster Controllers/Devices have larger internal
buffer DMA might be much slower than CPU Controllers become more and more
intelligent
USB doesn’t have DMA.
05/31/2006 ecs150, spring 2006 36
UCDavis, ecs150Spring 2006 Network ProcessorNetwork Processor
05/31/2006 ecs150, spring 2006 37
UCDavis, ecs150Spring 2006
File System MountingFile System Mounting
A file system must be mounted before it can be accessed.
A unmounted file system is mounted at a mount point.
05/31/2006 ecs150, spring 2006 38
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 39
UCDavis, ecs150Spring 2006 Mount PointMount Point
05/31/2006 ecs150, spring 2006 40
UCDavis, ecs150Spring 2006
logical diskslogical disks/
usr sys dev etc bin
/
local adm home lib bin
fs0: /dev/hd0a
fs1: /dev/hd0e
mount -t ufs /dev/hd0e /usr
mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom
05/31/2006 ecs150, spring 2006 41
UCDavis, ecs150Spring 2006 Distributed FSDistributed FS
Distributed File System– NFS (Network File System)– AFS (Andrew File System)– CODA
05/31/2006 ecs150, spring 2006 42
UCDavis, ecs150Spring 2006 Distributed FSDistributed FS
/
usr sys dev etc bin
/
local adm home lib bin
ftp.cs.ucdavis.edu fs0: /dev/hd0a
Server.yahoo.com fs0: /dev/hd0e
05/31/2006 ecs150, spring 2006 43
UCDavis, ecs150Spring 2006
Distributed File SystemDistributed File System
Transparency and Location Independence Reliability and Crash Recovery Scalability and Efficiency Correctness and Consistency Security and Safety
05/31/2006 ecs150, spring 2006 44
UCDavis, ecs150Spring 2006 CorrectnessCorrectness
One-copy Unix Semantics??
05/31/2006 ecs150, spring 2006 45
UCDavis, ecs150Spring 2006 CorrectnessCorrectness
One-copy Unix Semantics– every modification to every byte of a file has to
be immediately and permanently visible to every client.
05/31/2006 ecs150, spring 2006 46
UCDavis, ecs150Spring 2006 CorrectnessCorrectness
One-copy Unix Semantics– every modification to every byte of a file has to
be immediately and permanently visible to every client.
– Conceptually FS sequent access Make sense in a local file system Single processor versus shared memory
Is this necessary?
05/31/2006 ecs150, spring 2006 47
UCDavis, ecs150Spring 2006 DFS ArchitectureDFS Architecture
Server– storage for the distributed/shared files.– provides an access interface for the clients.
Client– consumer of the files.– runs applications in a distributed environment.
open closeread writeopendir statreaddir
applications
05/31/2006 ecs150, spring 2006 48
UCDavis, ecs150Spring 2006 NFS (SUN, 1985)NFS (SUN, 1985)
Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation)
Server maintains no state– a READ on the server opens, seeks, reads, and closes– a WRITE is similar, but the buffer is flushed to disk
before closing Server crash: client continues to try until server
reboots – no loss Client crashes: client must rebuild its own state –
no effect on server
05/31/2006 ecs150, spring 2006 49
UCDavis, ecs150Spring 2006
RPC - XDRRPC - XDR
RPC: Standard protocol for calling procedures in another machine
Procedure is packaged with authorization and admin info
XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering.
05/31/2006 ecs150, spring 2006 50
UCDavis, ecs150Spring 2006
rpcgenrpcgen
RPC program
rpcgen
RPC client.c RPC server.cRPC.h
datastructure
datastructure
05/31/2006 ecs150, spring 2006 51
UCDavis, ecs150Spring 2006
NFS OperationsNFS Operations
Every operation is independent: server opens file for every operation
File identified by handle -- no state information retained by server
client maintains mount table, v-node, offset in file table etc.
What do these imply???
05/31/2006 ecs150, spring 2006 52
UCDavis, ecs150Spring 2006
Client computer Server computer
UNIXfile
system
NFSclient
NFSserver
UNIXfile
system
Applicationprogram
Applicationprogram
Virtual file systemVirtual file system
Oth
er f
ile s
yste
mUNIX kernel
system calls
NFSprotocol
(remote operations)
UNIX
Operations on local files
Operationson
remote files
*
Applicationprogram
NFSClient
KernelApplicationprogram
NFSClient
Client computer
mount –t nfs home.yahoo.com:/pub/linux /mnt/linux
05/31/2006 ecs150, spring 2006 53
UCDavis, ecs150Spring 2006
Final – 06/15/2006 8~10 amFinal – 06/15/2006 8~10 am 1062 Bainer Midterm plus
– 5.1~5.8, 5.11~5.12– 6.1, 6.5~6.7– 8.1~8.9– 9.1~9.3– 11.3
Notes/PPT, Homeworks, Brainstorming
05/31/2006 ecs150, spring 2006 54
UCDavis, ecs150Spring 2006 State-ful vs. State-lessState-ful vs. State-less
A server is fully aware of its clients– does the client have the newest copy?
– what is the offset of an opened file?
– “a session” between a client and a server!
A server is completely unaware of its clients– memory-less: I do not remember you!!
– Just tell me what you want to get (and where).
– I am not responsible for your offset values (the client needs to maintain the state).
05/31/2006 ecs150, spring 2006 55
UCDavis, ecs150Spring 2006 The StateThe State
applications
openreadstatlseek
applications
openreadstatlseek
offset
05/31/2006 ecs150, spring 2006 56
UCDavis, ecs150Spring 2006
Unix file semanticsUnix file semantics
NFS:– open a file with read-write mode– later, the server’s copy becomes read-only
mode– now, the application tries to write it!!
05/31/2006 ecs150, spring 2006 57
UCDavis, ecs150Spring 2006
Problems with NFSProblems with NFS
Performance not scaleable:– maybe it is OK for a local office.– will be horrible with large scale systems.
05/31/2006 ecs150, spring 2006 58
UCDavis, ecs150Spring 2006
Similar to UNIX file caching for local files:– pages (blocks) from disk are held in a main memory buffer cache
until the space is required for newer pages. Read-ahead and delayed-write optimisations.
– For local files, writes are deferred to next sync event (30 second intervals)
– Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.
NFS v3 servers offers two strategies for updating the disk:– write-through - altered pages are written to disk as soon as they are
received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk.
– delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.
*
05/31/2006 ecs150, spring 2006 59
UCDavis, ecs150Spring 2006 Server caching does nothing to reduce RPC traffic between client and
server– further optimisation is essential to reduce server load in large networks– NFS client module caches the results of read, write, getattr, lookup and
readdir operations– synchronization of file contents (one-copy semantics) is not guaranteed
when two or more clients are sharing the same file. Timestamp-based validity check
– reduces inconsistency, but doesn't eliminate it– validity condition for cache entries at the client:
(T - Tc < t) v (Tmclient = Tmserver)– t is configurable (per file) but is typically set to
3 seconds for files and 30 secs. for directories– it remains difficult to write distributed
applications that share files with NFS
*
t freshness guaranteeTc time when cache entry was
last validatedTm time when block was last
updated at serverT current time
05/31/2006 ecs150, spring 2006 60
UCDavis, ecs150Spring 2006 AFSAFS
State-ful clients and servers. Caching the files to clients.
– File close ==> check-in the changes. How to maintain consistency?
– Using “Callback” in v2/3 (Valid or Cancelled)
openread
applications
invalidate and re-cache
05/31/2006 ecs150, spring 2006 61
UCDavis, ecs150Spring 2006 Why AFS?Why AFS?
Shared files are infrequently updated Local cache of a few hundred mega bytes
– Now 50~100 giga bytes Unix workload:
– Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts.
– Are these still true?
05/31/2006 ecs150, spring 2006 64
UCDavis, ecs150Spring 2006 Fault Tolerance in AFSFault Tolerance in AFS
a server crashes
a client crashes– check for call-back tokens first.
05/31/2006 ecs150, spring 2006 65
UCDavis, ecs150Spring 2006
Problems with AFSProblems with AFS
Availability what happens if call-back itself is lost??
05/31/2006 ecs150, spring 2006 66
UCDavis, ecs150Spring 2006
GFS – Google File SystemGFS – Google File System
“failures” are norm Multiple-GB files are common Append rather than overwrite
– Random writes are rare Can we relax the consistency?
05/31/2006 ecs150, spring 2006 67
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 68
UCDavis, ecs150Spring 2006
05/31/2006 ecs150, spring 2006 69
UCDavis, ecs150Spring 2006
CODACODA
Server Replication:– if one server goes down, I can get another.
Disconnected Operation:– if all go down, I will use my own cache.
05/31/2006 ecs150, spring 2006 70
UCDavis, ecs150Spring 2006
ConsistencyConsistency
If John update file X on server A and Mary read file X on server B….
Read-one & Write-all
05/31/2006 ecs150, spring 2006 71
UCDavis, ecs150Spring 2006 Read x & Write (N-x+1)Read x & Write (N-x+1)
read
write
05/31/2006 ecs150, spring 2006 72
UCDavis, ecs150Spring 2006 Example: R3W4 (6+1)Example: R3W4 (6+1)
Initial 0 0 0 0 0 0Alice-W 2 2 0 2 2 0Bob-W 2 3 3 3 3 0Alice-R 2 3 3 3 3 0Chris-W 2 1 1 1 1 0Dan-R 2 1 1 1 1 0Emily-W 7 7 1 1 1 7Frank-R 7 7 1 1 1 7
05/31/2006 ecs150, spring 2006 73
UCDavis, ecs150Spring 2006
Client computer Server computer
Applicationprogram
Applicationprogram
Client module
Flat file service
Directory service
LookupAddNameUnNameGetNames
ReadWriteCreateDeleteGetAttributesSetAttributes
*