The Linux Network Subsystem -...
Transcript of The Linux Network Subsystem -...
![Page 1: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/1.jpg)
1© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Unable to handle kernel paging request at virtual address 4d1b65e8 Unable to handle kernel paging request at virtual address 4d1b65e8 pgd = c0280000 pgd = c0280000 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:Modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>] lr : [<bf00901c>] Not tainted sp : c076df78 ip : 60000093 fp : c076df84 pc : [<c00116c8>] lr : [<bf00901c>] Not tainted
The Linux Network Subsystem
Covers Linux version 2.6.25
Version 1.1
![Page 2: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/2.jpg)
2© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Rights to copy
Attribution – ShareAlike 2.0You are free
to copy, distribute, display, and perform the workto make derivative worksto make commercial use of the work
Under the following conditionsAttribution. You must give the original author credit.Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one.
For any reuse or distribution, you must make clear to others the license terms of this work.Any of these conditions can be waived if you get permission from the copyright holder.
Your fair use and other rights are in no way affected by the above.License text: http://creativecommons.org/licenses/bysa/2.0/legalcode
This kit contains work by the following authors:
© Copyright 20042006Michael Opdenackermichael@freeelectrons.comhttp://www.freeelectrons.com
© Copyright 20032006Oron [email protected]://www.actcom.co.il/~oron
© Copyright 2004 – 2008Codefidence [email protected]://www.codefidence.com
![Page 3: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/3.jpg)
3© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
What is Linux?
Linux is a kernel that implements the POSIX and Single Unix Specification standards which is developed as an Open Source project.
When one talks of “installing Linux”, one is referring to a Linux Distribution: a combination of Linux and other programs and library that form an operating system.
Linux runs on 24 main platforms and supports applications ranging from ccNUMA super clusters to cellular phones and micro controllers.
Linux is 15 years old, but is based on the 40 years old Unix design philosophy
![Page 4: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/4.jpg)
4© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Layers in a Linux system
● Kernel● Kernel Modules● C library● System libraries● Application libraries● User programs
Kernel
C library
User programs
![Page 5: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/5.jpg)
5© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Kernel architecture
System call interface
Processmanagement
Memorymanagement
Filesystemsupport
Devicecontrol Networking
CPU supportcode
Filesystemtypes
Storagedrivers
Characterdevice drivers
Networkdevice drivers
CPU / MMU support code
C library
App1 App2 ...Userspace
Kernelspace
Hardware
CPU RAM Storage
![Page 6: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/6.jpg)
6© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Kernel Mode vs. User Mode
All modern CPUs support a dual mode of operation:
User mode, for regular tasks.
Supervisor (or privileged) mode, for the kernel.
The mode the CPU is in determines which instructions the CPU is willing to execute:
“Sensitive” instructions will not be executed when the CPU is in user mode.
The CPU mode is determined by one of the CPU registers, which stores the current “Ring Level”
0 for supervisor mode, 3 for user mode, 12 unused by Linux.
![Page 7: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/7.jpg)
7© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
The System Call Interface
When a user space tasks needs to use a kernel service, it will make a “System Call”.
The C library places parameters and number of system call in registers and then issues a special trap instruction.
The trap atomically changes the ring level to supervisor mode and the sets the instruction pointer to the kernel.
The kernel will find the required system called via the system call table and execute it.
Returning from the system call does not require a special instruction, since in supervisor mode the ring level can be changed directly.
![Page 8: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/8.jpg)
8© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Linux System Call Path
entry.S
Task
sys_name()
do_name()
Glibc
Function call
Trap
Kernel
Task
![Page 9: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/9.jpg)
9© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Linux networking Subsystem Overview
Driver <> Hardware
Driver <> Stack
Networking Stack
Stack <> App App2 App3App 1
Stack Driver Interface
IP
TCP ICMPUDP
Driver
Socket Layer
Bridge
![Page 10: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/10.jpg)
10© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Network Device Driver Hardware Interface
Send
Free
Send
Free
Send
RcvOk
SentOK
RcvErr
SendErr
RecvCRC
Free
RcvOK
DriverTx
Rx
packet packet packet packet packet
packet packet packet packet
Memory Access
DMA●Driver allocates Ring Buffers.●Driver resets descriptors to initial state.●Driver puts packet to be sent in Tx buffers.●Device puts received packet in Rx buffers.●Driver/Device update descriptors to indicate state.●Device indicates Rx and end of Tx with interrupt, unless interrupt mitigation techniques are applied.
Memorymappedregistersaccess
Interrupts
![Page 11: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/11.jpg)
11© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Network Device Registration
Each network device is represented by a struct net_device
These are allocated using:struct net_device *alloc_netdev(size, mask, setup_func);
size – size of our priv data part
mask – a naming pattern (e.g. “eth%d”)
setup_func – A functionthat set ups the rest of net_device.
And is registered via a call to:int register_netdev(struct net_device *dev);
![Page 12: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/12.jpg)
12© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Network Device Initialization
The net_device structure is initalized with numerous methods and flags by the setup function:
open – request resources, register interrupts, start queues.
stop – deallocates resources, unregister irq, stop queue.
get_stats – report statistics
set_multicast_list – configure device for multicast
hard_start_xmit – called by the stack to initiate Tx.
IFF_MULTICAST – Device support multicast
IFF_NOARP – Device does not support ARP protocol
![Page 13: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/13.jpg)
13© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Packet Representation
We need to manipulate packets through the stack
This manipulation involves efficiently:
Adding protocol headers/trailers down the stack.
Removing protocol headers/trailers up the stack.
Packets can be chained together.
Each protocol should have convenient access to header fields.
To do all this the kernel uses the sk_buff structure.
![Page 14: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/14.jpg)
14© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffers
The sk_buff structure represents a single packet.
This structure is passed through the protocol stack.
It holds pointers to a buffers with the packet data.
It holds many type of other information:
Data size.
Incoming device.
Priority.
Security ...
![Page 15: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/15.jpg)
15© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
struct sk_buff
next: Next buffer in listprev: Previous buffer in listsk: Socket we are owned bytstamp: Time we arriveddev: Device we arrived on/are leaving byinput_dev: Device we arrived onh: Transport layer headernh: Network layer headermac: Link layer headerdst: Destination route cache entrysp: Security path, used for xfrmcb: Control buffer. Private data.len: Length of actual datadata_len: Data lengthmac_len: Length of link layer headercsum: Checksumlocal_df: Allow local fragmentation flagcloned: Head may be cloned (see refcnt)nohdr: Payload reference only flagpkt_type: Packet classfclone: Clone status
ip_summed: Driver fed us an IP checksumpriority: Packet queuing priorityusers: User count see {datagram,tcp}.cprotocol: Packet protocol from drivertruesize: Buffer sizehead: Head of bufferdata: Data head pointertail: Tail pointerend: End pointerdestructor: Destruct functionnfmark: Netfilter hooks private datanfct: Associated connection, if anyipvs_property: skbuff is owned by ipvsnfctinfo: Connection tracking info.nfct_reasm: Netfilter conntrack reassembly pointernf_bridge: Saved data about a bridged frametc_index: Traffic control indextc_verd: Traffic control verdictdma_cookie: DMA operation cookiesecmark: Security marking for LSM
![Page 16: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/16.jpg)
16© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffer Diagram
len...
headdatatailend...
dev
Padding
frag1
structsk_shared_info
frag2
frag3
struct sk_buff
EthernetIP
TCPPayload
headroomNote
Network chip must support
Scatter/Gatherto use of frags.
Otherwise kernel must copy buffersbefore send!
![Page 17: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/17.jpg)
17© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffer Operations
skb_put: add data to a buffer.
skb_push: add data to the start of a buffer.
skb_pull: remove data from the start of a buffer.
skb_headroom: returns free bytes at buffer head.
skb_tailroom: returns free bytes at buffer end.
skb_reserve: adjust headroom.
skb_trim: remove end from a buffer.
![Page 18: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/18.jpg)
18© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Operation Example: skb_put
unsigned char *skb_put(struct sk_buff * skb, unsigned int len)
Adds data to a buffer:
skb: buffer to use
len: amount of data to add
This function extends the used data area of the buffer.
If this would exceed the total buffer size the kernel will panic.
A pointer to the first byte of the extra data is returned.
![Page 19: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/19.jpg)
19© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffer Alignment
CPUs often take a performance hit when accessing unaligned memory locations.
Since an Ethernet header is 14 bytes network drivers often end up with the IP header at an unaligned offset.
The IP header can be aligned by shifting the start of the packet by 2 bytes. Drivers should do this with:skb_reserve(NET_IP_ALIGN);
The downside is that the DMA is now unaligned. On some architectures the cost of an unaligned DMA outweighs the gains so NET_IP_ALIGN is set on a per arch basis.
![Page 20: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/20.jpg)
20© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffer Padding
The networking layer reserves some headroom in skb data.
This is used to avoid having to reallocate skb data when the header has to grow.
In the default case, if the header has to grow 16 bytes or less we avoid the reallocation.
Unfortunately, this headroom changes the DMA alignment of the resulting network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive on some architectures.
Therefore architecture can override this value, as long as at least 16 bytes of free headroom are there.
![Page 21: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/21.jpg)
21© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Socket Buffer Allocations
dev_alloc_skb: allocate an skbuff for Rx
netdev_alloc_skb: allocate an skbuff for Rx, on a specific device.
Allocate a new sk_buff and assign it a usage count of one.
The buffer has unspecified headroom built in. Users should allocate the headroom they think they need without accounting for the built in space. The built in space is used for optimizations
NULL is returned if there is no free memory.
Although these functions allocates memory it can be called from an interrupt.
![Page 22: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/22.jpg)
22© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
sk_buff Allocation Example
Immediately after allocation, we should reserve the needed headroom:struct sk_buff*skb;
skb = dev_alloc_skb(1500);
if(unlikely(!skb))
break;
/* Mark as being used by this device */
skb>dev = dev;
/* Align IP on 16 byte boundaries */
skb_reserve(skb, NET_IP_ALIGN);
![Page 23: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/23.jpg)
23© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Softnet
Network stack is implemented as a pair of softirqs for parallelize packet handling on SMP machines:
NET_TX_SOFTIRQ Feeds packets from network stack to driver.
NET_RX_SOFTIRQ Feeds packets from driver to network stack.
Like any other softirq, these are called on return from interrupt or via the low priority ksoftirqd kernel thread.
Transmit/receive queues are stored in percpu softnet_data.
![Page 24: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/24.jpg)
24© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Linux Contexts
User Space
User Context
Interrupt Handlers
KernelSpace
InterruptContext
SoftIRQs
Hi p
riota
skle
ts
Tim
ers
Reg
ula
rta
skle
ts...
Process Thread
Kernel Thread
NetworkInterfaceDeviceDriver
NetStack
![Page 25: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/25.jpg)
25© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Packet Reception
The driver allocates an skb and sets up a descriptor in the ring buffers for the hardware.
The driver Rx interrupt handler calls netif_rx(skb).
netif_rx deposits the sk_buff in the percpu input queue. and marks the NET_RX_SOFTIRQ to run.
At SoftIRQ processing time, net_rx_action() is called by NET_RX_SOFTIRQ, which calls the driver poll() method to feed the packet up.
Normally poll() is set to proccess_backlog() by net_dev_init().
![Page 26: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/26.jpg)
26© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Packet Rx Overview
![Page 27: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/27.jpg)
27© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Packet Transmission
Each network device defines a method:int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev);
This function is indirectly called from the NET_TX_SOFTIRQ
Call are serialized via the lock dev>xmit_lock_owner
The driver manages the transmit queue during interface up and downs or to signal back pressure using the following functions:void netif_start_queue(struct net_device *net);
void netif_stop_queue(struct net_device *net);
void netif_wake_queue(struct net_device *net);
![Page 28: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/28.jpg)
28© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Packet Tx Overview
![Page 29: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/29.jpg)
29© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
NAPI
Network “New API”
Provides interrupt mitigation
Requirements:
A DMA ring buffer.
Ability to turn off receive interrupts or events.
It is used by defining a new method:
int (*poll) (struct net_device *dev, int * budget);
which is called by the network stack periodically when signaled by the driver to do so.
![Page 30: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/30.jpg)
30© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
NAPI (cont.)
When a receive interrupt occurs, driver:
Turns off receive interrupts.
Calls netif_rx_schedule(dev) to get stack to start calling it's poll method.
The Poll method
Scans receive ring buffers, feeding packets to the stack via: netif_receive_skb(skb).
If work finished within budget parameter, reenables interrupts and calls netif_rx_complete(dev)
Else, stack will call poll method again.
![Page 31: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/31.jpg)
31© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Routing
After the socket buffer is delivered to a protocol handler the handler may decide to route the packet.
The default routing uses the normal destination based routing with single table and a FIB destination cache.
For each packet the routintg destination is looked up in the FIB cache.
If found, the packet is sent to that interface driver.
Otherwise a more costly routing decision based on rules occurs and the result is stored in the FIB.
![Page 32: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/32.jpg)
32© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
What is Netfilter?
Netfilter is a framework for packet mangling
Each protocol defines "hooks" (IPv4 defines 5) which are welldefined points in a packet's traversal of that protocol stack.
At each of these points, the protocol will call the netfilter framework with the packet and the hook number.
Parts of the kernel can register to listen to the different hooks for each protocol.
When a packet is passed to the netfilter framework, it will call all registered callbacks for that hook and protocol.
![Page 33: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/33.jpg)
33© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Netfilter Architecture
PreRouting Route Forward
PostRouting
Route
LocalOut
LocalIn
Local Sockets
EgresIngres
![Page 34: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/34.jpg)
34© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Netfilter Hook
Kernel code can register a call back function to be called when a packet arrives at each hook. and are free to manipulate the packet.
The callback can then tell netfilter to do one of five things:
NF_ACCEPT: continue traversal as normal.
NF_DROP: drop the packet; don't continue traversal.
NF_STOLEN: I've taken over the packet; stop traversal.
NF_QUEUE: queue the packet (usually for userspace handling).
NF_REPEAT: call this hook again.
![Page 35: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/35.jpg)
35© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
IP Tables
A packet selection system called IP Tables has been built over the netfilter framework.
It is a direct descendant of ipchains (that came from ipfwadm, that came from BSD's ipfw IIRC), with extensibility.
Kernel modules can register a new table, and ask for a packet to traverse a given table.
This packet selection method is used for packet filtering (the `filter' table), Network Address Translation (the `nat' table) and general preroute packet mangling (the `mangle' table).
![Page 36: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/36.jpg)
36© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
IP Tables and Netfilter Hooks
PreRouting Route
PostRouting
LocalIn
Local Sockets
EgresIngres
ConntrackMangleDestination NAT
FilterConntrackMangle
ConntrackMangleDestination NATFilter
Forward
Route
LocalOut
MangleFilter
ConntrackMangleSource NAT
![Page 37: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/37.jpg)
37© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
BSD Sockets Interface
User space network interface:
socket() / bind() / accept() / listen()
Initalization, addressing and hand shaking
select() / poll() / epoll()
Waiting for events
send() / recv()
Stream oriented (e.g. TCP) Rx / Tx
sendto() / recvfrom()
Datagram oriented (e.g. UDP) Rx / TX
![Page 38: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/38.jpg)
38© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Simple Client/Server
socket s;char buf[256];
s =socket()
connect(s, IP:port)
while(ret !=0) ret = recv(s, buf)
socket s1, s2 ... sn;char buf[256];
s =socket()bind(s1, IP:port)listen(s1)
while {
select(s1,s2 ... sn)
if(s1) sn = accept(s1) else while(ret !=0) ret = send(sn, buf)}
Clients Server
![Page 39: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/39.jpg)
39© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Simple Client/Server Copies
... ret = recv(s, buf) ...
... ret = send(s, buf) ...
Kernel
User spaceApplication
Kernel
User spaceApplication
Client Server
TxRx
Copyto user
Copy from user
![Page 40: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/40.jpg)
40© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
BSD Sockets Interface Properties
Originally developed by UC Berkeley research at the dawn of time
Used by 90% of network oriented programs
Standart interface across operating systems
Simple, well understood by programmers
Context switch for every Rx/Tx
Buffer copied from/to user space to/from kernel
![Page 41: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/41.jpg)
41© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy
Inkernel buffer that the user has control over.
The buffer is implemented as a set of referencecounted pointers which the kernel copies around without actually copying the data.
splice() moves data to/from the buffer from/to an arbitrary file descriptor
tee() Moves data to/from one buffer to another
vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file.
Can be used anywhere where a process needs to send something from one end to another, but it doesn't need to touch or even look at the data, just forward it.
![Page 42: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/42.jpg)
42© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy of Example 1
Data
File Socket Buf
HD Controller Network Chip
KernelMemory
Hardware
User space
Copy (using DMA)
Pointer to page cachepage
Pointer to pageas part of frag list
Splice() *
Only pointer is copied
* In relaity you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers.
![Page 43: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/43.jpg)
43© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy of Example 2
Data
skb
Network Chip
KernelMemory
Hardware
User space
Copy (using DMA)
Pointer to pageas part of frag list
VMSplice() *
Only pointer is copied
Mem write
Proccesspage tables
* In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers.
![Page 44: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/44.jpg)
44© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Hardware Offloading
Large receive offload supported (in software)
TCP / Large Segment Offload supported (e.g. e1000 driver)
No TCP Offload Engine support
Security updates
Pointintime solution
Different network behavior
Hardwarespecific limits and resourcebased denialofservice attacks
http://www.linuxfoundation.org/en/Net:TOE
![Page 45: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/45.jpg)
45© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
More Information
Linux Foundation Net:Kernel Flow
http://www.linuxfoundation.org/en/Net:Kernel_Flow
Zero Copy I: UserMode Perspective
http://www.linuxjournal.com/article/6345
“Understanding Linux Network Internals”, O'Reilly Media
![Page 46: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/46.jpg)
46© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Use the Source, Luke!
Many resources and tricks on the Internet find you will, but solutions to all technical issues only in the Source lie.
Thanks to LucasArts
![Page 47: The Linux Network Subsystem - pudn.comread.pudn.com/downloads128/doc/543932/linuxprotocolstack... · 2008-09-08 · Kernel Mode vs. User Mode All modern CPUs support a dual mode of](https://reader035.fdocuments.us/reader035/viewer/2022062921/5f045ab57e708231d40d90d1/html5/thumbnails/47.jpg)
47© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Copyrights and Trademarks© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042008 Codefidence Ltd.Tux Image Copyright: © 1996 Larry EwingLinux is a registered trademark of Linus Torvalds.All other trademarks are property of their respective owners.Used and distributed under a Creative Commons AttributionShareAlike 2.0 license