CS 505: Computer Structures Networks
-
Upload
keely-gross -
Category
Documents
-
view
18 -
download
3
description
Transcript of CS 505: Computer Structures Networks
CS 505: Thu D. NguyenRutgers University, Spring 2005 1
CS 505: Computer Structures
Networks
Thu D. Nguyen
Spring 2005
Computer Science
Rutgers University
CS 505: Thu D. NguyenRutgers University, Spring 2005 2
Basic Message Passing
P0 P1
N0
Send Receive
P0 P1
N0 N1
Communication Fabric
Send Receive
CS 505: Thu D. NguyenRutgers University, Spring 2005 3
Terminology
• Basic Message Passing:– Send: Analogous to mailing a letter– Receive: Analogous to picking up a letter from the
mailbox– Scatter-gather: Ability to “scatter” data items in a
message into multiple memory locations and “gather” data items from multiple memory locations into one message
• Network performance:– Latency: The time from when a Send is initiated until
the first byte is received by a Receive.– Bandwidth: The rate at which a sender is able to send
data to a receiver.
CS 505: Thu D. NguyenRutgers University, Spring 2005 4
Scatter-Gather
… Message
Memory
Scatter (Receive)
… Message
Memory
Gather (Send)
CS 505: Thu D. NguyenRutgers University, Spring 2005 5
Network Topologies
CS 505: Thu D. NguyenRutgers University, Spring 2005 6
Terminology
• Network partition: When a network is broken into two or more components that cannot communicate with each others.
• Diameter: Maximum length of shortest path between any two processors.
• Connectivity: Measure of the multiplicity of paths between any two processors - Minimum number of links that must be removed to partition the network.
• Bisection width: Minimum number of links that must be removed to partition the network into two equal halves.
• Bisection bandwidth: Minimum volume of communication allowed between any two halves of the network with an equal number of processors.
CS 505: Thu D. NguyenRutgers University, Spring 2005 7
Bisection Bandwidth
Bisection Bandwidth=
Bisection Width * Link Bandwidth
CS 505: Thu D. NguyenRutgers University, Spring 2005 8
Typical Network Diagram
CS 505: Thu D. NguyenRutgers University, Spring 2005 9
Typical Node
CPU
Memory NIC Router
CS 505: Thu D. NguyenRutgers University, Spring 2005 10
Bus-Based Network
• Advantages– Simple– Diameter = 1
• Disadvantages– Blocking– Bandwidth does not scale with p
– Easy to partition network
Bus
CS 505: Thu D. NguyenRutgers University, Spring 2005 11
Completely-Connected Network
• Advantages– Diameter = 1– Bandwidth scales with p– Non-blocking– Difficult to partition network
• Disadvantages– Number of links grows O(p2)– Fan-in (and out) at each
node grows linearly with p
CS 505: Thu D. NguyenRutgers University, Spring 2005 12
Star Network
• Essentially same as Bus-Based Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 13
Ring Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 14
Mesh and Torus Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 15
Multistage Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 16
Perfect Shuffle
CS 505: Thu D. NguyenRutgers University, Spring 2005 17
Omega Network - Log(p) Stages
CS 505: Thu D. NguyenRutgers University, Spring 2005 18
Blocking in Omega Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 19
Tree Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 20
Fat Tree Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 21
Hypercube Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 22
Hypercube Network
CS 505: Thu D. NguyenRutgers University, Spring 2005 23
k-ary d-cube Networks
• k: radix of the network - the number of processors in each dimension
• d: dimension of the network• k-ary d-cube can be constructed from k k-
ary (d-1)-cubes by connecting the nodes occupying identical positions into rings
• Examples:– Hypercube: binary d-cube– Ring: p-ary 1-cube
CS 505: Thu D. NguyenRutgers University, Spring 2005 24
Arbitrary Topology Networks
Switch
Switch Switch
CS 505: Thu D. NguyenRutgers University, Spring 2005 25
Network Characteristics
CS 505: Thu D. NguyenRutgers University, Spring 2005 26
Packet vs. Wormhole Routing
Message
Packets
Worm
CS 505: Thu D. NguyenRutgers University, Spring 2005 27
Store-and-Forward vs. Cut-Through Routing
• Store-and-Forward:– Cannot route/forward a packet until
the entire packet has been received
• Cut-Through:– Can route/forward a packet as soon
as the router has received and processed the header
• Worm-hole is always cut-through because not enough buffer space to hold entire message
• Packet routing is almost always cut-through as well
• Difference: when blocked, a worm can span multiple routers while a packet will fit entirely into the buffer of a single router
CS 505: Thu D. NguyenRutgers University, Spring 2005 28
Collective Communication Primitives
• Send/Receive necessary and sufficient• Broadcast, multicast
– one-to-all, all-to-all, one-to-all personalized, all-to-all personalized
– flood
• Reduction– all-to-one, all-to-all
• Scatter, gather• Barrier
CS 505: Thu D. NguyenRutgers University, Spring 2005 29
Broadcast and Multicast
P0
P1
P2
P3
Broadcast
Message
P0
P1
P2
P3
Message
Multicast
CS 505: Thu D. NguyenRutgers University, Spring 2005 30
All-to-All
P0
P1
P2
P3
Message
Message Message
Message
CS 505: Thu D. NguyenRutgers University, Spring 2005 31
Reduction
sum 0for i 1 to p do sum sum + A[i]
P0
P1
P2
P3
A[1]
A[2]
A[3]
P0
P1
P2
P3
A[1]
A[2] + A[3]
A[3]
A[0]
A[1]
A[2]
A[3]
A[0] + A[1]
A[2] + A[3]
A[0] + A[1] + A[2] + A[3]
CS 505: Thu D. NguyenRutgers University, Spring 2005 32
Ring Broadcast
O(p)
CS 505: Thu D. NguyenRutgers University, Spring 2005 33
Ring Broadcast
O(logp)
CS 505: Thu D. NguyenRutgers University, Spring 2005 34
Mesh Broadcast
)(log
))log(2(
))log(2(2
1
pO
pO
pO
CS 505: Thu D. NguyenRutgers University, Spring 2005 35
Computation vs. Communication Cost
• 2GHz clock => 1/2 ns instruction cycle• Memory access:
– L1: ~2-4 cycles => 1-2 ns– L2: ~5-10 cycles => 2.5-5 ns– Memory: ~120-300 cycles => 60-150 ns
• Message roundtrip latency:– ~20 s– Suppose 75% hit ratio in L1, no L2, 1 ns L1 access time,
200 ns memory access time => average memory access time 51 ns
– 1 message roundtrip latency = ~400 memory accesses
CS 505: Thu D. NguyenRutgers University, Spring 2005 36
Performance … Always Performance!
• So … obviously, when we talk about message passing, we want to know how to optimize for performance
• But … which aspects of message passing should we optimize?
– We could try to optimize everything» Optimizing the wrong thing wastes precious
resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly
CS 505: Thu D. NguyenRutgers University, Spring 2005 37
Martin et al.: LogP Model
CS 505: Thu D. NguyenRutgers University, Spring 2005 38
Sensitivity to LogGP Parameters
• LogGP parameters:– L = delay incurred in passing a short msg from source to
dest– o = processor overhead involved in sending or receiving
a msg– g = min time between msg transmissions or receptions
(msg bandwidth)– G = bulk gap = time per byte transferred for long
transfers (byte bandwidth)
• Workstations connected with Myrinet network and Generic Active Messages layer
• Delay insertion technique• Applications written in Split-C but perform
their own data caching
CS 505: Thu D. NguyenRutgers University, Spring 2005 39
Sensitivity to Overhead
16
8.5
0.5
P
sg
sL
CS 505: Thu D. NguyenRutgers University, Spring 2005 40
Sensitivity to Gap
16
9.2
0.5
P
so
sL
CS 505: Thu D. NguyenRutgers University, Spring 2005 41
Sensitivity to Latency
16
8.5
9.2
P
sg
so
CS 505: Thu D. NguyenRutgers University, Spring 2005 42
Sensitivity to Bulk Gap
16
8.5
9.2
0.5
P
sg
so
sL
CS 505: Thu D. NguyenRutgers University, Spring 2005 43
Summary
• Runtime strongly dependent on overhead and gap
• Strong dependence on gap because of burstiness of communication
• Not so sensitive to latency => can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor)
• Not sensitive to bulk gap => got more bandwidth than we know what to do with
CS 505: Thu D. NguyenRutgers University, Spring 2005 44
What’s the Point?
• What can we take away from Martin et al.’s study?
– It’s extremely important to reduce overhead because it may affect both “o” and “g”
– All the “action” is currently in the OS and the Network Interface Card (NIC)
• Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.
CS 505: Thu D. NguyenRutgers University, Spring 2005 45
User-Level Access to NIC
• Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level
CS 505: Thu D. NguyenRutgers University, Spring 2005 46
User-level Communication
• Basic idea: remove the kernel from the critical path of sending and receiving messages
– user-memory to user-memory: zero copy– permission is checked once when the mapping is
established– buffer management left to the application
• Advantages– low communication latency– low processor overhead– approach raw latency and bandwidth provided by the
network
• One approach: U-Net
CS 505: Thu D. NguyenRutgers University, Spring 2005 47
U-Net Abstraction
CS 505: Thu D. NguyenRutgers University, Spring 2005 48
U-Net Endpoints
CS 505: Thu D. NguyenRutgers University, Spring 2005 49
U-Net Basics
• Protection provided by endpoints and communication channels
– Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory)
– Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints
• For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS.
• Message queues can be placed at different memories to optimize polling
– Receive queue allocated in host memory– Send and free queues allocated in NIC memory
CS 505: Thu D. NguyenRutgers University, Spring 2005 50
U-Net Performance on ATM
CS 505: Thu D. NguyenRutgers University, Spring 2005 51
U-Net UDP Performance
CS 505: Thu D. NguyenRutgers University, Spring 2005 52
U-Net TCP Performance
CS 505: Thu D. NguyenRutgers University, Spring 2005 53
U-Net Latency
CS 505: Thu D. NguyenRutgers University, Spring 2005 54
Virtual Memory-Mapped Communication
• Receiver exports the receive buffers • Sender must import a receive buffer before
sending• The permission of sender to write into the
receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program)
• Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention
• At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention
CS 505: Thu D. NguyenRutgers University, Spring 2005 55
Virtual-to-physical address
• In order to store data directly into the application address space (exported buffers), the NI must know the virtual to physical translations
• What to do?
sender receiver
int rec_buffer[1024];exp_id = export(rec_buffer, sender);
recv(exp_id);
int send_buffer[1024];recv_id = import(receiver, exp_id);
send(recv_id, send_buffer);
CS 505: Thu D. NguyenRutgers University, Spring 2005 56
Software TLB in Network Interface
• The network interface must incorporate a TLB (NI-TLB) which is kept consistent with the virtual memory system
• When a message arrives, NI attempts a virtual to physical translation using the NI-TLB
• If a translation is found, NI transfers the data to the physical address in the NI-TLB entry
• If a translation is missing in the NI-TLB, the processor is interrupted to provide the translation. If the page is not currently in memory, the processor will bring the page in. In any case, the kernel increments the reference count for that page to avoid swapping
• When a page entry is evicted from the NI-TLB, the kernel is informed to decrement the reference count
• Swapping prevented while DMA in progress