Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division,...

13
Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06

description

3 Network and Security Team(NAST) Enablers and Inhibitors of the network in one group –All responsibility is here Networking is responsible for end-to-end performance –Wherever the customer is –“Not our problem” is not sufficient or acceptable

Transcript of Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division,...

Page 1: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

Visual Flow Analysis: What do real-world problems look like?

Brent Draney

NERSC Center Division, LBNL2/07/06

Page 2: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

2

What is NERSC

• DOE scientific computer center• Supports ~2000 scientists around the world

(mainly DOE and Universities)• Supports most major disciplines• Combined ~20-TFLOPS, 8.8 Petabytes• 10 Gigabit lan backbone and 10 Gigabit ESnet

uplink• O(100) sockets accounts for ~95% of bytes

transferred• O(5000) IP addresses in a single building but only

100 desktops

Page 3: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

3

Network and Security Team(NAST)

• Enablers and Inhibitors of the network in one group– All responsibility is here

• Networking is responsible for end-to-end performance– Wherever the customer is– “Not our problem” is not sufficient or

acceptable

Page 4: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

4

Performance tools

• Optical taps everywhere• Mobile crashcart with all types of

interfaces• Tcpdump, Tcptrace and Xplot• A lot of head scratching

Note: Analyzing a mult-Gigabyte flow packet by packet is impossible!

Page 5: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

5

Simple Example

Consistent Slope

No anomalies

Protocol limited

Page 6: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

6

Simple Example Detail

PacketsACK’ed data

Sender Advertised Window

Page 7: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

7

Brick Wall Example

Few anomalies

Transfer Hangs

Page 8: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

8

Brick Wall Detail

One Dropped packet

3 Dupe ACK’s

No Retransmit, Ever

Page 9: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

9

Brick Wall Example Troubleshooting and Answer

• Troubleshooting– Sender verifies that retransmits are sent– “Non-tuned” traffic never fails

• Answer– A stateful firewall tracking TCP sequence

numbers didn’t believe that the retransmits were legitimate

Page 10: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

10

Perverse Example

Holy Mackerel!

Jumbo Packets

Retransmits

Page 11: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

11

Perverse Example

Is PMTU working? Yes[Scratch Head]

Page 12: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

12

Perverse Example Troubleshooting and Answer

• Troubleshooting– Review sender configuration– PMTU installed in routing table correctly? Yes– TCPdump on host shows 64K packets leaving a 9k

interface– “Large Send” enabled offloading packet creation to NIC

• Answer– NIC doesn’t have access to routing table

• Route MTU not honored– Retransmits handled by kernel

• Route MTU Honored

Page 13: Visual Flow Analysis: What do real-world problems look like? Brent Draney NERSC Center Division, LBNL 2/07/06.

13

Conclusions

• Diverse problems have the same general feel of poor performance.

• Flow visualization can isolate problems quickly.• Very large flows require visualization.• Protocol limits (host buffers, sftp …) are still a

major cause but are becoming less so.• New and “creative” methods to achieve higher

performance can create strangeness and are becoming more of a problem.

• Seeing is believing. Pictures are convincing (to users, system admins and network admins).