[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] TCP Window Tuning for High Bandwidth/High Latency WANlinks?
Doug Hughes wrote:
>On Thu, 21 Jul 2005, John Stoffel wrote:
>
>
>
>>Hi all,
>>
>>We're getting ready to deploy the next step of a high speed WAN VLAN
>>network between a bunch of remote offices for interoffice networking
>>needs. It's all going to be DS3s of various speeds with the vendor
>>providing us with an ethernet port (100 mbs/full duplex) to plug into
>>a router.
>>
>>It's all up and working it our test setup, except that we're not
>>seeing the speedups we'd expect with this network. After doing some
>>research, I've come to learn more about the Bandwidth Delay Product
>>(BDP) and how it affects TCP/IP sessions, specifically in our case
>>FTP, NFS and SSH/SCP sessions. Since our users send big files back
>>and forth between sites, sometimes via ftp, sometimes via NFS (we
>>automount all the data directories across the WAN), it's imperitive
>>that we get the most bandwidth usage on a per-connection basis. I
>>realize that I can do a bunch of connections at once to get the
>>throughput I need (see iperf and the -P option), that's not acceptable
>>for sending a 300gb file across the WAN.
>>
>>Some of the web pages I've been look at include:
>>
>> http://www.psc.edu/networking/projects/tcptune/
>> http://www.psc.edu/networking/projects/hpn-ssh/theory
>> http://dast.nlanr.net/Guides/GettingStarted/TCP_window_size.html
>> http://dast.nlanr.net/Projects/Iperf/iperfdocs_1.7.0.html
>>
>>Which explain the issue pretty clearly, though some of these articles
>>are a bit dated and don't talk about Solaris 8 much, which is our
>>dominant Unix OS, along with a bunch of Linux boxes in LSF queues.
>>
>>
>>
>The calculations are really quite simple and the settings (at least
>under Solaris) are easy to make too. It's essential for any kind
>of WAN network performance. You've got the right idea with using
>iperf (or ttcp) for testing.
>
>first, get the latency between the sites: (ping)
>second, adjust the TCP window according to the following formula:
>bandwidth (bits/sec) * delay (msec) * 8 = bytes for sliding window
>
>(for solaris, ndd -set /dev/tcp tcp_xmit_hiwat <x>;
>ndd -set /dev/tcp tcp_recv_hiwat <x>;
>ndd -set /dev/tcp <x>
>
>I usually set my tcp_max_buf to ndd -set /dev/tcp tcp_max_buf 83886080)
># Max buffer size for application controlled setsockopt = 80Mb
>
>I also usually add:
>ndd -set /dev/tcp tcp_wscale_always 1
>ndd -set /dev/tcp tcp_tstamp_if_wscale 1
>
>also, as of Solaris 2.6 and above you do not have to tune sliding windows
>globally. You can do it on an endpoint by endpoint basis:
>ndd -set /dev/tcp tcp_host_param 64.215.96.180 1048576 recvspace 1048576 timesta
>mp 1
>
>sendspace and recvspace override the defaults for tcp_xmit_hiwat and
>tcp_recv_hiwat respectively.
>
>This is very useful if you are concerned about memory. (But I wouldn't
>be that concerned unless you plan on running dozens of these at once.
>And then, you'd probably use up the pipe before memory anyway, unless
>you are transmitting from NYC to Singapore or something like that)
>
>
>tcp_xmit_hiwat and tcp_recv_hiwat don't have to be symmetrical.
>If you plan on always sending in one direction, just make sure you
>have corresponding xmit on sender and recv on receiver.
>
>
>
>e.g. to fill a DS-3 at 45Mbits/sec across the country with a latency of 80
>msec you'd need a window of
>45000000 bits/sec * .080 sec * 8 bytes/bit = 450000
>
>round this up for the sliding window to the next power of 2 and you
>get a value of 524288 (512k - 2^19)
>
>
>
>
>>So the options for fixing or improving performance seem to come down
>>to:
>>
>>1. tune the TCP settings on each host for all connections, which may
>> impact memory usage and won't do much of LAN connections if at all.
>>
>>
>>
>we do replication across the country on an OC-3 using TCP with
>sliding windows at a rate of 8-15 Mbytes/sec and haven't had problems
>with memory.
>
>
>
>
>>2. turn the TCP window size on a per-server (ftp, ssh) and per-client
>> basis (ncftp, scp, etc). Then training the users if the tuning
>> isn't automatic how to use it...
>>
>>
>>
>ugh.
>
>
>>3. Get a network box which will do this for us (yet more money...)
>> auto-magically at each site. One option, which I have pricing on
>> at all is:
>>
>> http://www.internap.com/products/FCP-solution.htm
>>
>> Though we won't be multi-homed in our WAN/VLAN setup at this time,
>> too much money.
>>
>>So it all comes down to what other people have done and/or are doing
>>in this type of situation? What solutions have your deployed? As WAN
>>links get faster, yet the RTT time doesn't shrink, TCP is going to
>>need some interesting hacks to make it work better in this situation.
>>
>>
>
>Save your money. All that most of these boxes do is disable Nagle,
>turn on sliding windows, and enable deferred acks. You can do all
>of this in software for a lot less money.
>
>Also be aware that you will almost never get full throughput with SCP
>like you would with TCP. It's not nearly as efficient because of
>the encryption. If you do need more throughput, check out the
>faster encryption options like blowfish. Pay careful attention
>to the algorithm used.
>
>
>
>
>This is a good page:
>http://www.sean.de/Solaris/soltune.html
>
>
>Other parameters to check on that page:
>tcp_deferred_acks_max (play with it - I use 16 - max)
>tcp_slow_start_initial
>tcp_deferred_ack_interval (I use 500)
>tcp_cwnd_max (I use 4194304 on endpoints in Rochester and AZ)
>
> Doug
>
>
>
>
>
>
If a large tcp window isn't appropriate for all your destinations, you
might want to enable the larger window based on route. The
"route" command allows you to do with the -recvpipe and -sendpipe
options. The following works pretty well for a cross country DS3
with ~68 ms latency w/o altering LAN characteristics:
route change dest 400000
where 'dest' is a destination in your route tables. eg
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ------ ---------
192.168.1.0 192.168.1.81 U 1 3 rtls0
132.239.2.0 169.27.112.50 UG 1 55 elxl0
224.0.0.0 192.168.1.81 U 1 0 rtls0
default 192.168.1.1 UG 1 24
127.0.0.1 127.0.0.1 UH 2 4 lo0
# route get 132.239.2.0
route to: 132.239.2.0
destination: 132.239.2.0
mask: 255.255.255.0
gateway: 169.27.112.50
interface: elxl0
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,ms rttvar,ms hopcount mtu
expire
400000 0 0 0 0 0
1500 0
Bob