[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] TCP Window Tuning for High Bandwidth/High Latency WANlinks?
On Thu, 21 Jul 2005, John Stoffel wrote:
>
> Hi all,
>
> We're getting ready to deploy the next step of a high speed WAN VLAN
> network between a bunch of remote offices for interoffice networking
> needs. It's all going to be DS3s of various speeds with the vendor
> providing us with an ethernet port (100 mbs/full duplex) to plug into
> a router.
>
> It's all up and working it our test setup, except that we're not
> seeing the speedups we'd expect with this network. After doing some
> research, I've come to learn more about the Bandwidth Delay Product
> (BDP) and how it affects TCP/IP sessions, specifically in our case
> FTP, NFS and SSH/SCP sessions. Since our users send big files back
> and forth between sites, sometimes via ftp, sometimes via NFS (we
> automount all the data directories across the WAN), it's imperitive
> that we get the most bandwidth usage on a per-connection basis. I
> realize that I can do a bunch of connections at once to get the
> throughput I need (see iperf and the -P option), that's not acceptable
> for sending a 300gb file across the WAN.
>
> Some of the web pages I've been look at include:
>
> http://www.psc.edu/networking/projects/tcptune/
> http://www.psc.edu/networking/projects/hpn-ssh/theory
> http://dast.nlanr.net/Guides/GettingStarted/TCP_window_size.html
> http://dast.nlanr.net/Projects/Iperf/iperfdocs_1.7.0.html
>
> Which explain the issue pretty clearly, though some of these articles
> are a bit dated and don't talk about Solaris 8 much, which is our
> dominant Unix OS, along with a bunch of Linux boxes in LSF queues.
>
The calculations are really quite simple and the settings (at least
under Solaris) are easy to make too. It's essential for any kind
of WAN network performance. You've got the right idea with using
iperf (or ttcp) for testing.
first, get the latency between the sites: (ping)
second, adjust the TCP window according to the following formula:
bandwidth (bits/sec) * delay (msec) * 8 = bytes for sliding window
(for solaris, ndd -set /dev/tcp tcp_xmit_hiwat <x>;
ndd -set /dev/tcp tcp_recv_hiwat <x>;
ndd -set /dev/tcp <x>
I usually set my tcp_max_buf to ndd -set /dev/tcp tcp_max_buf 83886080)
# Max buffer size for application controlled setsockopt = 80Mb
I also usually add:
ndd -set /dev/tcp tcp_wscale_always 1
ndd -set /dev/tcp tcp_tstamp_if_wscale 1
also, as of Solaris 2.6 and above you do not have to tune sliding windows
globally. You can do it on an endpoint by endpoint basis:
ndd -set /dev/tcp tcp_host_param 64.215.96.180 1048576 recvspace 1048576 timesta
mp 1
sendspace and recvspace override the defaults for tcp_xmit_hiwat and
tcp_recv_hiwat respectively.
This is very useful if you are concerned about memory. (But I wouldn't
be that concerned unless you plan on running dozens of these at once.
And then, you'd probably use up the pipe before memory anyway, unless
you are transmitting from NYC to Singapore or something like that)
tcp_xmit_hiwat and tcp_recv_hiwat don't have to be symmetrical.
If you plan on always sending in one direction, just make sure you
have corresponding xmit on sender and recv on receiver.
e.g. to fill a DS-3 at 45Mbits/sec across the country with a latency of 80
msec you'd need a window of
45000000 bits/sec * .080 sec * 8 bytes/bit = 450000
round this up for the sliding window to the next power of 2 and you
get a value of 524288 (512k - 2^19)
> So the options for fixing or improving performance seem to come down
> to:
>
> 1. tune the TCP settings on each host for all connections, which may
> impact memory usage and won't do much of LAN connections if at all.
>
we do replication across the country on an OC-3 using TCP with
sliding windows at a rate of 8-15 Mbytes/sec and haven't had problems
with memory.
> 2. turn the TCP window size on a per-server (ftp, ssh) and per-client
> basis (ncftp, scp, etc). Then training the users if the tuning
> isn't automatic how to use it...
>
ugh.
> 3. Get a network box which will do this for us (yet more money...)
> auto-magically at each site. One option, which I have pricing on
> at all is:
>
> http://www.internap.com/products/FCP-solution.htm
>
> Though we won't be multi-homed in our WAN/VLAN setup at this time,
> too much money.
>
> So it all comes down to what other people have done and/or are doing
> in this type of situation? What solutions have your deployed? As WAN
> links get faster, yet the RTT time doesn't shrink, TCP is going to
> need some interesting hacks to make it work better in this situation.
Save your money. All that most of these boxes do is disable Nagle,
turn on sliding windows, and enable deferred acks. You can do all
of this in software for a lot less money.
Also be aware that you will almost never get full throughput with SCP
like you would with TCP. It's not nearly as efficient because of
the encryption. If you do need more throughput, check out the
faster encryption options like blowfish. Pay careful attention
to the algorithm used.
This is a good page:
http://www.sean.de/Solaris/soltune.html
Other parameters to check on that page:
tcp_deferred_acks_max (play with it - I use 16 - max)
tcp_slow_start_initial
tcp_deferred_ack_interval (I use 500)
tcp_cwnd_max (I use 4194304 on endpoints in Rochester and AZ)
Doug