Linux
Kernel Tuning Using System Control
Dustin Puryear
Some of the most notable performance improvements for Linux can
be accomplished via system control (sysctl) in /proc/sys. Unlike
most other areas of /proc under Linux, sysctl variables are typically
writable, and are used to adjust the running kernel rather than
simply monitor currently running processes and system information.
In this article, I'll walk you through several areas of sysctl
that can result in large performance improvements. While certainly
not a definitive work, this article should provide the foundation
needed for further research and experimentation with Linux sysctl.
Note that I wrote this article with the 2.4 kernel in mind. Some
variables may exist in earlier kernels but not in 2.4, or vice versa.
Working with the sysctl Interface
The sysctl interface allows administrators to modify variables
that the kernel uses to determine behavior. There are two ways to
work with sysctl: by directly reading and modifying files in /proc/sys
and by using the sysctl program supplied with most, if not all,
distributions. Most documentation on sysctl accesses variables via
the /proc/sys file system, and does so using cat for viewing
and echo for changing variables, as shown in the following
example where IP forwarding is enabled:
# cat /proc/sys/net/ipv4/ip_forward
0
# echo "1" > /proc/sys/net/ipv4/ip_forward
# cat /proc/sys/net/ipv4/ip_forward
1
This is an easy way to work with sysctl. An alternative is to use
the sysctl program, which provides an easy interface to accessing
sysctl. With the sysctl program, you specify a path to the variable,
with /proc/sys being the base. For example, to view /proc/sys/net/ipv4/ip_forward,
use the following command:
# sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
To then update this variable, use the -w (write) option:
# sysctl -w net.ipv4.ip_forward="0"
net.ipv4.ip_forward = 0
In this example, I have simply undone what was accomplished earlier
when using cat and echo.
Deciding which to use is often a matter of preference, but sysctl
does have the benefit of being supported via the /etc/sysctl.conf
configuration file, which is read during system startup. After experimenting
with variables that increase the performance or reliability of the
system, you should enter and document these variables in /etc/sysctl.conf:
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
You can also specify that the sysctl program reread /etc/sysctl.conf
via the -p option:
# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
In this article, I will typically be using the sysctl program syntax
for accessing sysctl variables (i.e., I will use net.ipv4.ip_forward
rather than /proc/sys/net/ipv4/ip_forward).
Getting to Work
sysctl exposes several important elements of the kernel beneath
/proc/sys, and I will be focusing on /proc/sys/fs, /proc/sys/vm,
and /proc/sys/net, which are used to tune file system, virtual memory
and disk buffers, and network code, respectively. Of course, there
is a lot more available in sysctl than what can be covered here,
so use this article as a stepping stone toward learning more about
sysctl.
Tweaking /proc/sys/fs: File Systems
The /proc/sys/fs interface exposes several interesting variables,
but only a few will directly affect the performance or utilization
of your system. For most workstations or lightly loaded servers,
you can typically leave everything as is, but as your system offers
more services and opens more files, begin monitoring fs.file-nr:
# sysctl fs.file-nr
fs.file-nr = 7343 2043 8192
The fs.file-nr variable displays three parameters: total allocated
file handles, currently used file handles, and maximum file handles
that can be allocated. The Linux kernel dynamically allocates file
handles whenever a file handle is requested by an application, but
it does not free these handles when they are released by the application.
Instead, the file handles are recycled. This means that over time
you will see the total allocated file handles increase as the server
reaches new peaks of file handle use, even though the number of in-use
file handles may be low. If you are running a server that opens a
large number of files, such as a news or file server, then you should
pay close attention to these parameters when tuning the system.
Adjusting the maximum file handles that Linux will allocate is
only a matter of updating fs.file-max:
# sysctl -w fs.file-max="32768"
fs.file-max = 32768
# sysctl fs.file-nr
fs.file-nr = 7343 2043 32768
Here I have quadrupled the maximum number of file handles that may
be allocated, noting that the peak usage is currently topping out
at 7,343 file handles. The server has only 2,043 file handles currently
in use.
In 2.2 kernels, you would also need to worry about setting a similar
variable for inodes via fs.inode-max, but as of the 2.4 kernel,
this is no longer necessary, and indeed this variable is no longer
available under /proc/sys/fs. You can, however, still view information
on inode usage via fs.inode-state. There are several other variables
that can be used in /proc/sys/fs, but the 2.4 kernel defaults for
most other variables are quite sufficient for all but the most extreme
cases.
Learn more about /proc/sys/fs in /usr/src/linux/Documentation/sysctl/fs.txt.
The information is generally dated to the 2.2 kernel, but there
are some excellent nuggets of information in the document.
Tweaking /proc/sys/vm: Virtual Memory
There are two variables under /proc/sys/vm that you will find
very useful in tweaking how the disk buffers and the Linux VM work
with your disks and file systems. The first, vm.bdflush, allows
you to adjust how the kernel will flush dirty buffers to disk. Disk
buffers are used by the kernel to cache data stored on disks, which
are very slow compared to RAM. Whenever a buffer becomes sufficiently
dirty (i.e., its contents have been changed so that it differs from
what is on the disk), the kernel daemon bdflush will flush it to
disk.
When viewing vm.bdflush you will see several parameters:
# sysctl vm.bdflush
vm.bdflush = 30 500 0 0 500 3000 60 20 0
Some of the parameters are dummy values. For now, pay attention to
the first, second, and seventh parameters (nfract, ndirty, and nfract_sync,
respectively). nfract specifies the maximum percentage of a buffer
that bdflush will allow before queuing the buffer to be written to
disk. ndirty specifies the maximum buffers that bdflush will flush
at once. Finally, nfract_sync is similar to nfract, but once the percentage
specified by nfract_sync is reached, a write is forced rather than
queued.
Adjusting vm.bdflush is something of an art because you need to
extensively test the effect on your server and target applications.
If the server has an intelligent controller and disk, then decreasing
the total number of flushes (which will in turn cause each flush
that is done to take a bit longer) may increase overall performance.
However, with a slower disk, the system may end up spending more
time waiting for the flush to finish. For this tweak, you need to
test, test, and then test some more.
The default for nfract is 30%, and it's 60% for nfract_sync.
When increasing nfract, make sure the new value is not equal to
nfract_sync:
# sysctl -w vm.bdflush="60 500 0 0 500 3000 80 20 0"
vm.bdflush = 60 500 0 0 500 3000 80 20 0
Here, nfract is being set to 60% and nfract_sync to 80%.
The ndirty parameter simply specifies how much bdflush will write
to disk at any one time. The larger this value, the longer it could
potentially take bdflush to complete its updates to disk.
You can also tune how many pages of memory are paged out by the
kernel swap daemon, kswapd, when memory is needed using vm.kswapd:
# sysctl vm.kswapd
vm.kswapd = 512 32 8
The vm.kswapd variable has three parameters: tries_base, the maximum
number of pages that kswapd tries to free in one round; tries_min,
the minimum pages that kswapd will free when writing to disk (in other
words, kswapd will try to at least get some work done when it wakes
up); and swap_cluster, the number of pages that kswapd will write
in one round of paging.
The performance tweak, which is similar to the adjustment made
to vm.bdflush, is to increase the number of pages that kswapd pages
out at once on systems that page often by modifying the first and
last parameters:
# sysctl -w vm.kswapd="1024 32 64"
vm.kswapd = 1024 32 64
Here I am specifying that kswapd search up to 1024 pages to be paged
out, and that during one round of paging that kswapd can write out
64 pages. There is no hard and fast rule on modifying these parameters
as their effect is very much dependent on disk speed. The best bet
is to simply experiment until finding the right value for the server
application.
I suggest that you review /usr/src/linux/Documentation/sysctl/vm.txt
for more information. Again, this documentation is generally dated
to the 2.2 kernel, but the information is still mostly relevant.
Tweaking /proc/sys/net: Networking
Unlike the other two areas discussed, /proc/sys/net offers many
more areas where you can tweak and tune your system's performance.
Unfortunately, you can also break your system's compatibility
with other computers on the Internet, so be sure to rigorously test
changes. In this article, I will not discuss any changes that can
affect compatibility, however, so these changes can be tested simply
on the basis on their performance improvements.
When viewing /proc/sys/net, you will see several different directories:
# ls -l /proc/sys/net
total 0
dr-xr-xr-x 2 root root 0 Aug 14 10:55 802
dr-xr-xr-x 2 root root 0 Aug 14 10:55 core
dr-xr-xr-x 2 root root 0 Aug 14 10:55 ethernet
dr-xr-xr-x 5 root root 0 Aug 14 10:55 ipv4
dr-xr-xr-x 2 root root 0 Aug 14 10:55 token-ring
dr-xr-xr-x 2 root root 0 Aug 14 10:55 unix
In this article, I am only going to address net.core and net.ipv4.
net.core typically provides defaults for all networking components,
especially in terms of memory usage and buffer allocation for send
and receive buffers. On the other hand, net.ipv4 only has variables
that affect the IPv4 stack, and many of the variables, but not all,
will override net.core. When working with net.core and net.ipv4,
you should concern yourself with three areas: new connections, established
connections, and closing connections. When thinking along these
lines, it is usually easy to determine which variables to tune.
An excellent example is how Linux will handle half-open connections.
That is, when connections that have been initiated to the server,
but where the three-way TCP handshake has not completed. You can
see connections that are in this state by looking for SYN_RECV in
the output of netstat:
# netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 127.0.0.1:389 127.0.0.1:52994 TIME_WAIT
tcp 0 1 10.0.0.23:25 10.0.0.93:3432 SYN_RECV
When dealing with a heavily loaded service or with clients on high
latency or bad connections, the rate of half-open connections is going
to increase. Web server administrators are particularly aware of this
issue because a lot of Web site clients are on dial-up. Dial-up tends
to have a high latency where clients can sometimes disappear entirely
from the Internet.
In Unix, half-open connections are placed in the incomplete (or
backlog) connections queue, and under Linux, the amount of space
available in this queue is specified by ipv4.tcp_max_syn_backlog.
It's important to realize that each half-open connection consumes
memory. Also, realize that a common Denial of Service attack, the
syn-flood attack, is based on the knowledge that your server will
no longer be able to serve new connection requests if an attacker
opens enough half-open connections.
If you are running a site that does need to handle a large number
of half-open connections, then consider increasing this value:
# sysctl -w net.ipv4.tcp_max_syn_backlog="1024"
net.ipv4.tcp_max_syn_backlog = 1024
As a side note, many administrators also enable syn-cookies, which
enable a server to handle new connections even when the incomplete
connections queue is full (e.g., during a syn-flood attack):
# sysctl -w net.ipv4.tcp_syncookies="1"
net.ipv4.tcp_syncookies = 1
Unfortunately, when using syn-cookies, you will not be able to use
advanced TCP features such as window scaling (discussed later).
Another important consideration when connections are being established
is ensuring your server has enough local ports to allocate to sockets
for outgoing connections. When a server, such as HTTP proxy, has
a large number of outgoing connections, the server may run out of
local ports. The number of local ports dedicated to outgoing connections
is specified in net.ipv4.ip_local_port_range, and the default is
to allocate ports 1024 to 4999 for this purpose:
# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 1024 4999
To adjust these values, simply increase this range. A common change
is to allow outgoing connections on local ports 32768 to 61000:
# sysctl -w net.ipv4.ip_local_port_range="32768 61000"
net.ipv4.ip_local_port_range = 32768 61000
Once a TCP session has been established, you need to think about how
efficiently TCP/IP uses the available bandwidth. One of the most common
ways to increase the utilization is to adjust the possible size of
the TCP congestion window. The TCP congestion window is simply how
many bytes of data the server will send over a connection before it
requires an acknowledgement by the client on the other end of the
connection. The larger the window, the more data is allowed on the
wire at a time, and vice versa. This is a key point to understand
because if you are serving clients on a high latency network (e.g.,
a WAN or the Internet), then your server is probably wasting a lot
of network capacity while it waits for a client ACK.
The congestion window will start at a small size and increase
over time as the server begins to trust the connection. The maximum
size of the window is limited by the size of the send buffer because
the server must be able to resend any data that is lost, and this
data must be in the send buffer to be sent.
Adjusting the buffer size used by Linux is a matter of adjusting
both net.core.wmem_max and net.ipv4.tcp_wmem. net.core.wmem_max
specifies the maximum buffer size for the send queue for any protocol,
including IPv4. net.ipv4.tcp_wmem, on the other hand, includes three
parameters: the minimum size of a buffer regardless of how much
stress is on the memory system, the default size of a buffer, and
the maximum size of a buffer. The default size specified in net.ipv4.tcp_wmem
will override net.core.wmem_default, so we can simply ignore net.core.wmem_default.
However, net.core.wmem_max overrides the maximum buffer size specified
in net.ipv4.tcp_wmem, so when changing net.ipv4.tcp_wmem, be sure
that the maximum buffer size specified in net.core.wmem_max is as
large or larger than the maximum buffer size specified by net.ipv4.tcp_wmem.
Whew! Let's go over that with an example.
By default, Linux configures the minimum guaranteed buffer size
to be 4K, the default buffer size as 16K, and the maximum buffer
size as 128K:
# sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 16384 131072
You can determine your optimal window by using the bandwidth-delay
product, which will help you find a general range where you should
begin experimenting with congestion window sizes:
windows-size = bandwidth (bytes/sec) * round-trip time (seconds)
Let's say that you determine the congestion window size should
be 48K. You should then adjust the parameters to net.ipv4.tcp_wmem
to reflect this size as the default size for the send buffer:
# sysctl -w net.ipv4.tcp_wmem="4096 49152 131072"
net.ipv4.tcp_wmem = 4096 49152 131072
That's all there is to it. Note, however, that historically the
congestion window has been limited to 64K in size. RFC 1323 did away
with this limit by introducing window scaling, which allows for even
larger values. TCP window scaling is enabled by default on the 2.4
kernel, but, just to make sure, enable it when specifying a value
at 64K or larger:
# sysctl -w net.ipv4.tcp_window_scaling="1"
net.ipv4.tcp_window_scaling = 1
# sysctl -w net.core.wmem_max="262144"
net.core.wmem_max = 262144
# sysctl -w net.ipv4.tcp_wmem="4096 131072 262144"
net.ipv4.tcp_wmem = 4096 131072 262144
Here I have increased the default buffer size to 128KB, and the maximum
buffer size to double that number, or 256KB. Since the new default
and maximum buffer sizes are larger than the original value in net.core.wmem_max,
I must also adjust that value as it will override the maximum specified
by net.ipv4.tcp_wmem.
Also investigate net.core.rmem_default, net.core.rmem_max, and
net.ipv4.tcp_rmem, which are variables used to control the size
of the receive buffer. This can make a large impact especially on
a client system, as well as for file servers.
The final considerations are closing connections. One problem
that servers will face, especially if clients may disappear or otherwise
not close connections, is that the server will have a large number
of open but unused connections. TCP has a keepalive function that
will begin probing the TCP connection after a given amount of inactivity.
By default Linux will wait for 7200 seconds, or two hours:
# sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
That's a long time, especially if serving a large number of clients
that only require short-lived connections. Good examples of this are
Web servers. The trick here is to reduce how long a quiet TCP connection
is allowed to live by adjusting net.ipv4.tcp_keepalive_time to something
perhaps along the lines of 30 minutes:
# sysctl -w net.ipv4.tcp_keepalive_time="1800"
net.ipv4.tcp_keepalive_time = 1800
You can also adjust how often the connection will be probed, and how
long between each probe, before a forceful closing of the connection.
But relative to the time specified by net.ipv4.tcp_keepalive_time,
these values are low. If you're interested, review net.ipv4.tcp_keepalive_probes
and net.ipv4.tcp_keepalive_intvl.
Conclusion
You can see dramatic improvements in performance if you know where
to look. One of the most critical elements to consider when tuning
the kernel and overall system performance is sysctl. The variables
mentioned in this article will take you well on your way to understanding
how sysctl affects your system, and I invite you to learn more by
reading the documentation referenced. Keep in mind that there are
many more tweaks available to you via sysctl.
Suggested additional readings are O'Reilly & Associates'
System Performance Tuning, 2nd Edition and Understanding
the Linux Kernel, 2nd Edition.
Dustin Puryear is a consultant providing expertise in managing
and tuning Unix systems, services, and applications, with a strong
focus on open source, and is author of Integrate Linux Solutions
into Your Windows Network. He can be contacted at: dustin@puryear-it.com.
|