Cover V12, i11

Article

nov2003.tar

Linux Kernel Tuning Using System Control

Dustin Puryear

Some of the most notable performance improvements for Linux can be accomplished via system control (sysctl) in /proc/sys. Unlike most other areas of /proc under Linux, sysctl variables are typically writable, and are used to adjust the running kernel rather than simply monitor currently running processes and system information. In this article, I'll walk you through several areas of sysctl that can result in large performance improvements. While certainly not a definitive work, this article should provide the foundation needed for further research and experimentation with Linux sysctl.

Note that I wrote this article with the 2.4 kernel in mind. Some variables may exist in earlier kernels but not in 2.4, or vice versa.

Working with the sysctl Interface

The sysctl interface allows administrators to modify variables that the kernel uses to determine behavior. There are two ways to work with sysctl: by directly reading and modifying files in /proc/sys and by using the sysctl program supplied with most, if not all, distributions. Most documentation on sysctl accesses variables via the /proc/sys file system, and does so using cat for viewing and echo for changing variables, as shown in the following example where IP forwarding is enabled:

# cat /proc/sys/net/ipv4/ip_forward
0
# echo "1" > /proc/sys/net/ipv4/ip_forward
# cat /proc/sys/net/ipv4/ip_forward
1
This is an easy way to work with sysctl. An alternative is to use the sysctl program, which provides an easy interface to accessing sysctl. With the sysctl program, you specify a path to the variable, with /proc/sys being the base. For example, to view /proc/sys/net/ipv4/ip_forward, use the following command:

# sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
To then update this variable, use the -w (write) option:

# sysctl -w net.ipv4.ip_forward="0"
net.ipv4.ip_forward = 0
In this example, I have simply undone what was accomplished earlier when using cat and echo.

Deciding which to use is often a matter of preference, but sysctl does have the benefit of being supported via the /etc/sysctl.conf configuration file, which is read during system startup. After experimenting with variables that increase the performance or reliability of the system, you should enter and document these variables in /etc/sysctl.conf:

# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
You can also specify that the sysctl program reread /etc/sysctl.conf via the -p option:

# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
In this article, I will typically be using the sysctl program syntax for accessing sysctl variables (i.e., I will use net.ipv4.ip_forward rather than /proc/sys/net/ipv4/ip_forward).

Getting to Work

sysctl exposes several important elements of the kernel beneath /proc/sys, and I will be focusing on /proc/sys/fs, /proc/sys/vm, and /proc/sys/net, which are used to tune file system, virtual memory and disk buffers, and network code, respectively. Of course, there is a lot more available in sysctl than what can be covered here, so use this article as a stepping stone toward learning more about sysctl.

Tweaking /proc/sys/fs: File Systems

The /proc/sys/fs interface exposes several interesting variables, but only a few will directly affect the performance or utilization of your system. For most workstations or lightly loaded servers, you can typically leave everything as is, but as your system offers more services and opens more files, begin monitoring fs.file-nr:

# sysctl fs.file-nr
fs.file-nr = 7343 2043 8192
The fs.file-nr variable displays three parameters: total allocated file handles, currently used file handles, and maximum file handles that can be allocated. The Linux kernel dynamically allocates file handles whenever a file handle is requested by an application, but it does not free these handles when they are released by the application. Instead, the file handles are recycled. This means that over time you will see the total allocated file handles increase as the server reaches new peaks of file handle use, even though the number of in-use file handles may be low. If you are running a server that opens a large number of files, such as a news or file server, then you should pay close attention to these parameters when tuning the system.

Adjusting the maximum file handles that Linux will allocate is only a matter of updating fs.file-max:

# sysctl -w fs.file-max="32768"
fs.file-max = 32768
# sysctl fs.file-nr
fs.file-nr = 7343 2043 32768
Here I have quadrupled the maximum number of file handles that may be allocated, noting that the peak usage is currently topping out at 7,343 file handles. The server has only 2,043 file handles currently in use.

In 2.2 kernels, you would also need to worry about setting a similar variable for inodes via fs.inode-max, but as of the 2.4 kernel, this is no longer necessary, and indeed this variable is no longer available under /proc/sys/fs. You can, however, still view information on inode usage via fs.inode-state. There are several other variables that can be used in /proc/sys/fs, but the 2.4 kernel defaults for most other variables are quite sufficient for all but the most extreme cases.

Learn more about /proc/sys/fs in /usr/src/linux/Documentation/sysctl/fs.txt. The information is generally dated to the 2.2 kernel, but there are some excellent nuggets of information in the document.

Tweaking /proc/sys/vm: Virtual Memory

There are two variables under /proc/sys/vm that you will find very useful in tweaking how the disk buffers and the Linux VM work with your disks and file systems. The first, vm.bdflush, allows you to adjust how the kernel will flush dirty buffers to disk. Disk buffers are used by the kernel to cache data stored on disks, which are very slow compared to RAM. Whenever a buffer becomes sufficiently dirty (i.e., its contents have been changed so that it differs from what is on the disk), the kernel daemon bdflush will flush it to disk.

When viewing vm.bdflush you will see several parameters:

# sysctl vm.bdflush
vm.bdflush = 30 500 0 0 500 3000 60 20 0
Some of the parameters are dummy values. For now, pay attention to the first, second, and seventh parameters (nfract, ndirty, and nfract_sync, respectively). nfract specifies the maximum percentage of a buffer that bdflush will allow before queuing the buffer to be written to disk. ndirty specifies the maximum buffers that bdflush will flush at once. Finally, nfract_sync is similar to nfract, but once the percentage specified by nfract_sync is reached, a write is forced rather than queued.

Adjusting vm.bdflush is something of an art because you need to extensively test the effect on your server and target applications. If the server has an intelligent controller and disk, then decreasing the total number of flushes (which will in turn cause each flush that is done to take a bit longer) may increase overall performance. However, with a slower disk, the system may end up spending more time waiting for the flush to finish. For this tweak, you need to test, test, and then test some more.

The default for nfract is 30%, and it's 60% for nfract_sync. When increasing nfract, make sure the new value is not equal to nfract_sync:

# sysctl -w vm.bdflush="60 500 0 0 500 3000 80 20 0"
vm.bdflush = 60 500 0 0 500 3000 80 20 0
Here, nfract is being set to 60% and nfract_sync to 80%.

The ndirty parameter simply specifies how much bdflush will write to disk at any one time. The larger this value, the longer it could potentially take bdflush to complete its updates to disk.

You can also tune how many pages of memory are paged out by the kernel swap daemon, kswapd, when memory is needed using vm.kswapd:

# sysctl vm.kswapd
vm.kswapd = 512 32 8
The vm.kswapd variable has three parameters: tries_base, the maximum number of pages that kswapd tries to free in one round; tries_min, the minimum pages that kswapd will free when writing to disk (in other words, kswapd will try to at least get some work done when it wakes up); and swap_cluster, the number of pages that kswapd will write in one round of paging.

The performance tweak, which is similar to the adjustment made to vm.bdflush, is to increase the number of pages that kswapd pages out at once on systems that page often by modifying the first and last parameters:

# sysctl -w vm.kswapd="1024 32 64"
vm.kswapd = 1024 32 64
Here I am specifying that kswapd search up to 1024 pages to be paged out, and that during one round of paging that kswapd can write out 64 pages. There is no hard and fast rule on modifying these parameters as their effect is very much dependent on disk speed. The best bet is to simply experiment until finding the right value for the server application.

I suggest that you review /usr/src/linux/Documentation/sysctl/vm.txt for more information. Again, this documentation is generally dated to the 2.2 kernel, but the information is still mostly relevant.

Tweaking /proc/sys/net: Networking

Unlike the other two areas discussed, /proc/sys/net offers many more areas where you can tweak and tune your system's performance. Unfortunately, you can also break your system's compatibility with other computers on the Internet, so be sure to rigorously test changes. In this article, I will not discuss any changes that can affect compatibility, however, so these changes can be tested simply on the basis on their performance improvements.

When viewing /proc/sys/net, you will see several different directories:

# ls -l /proc/sys/net
total 0
dr-xr-xr-x 2 root root 0 Aug 14 10:55 802
dr-xr-xr-x 2 root root 0 Aug 14 10:55 core
dr-xr-xr-x 2 root root 0 Aug 14 10:55 ethernet
dr-xr-xr-x 5 root root 0 Aug 14 10:55 ipv4
dr-xr-xr-x 2 root root 0 Aug 14 10:55 token-ring
dr-xr-xr-x 2 root root 0 Aug 14 10:55 unix
In this article, I am only going to address net.core and net.ipv4.

net.core typically provides defaults for all networking components, especially in terms of memory usage and buffer allocation for send and receive buffers. On the other hand, net.ipv4 only has variables that affect the IPv4 stack, and many of the variables, but not all, will override net.core. When working with net.core and net.ipv4, you should concern yourself with three areas: new connections, established connections, and closing connections. When thinking along these lines, it is usually easy to determine which variables to tune.

An excellent example is how Linux will handle half-open connections. That is, when connections that have been initiated to the server, but where the three-way TCP handshake has not completed. You can see connections that are in this state by looking for SYN_RECV in the output of netstat:

# netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp        0      0 127.0.0.1:389 127.0.0.1:52994 TIME_WAIT
tcp        0      1 10.0.0.23:25  10.0.0.93:3432  SYN_RECV
When dealing with a heavily loaded service or with clients on high latency or bad connections, the rate of half-open connections is going to increase. Web server administrators are particularly aware of this issue because a lot of Web site clients are on dial-up. Dial-up tends to have a high latency where clients can sometimes disappear entirely from the Internet.

In Unix, half-open connections are placed in the incomplete (or backlog) connections queue, and under Linux, the amount of space available in this queue is specified by ipv4.tcp_max_syn_backlog. It's important to realize that each half-open connection consumes memory. Also, realize that a common Denial of Service attack, the syn-flood attack, is based on the knowledge that your server will no longer be able to serve new connection requests if an attacker opens enough half-open connections.

If you are running a site that does need to handle a large number of half-open connections, then consider increasing this value:

# sysctl -w net.ipv4.tcp_max_syn_backlog="1024"
net.ipv4.tcp_max_syn_backlog = 1024
As a side note, many administrators also enable syn-cookies, which enable a server to handle new connections even when the incomplete connections queue is full (e.g., during a syn-flood attack):

# sysctl -w net.ipv4.tcp_syncookies="1"
net.ipv4.tcp_syncookies = 1
Unfortunately, when using syn-cookies, you will not be able to use advanced TCP features such as window scaling (discussed later).

Another important consideration when connections are being established is ensuring your server has enough local ports to allocate to sockets for outgoing connections. When a server, such as HTTP proxy, has a large number of outgoing connections, the server may run out of local ports. The number of local ports dedicated to outgoing connections is specified in net.ipv4.ip_local_port_range, and the default is to allocate ports 1024 to 4999 for this purpose:

# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 1024 4999
To adjust these values, simply increase this range. A common change is to allow outgoing connections on local ports 32768 to 61000:

# sysctl -w net.ipv4.ip_local_port_range="32768 61000"
net.ipv4.ip_local_port_range = 32768 61000
Once a TCP session has been established, you need to think about how efficiently TCP/IP uses the available bandwidth. One of the most common ways to increase the utilization is to adjust the possible size of the TCP congestion window. The TCP congestion window is simply how many bytes of data the server will send over a connection before it requires an acknowledgement by the client on the other end of the connection. The larger the window, the more data is allowed on the wire at a time, and vice versa. This is a key point to understand because if you are serving clients on a high latency network (e.g., a WAN or the Internet), then your server is probably wasting a lot of network capacity while it waits for a client ACK.

The congestion window will start at a small size and increase over time as the server begins to trust the connection. The maximum size of the window is limited by the size of the send buffer because the server must be able to resend any data that is lost, and this data must be in the send buffer to be sent.

Adjusting the buffer size used by Linux is a matter of adjusting both net.core.wmem_max and net.ipv4.tcp_wmem. net.core.wmem_max specifies the maximum buffer size for the send queue for any protocol, including IPv4. net.ipv4.tcp_wmem, on the other hand, includes three parameters: the minimum size of a buffer regardless of how much stress is on the memory system, the default size of a buffer, and the maximum size of a buffer. The default size specified in net.ipv4.tcp_wmem will override net.core.wmem_default, so we can simply ignore net.core.wmem_default. However, net.core.wmem_max overrides the maximum buffer size specified in net.ipv4.tcp_wmem, so when changing net.ipv4.tcp_wmem, be sure that the maximum buffer size specified in net.core.wmem_max is as large or larger than the maximum buffer size specified by net.ipv4.tcp_wmem. Whew! Let's go over that with an example.

By default, Linux configures the minimum guaranteed buffer size to be 4K, the default buffer size as 16K, and the maximum buffer size as 128K:

# sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 16384 131072
You can determine your optimal window by using the bandwidth-delay product, which will help you find a general range where you should begin experimenting with congestion window sizes:

windows-size = bandwidth (bytes/sec) * round-trip time (seconds)
Let's say that you determine the congestion window size should be 48K. You should then adjust the parameters to net.ipv4.tcp_wmem to reflect this size as the default size for the send buffer:

# sysctl -w net.ipv4.tcp_wmem="4096 49152 131072"
net.ipv4.tcp_wmem = 4096 49152 131072
That's all there is to it. Note, however, that historically the congestion window has been limited to 64K in size. RFC 1323 did away with this limit by introducing window scaling, which allows for even larger values. TCP window scaling is enabled by default on the 2.4 kernel, but, just to make sure, enable it when specifying a value at 64K or larger:

# sysctl -w net.ipv4.tcp_window_scaling="1"
net.ipv4.tcp_window_scaling = 1
# sysctl -w net.core.wmem_max="262144"
net.core.wmem_max = 262144
# sysctl -w net.ipv4.tcp_wmem="4096 131072 262144"
net.ipv4.tcp_wmem = 4096 131072 262144
Here I have increased the default buffer size to 128KB, and the maximum buffer size to double that number, or 256KB. Since the new default and maximum buffer sizes are larger than the original value in net.core.wmem_max, I must also adjust that value as it will override the maximum specified by net.ipv4.tcp_wmem.

Also investigate net.core.rmem_default, net.core.rmem_max, and net.ipv4.tcp_rmem, which are variables used to control the size of the receive buffer. This can make a large impact especially on a client system, as well as for file servers.

The final considerations are closing connections. One problem that servers will face, especially if clients may disappear or otherwise not close connections, is that the server will have a large number of open but unused connections. TCP has a keepalive function that will begin probing the TCP connection after a given amount of inactivity. By default Linux will wait for 7200 seconds, or two hours:

# sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
That's a long time, especially if serving a large number of clients that only require short-lived connections. Good examples of this are Web servers. The trick here is to reduce how long a quiet TCP connection is allowed to live by adjusting net.ipv4.tcp_keepalive_time to something perhaps along the lines of 30 minutes:

# sysctl -w net.ipv4.tcp_keepalive_time="1800"
net.ipv4.tcp_keepalive_time = 1800
You can also adjust how often the connection will be probed, and how long between each probe, before a forceful closing of the connection. But relative to the time specified by net.ipv4.tcp_keepalive_time, these values are low. If you're interested, review net.ipv4.tcp_keepalive_probes and net.ipv4.tcp_keepalive_intvl.

Conclusion

You can see dramatic improvements in performance if you know where to look. One of the most critical elements to consider when tuning the kernel and overall system performance is sysctl. The variables mentioned in this article will take you well on your way to understanding how sysctl affects your system, and I invite you to learn more by reading the documentation referenced. Keep in mind that there are many more tweaks available to you via sysctl.

Suggested additional readings are O'Reilly & Associates' System Performance Tuning, 2nd Edition and Understanding the Linux Kernel, 2nd Edition.

Dustin Puryear is a consultant providing expertise in managing and tuning Unix systems, services, and applications, with a strong focus on open source, and is author of Integrate Linux Solutions into Your Windows Network. He can be contacted at: dustin@puryear-it.com.