|  Linux 
              Kernel Tuning Using System Control
 Dustin Puryear
              Some of the most notable performance improvements for Linux can 
              be accomplished via system control (sysctl) in /proc/sys. Unlike 
              most other areas of /proc under Linux, sysctl variables are typically 
              writable, and are used to adjust the running kernel rather than 
              simply monitor currently running processes and system information. 
              In this article, I'll walk you through several areas of sysctl 
              that can result in large performance improvements. While certainly 
              not a definitive work, this article should provide the foundation 
              needed for further research and experimentation with Linux sysctl.
              Note that I wrote this article with the 2.4 kernel in mind. Some 
              variables may exist in earlier kernels but not in 2.4, or vice versa.
              Working with the sysctl Interface
              The sysctl interface allows administrators to modify variables 
              that the kernel uses to determine behavior. There are two ways to 
              work with sysctl: by directly reading and modifying files in /proc/sys 
              and by using the sysctl program supplied with most, if not all, 
              distributions. Most documentation on sysctl accesses variables via 
              the /proc/sys file system, and does so using cat for viewing 
              and echo for changing variables, as shown in the following 
              example where IP forwarding is enabled:
              
             
# cat /proc/sys/net/ipv4/ip_forward
0
# echo "1" > /proc/sys/net/ipv4/ip_forward
# cat /proc/sys/net/ipv4/ip_forward
1
This is an easy way to work with sysctl. An alternative is to use 
            the sysctl program, which provides an easy interface to accessing 
            sysctl. With the sysctl program, you specify a path to the variable, 
            with /proc/sys being the base. For example, to view /proc/sys/net/ipv4/ip_forward, 
            use the following command:  
             
# sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
To then update this variable, use the -w (write) option:  
             
# sysctl -w net.ipv4.ip_forward="0"
net.ipv4.ip_forward = 0
In this example, I have simply undone what was accomplished earlier 
            when using cat and echo.  Deciding which to use is often a matter of preference, but sysctl 
              does have the benefit of being supported via the /etc/sysctl.conf 
              configuration file, which is read during system startup. After experimenting 
              with variables that increase the performance or reliability of the 
              system, you should enter and document these variables in /etc/sysctl.conf:
              
             
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
You can also specify that the sysctl program reread /etc/sysctl.conf 
            via the -p option:  
             
# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
In this article, I will typically be using the sysctl program syntax 
            for accessing sysctl variables (i.e., I will use net.ipv4.ip_forward 
            rather than /proc/sys/net/ipv4/ip_forward).  Getting to Work
              sysctl exposes several important elements of the kernel beneath 
              /proc/sys, and I will be focusing on /proc/sys/fs, /proc/sys/vm, 
              and /proc/sys/net, which are used to tune file system, virtual memory 
              and disk buffers, and network code, respectively. Of course, there 
              is a lot more available in sysctl than what can be covered here, 
              so use this article as a stepping stone toward learning more about 
              sysctl.
              Tweaking /proc/sys/fs: File Systems
              The /proc/sys/fs interface exposes several interesting variables, 
              but only a few will directly affect the performance or utilization 
              of your system. For most workstations or lightly loaded servers, 
              you can typically leave everything as is, but as your system offers 
              more services and opens more files, begin monitoring fs.file-nr:
              
             
# sysctl fs.file-nr
fs.file-nr = 7343 2043 8192
The fs.file-nr variable displays three parameters: total allocated 
            file handles, currently used file handles, and maximum file handles 
            that can be allocated. The Linux kernel dynamically allocates file 
            handles whenever a file handle is requested by an application, but 
            it does not free these handles when they are released by the application. 
            Instead, the file handles are recycled. This means that over time 
            you will see the total allocated file handles increase as the server 
            reaches new peaks of file handle use, even though the number of in-use 
            file handles may be low. If you are running a server that opens a 
            large number of files, such as a news or file server, then you should 
            pay close attention to these parameters when tuning the system.  Adjusting the maximum file handles that Linux will allocate is 
              only a matter of updating fs.file-max:
              
             
# sysctl -w fs.file-max="32768"
fs.file-max = 32768
# sysctl fs.file-nr
fs.file-nr = 7343 2043 32768
Here I have quadrupled the maximum number of file handles that may 
            be allocated, noting that the peak usage is currently topping out 
            at 7,343 file handles. The server has only 2,043 file handles currently 
            in use.  In 2.2 kernels, you would also need to worry about setting a similar 
              variable for inodes via fs.inode-max, but as of the 2.4 kernel, 
              this is no longer necessary, and indeed this variable is no longer 
              available under /proc/sys/fs. You can, however, still view information 
              on inode usage via fs.inode-state. There are several other variables 
              that can be used in /proc/sys/fs, but the 2.4 kernel defaults for 
              most other variables are quite sufficient for all but the most extreme 
              cases.
              Learn more about /proc/sys/fs in /usr/src/linux/Documentation/sysctl/fs.txt. 
              The information is generally dated to the 2.2 kernel, but there 
              are some excellent nuggets of information in the document.
              Tweaking /proc/sys/vm: Virtual Memory
              There are two variables under /proc/sys/vm that you will find 
              very useful in tweaking how the disk buffers and the Linux VM work 
              with your disks and file systems. The first, vm.bdflush, allows 
              you to adjust how the kernel will flush dirty buffers to disk. Disk 
              buffers are used by the kernel to cache data stored on disks, which 
              are very slow compared to RAM. Whenever a buffer becomes sufficiently 
              dirty (i.e., its contents have been changed so that it differs from 
              what is on the disk), the kernel daemon bdflush will flush it to 
              disk.
              When viewing vm.bdflush you will see several parameters:
              
             
# sysctl vm.bdflush
vm.bdflush = 30 500 0 0 500 3000 60 20 0
Some of the parameters are dummy values. For now, pay attention to 
            the first, second, and seventh parameters (nfract, ndirty, and nfract_sync, 
            respectively). nfract specifies the maximum percentage of a buffer 
            that bdflush will allow before queuing the buffer to be written to 
            disk. ndirty specifies the maximum buffers that bdflush will flush 
            at once. Finally, nfract_sync is similar to nfract, but once the percentage 
            specified by nfract_sync is reached, a write is forced rather than 
            queued.  Adjusting vm.bdflush is something of an art because you need to 
              extensively test the effect on your server and target applications. 
              If the server has an intelligent controller and disk, then decreasing 
              the total number of flushes (which will in turn cause each flush 
              that is done to take a bit longer) may increase overall performance. 
              However, with a slower disk, the system may end up spending more 
              time waiting for the flush to finish. For this tweak, you need to 
              test, test, and then test some more.
              The default for nfract is 30%, and it's 60% for nfract_sync. 
              When increasing nfract, make sure the new value is not equal to 
              nfract_sync:
              
             
# sysctl -w vm.bdflush="60 500 0 0 500 3000 80 20 0"
vm.bdflush = 60 500 0 0 500 3000 80 20 0
Here, nfract is being set to 60% and nfract_sync to 80%.  The ndirty parameter simply specifies how much bdflush will write 
              to disk at any one time. The larger this value, the longer it could 
              potentially take bdflush to complete its updates to disk.
              You can also tune how many pages of memory are paged out by the 
              kernel swap daemon, kswapd, when memory is needed using vm.kswapd:
              
             
# sysctl vm.kswapd
vm.kswapd = 512 32 8
The vm.kswapd variable has three parameters: tries_base, the maximum 
            number of pages that kswapd tries to free in one round; tries_min, 
            the minimum pages that kswapd will free when writing to disk (in other 
            words, kswapd will try to at least get some work done when it wakes 
            up); and swap_cluster, the number of pages that kswapd will write 
            in one round of paging.  The performance tweak, which is similar to the adjustment made 
              to vm.bdflush, is to increase the number of pages that kswapd pages 
              out at once on systems that page often by modifying the first and 
              last parameters:
              
             
# sysctl -w vm.kswapd="1024 32 64"
vm.kswapd = 1024 32 64
Here I am specifying that kswapd search up to 1024 pages to be paged 
            out, and that during one round of paging that kswapd can write out 
            64 pages. There is no hard and fast rule on modifying these parameters 
            as their effect is very much dependent on disk speed. The best bet 
            is to simply experiment until finding the right value for the server 
            application.  I suggest that you review /usr/src/linux/Documentation/sysctl/vm.txt 
              for more information. Again, this documentation is generally dated 
              to the 2.2 kernel, but the information is still mostly relevant.
              Tweaking /proc/sys/net: Networking
              Unlike the other two areas discussed, /proc/sys/net offers many 
              more areas where you can tweak and tune your system's performance. 
              Unfortunately, you can also break your system's compatibility 
              with other computers on the Internet, so be sure to rigorously test 
              changes. In this article, I will not discuss any changes that can 
              affect compatibility, however, so these changes can be tested simply 
              on the basis on their performance improvements.
              When viewing /proc/sys/net, you will see several different directories:
              
             
# ls -l /proc/sys/net
total 0
dr-xr-xr-x 2 root root 0 Aug 14 10:55 802
dr-xr-xr-x 2 root root 0 Aug 14 10:55 core
dr-xr-xr-x 2 root root 0 Aug 14 10:55 ethernet
dr-xr-xr-x 5 root root 0 Aug 14 10:55 ipv4
dr-xr-xr-x 2 root root 0 Aug 14 10:55 token-ring
dr-xr-xr-x 2 root root 0 Aug 14 10:55 unix
In this article, I am only going to address net.core and net.ipv4.  net.core typically provides defaults for all networking components, 
              especially in terms of memory usage and buffer allocation for send 
              and receive buffers. On the other hand, net.ipv4 only has variables 
              that affect the IPv4 stack, and many of the variables, but not all, 
              will override net.core. When working with net.core and net.ipv4, 
              you should concern yourself with three areas: new connections, established 
              connections, and closing connections. When thinking along these 
              lines, it is usually easy to determine which variables to tune.
              An excellent example is how Linux will handle half-open connections. 
              That is, when connections that have been initiated to the server, 
              but where the three-way TCP handshake has not completed. You can 
              see connections that are in this state by looking for SYN_RECV in 
              the output of netstat:
              
             
# netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp        0      0 127.0.0.1:389 127.0.0.1:52994 TIME_WAIT
tcp        0      1 10.0.0.23:25  10.0.0.93:3432  SYN_RECV
When dealing with a heavily loaded service or with clients on high 
            latency or bad connections, the rate of half-open connections is going 
            to increase. Web server administrators are particularly aware of this 
            issue because a lot of Web site clients are on dial-up. Dial-up tends 
            to have a high latency where clients can sometimes disappear entirely 
            from the Internet.  In Unix, half-open connections are placed in the incomplete (or 
              backlog) connections queue, and under Linux, the amount of space 
              available in this queue is specified by ipv4.tcp_max_syn_backlog. 
              It's important to realize that each half-open connection consumes 
              memory. Also, realize that a common Denial of Service attack, the 
              syn-flood attack, is based on the knowledge that your server will 
              no longer be able to serve new connection requests if an attacker 
              opens enough half-open connections.
              If you are running a site that does need to handle a large number 
              of half-open connections, then consider increasing this value:
              
             
# sysctl -w net.ipv4.tcp_max_syn_backlog="1024"
net.ipv4.tcp_max_syn_backlog = 1024
As a side note, many administrators also enable syn-cookies, which 
            enable a server to handle new connections even when the incomplete 
            connections queue is full (e.g., during a syn-flood attack):  
             
# sysctl -w net.ipv4.tcp_syncookies="1"
net.ipv4.tcp_syncookies = 1
Unfortunately, when using syn-cookies, you will not be able to use 
            advanced TCP features such as window scaling (discussed later).  Another important consideration when connections are being established 
              is ensuring your server has enough local ports to allocate to sockets 
              for outgoing connections. When a server, such as HTTP proxy, has 
              a large number of outgoing connections, the server may run out of 
              local ports. The number of local ports dedicated to outgoing connections 
              is specified in net.ipv4.ip_local_port_range, and the default is 
              to allocate ports 1024 to 4999 for this purpose:
              
             
# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 1024 4999
To adjust these values, simply increase this range. A common change 
            is to allow outgoing connections on local ports 32768 to 61000:  
             
# sysctl -w net.ipv4.ip_local_port_range="32768 61000"
net.ipv4.ip_local_port_range = 32768 61000
Once a TCP session has been established, you need to think about how 
            efficiently TCP/IP uses the available bandwidth. One of the most common 
            ways to increase the utilization is to adjust the possible size of 
            the TCP congestion window. The TCP congestion window is simply how 
            many bytes of data the server will send over a connection before it 
            requires an acknowledgement by the client on the other end of the 
            connection. The larger the window, the more data is allowed on the 
            wire at a time, and vice versa. This is a key point to understand 
            because if you are serving clients on a high latency network (e.g., 
            a WAN or the Internet), then your server is probably wasting a lot 
            of network capacity while it waits for a client ACK.  The congestion window will start at a small size and increase 
              over time as the server begins to trust the connection. The maximum 
              size of the window is limited by the size of the send buffer because 
              the server must be able to resend any data that is lost, and this 
              data must be in the send buffer to be sent.
              Adjusting the buffer size used by Linux is a matter of adjusting 
              both net.core.wmem_max and net.ipv4.tcp_wmem. net.core.wmem_max 
              specifies the maximum buffer size for the send queue for any protocol, 
              including IPv4. net.ipv4.tcp_wmem, on the other hand, includes three 
              parameters: the minimum size of a buffer regardless of how much 
              stress is on the memory system, the default size of a buffer, and 
              the maximum size of a buffer. The default size specified in net.ipv4.tcp_wmem 
              will override net.core.wmem_default, so we can simply ignore net.core.wmem_default. 
              However, net.core.wmem_max overrides the maximum buffer size specified 
              in net.ipv4.tcp_wmem, so when changing net.ipv4.tcp_wmem, be sure 
              that the maximum buffer size specified in net.core.wmem_max is as 
              large or larger than the maximum buffer size specified by net.ipv4.tcp_wmem. 
              Whew! Let's go over that with an example.
              By default, Linux configures the minimum guaranteed buffer size 
              to be 4K, the default buffer size as 16K, and the maximum buffer 
              size as 128K:
              
             
# sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 16384 131072
You can determine your optimal window by using the bandwidth-delay 
            product, which will help you find a general range where you should 
            begin experimenting with congestion window sizes:  
             
windows-size = bandwidth (bytes/sec) * round-trip time (seconds)
Let's say that you determine the congestion window size should 
            be 48K. You should then adjust the parameters to net.ipv4.tcp_wmem 
            to reflect this size as the default size for the send buffer:  
             
# sysctl -w net.ipv4.tcp_wmem="4096 49152 131072"
net.ipv4.tcp_wmem = 4096 49152 131072
That's all there is to it. Note, however, that historically the 
            congestion window has been limited to 64K in size. RFC 1323 did away 
            with this limit by introducing window scaling, which allows for even 
            larger values. TCP window scaling is enabled by default on the 2.4 
            kernel, but, just to make sure, enable it when specifying a value 
            at 64K or larger:  
             
# sysctl -w net.ipv4.tcp_window_scaling="1"
net.ipv4.tcp_window_scaling = 1
# sysctl -w net.core.wmem_max="262144"
net.core.wmem_max = 262144
# sysctl -w net.ipv4.tcp_wmem="4096 131072 262144"
net.ipv4.tcp_wmem = 4096 131072 262144
Here I have increased the default buffer size to 128KB, and the maximum 
            buffer size to double that number, or 256KB. Since the new default 
            and maximum buffer sizes are larger than the original value in net.core.wmem_max, 
            I must also adjust that value as it will override the maximum specified 
            by net.ipv4.tcp_wmem.  Also investigate net.core.rmem_default, net.core.rmem_max, and 
              net.ipv4.tcp_rmem, which are variables used to control the size 
              of the receive buffer. This can make a large impact especially on 
              a client system, as well as for file servers.
              The final considerations are closing connections. One problem 
              that servers will face, especially if clients may disappear or otherwise 
              not close connections, is that the server will have a large number 
              of open but unused connections. TCP has a keepalive function that 
              will begin probing the TCP connection after a given amount of inactivity. 
              By default Linux will wait for 7200 seconds, or two hours:
              
             
# sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
That's a long time, especially if serving a large number of clients 
            that only require short-lived connections. Good examples of this are 
            Web servers. The trick here is to reduce how long a quiet TCP connection 
            is allowed to live by adjusting net.ipv4.tcp_keepalive_time to something 
            perhaps along the lines of 30 minutes:  
             
# sysctl -w net.ipv4.tcp_keepalive_time="1800"
net.ipv4.tcp_keepalive_time = 1800
You can also adjust how often the connection will be probed, and how 
            long between each probe, before a forceful closing of the connection. 
            But relative to the time specified by net.ipv4.tcp_keepalive_time, 
            these values are low. If you're interested, review net.ipv4.tcp_keepalive_probes 
            and net.ipv4.tcp_keepalive_intvl.  Conclusion
              You can see dramatic improvements in performance if you know where 
              to look. One of the most critical elements to consider when tuning 
              the kernel and overall system performance is sysctl. The variables 
              mentioned in this article will take you well on your way to understanding 
              how sysctl affects your system, and I invite you to learn more by 
              reading the documentation referenced. Keep in mind that there are 
              many more tweaks available to you via sysctl.
              Suggested additional readings are O'Reilly & Associates' 
              System Performance Tuning, 2nd Edition and Understanding 
              the Linux Kernel, 2nd Edition.
              Dustin Puryear is a consultant providing expertise in managing 
              and tuning Unix systems, services, and applications, with a strong 
              focus on open source, and is author of Integrate Linux Solutions 
              into Your Windows Network. He can be contacted at: [email protected].
           |