Article

jan2001.tar

Measuring and Improving Memory Efficiency of Large Applications

Greg Nakhimovsky

Many of today's computer applications require large amounts of system memory. This is especially true with very large and complex applications that provide hundreds of functionalities and handle large amounts of data.

At the same time, computer CPU speeds have increased faster than memory access speeds, so the gap between them is now very wide. This makes memory efficiency issues increasingly more important. This article describes how a large application uses system memory and what you can do to monitor and improve its memory efficiency. It presents and discusses special tools for these tasks. This information and tools can help systems administrators, software developers, and users who are working with large applications, particularly under the Solaris operating system. In this article, I will consider an example of using PTC Pro/ENGINEER (a major Mechanical CAD/CAM system) under SPARC/Solaris from Sun Microsystems. See http://www.ptc.com for more information about PTC and Pro/ENGINEER, and http://www.sun.com/ \
tecnical-computing/ISV/PTCFaq.html for a technical FAQ regarding PTC applications on Sun. Note that Pro/ENGINEER is just a convenient example of using these techniques. You can use them just as easily with any large application that requires a lot of memory.

Memory efficiency is a very large subject, impossible to address comprehensively in a single article. Because of that, this article only touches on some issues while describing a few specific tools in detail.

Memory Access Speed

Modern computers have a hierarchy of memory types. A very small portion of memory called Level-1 or L1 cache (also called CPU-internal cache) provides very fast data access. A larger portion called L2 cache (also known as external cache) provides somewhat slower memory access but still much faster than that of general RAM without using a cache. The faster the memory type, the more expensive and less practical it is to use. This is one of the fundamental tradeoffs of computer architecture.

Here are typical orders of magnitude of memory access times and sizes of the major memory components. These values are generic and they change rapidly as computers become faster but their ratios stay relatively constant:

 
               Latency (nanoseconds)    Size (kilobytes)
            
L1-cache              3                      32 
L2-cache              30                     4096 
RAM                   300                    500,000 
Disk                  30,000,000             10,000,000

As you can see, memory efficiency can be hugely different depending on how much the application uses the faster memory components and avoids the slower ones.

The first practical conclusion from this data is that disk access as a substitute for memory access should be avoided in any performance-sensitive situation. High Performance Computing, Second Edition, by Kevin Dowd and Charles Severance (O'Reilly and Associates) is an excellent further resource regarding memory access speed and related issues.

Pro/ENGINEER Use of Memory

Unlike older applications that explicitly use disks for some of their storage requirements, Pro/ENGINEER keeps all model data directly in memory. It assumes that the system has enough memory for all needs. This approach relies on the OS virtual memory (VM) system. All modern operating systems contain page-demand VM systems. The main advantage of such a system is that memory available to applications is not limited to random access memory (RAM); disk swap space can also be used. The application or the users do not have to do anything special about using it: the OS handles it automatically.

The main disadvantage of using the VM system is that, if memory requirements significantly exceed the available RAM, system performance will degrade. As application memory requirements start exceeding the amount of physical memory available in the system, paging to disk will begin. This causes rapid performance degradation because, as you saw in the previous section, disk access is many orders of magnitude slower than memory access of any kind.

Eventually, if the application requires much more memory than the available physical memory, so-called "thrashing" may occur, which makes the system practically impossible to use. For detailed descriptions of the Solaris VM system and other system-related information, see Sun Performance and Tuning: Java and the Internet by Adrian Cockcroft and Richard Pettit (Prentice Hall) and Solaris Internals: Architecture, Tips and Techniques, Volume 1: Core Kernel by Richard McDougall and Jim Mauro (to be published by Prentice Hall). McDougall and Mauro's columns are also available online at SunWorld (http://www.sunworld.com).

Pro/ENGINEER is a large application with the following features (from the memory efficiency perspective):

The use of the malloc() interface to obtain all required memory from the system. Most applications use this method, although other methods exist (mmap-based for example). C++ operator "new" also belongs to this category since most implementations of "new" call malloc() internally.
Highly dynamic memory usage. In other words, the amount of memory a Pro/ENGINEER session requires is highly dependent upon the size of the model and the operations performed on it.
Large amounts of memory required for large models. Pro/ENGINEER memory consumption can vary from about 50 megabytes to a number of gigabytes, up to the total amount of swap space (RAM plus disk swap) available in the system.
Very large and complex dynamically allocated data structures, many of which are scattered over memory. One reason for this is that Pro/ENGINEER data structures represent 3-D objects, while the operating system virtual address space model is linear (1-D). This makes it necessary to map the 3-D data to the one-dimensional virtual address space. Such a mapping introduces gaps between addresses of data items that may logically be closely related.
Very complex memory access patterns, in some cases approaching random access. There are some objective reasons for this. For example, certain algorithms used in mechanical CAD systems (Hidden Line Removal is one such algorithm) require examining a data item (say, Z-coordinate) for all existing entities, such as triangles tessellating the surfaces. The data structures containing the needed data items can be stored in memory far from each other. This can easily cause poor "locality of reference."
Extensive use of function pointers, causing frequent address jumps in referencing the program code. This is typical for modern object-oriented applications. For example, C++ virtual functions are usually implemented with function pointers.

Taken together, these features mean that the CPU caches described in the previous section are not very effective with Pro/ENGINEER. A cache is only useful when many data items or program instructions can be accessed directly from the cache. When data access is almost random with large gaps between addresses, caches cannot help much. In this case, performance will often depend on raw memory latency, that is the time it takes to access RAM without the benefit of a cache.

SPARC/Solaris allows a full 4-GB virtual address space for 32-bit applications, and practically unlimited virtual address space size for 64-bit applications. See details at:

http://www.sun.com/technical-computing/ISV/PTCFaq.html#MORETHAN2G

Current Sun workstations (e.g., Ultra-80) can hold up to 4 GB of RAM, thus making large memory requirements practical. Future workstation models will be capable of holding even more RAM.

One example is the PTC Division MockUp application, which is supported in the 64-bit mode on Sun. It can handle huge assemblies by taking full advantage of the large amounts of RAM and virtual address space in the system.

Uniprocessor and Multiprocessor Systems

Currently, the most popular multiprocessing model is Symmetric Multi-Processing (SMP). All Sun workstations and servers use it. Briefly, SMP means that CPUs installed in the computer have equal status; all of them can equally execute both application and kernel code. Every CPU has its own hardware cache. Any CPU can access any data that the applications use. Such access can be performed in parallel. When the application modifies data, the MP hardware ensures that all CPUs see the same data values. This feature is called "cache coherency".

To take advantage of the multiple CPUs in the same machine, you can run multiple applications at the same time. In this case, the kernel will automatically distribute the load among CPUs. Alternatively, a single application can create multiple threads running in parallel, thus taking advantage of multiple CPUs.

Pro/ENGINEER is partially multithreaded, which means that certain operations can be performed in parallel when multiple CPUs are available. A brief description of the MP/MT features of Pro/ENGINEER is available at:

http://www.sun.com/technical-computing/ISV/PTCFaq.html#MP

On Sun systems, Pro/ENGINEER is statically linked with a special malloc() package allowing faster memory allocation when multiple threads manage their memory simultaneously. Sun Microsystems owns a patent for this technology ("Memory Allocation in a Multithreaded Environment", by Greg Nakhimovsky, http://www.uspto.gov).

Measuring Memory Used by Application

Knowing how much memory an application session has consumed can be useful in many ways. It can help you determine whether adding more RAM will help performance, or the amount of available physical memory can be decreased without a negative effect on performance. It can be useful for workload management tasks required for distributed computing. You can also use this information to detect abnormal situations, for example, when the application is consuming too much memory.

This task is generally not trivial since today's operating systems, including Solaris, are very complex. The naive methods frequently used for this purpose do not work well. Examples include ps(1), swap(1), and vmstat(1) commands. For various reasons, none of those commands report the total memory consumption of a particular application. For example, the ps(1) SZ (size) field will report the amount of virtual memory, but not the actual memory consumed. The RSS (resident set size) field will include the memory occupied by the shared libraries, which many processes can use simultaneously.

It is not impossible however. Solaris has a very useful pmap(1) command based on the proc(4) interface. Here is a description of a couple of tools based on pmap(1) technology, which you can use with most applications, not just Pro/ENGINEER.

The readers who are not programmatically inclined can skip all the details given here, download the tools, and simply use them. These tools, with the exception of pmap(1), are not officially supported by Sun Microsystems. They are informal example programs and anyone is welcome to use them or modify them. They also demonstrate a few useful programming techniques.

The first tool is a shared library interposer called mem_on_exit.so. Listing 1 contains its source code. To build this interposer, use the following command:

cc -o mem_on_exit.so -G -Kpic mem_on_exit.c

To use it (from a Pro/ENGINEER startup script, for example) do the following (we are using the C-shell syntax in this example):

setenv LD_PRELOAD /full_path/mem_on_exit.so
[ Run Pro/ENGINEER as usual ]
unsetenv LD_PRELOAD

Shared library interposers are programs capable of intercepting calls the application makes to any shared library. Once such a call is intercepted, the interposer can do whatever you need, and then call the real function originally intended by the application.

Library interposers are very useful for all kinds of debugging, testing, and collection of runtime data statistics. They can even be used to fix bugs by modifying the behavior of the interposed function. They can do all this without rebuilding the application in any way.

In this case, the library can interpose on system call exit(2), which most applications (including Pro/ENGINEER) invoke at the end of the run to exit to the operating system. First, it can determine the name of the executable that made the call. The library can use Solaris proc(4) interface for it. (Note: the version of the proc(4) interface shown here works with Solaris 2.6 and later.) Pro/ENGINEER main executable is called "pro". So if the current executable is called "pro", the library can call system(3S) invoking a Perl script called measure_proe_mem.pl. After that is finished, call the real system exit(2) routine.

I will assume that the directory where script measure_proe_mem.pl (described later) is installed in is on the shell $PATH. Alternatively, you can put the full path to measure_proe_mem.pl in the system() call.

When the application uses malloc() to dynamically allocate memory from the system, the amount of memory that the application consumes cannot decrease while the application is running. The malloc() package contains a memory management system. When the application calls free(), the freed memory is not returned to the operating system but saved for future use by the same process. (Actually, malloc() can be written to return memory to the system, but most malloc() implementations, including Sun's, do not do it that way because it would be hard to do and unnecessary in most cases.)

Therefore, to estimate the amount of memory that the application consumes, it is enough to measure it only once, immediately before the application exits. This will provide the "high-water mark" value. The actual measurement and calculations are performed by the Perl script measure_proe_mem.pl that the library interposer invokes. Listing 2 contains its source.

Perl is a part of Solaris 8 and above, where it is automatically installed in /bin. For the earlier Solaris releases, you can download Perl from a number of locations, including:

http://sunfreeware.com

If Perl is not in /bin on your system, make sure to modify the first line in the script to point to your Perl executable.

The measure_proe_mem.pl script runs the ps(1) command to find all the running processes related to Pro/ENGINEER. It assumes that any process with an executable name containing characters "pro" or "appmgr" qualifies. (These patterns can be easily changed if necessary.)

For each process with a name matching the specified pattern, the script runs the Solaris pmap(1) command with parameter "-x" producing a detailed memory map for the process. After parsing the pmap(1) output, the script adds the amounts of private memory that each process has used, and selects the maximum amount of shared memory among the processes. We do not want to add the shared memory many times since many processes share it. We assume that the Pro/ENGINEER-related processes share most of the same libraries. The resulting value (Max shared + Total private) is a good approximation for the amount of memory this Pro/ENGINEER session has consumed. You can download both source files from the Web:

ftp://ftp.sunmde.com/pub/gregns/mem_on_exit.c
ftp://ftp.sunmde.com/pub/gregns/measure_proe_mem.pl

Typically, these tools are used together. Either individual users of Pro/ENGINEER can run them directly, or a startup script of some kind can do it. In the latter case, a systems administrator can collect various statistics regarding memory consumption. You can also use the measure_proe_mem.pl script on its own. If you execute it at any time while the application is running, it will output the results at that time.

Here is an example output from the measure_proe_mem.pl script executed while Pro/ENGINEER is running:

% measure_proe_mem.pl -v
5045 /export/home/proe2000i2/sun4_solaris/obj/pro:
 virtual_kb = 378512; shared_kb = 4232; private_kb = 179512
5046 /export/home/proe2000i2/sun4_solaris/nms/nmsd:
 virtual_kb = 2832; shared_kb = 1512; private_kb = 1016
5048 /export/home/proe2000i2/sun4_solaris/obj/pro_comm_msg:
 virtual_kb = 4848; shared_kb = 1496; private_kb = 1552
5060 /export/home/proe2000i2/sun4_solaris/obj/pglclock:
 virtual_kb = 157592; shared_kb = 4224; private_kb = 6936
Total memory consumed by all Pro/ENGINEER-related processes:
Total virtual address space = 531 MB
Max shared  = 4 MB
Total private = 185 MB
Max shared + Total private = 189 MB

You can easily modify these tools to work with applications other than Pro/ENGINEER. All you will have to do is change the names (or name patterns) of the executables, the name of the Perl script, and the output messages.

Measuring Paging to Disk

You can measure paging to disk with vmstat(1). This technique (among others) is described in:

http://www.sun.com/technical-computing/ISV/PTCFaq.html#PERFORMANCE

Look at the sr (scan rate) column in the vmstat output. When the numbers in that column are consistently zero or less than 200 pages per second, there is no significant paging to disk occurring and the amount of your physical memory is sufficient for the current session. If the scan rate is consistently high, application performance will improve if you add more RAM.

A Pro/ENGINEER startup script can start vmstat(1) in the background, capture the sr column output, and calculate some meaningful statistics. The vmstat(1) process can be terminated, for example, by a signal sent to it when the Pro/ENGINEER session ends. Developing such a script is left as an exercise for the reader.

As an alternative, you can watch for the disk activity reported for the swap device (assuming you use swap partitions rather than files). One way to do this is to run the iostat(2) command. Any significant input/output (I/O) in a swap partition is a sure sign of memory shortage.

I also recommend installing the xcpustate utility and using it to graphically monitor what your system is doing. It is a public-domain X-Windows based utility available for many UNIX platform. You can download it from:

ftp://ftp.cs.toronto.edu/pub/jdd/xcpustate

The SPARC/Solaris binary that I use (which is rather old) is available here:

ftp://ftp.sunmde.com/pub/gregns/xcpustate

To use it, simply make sure the xcpustate file is executable:

chmod +x xcpustate

and then start it putting it into background:

xcpustate &

If you would like to watch the I/O state in addition to the CPU state (which is a good idea, especially for the swap devices), start it with a -disk parameter:

xcpustate -disk &

The resulting display will show the state of each CPU and disk (if -disk is specified). It uses the following colors for the display:

Green       User time 
Yellow      System time 
Blue        Wait/idle

The xcpustate display is updated each second.

There are other graphical utilities to monitor system performance, but I like xcpustate the most for its convenience and light weight.

Measuring CPU Cache Usage

Solaris 8 has cpustat(1) and cputrack(1) utilities, which can help you measure various CPU statistics. Specifically, you can measure the number of external cache hits and misses. Here is an example of how you can use cputrack(1):

% cputrack -fev -c EC_ref,EC_hit <command>

EC_ref refers to the total number of external cache references, while EC_hit corresponds to the total number of external cache hits. The difference between the two values will give you the number of external cache misses. The external cache miss rate can be computed as:

(1 - EC_hit/EC_ref)*100%

Similarly, to watch the instruction cache references and hits, you can use this syntax:

% cputrack -fev -c IC_ref,IC_hit <command>

You can also concatenate multiple -c options to cputrack or cpustat. That will make the tool cycle between the specified events. The above examples are for UltraSPARC-I and UltraSPARC-II. See UltraSPARC User's Manuals (http://www.sun.com/microelectronics/manuals/index.html) for detailed information about the UltraSPARC counters.

Frederic Pariente of Sun Microsystems/France has developed an interesting utility called Hardware Activity Reporter (HAR), which computes many useful UltraSPARC CPU statistics such as L1-cache miss rate and data stall rate. See Hardware Performance Counters, Hardware Statistics Tool: http://www.sunmde.com/perf_tools/har/.

Improving Memory Efficiency

Using the tools described in the previous sections, measure how much memory your application uses and how much paging to disk is occurring.
Collect statistics about the memory consumption and paging to disk over a period of time. Configure your system with the optimal amount of RAM and disk swap amount best suited to your needs.
Use the cputrack or HAR tool to measure the CPU cache miss and data stall rates during specific application operations. Provide feedback to application developers, such that they can improve their algorithms and data structures to take better advantage of CPU caches. Improving locality of reference and thus increasing the number of cache hits can dramatically enhance application performance.
Use the application features that generate many cache misses as little as necessary.

Summary

Now that you have some memory measurement tools, you can apply them in your own environment to configure the system's hardware and provide useful feedback to the application developers. Since memory efficiency is such a large issue, its further discussion should be useful to everyone involved with large applications.

Acknowledgments

I would like to thank my Sun Microsystems colleagues Tom Gould, Morgan Herrington, Peter Nurkse, and Pramod Rustagi for their advice and encouragement.

Greg Nakhimovsky is a member of the technical staff at Sun Microsystems. He works with independent software vendors making sure that their applications run well on Sun systems. He has 20 years of industry experience developing, performance tuning, and supporting technical computer applications on various computer systems.