Measuring
and Improving Memory Efficiency of Large Applications
Greg Nakhimovsky
Many of today's computer applications require large
amounts of system memory. This is especially true with very large
and complex applications that provide hundreds of functionalities
and handle large amounts of data.
At the same time, computer CPU speeds have increased faster than
memory access speeds, so the gap between them is now very wide.
This makes memory efficiency issues increasingly more important.
This article describes how a large application uses system memory
and what you can do to monitor and improve its memory efficiency.
It presents and discusses special tools for these tasks. This information
and tools can help systems administrators, software developers,
and users who are working with large applications, particularly
under the Solaris operating system. In this article, I will consider
an example of using PTC Pro/ENGINEER (a major Mechanical CAD/CAM
system) under SPARC/Solaris from Sun Microsystems. See http://www.ptc.com
for more information about PTC and Pro/ENGINEER, and http://www.sun.com/
\
tecnical-computing/ISV/PTCFaq.html for a technical FAQ regarding
PTC applications on Sun. Note that Pro/ENGINEER is just a convenient
example of using these techniques. You can use them just as easily
with any large application that requires a lot of memory.
Memory efficiency is a very large subject, impossible to address
comprehensively in a single article. Because of that, this article
only touches on some issues while describing a few specific tools
in detail.
Memory Access Speed
Modern computers have a hierarchy of memory types. A very small
portion of memory called Level-1 or L1 cache (also called CPU-internal
cache) provides very fast data access. A larger portion called L2
cache (also known as external cache) provides somewhat slower memory
access but still much faster than that of general RAM without using
a cache. The faster the memory type, the more expensive and less
practical it is to use. This is one of the fundamental tradeoffs
of computer architecture.
Here are typical orders of magnitude of memory access times and
sizes of the major memory components. These values are generic and
they change rapidly as computers become faster but their ratios
stay relatively constant:
Latency (nanoseconds) Size (kilobytes)
L1-cache 3 32
L2-cache 30 4096
RAM 300 500,000
Disk 30,000,000 10,000,000
As you can see, memory efficiency can be hugely different depending
on how much the application uses the faster memory components and
avoids the slower ones.
The first practical conclusion from this data is that disk access
as a substitute for memory access should be avoided in any performance-sensitive
situation. High Performance Computing, Second Edition, by
Kevin Dowd and Charles Severance (O'Reilly and Associates)
is an excellent further resource regarding memory access speed and
related issues.
Pro/ENGINEER Use of Memory
Unlike older applications that explicitly use disks for some of
their storage requirements, Pro/ENGINEER keeps all model data directly
in memory. It assumes that the system has enough memory for all
needs. This approach relies on the OS virtual memory (VM) system.
All modern operating systems contain page-demand VM systems. The
main advantage of such a system is that memory available to applications
is not limited to random access memory (RAM); disk swap space can
also be used. The application or the users do not have to do anything
special about using it: the OS handles it automatically.
The main disadvantage of using the VM system is that, if memory
requirements significantly exceed the available RAM, system performance
will degrade. As application memory requirements start exceeding
the amount of physical memory available in the system, paging to
disk will begin. This causes rapid performance degradation because,
as you saw in the previous section, disk access is many orders of
magnitude slower than memory access of any kind.
Eventually, if the application requires much more memory than
the available physical memory, so-called "thrashing" may
occur, which makes the system practically impossible to use. For
detailed descriptions of the Solaris VM system and other system-related
information, see Sun Performance and Tuning: Java and the Internet
by Adrian Cockcroft and Richard Pettit (Prentice Hall) and Solaris
Internals: Architecture, Tips and Techniques, Volume 1: Core Kernel
by Richard McDougall and Jim Mauro (to be published by
Prentice Hall). McDougall and Mauro's columns are also available
online at SunWorld (http://www.sunworld.com).
Pro/ENGINEER is a large application with the following features
(from the memory efficiency perspective):
- The use of the malloc() interface to obtain all required
memory from the system. Most applications use this method, although
other methods exist (mmap-based for example). C++ operator
"new" also belongs to this category since most implementations
of "new" call malloc() internally.
- Highly dynamic memory usage. In other words, the amount of
memory a Pro/ENGINEER session requires is highly dependent upon
the size of the model and the operations performed on it.
- Large amounts of memory required for large models. Pro/ENGINEER
memory consumption can vary from about 50 megabytes to a number
of gigabytes, up to the total amount of swap space (RAM plus disk
swap) available in the system.
- Very large and complex dynamically allocated data structures,
many of which are scattered over memory. One reason for this is
that Pro/ENGINEER data structures represent 3-D objects, while
the operating system virtual address space model is linear (1-D).
This makes it necessary to map the 3-D data to the one-dimensional
virtual address space. Such a mapping introduces gaps between
addresses of data items that may logically be closely related.
- Very complex memory access patterns, in some cases approaching
random access. There are some objective reasons for this. For
example, certain algorithms used in mechanical CAD systems (Hidden
Line Removal is one such algorithm) require examining a data item
(say, Z-coordinate) for all existing entities, such as triangles
tessellating the surfaces. The data structures containing the
needed data items can be stored in memory far from each other.
This can easily cause poor "locality of reference."
- Extensive use of function pointers, causing frequent address
jumps in referencing the program code. This is typical for modern
object-oriented applications. For example, C++ virtual functions
are usually implemented with function pointers.
Taken together, these features mean that the CPU caches described
in the previous section are not very effective with Pro/ENGINEER.
A cache is only useful when many data items or program instructions
can be accessed directly from the cache. When data access is almost
random with large gaps between addresses, caches cannot help much.
In this case, performance will often depend on raw memory latency,
that is the time it takes to access RAM without the benefit of a
cache.
SPARC/Solaris allows a full 4-GB virtual address space for 32-bit
applications, and practically unlimited virtual address space size
for 64-bit applications. See details at:
http://www.sun.com/technical-computing/ISV/PTCFaq.html#MORETHAN2G
Current Sun workstations (e.g., Ultra-80) can hold up to 4 GB of RAM,
thus making large memory requirements practical. Future workstation
models will be capable of holding even more RAM.
One example is the PTC Division MockUp application, which is supported
in the 64-bit mode on Sun. It can handle huge assemblies by taking
full advantage of the large amounts of RAM and virtual address space
in the system.
Uniprocessor and Multiprocessor Systems
Currently, the most popular multiprocessing model is Symmetric
Multi-Processing (SMP). All Sun workstations and servers use it.
Briefly, SMP means that CPUs installed in the computer have equal
status; all of them can equally execute both application and kernel
code. Every CPU has its own hardware cache. Any CPU can access any
data that the applications use. Such access can be performed in
parallel. When the application modifies data, the MP hardware ensures
that all CPUs see the same data values. This feature is called "cache
coherency".
To take advantage of the multiple CPUs in the same machine, you
can run multiple applications at the same time. In this case, the
kernel will automatically distribute the load among CPUs. Alternatively,
a single application can create multiple threads running in parallel,
thus taking advantage of multiple CPUs.
Pro/ENGINEER is partially multithreaded, which means that certain
operations can be performed in parallel when multiple CPUs are available.
A brief description of the MP/MT features of Pro/ENGINEER is available
at:
http://www.sun.com/technical-computing/ISV/PTCFaq.html#MP
On Sun systems, Pro/ENGINEER is statically linked with a special malloc()
package allowing faster memory allocation when multiple threads manage
their memory simultaneously. Sun Microsystems owns a patent for this
technology ("Memory Allocation in a Multithreaded Environment",
by Greg Nakhimovsky, http://www.uspto.gov).
Measuring Memory Used by Application
Knowing how much memory an application session has consumed can
be useful in many ways. It can help you determine whether adding
more RAM will help performance, or the amount of available physical
memory can be decreased without a negative effect on performance.
It can be useful for workload management tasks required for distributed
computing. You can also use this information to detect abnormal
situations, for example, when the application is consuming too much
memory.
This task is generally not trivial since today's operating
systems, including Solaris, are very complex. The naive methods
frequently used for this purpose do not work well. Examples include
ps(1), swap(1), and vmstat(1) commands. For
various reasons, none of those commands report the total memory
consumption of a particular application. For example, the ps(1)
SZ (size) field will report the amount of virtual memory,
but not the actual memory consumed. The RSS (resident set size)
field will include the memory occupied by the shared libraries,
which many processes can use simultaneously.
It is not impossible however. Solaris has a very useful pmap(1)
command based on the proc(4) interface. Here is a description
of a couple of tools based on pmap(1) technology, which you
can use with most applications, not just Pro/ENGINEER.
The readers who are not programmatically inclined can skip all
the details given here, download the tools, and simply use them.
These tools, with the exception of pmap(1), are not officially
supported by Sun Microsystems. They are informal example programs
and anyone is welcome to use them or modify them. They also demonstrate
a few useful programming techniques.
The first tool is a shared library interposer called mem_on_exit.so.
Listing 1 contains its source code. To build this interposer, use
the following command:
cc -o mem_on_exit.so -G -Kpic mem_on_exit.c
To use it (from a Pro/ENGINEER startup script, for example) do the
following (we are using the C-shell syntax in this example):
setenv LD_PRELOAD /full_path/mem_on_exit.so
[ Run Pro/ENGINEER as usual ]
unsetenv LD_PRELOAD
Shared library interposers are programs capable of intercepting calls
the application makes to any shared library. Once such a call is intercepted,
the interposer can do whatever you need, and then call the real function
originally intended by the application.
Library interposers are very useful for all kinds of debugging,
testing, and collection of runtime data statistics. They can even
be used to fix bugs by modifying the behavior of the interposed
function. They can do all this without rebuilding the application
in any way.
In this case, the library can interpose on system call exit(2),
which most applications (including Pro/ENGINEER) invoke at the end
of the run to exit to the operating system. First, it can determine
the name of the executable that made the call. The library can use
Solaris proc(4) interface for it. (Note: the version of the
proc(4) interface shown here works with Solaris 2.6 and later.)
Pro/ENGINEER main executable is called "pro". So if the
current executable is called "pro", the library can call
system(3S) invoking a Perl script called measure_proe_mem.pl.
After that is finished, call the real system exit(2) routine.
I will assume that the directory where script measure_proe_mem.pl
(described later) is installed in is on the shell $PATH.
Alternatively, you can put the full path to measure_proe_mem.pl
in the system() call.
When the application uses malloc() to dynamically allocate
memory from the system, the amount of memory that the application
consumes cannot decrease while the application is running. The malloc()
package contains a memory management system. When the application
calls free(), the freed memory is not returned to the operating
system but saved for future use by the same process. (Actually,
malloc() can be written to return memory to the system, but
most malloc() implementations, including Sun's, do not
do it that way because it would be hard to do and unnecessary in
most cases.)
Therefore, to estimate the amount of memory that the application
consumes, it is enough to measure it only once, immediately before
the application exits. This will provide the "high-water mark"
value. The actual measurement and calculations are performed by
the Perl script measure_proe_mem.pl that the library interposer
invokes. Listing 2 contains its source.
Perl is a part of Solaris 8 and above, where it is automatically
installed in /bin. For the earlier Solaris releases, you
can download Perl from a number of locations, including:
http://sunfreeware.com
If Perl is not in /bin on your system, make sure to modify
the first line in the script to point to your Perl executable.
The measure_proe_mem.pl script runs the ps(1) command
to find all the running processes related to Pro/ENGINEER. It assumes
that any process with an executable name containing characters "pro"
or "appmgr" qualifies. (These patterns can be easily changed
if necessary.)
For each process with a name matching the specified pattern, the
script runs the Solaris pmap(1) command with parameter "-x"
producing a detailed memory map for the process. After parsing the
pmap(1) output, the script adds the amounts of private memory
that each process has used, and selects the maximum amount of shared
memory among the processes. We do not want to add the shared memory
many times since many processes share it. We assume that the Pro/ENGINEER-related
processes share most of the same libraries. The resulting value
(Max shared + Total private) is a good approximation for the amount
of memory this Pro/ENGINEER session has consumed. You can download
both source files from the Web:
ftp://ftp.sunmde.com/pub/gregns/mem_on_exit.c
ftp://ftp.sunmde.com/pub/gregns/measure_proe_mem.pl
Typically, these tools are used together. Either individual users
of Pro/ENGINEER can run them directly, or a startup script of some
kind can do it. In the latter case, a systems administrator can collect
various statistics regarding memory consumption. You can also use
the measure_proe_mem.pl script on its own. If you execute it
at any time while the application is running, it will output the results
at that time.
Here is an example output from the measure_proe_mem.pl script
executed while Pro/ENGINEER is running:
% measure_proe_mem.pl -v
5045 /export/home/proe2000i2/sun4_solaris/obj/pro:
virtual_kb = 378512; shared_kb = 4232; private_kb = 179512
5046 /export/home/proe2000i2/sun4_solaris/nms/nmsd:
virtual_kb = 2832; shared_kb = 1512; private_kb = 1016
5048 /export/home/proe2000i2/sun4_solaris/obj/pro_comm_msg:
virtual_kb = 4848; shared_kb = 1496; private_kb = 1552
5060 /export/home/proe2000i2/sun4_solaris/obj/pglclock:
virtual_kb = 157592; shared_kb = 4224; private_kb = 6936
Total memory consumed by all Pro/ENGINEER-related processes:
Total virtual address space = 531 MB
Max shared = 4 MB
Total private = 185 MB
Max shared + Total private = 189 MB
You can easily modify these tools to work with applications other
than Pro/ENGINEER. All you will have to do is change the names (or
name patterns) of the executables, the name of the Perl script, and
the output messages.
Measuring Paging to Disk
You can measure paging to disk with vmstat(1). This technique
(among others) is described in:
http://www.sun.com/technical-computing/ISV/PTCFaq.html#PERFORMANCE
Look at the sr (scan rate) column in the vmstat output.
When the numbers in that column are consistently zero or less than
200 pages per second, there is no significant paging to disk occurring
and the amount of your physical memory is sufficient for the current
session. If the scan rate is consistently high, application performance
will improve if you add more RAM.
A Pro/ENGINEER startup script can start vmstat(1) in the
background, capture the sr column output, and calculate some
meaningful statistics. The vmstat(1) process can be terminated,
for example, by a signal sent to it when the Pro/ENGINEER session
ends. Developing such a script is left as an exercise for the reader.
As an alternative, you can watch for the disk activity reported
for the swap device (assuming you use swap partitions rather than
files). One way to do this is to run the iostat(2) command.
Any significant input/output (I/O) in a swap partition is a sure
sign of memory shortage.
I also recommend installing the xcpustate utility and using
it to graphically monitor what your system is doing. It is a public-domain
X-Windows based utility available for many UNIX platform. You can
download it from:
ftp://ftp.cs.toronto.edu/pub/jdd/xcpustate
The SPARC/Solaris binary that I use (which is rather old) is available
here:
ftp://ftp.sunmde.com/pub/gregns/xcpustate
To use it, simply make sure the xcpustate file is executable:
chmod +x xcpustate
and then start it putting it into background:
xcpustate &
If you would like to watch the I/O state in addition to the CPU state
(which is a good idea, especially for the swap devices), start it
with a -disk parameter:
xcpustate -disk &
The resulting display will show the state of each CPU and disk (if
-disk is specified). It uses the following colors for the display:
Green User time
Yellow System time
Blue Wait/idle
The xcpustate display is updated each second.
There are other graphical utilities to monitor system performance,
but I like xcpustate the most for its convenience and light
weight.
Measuring CPU Cache Usage
Solaris 8 has cpustat(1) and cputrack(1) utilities,
which can help you measure various CPU statistics. Specifically,
you can measure the number of external cache hits and misses. Here
is an example of how you can use cputrack(1):
% cputrack -fev -c EC_ref,EC_hit <command>
EC_ref refers to the total number of external cache references,
while EC_hit corresponds to the total number of external cache
hits. The difference between the two values will give you the number
of external cache misses. The external cache miss rate can be computed
as:
(1 - EC_hit/EC_ref)*100%
Similarly, to watch the instruction cache references and hits, you
can use this syntax:
% cputrack -fev -c IC_ref,IC_hit <command>
You can also concatenate multiple -c options to cputrack
or cpustat. That will make the tool cycle between the specified
events. The above examples are for UltraSPARC-I and UltraSPARC-II.
See UltraSPARC User's Manuals (http://www.sun.com/microelectronics/manuals/index.html)
for detailed information about the UltraSPARC counters.
Frederic Pariente of Sun Microsystems/France has developed an
interesting utility called Hardware Activity Reporter (HAR), which
computes many useful UltraSPARC CPU statistics such as L1-cache
miss rate and data stall rate. See Hardware Performance Counters,
Hardware Statistics Tool: http://www.sunmde.com/perf_tools/har/.
Improving Memory Efficiency
- Using the tools described in the previous sections, measure
how much memory your application uses and how much paging to disk
is occurring.
- Collect statistics about the memory consumption and paging
to disk over a period of time. Configure your system with the
optimal amount of RAM and disk swap amount best suited to your
needs.
- Use the cputrack or HAR tool to measure the CPU cache
miss and data stall rates during specific application operations.
Provide feedback to application developers, such that they can
improve their algorithms and data structures to take better advantage
of CPU caches. Improving locality of reference and thus increasing
the number of cache hits can dramatically enhance application
performance.
- Use the application features that generate many cache misses
as little as necessary.
Summary
Now that you have some memory measurement tools, you can apply
them in your own environment to configure the system's hardware
and provide useful feedback to the application developers. Since
memory efficiency is such a large issue, its further discussion
should be useful to everyone involved with large applications.
Acknowledgments
I would like to thank my Sun Microsystems colleagues Tom Gould,
Morgan Herrington, Peter Nurkse, and Pramod Rustagi for their advice
and encouragement.
Greg Nakhimovsky is a member of the technical staff at Sun
Microsystems. He works with independent software vendors making
sure that their applications run well on Sun systems. He has 20
years of industry experience developing, performance tuning, and
supporting technical computer applications on various computer systems.
|