Tuning
NNM Servers
Thirumalainambi Murugesh
Hewlett-Packard's (HP) OpenView (OV) Network Node Manager
(NNM) is a powerful platform for enterprise-wide IP network management
based on SNMP standards. NNM performs auto-discovery of TCP/IP networks,
displays network topologies, correlates and manages events and SNMP
traps for root-cause analysis, monitors network health and alerts
based on configured thresholds, and collects performance data. NNM
manages heterogeneous switched layer 2 LAN environments as well
as routed layer 3 WAN environments. It always helps to have more
RAM, faster disks, and faster CPUs when facing system bottleneck
issues; however, not everyone is ready to spend money for new hardware.
Instead, administrators want to get more out of what they already
have. In this article, I will share some techniques for getting
more out of NNM.
Analysis Tools
When the system behaves slowly, system performance analysis tools
such as glance, top and perfview can be used to analyze the performance.
With these tools, you may see that the processes taking most of
the CPU and memory are from NNM, including ovrequestd, snmpCollect,
ovcoltosql, and ovdbrun. You can then view how those processes are
behaving in detail by choosing the process id and following its
threads (see Figure 1).
Looking at the OV processes in detail with /opt/OV/bin/ovstatus
-v gives the following stats for netmon and snmpCollect. You
can get more information on netmon.trace by issuing netmon -a12
or netmon -a16 for netmon and SNMP polling statistics:
object manager name: netmon
state: RUNNING
PID: 19976
last message: Initialization complete.
exit status: -
additional info:
09:28:30 Polling 2717 interfaces, 14143 polls/hour. 0 overdue
polls, current maximum 0 seconds behind. Worse was 71 polls a m
aximum of 33 seconds behind at 02/25/03 02:03:04. 74% interfaces
available. 477 Name Service requests, average 3.2 msec/lookup.
object manager name: snmpCollect
state: RUNNING
PID: 25535
last message: Data Collector has been busy for 10896 \
seconds (Behind on Polling)
exit status: -
additional info:
09:29:17 Collecting on 130 nodes, 24562 total instances.
Stored:1149617, Thresh:308, Rearm:294, Sent SNMP PDUs:127771,
Recvd Instances:4724716, SNMP retries:70870 since Fri Feb 21
01:46:19.2 2003 (4.32 days). Collecting 618 variables/minute via
19 PDUs/minute. Maximum collection delay 30920 seconds at
02/21/03 10:28:06. 81 collection checks in progress
You can plot the netmon and SNMP performance from the OpenView Internet
map menu [Performance]->[Network Polling Statistics] and see whether
any of the graph plotting seconds until the next status/SNMP poll
has gone negative, which indicates a problem and requires tuning (see
Figure 2).
You should check the ovcoltosql process. If that process is running
over a long time, it indicates issues with the size of snmp data
collection and solid db. In this case, check the configured NNM
reports and solid.db size located under /var/opt/OV/share/databases/analysis/default.
The maximum size of a solid.db can be 2 GB. If it is nearing that
maximum, it needs trimming and importing and exporting the data
to resize the solid db.
If NNM is struggling with name resolution, it will send two alarms
of type "OV_NS_PerformWarn and OV_NS_PerformErr", and
if it is struggling with SNMP data collection, it will send an "OV_DataColl_Busy"
alarm. If there are any errors with NNM data warehousing, it will
send an "OV_dataWareHouseMaintError" alarm to the alarm
browser. These alarms give a clear indication of issues with NNM
and mean that tuning is required.
You can run /opt/OV/bin/ovtopodump -l and compare the number
of required managed nodes to the number of currently managed nodes.
Currently, the managed nodes list can be printed using /opt/OV/bin/ovtopodump
-RISC. If the difference is high, clean up unwanted nodes to
reduce system loading.
Factors Affecting NNM Performance and Possible Workarounds
1. Physical characteristics of NNM server and its network
2. Nodes management and filters
3. NNM polling
4. Events and Event Correlation System (ECS)
5. Name resolution issues
6. Data collection
7. NNM daemons (ovwdb, ovrequestd, ovtopmd)
8. Number of running ovw sessions
9. NNM data warehouse issues
10. Performance enhancing scripts
Physical Characteristics of NNM Server and Network
Because NNM processes are memory and swap intensive, it is always
better to have more RAM, faster physical disks, CPU and high-bandwidth
lines. You can monitor the system's performance in terms of
swap, memory usage, cpu usage, disk I/O, run queue load using swapinfo,
dmesg, and sar with the available various flag options. In cases
where it is not possible to upgrade with better hardware due to
budget limitation, you can try to tune the system kernel parameters
(e.g., maxusers, process priority, MAXTSIZ, maxswapchunks, and max_thread_proc)
and keep swap partition disks away from file system disks. The tuned
32-bit kernel parameters for better performance should meet or exceed
the values given in Table 1. For a complete overview of HP-UX 11.x
kernel parameters, refer to:
http://docs.hp.com/hpux/onlinedocs/os/KCparams.OverviewAll.html
You might try to adjust the process priority value either using nice
before starting the program, or using renice with the process
id of OpenView processes. Note that the process priority value is
given through the nice number from 0 to 39 with default value
for every process as 20. A nice value of 0 is the highest priority,
and a value of 39 is the lowest. You can check the current nice
number against any program by running the following command and looking
at the value under NI column:
Ps -efl |cut -c 1-37,85-110
Nodes Management and Filters
If you don't control the number of nodes managed by NNM by
stopping the auto-discovery option, it will pick all reachable nodes
whether you are interested in managing them or not, and will fill
up the OpenView operational (object, topology, and map) databases.
This will lead to waste of system resources (memory, swap/paging,
CPU, network bandwidth, filesystem disk space). To effectively manage
the nodes you are interested in, you need to configure discovery,
topology, and map filters, which are defined in the filter configuration
file (/etc/opt/OV/share/conf/C/filters). Before applying the filter,
you should always check the syntax and validity of the filter using
/opt/OV/bin/ovfiltercheck /etc/opt/OV/share/conf/C/filters.
Once the filters are checked for syntax, they can be applied effectively
for nodes management. To ensure that nodes you are interested in
are being managed and that unnecessary nodes are not in the databases,
you can run the following:
/opt/OV/bin/ovtopodump -l
And, you can verify the number of managed nodes with the number of
nodes with the following:
/opt/OV/bin/ovtopodump -RISC |grep -v "/"
This gives the list of nodes managed by the NNM server. From here
you can determine which nodes are missing and which need to be removed
from current management. Once auto-discovery is disabled and filters
are enabled, you can add nodes in a controlled fashion using loadhosts
as shown:
Snmpget -c <RO_community_name> <node_name> \
ip.ipAddrTable.ipAddrEntry.ipAdEntNetMask.<mgmt_ip_of_node>
This command gives the subnet mask for the node with the management
IP address.
Next, add the node entry for polling in the snmpconfiguration
file using xnmsnmpconf:
/opt/OV/bin/ovstop netmon
loadhosts -p -v -m <subnet_mask_found from step i> << EOF
IP address node_name
EOF
/opt/OV/bin/ovstart netmon
nmdemandpoll nodename
NNM Polling
By default, NNM does the status polling of nodes configured in
SNMP configuration binary file using /opt/OV/bin/xnmsnmpconf
with the parameters (status polling interval, timeout, and number
of retries) for the list of nodes to be managed. The default polling
interval is 5 minutes (300 seconds) for each device. If you use
the wildcard character (*) for an entire segment (e.g., 10.1.1.*),
even though you are interested in managing few nodes only in that
segment, NNM will try to poll all the possible devices starting
from 10.1.1.0 to 10.1.1.255 within the 5-minute interval. You can
determine the packets per second generated for a status poll using
the following formula:
The wildcard character will load the NNM polling and chew up the
system resources. Hence, NNM will fall behind in the polling cycle
due to the number of retries for each failure and the round-trip
time for each polling. A high incidence of pairs of node down/node
up events is usually an indication of a busy LAN with overly frequent
polling intervals or timeouts too short. To avoid this situation,
don't use the wildcard character. Instead, you can add multiple
devices using range values (e.g., 10.1.1.0-25, 10.1.1.42-59). You
should also group critical devices with shorter polling intervals
and use longer polling intervals for less critical nodes. Unwanted
nodes should be unmanaged and removed totally from all OpenView
operational databases.
Besides status polling, the following other types of NNM polling
also load up the system:
- Device discovery polling (IP discovery polling varies from
1 minute to every 24 hours and IPX discovery polling for every
6 hours) for discovery of new nodes from the information of any
particular node.
- Topology configuration polling for every 4 hours.
- Device configuration check polling once a day for every managed
node.
- Collection station status polling from the management NNM server
to NNM collection station once for every 5 minutes.
To avoid the above NNM polling loads up the system, you should
define efficient NNM filters and place them in operation.
Events and Event Correlation Systems
NNM receives events from the managed nodes in the form of SNMP
traps or self-created alert messages for missed polling on a managed
node. When a core router goes down, it generates a huge number of
alarms for unreachability of the entire number of devices managed
down through the path of the core router. When a flood of alarms
hits the NNM server, it will cause a system resources bottleneck
and overflow of buffer, etc. To avoid this flood of alarms, NNM
comes with the following standard Event Correlation System circuits:
- ConnectorDown
- MgXServerDown
- Pairwise
- RepeatedEvent
- ScheduledMaintenance
These circuits can be modified using ECS configuration Graphical
User Interface (ECSGUI) to suit your needs to suppress unwanted
alarms. Once modified, it is always best practice to test them with
the verify option for syntax checking before turning them ON. You
can simulate the required SNMP traps using "snmptrap"
for any enterprise-specific event and use the following syntax to
check the ECS circuit in real-time simulation:
$SNMPPATH/snmptrap -v 2 <NNM_Station> <SNMP_Community_String> \
.1.3.6.1.4.1.11.2.17.1 <Node_Name> 6 58916874 0 .1.1 i 1 .1.2 \
s <Node_name> .1.3 s "ECS_MSG_CHECK".
Also, it is good practice to run confidence tests on ECS using /opt/OV/bin/ecsconftest
with runtime options and to check the log "/var/opt/OV/tmp/ecsconftest.log"
for any alerts.
By default, the event database size is 16 MB. The event database
is divided into four files, which means that each file has a maximum
size of 4 MB. When all four files are full, the oldest log is truncated.
This sends an alert in the NNM alarm browser, and new events are
written into the reclaimed space. If you want to change the size
of the event database to hold more events, use the "b"
parameter in the /etc/opt/OV/share/lrf/pmd.lrf file. Most NNM administrators
periodically upload the NNM events into NNM data warehouse using
/opt/OV/bin/ovdwevent -export for various forms of reporting.
Name Resolution Issues
NNM performance can be drastically affected or improved by name
resolution. NNM issues simple gethostbyname() and gethostbyaddr()
calls to the configured name resolver (DNS/NIS/hosts). Netmon, trapd,
and SNMP data collection processes will be affected if there are
name resolution issues on the NNM server. Netmon monitors name resolution
performance and generates an alert if it finds poor performance.
In a Unix system, the name resolution order is set in /etc/nsswitch.conf
file, and in Windows it is set in the registry settings "DNSPriority,
LocalPriority, HostsPriority, NetbtPriority" under \My Computer\HKEY_LOCAL_MACHINE\SYSTEM\Current
ControlSet\Services\TCPIP\ServiceProvider.
If you use local hosts file for name resolution and the file is
long, it will add a delay because the search is done in sequential
order. If we use DNS, the name server responses can be affected
by network latencies. In Windows, it is always better to turn off
WINS resolution and NETBIOS over TCP/IP.
NNM 6.2 contains several DNS performance enhancements compared
to the previous versions of NNM. NNM6.2 has another database related
to name resolution called "No Lookup cache", which stores
names of nodes, segments, and networks whose names should not or
cannot be resolved to an IP address using the system IP name resolution
services. NNM administrators can also add a mutually exclusive function
file called "ipNoLookup.conf" with IP addresses that should
not be resolved to hostnames under $OV_CONF directory. The "no
lookup cache" is populated by the netmon process during discovery,
but NNM administrators can add and delete entries to the cache using
the following commands:
Snmplookupconf -add <hostname>
Snmplookupconf -load <filename>
Snmplookupconf -disable <hostname>
DNS lookups can be traced with the following command:
Export OV_NS_LOG_TRACE="<Trace directory>;<Log Threshold>;<Trace level>"
Export OV_NS_LOG_TRACE="/tmp/dns.trace;1.0;3"
Running the local caching server on the NNM station is recommended
for better name resolution performance.
Data Collection
Misconfigured SNMP data collection always puts a huge load on
the system. Data collection status can be quickly checked by running
ovstatus -v snmpCollect. On NNM, SNMP data is collected either
to display the health status of managed nodes through Service Information
Portal (SIP) health dials or to produce standard NNM reports. You
can work out the system resource requirements based on the calculations
given in "Reporting and Data Analysis with HP OpenView Network
Node Manager". The main factors include the average number
of instances, the number of MIBs collected on, and the number of
nodes collected on. The current list can be displayed by running
the following command:
/opt/OV/bin/request_list schedule
snmpCollect saves the configuration for each datacollection and data
in the collection directory /var/opt/OV/share/databases/snmpCollect.
The configuration files ends with !. The data is stored in binary
format and can be read using /opt/OV/bin/snmpColDump. The directory
grows as long as you keep collecting data. To control the growth of
the data collection, consider the following:
- Reduce the rate of disk fill by increasing the SNMP data collection
interval.
- Set up cron to periodically trim the SNMP historical data.
Listing 1 shows a small script that can be put into an hourly
cron job to keep only the last 2000 entries in snmpCollected data
files. The script is smart enough that it won't alter the snmpCollect
configuration files.
Currently, snmpCollect tries to query sub-interfaces and tries
to collect from non-existing interfaces. This results in a huge
number of repetitive timeouts and access denied for requested variables
on the sub-interfaces, which can be verified with "/var/opt/OV/share/log/snmpCol.trace"
file. This will load the system and affect system performance. HP
knows these issues and there has been an enhancement request logged
with HP Labs. You can view this online at:
http://OpenView.hp.com/sso/ecare/getsupportdoc?docid=8606274911
http://OpenView.hp.com/sso/ecare/getsupportdoc?docid=8606288412
Follow the link to "email me" if you want to be notified
by email once the issue has been resolved. To work around this issue,
you may need to specify the list of instances using a file instead
of selecting "All instances" by default while configuring
performance reports (see Figure 3).
NNM Daemons
Most of the NNM daemons have various flag options for running
and troubleshooting. The flags can be passed on to the process either
by stopping and restarting with the flags or modifying the local
registration file (lrf) file in /etc/opt/OV/share/lrf/NNMDAEMON.lrf.
For example, if you don't want netmon to poll the nodes for
HTTP server port, add the flag -H 0 in netmon.lrf, which will reduce
system load by stopping http discovery. After modifying it, you
must compile and submit the script for startup with /opt/OV/bin/ovaddobj
/etc/opt/OV/share/lrf/netmon.lrf.
If the total number of objects in the OpenView database can be
loaded into memory, it will enhance many OpenView operations. This
can be achieved by finding the number of objects in the OpenView
database using /opt/OV/bin/ovtopodump -l and modifying $OV_LRF/ovwdb.lrf
with a value of (number of objects + 10%) using the -n option. If
the value is higher than available physical memory, it will defeat
the purpose and cause excessive swapping/paging. Similarly, if you
want to enable tracing with verbose mode on snmpCollect, you should
alter snmpCollect.lrf file as shown and look at the trace file snmpCol.trace
logged under /var/opt/OV/share/log:
OVs_YES_START:pmd,ovwdb,ovtopmd:-d -T -V:OVs_WELL_BEHAVED:20:PAUSE::
You should be very careful in terms of using the various options in
the startup file, because excessive logging and tracing will affect
system performance.
Number of Running ovw Sessions
If you have a big OpenView database, a highly customized map with
lots of background graphics, and multiple users running various
ovw sessions with different maps at the same time, it will put a
huge load on the system in terms of CPU and memory due to IPMAP
synchronization. Also, if you are running ovw sessions over a long
time, they will create a memory leak problem. If all the maps are
persistent, this will add more load in terms of system memory.
To save memory, you could enable the on-demand feature. This feature
enables you to decide which level of submap (All submap, Segment
level, Network level, Internet level) must be loaded into the system
memory during the ovw session. The submaps below that level will
then be created from the available topology database on demand.
You can also limit access to the OpenView database and open an ovw
session by placing the appropriate host names and user names in
the file "/etc/opt/OV/share/conf/ovwdb.auth". The format
is:
<hostname> <username> = To give access
<hostname> - <username> = To deny access
If you use + +, it will give access to anyone from any host. The search
is done in sequential order, so if you want to deny some users, the
entry must be on top. You can also restrict people running too many
maps by granting permission to certain maps to certain people using
ovwperms.
NNM Data Warehouse Issues
If the data warehouse is kept in solid, it is better to check
and maintain properly so that the size of the database doesn't
reach the maximum of 2 GB defined in the /var/opt/OV/share/databases/analysis/default/solid.ini
file. If it exceeds this limit, you risk database corruption, and
running sql scripts (ovcoltosql) to export data and produce reports
will be extremely slow. It is better to trim the data and to export
with reduced option than raw data. When we export the data to the
NNM Data Warehouse, we can export either as raw data or consolidated
compressed format using the reduced option. It is always better
to export with the reduced option as it saves space.
Note that the solid database does not reduce in size when you
trim it; it grows. When it is trimmed, it simply gathers unused
spaces but does not reduce the space. If the solid database has
exceeded its limits, it is better to unload the data using /opt/OV/bin/ovdwunloader,
archive it, and recreate the solid.db. To check whether the solid
database has corrupted, run the following commands in order:
Ovstop ovdbrun
/opt/OV/bin/ovdbrun -x testindex
If you get some internal error, it's most likely that the database
is corrupt. If the solid database has become corrupted, it can be
re-initialized using the following procedure:
1. Keep a copy of the existing database and all associated files
under /var/opt/OV/share/databases/analysis/default/ in a different
file system just in case you need to roll back.
2. Change the working directory to /var/opt/OV/share/databases/analysis/default/.
3. Delete solid.db, ./sol*.log, and ./log/*, ./backup/* files.
4. Run /opt/OV/bin/ovdbrun -x exit while in the /var/opt/OV/share/databases/analysis/default
directory. You will be prompted for the default database. Give the
following values for the variables system catalog=ovdb, user=ovdb,
and passwords=ovdb. This creates an empty database (solid.db).
5. You can populate the database with the NNM schema:
/opt/OV/bin/ovdwconfig.ovpl -type embedded -load
6. You can start the NNM server with the newly configured database
with ovstart ovdbcheck.
Performance Enhancing Scripts
It is good practice to do a regular cleanup of the OpenView operational
databases. This can be achieved using the following procedure:
/opt/OV/bin/ovstop netmon
xnmsnmpconf -clearCache
ovw -mapcount -ruvDR
ovtopofix -a
/opt/OV/bin/ovstart netmon
You can perform a regular weekly backup using the following cron job:
15 15 * * 5 /opt/OV/bin/ovbackup.ovpl -d /filesystem > /dev/null 2>&1
The following cron jobs can be used to trim the old data before exporting
it to the database to improve the solid database-related performance
issues:
50 23 * * * /opt/OV/bin/ovdwtopo -export
15 23 * * * /opt/OV/bin/ovdwevent -export -trimdetail 14
15 2,5,8,11,14,17,20 * * * /opt/OV/bin/ovdwevent -export
35 23 * * * /opt/OV/bin/ovdwtrend -export -sum -trim -trimpriorto 236
35 2,5,8,11,14,17,20 * * * /opt/OV/bin/ovdwtrend -export
0 1 * * * ovdwtrend -delpriorto "2002-06-31 00:00:00" -exportto reduced
Further Reading
Performance tuning is not a one-day job and has to be analyzed,
tuned, and verified over a period of time to get it right. I hope
I have provided initial tips covering the various factors affecting
NNM performance and possible workarounds to overcome them. For further
information, refer to the following sources:
1. John Blommers, OpenView Network Node Manager: Designing
and Implementing an Enterprise Solution, ISBN 0-13-019849-8
2. Network Node Manager 6.2 Performance and Configuration Guide
3. Reporting and Data Analysis with HP OpenView Network Node Manager
4. Network Node Manager 6.2 DNS Performance Improvements
5. HP OpenView -- A guide to Scalability and Distribution
for Network Node Manager
6. Network Node Manager -- Managing Your Network
Thirumalainambi Murugesh received his Ph.D. degree in Electrical
& Electronic Engineering specializing Managing Networks and
Internet Security from the University of Auckland, New Zealand.
He has been working in the IT industry during the past 10 years
in design, implementation, and support of high-performance Unix
computing systems and networks and focusing on reliability, redundancy,
and security. He can be reached at: murugesh@cmc.optus.net.au.
|