Notes on Ruleset Performance
This document is an attempt to bring together several hints on ruleset
coding which seem to be needed to help users who have performance
problems with rulesets.
The key thing to remember is that the ruleset editor is a
programming tool, and with it you can write very powerful programs, but
you can also write resource hogs too. And, since ruleset processing is
done by just one daemon, nvcorrd, you can quickly bring all of event
processing to its knees with a bad ruleset.
Here are some hints.
1. Limit the input as quickly and dramatically as you can
Typically this is done by using the Trap Settings node (where you
can specify up to 20 specific traps per enterprise) or the Event
Attributes node, so that you can immediately reduce the processing load
to a small subset of all the traps which pass through the system.
Remember that nvcorrd gets a copy of every one of them -- even those
marked "Log Only" or "Don't Log or Display" -- so it is vitally
important to limit the volume of what is processed. Failure to do so
will result in a general slowdown of events coming to the display window
and you will see messages in the nvcorrd.alog and .blog that traps are
being queued. With nvcdebug -d all you can actually see how many are
in the queue. What this means is that nvcorrd is falling behind and
may never catch up.
2. Limit your use of Collection compares and Database field compares
Never start your ruleset with a Query Database node, whether for
collection membership or for a database field value. To do so means
that every trap in the system will have to be checked (see hint #1)
and worse yet, that nvcorrd will have to suspend other processing
while awaiting the response from an external daemon (nvcold for collec-
tions, ovwdb for database fields) to decide what to do next. These
external calls should be kept to a minimum. Likewise, it is not good
to string these calls together (multiple nodes which query collections
or database fields) for the same reason. Every one suspends nvcorrd
while he awaits an answer. Collections can be combined into a
super collection using the Collection editor so that only one call has
to be made. Try not to query a database field if you are also going to
set it, since each is another call to ovwdb.
3. Limit MIB compares and sets
The same kinds of considerations that apply to collections and fields,
also apply to getting (and setting) MIB variables. This should be
done sparingly. Now we not only suspend processing for a response from
another process (snmpd), but we have to go outside our own box and
that may introduce network delays as well. The retry count for a
MIB compare should be cut to one, if possible; especially if the node
you are trying to reach may be down, and the timeout values kept low.
4. Use checkroute sparingly
Checkroute is also an external call so make sure that you only use it
when you have to and never retry more than once.
5. Use trap variables for information wherever you can.
The entire trap is passed to nvcorrd and is immediately parsed into
the trap variables. These should used rather than database fields,
collection queries, or MIB compares, to make decisions, whenever
possible. The reason should be obvious by now -- nvcorrd does not
have to rely on any eternal sources for this data and can process it
immediately.
6. Keep in-line actions short
An in-line action is a user-specified command or script for nvcorrd
to execute to decide what to do next. These should be the sorts of
things that complete in less than 10 seconds, because once again, all
other processing is suspended. Activities like sending a page, an
email, or a pop-up message, should never be done in an in-line action.
They should be done in an (off-line) action node instead, so that they
are executed by actionsvr and not by nvcorrd. For in-line actions,
the shorter, the better. Don't wait even 30 seconds for output that
should come back in two; wait ten seconds instead.
7. Make hold times for pass-on-match and reset-on-match reasonable
The pass-on-match and reset-on-match functions offer great power in
a ruleset, but there is a trade-off in memory and cpu for this power.
These nodes create a cache in memory and store events in that cache
for comparison with others which arrive at some later time. If you
specify hours of hold time, and the events you are caching are
frequent, then you could see dramatic memory growth in nvcorrd over
time. Nothing can be done about this, if that is what you have coded.
Similarly, although not a performance consideration, there is a lower
limit to time discrimination here. These match nodes use a "heartbeat"
mechanism to check whether a the timeout value of their cache has been
reached for the events in it. This is fixed at every 15 seconds. No
time value for the cache can be resolved any finer than this. In real
life this is typically not a problem, since matching events seldom
occur any faster than this, but it a design consideration to be aware
of.
8. Never use "PASS" or "Forward" in a ruleset in ESE.automation.
Rulesets whose full path names are placed in the /usr/OV/conf/ESE.
automation file are registered by the actionsvr daemon when he is
started. Thus, when the initial event stream node says "PASS" then
a complete copy of every event in the system is sent to him. But he
does not have a display on which to place these passed events
(actionsvr is a daemon and daemons do not have a display), so they
instead sit on his incoming queue. Actionsvr has no way to de-queue
them as they are not accompanied by actions he is to perform.
Eventually his queue fills up (about 32K events) and he stops processing.
Then the events start to back up inside nvcorrd, and when his outbound
queue fills up, he too stops processing, and all events processing ceases.
The same thing will happen, of course, when a Forward node is used in
a ruleset executing in the background for actionsvr. It will just occur
more slowly.
It is hoped that these hints will assist the user in creating more
effective rulesets, with minimal adverse impacts to their systems.
James Shanks
Tivoli (NetView for AIX) L3 Support
Last updated: July 6, 1998
James Shanks
Tivoli (NetView for UNIX) L3 Support
|