James, All questions gladly answered :) My slow-witted
filter is on (incoming and out going :D)
Let me break down each of the pieces in a little more
detail, I think you pointed out where I was wrong here.
1. "first" ruleset - triggered by a trap settings that
matches the mainframe NetMaster enterprise. Calls a PERL script. The PERL script
parses out the relevant data and logs it. All log entries during the problem
present, 34 downs and 34 corresponding ups.
2. The PERL script uses a TCP socket call to notify the
"listener". Again for performance reasons, the listener reads a Hash of customer
data when it starts up and when the first script sends a no down or node up
trap, the hash is interrogated for the name of the customer and the location.
The listener logs the event and generates the SNMP trap (either up or down) as
input to the timer / hold down ruleset (the second ruleset). The listener
program logged 34 down events and 34 up events and generated 34 down traps and
34 up traps.
3. The "second" ruleset containing the reset and pass on
match logic fires on the trap settings for our custom trap. It's only function
is to wait the requisite 10 minutes and if the timer is exceeded calls the
notification script (PERL). This is the ambiguous part. Even though the ruleset
fired 34 times for the up messages, the PERL script to do notification only got
called 12 times.
So to answer your question - yes, it appears that nvcorrd
failed to kick off the notfication script in 22 of 34 up events only. I dont'
believe the ruleset is failing because as long the events flow in smaller
groups, the correlation appears to work flawlessly. The ruleset is copied from a
standard one we use wherever we need hold-down timers.
The second script does not call postemsg directly. I have
some housework to do to figure out who to notify and what severity, so the
rulset calls my small PERL script and that in turn calls the notification
script. Either my small script did not get called (nvcorrd issue) or my call of
the notification script failed (nothing to suggest that
happened).
We do record the return code from postemsg, but there were
no non-zero return codes (only 12 occurences of the notification getting done,
all 12 had good return codes)
Some good ideas here James (and Stephen). I think the
easiest thing I can do is put a log message in my notification script to
indicate it got called before going to postemsg. Then I will proceed to try and
generate mutliple events with snmptrap and see if I can get it to
fail.
More info as I progress. Thanks
everyone.
Scott,
I am
slow-witted and I need lots of time to think about complex scenarios, so
please bear with me if I belabor the obvious or just don't seem to "get it",
especially at first. Your solution is both elaborate and
ingenious, and I am not certain that I comprehend it.
When I asked about code I was more specifically
thinking about how the rulesets are structured, what internal ruleset nodes
are used, because you were saying that you thought something had failed inside
nvcorrd (that's how I interpret what you said about there being an
internal limit of 12). And now you have elaborated on that by
saying, >"A second ruleset is
used to catch the listener-generated node down and up events and trigger the
notification script to TEC (it appears not all resulted in >triggering the
notification script)."
Isn't that the same as saying that
nvcorrd failed to kick off your script when he should have? You say that
you see all the generated events (so the first rs in ESE.automation appears to
be working), so it is this second rs which you think is failing. Am I
correct? If so, then that is the one we should focus on, how it is
structured, what it does, and so on.
If the second rs is launching postemsg in an action node, then it seems
to me that you have several choices on how to debug this. One way is that you could issue postemsg in a script
which echoes a unique value (random number?) to a file so you could see how
many times postemsg was actually called. That would show whether it was
driven as many times as you think. Or is this what you think you already
have with your value of 12?
Another way might be as Stephen Hochstetler suggested, trying to trap a
return code from postemsg. If postemsg generates a non-zero return code,
and actionsvr kicked off the script, then actionsvr will capture that message
and log it to the nvaction.log. Every action gets a unique sequential id
number, and if the script or command returns a non-zero return code then any
messages which accompanied that return code are also displayed with that
id number, so you can see what happened to each execution.
Or you can trace the execution of
just this second ruleset in nvcorrd by giving it a special reserved name.
"nvcdebug.rs". Then you can look in the logs and see whether it works
right or not, whether each of the triggering events caused an action to be
queued to actionsvr or not. If you see the action being sent, then you
know the problem is not in nvcorrd. From there you could see the action
command arriving in the nvaction.log and watch what happens after that.
How would you test this without
taking down devices? With snmptrap. Your initial ruleset works off
a trap, so does your second. So you should be able to build an snmptrap
command to mimic them. The code won't know that there wasn't a real
outage if you launch a test script with 34 snmptrap commands in it.
Unsure how to build this? Just ask.
Complicated? Difficult? Perhaps, but no more so than what you
have already built. That's the problem with disaster recovery tools --
it's always hard to test them.
James Shanks Level 3 Support for Tivoli NetView for UNIX
and Windows Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr@csgsystems.com> Sent by: owner-nv-l@lists.us.ibm.com
05/28/2004 11:47 AM
|
To
| <nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| RE: [nv-l] Ruleset
Correlation |
|
I'll try to keep this simple James, and answer your
questions at the same time, here is the flow: Mainframe
NetMaster Enterprise trap when
an SNA device fails Received by
NetView Trigger ruleset via
ESE.automation that calls script Script parses event picking out important data (SNA PU NAME &
STATUS) Script uses a TCP socket
connection to a listening script Listening script interrogates it's hash table of 1100+ devices for name
and location of the client affected Listening script issues our own trap (i.e. node down or node up)
The
listening script is used because I wanted to avoid having to load a hash of
1100 customers (or do equivalent file I/O) in the event of large scale
outages. When we IPL the mainframe, we are going to receive events on ALL SNA
PUs and spawning several hundred copies of the script loading the hash with
1100 customers would be an incredible resource hog. So I have the listening
script load the hash and run like a daemon and accept requests from small
individual scripts that have parsed out the relavent data. The logging shows
this: Trapd.log shows all 34 down events and all 34 up events from the
mainframe (duration beyond the timers) The small script which parses logged all 34 down and all 34 up
events The listener program
generated all 34 down and all 34 up events (the ones the timers care
about) A second ruleset is used
to catch the listener-generated node down and up events and trigger the
notification script to TEC (it appears not all resulted in triggering the
notification script) Notification to TEC only occured on 12. TEC console only shows 12 up events and leaves the remainder
as open. So, one of two conditions exist. My listener program did receive all
the events, and did generate the traps. Therefore, either ruleset correlation
was only able to correlate a maximum of 12 (and thus did not fire the
notification script), OR the notification script has problems generate 34
calls to TEC (we use postemsg, not TEC forwarding). I would rule out the
listener program having an issue on the basis that it was able to generate all
the down and up traps even during the heaviest of volumes I have observed.
Somewhere, the ruleset correlation failed, or the TEC postemsg failed.
As far as actionsvr firing up 34/35 processes, that should be okay.
These NetView servers have dual 1.0 Ghz processors and 2gb of memory. We have
other "storm-like" situations that we handle a volume equal to or larger than
this. In those cases though, I don't have the hold-down timers and the second
ruleset. Sorry if this is complicated, I was trying to conservative with system
resources by using this listener program. All code is in PERL btw. One problem
I have is I cannot test this without nuking some large number of customers and
my management seems to frown on production outages to test event notification.
Go figure.
From: owner-nv-l@lists.us.ibm.com
[mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of James
Shanks Sent: Friday, May 28, 2004 10:27 AM To:
nv-l@lists.us.ibm.com Subject: Re: [nv-l] Ruleset
Correlation
Well, it is awfully difficult to try to diagnose your situation
without knowing how the code you have designed actually works.
Did the ruleset fire
correctly on every event? Your best bet is to turn on nvcorrd tracing
(nvcdebug -d all) after nvcorrd starts so you can look at the logs. If
they toggle too quickly, then you'll have to start nvcorrd with the -l
<logfile> parameter so he just writes to one huge log until you stop
him. The logs will show what actually happens inside him and whether the
rulesets worked properly.
Did the scripts get launched? If you think you already know
that they did, and these notifications are sent via a scripts run by
actionsvr, then it is time to look at the nvaction logs. Note that the way
actionsvr operates is that he spawns a child for ever action he runs, so if
you are expecting 34 concurrent notifications, you'll get up to 35 actionsvr
processes running concurrently, the main one and 34 children. There's
no magic number of actionsvr processes that can run at one time; that's up to
your operating system limits. But actionsvr will cancel his children
however, if they don't complete in 999 seconds.
Hope this helps.
James Shanks Level 3
Support for Tivoli NetView for UNIX and Windows Tivoli Software / IBM
Software Group
"Barr, Scott"
<Scott_Barr@csgsystems.com> Sent by:
owner-nv-l@lists.us.ibm.com
05/28/2004 10:08 AM
|
To
| <nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| [nv-l] Ruleset
Correlation |
|
Greetings -
NetView 7.1.3 & Solaris 2.8 I am working through some automation
performance issues and I observed something disturbing. I have automation that
receives SNA mainframe events, parses and formats the trap and writes it to a
log. It also uses snmptrap to generate a psuedo "node down" trap. When a
corresponding up event is received for the same SNA device I use snmptrap to
send an "up" event. A second ruleset performs correlation on the up and down
events so that if the duration between the up and down events is less than 10
minutes, it gets tossed, otherwise a notification script is called that wakes
up the help desk. What disturbs me is the behavior I see when we have a
significant outage - in my sample case, 34 SNA devices dropped at one time.
When the corresponding up messages occured, everything worked properly except
the notifications. The duration of the outage exceeded the time in pass on
match/resset on match timers but only 12 up notifications occured. According
to my application log and trapd.log, the 34 "up" events got generated
but the notifications did not. What I am wondering is whether there is a limit
to the number of outstanding correlated events, i.e. how many devices can be
waiting for a node up? Is it possible only 12 pairs of node down/ups can be
outstanding? Is there a way to look at whave events automation (and I'm not
sure if it's nvcorrd, actionsvr or ovactiond thats involved) still has
outstanding?
|