I believe I have resolved this problem! And guess what!
NetView ain't broke! (You already know that)
Turns out I'm a smarter programmer than I remember. The key
to solving this was to stumble across the fact that I was writing out an
"errors" file - i.e. messages from the mainframe I had no idea what to do with.
What I failed to remember is that I also throw messages in there when I can't
find the customer name in my 1100+ customer table. There were ultimately 2
problems - first, the mainframe dispatch priority had to be modified for
NetMaster (it was set inappropriately low) - which resulted in a few messages
never reaching us. The ones that reached NetView and did not generate
notifications, turned out to be recent additions to the table and of course, I
had not sync'd up the table with infoman in a month or so.
Everything now makes sense and the world is a beautiful
place.
James, All questions gladly answered :) My slow-witted
filter is on (incoming and out going :D)
Let me break down each of the pieces in a little more
detail, I think you pointed out where I was wrong here.
1. "first" ruleset - triggered by a trap settings
that matches the mainframe NetMaster enterprise. Calls a PERL script. The PERL
script parses out the relevant data and logs it. All log entries during the
problem present, 34 downs and 34 corresponding ups.
2. The PERL script uses a TCP socket call to notify the
"listener". Again for performance reasons, the listener reads a Hash of
customer data when it starts up and when the first script sends a no down or
node up trap, the hash is interrogated for the name of the customer and the
location. The listener logs the event and generates the SNMP trap (either up
or down) as input to the timer / hold down ruleset (the second ruleset). The
listener program logged 34 down events and 34 up events and generated 34 down
traps and 34 up traps.
3. The "second" ruleset containing the reset and pass on
match logic fires on the trap settings for our custom trap. It's only function
is to wait the requisite 10 minutes and if the timer is exceeded calls the
notification script (PERL). This is the ambiguous part. Even though the
ruleset fired 34 times for the up messages, the PERL script to do notification
only got called 12 times.
So to answer your question - yes, it appears that nvcorrd
failed to kick off the notfication script in 22 of 34 up events only. I dont'
believe the ruleset is failing because as long the events flow in smaller
groups, the correlation appears to work flawlessly. The ruleset is copied from
a standard one we use wherever we need hold-down timers.
The second script does not call postemsg directly. I have
some housework to do to figure out who to notify and what severity, so the
rulset calls my small PERL script and that in turn calls the notification
script. Either my small script did not get called (nvcorrd issue) or my call
of the notification script failed (nothing to suggest that
happened).
We do record the return code from postemsg, but there
were no non-zero return codes (only 12 occurences of the notification getting
done, all 12 had good return codes)
Some good ideas here James (and Stephen). I think the
easiest thing I can do is put a log message in my notification script to
indicate it got called before going to postemsg. Then I will proceed to try
and generate mutliple events with snmptrap and see if I can get it to
fail.
More info as I progress. Thanks
everyone.
Scott,
I
am slow-witted and I need lots of time to think about complex scenarios, so
please bear with me if I belabor the obvious or just don't seem to "get it",
especially at first. Your solution is both elaborate and
ingenious, and I am not certain that I comprehend it.
When I asked about code I was more specifically
thinking about how the rulesets are structured, what internal ruleset nodes
are used, because you were saying that you thought something had failed
inside nvcorrd (that's how I interpret what you said about there being
an internal limit of 12). And now you have elaborated on that by
saying, >"A second ruleset
is used to catch the listener-generated node down and up events and trigger
the notification script to TEC (it appears not all resulted in
>triggering the notification script)." Isn't
that the same as saying that nvcorrd failed to kick off your script when he
should have? You say that you see all the generated events (so the
first rs in ESE.automation appears to be working), so it is this second rs
which you think is failing. Am I correct? If so, then that is
the one we should focus on, how it is structured, what it does, and so
on.
If the second rs is
launching postemsg in an action node, then it seems to me that you have
several choices on how to debug this. One way is that you could issue postemsg in a script which echoes a
unique value (random number?) to a file so you could see how many times
postemsg was actually called. That would show whether it was driven as
many times as you think. Or is this what you think you already have
with your value of 12?
Another
way might be as Stephen Hochstetler suggested, trying to trap a return code
from postemsg. If postemsg generates a non-zero return code, and
actionsvr kicked off the script, then actionsvr will capture that message
and log it to the nvaction.log. Every action gets a unique sequential id
number, and if the script or command returns a non-zero return code then any
messages which accompanied that return code are also displayed with that
id number, so you can see what happened to each execution.
Or you can trace the execution
of just this second ruleset in nvcorrd by giving it a special reserved name.
"nvcdebug.rs". Then you can look in the logs and see whether it works
right or not, whether each of the triggering events caused an action to be
queued to actionsvr or not. If you see the action being sent, then you
know the problem is not in nvcorrd. From there you could see the
action command arriving in the nvaction.log and watch what happens after
that.
How would you test this
without taking down devices? With snmptrap. Your initial ruleset
works off a trap, so does your second. So you should be able to build
an snmptrap command to mimic them. The code won't know that there
wasn't a real outage if you launch a test script with 34 snmptrap commands
in it. Unsure how to build this? Just ask.
Complicated? Difficult? Perhaps, but no more so
than what you have already built. That's the problem with disaster
recovery tools -- it's always hard to test them.
James Shanks Level 3 Support for Tivoli
NetView for UNIX and Windows Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr@csgsystems.com> Sent by: owner-nv-l@lists.us.ibm.com
05/28/2004 11:47 AM
|
To
| <nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| RE: [nv-l] Ruleset
Correlation |
|
I'll try to keep this simple James, and answer
your questions at the same time, here is the flow: Mainframe
NetMaster Enterprise trap when
an SNA device fails Received
by NetView Trigger ruleset via
ESE.automation that calls script Script parses event picking out important data (SNA PU NAME &
STATUS) Script uses a TCP
socket connection to a listening script Listening script interrogates it's hash table of 1100+
devices for name and location of the client affected Listening script issues our own trap (i.e. node
down or node up) The listening script is used because I wanted to avoid
having to load a hash of 1100 customers (or do equivalent file I/O) in the
event of large scale outages. When we IPL the mainframe, we are going to
receive events on ALL SNA PUs and spawning several hundred copies of the
script loading the hash with 1100 customers would be an incredible resource
hog. So I have the listening script load the hash and run like a daemon and
accept requests from small individual scripts that have parsed out the
relavent data. The logging shows this: Trapd.log shows
all 34 down events and all 34 up events from the mainframe (duration beyond
the timers) The small script
which parses logged all 34 down and all 34 up events The listener program generated all 34 down and
all 34 up events (the ones the timers care about) A second ruleset is used to catch the
listener-generated node down and up events and trigger the notification
script to TEC (it appears not all resulted in triggering the notification
script) Notification to TEC
only occured on 12. TEC
console only shows 12 up events and leaves the remainder as open.
So,
one of two conditions exist. My listener program did receive all the events,
and did generate the traps. Therefore, either ruleset correlation was only
able to correlate a maximum of 12 (and thus did not fire the notification
script), OR the notification script has problems generate 34 calls to TEC
(we use postemsg, not TEC forwarding). I would rule out the listener program
having an issue on the basis that it was able to generate all the down and
up traps even during the heaviest of volumes I have observed. Somewhere, the
ruleset correlation failed, or the TEC postemsg failed. As far as
actionsvr firing up 34/35 processes, that should be okay. These NetView
servers have dual 1.0 Ghz processors and 2gb of memory. We have other
"storm-like" situations that we handle a volume equal to or larger than
this. In those cases though, I don't have the hold-down timers and the
second ruleset. Sorry if this is complicated, I was trying to conservative
with system resources by using this listener program. All code is in PERL
btw. One problem I have is I cannot test this without nuking some large
number of customers and my management seems to frown on production outages
to test event notification. Go figure.
From: owner-nv-l@lists.us.ibm.com
[mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of James
Shanks Sent: Friday, May 28, 2004 10:27 AM To:
nv-l@lists.us.ibm.com Subject: Re: [nv-l] Ruleset
Correlation
Well, it is awfully difficult to try to diagnose your situation
without knowing how the code you have designed actually works.
Did the ruleset fire
correctly on every event? Your best bet is to turn on nvcorrd tracing
(nvcdebug -d all) after nvcorrd starts so you can look at the logs. If
they toggle too quickly, then you'll have to start nvcorrd with the -l
<logfile> parameter so he just writes to one huge log until you stop
him. The logs will show what actually happens inside him and whether
the rulesets worked properly.
Did the scripts get launched? If you think
you already know that they did, and these notifications are sent via a
scripts run by actionsvr, then it is time to look at the nvaction logs. Note
that the way actionsvr operates is that he spawns a child for ever action he
runs, so if you are expecting 34 concurrent notifications, you'll get up to
35 actionsvr processes running concurrently, the main one and 34 children.
There's no magic number of actionsvr processes that can run at one
time; that's up to your operating system limits. But actionsvr will
cancel his children however, if they don't complete in 999
seconds.
Hope this helps.
James Shanks Level 3 Support for Tivoli NetView for
UNIX and Windows Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr@csgsystems.com> Sent by:
owner-nv-l@lists.us.ibm.com
05/28/2004 10:08 AM
|
To
| <nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| [nv-l] Ruleset
Correlation |
|
Greetings -
NetView 7.1.3 & Solaris 2.8 I am working through some automation
performance issues and I observed something disturbing. I have automation
that receives SNA mainframe events, parses and formats the trap and writes
it to a log. It also uses snmptrap to generate a psuedo "node down" trap.
When a corresponding up event is received for the same SNA device I use
snmptrap to send an "up" event. A second ruleset performs correlation on the
up and down events so that if the duration between the up and down events is
less than 10 minutes, it gets tossed, otherwise a notification script is
called that wakes up the help desk.
What disturbs me is
the behavior I see when we have a significant outage - in my sample case, 34
SNA devices dropped at one time. When the corresponding up messages occured,
everything worked properly except the notifications. The duration of the
outage exceeded the time in pass on match/resset on match timers but only 12
up notifications occured. According to my application log and trapd.log, the
34 "up" events got generated but the notifications did not. What I am
wondering is whether there is a limit to the number of outstanding
correlated events, i.e. how many devices can be waiting for a node up? Is it
possible only 12 pairs of node down/ups can be outstanding? Is there a way
to look at whave events automation (and I'm not sure if it's nvcorrd,
actionsvr or ovactiond thats involved) still has outstanding?
|