To: | <nv-l@lists.us.ibm.com> |
---|---|
Subject: | RE: [nv-l] Ruleset Correlation |
From: | "Barr, Scott" <Scott_Barr@csgsystems.com> |
Date: | Fri, 28 May 2004 10:47:56 -0500 |
Delivery-date: | Fri, 28 May 2004 17:02:37 +0100 |
Envelope-to: | nv-l-archive@lists.skills-1st.co.uk |
Reply-to: | nv-l@lists.us.ibm.com |
Sender: | owner-nv-l@lists.us.ibm.com |
Thread-index: | AcREyNJKLcDMmXl0RHKdhhR8QHPINgAAE2wg |
Thread-topic: | [nv-l] Ruleset Correlation |
I'll try to keep this simple James, and answer your
questions at the same time, here is the flow:
Mainframe NetMaster
Enterprise trap when an SNA device
fails
Received by NetView
Trigger ruleset via ESE.automation that calls
script
Script parses event picking out important data (SNA PU NAME
& STATUS)
Script uses a TCP socket connection to a listening script
Listening script interrogates it's hash table of 1100+
devices for name and location of the client affected
Listening script issues our own trap (i.e. node down or
node up)
The listening script is used because I wanted to avoid
having to load a hash of 1100 customers (or do equivalent file I/O) in the event
of large scale outages. When we IPL the mainframe, we are going to receive
events on ALL SNA PUs and spawning several hundred copies of the script loading
the hash with 1100 customers would be an incredible resource hog. So I have the
listening script load the hash and run like a daemon and accept requests from
small individual scripts that have parsed out the relavent data.
The logging shows this:
Trapd.log shows all 34 down events and all 34 up events
from the mainframe (duration beyond the timers)
The small script which parses logged all 34 down and all 34
up events
The listener program generated all 34 down and all 34 up
events (the ones the timers care about)
A second ruleset is used to catch the listener-generated
node down and up events and trigger the notification script to TEC (it appears
not all resulted in triggering the notification script)
Notification to TEC only occured on 12.
TEC console only shows 12 up events and leaves the
remainder as open.
So, one of two conditions exist. My listener program did
receive all the events, and did generate the traps. Therefore, either ruleset
correlation was only able to correlate a maximum of 12 (and thus did not fire
the notification script), OR the notification script has problems generate 34
calls to TEC (we use postemsg, not TEC forwarding). I would rule out the
listener program having an issue on the basis that it was able to generate all
the down and up traps even during the heaviest of volumes I have observed.
Somewhere, the ruleset correlation failed, or the TEC postemsg failed.
As far as actionsvr
firing up 34/35 processes, that should be okay. These NetView servers have dual
1.0 Ghz processors and 2gb of memory. We have other "storm-like" situations that
we handle a volume equal to or larger than this. In those cases though, I don't
have the hold-down timers and the second ruleset.
Sorry if this is complicated, I was trying to
conservative with system resources by using this listener program. All code is
in PERL btw. One problem I have is I cannot test this without nuking some large
number of customers and my management seems to frown on production outages to
test event notification. Go figure.
|
<Prev in Thread] | Current Thread | [Next in Thread> |
---|---|---|
|
Previous by Date: | Re: [nv-l] Ruleset Correlation, James Shanks |
---|---|
Next by Date: | [nv-l] Juliet Grout is out of the office., Juliet Grout |
Previous by Thread: | Re: [nv-l] Ruleset Correlation, James Shanks |
Next by Thread: | RE: [nv-l] Ruleset Correlation, Stephen Hochstetler |
Indexes: | [Date] [Thread] [Top] [All Lists] |
Archive operated by Skills 1st Ltd
See also: The NetView Web