Yeah, you are on the right track. Basically this should work.
A couple details puzzle me though. The Reset sounds coded OK but your
explanation of what it is doing makes me wonder if you perhaps are misinformed
about what it will do. You said:
This should have the effect of taking the results of two polls and
comparing them in the event that a node down is received: if the next is a node
up, then wait for another
node down; otherwise forward the event to the next node.
Actually, if after you have a Node Down in the cache, you get a matching Node
Up, both are tossed out and the ruleset is finished at that point, until another
trap comes in. It may just be that I didn't follow what you said, or that I am
being too picky about this, but I don't want you to feel that I have misled you,
if I misinterpret what you said.
The way a reset works is this. When a Node Down comes through the event stream,
it is held in the Reset cache for further processing and a time value is stored
with it, indicating when it was received. Then when any Node Up is received
(doesn't have to be the next event in the stream) it is compared against any
Node Downs in the cache. If they match, the original Node Down is tossed out
and processing for that event stops. Independently, nvcorrd is checking his
timer values every 15 seconds and seeing if any of his cached events has
expired. If no match has been received when the 10 minutes (plus or minus 15
seconds for timer checking) has passed, the Node Down is passed on to the next
ruleset node for further processing.
Now I mention this because by itself the Reset doesn't know anything about how
frequently netmon is polling these devices and generating these traps, nor does
it care how many other traps separate the Node Down and the Node Up. It is just
monitoring the event stream. It has no idea how the stream got to be the way it
is. It just knows that if it doesn't get a Node Up for the Node Down events
in its cache within ten minutes, it is going to send those events along for
further processing. It is only your knowledge of the polling cycle which allows
you to say under what conditions these traps were generated. Does that make
sense?
The other detail that puzzles me is your concern that the trap Origin might be a
better indicator of the actual selection name for purposes of Querying a
Collection. I have never seen a case where NVATTR_2 and NVA were different in
netmon trap. I think either one should do the job.
James Shanks
Tivoli (NetView for UNIX) L3 Support
Sean Aaron <sean.aaron@UCOP.EDU> on 10/04/99 04:27:15 PM
Please respond to Discussion of IBM NetView and POLYCENTER Manager on NetView
<NV-L@UCSBVM.UCSB.EDU>
To: NV-L@UCSBVM.UCSB.EDU
cc: (bcc: James Shanks/Tivoli Systems)
Subject: Re: using a script in a ruleset.
James Shanks wrote:
>
> I think you might want to consider an alternate design, one which sends a page
> if, after ten minutes no matching Node Up has arrived. You would do that with
a
> Reset-On-Match rather than a threshold. The Node Down is Input one, the Node
Up
> is Input 2, and the match is on attribute two, the hostname. You could of
> course extend the reset period to wait longer, say 30 minutes, so that people
> only get paged when there is no chance of things working out for themselves.
>
Okay, check my logic here, please.
I'm going to make three rulesets, one for each collection that contains
things we need to notify operators of through email to a helpdesk alias
and page certain oncall personnel through email to their pagers.
I've created a prototype with modifications based on your suggestion
above. First, we our polling is as such, every 5 minutes with a 30sec.
timeout; no retries. This is to minimize network traffic and account
for possible network slowdown...basically we want to know that the
device is actually up--we don't care about speed of connectivity because
this is for off-hours purposes.
I've got the pizza as first node, then splitting to two trap nodes: one
for nv6000 Node Down and one for nv6000 Node Up. These both funnel into
a Reset on Match node which has one comparison: Attribute 1 (2, as you
said the hostname in a netview trap) equal to Attribute 2 (2, again, the
hostname) during a 10-minute interval. This should have the effect of
taking the results of two polls and comparing them in the event that a
node down is received: if the next is a node up, then wait for another
node down; otherwise forward the event to the next node.
The next node is a Query Database Collection which will check for a
match on Origin with the relevant collection for this ruleset (network
devices, unix/nt servers, mvs servers). This should forward only those
events coming from machines which are a part of the given collection. I
use Origin as the ID source in case $NVATTR_2, the hostname, doesn't
match the selection name (I'm assuming Origin will contain the selection
name from the database as the identifier for the trap originator). This
will flow into two Actions, one a mailx command sending a simple subject
line to the pager alias stating that $NVATTR_2 is down (unless there's a
nice variable for the Origin, in which case I'll use that), and the
other one a mailx command sending a simple subject line to our helpdesk
alias to let the operators know...just in case they haven't been doing
their scheduled netview monitor check.
Am I on the right track here?
--
Sean Aaron
UNIX System Administrator
University of California
Office of the President
|