nv-l
[Top] [All Lists]

Re: Sending Interface Down & Interface up pages and correlation

To: nv-l@lists.tivoli.com
Subject: Re: Sending Interface Down & Interface up pages and correlation
From: Steve Francis <steve.francis@COMMSERV.UCSB.EDU>
Date: Tue, 18 Jan 2000 10:17:54 -0800
Well, the CPU load should not be that bad, as the cache will only be holding
interfaces that have gone down, and have not yet come back up.  (I suppose it
depends on how large the network is, and if many interfaces tend to go down and 
for
how long.)

Like Ray, I also have my ruleset ping the interface that has reported down, to 
see
if its really down.  (I just have an inline action launch a ping for 30 seconds 
to
the interface: if the ping is sucessful, netmon is clever enough to notice.)

I prefer having one ruleset to Joe's approach, but that's mainly because I 
didn't
want to add extra objects into the DB.  I page if the interface is down for 30
seconds (15 minutes would be a tad long in my environment.)

Nothing gets processed by the ruleset unless its in the MajorNodes collection.
(i.e. things I care about.)
One other test I do to minimize paging is after the 30 seconds is up I check the
node status to see if the node is critical (i.e. all its interfaces are down.)  
If
it is, I exit the Interfacepaging rule without sending a page (but still start 
the
correlation of down interfaces).  I have a second rule activated on NodeDown 
events,
which just send a NodeDown page if the node is still critical after 30 seconds.
This way you get one page when  a node goes down, instead of lots for a
many-interfaced router.  You still get all the interface up pages when the 
router
comes up (just in case not all the interfaces do come up.)  (I guess Matt's 
approach
of repeating a page if the interface is still down after x miniutes could be 
useful
here, if you dont remember whats in a router.)

And finally, the script that is fired to page does some processing:
it checks to see if the node is a cisco router, switch, or 3Com router, or a 
subset
of other things we have on the network that support labels.
If it is, it does the appropriate SNMP query to get the interface descriptive 
label
(the description line in a cisco config), and includes that in the page 
message.  If
the node is down, so SNMP wont work, or the device is one it doesn't know how 
to ask
for the label, it just says the label is "none."  (A page saying 'Node router1,
INterface 10.1.1.1 is down.  Interface label is T3 connection to ocean physics 
lab'
means a lot more to me than just the IP address in figuring out how urgent it 
is.)
It also deals with who to page for what device.

If anyone wants the actual rulesets/scripts, let me know.


James Shanks wrote:

> The longer you make the hold time in the ruleset the more you will increase 
> the
> memory requirements for nvcorrd, since the cache will grow with every held 
> event
> and not flush until the time limit expires.  In addition you will gradually
> increase the processing time (and cpu used) for nvcorrd since he will have to
> check an ever-increasing, and seldom decreasing, cache for expired events.  He
> does that every 15 seconds.  But it will work if your system can stand it.
>
> James Shanks
> Tivoli (NetView for UNIX) L3 Support
>
> Leslie Clark <lclark@US.IBM.COM> on 01/18/2000 02:18:46 AM
>
> Please respond to Discussion of IBM NetView and POLYCENTER Manager on NetView
>       <NV-L@UCSBVM.UCSB.EDU>
>
> To:   NV-L@UCSBVM.UCSB.EDU
> cc:    (bcc: James Shanks/Tivoli Systems)
> Subject:  Re: Sending Interface Down & Interface up pages and correlation
>
> I think I understand what  Patrick is looking for, since I have just
> started to
> look at the same question. If  a down event comes in, and no up event
> within
> the specified time, you want to send a page (for instance).  That is the
> part
> everyone seems to agree on.
> A little later, the up event does come in, and you want to send the
> all-clear page.
> But only if the down page was sent in the  first place. It seems like it
> ought to
> work, but I worry about the long caching.  What do you think about that,
> James?
>
> This is how I understand Steve's suggestion:
>
> Node down is input 1 for reset-on-match (5 min)
> Node up is input 2 for same.
> Outputs of the reset-on-match  go to:
>     1) Send the down page
>     2) and also input 1 for a pass-on-match (long time)
> The same Node up is also input 2 for the pass on match
> Output for the pass-on-match is send the up page. The trap
> info available would be from the down event, not the up event,
> but you would know that and could act accordingly.
>
> Patrick, I vote that you verify this for us. Steve, is this something that
> you are
> actually running?
>
> By the way, my current customer tells me that there are real dollars to be
> saved by preventing unneccessary pages...
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
> Detroit
>
> ---------------------- Forwarded by Leslie Clark/Southfield/IBM on
> 01/18/2000 01:40 AM ---------------------------
>
> James Shanks <James_Shanks@TIVOLI.COM>@UCSBVM.UCSB.EDU> on 01/17/2000
> 08:40:00 PM
>
> Please respond to Discussion of IBM NetView and POLYCENTER Manager on
>       NetView <NV-L@UCSBVM.UCSB.EDU>
>
> Sent by:  Discussion of IBM NetView and POLYCENTER Manager on NetView
>       <NV-L@UCSBVM.UCSB.EDU>
>
> To:   NV-L@UCSBVM.UCSB.EDU
> cc:
> Subject:  Re: Sending Interface Down & Interface up pages and correlation
>
> Patrick -
>
> I am not certain that I understand what your second case is for, and Steve
> Francis has given you a suggestion which may work, in any case, but I
> thought I
> would comment on your questions.
>
> You guessed correctly  about how the Reset-on-Match and Pass-on-Match
> functions
> work with incoming events.  I tried to clarify that in my second append
> last
> week.  Only events of the type connected to  Slot 1 are  held in cache.
> The
> Slot 2 event is  used to evaluate the events in the cache as soon as it is
> received.    The Slot 2 events are not cached at all, and once used, they
> are
> discarded unless you added additional processing for them, which is why
> your
> ruleset doesn't handle your second case.  If no matches are received during
> the
> time interval, the cache is flushed, and the appropriate action taken for
> the
> Slot 1 event -- for Reset, it is passed along to the next ruleset node, for
> Pass, it is dropped.
>
> James Shanks
> Tivoli (NetView for UNIX) L3 Support


<Prev in Thread] Current Thread [Next in Thread>

Archive operated by Skills 1st Ltd

See also: The NetView Web