Larry, there are a few nuances to consider when talking about "false down" alerts. Here are a couple items we talked about in the early days of NetView education -- pre-Tivoli NetView Version 3 back in the mid nineties. The comments are current but the concepts haven't changed.
1. NetView does not really tell you when the device is down; only that NetView can't talk to it. Luckily this is almost the same thing. At my location the network engineers distinguish between real failures and false alarms by checking the system up time. Route Fault Isolation and Switch Analyzer activity usually clears up false alarms within the minute. On the wide area network our false alarms are usually circuit problems.
2. When NetView indicates that the device is down and it really isn't there is still a problem. It might be a slow circuit on a remote device or an overloaded switch or router. ICMP and SNMP can be thrown away if a device is too busy so the "false down" alert may be an early warning of an overloaded device or circuit. I've seen a slow response result in NetView timing out one SNMP response in a chain and throwing away the "extra" one at the end of the sequence so that it appears we have one interface on the device down. Our NetView server was too slow and overloaded to keep up. We could either lengthen the timeout or get a new server. The new server is great.
3. Setting your parameters for NetView is a trade-off between rapid response, system overhead and false alarms. Whoever set the ten minute delay in your TEC rules wanted to reduce false alarms. A similar delay can be set in the NetView Rulesets. Our installation would rather see the false alarms at the moment. Your mileage may vary but it sounds like your consumers would rather timely. Setting your simultaneous polling value increases overhead but gives a faster completion of the polling cycle. Increasing your timeouts and wait time reduces false alarms at the cost of timeliness. Too many timeouts and long wait times may cause the time to complete the polling cycle to exceed the length of the polling cycle. Finding the right balance is the art in NetView.
4. An "industry standard" for polling nodes doesn't exist. The NetView for Unix default is five minutes. The NetView for Windows default is twenty minutes. At my installation we use the Unix default but have a couple special devices we poll every one minute; those are the gateways to the internet. We also notify our hands-on personnel on every Router Up/Down/Marginal event immediately and on Switch Down/Up events for large switches in the infrastructure. These are e-mails to Blackberry devices. Our peripheral devices (switches in hall closets) currently do not have automatic notifications. If a closet switch fails our operations center passes on the request by e-mail.
Bill Evans
-----Original Message-----
From: owner-nv-l@lists.us.ibm.com [mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of Larry Fagan
Sent: Tuesday, October 04, 2005 11:14 AM
To: nv-l@lists.us.ibm.com
Subject: RE: [nv-l] Polling Intervals Info!
Colin,
Thanks on the info...
Yep.. I know the rule where down event is held for 10
minutes.. my issue is , if i reduce this time to 5
minutes, will there be false down alerts? ( I still
don't find reason to hold 10 minutes in TEC)
Well .. looks like your idea of pinging the interface
using script seems good.. Can i borrow your script?
You know the whole idea is to reduce the server down
alert from 13 minutes to as less as possible..
Many thanks colin and paul..
Larry
--- Colin Mulkerrins <Colin.Mulkerrins@anpost.ie>
wrote:
> Larry,
>
> Is it a TEC rule that is holding the event?. In my
> shop we run a script
> against the trap on the netview server whenever an
> interface down event
> comes in - the script polls the device 3 times with
> a 20 second sleep
> between each poll - if the interface is still down
> after the 3rd poll we
> then send an alert. When we started out we saw a
> very large number of
> interface down alerts (essentially whenever a router
> hiccuped - which
> happens regularly) hence the delays and polls.
>
> The timeouts were agreed with our network team (not
> sure whether its
> industry standard or not) and they seem to be happy
> that the alerts they
> get are proper.
>
> Regarding the 10 minute hold in TEC - have a look at
> your rulebase,
> there is probably a rule there which is holding the
> event in cache
> waiting for a clearing (interface_up) event for 10
> minutes - it should
> be easy enough to reduce the wait time.
>
> Regs
>
> Colin M.
>
> -----Original Message-----
> From: owner-nv-l@lists.us.ibm.com
> [mailto:owner-nv-l@lists.us.ibm.com]
> On Behalf Of Larry Fagan
> Sent: 04 October 2005 15:20
> To: nv-l@lists.us.ibm.com
> Subject: Re: [nv-l] Polling Intervals Info!
>
>
> Paul,
> Thanks for looking into this.. My situation is
> this..
> I have inherited this environment.. Netview poll
> settings is as below but in TEC these interface down
> events are held for 10 minutes.. If no Up is
> received
> from same hostname, then an down alert goes out to
> team.. Now the issue is why is this 10 minutes hold
> in
> TEC. Can't i lower this to lesser value.. Is there
> any
> issues or false events generated if i lower 10
> minute
> hold since we have about 17,000 interfaces to poll..
> What is the industry standard in polling the nodes?
> My customers are
> saying "It's really NOT good to know that the server
> is down ONLY after
> about (10+3)13 minutes".. Again many thanks as
> usual.. Larry
>
>
> --- Paul Stroud <nvladmin@gmail.com> wrote:
>
> > Larry,
> > This is actually pretty simple;-)
> >
> > With the polling set as you have it:
> >
> > netmon will poll each device once every three
> > minutes, it will
> > wait 5 seconds for a reply and will try a total of
> 4
> > times per
> > device if there are any failures.
> >
> > For the question about how netmon polls(sequential
> > vs simultaneous),
> > the answer is yes. Netmon has a fixed(but
> > configurable) number of
> > threads available for polling, for each ICMP and
> > SNMP. So netmon will
> > poll X number of devices at the same time and then
> > all devices as it gets
> > around to them.
> >
> > To see the list of devices being polled you can
> run:
> >
> > netmon -a 12 (to see what netmon is polling via
> > ICMP)
> > netmon -a 16 (to see what netmon is polling via
> > SNMP)
> >
> > This information will be stored in the
> /usr/OV/log/netmon.trace file.
> > That being said, you might find looking through
> > 17,000 IP addresses
> > a bit tedious;-)
> >
> > Paul
> >
> >
> > Larry Fagan wrote:
> >
> > >Gentlemen,
> > >I have gone through tons of info in list on this
> > >topic. But i just need a simple answer for my
> > >configuration. I have set timeout=5 retries=3
> > >polling=3m. Now can some one tell how the polling
> > >works in this case? how does netmon poll fo rup
> and
> > >down's?
> > >My second question is, I have about 17,000
> > interfaces
> > >to be polled in my DB. Does the polling to all
> > these
> > >interfaces happens sequentially or
> simulataneously?
> > >How can i check if all my nodes are being polled?
> > Is
> > >there a way to see all of my nodes are polled?
> > >I'm really exhausted looking into this.
> > >Please could someone help me?
> > >Many Thanks as usual.
> > >Larry
> > >
> > >
> > >
> > >
> > >__________________________________
> > >Yahoo! Mail - PC Magazine Editors' Choice 2005
> > >http://mail.yahoo.com
> > >
> > >
> > >
> >
> >
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
>
>
*********************************************************************************
> This e-mail and its attachments, is confidential and
> is intended for the addressee(s) only. If you are
> not the intended recipient, disclosure, distribution
> or any action taken in reliance on it is prohibited
> and may be unlawful. Please note that any
> information expressed in this message or its
> attachments is not given or endorsed by An Post
> unless otherwise indicated by an authorised
> representative independently of this message. An
> Post does not accept responsibility for the contents
> of this message and although it has been scanned for
> viruses An Post will not accept responsibility for
> any damage caused as a result of a virus being
> passed on.
>
*********************************************************************************
>
>
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
|