nv-l
[Top] [All Lists]

Re: Netview5.1.1/Optivity8.1.1 Avoid blind alarms

To: nv-l@lists.tivoli.com
Subject: Re: Netview5.1.1/Optivity8.1.1 Avoid blind alarms
From: Leslie Clark <lclark@US.IBM.COM>
Date: Sat, 9 Oct 1999 12:14:05 -0400
This is the usual 'false alarm' problem, I think. I am a bit worried about your
-q and -Q options of 120, and mystified at the 20-second recovery. That
20 second recovery would make me suspicious that your polling cycle is
actually 20 seconds, and not 20 minutes. Are you sure about what you have set?
If  it turns green on its own within 20 seconds and the polling cycle is
actually
20 minutes, then it is turning green because it is getting a ping response from
the node that it had given up waiting for - probably one of the ones it sent.

That -q says netmon can have pings outstanding to 120 nodes at once.
Mark S. in Tivoli  has told us recently that they have only tested up to 64, and
he recommends that you only use that if you have a very high-speed adapter
on the netview server.  This could be part of your problem.

The 20-second timeout means that netview will send a ping and wait 20
seconds for a response before sending another one, and repeat that six
times (with some automatic increases in timeout along the way) before
generating that node down event. That is a very long time. So my first
suspicion would be that AIX itself is having trouble getting all of the pings
that are coming back to it.  Maybe because the -q (for pings) and -Q
(for SNMP configuration polling) are keeping the adapter tied up.

Have you looked at what netmon has outstanding to see if it is getting
behind? Use netmon -a 3 (for status polling) and netmon -a 4 (for configuration
polling) and check the output in /usr/OV/log/netmon.trace. It's a rather
mysterious file, but you may get some idea of what is actually going on.
Maybe, with that long timeout/retry, it cannot get through all of the 2000 nodes
and their interfaces in the 10-minute polling cycle.

Now I am going beyond what I really understand and into the arena of
voodoo, but you might also take a look at some of the no command options,
and investigate the settings of tcp_sendspace, tcp_recvspace, and
ipqmaxlen.

And if you are running Netview AND Optivity on a C10 for 2000 nodes, I do
congratulate you on a nice tuning job! Which is why I hesitated to offer any
suggestions at all....:)

Cordially,

Leslie A. Clark
IBM Global Services - Systems Mgmt & Networking



Hello Netview/Optivity-experts around the globe,

I'm suffering from blind-alarms on our welltuned IBM C10 with
Netview5.1.1/Optivity8.1.1 on AIX4.3.2

The SNMP-values are: SNMP-Timeout 20sec    Retry 6  Polling 10 min

Netmon-lrf-file-parameters:
OVs_YES_START:nvsecd,ovtopmd,trapd,ovwdb:-P, -q 120, -Q 120,-S,-s/usr/OV/conf/s:

I'm watching around 2000 objects and around 20 times per day I receive blind
alarms like this:

Fri Oct 08 07:40:07 1999  BAY-HUB-123   node down
Specific: 58916865  Generic: 6  Category: Status Events  Enterprise: netView6000
1.3.6.1.4.1.2.6.3.1
Source: Netmon (N)  Hostname: BAY-HUB-123  Severity: Critical

The according ICON turns into red color and our operator is alarmed. But the box
outside has no problem.
In some cases this situation last several minutes, in most cases around 20
seconds till the icon turns green again.

If I want to have the icon back to green color immediatly, no problem, but not
the best solution:
A manual Ping-command from the command-line will wake up the soapbox outside and
the object is green again.

To avoid these blind alarms and manual intervention in future, I would like to
automate this with a little script.

As soon as a "Node-down"-event occurs, the Mngmt-Station should try to reach the
object again by
automatically pinging the IP-adress, maybe 3 times with 5 seconds between each
PING.
 If this won't wake up the box, a trap should be generated saying " Hello
Operator, this box is really dead !!"

One question to the SNMP-values:
The SNMP-timeout is set to 20 seconds, does that mean that Netview is waiting 20
seconds for the answer
of the first PING ? Would it be better to take a smaller value to reduce the
time for showing a blind alarm ?
As I already mentioned, the red icon is often but not always shown around 20
seconds.

Any hints, tips and tricks are welcome

                      ```
                     (o o)
------------------oOO-(_)-OOo------------------
Winfried Gehrig         mailto:Winfried.Gehrig@skf.com
SKF GmbH                FON  ++49(0)9721 56 3077
Schweinfurt     Virtual FAX  ++49(0)9721 5663266
(Germany)
Our bearings turn the planet
http://www.skf.com
-----------------------------------------------


<Prev in Thread] Current Thread [Next in Thread>

Archive operated by Skills 1st Ltd

See also: The NetView Web