Winfried, you might want to have a look at the sysUpTime on the Bay
hub(s). I had similar problems with my Bay kit, and found that the
hub's agent was spontaneously resetting every so often, with NO loss of
connectivity to any user. But the agent's IP stack 'went away' during
the reset, and the Hubs were failing to respond to a ping in the time
allotted. You might need to get this week's version of code from Bay.
Leslie Clark wrote:
>
> This is the usual 'false alarm' problem, I think. I am a bit worried about
> your
> -q and -Q options of 120, and mystified at the 20-second recovery. That
> 20 second recovery would make me suspicious that your polling cycle is
> actually 20 seconds, and not 20 minutes. Are you sure about what you have set?
> If it turns green on its own within 20 seconds and the polling cycle is
> actually
> 20 minutes, then it is turning green because it is getting a ping response
> from
> the node that it had given up waiting for - probably one of the ones it sent.
>
> That -q says netmon can have pings outstanding to 120 nodes at once.
> Mark S. in Tivoli has told us recently that they have only tested up to 64,
> and
> he recommends that you only use that if you have a very high-speed adapter
> on the netview server. This could be part of your problem.
>
> The 20-second timeout means that netview will send a ping and wait 20
> seconds for a response before sending another one, and repeat that six
> times (with some automatic increases in timeout along the way) before
> generating that node down event. That is a very long time. So my first
> suspicion would be that AIX itself is having trouble getting all of the pings
> that are coming back to it. Maybe because the -q (for pings) and -Q
> (for SNMP configuration polling) are keeping the adapter tied up.
>
> Have you looked at what netmon has outstanding to see if it is getting
> behind? Use netmon -a 3 (for status polling) and netmon -a 4 (for
> configuration
> polling) and check the output in /usr/OV/log/netmon.trace. It's a rather
> mysterious file, but you may get some idea of what is actually going on.
> Maybe, with that long timeout/retry, it cannot get through all of the 2000
> nodes
> and their interfaces in the 10-minute polling cycle.
>
> Now I am going beyond what I really understand and into the arena of
> voodoo, but you might also take a look at some of the no command options,
> and investigate the settings of tcp_sendspace, tcp_recvspace, and
> ipqmaxlen.
>
> And if you are running Netview AND Optivity on a C10 for 2000 nodes, I do
> congratulate you on a nice tuning job! Which is why I hesitated to offer any
> suggestions at all....:)
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
>
> Hello Netview/Optivity-experts around the globe,
>
> I'm suffering from blind-alarms on our welltuned IBM C10 with
> Netview5.1.1/Optivity8.1.1 on AIX4.3.2
>
> The SNMP-values are: SNMP-Timeout 20sec Retry 6 Polling 10 min
>
> Netmon-lrf-file-parameters:
> OVs_YES_START:nvsecd,ovtopmd,trapd,ovwdb:-P, -q 120, -Q
> 120,-S,-s/usr/OV/conf/s:
>
> I'm watching around 2000 objects and around 20 times per day I receive blind
> alarms like this:
>
> Fri Oct 08 07:40:07 1999 BAY-HUB-123 node down
> Specific: 58916865 Generic: 6 Category: Status Events Enterprise:
> netView6000
> 1.3.6.1.4.1.2.6.3.1
> Source: Netmon (N) Hostname: BAY-HUB-123 Severity: Critical
>
> The according ICON turns into red color and our operator is alarmed. But the
> box
> outside has no problem.
> In some cases this situation last several minutes, in most cases around 20
> seconds till the icon turns green again.
>
> If I want to have the icon back to green color immediatly, no problem, but not
> the best solution:
> A manual Ping-command from the command-line will wake up the soapbox outside
> and
> the object is green again.
>
> To avoid these blind alarms and manual intervention in future, I would like to
> automate this with a little script.
>
> As soon as a "Node-down"-event occurs, the Mngmt-Station should try to reach
> the
> object again by
> automatically pinging the IP-adress, maybe 3 times with 5 seconds between each
> PING.
> If this won't wake up the box, a trap should be generated saying " Hello
> Operator, this box is really dead !!"
>
> One question to the SNMP-values:
> The SNMP-timeout is set to 20 seconds, does that mean that Netview is waiting
> 20
> seconds for the answer
> of the first PING ? Would it be better to take a smaller value to reduce the
> time for showing a blind alarm ?
> As I already mentioned, the red icon is often but not always shown around 20
> seconds.
>
> Any hints, tips and tricks are welcome
>
> ```
> (o o)
> ------------------oOO-(_)-OOo------------------
> Winfried Gehrig mailto:Winfried.Gehrig@skf.com
> SKF GmbH FON ++49(0)9721 56 3077
> Schweinfurt Virtual FAX ++49(0)9721 5663266
> (Germany)
> Our bearings turn the planet
> http://www.skf.com
> -----------------------------------------------
|