Thank you Mark and Gareth. That
helps.
During the preparation of this reply I
noticed that NVCOLD had terminated somewhere along the way. I have to
reconstruct but I think it was about the time of the failure. When I
restarted NVCOLD the devices came back on the next polling cycle.
We’ll have to see whether it has
anything to do with the problem.
------------------------------------------------------------------
The net-SNMP code is net-snmp-5.1-80.3.
As long as one is greater than zero I think it’s current.
Tracing the activity shows “expired_SNMP”
entries. No matter what I do, once the device goes down the following queries
fail until I do a demand poll. The following was done after successful
snmpwalks on system, ip and interfaces.
09:49:02 ***** Starting
Quick Test of node EML-doenet *****
09:49:02 Interface
10.0.8.8(Loopback0) (down since 03/17/05 06:59:19)
09:49:02 Interface
132.172.66.54(ATM0.3-aal5) (down since 03/17/05 06:59:19)
09:49:02 Interface
132.172.177.1(Ethernet0) (down since 03/17/05 06:59:19)
09:49:02 Interface
132.172.192.26(ATM0.1-aal5) (down since 03/17/05 06:59:19)
09:49:02 Current Polling
parameters
09:49:02 The next SNMP
Status Poll is scheduled for 03/17/05 09:47:18.
09:49:02 Get
number of interfaces
09:49:10 SNMP request
timed out (10.0.8.8)
09:49:10 ***** End of
Quick Test for node EML-doenet *****
The one difference I notice on the demand
poll relative to the quicktest is the FORCED requests:
09:40:45 :
./nl_snmper.c[307] : sending SNMP to 10.0.8.19 op = FORCED req = Objid reqid =
761814
versus
09:41:33 :
./nl_snmper.c[307] : sending SNMP to 10.0.8.19 op = STATUS req = SNMPStatus
reqid = 762232
The SNMP requests were made using NetView’s
module. The problem with letting them run to the end is the size of the
ipRouteTable J and the long wait.
Bill Evans
Tivoli NetView
Support for DOE
-----Original Message-----
From: owner-nv-l@lists.us.ibm.com
[mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of Mark Sklenarik
Sent: Wednesday, March 16, 2005
11:32 AM
To: nv-l@lists.us.ibm.com
Subject: Re: [nv-l] Problems with
SNMP monitoring
Bill,
This
may or may not be related, but we had a problem here with a earlier version of
the net-SNMP code, are you are at net-snmp-5.0.9-2-2.3E.6 or later on the
NetView machine? (in Release Note for FP2)
It
sounds like you may have a time-out of snmp communication to devices from
NetView SLES machine, this could be either NetView, or the device causing
the time-out.
You
may want to turn on netmon tracing "netmon -M 63", and tail -f
/usr/OV/log/netmon.trace to determine if netmon is still polling the devices in
question. You may want to try this both before problem happens, and when
problem is occurring to see if you can determine why the time-outs are
occurring
The
different between QuickTest and Demand Poll, is that Quick test only goes after
interface status, where Demand Poll goes after a large set of data.
Check
the the IP and Interface tables from device are responsing fuilly, snmpwalk the
device looking for ip and interface tables only, make sure they complete.
Also
be aware that net-snmp also provides a snmpwalk command that is different from
NetView. NetView will use the one in /usr/OV/bin. Which did you
use. I have found that by using both, I can sometime locate problem that one or
the other would not find.
Also
at the time of pause in demand poll, what is the state of the device, ? high
CPU usage?
Hope
thie helps.
Mark F Sklenarik IBM SWG Tivoli Solutions Business
Impact Management and Event Correlation
Software Quality Engineer IBM Corporation
"Evans, Bill"
<Bill.Evans@hq.doe.gov>
Sent
by: owner-nv-l@lists.us.ibm.com
03/16/2005 10:49 AM
|
To
|
nv-l@lists.us.ibm.com
|
cc
|
|
Subject
|
[nv-l] Problems with SNMP monitoring
|
|
I'm having a problem with the migration of NetView to a new machine.
This is a new SUSE SLES 9 installation of NetView
7.1.4 FP 2 on a Dell 1750 with manual transfer of seed, community strings,
hosts, location.conf and other configuration data. We are in a
"test"
mode. It is using net-SNMP. Our old
system is a SUN with NV 7.1.3 and current fixpacks. It
uses the SUN SNMP. We staged the bring up of the
new machine to verify it's capacity and clean up the messy existing
configuration. Our first pass was to bring across the routers, then the
switches, then the servers we monitor and finally any local
extensions. We're there with the full NetView device load.
The area which is giving us problems is
the SNMP management of Routers. This includes 15 core network
routers, 15 MAN routers and 37 Wide Area Network routers. Core
Routers
are Cisco 6000 and 7000 models. WAN routers are Cisco 3800
series. MAN routers are all over the place from Cisco
2500
through 7500 models.
The OLD machine is giving us fits with what
appears to be dropped SNMP responses. The particular ones giving
trouble are the WAN devices although the loss of responses also hits
the core routers on occasion. It would appear that the SUN SNMP subsystem
is swallowing some responses (randomly but tending
toward the last ones received for the devices affected). This began
after we added a hundred or so HSRP interfaces to our core configuration. These false
alarms upset our management team and we're trying to address it by
moving to a new box.
The new box works well (most of the
time) for these devices. When it is working it gives a reliable view of
the state of the WAN routers. The "lost responses" are not a
problem on the new machine. Occasionally (about every 32 hours for the
past couple days) a portion of the WAN if not all of it goes critical with SNMP
polling timeouts. When it happens, all the affected routers fail
at the same time. Until reset manually they will
not recover. One or more core routers may also be
hit.
·
PING
will work to the devices on either loopback or active port address but the
device state will return to Critical because the next SNMP poll will fail.
·
SNMP
polling is in use because the router configuration has a delay defined on one
port (backup circuit) which prevents successful ICMP polling.
·
QuickTest
and QuickTest Critical will NOT work after the initial failure. The
result is an SNMP timeout.
· Demand Poll
will work. This resets whatever is ailing and all works well for
another day.
· During the
Demand Poll there is often a significant pause (up to one minute)
after
we see the "Get CDP Cache entry" line and sometimes another
when we see the "Get MPLS MIB" line.
·
The
other machine is having no problems with its SNMP polling except for the
continuing false alarms.
As you can guess this 32 hour cycle slows debugging. A couple
days ago I did an SNMP Walk on the devices but I'm not sure if it worked or
didn't. Next time I get a failure I plan to dig into
that
issues.
Meanwhile
I haven't been able to find anything on the archives of in the knowledge base
which appears to be similar.
I don't feel I have enough to go on to open an
incident yet and hope the "communal wisdom" may
point me in the right direction. My current hypothesis:
·
The
problem has to be in the NetView at the new machine.
Suggestions and comments are solicited.
Bill Evans