nv-l
[Top] [All Lists]

RE: [nv-l] Problems with SNMP monitoring

To: "'nv-l@lists.us.ibm.com'" <nv-l@lists.us.ibm.com>
Subject: RE: [nv-l] Problems with SNMP monitoring
From: "Evans, Bill" <Bill.Evans@hq.doe.gov>
Date: Thu, 17 Mar 2005 13:26:08 -0500
Delivery-date: Thu, 17 Mar 2005 18:27:36 +0000
Envelope-to: nv-l-archive@lists.skills-1st.co.uk
Reply-to: nv-l@lists.us.ibm.com
Sender: owner-nv-l@lists.us.ibm.com

Thank you Mark and Gareth.  That helps. 

 

During the preparation of this reply I noticed that NVCOLD had terminated somewhere along the way.  I have to reconstruct but I think it was about the time of the failure.  When I restarted NVCOLD the devices came back on the next polling cycle. 

 

We’ll have to see whether it has anything to do with the problem.   

------------------------------------------------------------------

 

The net-SNMP code is net-snmp-5.1-80.3.  As long as one is greater than zero I think it’s current.

 

Tracing the activity shows “expired_SNMP” entries.  No matter what I do, once the device goes down the following queries fail until I do a demand poll.  The following was done after successful snmpwalks on system, ip and interfaces. 

 

09:49:02 ***** Starting Quick Test of node EML-doenet *****

09:49:02 Interface 10.0.8.8(Loopback0) (down since 03/17/05 06:59:19)

09:49:02 Interface 132.172.66.54(ATM0.3-aal5) (down since 03/17/05 06:59:19)

09:49:02 Interface 132.172.177.1(Ethernet0) (down since 03/17/05 06:59:19)

09:49:02 Interface 132.172.192.26(ATM0.1-aal5) (down since 03/17/05 06:59:19)

09:49:02 Current Polling parameters

09:49:02 The next SNMP Status Poll is scheduled for 03/17/05 09:47:18.

09:49:02   Get number of interfaces

09:49:10 SNMP request timed out (10.0.8.8)

09:49:10 ***** End of Quick Test for node EML-doenet *****

 

The one difference I notice on the demand poll relative to the quicktest is the FORCED requests:

 

09:40:45 : ./nl_snmper.c[307] : sending SNMP to 10.0.8.19 op = FORCED req = Objid reqid = 761814

versus

09:41:33 : ./nl_snmper.c[307] : sending SNMP to 10.0.8.19 op = STATUS req = SNMPStatus reqid = 762232

The SNMP requests were made using NetView’s module.  The problem with letting them run to the end is the size of the ipRouteTable J and the long wait.   

Bill Evans
Tivoli NetView Support for DOE
 

-----Original Message-----
From: owner-
nv-l@lists.us.ibm.com [mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of Mark Sklenarik
Sent: Wednesday, March 16, 2005 11:32 AM
To:
nv-l@lists.us.ibm.com
Subject: Re: [nv-l] Problems with SNMP monitoring

 


Bill,
This may or may not be related, but we had a problem here with a earlier version of the net-SNMP code, are you are at net-snmp-5.0.9-2-2.3E.6 or later on the NetView machine? (in Release Note for FP2)

It sounds like you may have a time-out of snmp communication to devices from NetView  SLES machine, this could be either NetView, or the device causing the time-out.  

You may want to turn on netmon tracing "netmon -M 63", and tail -f /usr/OV/log/netmon.trace to determine if netmon is still polling the devices in question. You may want to try this both before problem happens, and when problem is occurring to see if you can determine why the time-outs are occurring

The different between QuickTest and Demand Poll, is that Quick test only goes after interface status, where Demand Poll goes after a large set of data.
Check the the IP and Interface tables from device are responsing fuilly, snmpwalk the device looking for ip and interface tables only, make sure they complete.

Also be aware that net-snmp also provides a snmpwalk command that is different from NetView.  NetView will use the one in /usr/OV/bin.  Which did you use. I have found that by using both, I can sometime locate problem that one or the other would not find.

Also at the time of pause in demand poll, what is the state of the device, ? high CPU usage?

Hope thie helps.

Mark F Sklenarik   IBM SWG Tivoli Solutions  Business Impact Management and Event Correlation  Software Quality Engineer  IBM Corporation

"Evans, Bill" <Bill.Evans@hq.doe.gov>
Sent by: owner-nv-l@lists.us.ibm.com

03/16/2005 10:49 AM

Please respond to
nv-l

To

nv-l@lists.us.ibm.com

cc

 

Subject

[nv-l] Problems with SNMP monitoring

 

 

 




I'm having a problem with the migration of NetView to a new machine.  

This is a new SUSE SLES 9 installation of NetView 7.1.4 FP 2 on a Dell 1750 with manual transfer of seed, community strings, hosts, location.conf and other configuration data.  We are in a "test" mode.  It is using net-SNMP.  Our old system is a SUN with NV 7.1.3 and current fixpacks.  It uses the SUN SNMP.   We staged the bring up of the new machine to verify it's capacity and clean up the messy existing configuration.  Our first pass was to bring across the routers, then the switches, then the servers we monitor and finally any local extensions.  We're there with the full NetView device load.  

The area which is giving us problems is the SNMP management of Routers.  This includes 15 core network routers, 15 MAN routers and 37 Wide Area Network routers.  Core Routers are Cisco 6000 and 7000 models. WAN routers are  Cisco 3800 series.  MAN routers are all over the place from Cisco 2500 through 7500 models.      

The OLD machine is giving us fits with what appears to be dropped SNMP responses.  The particular ones giving trouble are the WAN devices although the loss of responses also hits the core routers on occasion.  It would appear that the SUN SNMP subsystem is swallowing some responses (randomly but tending toward the last ones received for the devices affected).  This began after we added a hundred or so HSRP interfaces to our core configuration.  These false alarms upset our management team and we're trying to address it by moving to a new box.

The new box works well (most of the time) for these devices.  When it is working it gives a reliable view of the state of the WAN routers. The "lost responses" are not a problem on the new machine.  Occasionally (about every 32 hours for the past couple days) a portion of the WAN if not all of it goes critical with SNMP polling timeouts.  When it happens, all the affected routers fail at the same time.  Until reset manually they will not recover.  One or more core routers may also be hit.

·       PING will work to the devices on either loopback or active port address but the device state will return to Critical because the next SNMP poll will fail.

·       SNMP polling is in use because the router configuration has a delay defined on one port (backup circuit) which prevents successful ICMP polling.  

·       QuickTest and QuickTest Critical will NOT work after the initial failure.  The result is an SNMP timeout.  
·
      Demand Poll will work.  This resets whatever is ailing and all works well for another day.  
·
      During the Demand Poll there is often a significant pause (up to one minute) after we see the "Get CDP Cache entry" line and sometimes another when we see the "Get MPLS MIB" line.  

·       The other machine is having no problems with its SNMP polling except for the continuing false alarms.

As you can guess this 32 hour cycle slows debugging.  A couple days ago I did an SNMP Walk on the devices but I'm not sure if it worked or didn't.  Next time I get a failure I plan to dig into that issues.  Meanwhile I haven't been able to find anything on the archives of in the knowledge base which appears to be similar.

 

I don't feel I have enough to go on to open an incident yet and hope the "communal wisdom" may point me in the right direction.   My current hypothesis:

·       The problem has to be in the NetView at the new machine.  

Suggestions and comments are solicited.  

Bill Evans

<Prev in Thread] Current Thread [Next in Thread>

Archive operated by Skills 1st Ltd

See also: The NetView Web