I think I may know what's going on here. When netmon first comes up, it must
verify snmp configurations and create a database in memory.
During this initialization phase, it reads the topology database and accesses
the DNS to resolve names. If the DNS
is slow, or has problems resolving names, this initialization process is very
slow. Netmon wil not ping, discover, answer demandpolls
or any other resquests until initialization is complete.
In ALL the cases of this problem that I have seen, the problem turned out to be
related to name resolution.
Check out the NetView Version 5 Diagnosis Guide, Chapter 5, "Diagnosing
Performance Problems", "Resolving Domain Nameserver Problems".
It describes how to set res_timeout and res_retry variables to control DNS
timeouts. This has helped in some cases. If not, I would suggest
a careful performance analysis of your name resolution process by a competent
DNS person.
To verify this problem:
Configure netmon to begin full tracing (minus one in the trace mask in netmon
configuration) when it starts up. You will see this
initializaiton occurring. Do a tail -f on the netmon.trace. If intialization is
occurring normally, you should not be able to read the trace entries.
If it is running slowly, you will be able to read the trace. This process in
the netmon trace looks something like this:
It starts out...
08:19:56 : nl_main.c[963] : STARTING NETMON TRACE ---- TRACEMASK=0xffffffff
08:19:59 : nl_fixup.c[674] : Fixup network 224.0.0.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.128.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.0.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.8.0
08:19:59 : nl_fixup.c[674] : Fixup network 192.168.1.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.9.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.10.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.13.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.32.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.200.200.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.203.128.0
08:19:59 : nl_fixup.c[674] : Fixup network 69.1.1.0
Then you should see entries like....
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for nvnt12.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'public', setcommunity = 'public',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900, Daily Config = 1200, Route Entries = 500
08:19:59 : nl_fixup.c[162] : DB IF load: 69.200.3.200
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for nvnt12.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'public', setcommunity = 'public',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900, Daily Config = 1200, Route Entries = 500
08:19:59 : nl_fixup.c[63] : fixupIfaceSnmpConf() for 69.200.3.200
08:19:59 : nl_fixup.c[78] : ... interval = 1200, timeout = 2, retries = 3
08:19:59 : nl_fixup.c[420] : DB Node load: nvnt12.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for nvnt12.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'public', setcommunity = 'public',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900, Daily Config = 1200, Route Entries = 500
08:19:59 : nl_fixup.c[653] : Adding node nvnt12.rtp.lab.tivoli.com address
69.200.3.200 time 36
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for
nvaix02.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'jeff', setcommunity = 'jeff',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900, Daily Config = 1200, Route Entries = 500
08:19:59 : nl_fixup.c[162] : DB IF load: 69.200.5.26
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for
nvaix02.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'jeff', setcommunity = 'jeff',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900, Daily Config = 1200, Route Entries = 500
08:19:59 : nl_fixup.c[63] : fixupIfaceSnmpConf() for 69.200.5.26
08:19:59 : nl_fixup.c[78] : ... interval = 1200, timeout = 2, retries = 3
08:19:59 : nl_fixup.c[420] : DB Node load: nvaix02.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[279] : fixupNmNodeSnmpConf() for
nvaix02.rtp.lab.tivoli.com
08:19:59 : nl_fixup.c[362] : ... interval = 1200, timeout = 2, retries = 3
... community = 'jeff', setcommunity = 'jeff',proxyName = '<none>'
... proxyAddr = 0.0.0.0, checked proxy address = FALSE
... Node Down = 604800, Discovery Poll? = YES, Auto Adjust? = YES, Fixed
Interval = 900,
And when it is finished it will dump the seedfile entries and start pinging ...
*** dump of interface 69.200.3.132 [0x20395de8] ***
objid = 1163
node = dec40a2.rtp.lab.tivoli.com
if_mask = 255.255.248.0
if_status = 2
if_type = 6
ifNumber = 1
physaddrlen = 6
physaddr = 0x0000F82363D5
snmp_state = 0x2
if_bad_polls = 0
if_lla_from = 0x0
if_mask_from = 0x1
if_timeout = 2
if_lastpoll = Wed Sep 1 08:04:39 1999
onlist = 0x2027b3b8
listelem = 0x204c7b68
[0] lasttraffic, lastval = 0, 0
[1] lasttraffic, lastval = 0, 0
[2] lasttraffic, lastval = 0, 0
[3] lasttraffic, lastval = 0, 0
[4] lasttraffic, lastval = 0, 0
--- No PingState---
*** dump of interface 69.200.32.68 [0x20405008] ***
objid = 3642
node = NLS68.rtp.lab.tivoli.com
if_mask = 255.255.255.0
if_status = 2
if_type = 1
ifNumber = 0
physaddrlen = 6
physaddr = 0x005004266B82
snmp_state = 0x2
if_bad_polls = 0
if_lla_from = 0x45c80001
if_mask_from = 0x2
if_timeout = 2
if_lastpoll = Wed Sep 1 08:05:54 1999
onlist = 0x2027b3b8
listelem = 0x20505a98
[0] lasttraffic, lastval = 0, 0
[1] lasttraffic, lastval = 0, 0
[2] lasttraffic, lastval = 0, 0
[3] lasttraffic, lastval = 0, 0
[4] lasttraffic, lastval = 0, 0
--- No PingState---
08:20:08 : nl_main.c[1105] : main() collectd_sock_fd = 20
08:20:08 : nl_main.c[485] : icmpbit = 0x40, snmpbit = 0x80, eventbit = 0x100
08:20:08 : nl_pinger.c[216] : sending ping to 69.201.4.240 seqnum = 34564 ident
= 34564 timeout = 2
08:20:08 : nl_pinger.c[216] : sending ping to 69.201.5.10 seqnum = 34565 ident =
34564 timeout = 2
08:20:08 : nl_pinger.c[216] : sending ping to 69.201.5.9 seqnum = 34566 ident =
34564 timeout = 2
I hope this helps.
Regards,
Walt Ostack Tivoli NetView Product Integrity
Schiffinger Ralph 2100 <Ralph.Schiffinger@ERSTEBANK.AT> on 08/26/99 12:18:41 PM
Please respond to Discussion of IBM NetView and POLYCENTER Manager on NetView
<NV-L@UCSBVM.UCSB.EDU>
To: NV-L@UCSBVM.UCSB.EDU
cc: (bcc: Walt Ostack/Tivoli Systems)
Subject: AW: ping response
Hi, group.
Running AIX 4.2.1.0, NV5.1.1, Framework 3.6.
Box is F50 / 1GB RAM / ca. 18500 objects in database.
Continously monitoring CPU-load and network-traffic of my box.
Every time I stop and restart netmon (or change my SNMP-config),
i am getting the same picture:
CPU-load goes up and stays up (near 100%).
Network-traffic goes down and stays down (ca. 1-5 packets/sec).
After approx. 20 min's netmon suddenly wakes up,
network-traffic peaks out (from an average of 60 packets/sec
up to 500 packets/sec and more).
Afterwards things settle down...
Makes an rather interesting graph ;-)
I think (and watch). Waiting for clues...
Regards, Ralph.
iT-AUSTRIA / OE2100 / TK
Tel (mobil) 0664 1908469
Tel (iT-A) 21717 58948
FAX (iT-A) 21717 58979
Mailto:Ralph.Schiffinger@erstebank.at
> -----Urspr
üngliche Nachricht-----
> Von: Leslie Clark [SMTP:lclark@US.IBM.COM]
> Gesendet am: Donnerstag, 26. August 1999 16:10
> An: NV-L@UCSBVM.ucsb.edu
> Betreff: Re: ping response
>
> Well, yes, I hear it too. Somebody out there must know the answer. This
> customer as lots and lots of 3Com Superstacks, and if I take netmon down
> for a while and bring it back up, about one third of those devices are
> unpingable for a half hour to an hour. Little by little they come back.
> And it
> is not false alarms, no amount of pinging fixes it until some time has
> elapsed.
> Maybe the same sort of thing you guys are seeing? It feels like the
> devices
> don't know the way back to the server for a while, then they do. Could it
> be
> their configuration, routing tables, arp cache? Far different subnets in
> most
> cases,
> I think. Waiting for clues...
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
>
>
>
> In my case the device was on a different subnet. Now the puzzlement. Pings
> originating from other devices on the same subnet as my NetView server
> worked. The route tables of these other devices are the same as my NetView
> server therefore the same routers were coming into play. Pings from my
> NetView server to all other devices in this other subnet worked. It would
> also eliminate the arp cache issue. Anyone besides me hear the theme song
> of
> "The Twilight Zone"?
>
> Blaine Owens
> Eastman Chemical Company
> Phone - (423)-229-3579
> Fax - (423)-229-1188
> bowens@eastman.com
>
> > -----Original Message-----
> > From: Chris J. Garlick [SMTP:chris.garlick@eu.effem.com]
> > Sent: Wednesday, August 25, 1999 11:44 AM
> > To: NV-L@UCSBVM.ucsb.edu
> > Subject: Re: ping response
> >
> > .... is it possible that the web server and Netview box are on
> different
> > IP
> > segments..? I have observed cases of messed up routers passing some
> > sub-protocols (SNMP, HTTP etc) but not others (eg ICMP (ping).....) ?
> > The router may have been bounced/reset later, accounting for the
> > mysterious
> > 'fix' ...?
> >
> > Kind Regards
> >
> >
> >
> >
> > Please respond to Discussion of IBM NetView and POLYCENTER Manager on
> > NetView <NV-L@UCSBVM.ucsb.edu>
> >
> >
> >
> > To: NV-L@UCSBVM.ucsb.edu
> >
> > cc:
> >
> > Subject: Re: ping response
> >
> >
> >
> >
> >
> >
> >
> > Just curious - AIX 4.3.2? We had a similar problem a while back. The
> > problem
> > just "went away" - I am still puzzled over it. We could not ping a
> certain
> > web server from NetView but could ping the server from other hosts. Now
> to
> > add to my confusion - all the while that pings were failing I could do
> > SNMP
> > gets and HTTP requests just fine - only pings were failing. I might add
> > that
> > two different ping programs were tried (the AIX supplied one and the
> > public
> > utility called "fping") - both failed. It was not an arp cache problem.
> As
> > mysteriously as the problem appeared it just went away.
> >
> > Blaine Owens
> > Eastman Chemical Company
> > Phone - (423)-229-3579
> > Fax - (423)-229-1188
> > bowens@eastman.com
> >
> > > -----Original Message-----
> > > From: John Creasey [SMTP:creasey@OZEMAIL.COM.AU]
> > > Sent: Wednesday, August 25, 1999 9:01 AM
> > > To: NV-L@UCSBVM.ucsb.edu
> > > Subject: Re: ping response
> > >
> > > Recently our network operators reported a very similar thing
> > > happening. They were certain an interface was up but netview
> > > couldn't ping it. I wasnt present so I was very much baffled
> > > by their claims.
> > >
> > > Have you tried to reproduce the problem? If it is reproducible
> > > it could be a bonafide bug.
> > >
> > >
> > > > -----Original Message-----
> > > > From: Discussion of IBM NetView and POLYCENTER Manager on NetView
> > > > [mailto:NV-L@UCSBVM.UCSB.EDU]On Behalf Of Frantsen Christian
> > > > Sent: Wednesday, 25 August 1999 19:26
> > > > To: NV-L@UCSBVM.UCSB.EDU
> > > > Subject: ping response
> > > >
> > > >
> > > > I noticed something strange the other day when trying some
> > > > things in our
> > > > lab.
> > > >
> > > > I unplugged a switch by removing the TP-cord and i got a node down
> in
> > > > netview (obviously =))
> > > > then I put the cord back in and I couldn't ping the switch
> > > > from the netview
> > > > map, there was no response,
> > > > I got response from it if I pinged it from a machine right
> > > > next to me, I
> > > > tried again from netview, no response
> > > > then I made a ping from the commandline on the
> > > > netview-machine. Now i got a
> > > > response and after that it worked from the map again.
> > > >
> > > > Anyone got some feedback on this?
> > > >
> > > > -----------------------------------------
> > > > Christian Frantsen
> > > > Technical Operations
> > > >
> > > > Internoc Scandinavia AB
> > > > Tel: +46-36-194843
> > > > Fax: +46-36-194651
> > > > http://www.internoc.se
> > > >
|