Hi all,
from working with HP OpenView I might give a hint:
With OpenView you can start a xnmgraph showing the scheduled polling time and
the actual polling time. When the actual polling time is below the scheduled
polling time -or negative in the worst case- netmon is too busy to poll all
devices within the scheduled polling interval or the configuration of netmon is
inappropriate.
One possible misconfiguration is by example if you define polling every
minute, retrys=5 and retry timeout=10s. If the device is not responding to a
ping the first polling cycle is still within his retry-time when the next
polling cycle begins. If you have a lot of devices which are not responding
netmon will get more and more polling-cycles running parallel and it has no
chance to catch up, because the configured timeout-parameters don't give him
the chance to do.
Maybe you started with NV V.5.x and everything was okay and now, with the same
netmon configuration, you have got problems. That's because in NV V5.x the
timeout algorithm was changed from what it was before and now, with 6.x, it's
again like in V4.x. In V5.x the configured timeout was used for the first
polling, but for every successfull poll the timeout was reduced by netview
until reaching 1 second. Now the wait-time for 5 retrys was 5 seconds (correct
me if I am wrong) and I am pretty sure that no one has configured a polling
intervall less/equal 5 seconds, so netmon never ran into the problem of
"overruning" himself. Now, as you have switched to V6.x, the configured timeout
is used, and for every retry this value gets doubled, as written in the
documentation. So if you have configured polling=60s, 5 retrys and timeout=10s,
by example, you get the following behaviour:
Poll# timeout elapsed total time since start of polling cycle
1 10 - at this time the first scheduled polling cycle
starts, device is not responding
2 20 10
3 40 30
60 - at this time a new, scheduled polling cycle starts
4 80 70
120 - at this time a new, scheduled polling cycle starts
5 160 150
180 - at this time a new, scheduled polling cycle starts
240 - at this time a new, scheduled polling cycle starts
300 - at this time a new, scheduled polling cycle starts
310 netmon reports "unreachable"
As you can see you now have 6 !! polling cycles running at the same time. Easy
to see that this will cause problems if you have a lot of failing devices.
So the solution is to
1. determine the polling intervall from the customer requests or your own desire
2. configure timeout and retry in a manner avoiding the problem described above
Hope this helps
Michael Seibold
Gmünder Ersatzkasse
>>> lclark@us.ibm.com 21.11. 4.41 Uhr >>>
The meaning of these entries is not documented. I'm going by observed
behavior. When I see no negatives, all future times, I know netmon is
caught up. When I see negative times, I know that it was scheduled to
poll that particular interface some time ago. Take a look at the node with
the
really old time. Maybe it's not up? Maybe it has missed its polling time
for many polls? That's a guess. I've observed a reduction in the number
of negative entries when I reduce the number of managed, down interfaces,
and/or increased the number of pingers (-q) and/or when increased the
polling
cycle. I may be completely wrong about this, but that's what it looks like
and
that's how I use it.
I would not say that -59 is 'in good order'. That is a minute behind.
Aside from
right after a netmon startup, your goal is all positive times. Otherwise,
you probably
have your polling cycle set to something shorter than your system can
handle. Or
your timeouts are too long. Same for the snmp polling (netmon -a 16).
Again, I'm guessing. You would have to go to Support for a real explanation
of the contents of those records, and they may have to go even further to
get
the answer, since it is not documented.
Anybody else?
Cordially,
Leslie A. Clark
IBM Global Services - Systems Mgmt & Networking
Detroit
Stephen Elliott <selliott@epicrealm.com>@tkg.com on 11/20/2000 05:11:00 PM
Please respond to IBM NetView Discussion <nv-l@tkg.com>
Sent by: owner-nv-l@tkg.com
To: "'IBM NetView Discussion'" <nv-l@tkg.com>
cc:
Subject: RE: [NV-L] netmon -a12
Leslie,
Thanks for the reply, I understand that part of it. The part I don't
understand is why in one minute the queue is essentially caught up and in
the next, the queue is several thousand seconds behind, then caught up in
the next. For example, here are three consecutive entries:
-40: 10.40.0.11 (VPN1X1.HKG1C) list = 0x565358
-8745: 100.129.76.46 (SRV3X4.LON1B) list = 0x565358
-59: 10.5.3.38 (SRV2X11.CHI4C) list = 0x565358
I'm assuming that the system is pushing garbage onto the stack for the
larger time entry, but the system is obviously processing the queue in good
form as seen in the 3rd entry. If one were going to cron a script to track
the queue for lengthy delays, this 'anomaly' would cause a considerable
number of false alarms. It could be easily resolved by looking for three or
more consecutive entries greater than X seconds before alarming, or just
take your particular approach to tracking queue length. The whole point
here
is, are we looking at a problem or not? Is this an indicator of some kind?
Regards,
Steve Elliott
Sr. Network Mgmt. Engineer
epicRealm, Inc.
214-570-4560
-----Original Message-----
From: Leslie Clark [mailto:lclark@US.IBM.COM]
Sent: Sunday, November 19, 2000 10:41 AM
To: IBM NetView Discussion
Subject: Re: [NV-L] netmon -a12
The 'behind' is for that interface only. After it does that interface, it
is
rescheduled with a future time. What I do is count the number
of records with negative numbers with a grep. That's the number of
interfaces it
is behind by.
Cordially,
Leslie A. Clark
IBM Global Services - Systems Mgmt & Networking
Detroit
Stephen Elliott <selliott@epicrealm.com>@tkg.com on 11/17/2000 03:17:19 PM
Please respond to IBM NetView Discussion <nv-l@tkg.com>
Sent by: owner-nv-l@tkg.com
To: "'nv-l@tkg.com'" <nv-l@tkg.com>
cc:
Subject: [NV-L] netmon -a12
Happy Friday, Y'all,
Here's a weekend puzzler. I am monitoring the netmon polling queue on my NV
6.0.1, Solaris 2.6 system to see how often and for how long the queue might
get backed up over the course of a day. There are 3181 interfaces in the
netmon -a12 output. The polling rates are a mixture of 1 min, 1 hour and 5
min (default) intervals. I have a simple script that deletes the
netmon.trace file, runs a new netmon -a12 and then appends the first line
of
that output to a file. The script runs every minute. Here's a sample of
that
output.
0: 88.88.99.99 (SWI2X1.AMS1B) list = 0x565358
-10804: 180.174.76.48 (SRV3X6.FRA1B) list = 0x565358
-5: 10.30.60.15 (TRM2X15.MAD1C) list = 0x565358
1: 10.0.0.237 (TRM2X17.SJC1B) list = 0x565358
-2: 165.130.105.8 (SWI2X16.TYO2C) list = 0x565358
-10: 10.0.10.11 (VPN1X1.SAN1C) list = 0x565358
1: 10.30.90.15 (TRM2X15.STO1C) list = 0x565358
-14: 244.76.88.73 (SWI2X16.HKG1C) list = 0x565358
-13: 188.174.76.1 (RTR1X20.FRA1B) list = 0x565358
0: 168.5.137.250 (SWI5X1.ATL1A) list = 0x565358
-30: 126.52.166.8 (SWI2X16.MIA1C) list = 0x565358
0: 10.0.10.15 (TRM2X15.SAN1C) list = 0x565358
-4: 200.174.77.139 (VPN1X1.GVA1C) list = 0x565358
-2: 200.52.99.253 (RTR2X14.LAX1C) list = 0x565358
-4: 200.224.34.21 (SVI1X3.LON2C) list = 0x565358
-8: 200.224.206.1 (RTR1X18.LON3T) list = 0x565358
-28: 10.40.10.15 (TRM2X15.SEL1C) list = 0x565358
-39: 10.30.1.21 (SVI1X3.LON2C) list = 0x565358
-48: 120.41.19.133 (SVI1X2.ANR1C) list = 0x565358
-61: 120.0.16.62 (VPN1X20.SJC4T) list = 0x565358
-60: 120.41.19.35 (SRV2X5.AMS1B) list = 0x565358
-50: 10.5.3.15 (TRM2X15.CHI4C) list = 0x565358
-30: 10.0.5.15 (TRM2X15.LAX1C) list = 0x565358
-29: 10.42.0.31 (SRV2X4.SYD1C) list = 0x565358
-24: 10.40.10.11 (VPN1X1.SEL1C) list = 0x565358
-45: 150.186.221.174 (SRV2X14.GRU1C) list = 0x565358
-50: 10.0.4.11 (VPN1X12.SJC5C) list = 0x565358
-40: 10.40.0.11 (VPN1X1.HKG1C) list = 0x565358
-8745: 100.129.76.46 (SRV3X4.LON1B) list = 0x565358
-59: 10.5.3.38 (SRV2X11.CHI4C) list = 0x565358
-71: 10.30.50.33 (SRV2X6.GVA1C) list = 0x565358
-70: 200.174.77.135 (SWI2X2.GVA1C) list = 0x565358
-84: 111.186.221.155 (SRV1X9.GRU1C) list = 0x565358
-4890: 211.76.12.97 (SRV2X6.SYD1C) list = 0x565358
-12: 164.0.16.62 (VPN1X20.SJC4T) list = 0x565358
-4: 10.1.0.15 (TRM2X15.SEA1C) list = 0x565358
Note the entries that indicate the queue is behind by several thousand
seconds. Then the next minute the queue is essentially caught up. Anyone
have an idea what this means, or if it's a known
'anomaly', why the system does this?
Regards,
Steve Elliott
Sr. Network Mgmt. Engineer
epicRealm, Inc.
214-570-4560
_________________________________________________________________________
NV-L List information and Archives: http://www.tkg.com/nv-l
_________________________________________________________________________
NV-L List information and Archives: http://www.tkg.com/nv-l
_________________________________________________________________________
NV-L List information and Archives: http://www.tkg.com/nv-l
|