You got it. Add to what you have demonstrated the information that netmon
can only
have about 10 outstanding queries at a time. Too short a polling cycle and
too high a
timeout/retry, and only a few down interfaces render the system useless for
status.
1)Ask yourself how soon you need to know about an outage. It may differ for
different
devices or sections of the network. Adjust the polling cycle accordingly.
Some
customers tell me they don't need to know about remote sites for 30
minutes. Other
devices you need to know within a minute or so.
2)Then identify those devices or sections of the network that habitually
give false alarms,
and carefully increase the timeout or retries until they don't do it any
more. But it is a very fine
knob, so turn it slowly, and only set it for devices or wildcards that need
it.
3)Remove from polling all interfaces that you know to be down. Unmanage
them or
deconfigure them on the devices, or don't discover things that go up and
down and
are not important to you.
Every network is different. Your goal is a green map, unless something
needs attention.
Cordially,
Leslie A. Clark
IBM Global Services - Systems Mgmt & Networking
(248) 552-4968 Voicemail, Fax, Pager
---------------------- Forwarded by Leslie Clark/Southfield/IBM on 02-11-99
06:56 AM ---------------------------
erik nilsson <erik@NETMAN.SE> on 02-11-99 05:41:59 AM
Please respond to Discussion of IBM NetView and POLYCENTER Manager on
NetView <NV-L@UCSBVM.UCSB.EDU>
To: NV-L@UCSBVM.UCSB.EDU
cc: (bcc: Leslie Clark/Southfield/IBM)
Subject: The polling process
When configuring the SNMP polling parameters we have decreased
the polling intervall to 2min with the same time out (2 sec)
and retry count (3) as default. Our network includes approx 500
interfaces on different routers.
On some links we have found that link down events appear altought
the link/interface is actually up (manual ping test). This can
occur when there is a timeout in the polling cycle because no icmp reply
whitin the time limits (slow link,hight util router, icmp low priority,
recalculating routertables etc).
Now we have increased the number off retries to 10 for some of
our routers to really be sure that the link is down when the event
is triggered (we actually start other processes to create enterprise
error messages to helpdesk etc)
Now, that seems to result in a very slow update time (10-15min) for
links/routers that comes up after a down state.
My question is about the polling process.
When increasing the retry count the time to flag the interface
'down' will of course increase. Does that affect (delay) the polling
frequency of the other nodes in the polling list ?
(is every poll a separate process not depending on the previous one)
If the answer is yes, that would seriously affect the polling cycle
and the time when a new state of an interface is detected.
If for example we have 10 down interfaces that would result in
10*10*2 sec delay wich will hold back the polling cycle for every
other node/interface.
Is this correct ?
In that case one should really keep the retry count low and polling
interval at more than 2 min so that every interface can be checked
whitin the polling interval.
Have I got this wrong or right ?
Any recommendations ?
(BTW, AIX 4.2.1 Netview 5.1)
Erik Nilsson (erik@netman.se)
Network Management tcpip AB
Stockholm
|