A couple of things to think about:
Is that script reporting correctly? Look at the netmon.trace that it is
evaluating and see if it is reasonable. Does it look like there are that
many entries starting with negative numbers?
You have not said what the scheduled polling cyle is at the moment.
The default is 5 minutes. You expressed concern that it would not
do better than 14 minutes. What is it set for now?
In your original append you stated that netmon could only ping 2
interfaces/sec. I assumed that was some sort of calculation you did.
Now I am wondering if it is something you observed it doing. Is that
right? How did you observe that?
You probably cannot get netmon to turn on tracing with
'netmon -M -1' because it is too busy to answer, so you would have
to turn it on at startup using the configuration menus. Watching the
output of the trace might give you a clue as to what it is doing while it
is not getting around to pinging.
When you do 'ovobjprint -S', how many objects are in the database?
When you do 'ovtopdump -l', how many real interfaces, how many nodes
does it say you have? Where does your 1700 fit? What is the cache
setting for ovwdb? Higher than the object count?
How is the box performing otherwise? Does vmstat -5 show idle, or
busy? Has it ever performed well? What kind of box is this anyway,
cpu and memory?
When was the last time you did an ovtopofix -a (or -A)? Since you are
using oids to exclude, you will have stub objects for every other thing
in the network. Netmon will check these, via snmp, at startup and
at daily config poll time, to see if they have turned into something that
it is allowed to discover. These things will show up in the total object
count but not in the ovtopodump counts. After an ovtopofix, they will be
gone but will start coming back as soon as you restart netmon, so I'm not
suggesting that you do this, only explaining what will happen if you do.
If the numbers of objects is a problem, the solution is to exclude as
much as possible by actual address range, for instance your DHCP
ranges.
I'm about out of ideas. It sounds like your netmon is not keeping up,
but I am not positive that is the case. Do you think it is keeping up, or
not?
Cordially,
Leslie A. Clark
IBM Global Services - Systems Mgmt & Networking
Detroit
"Treptow, Craig" <Treptow.Craig@principal.com>@tkg.com on 06/04/2001
11:58:10 AM
Please respond to IBM NetView Discussion <nv-l@tkg.com>
Sent by: owner-nv-l@tkg.com
To: "'IBM NetView Discussion'" <nv-l@tkg.com>
cc:
Subject: RE: [NV-L] Confirmation of Netview pinging
Thanks for the extra info Leslie. Regarding DNS performance, I run a
secondary DNS on the Netview machine for the reverse address space only.
These consistenly respond in 3-4 ms, while the forward entries take 10-20ms
in general.
I let this run all weekend and tried the script again this morning. I
didn't get any output from it until I increased the sleep to 60 seconds.
When I did this I got:
Netmon is 4335 behind in status pinging
I have configuration polling set to 1 day, with 11 OID's to include and 6
OID's to exclude in the seed file.
Thanks.
Craig
> -----Original Message-----
> From: Leslie Clark [mailto:lclark@US.IBM.COM]
> Sent: Saturday, June 02, 2001 3:56 PM
> To: IBM NetView Discussion
> Subject: RE: [NV-L] Confirmation of Netview pinging
>
>
> The 20 seconds is how long it takes netmon to get around to responding
> to your request that he dump the report. The script counts
> the number of
> records, such as the one below, that were scheduled for
> times that have
> already passed. Note that near the top of the netmon.trace they have
> negatives, and hopefully, at the bottom, there will be some
> with positive
> numbers. The positive numbers tell you that this node is
> schedule to be
> pinged that many seconds in the future. Or some such time
> unit. That looks
> like you have some nodes that have not been pinged in hours, so maybe
> the units are not seconds.
>
> -21138: 162.131.115.1 (tower1-feth-5-0.net.principal.com) ...
>
> When you first start netmon up, it is always behind. It has a
> lot to do at
> startup.
> Try letting it run for half an hour or so, and see if it
> catches up. What
> it says
> at startup is not something I worry about. It is the
> steady-state behavior
> that
> you want to tune for. I suspect that you do have some
> performance issues to
> deal with, though, because it sort of looks from this as if
> it is never
> catching up.
> Which is why you are writing, I guess. What does it say after
> it has been
> running
> for an hour? How often are you sending it off to do the
> configuration poll
> (which
> defaults to once a day)? How good is your name resolution? Oh
> - and how
> many names are in your seedfile? All of them? I have seen that delay
> netmon startup by half an hour unneccessarily.
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
> Detroit
>
> "Treptow, Craig" <Treptow.Craig@principal.com>@tkg.com on 06/01/2001
> 08:58:30 AM
>
> Please respond to IBM NetView Discussion <nv-l@tkg.com>
>
> Sent by: owner-nv-l@tkg.com
>
>
> To: "'IBM NetView Discussion'" <nv-l@tkg.com>
> cc:
> Subject: RE: [NV-L] Confirmation of Netview pinging
>
>
>
> Thanks Leslie!
>
> Just to clarify, 1700 was not the number of interfaces, it
> was the number
> of hubs/routers/switches/servers. I used the script you
> provided and had
> to increase the sleep time to 20 seconds. Anything less
> just resulted in
> the "Netmon is too busy" message. At 20 seconds, I typically get:
>
> Netmon is 3427 behind in status pinging
>
> I don't really understand what I'm looking at, though.
>
> The box itself is currently located on one of the backbone
> switches. I
> haven't taken any other action, because I wanted to
> understand what I was
> looking at. In netmon.trace I see things such as:
>
> 1043: 162.131.203.57 () list = 0x202aa858
> or
> -21138: 162.131.115.1 (tower1-feth-5-0.net.principal.com)
> list = 0x202aa7b8
>
> Can you explain these any more?
>
> Thanks!
>
> Craig
>
> > -----Original Message-----
> > From: Leslie Clark [mailto:lclark@us.ibm.com]
> > Sent: Wednesday, May 30, 2001 8:04 PM
> > To: IBM NetView Discussion
> > Subject: Re: [NV-L] Confirmation of Netview pinging
> >
> >
> > A couple of things.
> >
> > Remember your normal response is 40ms, not 1 sec. Yes, it
> > will take a while
> > to make the rounds if everything is down. But I hope your
> > normal state is
> > that everything is up.
> >
> > The number of outstanding pings is configurable. I think the current
> > default
> > is 16 (it was 10, years ago). It has been tested at up to 64.
> > That means it
> > can send off pings to 64 nodes at once, and as they repond,
> > send out more.
> > That number is the number of nodes it can be waiting on at one time
> > (waiting
> > an average of 40ms, you say). Set it in
> > /usr/OV/lrf/netmon.lrf, adding the
> > '-q' parameter. Use -q 32 to set the ping queue, and -Q 32 to
> > set the snmp
> > request queue. Experiment to see if you have the CPU and
> > interface speed to
> > back it up. I have never seen it overrun the adapter, but I
> > have seen it
> > use
> > up all of the cpu.
> >
> > 1700 interfaces is not a lot. You should be able to handle that in 5
> > minutes
> > easily on just about any box, using the default timeout/retry
> > of 2 and 3.
> > Some caveats: If you have a lot of unpingable interfaces in
> > your map, clear
> > them up. They clog up the ping queue (or increase the ping queue).
> > Acknowledged counts, too, since they are still pinged. Make
> > sure your name
> > resolution method is really fast. That slows everything down
> > more than you
> > would expect. If you are having problems with false alarms,
> > make note of
> > them
> > and tune them individually to accomodate normal variations in
> > the network,
> > rather than increase the timeout across the board. Make sure
> > you box is
> > centrally located in the network, with the most reliable connection
> > available,
> > and make sure that connection is running at full-duplex if
> > the connection
> > supports it.
> >
> > Here's a little script to help you monitor how well netmon
> is keeping
> > up with the status polling. See how fast it catches up when
> > it gets behind.
> >
> > #!/bin/ksh
> > #
> > # pingstatus.sh
> > #
> > # A script to check whether netmon can keep up with the polling
> > # frequency scheduled. Can be called from the Reports menu.
> > # Output: a messages to stdout
> > # Note: not reliable if netmon tracing is going on!
> > #
> > #set -x
> > rm /usr/OV/log/netmon.trace
> > netmon -a 12
> > sleep 3
> > if [ -f /usr/OV/log/netmon.trace ]; then
> > echo "Netmon is " `grep [-].*[:] /usr/OV/log/netmon.trace |
> > wc -l ` \
> > "behind in status pinging";
> > else
> > echo "Netmon is too busy to report now. Try later."
> > fi
> > exit
> >
> >
> > Cordially,
> >
> > Leslie A. Clark
> > IBM Global Services - Systems Mgmt & Networking
> > Detroit
> >
> >
> > "Treptow, Craig" <Treptow.Craig@principal.com>@tkg.com on 05/30/2001
> > 04:39:22 PM
> >
> > Please respond to IBM NetView Discussion <nv-l@tkg.com>
> >
> > Sent by: owner-nv-l@tkg.com
> >
> >
> > To: "NetView List (E-mail)" <nv-l@tkg.com>
> > cc:
> > Subject: [NV-L] Confirmation of Netview pinging
> >
> >
> >
> > Hi. We are running Netview 6.0.2 on AIX 4.3. We are wanting
> > to move to a
> > more proactive approach to problem notifications. Our hope
> is to ping
> > servers/hubs/switches/routers and generate events when they aren't
> > reachable. This would make use of the Netview features to
> reduce the
> > "noisy" pages, etc. In preparation for this, I was running
> > some numbers
> > and would like some input to see if I am flawed somewhere:
> >
> > Average response time for pings = 40ms (includes LAN and WAN)
> > Total devices to ping 1700. (and growing at about 30 per month)
> > # outstanding pings = 10 (Is this true? Does it affect my
> > numbers? If so,
> > how?)
> > Retries = 0
> > Timeout = 1 sec
> > One Netview machine.
> >
> > Netview could only ping 2 devices per second for a total of
> > 120 per minute.
> > 1700 / 120 = 14 minutes to complete one ping cycle.
> >
> > So this would mean that using this method, we would only find
> > out about a
> > down device after 14 minutes at best? I don't think anybody
> > would accept
> > this long of a window.
> >
> > Assuming the above is true, it appears that it is time for
> > use to look into
> > a different Netview architecture that could achieve our goals?
> >
> > I'm just looking for some insight into how Netview pings and
> > if my numbers
> > are even reasonable, etc. Thanks for any help you can provide.
> >
> > Craig
> >
> > P.S. I have searched the archives, but there appears to be many open
> > questions on this topic. Also, no form of netmon -a ?, or
> > any other flag
> > produced output in the netmon.trace file.
> > ______________________________________________________________
> > ___________
> > NV-L List information and Archives: http://www.tkg.com/nv-l
> >
> >
> > ______________________________________________________________
> > ___________
> > NV-L List information and Archives: http://www.tkg.com/nv-l
> >
> ______________________________________________________________
> ___________
> NV-L List information and Archives: http://www.tkg.com/nv-l
>
>
> ______________________________________________________________
> ___________
> NV-L List information and Archives: http://www.tkg.com/nv-l
>
_________________________________________________________________________
NV-L List information and Archives: http://www.tkg.com/nv-l
|