Folks
this problem is resolved.
Many
thanks to everyone involved. The NV - TEC troubleshooting guide hit the
proverbial nail on the head. Now we are cooking.
But
another question (Netview) looms. :-)
JT
Thank you for a very thoughtful and detailed reply, James. Hopefully
we'll get this one figured out.
Drew
Drew,
Performance problems are notoriously
difficult to diagnose, especially remotely. Remember too that the
benchmarks you are thinking of are for optimally configured systems running
in the lab, not real-world results. But here's a couple of points you
might investigate.
(1) What you
see in trapd.log is not necessarily what is coming in. It's what
trapd.processed and logged. Logging is the last thing trapd does with
the trap, after he's processed it in every other way. It does record
that a particular trap was received and processed at a particular time, but
that's about all. So seeing Cisco traps in trapd.log 5 seconds apart
means that's how fast trap is processing them, not how fast they are
arriving. What might you not see in the log? Any traps
configured to "Don't Log or Display" in xnmtrap. That action puts the
trap category in trapd.conf to "Ignore". So you could go to
/usr/OV/conf/C (don't forget the "C" here) and do "grep Ignore
trapd.conf " and see whether you have any of those. If you do, then
you are not seeing those in the log. For diagnostic purposes you
should alter those entries to "Log Only" so you can get a better idea of the
work trapd is actually doing.
(2) To get closer to what is coming in, you could turn on the
trapd.trace. You'll see a message about each trap being received
from address so-and-so every time one is pulled off the queue for
processing. If you want to see the contents of those incoming traps,
then you also need to have trapd running with the -x option to hex dump an
incoming packets. Now I said closer to what is coming in, because
obviously trapd will cannot trace a trap until he has started to read it.
When won't he read immediately? When there is no break between
incoming traps. If traps arrive too quickly, rather than pull
them off one at a time and process them, trapd queues them so that he
doesn't lose any. He won't start processing them again until there's a
break in the incoming flow. In that case you should see a bunch of
trap queued messages but no intervening processing in the trace. I
suspect that this is really what's going on. You get a big burst of
traps, so all trap processing slows while we queue them, and then once the
burst subsides, processing starts up again. But now the bottle neck is
going to be in nvcorrd and nvserverd, who have been idle for awhile, and now
have a lot to do. It's like a snake swallowing an egg; you see a
big lump moving along until it is totally digested. You have to turn
on the nvcorrd trace (nvcdebug -d all) to see what nvcorrd's doing, and one
benefit of that is that you can see how long it takes him to process just
one trap, given the rulesets and event windows you have going at the time.
Look for the eye-catcher "Received a trap" and "Finished with the
trap". From the one to the other is the transit time through nvcorrd.
Not much you can do if you don't like it, other than to reduce the
load.
(3) Obviously if you want
to assess what the real incoming trap rate is, you need an outside analysis
tool, such as an iptrace for port 162. Then you can use ipreport of
the data and see. Those are AIX commands by the way -- there are similar
tools on Solaris and Linux but I haven't used them much.
(4) If you cannot reduce the
incoming rates to keep processing from being overloaded then you might
consider installing an MLM and using it as a trap filter, tossing out
duplicates and only passing on to trapd what you really want to see.
HTH
James Shanks Level 3 Support for Tivoli
NetView for UNIX and Windows Tivoli Software / IBM Software Group
"Van Order, Drew \(US
- Hermitage\)" <dvanorder@deloitte.com> Sent by: owner-nv-l@lists.us.ibm.com
09/17/2004 10:12 AM
|
To
| <nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia
still hanging or falling behind processing
TEC_ITS.rs |
|
James,
We finally had missed heartbeats to
track. We can see the heartbeat trap in trapd.log, but no corresponding
entry in nvserverd. This appears to confirm the holdup is on the NV side,
and again, we had an increase in Cisco traps (one every 5 seconds for about
2 hours prior to missing the first heartbeat), but nothing near NV's limit.
Trapd.log shows it is starting to fall behind as well during this period--as
an example, the missed heartbeat TEC event for 6 PM last night did not show
in trapd until 6:48 PM. The 7 PM heartbeat shows in trapd at 7:21 PM and is
in nvserverd at 7:45, so it had almost caught up by then. So the TEC
adapter never stopped, but we've got to figure out why trapd and the
processes in between seem to stumble under load, but not a heavy one. We
know Cisco devices can send some traps at rates faster than one per second.
Is it possible devices are machine gunning traps even though NV shows one
every 5 seconds or so? That's the only thing I can think of that could set
trapd behind based on what we are seeing. Thanks
everyone--Drew -----Original
Message----- From: owner-nv-l@lists.us.ibm.com
[mailto:owner-nv-l@lists.us.ibm.com] On Behalf Of James
Shanks Sent: Thursday, September 16, 2004 3:40 PM To:
nv-l@lists.us.ibm.com Subject: RE: [nv-l] nvtecia still hanging or
falling behind processing TEC_ITS.rs
So what's different? Is your wpostemsg to @EventServer like
your tecint.conf file? We are back to this being a TEC issue and not a
NetView one. So unless you want to open a problem to TEC support,
you'll have to do some more detective work yourself.
If both the wpostemsg and the
tecint.conf have @EventServer, then I don't know what to tell you. If
not, then reconfigure your tecint.conf using serversetup to use the non-TME
method (which requires that a different daemon be started than when you use
the TME method). For non-TME forwarding, /usr/OV/bin/nvserverd is
started. For TME forwarding, it is /usr/OV/bin/spmsur, who then starts
/usr/OV/bin/tme_nvserverd. To which from one to the other requires
that you go through serversetup, which will reconfigure this
automatically, or that you manually alter the /usr/OV/conf/ovsuf file
to start the correct daemons. But note that when you go through
serversetup, your special customization to the Nvserverd entries is lost.
The fact that events are going to the cache means that nvserverd
got the event, formatted it, did his tec_put_event( ) and all went fine, but
then TEC library code in trying to send to the TEC server found that it
could not, that it has lost connection to the TEC server, for some reason
known only to those internal routines. And without a diag (as in
"diagnosis") file configured in here so that the internal TEC library code
will trace itself, no one can tell you what it's doing or why. And you
have to get that diag file, called ".ed_diag_config" from TEC Support and
they are the ones who have to look at the traces. No one on the
NetView side can assist at this point.
James Shanks Level 3 Support for
Tivoli NetView for UNIX and Windows Tivoli Software / IBM Software
Group
"Edwards, JT - ESM"
<JEdwards3@wm.com> Sent by:
owner-nv-l@lists.us.ibm.com
09/16/2004 04:00 PM
|
To
| "'nv-l@lists.us.ibm.com'"
<nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia
still hanging or falling behind processing TEC
_ITS.rs |
|
Yes it
does. -----Original
Message----- From: owner-nv-l@lists.us.ibm.com
[mailto:owner-nv-l@lists.us.ibm.com]On Behalf Of James
Shanks Sent: Thursday, September 16, 2004 2:32 PM To:
nv-l@lists.us.ibm.com Subject: RE: [nv-l] nvtecia still hanging or
falling behind processing TEC _ITS.rs
Wpostemsg does not go through the internal
adapter. Does that get to the TEC server?
James Shanks Level 3 Support
for Tivoli NetView for UNIX and Windows Tivoli Software / IBM
Software Group
"Edwards, JT - ESM"
<JEdwards3@wm.com> Sent by:
owner-nv-l@lists.us.ibm.com
09/16/2004 03:17 PM
|
To
| "'nv-l@lists.us.ibm.com'"
<nv-l@lists.us.ibm.com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia
still hanging or falling behind processing TEC
_ITS.rs |
|
Well at
this point. We are now getting events caching. From there what can we
do?
A
wpostemsg does not clear the cache.
-----Original Message----- From:
owner-nv-l@lists.us.ibm.com [mailto:owner-nv-l@lists.us.ibm.com]On Behalf
Of James Shanks Sent: Wednesday, September 15, 2004 10:16
PM To: nv-l@lists.us.ibm.com Subject: RE: [nv-l] nvtecia
still hanging or falling behind processing TEC _ITS.rs
No. The errno 827 indicates that there is a problem
initializing the JVM -- Java Virtual Machine. In almost every case I have
seen this indicates that the nvserverd daemon does not have the correct
library path for Java or the ZCE_CLASSPATH variable is not set. Since it is
only set in /etc/netnmrc, if you ovstop all the daemons and restart them
with just ovstart, you will lose it. So Mike is right. The usual fix is to
ovstop nvsecd and then restart with /etc/netnmrc (/etc/init.d/netnmrc on
Solaris or Linux). This issue has been fixed in the upcoming FixPack 2
(FP02) by updating the NVenvironment script so that if you run that before
you do ovstart, it will source the correct environment for you, and then the
daemons will inherit it when you do the ovtstart.
But I still don't
know why you are not getting an nvserverd.log which shows the same
tec_create_handle failure that you see in the formatted nettl. We do get
that here.
James Shanks Level 3 Support for Tivoli NetView for
UNIX and Windows Tivoli Software / IBM Software Group
This message (including any attachments) contains
confidential information intended for a specific individual and purpose, and
is protected by law. If you are not the intended recipient, you should
delete this message. Any disclosure, copying, or distribution of this
message, or the taking of any action based on it, is strictly
prohibited.
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.
|