To: | <nv-l@lists.us.ibm.com> |
---|---|
Subject: | RE: [nv-l] Stress Testing NV, looking for opinions |
From: | "Barr, Scott" <Scott_Barr@csgsystems.com> |
Date: | Thu, 3 Jun 2004 12:13:13 -0500 |
Delivery-date: | Thu, 03 Jun 2004 18:27:48 +0100 |
Envelope-to: | nv-l-archive@lists.skills-1st.co.uk |
Reply-to: | nv-l@lists.us.ibm.com |
Sender: | owner-nv-l@lists.us.ibm.com |
Thread-index: | AcRJgkIKp/dmArsmSJaASO+DjVPEZwAAdcYQAABXbXAAAca9YA== |
Thread-topic: | [nv-l] Stress Testing NV, looking for opinions |
You and I are on exactly the same page. I have a PMR opened
for exactly the same situation.
In our situation, we have a automation testing routine
(basically a cron that submits a trap that drives a ruleset that touches a file)
if the file has not been touched in greater than 30 seconds from the time the
trap was submitted, we declare automation is taking too long and stop nvcorrd
and restart it. Upon occasion, we see an indication that nothing can talk to
nvcold - this sounds a lot like what you are seeing - non-responsive nvcold
behavior.
I know for a fact that our automation runs around 40-60
traps a minute. Rates beyond that may exist, but I haven't measured. In one
recent situation, a mis-behaving trap agent (Oracle 9i intelligent agent) began
spewing malformed traps. NetView automation hung in there even though trapd was
being hit with 227 traps a second. The traps were malformed, and the enterprise
ID was not present, so automation was not invoked. Normal traps flowed through
the system during this incident, albeit a little slowly.
And IBM, if you are listening, MLM will do me no darn good
for trap storm protection. The basic problem is that the need to predefine
filter criteria essentially means that I must experience a trap storm from a
device once, then put a filter in, and then if the same trap storm occurs again,
the filter will choke it. Advocates of MLM will point out that I can configure
it based on host address (or even *.*.*.*) but that in essence will shut NetView
automation down (as MLM won't be sending traps beyond the threshold rate) so it
serves as no valid protection (since it breaks the automation just as if I had
sent the traps through).
I strongly believe that nvcold has a problem - even with
the test fix I received for the memory leak someone else mentioned in the
forum.
One other suggestion I had: While you are cranking up your
trap rate - if you are using query smartset - take a snapshot of netstat -a and
see if you see a ton of TIME_WAIT, CLOSE_WAIT and FIN2_WAIT sockets. It seems to
me that this is somehow related. There seems to be a ton of nvcold sockets in
use. I do not believe the trap rate itself is the problem, I think it is the
number of simultaneous query smartset operations.
I'll be interested in any further results you are willing
to share.
|
Previous by Date: | RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew \(US - Hermitage\) |
---|---|
Next by Date: | RE: [nv-l] Stress Testing NV, looking for opinions, Francois Le Hir |
Previous by Thread: | RE: [nv-l] Stress Testing NV, looking for opinions, Francois Le Hir |
Next by Thread: | RE: [nv-l] Stress Testing NV, looking for opinions, James Shanks |
Indexes: | [Date] [Thread] [Top] [All Lists] |
Archive operated by Skills 1st Ltd
See also: The NetView Web