James, it's refreshing to read a post that acknowledges limitations of an
application. I don't think Scott could have said it any better from the customer
perspective.
FYI--for those using TEC_ITS, we stressed the ruleset overnight by
sending 300 interface downs every 5 minutes. I don't have the luxury of spending
2 weeks just to test every nook and cranny, but this is very close to the upper
limit for our configuration, as TEC had to queue events after each storm, and
barely caught up before the next one hit. However, nothing was missed, and
correlation accurate. I should also note that the server we used for testing has
about 5,000 objects, and is actively managing most of the production network as
a warm failover.
James - GREAT information. Let me digest. I dont'
disagree with anything here, but want to reread for detail after I have had
some coffee. I think most of us looking at this type of situation agree that
it's not a "bug" but an under-performance issue. Essentially, if all
components of NetView can handle 8-20 traps a second, then nvcold has to be
able to perform somewhere near that level. It shouldn't be an arbitrary
governor. Only testing, tracing, benchmarking will really answer the
questions, so let me finish my research and I'll post the response
here.
I feel a lot of the problem has to do with the fact that
in 1986 or whenever NetView first came out, we just didn't have networks of
the size we have today and we certainly didn't have the connection speeds we
have today. Management functions on devices were very limited, now they are
very robust (some might say TOO robust). The combination of all these factors
means that some basic inherent architectures may be stretched to there
limits - i.e it may be time to re-invent the wheel. Even Microsoft
recognized that.
I understand the implications of all this. No firestorm
here. If it turns out IBM has some tough coding work to do, then so be it. If
they choose not to address it, the marketplace will respond accordingly. I
think we all know that some of the "legacy" aspects of NetView on AIX/UNIX
leave something to be desired. We'll see where it goes. Meanwhile, as a group,
let's continue to try and iron out all of the possible refinements that can be
made in terms of O/S tuning, NetView configuration, and network design.
I still think nvcold has a performance issue that keeps
it's capability out of scope with the rest of NetView components.
At the risk of starting a
firestorm, I feel I must respond to some of Scott's questions and
issues.
Scott, I just want
to prepare you for what you may find.
What you may find, is that despite the speed of your
processor(s), you are up against both system and old design limitations,
which are not easily remedied, rather than proof of some bug in
nvcold.
Well, perhaps,
you'll find that, yes. But perhaps also the end result may be that you
will simply find the upper limit of what NetView event processing can
handle, given the way it is written today and the amount of work that can be
done on your box in that period of time. As far as I know there are no
benchmarks for nvcold performance. And I know there are none for
nvcorrd performance either. So with this course of action you may be
the one determining those benchmarks.
You are correct that socket stats and performance are tied together,
but perhaps not in the way that you think. Those states may not
represent errors at all. Sockets left in states like TIME_WAIT, CLOSE_WAIT and FIN2_WAIT
are the result of heavy usage and
operating system resources. Some systems can be tuned to reduce the
amount of time between these states, which occur at the end of the
communications cycle, when at least one end of the communications pipe has
been closed, though I am not enough of an OS guy to tell you exactly
what they mean nor how to tune to reduce them. But periodically
the OS checks all open sockets and changes the states so that the ones that
should be closed go to "CLOSED" over time. So if you are using nvcold
heavily, that's just what I would expect to see if he's opening and closing
a lot of sockets. And he would be doing just if you have a lot of
traps running through rulesets with Query Smartset nodes in them.
Every Query Smartset in a ruleset
is just that, a new call to nvcold by nvcorrd. For each new call,
nvcold must then query the object database to determine what smartsets a
particular node is in, and return all those in a list. So performance
is going to be determined by both the size of the database and the
number of smartsets to be included. I'm not savvy about the internals
of nvcold, but that's real work, and I suspect all this means sockets to be
opened and closed between him and nvcorrd as well as between him and ovwdb.
So for some trap rates, no matter how fast your box is, it may not be
fast enough to keep up with the demand being placed on the NetView daemons
by your automation. Let's remember that nvcold, like all the other
NetView for UNIX daemons, except the new java ones, is
single-threaded. That's one operation at time. So if every trap
goes through a Query Smartset, it is easy to see how you could overwhelm the
available resources at some point. The same is true, of course, if
they were multi-threaded. It would just take longer. But that's
one of the reasons why you want to try to try to make calls outside of
nvcorrd, like Query Smartset and Query Database, and Query MIB, sparingly
when you write a ruleset, as the performance guidelines I posted some time
ago emphasize.
As for MLM
and trap storms, most of those we see are indeed repetitive. In the
seven years I have looked at customer logs and traces, they usually come
from the same devices over and over again. They usually come from
routers which are overworked and not well-configured, and in many NetView
environments, the NetView folks have no control over either one of those
things. But they can configure MLM to do thresholding. That's
not breaking your automation but protecting it; if we only fire it for the
first of every ten identical traps rather than for every one, provided that
you know when you get the end result that there could be nine more identical
triggers behind it. So MLM is not a panacea, and it does require that
you analyze storms which have already happened in order to be effective.
But what other choice is there? Without MLM thresholding, trapd
will just queue the traps until he runs out of storage to hold them; but
assuming that doesn't happen, he'll start processing them like mad when the
storm stops, and simply pass the bottleneck along to the connected
applications. What will they do? nvcorrd, nvserverd, and actionsvr
will then begin processing like mad themselves, but probably not fast enough
to stay current. Your one-trap-at-a-time automation may still work
but it'll be so slow that it might as well not work. Your pop-ups or
pages or whatever will be many minutes if not hours behind. What good
is being hours behind in processing traps?
I'm afraid I don't see any alternatives. For
every system there are limits, and limits imply trade-offs, and trade-offs
imply that you have to find a way to live with what you have. That's
the fundamental law of system performance. If you cannot find a way
to produce more resources to handle the load when it occurs, then you have
to reduce the load. And that's what MLM does. Even if we
multi-threaded trapd to take over the thresholding job, at some point he too
would have to make a decision about what to do when the load was too high.
And I'll bet the decision would be to stop processing duplicate
traps in order to protect every process that comes down the line
afterward.
In short, I think
what you want to test is a good idea. Just don't be surprised if you
don't find broken code at the end of it, but rather system and design
limitations.
Want a script to
test with? Here's one of mine which uses snmptrap, and sends any
number of simulated Cisco LinkDown traps with the variable content modified
so that for any given one, I can tell where it falls in the batch sent.
I call it "EventBlast" and you invoke it like this,
EventBlast
<number of traps to send> <target NetView>
#!/bin/ksh max=$1 target=$2 src=""
event=ciscoLinkDown # set -x let
count=0 while (($count < $max)) ; do
/usr/OV/bin/snmptrap $target .1.3.6.1.4.1.9
\ $src 2 0 1 \
.1.3.0 Integer $count \
.1.4.0 OctetStringascii "`date`" \
.1.5.0 Integer $max \
.1.6.0 OctetStringascii "blast test
mode" # sleep
1 let count=$count+1
echo "sent $event EventBlast$count to $target "
done Of course, you
can modify this to send any other trap, with any other variables you need to
test your rulesets. I sincerely
hope this helps. James
Shanks Level 3 Support for Tivoli NetView for UNIX and
Windows Tivoli Software / IBM Software Group
This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.
|