James - GREAT information. Let me digest. I dont' disagree
with anything here, but want to reread for detail after I have had some coffee.
I think most of us looking at this type of situation agree that it's not a "bug"
but an under-performance issue. Essentially, if all components of NetView can
handle 8-20 traps a second, then nvcold has to be able to perform somewhere near
that level. It shouldn't be an arbitrary governor. Only testing, tracing,
benchmarking will really answer the questions, so let me finish my research and
I'll post the response here.
I feel a lot of the problem has to do with the fact that in
1986 or whenever NetView first came out, we just didn't have networks of the
size we have today and we certainly didn't have the connection speeds we have
today. Management functions on devices were very limited, now they are very
robust (some might say TOO robust). The combination of all these factors means
that some basic inherent architectures may be stretched to there limits -
i.e it may be time to re-invent the wheel. Even Microsoft recognized
that.
I understand the implications of all this. No firestorm
here. If it turns out IBM has some tough coding work to do, then so be it. If
they choose not to address it, the marketplace will respond accordingly. I think
we all know that some of the "legacy" aspects of NetView on AIX/UNIX leave
something to be desired. We'll see where it goes. Meanwhile, as a group, let's
continue to try and iron out all of the possible refinements that can be made in
terms of O/S tuning, NetView configuration, and network design.
I still think nvcold has a performance issue that keeps
it's capability out of scope with the rest of NetView components.
At the risk of starting a
firestorm, I feel I must respond to some of Scott's questions and
issues.
Scott, I just want
to prepare you for what you may find.
What you may find, is that despite the speed of your
processor(s), you are up against both system and old design limitations, which
are not easily remedied, rather than proof of some bug in nvcold.
Well, perhaps, you'll find that,
yes. But perhaps also the end result may be that you will simply find
the upper limit of what NetView event processing can handle, given the way it
is written today and the amount of work that can be done on your box in that
period of time. As far as I know there are no benchmarks for nvcold
performance. And I know there are none for nvcorrd performance either.
So with this course of action you may be the one determining those
benchmarks.
You are correct that
socket stats and performance are tied together, but perhaps not in the way
that you think. Those states may not represent errors at all.
Sockets left in states like TIME_WAIT, CLOSE_WAIT and FIN2_WAIT are the result of heavy usage and operating system resources.
Some systems can be tuned to reduce the amount of time between these
states, which occur at the end of the communications cycle, when at least one
end of the communications pipe has been closed, though I am not enough
of an OS guy to tell you exactly what they mean nor how to tune to reduce
them. But periodically the OS checks all open sockets and changes
the states so that the ones that should be closed go to "CLOSED" over time.
So if you are using nvcold heavily, that's just what I would expect to
see if he's opening and closing a lot of sockets. And he would be doing
just if you have a lot of traps running through rulesets with Query Smartset
nodes in them.
Every Query
Smartset in a ruleset is just that, a new call to nvcold by nvcorrd. For
each new call, nvcold must then query the object database to determine what
smartsets a particular node is in, and return all those in a list. So
performance is going to be determined by both the size of the database
and the number of smartsets to be included. I'm not savvy about the
internals of nvcold, but that's real work, and I suspect all this means
sockets to be opened and closed between him and nvcorrd as well as between him
and ovwdb. So for some trap rates, no matter how fast your box is, it
may not be fast enough to keep up with the demand being placed on the NetView
daemons by your automation. Let's remember that nvcold, like all the
other NetView for UNIX daemons, except the new java ones, is
single-threaded. That's one operation at time. So if every trap
goes through a Query Smartset, it is easy to see how you could overwhelm the
available resources at some point. The same is true, of course, if they
were multi-threaded. It would just take longer. But that's one of
the reasons why you want to try to try to make calls outside of nvcorrd, like
Query Smartset and Query Database, and Query MIB, sparingly when you write a
ruleset, as the performance guidelines I posted some time ago emphasize.
As for MLM and trap storms,
most of those we see are indeed repetitive. In the seven years I have
looked at customer logs and traces, they usually come from the same devices
over and over again. They usually come from routers which are overworked
and not well-configured, and in many NetView environments, the NetView folks
have no control over either one of those things. But they can configure
MLM to do thresholding. That's not breaking your automation but
protecting it; if we only fire it for the first of every ten identical traps
rather than for every one, provided that you know when you get the end result
that there could be nine more identical triggers behind it. So MLM is
not a panacea, and it does require that you analyze storms which have already
happened in order to be effective. But what other choice is there?
Without MLM thresholding, trapd will just queue the traps until he runs
out of storage to hold them; but assuming that doesn't happen, he'll start
processing them like mad when the storm stops, and simply pass the bottleneck
along to the connected applications. What will they do? nvcorrd,
nvserverd, and actionsvr will then begin processing like mad themselves, but
probably not fast enough to stay current. Your one-trap-at-a-time
automation may still work but it'll be so slow that it might as well not work.
Your pop-ups or pages or whatever will be many minutes if not hours
behind. What good is being hours behind in processing traps?
I'm afraid I don't see any
alternatives. For every system there are limits, and limits imply
trade-offs, and trade-offs imply that you have to find a way to live with what
you have. That's the fundamental law of system performance. If
you cannot find a way to produce more resources to handle the load when it
occurs, then you have to reduce the load. And that's what MLM does.
Even if we multi-threaded trapd to take over the thresholding job, at
some point he too would have to make a decision about what to do when the load
was too high. And I'll bet the decision would be to stop
processing duplicate traps in order to protect every process that comes
down the line afterward.
In short,
I think what you want to test is a good idea. Just don't be surprised if
you don't find broken code at the end of it, but rather system and design
limitations.
Want a script to test
with? Here's one of mine which uses snmptrap, and sends any number of
simulated Cisco LinkDown traps with the variable content modified so that for
any given one, I can tell where it falls in the batch sent. I call it
"EventBlast" and you invoke it like this, EventBlast <number
of traps to send> <target NetView>
#!/bin/ksh max=$1 target=$2 src=""
event=ciscoLinkDown # set -x let
count=0 while (($count < $max)) ; do
/usr/OV/bin/snmptrap $target .1.3.6.1.4.1.9
\ $src 2 0 1 \
.1.3.0 Integer $count \
.1.4.0 OctetStringascii "`date`" \
.1.5.0 Integer $max \
.1.6.0 OctetStringascii "blast test mode" #
sleep 1
let count=$count+1 echo "sent
$event EventBlast$count to $target " done
Of course, you can modify this to
send any other trap, with any other variables you need to test your
rulesets. I sincerely hope this
helps. James Shanks Level 3
Support for Tivoli NetView for UNIX and Windows Tivoli Software / IBM
Software Group
|