The question you asked does not have a simple answer because of the way you
asked it.
When trapd gets more traps per second than he can process, he begins to queue
them. When traps are queued, they are not processed, only stored in memory
until such time as cycles are available to process them. How many traps per
second can trapd handle under these circumstances? Well, the performance lab
tested with rates as high as a hundred per second, and verified that none are
lost.
But the assumption behind traps processing is that trap flow is bursty. That
means that the high rates of reception will not last, and within a short time,
some cycles will become available to process as well as queue. If that doesn't
happen, or if the rate goes high and stays high, then there will be problems.
The most obvious is, that without periodic breaks, trapd will use all the
available storage on the box, and then he will die. So the rate you can actually
"handle" (that is "display and process in a timely fashion") is highly dependent
on how big you box is, how much storage and how much cpu you have.
Even with breaks, after trapd processes a trap, formats it into a an event
structure and logs it to trapd.log, other daemons must process it before you see
it. Nvcorrd runs the rulesets and nvservered handles passing the forwarded
events out to the event windows. So cycles must be available for these guys to
run before you can see anything in the event window. And even after that, you
have Xwindows limitations. At some number of updates per second, I think it is
about 8 or 10, your X-window display of events will begin to flicker and become
hard to read and there is nothing NetView can do about that.
So you are correct that a long running trap storm will be difficult for NetView
to handle effectively, and even if trapd does not run out of storage and
recovers, it will take the other daemons more time to process everything that
has come in and display it. I have heard of cases where it has taken as long as
8 to 10 hours for the event windows to get current again, because the trap storm
put things so far behind, and when it let it, the rate did not fall to a one or
two a second level but stayed mdoerately high for a few hours.
It is also the case that there is no "pre-filtering" built into trapd. He
processes everything that is sent to him. He has to. Unless he processes it,
it will never get to a ruleset or display filter to be eliminated from the
operator's display. There could be filtering built into trapd as well, to throw
away traps form certain sources, but that would not eliminate the problem. You
still have to receive every trap in order to decide what to do with it, and it
is the sheer volume of the reception which is the problem.
So the only real solution is to be smarter about what you send to trapd in the
first place. One way to do that is to deploy MLMs and have them act as trap
filters. Different devices in the network send their raw traps to the MLMs
which use thresholding and filtering to reduce what gets sent to trapd.
Another way to do that is to not send so much stuff and reduce the likelyhood of
a trap storm. One customer I heard of instituted a policy that no traps could
be sent to trapd unless there was a form on file stating what to do with them,
and a person on staff to do it. The result was that they reduced their traps
flow by half and reclaimed a lot of bandwidth.
When you investigate OpenView just be sure you ask the right questions. They
also filter "after-the-fact" (after all we shared the same design once) and I
want to make certain that if they tell you they have a trap recption filter,
they tell you how it works, and who actually does it.
James Shanks
Team Leader, Level 3 Support
Tivoli NetView for UNIX and NT
"Tremblay, David A." <dtremblay@JHANCOCK.COM> on 06/15/2000 10:00:56 AM
Please respond to IBM NetView Discussion <nv-l@tkg.com>
To: "'IBM NetView Discussion'" <nv-l@tkg.com>
cc: "Lemire, Mark" <mlemire@JHANCOCK.COM>, "Hughes, Tom P."
<thughes@JHANCOCK.COM> (bcc: James Shanks/Tivoli Systems)
Subject: [NV-L] Netview and its handling of traps
To the group:
I am under the impression that the maximum number of traps that NetView can
handle per second is 16. Is this number sound correct?
I know that we have run tests here at John Hancock where we have forced 20
SNMP traps per second to be generated, performed a snmpwalk of a known
discovered device and watched it hang until we stopped forwarding the traps.
At that time netview played catch up with those traps, snmpwalk was
performed successfully and NetView returned to normal.
I just want to confirm this with the group, Netview DOES NOT have ANY
mechanism to filter out the number of traps coming in from a particular
configured device during a trap storm? We recently experienced a trap
storm recently in which 3 monitored devices went down and we received over
60 traps generated per second from 3 identical devices (digital recorders).
During this trap storm, NetView came to a grinding halt because it was
overwhelmed by the number of traps trying to be processed. We managed to
get around this by switching from our Production NetView server to our
Development NetView server since it wasn't configured to receive traps from
those particular devices.
When I talked with support and was told the only way to fix the problem was
to go to the device and turn off the traps or configure the number of traps
forwarded to be a lesser amount than current configured. I also went down
the path of talking about the ESE.automation file and using rulesets but
that will not effected anything expect for filtering what is displayed to
the event viewer or filtering forwarded events to TEC.
I know I can remedy this problem outside of NetView by using a product such
as Veritas Nerve Center to pre-filter traps/alarms and forward them to trapd
after the fact. But I was looking for ANY other solution native to NetView
in dealing with this problem. Are MLM's a way to remedy this situation?
Lastly, I have been told that HP Openview can handle a much higher number of
traps and has filtering built into the product. I am still gathering
information to see if these claims are true and will post what I find when
the information has been confirmed.
Any responses will be both welcomed and appreciated,
Dave
David A. Tremblay John Hancock
Financial Services
Lead Systems Analyst Information
Technology Services
Enterprise Management Tools Technology Shared Services
E-Mail: dtremblay@jhancock.com
_________________________________________________________________________
NV-L List information and Archives: http://www.tkg.com/nv-l
|