Our trapd queue size is set at 10000. We
have over 1000 cisco routers in Florida alone. They were sending traps as
fast as their DLSW peer could lose or establish a connection with them, all in
order. I will look into the possibility of disabling DLSW traps from
the branch routers, and enabling only at the peer. Since this network is
growing (and currently roughly 25 percent of its end game size), that would seem
to be our only hope.
Thanks
Art DeBuigny
Bank of America Network
Operations
Hmmm. What exactly is your application
queue size? 10,000? 25,000?
If trapd goes down, all
those others will go too. Is that what happened? Did trapd core or
what? Who died first?
Basically the application queue
size is a mechanism for people to use when they have configured their
agents to send more traps more frequently than the daemons can usually
handle. So adjusting this is how they can be kept up, at the
cost of a lot more storage and slower performance. The boys
and girls on the Tivoli performance team were able to handle 100
traps/sec for a few hours, but they had to boost the appl queue size to
35,000 and it took NetView many more hours to recover and process all
those traps. But they didn't lose any daemons.
So I have to
ask. What exactly is the point of getting so many traps?
Can not these Cisco agents be configured to send one or two instead of
dozens per minute? Or is that what they did, but you have 40,000
Cisco devices sending them at one time? Why be so verbose?
You cannot be helping your outage by flooding what is left of the network
with traps.
Personally, in my view (of course I'm the management
vendor) the only traps that should be sent to NetView are ones you intend
to do something about. And one is enough. Couldn't you get one trap
from the FEP or a few from key routers and stifle the rest? Lots of
folks implement a tiered solution, where routers in one tier send one
kind of trap and others do not.
After all, it's just one UNIX box
receiving all that stuff.
Just my two cents.
James
Shanks Tivoli (NetView for UNIX) L3 Support
Art DeBuigny
<debuigny@DALLAS.NET> on
05/18/99 11:09:59 AM
Please respond to Discussion of IBM NetView and
POLYCENTER Manager on NetView <NV-L@UCSBVM.UCSB.EDU>
To:
NV-L@UCSBVM.UCSB.EDU cc:
(bcc: James Shanks/Tivoli Systems) Subject: Trapd
questions
On occasion, we have been getting traps
from Cisco routers when the state of the DLSW connection resets, in this
case due to a reset at the FEP.
Recently, due to a major outage, we
started getting these traps from every single router on the
network. It crashed netmon, ovtopmd, trapd, and
even ovactiond.
I've tried setting the event customization to 'Do
not log or display' but that didn't seem to help. The situation
only stablizes once all the routers DLSW connections have been restored,
and traps are no longer flooding into the netview machine.
Since
this can always happen again in the event of an outage, can anyone think
of a way to 'protect' NetView's daemons from such a flood
without actually stopping the trap at the source? I've tried
adjusting the connected applications queue size, but that apparently
wasn't enough.
Thanks
Art DeBuigny debuigny@dallas.net Bank of America
Network Operations
|