nv-l
[Top] [All Lists]

[nv-l] Loss of traps with MLM

To: NetView Discussion <nv-l@lists.tivoli.com>
Subject: [nv-l] Loss of traps with MLM
From: Robin James <robin.james@thalesatm.com>
Date: Fri, 03 May 2002 16:20:52 +0100
We eventually found a better filter for our situation. Two of our MLM
filter settings were:

smMlmFilterThrottleArmTrapCount =  20
smMlmFilterThrottleDisarmTimer =  "1s"

We improved the filter by increasing the disarm timer to 1 minute. In
our trap storm we would receive the first 20 traps in the first second
and the increase in the timer means the filter blocks the forwarding of
traps for the rest of the minute. The impact is that MLM only lets
through 20 traps per minute per node instead of the previous 20 per
second.

We have also taken action to fix the software that was generating the
flood storm.

Thanks for your help James.

Robin James
Thales ATM
--- Begin Message ---
To: nv-l@lists.tivoli.com
Subject: Re: [nv-l] Loss of traps with MLM
From: "James Shanks" <jshanks@us.ibm.com>
Date: Wed, 3 Apr 2002 08:54:27 -0500
Delivered-to: mailing list nv-l@lists.tivoli.com
List-help: <mailto:nv-l-help@lists.tivoli.com>
List-post: <mailto:nv-l@lists.tivoli.com>
List-subscribe: <mailto:nv-l-subscribe@lists.tivoli.com>
List-unsubscribe: <mailto:nv-l-unsubscribe@lists.tivoli.com>
Mailing-list: contact nv-l-help@lists.tivoli.com; run by ezmlm
Hmmm.
(1) I have no answer about your test and send-event, since it is not 
generally available.  From your description it should send traps via udp 
and from that it would seem that there is a problem.  But if you want to 
pursue that with Support, you will have to go to Version 6 at least, and 
soon.  Version 6 will be out of Support  the end of October of this year.  
MLM already has two newer versions out than what you are using, though I 
have no idea how they would fare in your test.

(2)  NetView V6 has the same GUI as V5.  It is not NetView which has the 
problem with over 5 traps a second, it is X.  The NetView GUI can process 
the updates faster than X can refresh the screen, hence the flicker. This 
is an architectural issue.   Event processing is designed to be "bursty".  
 You can have short bursts which send a lot of stuff to the screens and 
elsewhere.  What you cannot do is maintain that level without a pause. 
 
(3) The feature we call trap pruning is just a short set of matches that 
trapd now does (V6 and above) on certain OID's and specific trap ids to 
help prevent the queues of various connected daemons from being overloaded 
with traps they neither need nor want.  For example, ovtopmd used to get 
all traps.  This was used as a heartbeat mechanism to see whether trapd 
was still there.  This was removed and a new method substituted in V6. 
Also  netmon now only gets "Link up" and "Link down" traps from external 
sources.  All other external traps, and a subset of NetView ones, are 
suppressed to keep his queues clear and his trap processing to minimum. 
This gives him more time for other things.  Other affected  processes are 
ipmap and snmpCollect.  They now only get subsets of  traps, not all of 
them.  The idea was to not put the trap on the daemon's queue if all he 
was going to do was throw it away.  This helps keep the daemons connected 
during a trap storm. 

But traps storms are still a problem, and always will be.  They should be 
stopped at the source, because your event processing, even if it does not 
lose anything, will quickly fall behind, and it will always take longer to 
catch up than it did to create the storm.    You can wind up with the 
event subsystem being hours behind, and NetView slowed to a crawl as a 
result.    You can take steps to help alleviate the problem, but 
ultimately any solution can be overrun if you have enough out-of-control 
devices sending hundreds or thousands of traps per second to your trap 
receiver.  And besides it is killing your bandwidth and throughput, so it 
needs to be stopped at the devices which are doing it.   And I don't mean 
by temporarily by shutting down their SNMP daemon, unless that is only a 
prelude to more definitive action.  I mean by configuring them so that 
they don't send the same trap over and over again in a very short time. 
It's pointless and a waste of time and resources.


James Shanks
Level 3 Support  for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
 





Robin James <robin.james@thalesatm.com>
04/03/2002 04:42 AM

 
        To:     NetView Discussion <nv-l@lists.tivoli.com>
        cc: 
        Subject:        Re: [nv-l] Loss of traps with MLM

 

James, thanks for the reply.

"send_event" is a simple programme written by us to send an event. It
uses the extensible SNMP protocol - I don't know if this protocol is
Compaq/Digital specific but it's available on TRU64 UNIX. It uses the
esnmp_init(), esnmp_poll() and esnmp_trap() functions provided by this
protocol and therefore uses UDP. It was written for sending a trap from
a node where Netview is not installed so snmptrap can't be used. In the
case of our test we could have used snmptrap because we were sending
local traps on the same node as Netview. 

We will try raising the filter limits to see what the effect is. 

You said the GUI begins to flicker at about 5 traps per second, does
Netview 6 cope better with display of traps at faster rates? 

I am interested by the trap pruning faciliy in Netview 6. How does this
work? 

We already have one problem which is not fixed in Netview 5.1.3 so if
the trap pruning facility helps us it might be another lever to persuade
our programme to upgrade.

-- 
Robin
email: robin.james@thalesatm.com
tel:   +44 (0) 1633-862020
fax:   +44 (0) 1633-868313
----- Message from "James Shanks" <jshanks@us.ibm.com> on Tue, 2 Apr 2002 
13:23:26 -0500 -----
To:
nv-l@lists.tivoli.com
Subject:
Re: [nv-l] Loss of traps with MLM
I am not the MLM guy, but I do know that there is no way, documented nor 
undocumented, to alter any buffer sizes it uses without a code change.  So 

what you are looking for doesn't exist.  Yet I am also not sure what to 
say about your problem, because this is the first time I have ever heard 
of MLM being accused of losing traps.   Perhaps I should also point out 
that NetView and MLM share no code whatsoever.  If they did, then we could 

not have an MLM on HP/UX.  That would be prohibited by our original 
purchase agreement with HP, just as a NetView for HP is.    So MLM, while 
it is shipped with NetView these days, remains a completely separate 
product, code-wise.  You cannot assume that a feature on one is the same 
as on the other nor that you can willy-nilly substitute one for the other 
and achieve the same result.

I am rather curious about your test procedure, since the command 
"send_event" is not shipped by either NetView nor MLM itself.  What  does 
it do?  Is it a command to MLM or to NetView?   Did you write it yourself? 

  Does it have tracing or error logging associated with it?   The reason I 

ask is that it seems to me that  if it opens a TCP connection to MLM to 
cause the event to be sent, it may very well be that under the conditions 
of your test, MLM was often too busy to open that connection, and thus it 
may be that he did not lose any traps, but rather failed to send them in 
the first place.    There is a BIG difference between the two.   If he 
failed to send them, then perhaps you just need better error checking in 
your command. 

Also I am curious about your 1 second disarm timer.  Since neither MLM nor 

NetView for UNIX is multi-threaded, it can only do one thing at time.  If 
you raise your limits, does the problem disappear?  Even allowing one node 

to send 20 traps per second is a sure way to bring your NetView processing 

to a crawl, so this is not an unreasonable thing to do.   Your NetView 
events GUI will begin to flicker at about 5 events per second if you 
display them, and without the trap pruning added in NetView Version 6 (not 

sending unnecessary traps to the daemons who don't need them) your netmon 
and ovtopmd will start falling far behind and may never catch up unless 
re-cycled.  They may just disconnect from trapd.  And when that happens 
ovtopmd will stop  and wait for you to re-connect with ovstart.

I am not certain about what anyone can do for you under the circumstances 
you describe.  The code you have is out of support and a performance issue 

involving it cannot be officially pursued.  And it seems clear to me that 
unless it is so pursued, with other people trying to duplicate your 
results, there is very little that can be done, except to tell you that 
you will have to live within the limits of the code you have.   Sorry, but 

I see no alternatives.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
 





Robin James <robin.james@thalesatm.com>
04/02/2002 09:58 AM

 
        To:     NetView Discussion <nv-l@lists.tivoli.com>
        cc: 
        Subject:        [nv-l] Loss of traps with MLM

 

We have been performing an experiment to determine if it is possible for
our Netview computer to lose locally generated traps. 

We use Netview 5.1.3 on Compaq TRU64 UNIX and we also run MLM on the
same machine to use its filtering capability. We have setup a filter to
throttle traps with the following settings:

smMlmFilterName[BlockTrapFlooding] =  "BlockTrapFlooding"
smMlmFilterState =  enabled
smMlmFilterDescription =  "Blocks traps when too many traps come from
the same host in a short time"
smMlmFilterAction =  throttleTraps
smMlmFilterAgentAddrExpression =  "cwps"
smMlmFilterThrottleType =  sendAfterN
smMlmFilterThrottleArmTrapCount =  20
smMlmFilterThrottleArmedCommand =  "/usr/sbin/Mlm_stop_snmpd.sh
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleDisarmTimer =  "1s"
smMlmFilterThrottleDisarmTrapCount =  0
smMlmFilterThrottleDisarmedCommand =  "snmptrap -p 1675 localhost omc
.1.3.6.1.4.1.1254.1 `hostname` 6 104 1 .1.3.6.1.2.1.1.5 OctetStringASCII
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleCriteria =  byNode
smMlmAliasName[cwps] =  "cwps"
smMlmAliasList =  "w1161,
w1162,
w2142"

As you can see from the settings an alias is also setup so that the
traps generated on the Netview node are not subject to the filter.

We set up 3 nodes to send traps repeatedly using the following script:

while 1
   send_event 803 "swamp test"
end

This gave approximately 2200 traps in the trapd log in one minute. Using
vmstat it could be seen that the Netview node had very little idle time.
We then used send_event on the Netview node to send single traps. We
observed that 1 out 4 events was not present in the trapd or midmand
logs.

This seems to confirm that a trap can be lost when the Netview node is
receiving a very heavy load of traps.

We also performed the same test by removing the use of MLM so traps go
directly to trapd and not via midmand. We performed the same test and
found no traps were lost.

It appears to me that the buffering between UNIX and trapd does not lose
the locally generated events but when MLM is filtering it is possible to
lose traps. 

Is it possible to find out the UDP receive buffer size with each
configuration?

I realise that the source of the problem is the node flooding our
Netview node with traps. We must stop this node from sending so many
events. We are trying to put a two part solution in place to ensure we
do not lose locally generated traps. The two parts are:

1. When MLM detects a node is flooding Netview with traps we will freeze
snmpd on that node so traps do not get sent.
2. Increase the buffer size between MLM and UNIX.

We think we know what to do for part 1 of the solution but is it
possible to increase the buffer size between midmand and UNIX for
receipt of traps? I know trapd provides an option to specify a UDP
receive buffer size but I can't see a similar option for midmand.

I would appreciate any comments or help on this problem.
Thanks

-- 
Robin
email: robin.james@thalesatm.com
tel:   +44 (0) 1633-862020
fax:   +44 (0) 1633-868313

---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com

*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)





---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com

*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)


---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com

*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)



---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com

*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)


--- End Message ---
<Prev in Thread] Current Thread [Next in Thread>
  • [nv-l] Loss of traps with MLM, Robin James <=

Archive operated by Skills 1st Ltd

See also: The NetView Web