You re correct. I was mistaken about what this variable controls. It is
not the reception buffer. It is the size of buffer that MLM stores traps
in to be forwarded to NetView.
When that buffer overflows, traps are lost. When its limit is reached,
the oldest ones are discarded as newer ones arrive. From your original
problem description, it seemed to me that you sent the same traps over and
over again, with no way to pick out an individual one, noting only that
not all of them arrived in NetView. So it was not clear to me that MLM
was losing traps sent to him, only that they ultimately did not make it to
NetView. So they could be lost "on the back end" because the outgoing
trap queue was too small, as well as on the front end because the MLM
reception buffer was too small and midmand was not fast enough to empty
it. The fact that MLM did not core, nor fall over, and managed to send
quite a few of the traps to NetView makes me think that the "back end"
case is worth investigating if you are trying to establish its processing
limits. But that's your call. So far as I know, no performance
benchmarks have ever been established by IBM/Tivoli. And now that MLM is
distributed free with NetView, it is unlikely that any ever will be.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
Robin James <robin.james@thalesatm.com>
05/08/2002 03:21 AM
To: NetView Discussion <nv-l@lists.tivoli.com>
cc:
Subject: Re: [nv-l] Loss of traps with MLM -- controlling the
trap reception
buffersize
What does "Max TCP Trap Queue" affect?
Does it affect the connection between MLM and trapd for traps received
by UDP or does it affect the queue between MLM and the port on which TCP
traps are received?
Robin James
Thales ATM
----- Message from "James Shanks" <jshanks@us.ibm.com> on Mon, 6 May 2002
10:05:31 -0400 -----
To:
nv-l@lists.tivoli.com
Subject:
Re: [nv-l] Loss of traps with MLM -- controlling the trap reception buffer
size
I am glad you were able to find an acceptable compromise.
On the other hand, I found out that I was mistaken when I said that there
was no way to alter the size of trap reception buffer in MLM. In fact
there is, but it is rather hidden. The MLM guy here showed it to me after
he found it. Even he wasn't sure it was there. It's contained in a MIB
variable (not a startup variable as it is for trapd). That MLM MIB
variable is:
smMlmProgramControlMaxTcpTrapQueue (.1.3.6.1.4.1.2.6.12.1.1.2.4.0)
You can set this thing from smconfig by pulling down the File menu and
clicking "Save MLM Configuration/Reinitialize". Then click "Modify" and
edit the box for "Max TCP Trap Queue". I don't know how big it can go.
The default is 2048. If you are still interested in playing around with
this, I'd go up by a factor of ten and see if it makes any difference.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
Robin James <robin.james@thalesatm.com>
05/03/2002 11:20 AM
To: NetView Discussion <nv-l@lists.tivoli.com>
cc:
Subject: [nv-l] Loss of traps with MLM
We eventually found a better filter for our situation. Two of our MLM
filter settings were:
smMlmFilterThrottleArmTrapCount = 20
smMlmFilterThrottleDisarmTimer = "1s"
We improved the filter by increasing the disarm timer to 1 minute. In
our trap storm we would receive the first 20 traps in the first second
and the increase in the timer means the filter blocks the forwarding of
traps for the rest of the minute. The impact is that MLM only lets
through 20 traps per minute per node instead of the previous 20 per
second.
We have also taken action to fix the software that was generating the
flood storm.
Thanks for your help James.
Robin James
Thales ATM
----- Message from "James Shanks" <jshanks@us.ibm.com> on Wed, 3 Apr 2002
08:54:27 -0500 -----
To:
nv-l@lists.tivoli.com
Subject:
Re: [nv-l] Loss of traps with MLM
Hmmm.
(1) I have no answer about your test and send-event, since it is not
generally available. From your description it should send traps via udp
and from that it would seem that there is a problem. But if you want to
pursue that with Support, you will have to go to Version 6 at least, and
soon. Version 6 will be out of Support the end of October of this year.
MLM already has two newer versions out than what you are using, though I
have no idea how they would fare in your test.
(2) NetView V6 has the same GUI as V5. It is not NetView which has the
problem with over 5 traps a second, it is X. The NetView GUI can process
the updates faster than X can refresh the screen, hence the flicker. This
is an architectural issue. Event processing is designed to be "bursty".
You can have short bursts which send a lot of stuff to the screens and
elsewhere. What you cannot do is maintain that level without a pause.
(3) The feature we call trap pruning is just a short set of matches that
trapd now does (V6 and above) on certain OID's and specific trap ids to
help prevent the queues of various connected daemons from being overloaded
with traps they neither need nor want. For example, ovtopmd used to get
all traps. This was used as a heartbeat mechanism to see whether trapd
was still there. This was removed and a new method substituted in V6.
Also netmon now only gets "Link up" and "Link down" traps from external
sources. All other external traps, and a subset of NetView ones, are
suppressed to keep his queues clear and his trap processing to minimum.
This gives him more time for other things. Other affected processes are
ipmap and snmpCollect. They now only get subsets of traps, not all of
them. The idea was to not put the trap on the daemon's queue if all he
was going to do was throw it away. This helps keep the daemons connected
during a trap storm.
But traps storms are still a problem, and always will be. They should be
stopped at the source, because your event processing, even if it does not
lose anything, will quickly fall behind, and it will always take longer to
catch up than it did to create the storm. You can wind up with the
event subsystem being hours behind, and NetView slowed to a crawl as a
result. You can take steps to help alleviate the problem, but
ultimately any solution can be overrun if you have enough out-of-control
devices sending hundreds or thousands of traps per second to your trap
receiver. And besides it is killing your bandwidth and throughput, so it
needs to be stopped at the devices which are doing it. And I don't mean
by temporarily by shutting down their SNMP daemon, unless that is only a
prelude to more definitive action. I mean by configuring them so that
they don't send the same trap over and over again in a very short time.
It's pointless and a waste of time and resources.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
Robin James <robin.james@thalesatm.com>
04/03/2002 04:42 AM
To: NetView Discussion <nv-l@lists.tivoli.com>
cc:
Subject: Re: [nv-l] Loss of traps with MLM
James, thanks for the reply.
"send_event" is a simple programme written by us to send an event. It
uses the extensible SNMP protocol - I don't know if this protocol is
Compaq/Digital specific but it's available on TRU64 UNIX. It uses the
esnmp_init(), esnmp_poll() and esnmp_trap() functions provided by this
protocol and therefore uses UDP. It was written for sending a trap from
a node where Netview is not installed so snmptrap can't be used. In the
case of our test we could have used snmptrap because we were sending
local traps on the same node as Netview.
We will try raising the filter limits to see what the effect is.
You said the GUI begins to flicker at about 5 traps per second, does
Netview 6 cope better with display of traps at faster rates?
I am interested by the trap pruning faciliy in Netview 6. How does this
work?
We already have one problem which is not fixed in Netview 5.1.3 so if
the trap pruning facility helps us it might be another lever to persuade
our programme to upgrade.
--
Robin
email: robin.james@thalesatm.com
tel: +44 (0) 1633-862020
fax: +44 (0) 1633-868313
----- Message from "James Shanks" <jshanks@us.ibm.com> on Tue, 2 Apr 2002
13:23:26 -0500 -----
To:
nv-l@lists.tivoli.com
Subject:
Re: [nv-l] Loss of traps with MLM
I am not the MLM guy, but I do know that there is no way, documented nor
undocumented, to alter any buffer sizes it uses without a code change. So
what you are looking for doesn't exist. Yet I am also not sure what to
say about your problem, because this is the first time I have ever heard
of MLM being accused of losing traps. Perhaps I should also point out
that NetView and MLM share no code whatsoever. If they did, then we could
not have an MLM on HP/UX. That would be prohibited by our original
purchase agreement with HP, just as a NetView for HP is. So MLM, while
it is shipped with NetView these days, remains a completely separate
product, code-wise. You cannot assume that a feature on one is the same
as on the other nor that you can willy-nilly substitute one for the other
and achieve the same result.
I am rather curious about your test procedure, since the command
"send_event" is not shipped by either NetView nor MLM itself. What does
it do? Is it a command to MLM or to NetView? Did you write it yourself?
Does it have tracing or error logging associated with it? The reason I
ask is that it seems to me that if it opens a TCP connection to MLM to
cause the event to be sent, it may very well be that under the conditions
of your test, MLM was often too busy to open that connection, and thus it
may be that he did not lose any traps, but rather failed to send them in
the first place. There is a BIG difference between the two. If he
failed to send them, then perhaps you just need better error checking in
your command.
Also I am curious about your 1 second disarm timer. Since neither MLM nor
NetView for UNIX is multi-threaded, it can only do one thing at time. If
you raise your limits, does the problem disappear? Even allowing one node
to send 20 traps per second is a sure way to bring your NetView processing
to a crawl, so this is not an unreasonable thing to do. Your NetView
events GUI will begin to flicker at about 5 events per second if you
display them, and without the trap pruning added in NetView Version 6 (not
sending unnecessary traps to the daemons who don't need them) your netmon
and ovtopmd will start falling far behind and may never catch up unless
re-cycled. They may just disconnect from trapd. And when that happens
ovtopmd will stop and wait for you to re-connect with ovstart.
I am not certain about what anyone can do for you under the circumstances
you describe. The code you have is out of support and a performance issue
involving it cannot be officially pursued. And it seems clear to me that
unless it is so pursued, with other people trying to duplicate your
results, there is very little that can be done, except to tell you that
you will have to live within the limits of the code you have. Sorry, but
I see no alternatives.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
Robin James <robin.james@thalesatm.com>
04/02/2002 09:58 AM
To: NetView Discussion <nv-l@lists.tivoli.com>
cc:
Subject: [nv-l] Loss of traps with MLM
We have been performing an experiment to determine if it is possible for
our Netview computer to lose locally generated traps.
We use Netview 5.1.3 on Compaq TRU64 UNIX and we also run MLM on the
same machine to use its filtering capability. We have setup a filter to
throttle traps with the following settings:
smMlmFilterName[BlockTrapFlooding] = "BlockTrapFlooding"
smMlmFilterState = enabled
smMlmFilterDescription = "Blocks traps when too many traps come from
the same host in a short time"
smMlmFilterAction = throttleTraps
smMlmFilterAgentAddrExpression = "cwps"
smMlmFilterThrottleType = sendAfterN
smMlmFilterThrottleArmTrapCount = 20
smMlmFilterThrottleArmedCommand = "/usr/sbin/Mlm_stop_snmpd.sh
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleDisarmTimer = "1s"
smMlmFilterThrottleDisarmTrapCount = 0
smMlmFilterThrottleDisarmedCommand = "snmptrap -p 1675 localhost omc
.1.3.6.1.4.1.1254.1 `hostname` 6 104 1 .1.3.6.1.2.1.1.5 OctetStringASCII
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleCriteria = byNode
smMlmAliasName[cwps] = "cwps"
smMlmAliasList = "w1161,
w1162,
w2142"
As you can see from the settings an alias is also setup so that the
traps generated on the Netview node are not subject to the filter.
We set up 3 nodes to send traps repeatedly using the following script:
while 1
send_event 803 "swamp test"
end
This gave approximately 2200 traps in the trapd log in one minute. Using
vmstat it could be seen that the Netview node had very little idle time.
We then used send_event on the Netview node to send single traps. We
observed that 1 out 4 events was not present in the trapd or midmand
logs.
This seems to confirm that a trap can be lost when the Netview node is
receiving a very heavy load of traps.
We also performed the same test by removing the use of MLM so traps go
directly to trapd and not via midmand. We performed the same test and
found no traps were lost.
It appears to me that the buffering between UNIX and trapd does not lose
the locally generated events but when MLM is filtering it is possible to
lose traps.
Is it possible to find out the UDP receive buffer size with each
configuration?
I realise that the source of the problem is the node flooding our
Netview node with traps. We must stop this node from sending so many
events. We are trying to put a two part solution in place to ensure we
do not lose locally generated traps. The two parts are:
1. When MLM detects a node is flooding Netview with traps we will freeze
snmpd on that node so traps do not get sent.
2. Increase the buffer size between MLM and UNIX.
We think we know what to do for part 1 of the solution but is it
possible to increase the buffer size between midmand and UNIX for
receipt of traps? I know trapd provides an option to specify a UDP
receive buffer size but I can't see a similar option for midmand.
I would appreciate any comments or help on this problem.
Thanks
--
Robin
email: robin.james@thalesatm.com
tel: +44 (0) 1633-862020
fax: +44 (0) 1633-868313
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe@lists.tivoli.com
For additional commands, e-mail: nv-l-help@lists.tivoli.com
*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)
|