nv-l
[Top] [All Lists]

Re: [nv-l] appl queue size question

To: nv-l@lists.us.ibm.com
Subject: Re: [nv-l] appl queue size question
From: James Shanks <jshanks@us.ibm.com>
Date: Wed, 25 May 2005 11:33:11 -0400
Delivery-date: Wed, 25 May 2005 16:34:05 +0100
Envelope-to: nv-l-archive@lists.skills-1st.co.uk
In-reply-to: <OFA67F3850.53E6D2B6-ON8525700C.0050BFDE-8525700C.0053FB87@ca.ibm.com>
Reply-to: nv-l@lists.us.ibm.com
Sender: owner-nv-l@lists.us.ibm.com
The "ungraceful" messages,
      servmon probably died_ ungracefully disconnected from trapd
      netmon probably died_ ungracefully disconnected from trapd
are not a result of trapd forcing the application off.  Rather they are the
result of him trying to queue an event to that process and finding that
their end of the pipe, their socket, is gone.  This can the be the result
of a core, but perhaps not.  In this case you would have to consult the
trace logs for that daemon, or nettl perhaps, to find out what led up to
this issue.  I don't know about servmon, but I do believe, though I am not
certain, that netmon will re-try his trapd connection if he loses it.  The
netmon trace would be the place to look for problems like that.

The "max number" message
      netmon\-related Application reached maximum number of outstanding
events_
disconnecting from trapd\.
 is the one that the appl queue size is about.

This message indicates that trapd ended his side of the connection because
it did not appear that the process on the other side was still active.   If
you are seeing this with netmon, then once again I would look at the netmon
trace and nettl.  There is already code in trapd to forward only those
events to netmon which he really needs to see, so if his queue got really
big, then most likely he had a real problem of some sort.   The only events
which might exceed netmon's ability to handle them, that I can think of,
would be Cisco Link Up and Link Down.  You could conceivably get enough of
those in a short time to overload the queue.

But in any case, raising the appl queue size is just a band-aid.  IF (big
IF)  there is not problem with the daemons, and the cause is just a simple
trap storm, of moderate duration, then raising the queue is the right thing
to do.   But if you raise the queue the queue size significantly, and it
doesn't help, that means there is another problem here that needs to be
addressed.


James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group


                                                                           
             Francois Le Hir                                               
             <flehir@ca.ibm.co                                             
             m>                                                         To 
             Sent by:                  nv-l@lists.us.ibm.com               
             owner-nv-l@lists.                                          cc 
             us.ibm.com                                                    
                                                                   Subject 
                                       Re: [nv-l] appl queue size question 
             05/25/2005 11:14                                              
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
                   nv-l                                                    
                                                                           
                                                                           









James,

I have seen the question on the list recently but didn't see any answer and
I think it is related to this thread.
Every day I get several netview traps IBM_NVFERR_EV (specific 58851330)
with messages like this:

servmon probably died_ ungracefully disconnected from trapd
netmon probably died_ ungracefully disconnected from trapd
netmon\-related Application reached maximum number of outstanding events_
disconnecting from trapd\.

However no daemon seams to fail or they restart by themselves as I don't
have to restart anything.

A while ago I tried to address this issue with support (to at least
understand the meaning of theses traps) and what I was told is to increase
the "appl queue size". It is now set to 25000 on my system and even if it
(probably) reduced the number of traps I am getting, I still get some
almost every day.
Does this high number of 25000 do any good ?
Running the same (or similar) script as Scott never show and high value for
the queue usage.

Thanks
Salutations, / Regards,

Francois Le Hir
Network Projects & Consulting Services
IBM Global Services
Phone: (514) 964 2145



             James Shanks
             <jshanks@us.ibm.c
             om>                                                        To
             Sent by:                  nv-l@lists.us.ibm.com
             owner-nv-l@lists.                                          cc
             us.ibm.com
                                                                   Subject
                                       Re: [nv-l] appl queue size question
             05/25/2005 10:23
             AM


             Please respond to
                   nv-l






Scott,

I hesitate to say it, but the phrase, "Luke, you are messing with powers
you cannot possibly understand," comes to mind.
(Guess which movie we saw recently?)   And I'll apologize now for that
feeble attempt at humor, while I attempt to answer your question.  Of
course, you can understand, once someone explains what you are actually
looking at.  So here goes.

Basically, you have an application queue size of 5000 events, period.  The
55042 is a process id, and is irrelevant.  That's all the trace tells you
at this time, except that the queues are not backed up, since you are
seeing  one event being added, and then immediately deleted.  Running this
script when you actually have a problem with events being behind might tell
you how close the appl queues are to being full, but running it now when
you don't have a problem, tells you nothing much.  By itself,  this script
is not a performance  analysis tool, but only a diagnostic aid.

Now, since the default application queue size in trapd is 2000 events,
yours has already been changed at least once and is more than double the
usual amount.  Apparently someone has been tuning this before.   So what
problem are you trying to solve, what symptoms are you seeing?

This queue size determines how many events trapd will pass to connected
application which is not responding (or responding too slowly) before he
closes their socket connection to him.  He does so in order to avoid his
own demise from lack of storage.   Usually, the only reason to alter this
size is that you have periodic traps storms, so the connected applications
get a whole bunch of traps all at once, after the storm initially subsides,
and now they have a lot to do to catch up.  So you raise the size of the
queues to hold more events so they can do that.  Otherwise, they get forced
off and all the events in the queue for them are discarded.  Sometimes that
really is the best thing to do, let them get forced off, and sometimes not.
It's a trade-off.  If they don't get forced off, then they get backed up,
and it may take while awhile for them to catch up.

Unfortunately. there is no tool I know of which can tell you how big you
should make the application queue size if you don't  want the appls forced
off.   And I should know.  I'm responsible for trapd maintenance.  Like
most tuning issues, picking an application queue size other than the
default is a trial-and-error business.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group



             "Bursik, Scott
             {PBSG}"
             <Scott.Bursik@pbs                                          To
             g.com>                    "'Nv-L (nv-l@lists.us.ibm.com)'"
             Sent by:                  <nv-l@lists.us.ibm.com>
             owner-nv-l@lists.                                          cc
             us.ibm.com
                                                                   Subject
                                       [nv-l] appl queue size question
             05/25/2005 09:15
             AM


             Please respond to
                   nv-l






All,

I am having some performance issues with my production NetView server and
in
an effort to diagnose the issue I ran a script that someone from the forum
contributed a while back. It checks the appl queue.

When I run the script I get the following output:

Turning on trapd tracing
Starting tracing now....
Toggling trace mode of SNMP trap daemon
Waiting for one minute---------------------------|
.................................................
Stopping Tracing....
Toggling trace mode of SNMP trap daemon
Getting trapd status from /usr/OV/log/trapd.trace
Wed May 25 08:04:17 2005 send_to_all_appls: [0] appl queue size 1 of
maximum
5000 events
Wed May 25 08:04:17 2005 send_to_all_appls: [55042] appl queue size 1 of
maximum 5000 events

Should I be concerned with the last line? If I am reading this correctly I
am configured for a max of 5000 events and I have a queue size of 55042. I
would say that the appl queue size needs to be changed. We have a very
large
environment.


Here is the script so you can see what it is doing:

#!/usr/bin/ksh
clear
echo > /usr/OV/log/trapd.trace
echo "Turning on trapd tracing"
        echo ""
        echo ""
echo "Starting tracing now...."
/usr/OV/bin/trapd -T
        echo ""
        echo ""
# Progress indicator
while :; do
        sleep 1
        echo ".\c"
done &
Progress=$!
trap 9 15 "kill $Progress;exit 1"
echo "Waiting for one minute---------------------------|"
sleep 50
kill $Progress
        echo ""
        echo ""
echo "Stopping Tracing...."
/usr/OV/bin/trapd -T
        echo ""
        echo ""
echo "Getting trapd status from /usr/OV/log/trapd.trace"
tail /usr/OV/log/trapd.trace | grep "appl queue size"
Thank You!

Scott Bursik









<Prev in Thread] Current Thread [Next in Thread>

Archive operated by Skills 1st Ltd

See also: The NetView Web