Hmmm. This is a complex failure scenario and it is quite difficult to
determine what all is involved.
Remember that the way this all works is that nvpagerd manages a queue of
pages to send and spawns a child process, nvpager, to do the actual dialing
and sending. That's the process which is getting too many consecutive
errors and being cancelled. Why? Well, it appears that the errors we are
talking about are these:
Error: Cannot initialize modem after 5 retries.
The problem here is that this is all we know. Apparently, there were pages
in the queue to send, nvpagerd spawned an nvpager process to send them, but
for some reason it could not init the modem. Why not? Well, that's not
clear, but it may be that the previous call, which had problems, did not
terminate properly, and so the modem was still in a busy state.
Apparently, sometimes this problem is self-clearing, because the log shows
we did continue without intervention after some errors, and sometimes not.
I'm not sure what could be done to get out of a situation like this, but
perhaps one could make configurational changes to avoid getting into it in
the first place. What kind of problems did the previous call(s)
experience? Here are a few clues.
Error: Carrier MCLocal rejected message or pager ID.
Error: Carrier MCNational did not respond to end of transactions.
INFO: ready for next transaction on this call to carrier
Error: Carrier verizon rejected message or pager ID.
First, it would appear that some of the pager PINs are not correct, or that
the number of max transactions per call specified in the nv.carriers file
is too high. You might profitably try to verify these. The "did not
respond" error indicates that the carrier already hung up before we
terminated our end of the call, so they probably aren't expecting more than
one page per call. The INFO statement indicates that more than one
transaction is allowed per the nv.carriers file. I realize that I have
taken these lines out of context. I just wanted to explain what they might
tell you about what is going on.
Whether this will help or not, I'm not sure, but it might bear looking
into.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Bursik, Scott {PBSG}"
<Scott.Bursik@pbsg.com> To:
"'nv-l@lists.us.ibm.com'" <nv-l@lists.us.ibm.com>
Sent by: cc:
owner-nv-l-digest@lists Subject: RE: [nv-l]
nvpagerd dies
.us.ibm.com
10/21/03 02:30 PM
Please respond to nv-l
Errpt doesn't give any clues. There are not failures listed there any where
near that time.
Most of the time the nvpagerd works just fine and then for some reason the
pages stop going out to the carrier. Here is a sample of the nvpagerd.alog
when the process dies:
do_alpha[1649] INFO: ready for next transaction on this call to carrier
10/21/03 11:03:22 SUCCEEDED (ID 4600 Pri 3 Secs 199 Tries 1)
[applmgr@pbsxap00076.fritolay.pvt] 1474379@mcnational: Concurrent manager
problem in ITNHD
'[p
' 4
Error: Carrier MCLocal rejected message or pager ID.
10/21/03 11:04:17 FAILED (ID 4595 Pri 3 Secs 601 Tries 1)
[root@pbsxis00010.fritolay.pvt] 3123931877@mclocal: Taking hmc02 down
shortly to add 128 port adapter if no objections. DH 3368
do_alpha[1649] INFO: ready for next transaction on this call to carrier
10/21/03 11:04:21 SUCCEEDED (ID 4601 Pri 3 Secs 258 Tries 1)
[applmgr@pbsxap00076.fritolay.pvt] 9726090270@mclocal: Concurrent manager
problem in ITNHD
'[p
' 4
Error: Carrier verizon rejected message or pager ID.
10/21/03 11:05:06 FAILED (ID 4596 Pri 3 Secs 650 Tries 1)
[root@pbsxis00010.fritolay.pvt] 3123931890@verizon: Taking hmc02 down
shortly to add 128 port adapter if no objections. DH 3368
Error: Cannot initialize modem after 5 retries.
'[p
' 4
do_alpha[1649] INFO: ready for next transaction on this call to carrier
10/21/03 11:11:58 SUCCEEDED (ID 4603 Pri 3 Secs 86 Tries 1)
[oracle@pbsxdr00034.fritolay.pvt] 1407346@mclocal: dr34 dbClone: sqlplus
unable to retrieve file names Tue Oct 21 11:08:12 2003
Error: Carrier MCNational did not respond to end of transactions.
'[p
' 4
do_alpha[1649] INFO: ready for next transaction on this call to carrier
10/21/03 11:13:15 SUCCEEDED (ID 4604 Pri 3 Secs 153 Tries 1)
[bfloyd@pbsxis00001.fritolay.pvt] 1090913@mcnational: Just got your vm.
call
me. bf x6341
Error: Cannot initialize modem after 5 retries.
Error: Cannot initialize modem after 5 retries.
'[p
' 4
do_alpha[1649] INFO: ready for next transaction on this call to carrier
10/21/03 11:25:41 SUCCEEDED (ID 4607 Pri 3 Secs 50 Tries 1)
[root@pbsxcm00017.fritolay.pvt] 1126045@mcnational: please give me a call
at
6338 Nate
Error: Cannot initialize modem after 5 retries.
Error: Cannot initialize modem after 5 retries.
Error: Cannot initialize modem after 5 retries.
oracle, pbsxzz00002.fritolay.pvt, $mesgError: Cannot initialize modem after
5 retries.
Error: Cannot initialize modem after 5 retries.
Error: Cannot initialize modem after 5 retries.
oracle, pbsxdr00016.pepsi.com, $mesgError: Too many consecutive child
errors.
Problem with child. 4
10/21/03 12:01:16 - Error: select() failed. : A file descriptor does not
refer to an open file.
10/21/03 12:01:21 - STOP /usr/OV/bin/nvpagerd as PID 54128
Scott Bursik
Enterprise Systems Management
PepsiCo Business Solutions Group
scott.bursik@pbsg.com
(972) 963-1400
________________________________________
From: James Shanks [mailto:jshanks@us.ibm.com]
Sent: Tuesday, October 21, 2003 12:08 PM
To: nv-l@lists.us.ibm.com
Subject: Re: [nv-l] nvpagerd dies
Look at errpt ( run errpt -a > myfile ) and see what else is happening
around that time.
There is no error level in nvapgerd. That's an OS level message
complaining
about the errors.
I have only two guesses.
Is it possible that you have paging enabled but your modem or tty device is
inactive? That's the only "file" (all UNIX I/O is to a "file") that
might be an issue here that I can think of.
If there is no nvpager.warm file it should work fine, unless perhaps you
deleted it while the daemon was active and expecting to write to it.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Bursik, Scott {PBSG}" <Scott.Bursik@pbsg.com>
Sent by: owner-nv-l-digest@lists.us.ibm.com
10/21/2003 12:16 PM
Please respond to nv-l
(nv-l@lists.us.ibm.com)'" <nv-l@lists.us.ibm.com>
cc:
NetView 7.1.3 AIX 4.3.3
I have had several instances where nvpagerd dies and when I look at the
nvpagerd.errlog I see the following message every time:
Error(/usr/OV/bin/nvpagerd): Too many consecutive child errors.
10/20/03 16:18:39 - Error(/usr/OV/bin/nvpagerd): select() failed. : A file
descriptor does not refer to an open file.
I don't see anything in the nvpagerd.alog or nvpagerd.blog that is
obviously
an issue. Has anyone seen this? Is there a way of bumping up the error
level
in nvpagerd?
Thanks,
Scott
|