Re: Availability

To:	nv-l@lists.tivoli.com
Subject:	Re: Availability
From:	Ray Schafer <schafer@tkg.com>
Date:	Thu, 2 Sep 1999 00:48:48 -0400

Hey Rob!

Using NetView Node Up/Node Down traps may not give you what you want.  The 
issues
are:

   * Not all Node Up traps will have corresponding Node Down traps - especially
     for routers.  For example, an interface down trap on the router will cause 
a
     Node Marginal trap, and when the interface comes up again you'll get a Node
     Up trap - without a Node Down.
   * For routers that have administratively down interfaces (when an
     adminsitrator manually brings the interface down on the router), NetView
     will never ever mark the router as down.  Even if the router is under 
water!
   * Network problems with the NetView server or the MLM's default router or any
     router in between you and the endpoint will cause NetView to mark the node
     or interface down (if it is polling it at the time) even though it is 
really
     notr down, just unpingable from the NetView server or MLM.

Now for the good news:  This may be addressed in the next version of NetView if
"snmp" polling is engaged.  This will actually look at the uptime from the 
system
tree of the devices MIB.  You could probably write a script to do the same for
now.  If for every Node Up trap you get, you fire off an snmpget of
system.sysuptime (I think that's it - do "snmpwalk <node> system" to see!).   If
the uptime is just a few minutes than it is really an outage, if it is more than
your polling cycle, than it is bogus.  Be carefull though, if you fire off a
bunch of these snmpget's when you are flooded with up traps you could exhaust
system resources!

Christian,

Maybe you could use the snmpCollect facilities to attack the problem in a more
efficient way:  Set up a collection for your servers and another for your
routers.  Create a MIB Expression to store the value "0 - system.sysuptime.0" 
for
each member of the collection.  I think that this is collected by snmpCollect as
a counter - which means that it will report the difference between the last
sample and this one.  The reason for the "0 - value" expression is because
snmpCollect only takes action when the variable or expression is greater than
some number (in our case we are looking for this expression to be greater than
0!).  Create a specific trap for this threshold event, and as an action of this
trap, run a command that will parse the trapd.log file looking for the NetView
events (up/down/marginal) to get a closer approximation of when the node went
down, and came back up!  Collecting this once a day won't be overkill, and 
unless
your node goes down every day, this should work fairly well.

Rob Napholz wrote:

> Pham could you post your perl script to the group
> and save us all some time.
>
> thanks Rob
>
> Pham Isaak V wrote:
> >
> > First create a ruleset to detect Node Up/Down traps, then compare the traps
> > to a collection of server or router.
> >
> > If device is a router, log the event to a router logfile.  If device is a
> > server, log the event to a server logfile.  The logs should contain the
> > following fields:
> >
> >         device name
> >         status of device (up/down)
> >         time of status change (day, hours, & minutes)
> >
> > At the end of the month, run a script or program against the logfiles.  The
> > program or script (Perl in my case) to match the device down with its
> > corresponding device up.  Now subtract the time of the device down trap to
> > the device up trap.  This will give you the length of time the devices was
> > down.  Convert the days and hours to minutes.  Match up all the other
> > down/up trap associated with the same device.  Add them all together and you
> > should have the total number of minutes the device was down for the month.
> >
> > Now, take the total number of minutes the device was down and subtract by
> > 43200[(24 hours * 60 minutes) * 30 days = # of minutes in a month]. Take
> > that value and divide by 43200.  This will give you the percentage of
> > availablity for the device.
> >
> > This method is not 100% accurate, but it had to do for now.  I hope someone
> > else have a better way of doing this.
> >
> > Hint:  This would be a great addition to the next release of NetView.
> >
> > -----Original Message-----
> > From: Frantsen Christian [mailto:cf@INTERNOC.SE]
> > Sent: Wednesday, September 01, 1999 6:00 AM
> > To: NV-L@UCSBVM.ucsb.edu
> > Subject: Availability
> >
> > Hi!
> >
> > I would like to (with help from sysUptime) gather information and then
> > present this to a customer in single number. i.e
> >
> > Your availability this month on these routers/servers/etc has been 99.7%
> >
> > Has anyone made something like this? Perhaps someone gcould ive me a few
> > pointers on how to do this as easy as possible.
> >
> > -----------------------------------------
> > Christian Frantsen
> > Technical Operations
> >
> > Internoc Scandinavia AB
> > Tel: +46-36-194843
> > Fax: +46-36-194651
> > http://www.internoc.se

--
Ray Schafer                   | schafer@tkg.com
The Kernel Group              | Distributed Systems Management
http://www.tkg.com

<Prev in Thread]	Current Thread	[Next in Thread>
Availability, Frantsen Christian Re: Availability, Pham Isaak V Re: Availability, Boyles, Gary P Re: Availability, Todd E. Lewis Re: Availability, Rob Napholz Re: Availability, James Shanks Re: Availability, Mark Sklenarik Re: Availability, Ray Schafer <= Re: Availability, Fältman, Mikael Re: Availability, Boulieris, Arthur

Previous by Date:	Re: Availability, Mark Sklenarik
Next by Date:	Re: Availability, Fältman, Mikael
Previous by Thread:	Re: Availability, Mark Sklenarik
Next by Thread:	Re: Availability, Fältman, Mikael
Indexes:	[Date] [Thread] [Top] [All Lists]