Availability is hard to mange,I would like to know more about the management
of availability in the next version of netview and whether or not other
tools are required to manage and report on the data.
Does anyone know when the next version of netview is released and what other
feature it may have?
-----Original Message-----
From: Ray Schafer [mailto:schafer@TKG.COM]
Sent: Thursday, September 02, 1999 4:49 PM
To: NV-L@UCSBVM.UCSB.EDU
Subject: Re: Availability
Hey Rob!
Using NetView Node Up/Node Down traps may not give you what you want. The
issues
are:
* Not all Node Up traps will have corresponding Node Down traps -
especially
for routers. For example, an interface down trap on the router will
cause a
Node Marginal trap, and when the interface comes up again you'll get a
Node
Up trap - without a Node Down.
* For routers that have administratively down interfaces (when an
adminsitrator manually brings the interface down on the router),
NetView
will never ever mark the router as down. Even if the router is under
water!
* Network problems with the NetView server or the MLM's default router or
any
router in between you and the endpoint will cause NetView to mark the
node
or interface down (if it is polling it at the time) even though it is
really
notr down, just unpingable from the NetView server or MLM.
Now for the good news: This may be addressed in the next version of NetView
if
"snmp" polling is engaged. This will actually look at the uptime from the
system
tree of the devices MIB. You could probably write a script to do the same
for
now. If for every Node Up trap you get, you fire off an snmpget of
system.sysuptime (I think that's it - do "snmpwalk <node> system" to see!).
If
the uptime is just a few minutes than it is really an outage, if it is more
than
your polling cycle, than it is bogus. Be carefull though, if you fire off a
bunch of these snmpget's when you are flooded with up traps you could
exhaust
system resources!
Christian,
Maybe you could use the snmpCollect facilities to attack the problem in a
more
efficient way: Set up a collection for your servers and another for your
routers. Create a MIB Expression to store the value "0 -
system.sysuptime.0" for
each member of the collection. I think that this is collected by
snmpCollect as
a counter - which means that it will report the difference between the last
sample and this one. The reason for the "0 - value" expression is because
snmpCollect only takes action when the variable or expression is greater
than
some number (in our case we are looking for this expression to be greater
than
0!). Create a specific trap for this threshold event, and as an action of
this
trap, run a command that will parse the trapd.log file looking for the
NetView
events (up/down/marginal) to get a closer approximation of when the node
went
down, and came back up! Collecting this once a day won't be overkill, and
unless
your node goes down every day, this should work fairly well.
Rob Napholz wrote:
> Pham could you post your perl script to the group
> and save us all some time.
>
> thanks Rob
>
> Pham Isaak V wrote:
> >
> > First create a ruleset to detect Node Up/Down traps, then compare the
traps
> > to a collection of server or router.
> >
> > If device is a router, log the event to a router logfile. If device is
a
> > server, log the event to a server logfile. The logs should contain the
> > following fields:
> >
> > device name
> > status of device (up/down)
> > time of status change (day, hours, & minutes)
> >
> > At the end of the month, run a script or program against the logfiles.
The
> > program or script (Perl in my case) to match the device down with its
> > corresponding device up. Now subtract the time of the device down trap
to
> > the device up trap. This will give you the length of time the devices
was
> > down. Convert the days and hours to minutes. Match up all the other
> > down/up trap associated with the same device. Add them all together and
you
> > should have the total number of minutes the device was down for the
month.
> >
> > Now, take the total number of minutes the device was down and subtract
by
> > 43200[(24 hours * 60 minutes) * 30 days = # of minutes in a month]. Take
> > that value and divide by 43200. This will give you the percentage of
> > availablity for the device.
> >
> > This method is not 100% accurate, but it had to do for now. I hope
someone
> > else have a better way of doing this.
> >
> > Hint: This would be a great addition to the next release of NetView.
> >
> > -----Original Message-----
> > From: Frantsen Christian [mailto:cf@INTERNOC.SE]
> > Sent: Wednesday, September 01, 1999 6:00 AM
> > To: NV-L@UCSBVM.ucsb.edu
> > Subject: Availability
> >
> > Hi!
> >
> > I would like to (with help from sysUptime) gather information and then
> > present this to a customer in single number. i.e
> >
> > Your availability this month on these routers/servers/etc has been 99.7%
> >
> > Has anyone made something like this? Perhaps someone gcould ive me a few
> > pointers on how to do this as easy as possible.
> >
> > -----------------------------------------
> > Christian Frantsen
> > Technical Operations
> >
> > Internoc Scandinavia AB
> > Tel: +46-36-194843
> > Fax: +46-36-194651
> > http://www.internoc.se
--
Ray Schafer | schafer@tkg.com
The Kernel Group | Distributed Systems Management
http://www.tkg.com
|