nv-l
[Top] [All Lists]

Re: Availability

To: nv-l@lists.tivoli.com
Subject: Re: Availability
From: "Douglas W. Stevenson" <dsteven@INOTECH.COM>
Date: Thu, 4 Nov 1999 14:27:16 -0500
Availability can take on a multitude of meanings as in :

Is the Node up and running
Are the Node's applications up and running
Can users connect to the application

There is a short term and a long term aspect to availability.  To want to
monitor and measure availability for reports and SLA metrics.  But you need
to monitor availability in near real time so that you can manage to the
availability goals at hand.

If you want reporting that integrates pretty tightly with NV, check out
Tavve at:

http://www.tavve.com

HTH,

Doug...
 Douglas W. Stevenson
Senior Principal Engineer
InoTech
3701 Pender Drive
Fairfax, VA 22030
1-800-INOTECH
1-703-995-1738 (Direct)
1-703-995-1725 (Fax)
http://www.inotech.com

> -----Original Message-----
> From: Discussion of IBM NetView and POLYCENTER Manager on NetView
> [mailto:NV-L@UCSBVM.UCSB.EDU]On Behalf Of Boulieris, Arthur
> Sent: Thursday, September 02, 1999 3:42 AM
> To: NV-L@UCSBVM.UCSB.EDU
> Subject: Re: Availability
>
>
> Availability is hard to mange,I would like to know more about the
> management
> of availability in the next version of netview and whether or not other
> tools are required to manage and report on the data.
> Does anyone know when the next version of netview is released and
> what other
> feature it may have?
>
>
> -----Original Message-----
> From: Ray Schafer [mailto:schafer@TKG.COM]
> Sent: Thursday, September 02, 1999 4:49 PM
> To: NV-L@UCSBVM.UCSB.EDU
> Subject: Re: Availability
>
>
> Hey Rob!
>
> Using NetView Node Up/Node Down traps may not give you what you want.  The
> issues
> are:
>
>    * Not all Node Up traps will have corresponding Node Down traps -
> especially
>      for routers.  For example, an interface down trap on the router will
> cause a
>      Node Marginal trap, and when the interface comes up again
> you'll get a
> Node
>      Up trap - without a Node Down.
>    * For routers that have administratively down interfaces (when an
>      adminsitrator manually brings the interface down on the router),
> NetView
>      will never ever mark the router as down.  Even if the router is under
> water!
>    * Network problems with the NetView server or the MLM's
> default router or
> any
>      router in between you and the endpoint will cause NetView to mark the
> node
>      or interface down (if it is polling it at the time) even though it is
> really
>      notr down, just unpingable from the NetView server or MLM.
>
> Now for the good news:  This may be addressed in the next version
> of NetView
> if
> "snmp" polling is engaged.  This will actually look at the uptime from the
> system
> tree of the devices MIB.  You could probably write a script to do the same
> for
> now.  If for every Node Up trap you get, you fire off an snmpget of
> system.sysuptime (I think that's it - do "snmpwalk <node> system"
> to see!).
> If
> the uptime is just a few minutes than it is really an outage, if
> it is more
> than
> your polling cycle, than it is bogus.  Be carefull though, if you
> fire off a
> bunch of these snmpget's when you are flooded with up traps you could
> exhaust
> system resources!
>
> Christian,
>
> Maybe you could use the snmpCollect facilities to attack the problem in a
> more
> efficient way:  Set up a collection for your servers and another for your
> routers.  Create a MIB Expression to store the value "0 -
> system.sysuptime.0" for
> each member of the collection.  I think that this is collected by
> snmpCollect as
> a counter - which means that it will report the difference
> between the last
> sample and this one.  The reason for the "0 - value" expression is because
> snmpCollect only takes action when the variable or expression is greater
> than
> some number (in our case we are looking for this expression to be greater
> than
> 0!).  Create a specific trap for this threshold event, and as an action of
> this
> trap, run a command that will parse the trapd.log file looking for the
> NetView
> events (up/down/marginal) to get a closer approximation of when the node
> went
> down, and came back up!  Collecting this once a day won't be overkill, and
> unless
> your node goes down every day, this should work fairly well.
>
>
>
> Rob Napholz wrote:
>
> > Pham could you post your perl script to the group
> > and save us all some time.
> >
> > thanks Rob
> >
> > Pham Isaak V wrote:
> > >
> > > First create a ruleset to detect Node Up/Down traps, then compare the
> traps
> > > to a collection of server or router.
> > >
> > > If device is a router, log the event to a router logfile.  If
> device is
> a
> > > server, log the event to a server logfile.  The logs should
> contain the
> > > following fields:
> > >
> > >         device name
> > >         status of device (up/down)
> > >         time of status change (day, hours, & minutes)
> > >
> > > At the end of the month, run a script or program against the logfiles.
> The
> > > program or script (Perl in my case) to match the device down with its
> > > corresponding device up.  Now subtract the time of the device
> down trap
> to
> > > the device up trap.  This will give you the length of time the devices
> was
> > > down.  Convert the days and hours to minutes.  Match up all the other
> > > down/up trap associated with the same device.  Add them all
> together and
> you
> > > should have the total number of minutes the device was down for the
> month.
> > >
> > > Now, take the total number of minutes the device was down and subtract
> by
> > > 43200[(24 hours * 60 minutes) * 30 days = # of minutes in a
> month]. Take
> > > that value and divide by 43200.  This will give you the percentage of
> > > availablity for the device.
> > >
> > > This method is not 100% accurate, but it had to do for now.  I hope
> someone
> > > else have a better way of doing this.
> > >
> > > Hint:  This would be a great addition to the next release of NetView.
> > >
> > > -----Original Message-----
> > > From: Frantsen Christian [mailto:cf@INTERNOC.SE]
> > > Sent: Wednesday, September 01, 1999 6:00 AM
> > > To: NV-L@UCSBVM.ucsb.edu
> > > Subject: Availability
> > >
> > > Hi!
> > >
> > > I would like to (with help from sysUptime) gather information and then
> > > present this to a customer in single number. i.e
> > >
> > > Your availability this month on these routers/servers/etc has
> been 99.7%
> > >
> > > Has anyone made something like this? Perhaps someone gcould
> ive me a few
> > > pointers on how to do this as easy as possible.
> > >
> > > -----------------------------------------
> > > Christian Frantsen
> > > Technical Operations
> > >
> > > Internoc Scandinavia AB
> > > Tel: +46-36-194843
> > > Fax: +46-36-194651
> > > http://www.internoc.se
>
> --
> Ray Schafer                   | schafer@tkg.com
> The Kernel Group              | Distributed Systems Management
> http://www.tkg.com
>


<Prev in Thread] Current Thread [Next in Thread>

Archive operated by Skills 1st Ltd

See also: The NetView Web