[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SAGE] Availability Metric Reporting



On Mon, 12 Apr 2004, Shane B. Milburn wrote:

>
> I'm trying to come up with a better reporting method for management on our
> weekly/monthly/yearly System and Network availability. I haven't
> been able to come up with any really good formulas or ways to
> track how close to five 9s we get for servers and for networks.
>
> Cisco has a book 'High Availability Network Fundamentals' which is
> good but it really only helps calculate the hardware availablity.
> I also need to track server and network availability from a "service"
> perspective. We had all kinds of redundancy, so even if I loose a switch
> in my network, the network is still functioning and the only affected
> people are the clients directly attached to that failed component. Same
> goes for my servers, since most are clustered.
>
> How are the rest of you tracking and reporting on server and network
> availability metrics?
>

Well, it's complicated, and it depends on the service.

For instance, suppose you offer a dns caching service to customers.
Every customer gets at least 2 different IPS.. Now, if one of those
servers is unavailable, is it the first one in the list or the second?
Is the DNS timeout considered ok per your agreements with your customer?

For NTP - you offer 3 NTP servers. one fails. by NTP design, that should
be a non-outtage because there's builtin failover, right?

For a single machine single service, it's relatively easy.. Downtime
is downtime.

We use a combination of Netcool Internet Service Monitors to monitor
existing services (common ones like http, https, dns, pop, imap, ntp,
ldap, radius, etc) and some aggregation to define what really constitutes
an outtage. Sometimes one server down is an outtage, sometimes it isn't.

We supplement ISMs with custom probes that inject into netcool and
the obligatory ping probe (more for severity than service purposes)
	Doug