[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] Availability Metric Reporting
Quoting Doug Hughes (doug@eng.auburn.edu):
> On Mon, 12 Apr 2004, Shane B. Milburn wrote:
> > How are the rest of you tracking and reporting on server and network
> > availability metrics?
>
> Well, it's complicated, and it depends on the service.
I'll toss in also time of day. When I worked on a trading
floor it was completely acceptable to have a server outage
(ideally planned, but not fatal if not) at 6:30PM. We cared,
but not as much. Deadly at 3:30PM.
At -current client, I'm still getting the NOC to understand
that with email, if one of three boxes isn't responding on port 25,
it's NOT a Priority 1 problem.
Hell, it's often a goal (it refuses at a lower LA than the others because
I have others. With filtering it's easier to handle a flood of spam
by getting other machines involved sooner and the first machine recovers
quicker when it's not LOADED with connections).
So checking on services with end to end tests.
- If I can send and receive a mail in N seconds, we're fine. No matter
if 2/3 boxes are down.
- If I can get the time, LDAP, DNS or web page, I'm ok.
- If packets get from here to there, the network is functioning.
On TOP of those, we get to address failures on individual machines
and track per-machine issues (gee, NameRUs wasn't responding 5% of
the time last month...). But that's a separate discussion.
I tire of people panicing when an HA group fails over. That's why
we spend all that money. We're REDUCED, not down; it's a concern, not
a fire. Can I go back to bed?
We also track resolution times of tickets but have issues. If I
get a passed a trouble ticket that's past it's time or just has a
few minutes left, it's like passing a grenade and we get dinged.
(yes, my 2hr response window was missed when I got it 1:55 hrs
after it was created). Fixing that means changing deeply embedded
tools. Unfo, management sees numbers and bad numbers mean more to
them than no numbers.