[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [SAGE] zenoss versus nagios
We use Nagios, but for different reasons... well, really for what Nagios
is. Nagios isn't a monitoring system as much as it is an event reporting
system. Nagios is our central notification platform used for both
sending out alerts as well as graphically on flat panels around the
office.
Now, saying that Nagios is an event reporting system, you have to create
events. We use very few canned Nagios pluggins, instead, we have written
our own that are much more scalable. We also rely heavily on vendor
supplied "free" monitoring software packages as in many cases, no one is
better suited to report on their hardware/software then the people who
wrote it. So, a brief description of what we have is as follows:
Nagios server receives alerts from:
* Sun Management Center
* Sun StorADE
* HP Insight Manager
* Netapp Operations Manager (formally DFM)
* Vertias Storage Foundation Management Server
* Solar Winds Orion
* Oracle OEM (in progress)
* EMC Networker (in progress)
For performance monitoring/trending/etc we use RTG to collect our data
with the targetmaker package (shameless plug) to automatically discover
all SNMP nodes on our network and add the proper monitoring for each
device based on features supported by the devices MIBs. For instance it
will poll a device, determine what version of SNMP to use (1 or 2) and
then check a number of "modules" to see what it supports. For our HP
boxes (both linux and NT) it will find all CPU's in the system, all
disks, memory, processes, users, network interfaces, threads,
temperature sensors, and a few other items. RTG then sticks all that
data into a SQL (MySQL in our case) database. We then have several
daemons that run and continually query the SQL database and report to
Nagios on the status of things like CPU usage, disk usage, network
bandwidth, network errors, temperature, just about anything that is in
the database we alert on. With RTG we're able to collect at 60 second
intervals. In our environment currently we're collecting and storing
around 60,000 OID's (takes about 33 seconds). We're about about 2TB of
data in the MySQL database...
Now, for the really fun part... As we have all this data on performance
of servers, what do you do with it? Well, the answer is easy, dynamic
thresholds for monitoring based on historical data. We're about 50% of
the way thru implementing the first stage of this. Basically, as we're a
trading firm, we have two parts of teh day we care about, trading hours,
and non-trading hours. We have calculated "peaks" and "averages" based
on our RTG data for the different periods of the and days of the week.
We now have the ability to say "oh, this server is at 94% cpu usage, is
this bad? nope, it has hit 95% CPU every da for the last 2 months, so,
this is normal, don't alert anyone". But, on the other hand, another
server hitting 94% CPU we may say "ut oh, this box has only been hitting
70% max over the last 2 months, 94% is BAD, alert someone!". We
currently have the "peaks" database working and are about to start
implementing the thresholds from it.
And, well, what else can you do with 2TB of performance data? Well, if
you have metrics on each CPU individually (which HP Insight gives us,
yay HP!) you can do fun things like create a vector for each server
based on how evenly it uses all 8 cores in your server. The steeper the
vector, the less evenly your app can use the 8 cores, and time to maybe
consolidate. The list goes on and on :)
As for management of this all, we have an in house GUI to do all the
configuring of Nagios. Basically it's a SQL database that generates
Nagios configs with a web frontent. We have a number of scripts that
automatically populate this SQL database from other applications. So,
for instance, if we add a new server, group X updates their list of
servers they support, that is automatically pulled into our Nagios
config appliction, which then is automatically pulled in my RTG, HP IM,
Sun Management Center, or whatever other app we want to create its
configs. Basically, we put a host in one place once, and all the
monitoring apps pick it up from the single location.
Granted this is all a high level overview, and alot more then you were
probably looking for... But, it just goes to say, Nagios rocks if you
use it for what it is, a reporting system of events.
Brian
Quoting Neil Watson (sage@xxxxxxxxxxxxxxxx) from :
> I'm beginning to plan a migration from an old Nagios 1 server to perhaps
> Nagios 3. It appears that much has changed from version 1 to 3 meaning
> that at least some of the configurations will have to be altered or even
> created anew. Last summer I helped to write a comparison on monitoring
> systems. In that paper Nagios was a front running but Zenoss came out
> on top. Now I'm considering migrating to Zenoss instead of Nagios 3.
>
> Does anyone here have practical experience with Zenoss? How does it
> compare with Nagios? Is it worth switching to?
>
> --
> Neil Watson | Debian Linux
> System Administrator | Uptime 4 days
> http://watson-wilson.ca
--
btoneill@xxxxxxxxxxxxx
****************************************************************************
UNIX is simple and coherent, but it takes a genius (or at any rate a
programmer) to understand and appreciate the simplicity." - Dennis Ritchie
****************************************************************************