Hot Diggety! Scott W was rumored to have written: > > Some of the higher end IBM x86 systems are supposed to be able to do > this, although note that they are all systems equipped with integrated > (or additional) service processors (AKA Remote Supervisor Adapters). > Some of the service processor setups can be accessed via serial or rs485 > management ports, and their monitors(CPU, Mem, disk status, temps, fans, > voltages) are monitored as well via IBM Director (software). I should point out that you can gather that data today with an utility called 'xmbmon' (or a few other similar tools) that gathers that information via the SMBus if the utility knows how to talk with the motherboard chipset and especially if it has sensors (LM75, LM78, etc). Modern motherboards -- things made in the past 2-3 years at least, tend to be capable of talking with xmbmon. I've got it working great -- I wrote a Nagios plug-in that interrogates the data (temp, fan, power, etc) from the motherboard and if the Nagios server detects that one parameter is outside an acceptable range, it raises an alarm. (How you want the system to respond to an alarm is customizable, too... we like for ours to shut down upon an high temp or voltage out of range alarm.) The typical situations that would cause an alarm is: a fan fails, causing internal temp to climb... OR... the HVAC system power quits, causing room temp to spike through the roof. Either way, we want the system to quickly sanely shut down to prevent stuff melting down like silicon which is much more time consuming to recover from. (Imagine -- it dies on a Sunday night; you don't have a spare CPU/RAM/motherboard/HD/etc... what do you do?) This is what happened during the large blackout on the U.S. East Coast last year... the systems stayed on because the room was on an industrial UPS... BUT... the HVAC system was not, so the room temp went to 125 degrees Fahrenheit or hotter. http://www.nt.phys.kyushu-u.ac.jp/shimizu/download/xmbmon200.tar.gz (2.03 is also in /usr/ports/sysutils/xmbmon) For our 4.9-RELEASE production boxes, we compiled mbmon from that package (not the xmbmon portion; we don't need the X interface) and added to the kernel config file: device smbus device iicbus device iicbb device viapm device smb (Note: smb above relates to SMBus, not SMB like the Windows file stuff :) SMBus = System Management Bus.) Built kernel, rebooted, mbmon worked great out of the box. Eg: # mbmon -c1 Temp.= 31.2, 33.0, 24.2; Rot.= 3590, 0, 0 Vcore = 1.75, 1.18; Volt. = 3.33, 5.20, 11.95, 0.00, 0.00 (Not all motherboards will reports all parameters; we have other servers with a different motherboards that reports different numbers...) I just wanted to mention all of the above stuff because some folks asked about temp/fan/voltage monitoring -- existing tools can already do that. However, for monitoring the *failure* of a CPU (for other than a temp or voltage issue) is a much more interesting issue that I don't think mbmon is really geared to deal with. IBM POWER4 servers do so through a service processor that actively snoops all CPU and memory transactions to determine when a CPU has died, and then takes a failed CPU out of service along with generating error reports for immediate notification. The way it's done results in continued uptime, which is why it's done that way. I don't know how the high end x86 servers handles that but the expensive servers from Compaq, years ago, had some sort of similar features for the CPU. I just don't know how one would interface with the hardware to obtain information... likely to be proprietary or hardware-specific since I don't think there's a standard across vendors for this. -DanReceived on Fri Jan 02 2004 - 16:04:37 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:36 UTC