Re: Hot Swapping CPUs?

From: Dan Foster <dsf_at_globalcrossing.net>
Date: Sat, 3 Jan 2004 01:03:07 +0000
Hot Diggety! Scott W was rumored to have written:
>
> Some of the higher end IBM x86 systems are supposed to be able to do 
> this, although note that they are all systems equipped with integrated 
> (or additional) service processors (AKA Remote Supervisor Adapters).  
> Some of the service processor setups can be accessed via serial or rs485 
> management ports, and their monitors(CPU, Mem, disk status, temps, fans, 
> voltages) are monitored as well via IBM Director (software).

I should point out that you can gather that data today with an utility
called 'xmbmon' (or a few other similar tools) that gathers that
information via the SMBus if the utility knows how to talk with the
motherboard chipset and especially if it has sensors (LM75, LM78, etc).

Modern motherboards -- things made in the past 2-3 years at least, tend to
be capable of talking with xmbmon. I've got it working great -- I wrote a
Nagios plug-in that interrogates the data (temp, fan, power, etc) from the
motherboard and if the Nagios server detects that one parameter is outside
an acceptable range, it raises an alarm. (How you want the system to
respond to an alarm is customizable, too... we like for ours to shut down
upon an high temp or voltage out of range alarm.)

The typical situations that would cause an alarm is: a fan fails, causing
internal temp to climb... OR... the HVAC system power quits, causing room
temp to spike through the roof. Either way, we want the system to quickly
sanely shut down to prevent stuff melting down like silicon which is much
more time consuming to recover from. (Imagine -- it dies on a Sunday night;
you don't have a spare CPU/RAM/motherboard/HD/etc... what do you do?)

This is what happened during the large blackout on the U.S. East Coast last
year... the systems stayed on because the room was on an industrial UPS...
BUT... the HVAC system was not, so the room temp went to 125 degrees
Fahrenheit or hotter.

http://www.nt.phys.kyushu-u.ac.jp/shimizu/download/xmbmon200.tar.gz

(2.03 is also in /usr/ports/sysutils/xmbmon)

For our 4.9-RELEASE production boxes, we compiled mbmon from that package
(not the xmbmon portion; we don't need the X interface) and added to the
kernel config file:

device smbus
device iicbus
device iicbb
device viapm
device smb

(Note: smb above relates to SMBus, not SMB like the Windows file stuff :)
SMBus = System Management Bus.)

Built kernel, rebooted, mbmon worked great out of the box. Eg:

# mbmon -c1 

Temp.= 31.2, 33.0, 24.2; Rot.= 3590,    0,    0
Vcore = 1.75, 1.18; Volt. = 3.33, 5.20, 11.95,   0.00,  0.00

(Not all motherboards will reports all parameters; we have other servers
with a different motherboards that reports different numbers...)

I just wanted to mention all of the above stuff because some folks asked
about temp/fan/voltage monitoring -- existing tools can already do that.

However, for monitoring the *failure* of a CPU (for other than a temp or
voltage issue) is a much more interesting issue that I don't think mbmon is
really geared to deal with.

IBM POWER4 servers do so through a service processor that actively snoops
all CPU and memory transactions to determine when a CPU has died, and then
takes a failed CPU out of service along with generating error reports for
immediate notification. The way it's done results in continued uptime,
which is why it's done that way. I don't know how the high end x86 servers
handles that but the expensive servers from Compaq, years ago, had some
sort of similar features for the CPU.

I just don't know how one would interface with the hardware to obtain
information... likely to be proprietary or hardware-specific since I don't
think there's a standard across vendors for this.

-Dan
Received on Fri Jan 02 2004 - 16:04:37 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:36 UTC