Re: Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

From: Matthew Jacob <lydianconcepts_at_gmail.com> Date: Sat, 2 Sep 2006 12:20:49 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:59 UTC

>
> The OS booted up and the SAS controller was now detected and supported by
> the mpt(4) driver:
> ---
> mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff,
> 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2
> mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00
> mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000
> mpt0: [GIANT-LOCKED]
> mpt0: MPI Version=1.5.12.0
> ---
>
> And the related errors showed up immediately, for the first time:
> ---
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> mpt0: mpt_cam_event: 0x12
> mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> --

These are device arrival events.

>
> When the bootstrap process reached the SCSI probe, there were
> no activity on the screen for about five minutes, so I was forced to use
> the power off button, and after rebooting, the same symptoms were evident,
> so I rebooted the machine once again, this time in verbose mode.
>
> This debug information was being printed on the screen, one character at time,
> at about 1 char/sec:
>
> (probe8:mpt0:0:8:0): error 22

What's at target 8? It isn't happy for a variety of reasons. Oh- I see
from below- it's an SES instance that drops dead if given something at
> lun 0.

> (probe8:mpt0:0:8:0): Unretryable Error
> ---
> pass0 at mpt0 bus 0 target 0 lun 0
> pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
> > As a workaround, I disabled the APICs (hint.apic.0.disabled),
> and that ~15 minutes delay at boot up, now was gone. Fine.
>
> (BTW, 7-CURRENT has the same problem, but without that huge delay)

Do you have APIC disabled for 7-CURRENT also?

>
> Once I was logged in the server, I proceeded to populate my ports tree,
> by using portsnap(8), so, when I extracted the tarball (portsnap extract),
> there was a lot of the following error message, at about 1 message per second:
>
> mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).

Queue Full events from the SAS firmware.

>
> Once in a while, an error message like below, showed up:
> --
> (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0
> (da0:mpt0:0:0:0): CAM Status: SCSI Status Error
> (da0:mpt0:0:0:0): SCSI Status: Check Condition
> (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2
> (da0:mpt0:0:0:0): Scsi bus reset occurred

Somebody is reseeting the bus periodically. We (freebsd) aren't
volitionally doing this that I'm aware of here.

> In order to perform those diagnostics, I had to install a SuSe Linux
> Enterprise Server 9, which was also shipped with this machine)

Which is a good way of saying that LSI-Logic support isn't very
evident on FreeBSD.

>
> After reinstalling FreeBSD, I logged remotely into the server, via ssh,
> and fetched the ports snapshot again and extracted once more.
>
> Suddenly, the screen activity ceased and the network connection timed out.
>
> Locally, on the server, there was a lot of mpt(4) errors and warnings.
> ---
> (da0:mpt0:0:0:0): CAM Status 0x18
> (da0:mpt0:0:0:0): Retrying Command
> (... and about 500 more lines like those...)

Hmm.

> ---
>>
> And finally, those errors from mpt(4):
>
> ---
> request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400)
> request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800)
> request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800)
> (... and about 300 more lines like those ...)
> ---
>
> which were followed by the same number of lines like these:
> ---
> mpt0: completing timedout/aborted req 0xc4c4a080:44717
> mpt0: completing timedout/aborted req 0xc4c4b430:44718
> mpt0: completing timedout/aborted req 0xc4c4cd80:44719
> ---
>
> and finishing with this line:
> ---
> mpt0: Timedout requests already complete. Interrupts may not be functioning.
> ---
>

I've seen this on Supermicro EM64T in the past on 7-current, but that
went away about 3-4 weeks ago. It really seemed to me that this was
indeed an interrupt related problem.

Yup, sounds like a mess here.