Re: Second SATA device lost after ZFS root is mount

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Tue, 15 Nov 2011 01:43:21 +0200
On 15.11.2011 01:00, Sebastian Chmielewski wrote:
> On Tue, 15 Nov 2011 00:39:52 +0200
> Alexander Motin<mav_at_FreeBSD.org>  wrote:
>
>> SATA device can be dropped because of error during reset/ probe/
>> initialization sequence or because controller reported disconnection.
>> Verbose boot messages (boot -v from loader prompt) should give more
>> information about what happened there. Show please full verbose dmesg.
> Using rc_debug="YES" in rc.conf I've found that my device is dropped during
> sysctl_start. With empty sysctl.conf my device is not lost. The contents of
> file seems quite innocent:
>
> # Uncomment this to prevent users from seeing information about processes that
> # are being run under another UID.
> security.bsd.see_other_uids=1
>
> # Enable/disable coredump
> kern.coredump=1
>
> # Up the maxfiles to 4x default
> kern.maxfiles=49312
>
> kern.ipc.shmmax=67108864
> kern.ipc.shmall=32768
>
> # Allow users to mount CD's
> vfs.usermount=1
> vfs.hirunningspace=8388608
> vfs.lorunningspace=1048576
>
> kern.corefile="/var/coredumps/%U/%N.core"
>
> # Do not truncate command line arguments in ps(1) listing
> kern.ps_arg_cache_limit=10000
>
> # Tune for desktop usage
> kern.sched.preempt_thresh=224
>
> # Increase default setting - recommended for 2 GB of RAM
> kern.maxvnodes=400000
>
> dev.acpi_ibm.0.lcd_brightness=6
> dev.acpi_ibm.0.lcd_brightness=3
> net.link.tap.user_open=1
> net.link.tap.up_on_open=1
>
> The device is lost even when sysctl is started with new file when booting finishes (I did service sysctl restart from X session).
> # sysctl debug.bootverbose=1
> # service sysctl restart
> # dmesg
>
> ahcich1: DISCONNECT requested
> ahcich1: AHCI reset...
> ahcich1: SATA connect timeout time=10000us status=00000000
> ahcich1: AHCI reset: device not found
> (ada1:ahcich1:0:0:0): lost device
> (pass1:ahcich1:0:0:0): lost device
> (pass1:ahcich1:0:0:0): removing device entry
>
> Crazy, isn't it?

It is. I've never heard about such things.

Reset status looks like if device was indeed disconnected or powered 
down. I don't even know how to do it this way, at least on Intel 
chipsets. My laptop's BIOS has bug that disables SATA port after 
suspend/resume, but there it can be seen in reset status that port was 
explicitly disabled. I have only one crazy idea: while setting screen 
brightness you are calling ACPI code that is black box by definition and 
can do whatever it wants with hardware, including using any possible 
custom power control interfaces.

Was the second disk initially planned in this laptop? Laptop vendors 
more then desktop ones tend to hardcode things.

I would try two things:
  - bisecting list of sysctls found one that cause this;
  - tried to enable SATA interface power management for the device. If 
power management was somehow enabled on the device around the OS, it may 
cause false DISCONNECT messages, while it still it should not cause such 
reset status. Setting hint.ahcich.1.pm_level=1 in loader.conf will make 
ahci(4) driver do ignore link loss events. If device indeed lost, you 
should see command timeouts and only then device loss.

-- 
Alexander Motin
Received on Mon Nov 14 2011 - 22:43:23 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:20 UTC