Re: Any objections/comments on axing out old ATA stack?

From: Jeremy Chadwick <jdc_at_koitsu.org> Date: Sat, 20 Apr 2013 14:29:58 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:36 UTC

On Thu, Apr 04, 2013 at 10:00:18AM +0200, Matthias Andree wrote:
> Am 04.04.2013 03:05, schrieb Jeremy Chadwick:
> 
> { snipping stuff I have no comment on.  reference thread: }
> { > http://lists.freebsd.org/pipermail/freebsd-stable/2013-April/073036.html }
> 
> > One piece of evidence that refutes my theory is that if Windows and/or
> > Linux partition are something you boot into and use often, I would
> > imagine NCQ would be used in both of those environments and would suffer
> > from the same issue.  Although Windows tends to hide all sorts of
> > transient errors from the user (sigh), Linux tends to be like FreeBSD
> > with regards to such issues (on the console anyway; you wouldn't see
> > such messages normally inside of X).
> 
> Now, the FreeBSD slice is the only partition on that disk that would
> likely see concurrent write accesses (think "make -j8" on a quadcore
> computer) which is more prone to ferret out such alignment contention.
> 
> The NTFS partition is aligned on a multi-MB boundary, so wouldn't hit
> the problem anyways.
> 
> The Linux partition is in ext4 format for mostly sequential access to
> files usually in excess of 10 MB each.
> 
> Linux's ext4 jumps through several hoops to end up with bulk writes,
> like extents, delayed allocations (to avoid fragmentation), reordering
> of data and metadata writes, serialized log writes and all that stuff,
> and it would appear I am permitting it to cache writes -- Linux uses
> write barriers to enforce proper ordering of journal/meta-data writes.
> 
> It would be rather hard to hit ATA taskfile timeouts, the expected rate
> with which the drive needs to do a partial write is orders of magnitude
> lower.
> 
> Any good "concurrent write" exercise tools for Unix that I could run on
> the Linux ext4 partition that you would propose?

The only tool I'm familiar with is bonnie++.

But I don't think this (partition alignment) is what matters now.  Your
smartctl output has shed some light on your situation.

> >> - I am running with kern.cam.ada.default_timeout=5 which makes the
> >> computer recover faster
> > 
> > I can definitely imagine cases where a drive using NCQ but doing writes
> > to a non-aligned partition could take longer than 5 seconds to respond
> > to an ATA CDB (this is different than a SATA or AHCI layer timeout).  I am
> > not telling you "change this back to 30", but it might not be helping
> > your situation at all given my above theory.
> 
> My feeling is that the stalls are mostly from the error handler and the
> overall time the drive is "frozen" gets shorter. If it had not _felt_
> faster, I'd not have left that in sysctl.conf in the first place.

Your understanding of what that sysctl does is wrong, or I'm
misunderstanding what you're saying (very possible!).

How I interpret what you're saying: that the sysctl somehow "decreases
stall times" during I/O operations that fail.  This is incorrect.

What that sysctl does is define the number of seconds that transpire
***before*** the CAM layer says "Okay, I didn't get a response to the
ATA CDB I sent the disk", and then re-submits the same CDB to the disk.

Rephrased: in the case of a disk stalling on an I/O request, you will
experience the effects of that stall no matter what that sysctl is set
to.  A lower value in that sysctl will result in CAM spitting out
nasties on the console + hitting the CDB retry submission scenario
sooner, which if the drive is awake/responsive by that time will go
smoothly.

That's all it does.

Thus a value of 5 indicates a device/drive did not respond to a CDB
within 5 seconds, and a value of 30 indicates a device/drive did not
respond to a CDB within 30 seconds.  Regardless, those lengths of time
are VERY long for an I/O operation on a mechanical HDD.

When you get to the bottom of my Email, you'll understand why I screamed
at you about adjusting that sysctl.

> > Finally: could you please provide output from "smartctl -x /dev/ada1"?
> > I would like to rule out any possibility of your drive having some other
> > kind of issue that might cause it to go catatonic.  Thanks.
> 
> I have fetched the data with Linux this time (should not make a
> difference as it's all drive internal data, not host OS stuff).
> 
> Looks sane to me, <http://people.freebsd.org/~mandree/smartctl.log>.
> I'll be happy to refetch this data with a more current smartctl version
> under FreeBSD if required.

Oh look, it's the Samsung SpinPoint series, especially the EcoGreen
("EG") series.  No joke: ~60% of the "problem reports" I deal with when
it comes to "weird wonky problems" stem from this drive series.  I have
no idea why, but they're a common pain point for me.

First, about the shown sector size: smartmontools 5.41 was the first
release to show the sector sizes per ATA IDENTIFY.  I assume they got
this right from the get-go.  So as of this moment I'm going to assume
that this drive really is a 512-byte sector drive.

Politely, your analysis of the drive ("looks sane to me") is an
indicator of why SMART output needs to be interpreted by a person who is
familiar with the information.  That drive *does not* look sane to me.
:-)

The first thing that comes to my attention is attribute 199, indicating
that the drive has experienced a total of 14 CRC errors during its
lifetime (10779 hours as of that moment).  Usually this attribute is
zeroed at the factory (other attributes are often not).  Just yesterday
I wrote a very long/detailed analysis about what this attribute means,
so I'll just link you to that post.  Please focus on just the part about
CRC errors:

http://www.dslreports.com/forum/r28219261-

The next thing I see are 14 errors in your SMART error log.  It's worth
noting that this number correlates with the CRC error count above
(though depending on drive firmware they may not have a symbiotic
relationship).

Your SMART error log consists of entries indicating the drive itself
sent back error conditions to the controller/OS (which FreeBSD or Linux
would show on the console).  The timestamps of these events are based on
power-on hour count, so the most recent event was at 7747 hours, but
there are others going back all the way to 6528 hours.  Sadly, the SMART
error log is very small (2 sectors / 1024 bytes), so only the last 8
errors can be shown.

Key points about these errors:

- The LBAs being accessed varies/is all over the board, indicating that
  it's very unlikely this anomaly is being caused by physical defects
  on the platters (the drive also shows no remapped LBAs or
  pending/suspect LBAs, which further supports that theory),

- The ATA commands which lead up to the error also vary.  Many are for
  write requests, and from some entries I can see that the OS was doing
  NCQ writes (WRITE FPDMA QUEUED) and then suddenly decided to do a
  classic 28-bit LBA write (WRITE DMA).  I'm not sure why an OS would do
  this (there's nothing optimal about it) unless there were conditions
  occurring where the OS/ATA driver said "this NCQ write isn't working
  (timeout, etc.), let me retry with a classic 28-bit LBA write".

  There is one entry (the last) which shows a similar situation
  happening but with NCQ reads.

- These are conditions that short, long, select (LBA range scan), and
  conveyance SMART tests would probably not detect.  Like I said: it
  seems to be all over the board.

This is not the first time I have seen this behaviour with SpinPoint
drives.

Bernd Walter responded indicating that his experience indicated that the
issue related to NCQ compatibility.  This would not surprise me.

NCQ incompatibilities have happened in the past; the most notable (to
me) was between Maxtor drives and nVidia SATA controllers.  Both
companies blamed the other, yet both came out with "fixes" (Maxtor with
a firmware update, nVidia with a driver update).  Neither company stated
anything concrete/useful publicly (oh America, so stock-focused you
are).  My personal opinion is that the bug was in Maxtor's firmware, and
nVidia ceased use of NCQ requests to drives matching specific model
numbers (similar to what we do in FreeBSD, re: 4KB quirks).

What doesn't help is that SpinPoint drives have a history of pretty
awful firmware bugs, such as this one, which still blows my mind to this
day:

http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

Your drive is using firmware version 1AG01118, but I can't easily find a
newer firmware because of the whole Seagate/Samsung buyout (Seagate
buying out Samsung's MHDD division).

Because of the "random" nature of this issue, my opinion is that what
you're experiencing is caused by one of the following:

- The "EG" series are known to park their heads excessively, and much to
  my annoyance, do not track this behaviour in SMART (normally it's
  tracked in attribute 193, which the drive lacks (probably
  intentionally)).  This head-parking nonsense is known to cause
  problems in certain situations, reported by the OS as timeouts and
  I/O errors as the drive is trying to wake up and respond to the CDB.
  There are many drives on the markets that do this now, and I
  generally boycott them all (it's only useful for laptops).  I can
  talk at length about that some other time, or you can find/read my
  blog (I wrote an article about the WD30EFRX doing this -- at least
  on WD drives you can inhibit the behaviour, while on Seagate you
  can't).

  I noticed that SMART attribute 3 on your drive indicates it takes
  roughly 6.2 seconds to spin up.  This may change over time as well
  (often getting worse as the drive gets older (spindle motors do
  wear down over time)).

  Now take into consideration the sysctl you changed, and what I
  said earlier about me knowing some conditions where a drive may
  take >5 seconds to handle certain I/O ops.

- NCQ bugs in the drive's firmware.  You can try to talk to Samsung
  about this, but you'll probably get no where due to how deep within
  companies actual engineers live.

My suggestions to you at this point in time:

- Remove the sysctl and leave it at its default (30 seconds).  Or if
  you really must adjust it, set it to 15.  YMMV with this.

- Replace the drive and/or choose another drive vendor.

My suggestions for FreeBSD at this time:

- Regardless of what the root cause of the above is, we really do need a
  no-NCQ quirk, and we also need to print the quirks used (in a similar
  fashion to how CPU features are shown) during boot.

  I can try to write the code for this, but I am going to need help.
  Kernel land is not something I'm generally good at.

-- 
| Jeremy Chadwick                                   jdc_at_koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |