Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS!

From: Lev Serebryakov <lev_at_FreeBSD.org> Date: Thu, 28 Feb 2013 18:56:47 +0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC

Hello, Ivan.
You wrote 28 февраля 2013 г., 18:19:38:

>>    Maybe, it is subtile interference between raid5 implementation and
>>   SU+J, but in such case I want to understand what does raid5 do
>>   wrong.
IV> You guessed correctly, I was going to blame geom_raid5 :)
  It  is  not first time :( But every time such discussion ends without
 any practical results.

  One time, Kirk say, that delayed writes are Ok for SU until bottom
 layer doesn't lie about operation completeness. geom_raid5 could
 delay writes (in hope that next writes will combine nicely and allow
 not to do read-calculate-write cycle for read alone), but it never
 mark BIO complete until it is really completed (layers down to
 geom_raid5 returns completion). So, every BIO in wait queue is "in
 flight" from GEOM/VFS point of view. Maybe, it is fatal for journal :(

  And want I really want to see is "SYNC" flag for BIO and that all
 journal-related writes will be marked with it. Also all commits
 originated with fsync() MUST be marked in same way, really. Alexander
 Motin (ahci driver author) assured me, that he'll add support for
 such flag in driver to flush drive cache too, if it will be
 introduced.

  IMHO, lack of this (or similar) flag is bad idea even without
 geom_raid5 with its optimistic behavior.

  There was commit r246876, but I don't understand exactly what it
 means, as no real FS or driver's code was touched.

  But I'm writing about this idea for 3rd or 4th time without any
 results :( And I don't mean, that it should be implemented ASAP by
 someone, I mean I didn't see any support from FS guys (Kirk and
 somebody else, I don't remember exactly participants of these old
 thread, but he was not you) like "go ahead and send your patch". All
 these threads was very defensive from FS guru side, like "we don't
 need it, fix hardware, disable caches".

IV> Is this a production setup you have? Can you afford to destroy it and
IV> re-create it for the purpose of testing, this time with geom_raid3
IV> (which should be synchronous with respect to writes)?
  Unfortunately, it is production setup and I don't have any spare
 hardware for second one :(

  I've posted panic stacktrace -- and it is FFS-related too -- and now
 preparing setup with only one HDD and same high load to try reproduce
 it without geom_raid5. But I don't have enough hardware (3 spare HDDs
 at least!) to reproduce it with geom_raid3 or other copy of
 geiom_radi5.

-- 
// Black Lion AKA Lev Serebryakov <lev_at_FreeBSD.org>