Re: Possible bug in or around posix_fadvise after r292326

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Tue, 5 Jan 2016 13:24:53 +0200
On Tue, Jan 05, 2016 at 01:07:40PM +0200, Konstantin Belousov wrote:
> On Mon, Jan 04, 2016 at 10:05:21PM -0800, Benno Rice wrote:
> > Hi Konstantin,
> > 
> > I recently updated my dev box to r292962. After doing this I attempted to set up PostgreSQL 9.4. When I ran initdb the last phase hung. Using procstat -kk I found it appeared to be stuck in a loop inside a posix_fadvise syscall. I could not ^C or ^Z the initdb process. I could kill it but a subsequent attempt to rm -rf the /usr/local/pgsql/data directory also got stuck and was unkillable by any means. Rebooting allowed me to remove the directory but the initdb process still hung when I re-ran it.
> > 
> > I tried PostgreSQL 9.3 with similar results.
> > 
> > Looking at the source code for initdb I found that it calls posix_fadvise like so[1]:
> > 
> >      /*
> >       * We do what pg_flush_data() would do in the backend: prefer to use
> >       * sync_file_range, but fall back to posix_fadvise.  We ignore errors
> >       * because this is only a hint.
> >       */
> >  #if defined(HAVE_SYNC_FILE_RANGE)
> >      (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
> >  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
> >      (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> >  #else
> >  #error PG_FLUSH_DATA_WORKS should not have been defined
> >  #endif
> > 
> > Looking for recent commits involving POSIX_FADV_DONTNEED I found r292326:
> > 
> > https://svnweb.freebsd.org/changeset/base/292326 <https://svnweb.freebsd.org/changeset/base/292326>
> > 
> > Backing this revision out allowed the initdb process to complete.
> > 
> > My current theory is that some how we???re getting ENOLCK or EAGAIN from the BUF_TIMELOCK call in bnoreuselist:
> > 
> > https://svnweb.freebsd.org/base/head/sys/kern/vfs_subr.c?view=annotate#l1676 <https://svnweb.freebsd.org/base/head/sys/kern/vfs_subr.c?view=annotate#l1676>
> > 
> > Leading to an infinite loop in vop_stdadvise:
> > 
> > https://svnweb.freebsd.org/base/head/sys/kern/vfs_default.c?annotate=292373#l1083 <https://svnweb.freebsd.org/base/head/sys/kern/vfs_default.c?annotate=292373#l1083>
> > 
> > I haven???t managed to dig any deeper than that yet.
> > 
> > Is there any other information I could give you to help narrow this down?
> 
> I do not see this issue locally.
> 
> When the hang in initdb occur, what is the state of the initdb thread
> which performs advise() ?  Is it "brlsfl" sleep, or is the thread running ?
> 
> If buffer lock is not available, and this is the cause of the ENOLCK/EAGAIN,
> then the question is who is the owner of the corresponding buffer lock.
> You could overview the state of the system with 'ps' command in ddb, and
> 'show alllocks' would list owner, unless buffer was async.
> 
> Also, I do not quite understand the behaviour of SIGINT/SIGKILL.  Could
> it be that the process was not killed by SIGKILL as well ?  It would be
> consistent with the vnode lock still owned and preventing the accesses.

Just in case, if this is due to the quadratic loop behaviour, there
is no need to restart from the very start in the new stdadvise(DONTNEED)
implementation.  So regardless of the answers to the questions above,
you might also try this patch.

diff --git a/sys/kern/vfs_default.c b/sys/kern/vfs_default.c
index fd83f87..3da8618 100644
--- a/sys/kern/vfs_default.c
+++ b/sys/kern/vfs_default.c
_at__at_ -1080,15 +1080,9 _at__at_ vop_stdadvise(struct vop_advise_args *ap)
 		bsize = vp->v_bufobj.bo_bsize;
 		startn = ap->a_start / bsize;
 		endn = ap->a_end / bsize;
-		for (;;) {
-			error = bnoreuselist(&bo->bo_clean, bo, startn, endn);
-			if (error == EAGAIN)
-				continue;
+		error = bnoreuselist(&bo->bo_clean, bo, startn, endn);
+		if (error == 0)
 			error = bnoreuselist(&bo->bo_dirty, bo, startn, endn);
-			if (error == EAGAIN)
-				continue;
-			break;
-		}
 		BO_RUNLOCK(bo);
 		VOP_UNLOCK(vp, 0);
 		break;
diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index ace97e8..8cac32f 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
_at__at_ -1670,6 +1670,7 _at__at_ bnoreuselist(struct bufv *bufv, struct bufobj *bo, daddr_t startn, daddr_t endn)
 	ASSERT_BO_LOCKED(bo);
 
 	for (lblkno = startn;; lblkno++) {
+again:
 		bp = BUF_PCTRIE_LOOKUP_GE(&bufv->bv_root, lblkno);
 		if (bp == NULL || bp->b_lblkno >= endn)
 			break;
_at__at_ -1677,7 +1678,9 _at__at_ bnoreuselist(struct bufv *bufv, struct bufobj *bo, daddr_t startn, daddr_t endn)
 		    LK_INTERLOCK, BO_LOCKPTR(bo), "brlsfl", 0, 0);
 		if (error != 0) {
 			BO_RLOCK(bo);
-			return (error != ENOLCK ? error : EAGAIN);
+			if (error == ENOLCK)
+				goto again;
+			return (error);
 		}
 		KASSERT(bp->b_bufobj == bo,
 		    ("bp %p wrong b_bufobj %p should be %p",
Received on Tue Jan 05 2016 - 10:24:59 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:02 UTC