Re: Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467]

From: Mark Johnston <markj_at_FreeBSD.org> Date: Tue, 29 May 2018 12:22:17 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC

On Tue, May 29, 2018 at 04:50:14PM +0300, Andriy Gapon wrote:
> On 23/04/2018 17:50, Julian Elischer wrote:
> > back trace at:  http://www.freebsd.org/~julian/bob-crash.png
> > 
> > If anyone wants to take a look..
> > 
> > In the exit syscall, while deallocating a vm object.
> > 
> > I haven't see references to a similar crash in the last 10 days or so.. But if
> > it rings any bells...
> 
> We have just got another one:
> panic: Bad link elm 0xfffff80cc3938360 prev->next != elm
> 
> Matching disassembled code to C code, it seems that the crash is somewhere in
> vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of
> TAILQ_REMOVE-s there:
> 		if (p->queue != PQ_NONE) {
> 			KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
> 			    "page %p is not queued", p));
> 			pq1 = vm_page_pagequeue(p);
> 			if (pq != pq1) {
> 				if (pq != NULL) {
> 					vm_pagequeue_cnt_add(pq, dequeued);
> 					vm_pagequeue_unlock(pq);
> 				}
> 				pq = pq1;
> 				vm_pagequeue_lock(pq);
> 				dequeued = 0;
> 			}
> 			p->queue = PQ_NONE;
> 			TAILQ_REMOVE(&pq->pq_pl, p, plinks.q);
> 			dequeued--;
> 		}
> 		if (vm_page_free_prep(p, true))
> 			continue;
> unlist:
> 		TAILQ_REMOVE(&object->memq, p, listq);
> 	}
> 
> 
> Please note that this is the code before r332974 Improve VM page queue scalability.
> I am not sure if r332974 + r333256 would fix the problem or if it just would get
> moved to a different place.
> 
> Does this ring a bell to anyone who tinkered with that part of the VM code recently?

This doesn't look familiar to me and I doubt that r332974 fixed the
underlying problem, whatever it is.

> Looking a little bit further, I think that object->memq somehow got corrupted.
> memq contains just two elements and the reported element is not there.

Based on the debugging session, it would be interesting to know if there
were any other threads somehow manipulating the (dead) object at the
time of the panic.

Among the panics that you observed, is it the same application that is
causing the crash in each case?