Re: Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467]

From: Andriy Gapon <avg_at_FreeBSD.org> Date: Tue, 29 May 2018 19:38:19 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC

On 29/05/2018 19:22, Mark Johnston wrote:
> On Tue, May 29, 2018 at 04:50:14PM +0300, Andriy Gapon wrote:
>> On 23/04/2018 17:50, Julian Elischer wrote:
>>> back trace at:  http://www.freebsd.org/~julian/bob-crash.png
>>>
>>> If anyone wants to take a look..
>>>
>>> In the exit syscall, while deallocating a vm object.
>>>
>>> I haven't see references to a similar crash in the last 10 days or so.. But if
>>> it rings any bells...
>>
>> We have just got another one:
>> panic: Bad link elm 0xfffff80cc3938360 prev->next != elm
>>
>> Matching disassembled code to C code, it seems that the crash is somewhere in
>> vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of
>> TAILQ_REMOVE-s there:
>> 		if (p->queue != PQ_NONE) {
>> 			KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
>> 			    "page %p is not queued", p));
>> 			pq1 = vm_page_pagequeue(p);
>> 			if (pq != pq1) {
>> 				if (pq != NULL) {
>> 					vm_pagequeue_cnt_add(pq, dequeued);
>> 					vm_pagequeue_unlock(pq);
>> 				}
>> 				pq = pq1;
>> 				vm_pagequeue_lock(pq);
>> 				dequeued = 0;
>> 			}
>> 			p->queue = PQ_NONE;
>> 			TAILQ_REMOVE(&pq->pq_pl, p, plinks.q);
>> 			dequeued--;
>> 		}
>> 		if (vm_page_free_prep(p, true))
>> 			continue;
>> unlist:
>> 		TAILQ_REMOVE(&object->memq, p, listq);
>> 	}
>>
>>
>> Please note that this is the code before r332974 Improve VM page queue scalability.
>> I am not sure if r332974 + r333256 would fix the problem or if it just would get
>> moved to a different place.
>>
>> Does this ring a bell to anyone who tinkered with that part of the VM code recently?
> 
> This doesn't look familiar to me and I doubt that r332974 fixed the
> underlying problem, whatever it is.

I see.

>> Looking a little bit further, I think that object->memq somehow got corrupted.
>> memq contains just two elements and the reported element is not there.
> 
> Based on the debugging session, it would be interesting to know if there
> were any other threads somehow manipulating the (dead) object at the
> time of the panic.

I will check for this.

> Among the panics that you observed, is it the same application that is
> causing the crash in each case?

I have two crash dumps right now and in both cases it's sh exec-ing grep.
But I cannot imagine what could be so special about that.
Actually, I see that the shell ran a long pipeline with many grep-s in it, so
there were many exec-s and exits of grep, perhaps some of them concurrent.

-- 
Andriy Gapon