On Tue, Oct 22, 2019 at 02:48:56PM +0300, Andriy Gapon wrote: > On 22/10/2019 13:44, Konstantin Belousov wrote: > > On Tue, Oct 22, 2019 at 01:08:59PM +0300, Andriy Gapon wrote: > >> > >> We observe a problem that happens very rarely (about once a month across many > >> test machines). The problem is that a thread remain in sleepq_timedwait() even > >> after its timeout expires. The thread's td_slpcallout looks like the callout > >> has fired. But the thread's state looks like it was never notified. > >> E.g.: > >> (kgdb) p td->td_slpcallout > >> $1 = {c_links = {le = {le_next = 0xfffff800108e6470, le_prev = > >> 0xfffffe0000be6ea8}, sle = {sle_next = 0xfffff800108e6470}, tqe = {tqe_next = > >> 0xfffff800108e6470, tqe_prev = 0xfffffe0000be6ea8}}, c_time = 160957479343159, > >> c_precision = 268435450, c_arg = 0xfffff80184602000, c_func = > >> 0xffffffff807481d0 <sleepq_timeout>, c_lock = 0x0, c_flags = 2, c_iflags = 272, > >> c_cpu = 6, c_exec_time = 160957506517070} [*] > >> (kgdb) p/x td->td_flags > >> $5 = 0x80000004 > > What is the bit 31 in your flags ? FreeBSD does not use the bit. > > It's TDF_NOSWAP, a local addition. > We use it to prohibit full process swapout (I guess that means kernel stacks). > > >> (kgdb) p td->td_sqqueue > >> $8 = 0 > >> (kgdb) p td->td_sleepqueue > >> $9 = (struct sleepqueue *) 0x0 > >> (kgdb) p td->td_wchan > >> $10 = (void *) 0xfffff802b990df38 > >> > >> > >> Has anyone seen anything like this problem? > > Yes, but it was very long time ago. See r303426. > > Yeah, we are based off r329000 plus a bunch of merges for various fixes. > One thing I forgot to mention is that it seems to happen only on VMware guests, > but maybe it's only because we have many more virtual test boxes than we have > physical ones. > One thing I suspected was that binuptime() could somehow jump backwards... Do you use any of suspend/migration ? Perhaps record sbinuptime() in the struct thread in sleepq_timeout() and keep the original value of td_sleeptimo around to see what did happen. > > >> Any advice on how to diagnose it? > >> > >> Thanks! > >> > >> P.S. > >> c_exec_time is our addition, we set this field right before firing a callback > >> and we reset it to zero when a callout is (re-)scheduled. > > > -- > Andriy GaponReceived on Tue Oct 22 2019 - 11:16:41 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:22 UTC