On 11/17/11 11:06, Kostik Belousov wrote: > On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote: >> On 11/17/11 10:15, Kostik Belousov wrote: >>> On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: >>>> On 17.11.2011 00:21, Andriy Gapon wrote: >>>>> on 16/11/2011 21:27 Fabian Keil said the following: >>>>>> Kostik Belousov<kostikbel_at_gmail.com> wrote: >>>>>> >>>>>>> I was tricked into finishing the work by Andrey Gapon, who developed >>>>>>> the patch to reliably stop other processors on panic. The patch >>>>>>> greatly improves the chances of getting dump on panic on SMP host. >>>>>> >>>>>> I tested the patch trying to get a dump (from the debugger) for >>>>>> kern/162036, which currently results in the double fault reported in: >>>>>> http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html >>>>>> >>>>>> It didn't help, but also didn't make anything worse. >>>>>> >>>>>> Fabian >>>>> >>>>> The mi_switch recursion looks very familiar to me: >>>>> mi_switch() at mi_switch+0x270 >>>>> critical_exit() at critical_exit+0x9b >>>>> spinlock_exit() at spinlock_exit+0x17 >>>>> mi_switch() at mi_switch+0x275 >>>>> critical_exit() at critical_exit+0x9b >>>>> spinlock_exit() at spinlock_exit+0x17 >>>>> [several pages of the previous three lines skipped] >>>>> mi_switch() at mi_switch+0x275 >>>>> critical_exit() at critical_exit+0x9b >>>>> spinlock_exit() at spinlock_exit+0x17 >>>>> intr_even_schedule_thread() at intr_event_schedule_thread+0xbb >>>>> ahci_end_transaction() at ahci_end_transaction+0x398 >>>>> ahci_ch_intr() at ahci_ch_intr+0x2b5 >>>>> ahcipoll() at ahcipoll+0x15 >>>>> xpt_polled_action() at xpt_polled_action+0xf7 >>>>> >>>>> In fact I once discussed with jhb this recursion triggered from a different >>>>> place. To quote myself: >>>>> <avg> spinlock_exit -> critical_exit -> mi_switch -> kdb_switch -> >>>>> thread_unlock -> spinlock_exit -> critical_exit -> mi_switch -> ... >>>>> <avg> in the kdb context >>>>> <avg> this issue seems to be triggered by td_owepreempt being true at >>>>> the time >>>>> kdb is entered >>>>> <avg> and there of course has to be an initial spinlock_exit call >>>>> somewhere >>>>> <avg> in my case it's because of usb keyboard >>>>> <avg> I wonder if it would make sense to clear td_owepreempt right >>>>> before >>>>> calling kdb_switch in mi_switch >>>>> <avg> instead of in sched_switch() >>>>> <avg> clearing td_owepreempt seems like a scheduler-independent >>>>> operation to me >>>>> <avg> or is it better to just skip locking in usb when kdb_active is set >>>>> <avg> ? >>>>> >>>>> The workaround described above should work in this case. >>>>> Another possibility is to pessimize mtx_unlock_spin() implementations to >>>>> check >>>>> SCHEDULER_STOPPED() and to bypass any further actions in that case. But >>>>> that >>>>> would add unnecessary overhead to the sunny day code paths. >>>>> >>>>> Going further up the stack one can come up with the following proposals: >>>>> - check SCHEDULER_STOPPED() swi_sched() and return early >>>>> - do not call swi_sched() from xpt_done() if we somehow know that we are >>>>> in a >>>>> polling mode >>>> >>>> There is no flag in CAM now to indicate polling mode, but if needed, it >>>> should not be difficult to add one and not call swi_sched(). >>> >>> I have the following change for eons on my test boxes. Without it, >>> I simply cannot get _any_ dump. >>> >>> diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c >>> index 10b89c7..a38e42f 100644 >>> --- a/sys/cam/cam_xpt.c >>> +++ b/sys/cam/cam_xpt.c >>> _at__at_ -4230,7 +4230,7 _at__at_ xpt_done(union ccb *done_ccb) >>> TAILQ_INSERT_TAIL(&cam_simq, sim, links); >>> mtx_unlock(&cam_simq_lock); >>> sim->flags |= CAM_SIM_ON_DONEQ; >>> - if (first) >>> + if (first && panicstr == NULL) >>> swi_sched(cambio_ih, 0); >>> } >>> } >> >> That should be OK for kernel dumping. I was thinking about CAM abusing >> polling not only for dumping. But looking on cases where it does it now, >> may be it is better to rewrite them instead. > > So, should I interpret your response as 'Reviewed by" ? It feels somehow dirty to me. I don't like these global variables. If you consider it is fine, proceed, I see no much harm. But if not, I can add polling flag to the CAM. Flip a coin for me. :) -- Alexander MotinReceived on Thu Nov 17 2011 - 08:29:08 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:20 UTC