Re: [RFC] kern/kern_timeout.c rewrite in progress

From: Bryan Drewery <bdrewery_at_FreeBSD.org> Date: Thu, 22 Jan 2015 16:41:41 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:55 UTC

On 1/22/2015 4:27 AM, Konstantin Belousov wrote:
> On Thu, Jan 22, 2015 at 11:16:41AM +0100, Hans Petter Selasky wrote:
>> On 01/20/15 11:47, Slawa Olhovchenkov wrote:
>>> On Tue, Jan 20, 2015 at 08:29:47AM +0100, Hans Petter Selasky wrote:
>>>
>>>> On 01/17/15 23:18, Hans Petter Selasky wrote:
>>>>> On 01/17/15 20:11, Jason Wolfe wrote:
>>>>>>
>>>>>> HPS,
>>>>>>
>>>>>> Just to give a quick status update, this patch has most certainly
>>>>>> resolved our spin lock held too long panics on stable/10.
>>>>>>
>>>>>> Thank you to JHB for spending some time digging into the issue and
>>>>>> leading us to td_slpcallout as the culprit, and HPS for your rewrite.
>>>>>> I had heard rumors of other being affected by similar issues, so this
>>>>>> seems like a fine candidate for an MFC if possible.
>>>>>>
>>>>>> Jason
>>>>>>
>>>>>
>>>>> Hi Jason,
>>>>>
>>>>> I'm glad to hear that my patch has resolved your issue and I'm happy we
>>>>> now have a more stable system.
>>>>>
>>>>> It was actually a co-worker at work which wrote some bad code which I
>>>>> started debugging which then lead me to look at the callout subsystem.
>>>>> One bug kills the other ;-)
>>>>>
>>>>> I'm planning a MFC to 10-stable - yes, and will possibly add the
>>>>> _callout_stop_safe() function to not break binary compatibility with
>>>>> existing drivers as part of the MFC.
>>>>>
>>>>> --HPS
>>>>
>>>> Hi,
>>>>
>>>> Here is a followup patch for the TCP stack like I mentioned in the
>>>> beginning of the work done on the callout subsystem:
>>>>
>>>> https://reviews.freebsd.org/D1563
>>>>
>>>> If someone has a setup for massive TCP testing please give it a spin.
>>>
>>> I have on 10.1 (with applied r261906).
>>
>> FYI:
>>
>> r277213 is going to be pulled out from -current in at maximum a few 
>> hours from now, because developers need more time to review patches in 
>> surrounding areas like the TCP stack area to restore distribution of 
>> callouts on multiple CPUs when using MPSAFE callouts to avoid congestion 
>> in the TCP stack.
> 
> No, r277213 was requested to be reverted not due to TCP issues.
> 
> The main complain is that you left indefinite amount of cases degraded,
> and there is no analysis of each such case, nor even a list of the cases
> that need to be fixed (or argumentation why consumer of the callout KPI
> could be left as is).
> 
> Just providing fix for one place is not enough.

I have a similar concern about out-of-tree work. It would be surprising
for a vendor or module developer to find their performance degrade if
they missed accounting for this change. At a minimum, an UPDATING entry
should be added explaining the change and what must be done for consumers.

-- 
Regards,
Bryan Drewery