Re: deadlock between g_event and a thread on removing a device.

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Wed, 23 Jan 2013 11:54:14 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC

On Fri, Jan 18, 2013 at 02:45:38PM +0900, Kohji Okuno wrote:
> Hi,
> 
> When I removed a device (ex. /dev/da0), I have encounterd a
> dead-lock between ``g_event'' thread and a thread that is opening
> device file (I call this thread as A).
> 
> Would you refer the following?
> 
> When the device is removed between dev_refthread() and g_dev_open(),
> thread A incremented dev->si_threadcount, but can't acquire
> topology_lock.
> 
> On the other hand, g_event is waiting to set dev->si_threadcount to 0
> with topology_lock.
> 
> Regards,
>  Kohji Okuno
> 
> 
> <<< Thread A >>>
> ...
> devfs_open()
> {
>   ...
>   dsw = dev_refthread(dev, &ref); <= increment dev->si_threadcount
>   ...
>   error = dsw->d_open(...);       <= call g_dev_open()
>   ...
>   dev_relthread(dev, ref);        <= decrement dev->si_threadcount
> }
> 
> g_dev_open()
> {
>   ...
>   g_topology_lock();              <= Thread A couldn't acquire 
>   ...                                topology_lock.
> }
> 
> <<< g_event >>>
> g_run_events()
> {
>    ...
>    g_topology_lock();             <= g_event acuired topology_lock here.
>    ...
>    one_event()
>    ...
> }
> 
> one_event()
> g_orphan_register()
> g_dev_orphan()
> destroy_dev()
> destroy_dev()
> destroy_devl()
> {
>   ...
>   while (dev->si_threadcount != 0) { <= this count was incremented by Thread A
>     /* Use unique dummy wait ident */
>     msleep(&csw, &devmtx, PRIBIO, "devdrn", hz / 10);
>   }
>   ...
> }

Yes, you are absolutely right.

I believe there were some patches floating around which changed the
destroy_dev() call in the g_dev_orphan() to destroy_dev_sched(). I do
not remember who was the author.

My reply was that naive substitution of the destroy_dev() to
destroy_dev_sched() is racy, because some requests might still come
in after the call to destroy_dev_sched(). Despite destroy_dev_sched()
setting the CDP_SCHED_DTR flag on the devfs node, some thread might
already entered the cdevsw method. I do not believe that there was
further progress there.