Re: Any success stories for HAST + ZFS?

From: Freddie Cash <fjwcash_at_gmail.com> Date: Mon, 28 Mar 2011 13:06:46 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:12 UTC

On Sun, Mar 27, 2011 at 5:16 AM, Mikolaj Golub <trociny_at_freebsd.org> wrote:
 On Sat, 26 Mar 2011 10:52:08 -0700 Freddie Cash wrote:
>
>  FC> hastd backtrace is here:
>  FC> http://www.sd73.bc.ca/downloads/crash/hast-backtrace.png
>
> It is not a hastd crash, but a kernel crash triggered by hastd process.

Ah, interesting.

> I am not sure I got the same crash as you but apparently the race is possible
> in g_gate on device creation.

95% of the time that it would crash, would be when creating the
/dev/hast/* devices (switching to primary role).  Most of the crashes
happened when doing "hastctl role primary all", but would occasionally
happen when doing it manually for each resource.  Creating the
resources by hand, one every 2 seconds or so, would usually create
them all without crashing.

The other 5% of the time, the hastd crashes occurred either when
importing the ZFS pool, or when running multiple parallel rsyncs to
the pool.  hastd was always shown as the last running process in the
backtrace onscreen.

> I got the following crash starting many hast providers simultaneously:
>
> fault virtual address   = 0x0
>
> #8  0xc0c11adc in calltrap () at /usr/src/sys/i386/i386/exception.s:168
> #9  0xc086ac6b in g_gate_ioctl (dev=0xc6a24300, cmd=3374345472,
>    addr=0xc9fec000 "\002", flags=3, td=0xc7ff0b80)
>    at /usr/src/sys/geom/gate/g_gate.c:410
> #10 0xc0853c5b in devfs_ioctl_f (fp=0xc9b9e310, com=3374345472,
>    data=0xc9fec000, cred=0xc8c9c200, td=0xc7ff0b80)
>    at /usr/src/sys/fs/devfs/devfs_vnops.c:678
> #11 0xc09210cd in kern_ioctl (td=0xc7ff0b80, fd=3, com=3374345472,
>    data=0xc9fec000 "\002") at file.h:262
> #12 0xc0921254 in ioctl (td=0xc7ff0b80, uap=0xf5edbcec)
>    at /usr/src/sys/kern/sys_generic.c:679
> #13 0xc0916616 in syscallenter (td=0xc7ff0b80, sa=0xf5edbce4)
>    at /usr/src/sys/kern/subr_trap.c:315
> #14 0xc0c2b9ff in syscall (frame=0xf5edbd28)
>    at /usr/src/sys/i386/i386/trap.c:1086
> #15 0xc0c11b71 in Xint0x80_syscall ()
>    at /usr/src/sys/i386/i386/exception.s:266
>
> Or just creating many ggate devices simultaneously:
>
> for i in `jot 100`; do
>    ./ggiocreate $i&
> done
>
> ggiocreate.c is attached.
>
> In my case the kernel crashes in g_gate_create() when checking for name
> collisions in strcmp():
>
>        /* Check for name collision. */
>        for (unit = 0; unit < g_gate_maxunits; unit++) {
>                if (g_gate_units[unit] == NULL)
>                        continue;
>                if (strcmp(name, g_gate_units[unit]->sc_provider->name) != 0)
>                        continue;
>                mtx_unlock(&g_gate_units_lock);
>                mtx_destroy(&sc->sc_queue_mtx);
>                free(sc, M_GATE);
>                return (EEXIST);
>        }
>
> I think the issue is the following. When preparing sc we take
> g_gate_units_lock, check for name collision, fill sc fields except
> sc->sc_provider, and registers sc in g_gate_units[unit]. sc_provider is filled
> later, when g_gate_units_lock is released. So the scenario is possible:
>
> 1) Thread A registers sc in g_gate_units[unit] with
> g_gate_units[unit]->sc_provider still null and releases g_gate_units_lock.
>
> 2) Thread B traverses g_gate_units[] when checking for name collision and
> craches accessing g_gate_units[unit]->sc_provider->name.
>
> The attached patch fixes the issue in my case.

Patch applied cleanly to 8-STABLE with ZFSv28 patch also applied.
Just to be safe, did a full buildwold/kernel cycle, running GENERIC
kernel.

So far, I have not been able to produce a crash in hastd, through
several reboots, switching from primary to secondary and back, and
just switching from primary to init and back.

So far, so good.

Now to see if I can reproduce any of the ZFS crashes I had earlier.

-- 
Freddie Cash
fjwcash_at_gmail.com