Re: Fix for sys_munlock(2) with racct

From: Alan Cox <alc_at_rice.edu>
Date: Tue, 30 Jul 2013 10:40:45 -0700
On Jul 21, 2013, at 2:50 PM, Jeremie Le Hen wrote:

> On Sat, Jul 20, 2013 at 08:33:35PM -0700, Alan Cox wrote:
>> 
>> On Jul 20, 2013, at 4:22 AM, Jeremie Le Hen wrote:
>> 
>>> Hi Edward, Alan,
>>> 
>>> I plan to commit the following patch:
>>> http://people.freebsd.org/~jlh/racct_munlock.diff
>>> 
>>> This solves the following panic:
>>> 
>>> panic: racct_sub: freeing 301989888 of resource 5, which is more than allocated 73728 for pwsafe (pid 4177)
>>> 
>>> The problem is that the racct code in sys_munlock() trusts too much the
>>> user's input.  vm_map_unwire_count() now returns how much memory has
>>> really been unwired.
>>> 
>>> Any objection?
>> 
>> 
>> Can you elaborate on what you mean by "sys_munlock() trusts too much
>> the user's input."   munlock(2) is supposed to return ENOMEM if any
>> addresses within the specified range are not backed by valid mappings.
>> (Valid mappings with PROT_NONE access are something of a gray area
>> here.)  However, it is not error for a to call munlock() on a range
>> that isn't locked, or to call it a second, third, etc. time on the
>> same range.  Is that what is provoking your panic?
> 
> I'm using sysutils/pwsafe and it seems to mlock(2) 18 pages (end - start
> = 73728 bytes) but then munlock(2) 73728 pages :-). vm_map_unwire()
> seems to do the right thing with those buggy boundaries, but then the
> racct code assumes that those boundaries were correct as long as
> vm_map_unwire() returned successfully.  This is what causes the
> assertion failure.
> 

It sounds like the application is calling munlock(2) on valid but unwired memory.  That's not considered an error under the various Unix standards that specify the behavior of munlock(2).  It's sloppy programming but not buggy programming. 

> 
>> By the way, sys_mlock() uses a simpler approach to deal with a similar
>> situation:
>> 
>>        error = vm_map_wire(map, start, end, 
>>            VM_MAP_WIRE_USER | VM_MAP_WIRE_NOHOLES);
>> #ifdef RACCT
>>        if (error != KERN_SUCCESS) {
>>                PROC_LOCK(proc);
>>                racct_set(proc, RACCT_MEMLOCK,
>>                    ptoa(pmap_wired_count(map->pmap)));
>>                PROC_UNLOCK(proc);
>>        }
>> #endif
>> 
>> However, the code in sys_mlock() for maintaining RACCT_MEMLOCK,
>> including the above snippet, is race-y.  Two simultaneous callers to
>> sys_mlock() will produce incorrect results.  (I haven't looked at
>> sys_mlockall() or vm_map_growstack().)
> 
> Ok, I will commit that as the first bandaid then.
> 
> 
>> Also, a wired mapping can be destroyed by calling munmap(2) without
>> first calling munlock(2), in which case, RACCT_MEMLOCK will be
>> incorrect.
> 
> So I think the right way to tackle this is to handle racct in the vm
> layer rather than at the syscall layer.
> 

The VM system already maintains counters equivalent to RACCT_VMEM and RACCT_MEMLOCK.  They are "map->size" and "pmap_wired_count(map->pmap)".  Instead of maintaining duplicate counters, could the resource accounting framework be extended to support callbacks to obtain a value when it's actually needed?

> vm_fault_{wire,unwire}() look like natural places for this, but racct
> functions require the struct proc to be locked.  Any idea how to handle
> this?
> 
> -- 
> Jeremie Le Hen
> 
> Scientists say the world is made up of Protons, Neutrons and Electrons.
> They forgot to mention Morons.
> 
Received on Tue Jul 30 2013 - 15:41:07 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:39 UTC