Re: expanding past 1 TB on amd64

From: Kurt Lidl <lidl_at_pix.net> Date: Tue, 16 Jul 2013 10:08:04 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:39 UTC

> On Wed, Jun 19, 2013 at 1:32 AM, Chris Torek <chris.torek at gmail.com> wrote:
>
>> In src/sys/amd64/include/vmparam.h is this handy map:
>>
>>  * 0x0000000000000000 - 0x00007fffffffffff   user map
>>  * 0x0000800000000000 - 0xffff7fffffffffff   does not exist (hole)
>>  * 0xffff800000000000 - 0xffff804020100fff   recursive page table (512GB
>> slot)
>>  * 0xffff804020101000 - 0xfffffdffffffffff   unused
>>  * 0xfffffe0000000000 - 0xfffffeffffffffff   1TB direct map
>>  * 0xffffff0000000000 - 0xffffff7fffffffff   unused
>>  * 0xffffff8000000000 - 0xffffffffffffffff   512GB kernel map
>>
>> showing that the system can deal with at most 1 TB of address space
>> (because of the direct map), using at most half of that for kernel
>> memory (less, really, due to the inevitable VM fragmentation).
>>
>> New boards are coming soonish that will have the ability to go
>> past that (24 DIMMs of 64 GB each = 1.5 TB).  Or, if some crazy
>> people :-) might want to use a most of a 768 GB board (24 DIMMs of
>> 32 GB each, possible today although the price is kind of
>> staggering) as wired-down kernel memory, the 512 GB VM area is
>> already a problem.
>>
>> I have not wrapped my head around the amd64 pmap code but figured
>> I'd ask: what might need to change to support larger spaces?
>> Obviously NKPML4E in amd64/include/pmap.h, for the kernel start
>> address; and NDMPML4E for the direct map.  It looks like this
>> would adjust KERNBASE and the direct map appropriately.  But would
>> that suffice, or have I missed something?
>>
>> For that matter, if these are changed to make space for future
>> expansion, what would be a good expansion size?  Perhaps multiply
>> the sizes by 16?  (If memory doubles roughly every 18 months,
>> that should give room for at least 5 years.)
>>
>>
> Chris, Neel,
>
> The actual data that I've seen shows that DIMMs are doubling in size at
> about half that pace, about every three years.  For example, see
> http://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdfslide
> #8.  So, I think that a factor of 16 is a lot more than we'll need in
> the next five years.  I would suggest configuring the kernel virtual
> address space for 4 TB.  Once you go beyond 512 GB, 4 TB is the net
> "plateau" in terms of address translation cost.  At 4 TB all of the PML4
> entries for the kernel virtual address space will reside in the same L2
> cache line, so a page table walk on a TLB miss for an instruction fetch
> will effectively prefetch the PML4 entry for the kernel heap and vice versa.

The largest commodity motherboards that are shipping today support
24 DIMMs, at a max size of 32GB per DIMM.  That's 768GB, right now.
(So FreeBSD is already "out of bits" in terms of supporting current
shipping hardware.) The Haswell line of CPUs is widely reported to
support DIMMs twice as large, and it's due in September.  That would
make the systems of late 2013 hold up to 1536GB of memory.

Using your figure of doubling in 3 years, we'll see 3072GB systems by
~2016.  And in ~2019, we'll see 6TB systems, and need to finally expand
to using more than a single cache line to hold all the PML4 entries.

Of course, that's speculating furiously about two generations out, and
assumes keeping the current memory architecture / board design
constraints.

-Kurt