Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Moyer <jmoyer@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	 Vishal Verma <vishal.l.verma@intel.com>,
	 "Rafael J. Wysocki" <rafael@kernel.org>,
	 Len Brown <lenb@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Oscar Salvador <osalvador@suse.de>,
	 Dave Jiang <dave.jiang@intel.com>,
	 linux-acpi@vger.kernel.org,  linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org,  nvdimm@lists.linux.dev,
	 linux-cxl@vger.kernel.org,  Huang Ying <ying.huang@intel.com>,
	 Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
Date: Fri, 14 Jul 2023 09:54:02 -0400	[thread overview]
Message-ID: <x491qha7g5h.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <cfeecd92-3aa4-a07d-b71a-793531785692@redhat.com> (David Hildenbrand's message of "Fri, 14 Jul 2023 10:35:47 +0200")

David Hildenbrand <david@redhat.com> writes:

> On 13.07.23 21:12, Jeff Moyer wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 16.06.23 00:00, Vishal Verma wrote:
>>>> The dax/kmem driver can potentially hot-add large amounts of memory
>>>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>>>> memories'. There is a chance there isn't enough regular system memory
>>>> available to fit ythe memmap for this new memory. It's therefore
>>>> desirable, if all other conditions are met, for the kmem managed memory
>>>> to place its memmap on the newly added memory itself.
>>>>
>>>> Arrange for this by first allowing for a module parameter override for
>>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>>>> exporting the symbol so it can be called by kmem.c, and finally changing
>>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>>>
>>> 1) Why is the override a requirement here? Just let the admin
>>> configure it then then add conditional support for kmem.
>>>
>>> 2) I recall that there are cases where we don't want the memmap to
>>> land on slow memory (which online_movable would achieve). Just imagine
>>> the slow PMEM case. So this might need another configuration knob on
>>> the kmem side.
>>
>>  From my memory, the case where you don't want the memmap to land on
>> *persistent memory* is when the device is small (such as NVDIMM-N), and
>> you want to reserve as much space as possible for the application data.
>> This has nothing to do with the speed of access.
>
> Now that you mention it, I also do remember the origin of the altmap --
> to achieve exactly that: place the memmap on the device.
>
> commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3
> Author: Dan Williams <dan.j.williams@intel.com>
> Date:   Fri Jan 15 16:56:22 2016 -0800
>
>     x86, mm: introduce vmem_altmap to augment vmemmap_populate()
>       In support of providing struct page for large persistent memory
>     capacities, use struct vmem_altmap to change the default policy for
>     allocating memory for the memmap array.  The default vmemmap_populate()
>     allocates page table storage area from the page allocator.  Given
>     persistent memory capacities relative to DRAM it may not be feasible to
>     store the memmap in 'System Memory'.  Instead vmem_altmap represents
>     pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
>     requests.
>
> In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to
> configure it).

Configuration is done at pmem namespace creation time.  The metadata for
the namespace indicates where the memmap resides.  See the
ndctl-create-namespace man page:

       -M, --map=
           A pmem namespace in "fsdax" or "devdax" mode requires allocation of
           per-page metadata. The allocation can be drawn from either:

           ·   "mem": typical system memory

           ·   "dev": persistent memory reserved from the namespace

                   Given relative capacities of "Persistent Memory" to "System
                   RAM" the allocation defaults to reserving space out of the
                   namespace directly ("--map=dev"). The overhead is 64-bytes per
                   4K (16GB per 1TB) on x86.

> BUT that case is completely different from the "System RAM" mode. The memmap
> of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy).

Right.  (btw, I don't think system ram mode existed back then.)

> In comparison, if the buddy and everybody else works on the memmap in
> "System RAM", it's much more significant if that resides on slow memory.

Agreed.

> Looking at
>
> commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Tue Oct 3 16:16:19 2017 -0700
>
>     mm, page_alloc: add scheduling point to memmap_init_zone
>       memmap_init_zone gets a pfn range to initialize and it can be
> really
>     large resulting in a soft lockup on non-preemptible kernels
>         NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s!
> [kworker/u642:5:1720]
>       [...]
>       task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
>       RIP: move_pfn_range_to_zone+0x185/0x1d0
>       [...]
>       Call Trace:
>         devm_memremap_pages+0x2c7/0x430
>         pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
>         nvdimm_bus_probe+0x64/0x110 [libnvdimm]
>
>
> It's hard to tell if that was only required due to the memmap for these devices
> being that large, or also partially because the access to the memmap is slower
> that it makes a real difference.

I believe the main driver was the size.  At the time, Intel was
advertising 3TiB/socket for pmem.  I can't remember the exact DRAM
configuration sizes from the time.

> I recall that we're also often using ZONE_MOVABLE on such slow memory
> to not end up placing other kernel data structures on there: especially,
> user space page tables as I've been told.

Part of the issue was preserving the media.  The page structure gets
lots of updates, and that could cause premature wear.

> @Dan, any insight on the performance aspects when placing the memmap on
> (slow) memory and having that memory be consumed by the buddy where we frequently
> operate on the memmap?

I'm glad you're asking these questions.  We definitely want to make sure
we don't conflate requirements based on some particular
technology/implementation.  Also, I wouldn't make any assumptions about
the performance of CXL devices.  As I understand it, there could be a
broad spectrum of performance profiles.

And now Dan can correct anything I got wrong.  ;-)

Cheers,
Jeff

     prev parent reply	other threads:[~2023-07-14 13:48 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-15 22:00 Vishal Verma
2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
2023-06-16  6:35   ` Huang, Ying
2023-06-16  7:46   ` David Hildenbrand
2023-06-22 13:37     ` Jonathan Cameron
2023-06-23  8:40   ` Aneesh Kumar K.V
2023-06-23 12:35     ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
2023-06-16  7:47   ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
2023-06-16  6:42   ` Huang, Ying
2023-06-16  7:54   ` David Hildenbrand
2023-07-11 14:30     ` Aneesh Kumar K.V
2023-07-11 15:21       ` David Hildenbrand
2023-07-13  6:45         ` Verma, Vishal L
2023-07-13  7:23           ` David Hildenbrand
2023-07-13 15:15             ` Verma, Vishal L
2023-07-13 15:23               ` David Hildenbrand
2023-07-13 15:40                 ` Verma, Vishal L
2023-07-13 15:43                   ` David Hildenbrand
2023-06-20 13:14   ` Tarun Sahu
2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
2023-06-21 19:32   ` Verma, Vishal L
2023-06-22 13:55     ` David Hildenbrand
2023-07-13 19:12   ` Jeff Moyer
2023-07-14  8:35     ` David Hildenbrand
2023-07-14 13:54       ` Jeff Moyer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x491qha7g5h.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox