[LSF/MM/BPF TOPIC] page table reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: lsf-pc@lists.linux-foundation.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: [LSF/MM/BPF TOPIC] page table reclaim
Date: Tue, 22 Feb 2022 09:56:20 +0100	[thread overview]
Message-ID: <7b908208-02f8-6fde-4dfc-13d5e00310a6@redhat.com> (raw)

Hi all,

we are aware of workloads that can trigger allocation of a lot of page
tables that are essentially unnecessary. The obvious candidates are
processes that dynamically manage memory consumption in large, sparse
memory mappings e.g., via madvise(MADV_DONTNEED): hypervisors that
implement memory ballooning or virtio-mem, and memory allocators.

In fact, it's easy to have a process that almost exclusively consumes
page tables only, and it's hard to distinguish between "malicious" and
"sane" workload when just looking at the page table consumption. I have
quite some neat examples that I can present.

Page tables are unmovable in memory an cannot get swapped out. So heavy
page table consumption isn't only problematic because we end up wasting
system RAM and fragmenting system RAM with unmovable allocations, it's
also a problem when having big portions of system RAM managed by
CMA/ZONE_MOVABLE where we can just run out of system RAM available for
unmovable allocations and eventually harm the system / other workloads
in the same machine.

One approach I'd like to discuss is page table reclaim: reclaiming
unnecessary page tables, which involves a lot of challenges.

1. Efficient page table reclaim

"Ripping out" a page table is an expensive and highly complicated
operation: just take a look at khugepaged. We have to block all page
table walkers, which requires the mmap_lock in write mode, the rmap
lock, and proper synchronization with GUP-fast.

In the simplest approach, we'd scan for candidate page tables to then
rip them out. But:
* How to scan for candidate page tables efficiently?
* How to avoid the mmap_lock in write mode when removing a page table?
* How to avoid the rmap lock (just imagine a page table spanning
  multiple rmaps)?

But also: how to make the implementation simple and appealing to get
merged upstream? For example, the last attempt to reclaim empty PTE page
tables [1] automatically once the last PTE was zapped was not merged yet
because it certainly adds complexity. How to avoid that complexity?

2. Who triggers reclaim and when?

Letting an application trigger reclaim of page tables is the "easiest
solution": let's imagine madvise(MADV_RECLAIM_PGTABLES). However, this
doesn't take care of malicious workloads and is more problematic when
having sparse files mapped into multiple processes. Further, there is no
need to reclaim if we're not under memory pressure.

Letting the system do this automatically looks "cleaner". But, when to
start reclaiming? How to detect and handle malicious processes (do we
care?)? How to set an adequate soft/hard limit?

3. Which page tables to reclaim?

While the obvious candidates are empty page tables, we can easily have
page tables all filled with the shared zeropage instead. Once again,
there are sane and malicious use cases. A sane use case is a simple VM
having a balloon inflated and triggering a memory dump like kdump: we'll
populate the shared zeropage everywhere and have plenty of page tables
we don't even care about.

But once we talk about reclaiming page tables that are still populated
with the shared zeropage, why not reclaim page tables that are
"reconstructable", for example, because they don't map anonymous pages
and don't require special fault handling (userfaultfd?)?

While I do have answers to some of the questions and various ideas, it's
certainly an interesting topic to discuss and brainstorm.

[1]https://lkml.kernel.org/r/20211110105428.32458-1-zhengqi.arch@bytedance.com

-- 
Thanks,

David / dhildenb

next             reply	other threads:[~2022-02-22  8:56 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-22  8:56 David Hildenbrand [this message]
2022-02-24 20:09 ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b908208-02f8-6fde-4dfc-13d5e00310a6@redhat.com \
    --to=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox