From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>
Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de,
muchun.song@linux.dev, akpm@linux-foundation.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages
Date: Thu, 13 Jun 2024 17:32:30 +0800 [thread overview]
Message-ID: <efac94f6-2fb3-4682-a894-7c8ffac18d20@bytedance.com> (raw)
In-Reply-To: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com>
Hi David,
Thanks for such a quick reply!
On 2024/6/13 17:04, David Hildenbrand wrote:
> On 13.06.24 10:38, Qi Zheng wrote:
>> Hi all,
[...]
>
>
>> 3. Implementation
>> =================
>>
>> For empty user PTE pages, we don't actually need to free it
>> immediately, nor do
>> we need to free all of it.
>>
>> Therefore, in this patchset, we register a task_work for the user
>> tasks to
>> asyncronously scan and free empty PTE pages when they return to user
>> space.
>> (The scanning time interval and address space size can be adjusted.)
>
> The question is, if we really have to scan asynchronously, or if would
> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
> every now and then. For virtio-mem, and likely most memory allocators,
> that might be feasible, and valuable independent of system-wide
> automatic scanning.
Agree, I also think it is possible to add always && madvise modes
simliar to THP.
>
>>
>> When scanning, we can filter out some unsuitable vmas:
>>
>> - VM_HUGETLB vma
>> - VM_UFFD_WP vma
>
> Why is UFFD_WP unsuitable? It should be suitable as long as you make
> sure to really only remove page tables that are all pte_none().
Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().
>
>> - etc
>> And for some PTE pages that spans multiple vmas, we can also skip.
>>
>> For locking:
>>
>> - use the mmap read lock to traverse the vma tree and pgtable
>> - use pmd lock for clearing pmd entry
>> - use pte lock for checking empty PTE page, and release it after
>> clearing
>> pmd entry, then we can capture the changed pmd in
>> pte_offset_map_lock()
>> etc after holding this pte lock. Thanks to this, we don't need
>> to hold the
>> rmap-related locks.
>> - users of pte_offset_map_lock() etc all expect the PTE page to
>> be stable by
>> using rcu lock, so use pte_free_defer() to free PTE pages.
>
> I once had a protoype that would scan similar to GUP-fast, using the
> mmap lock in read mode and disabling local IRQs and then walking the
> page table locklessly (no PTLs). Only when identifying an empty page and
> ripping out the page table, it would have to do more heavy locking (back
> when we required the mmap lock in write mode and other things).
Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.
>
> I can try digging up that patch if you're interested.
Yes, that would be better, maybe it can provide more inspiration!
>
> We'll have to double check whether all anon memory cases can *properly*
> handle pte_offset_map_lock() failing (not just handling it, but doing
> the right thing; most of that anon-only code didn't ever run into that
> issue so far, so these code paths were likely never triggered).
Yeah, I'll keep checking this out too.
>
>
>> For the path that will also free PTE pages in THP, we need to recheck
>> whether the
>> content of pmd entry is valid after holding pmd lock or pte lock.
>>
>> 4. TODO
>> =======
>>
>> Some applications may be concerned about the overhead of scanning and
>> rebuilding
>> page tables, so the following features are considered for
>> implementation in the
>> future:
>>
>> - add per-process switch (via prctl)
>> - add a madvise option (like THP)
>> - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>> procfs file)
>> Perhaps we can add the refcount to PTE pages in the future as well,
>> which would
>> help improve the scanning speed.
>
> I didn't like the added complexity last time, and the problem of
> handling situations where we squeeze multiple page tables into a single
> "struct page".
OK, except for refcount, do you think the other three todos above are
still worth doing?
Thanks,
Qi
>
next prev parent reply other threads:[~2024-06-13 9:32 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-13 8:38 Qi Zheng
2024-06-13 8:38 ` [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE Qi Zheng
2024-06-13 8:38 ` [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
2024-06-13 8:38 ` [RFC PATCH 3/3] mm: free empty user PTE pages Qi Zheng
2024-06-13 9:04 ` [RFC PATCH 0/3] asynchronously scan and " David Hildenbrand
2024-06-13 9:32 ` Qi Zheng [this message]
2024-06-13 10:25 ` David Hildenbrand
2024-06-13 11:59 ` Qi Zheng
2024-06-14 3:32 ` Qi Zheng
2024-06-17 17:51 ` David Hildenbrand
2024-06-18 7:52 ` Qi Zheng
2024-06-14 7:53 ` David Hildenbrand
2024-06-14 10:49 ` Qi Zheng
2024-06-17 17:49 ` David Hildenbrand
2024-06-18 7:51 ` Qi Zheng
2024-06-18 9:40 ` David Hildenbrand
2024-06-18 9:55 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=efac94f6-2fb3-4682-a894-7c8ffac18d20@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=muchun.song@linux.dev \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox