linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>
Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de,
	muchun.song@linux.dev, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages
Date: Thu, 13 Jun 2024 17:32:30 +0800	[thread overview]
Message-ID: <efac94f6-2fb3-4682-a894-7c8ffac18d20@bytedance.com> (raw)
In-Reply-To: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com>

Hi David,

Thanks for such a quick reply!

On 2024/6/13 17:04, David Hildenbrand wrote:
> On 13.06.24 10:38, Qi Zheng wrote:
>> Hi all,

[...]

> 
> 
>> 3. Implementation
>> =================
>>
>> For empty user PTE pages, we don't actually need to free it 
>> immediately, nor do
>> we need to free all of it.
>>
>> Therefore, in this patchset, we register a task_work for the user 
>> tasks to
>> asyncronously scan and free empty PTE pages when they return to user 
>> space.
>> (The scanning time interval and address space size can be adjusted.)
> 
> The question is, if we really have to scan asynchronously, or if would 
> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM) 
> every now and then. For virtio-mem, and likely most memory allocators, 
> that might be feasible, and valuable independent of system-wide 
> automatic scanning.

Agree, I also think it is possible to add always && madvise modes
simliar to THP.

> 
>>
>> When scanning, we can filter out some unsuitable vmas:
>>
>>      - VM_HUGETLB vma
>>      - VM_UFFD_WP vma
> 
> Why is UFFD_WP unsuitable? It should be suitable as long as you make 
> sure to really only remove page tables that are all pte_none().

Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().

> 
>>      - etc
>> And for some PTE pages that spans multiple vmas, we can also skip.
>>
>> For locking:
>>
>>      - use the mmap read lock to traverse the vma tree and pgtable
>>      - use pmd lock for clearing pmd entry
>>      - use pte lock for checking empty PTE page, and release it after 
>> clearing
>>        pmd entry, then we can capture the changed pmd in 
>> pte_offset_map_lock()
>>        etc after holding this pte lock. Thanks to this, we don't need 
>> to hold the
>>        rmap-related locks.
>>      - users of pte_offset_map_lock() etc all expect the PTE page to 
>> be stable by
>>        using rcu lock, so use pte_free_defer() to free PTE pages.
> 
> I once had a protoype that would scan similar to GUP-fast, using the 
> mmap lock in read mode and disabling local IRQs and then walking the 
> page table locklessly (no PTLs). Only when identifying an empty page and 
> ripping out the page table, it would have to do more heavy locking (back 
> when we required the mmap lock in write mode and other things).

Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.

> 
> I can try digging up that patch if you're interested.

Yes, that would be better, maybe it can provide more inspiration!

> 
> We'll have to double check whether all anon memory cases can *properly* 
> handle pte_offset_map_lock() failing (not just handling it, but doing 
> the right thing; most of that anon-only code didn't ever run into that 
> issue so far, so these code paths were likely never triggered).

Yeah, I'll keep checking this out too.

> 
> 
>> For the path that will also free PTE pages in THP, we need to recheck 
>> whether the
>> content of pmd entry is valid after holding pmd lock or pte lock.
>>
>> 4. TODO
>> =======
>>
>> Some applications may be concerned about the overhead of scanning and 
>> rebuilding
>> page tables, so the following features are considered for 
>> implementation in the
>> future:
>>
>>      - add per-process switch (via prctl)
>>      - add a madvise option (like THP)
>>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via 
>> procfs file)
>> Perhaps we can add the refcount to PTE pages in the future as well, 
>> which would
>> help improve the scanning speed.
> 
> I didn't like the added complexity last time, and the problem of 
> handling situations where we squeeze multiple page tables into a single 
> "struct page".

OK, except for refcount, do you think the other three todos above are 
still worth doing?

Thanks,
Qi

> 


  reply	other threads:[~2024-06-13  9:32 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-13  8:38 Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 3/3] mm: free empty user PTE pages Qi Zheng
2024-06-13  9:04 ` [RFC PATCH 0/3] asynchronously scan and " David Hildenbrand
2024-06-13  9:32   ` Qi Zheng [this message]
2024-06-13 10:25     ` David Hildenbrand
2024-06-13 11:59       ` Qi Zheng
2024-06-14  3:32         ` Qi Zheng
2024-06-17 17:51           ` David Hildenbrand
2024-06-18  7:52             ` Qi Zheng
2024-06-14  7:53         ` David Hildenbrand
2024-06-14 10:49           ` Qi Zheng
2024-06-17 17:49             ` David Hildenbrand
2024-06-18  7:51               ` Qi Zheng
2024-06-18  9:40                 ` David Hildenbrand
2024-06-18  9:55                   ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=efac94f6-2fb3-4682-a894-7c8ffac18d20@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox