Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Qi Zheng <zhengqi.arch@bytedance.com>,
	hughd@google.com, willy@infradead.org, mgorman@suse.de,
	muchun.song@linux.dev, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages
Date: Thu, 13 Jun 2024 11:04:57 +0200	[thread overview]
Message-ID: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com> (raw)
In-Reply-To: <cover.1718267194.git.zhengqi.arch@bytedance.com>

On 13.06.24 10:38, Qi Zheng wrote:
> Hi all,
> 
> This series aims to asynchronously scan and free empty user PTE pages.
> 
> 1. Background
> =============
> 
> We often find huge user PTE memory usage on our servers, such as the following:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
>          
> The root cause is that these processes use some high-performance mmeory
> allocators (such as jemalloc, tcmalloc, etc). These memory allocators use
> madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither
> MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause
> huge page table memory usage.
> 
> This issue has been discussed on LSFMM 2022 (led by David Hildenbrand):
> 
>          topic link: https://lore.kernel.org/linux-mm/7b908208-02f8-6fde-4dfc-13d5e00310a6@redhat.com/
>          youtube link: https://www.youtube.com/watch?v=naO_BRhcU68
>          
> In the past, I have tried to introduce refcount for PTE pages to solve this
> problem, but these methods [1][2][3] introduced too much complexity.
> 
> [1]. https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
> [3]. https://lore.kernel.org/lkml/20220825101037.96517-1-zhengqi.arch@bytedance.com/
> 
> 2. Infrastructure
> =================
> 
> Later, in order to freeing retracted page table, Hugh Dickins added a lot of
> PTE-related infrastructure[4][5][6]:
> 
>      - allow pte_offset_map_lock() etc to fail
>      - make PTE pages can be removed without mmap or rmap locks
>        (see collapse_pte_mapped_thp() and retract_page_tables())
>      - make PTE pages can be freed by RCU (via pte_free_defer())
>      - etc
>      
> These are all beneficial to freeing empty PTE pages.
> 
> [4]. https://lore.kernel.org/all/a4963be9-7aa6-350-66d0-2ba843e1af44@google.com/
> [5]. https://lore.kernel.org/all/c1c9a74a-bc5b-15ea-e5d2-8ec34bc921d@google.com/
> [6]. https://lore.kernel.org/all/7cd843a9-aa80-14f-5eb2-33427363c20@google.com/
> 

I'm a big fan for virtio-mem.


> 3. Implementation
> =================
> 
> For empty user PTE pages, we don't actually need to free it immediately, nor do
> we need to free all of it.
> 
> Therefore, in this patchset, we register a task_work for the user tasks to
> asyncronously scan and free empty PTE pages when they return to user space.
> (The scanning time interval and address space size can be adjusted.)

The question is, if we really have to scan asynchronously, or if would 
be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM) 
every now and then. For virtio-mem, and likely most memory allocators, 
that might be feasible, and valuable independent of system-wide 
automatic scanning.

> 
> When scanning, we can filter out some unsuitable vmas:
> 
>      - VM_HUGETLB vma
>      - VM_UFFD_WP vma

Why is UFFD_WP unsuitable? It should be suitable as long as you make 
sure to really only remove page tables that are all pte_none().

>      - etc
>      
> And for some PTE pages that spans multiple vmas, we can also skip.
> 
> For locking:
> 
>      - use the mmap read lock to traverse the vma tree and pgtable
>      - use pmd lock for clearing pmd entry
>      - use pte lock for checking empty PTE page, and release it after clearing
>        pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
>        etc after holding this pte lock. Thanks to this, we don't need to hold the
>        rmap-related locks.
>      - users of pte_offset_map_lock() etc all expect the PTE page to be stable by
>        using rcu lock, so use pte_free_defer() to free PTE pages.
>     

I once had a protoype that would scan similar to GUP-fast, using the 
mmap lock in read mode and disabling local IRQs and then walking the 
page table locklessly (no PTLs). Only when identifying an empty page and 
ripping out the page table, it would have to do more heavy locking (back 
when we required the mmap lock in write mode and other things).

I can try digging up that patch if you're interested.

We'll have to double check whether all anon memory cases can *properly* 
handle pte_offset_map_lock() failing (not just handling it, but doing 
the right thing; most of that anon-only code didn't ever run into that 
issue so far, so these code paths were likely never triggered).


> For the path that will also free PTE pages in THP, we need to recheck whether the
> content of pmd entry is valid after holding pmd lock or pte lock.
> 
> 4. TODO
> =======
> 
> Some applications may be concerned about the overhead of scanning and rebuilding
> page tables, so the following features are considered for implementation in the
> future:
> 
>      - add per-process switch (via prctl)
>      - add a madvise option (like THP)
>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via procfs file)
>      
> Perhaps we can add the refcount to PTE pages in the future as well, which would
> help improve the scanning speed.

I didn't like the added complexity last time, and the problem of 
handling situations where we squeeze multiple page tables into a single 
"struct page".

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-06-13  9:05 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-13  8:38 Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
2024-06-13  8:38 ` [RFC PATCH 3/3] mm: free empty user PTE pages Qi Zheng
2024-06-13  9:04 ` David Hildenbrand [this message]
2024-06-13  9:32   ` [RFC PATCH 0/3] asynchronously scan and " Qi Zheng
2024-06-13 10:25     ` David Hildenbrand
2024-06-13 11:59       ` Qi Zheng
2024-06-14  3:32         ` Qi Zheng
2024-06-17 17:51           ` David Hildenbrand
2024-06-18  7:52             ` Qi Zheng
2024-06-14  7:53         ` David Hildenbrand
2024-06-14 10:49           ` Qi Zheng
2024-06-17 17:49             ` David Hildenbrand
2024-06-18  7:51               ` Qi Zheng
2024-06-18  9:40                 ` David Hildenbrand
2024-06-18  9:55                   ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox