linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de,
	muchun.song@linux.dev, vbabka@kernel.org,
	akpm@linux-foundation.org, zokeefe@google.com,
	rientjes@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
Date: Fri, 16 Aug 2024 11:22:15 +0200	[thread overview]
Message-ID: <b0b39543-498d-4b08-a864-b05be45f617d@redhat.com> (raw)
In-Reply-To: <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>

On 07.08.24 05:58, Qi Zheng wrote:
> Hi David,
> 

Really sorry for the slow replies, I'm struggling with a mixture of 
public holidays, holiday and too many different discussions (well, and 
some stuff I have to finish myself).

> On 2024/8/6 22:40, David Hildenbrand wrote:
>> On 05.08.24 14:55, Qi Zheng wrote:
>>> Now in order to pursue high performance, applications mostly use some
>>> high-performance user-mode memory allocators, such as jemalloc or
>>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>>> release page table memory, which may cause huge page table memory usage.
>>>
>>> The following are a memory usage snapshot of one process which actually
>>> happened on our server:
>>>
>>>           VIRT:  55t
>>>           RES:   590g
>>>           VmPTE: 110g
>>>
>>> In this case, most of the page table entries are empty. For such a PTE
>>> page where all entries are empty, we can actually free it back to the
>>> system for others to use.
>>>
>>> As a first step, this commit attempts to synchronously free the empty PTE
>>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
>>> order to reduce overhead, we only handle the cases with a high
>>> probability
>>> of generating empty PTE pages, and other cases will be filtered out, such
>>> as:
>>
>> It doesn't make particular sense during munmap() where we will just
>> remove the page tables manually directly afterwards. We should limit it
>> to the !munmap case -- in particular MADV_DONTNEED.
> 
> munmap directly calls unmap_single_vma() instead of
> zap_page_range_single(), so the munmap case has already been excluded
> here. On the other hand, if we try to reclaim in zap_pte_range(), we
> need to identify the munmap case.
> 
> Of course, we could just modify the MADV_DONTNEED case instead of all
> the callers of zap_page_range_single(), perhaps we could add a new
> parameter to identify the MADV_DONTNEED case?

See below, zap_details might come in handy.

> 
>>
>> To minimze the added overhead, I further suggest to only try reclaim
>> asynchronously if we know that likely all ptes will be none, that is,
> 
> asynchronously? What you probably mean to say is synchronously, right?
> 
>> when we just zapped *all* ptes of a PTE page table -- our range spans
>> the complete PTE page table.
>>
>> Just imagine someone zaps a single PTE, we really don't want to start
>> scanning page tables and involve an (rather expensive) walk_page_range
>> just to find out that there is still something mapped.
> 
> In the munmap path, we first execute unmap and then reclaim the page
> tables:
> 
> unmap_vmas
> free_pgtables
> 
> Therefore, I think doing something similar in zap_page_range_single()
> would be more consistent:
> 
> unmap_single_vma
> try_to_reclaim_pgtables
> 
> And I think that the main overhead should be in flushing TLB and freeing
> the pages. Of course, I will do some performance testing to see the
> actual impact.
> 
>>
>> Last but not least, would there be a way to avoid the walk_page_range()
>> and simply trigger it from zap_pte_range(), possibly still while holding
>> the PTE table lock?
> 
> I've tried doing it that way before, but ultimately I did not choose to
> do it that way because of the following reasons:

I think we really should avoid another page table walk if possible.

> 
> 1. need to identify the munmap case

We already have "struct zap_details". Maybe we can extend that to 
specify what our intention are (either where we come from or whether we 
want to try ripping out apge tables directly).

> 2. trying to record the count of pte_none() within the original
>      zap_pte_range() loop is not very convenient. The most convenient
>      approach is still to loop 512 times to scan the PTE page.

Right, the code might need some reshuffling. As we might temporary drop 
the PTL (break case), fully relying on everything being pte_none() 
doesn't always work.

We could either handle it in zap_pmd_range(), after we processed a full 
PMD range. zap_pmd_range() knows for sure whether the full PMD range was 
covered, even if multiple zap_pte_range() calls were required.

Or we could indicate to zap_pte_range() the original range. Or we could 
make zap_pte_range() simply handle the retrying itself, and not get 
called multiple times for a single PMD range.

So the key points are:

(a) zap_pmd_range() should know for sure whether the full range is
     covered by the zap.
(b) zap_pte_range() knows whether it left any entries being (IOW, it n
     never ran into the "!should_zap_folio" case)
(c) we know whether we temporarily had to drop the PTL and someone might
     have converted pte_none() to something else.

Teaching zap_pte_range() to handle a full within-PMD range itself sounds 
cleanest.

Then we can handle it fully in zap_pte_range():

(a) if we had to leave entries behind (!pte_none()), no need to try
     ripping out the page table.

(b) if we didn't have to drop the PTL, we can remove the page table
     without even re-verifying whether the entries are pte_none(). We
     know they are. If we had to drop the PTL, we have to re-verify at
    least the PTEs that were not zapped in the last iteration.


So there is the chance to avoid pte_none() checks completely, or minimze 
them if we had to drop the PTL.

Anything I am missing? Please let me know if anything is unclear.

Reworking the retry logic for zap_pte_range(), to be called for a single 
PMD only once is likely the first step.

> 3. still need to release the pte lock, and then re-acquire the pmd lock
>      and pte lock.

Yes, if try-locking the PMD fails.

-- 
Cheers,

David / dhildenb



  parent reply	other threads:[~2024-08-16  9:22 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
2024-08-05 14:43   ` David Hildenbrand
2024-08-06  2:40     ` Qi Zheng
2024-08-06 14:16       ` David Hildenbrand
     [not found]         ` <f6c05526-5ac9-4597-9e80-099ea22fa0ae@bytedance.com>
2024-08-09 16:54           ` David Hildenbrand
2024-08-12  6:21             ` Qi Zheng
2024-08-16  8:59               ` David Hildenbrand
2024-08-16  9:21                 ` Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM Qi Zheng
2024-08-06 14:25   ` David Hildenbrand
2024-08-05 12:55 ` [RFC PATCH v2 3/7] mm: pass address information to pmd_install() Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Qi Zheng
2024-08-06 14:40   ` David Hildenbrand
     [not found]     ` <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>
2024-08-16  9:22       ` David Hildenbrand [this message]
2024-08-16 10:01         ` Qi Zheng
2024-08-16 10:03           ` David Hildenbrand
2024-08-16 10:07             ` Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 5/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 6/7] x86: mm: define arch_flush_tlb_before_set_huge_page Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
2024-08-05 13:14 ` [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-08-06  3:31 ` Qi Zheng
2024-08-16  2:55   ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0b39543-498d-4b08-a864-b05be45f617d@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=zhengqi.arch@bytedance.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox