From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>
Cc: jannh@google.com, hughd@google.com, willy@infradead.org,
muchun.song@linux.dev, vbabka@kernel.org,
akpm@linux-foundation.org, peterx@redhat.com, mgorman@suse.de,
catalin.marinas@arm.com, will@kernel.org,
dave.hansen@linux.intel.com, luto@kernel.org,
peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
zokeefe@google.com, rientjes@google.com
Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
Date: Mon, 18 Nov 2024 19:13:53 +0800 [thread overview]
Message-ID: <d897a1d3-bf72-48f2-b4df-1f7acb3ac311@bytedance.com> (raw)
In-Reply-To: <332cbacb-cad3-4522-a74b-b5ad5efee4af@redhat.com>
On 2024/11/18 18:59, David Hildenbrand wrote:
> On 18.11.24 11:56, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>
>>>>>>>>>>>> and then:
>>>>>>>>>>>>
>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>> rss, &force_flush, &force_break);
>>>>>>>>>>>> if (can_reclaim_pt) {
>>>>>>>>>>>> none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>> none_nr += nr_skip;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> Right?
>>>>>>>>>>>
>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>> counting of
>>>>>>>>>>
>>>>>>>>>> Got it.
>>>>>>>>>>
>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>
>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>> empty PTE page.
>>>>>>>>>
>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>> entries" but
>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>
>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>> which
>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>> do_zap_pte_range?
>>>>>>>>>
>>>>>>>>> Because as soon as any entry was processed but no zapped, you can
>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>
>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>>>> found in count_pte_none().
>>>>>>>
>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>> look
>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>> pte_none
>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>
>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>> scanning and
>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>
>>>>>> Maybe we can return the information whether the zap was skipped from
>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>> like I
>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>
>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>
>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>> zap, and the PTE page must be empty.
>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>> empty.
>>>>>>
>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>> pages
>>>>>> to not be reclaimed.
>>>>>>
>>>>>> So the most reliable and efficient method may be:
>>>>>>
>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>> reclaim
>>>>>> the PTE page;
>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>> count_pte_none()
>>>>>
>>>>> Is there a need for count_pte_none() that I am missing?
>>>>
>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>> ptes.
>>>>
>>>>>
>>>>> Assume we have
>>>>>
>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>
>>>>>
>>>>> If "nr" is the number of processed entries (including pte_none()), and
>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>>>> can detect what we need, no?
>>>>>
>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>> entries. -> We can continue trying to reclaim
>>>>
>>>> I prefer that "nr" should not include pte_none().
>>>>
>>>
>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>> less, nothing more.
>>>
>>> Let's just keep it simple and avoid count_pte_none().
>>>
>>> I'm probably missing something important?
>>
>> As we discussed before, we should skip all consecutive none ptes,
> > pte and addr are already incremented before returning.
>
> It's probably best to send the resulting patch so I can either
> understand why count_pte_none() is required or comment on how to get rid
> of it.
Something like this:
diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..e9bec3cd49d44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
mmu_gather *tlb,
return nr;
}
+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pte_t *pte,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details, int *rss,
+ bool *force_flush, bool *force_break,
+ bool *any_skipped)
+{
+ pte_t ptent = ptep_get(pte);
+ int max_nr = (end - addr) / PAGE_SIZE;
+
+ /* Skip all consecutive pte_none(). */
+ if (pte_none(ptent)) {
+ int nr;
+
+ for (nr = 1; nr < max_nr; nr++) {
+ ptent = ptep_get(pte + nr);
+ if (!pte_none(ptent))
+ break;
+ }
+ max_nr -= nr;
+ if (!max_nr)
+ return 0;
+ pte += nr;
+ addr += nr * PAGE_SIZE;
+ }
+
+ if (pte_present(ptent))
+ return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+ addr, details, rss, force_flush,
+ force_break, any_skipped);
+
+ return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+ details, rss, any_skipped);
+}
+
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+ int none_nr = 0;
+
+ /*
+ * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+ * re-installed, so we need to check pte_none() one by one.
+ * Otherwise, checking a single PTE in a batch is sufficient.
+ */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+ for (;;) {
+ if (pte_none(ptep_get(pte)))
+ none_nr++;
+ if (--nr == 0)
+ break;
+ pte++;
+ }
+#else
+ if (pte_none(ptep_get(pte)))
+ none_nr = nr;
+#endif
+ return none_nr;
+}
+
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
mmu_gather *tlb,
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
+ bool can_reclaim_pt;
pte_t *pte;
int nr;
@@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
mmu_gather *tlb,
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
- pte_t ptent = ptep_get(pte);
- int max_nr;
-
- nr = 1;
- if (pte_none(ptent))
- continue;
+ bool any_skipped;
if (need_resched())
break;
- max_nr = (end - addr) / PAGE_SIZE;
- if (pte_present(ptent)) {
- nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
- addr, details, rss,
&force_flush,
- &force_break);
- if (unlikely(force_break)) {
- addr += nr * PAGE_SIZE;
- break;
- }
- } else {
- nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
max_nr,
- addr, details, rss);
+ nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
+ rss, &force_flush, &force_break,
+ &any_skipped);
+ if (can_reclaim_pt) {
+ VM_BUG_ON(!any_skipped && count_pte_none(pte,
nr) == nr);
+ if (any_skipped)
+ can_reclaim_pt = false;
+ }
+ if (unlikely(force_break)) {
+ addr += nr * PAGE_SIZE;
+ break;
}
} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
next prev parent reply other threads:[~2024-11-18 11:14 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-14 6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-11-14 6:59 ` [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
2024-11-14 6:59 ` [PATCH v3 2/9] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
2024-11-14 6:59 ` [PATCH v3 3/9] mm: introduce zap_nonpresent_ptes() Qi Zheng
2024-11-14 6:59 ` [PATCH v3 4/9] mm: introduce skip_none_ptes() Qi Zheng
2024-11-14 8:04 ` David Hildenbrand
2024-11-14 9:20 ` Qi Zheng
2024-11-14 12:32 ` David Hildenbrand
2024-11-14 12:51 ` Qi Zheng
2024-11-14 21:19 ` David Hildenbrand
2024-11-15 3:03 ` Qi Zheng
2024-11-15 10:22 ` David Hildenbrand
2024-11-15 14:41 ` Qi Zheng
2024-11-15 14:59 ` David Hildenbrand
2024-11-18 3:35 ` Qi Zheng
2024-11-18 9:29 ` David Hildenbrand
2024-11-18 10:34 ` Qi Zheng
2024-11-18 10:41 ` David Hildenbrand
2024-11-18 10:56 ` Qi Zheng
2024-11-18 10:59 ` David Hildenbrand
2024-11-18 11:13 ` Qi Zheng [this message]
2024-11-19 9:55 ` David Hildenbrand
2024-11-19 10:03 ` Qi Zheng
2024-11-14 6:59 ` [PATCH v3 5/9] mm: introduce do_zap_pte_range() Qi Zheng
2024-11-14 6:59 ` [PATCH v3 6/9] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-11-14 6:59 ` [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-11-14 6:59 ` [PATCH v3 8/9] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-11-14 7:00 ` [PATCH v3 9/9] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d897a1d3-bf72-48f2-b4df-1f7acb3ac311@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=catalin.marinas@arm.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=muchun.song@linux.dev \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=rientjes@google.com \
--cc=vbabka@kernel.org \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox