linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Jann Horn <jannh@google.com>
Cc: david@redhat.com, hughd@google.com, willy@infradead.org,
	mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org,
	akpm@linux-foundation.org, zokeefe@google.com,
	rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org
Subject: Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
Date: Thu, 7 Nov 2024 15:54:47 +0800	[thread overview]
Message-ID: <8c27c1f8-9573-4777-8397-929a15e67f60@bytedance.com> (raw)
In-Reply-To: <CAG48ez1S3nU0qzf6zZYAOTGO=RmK_2z+ZvHLzrpfamQ4uGK4hg@mail.gmail.com>

Hi Jann,

On 2024/11/7 05:48, Jann Horn wrote:
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> In retract_page_tables(), we may modify the pmd entry after acquiring the
>> pml and ptl, so we should also check whether the pmd entry is stable.
> 
> Why does taking the PMD lock not guarantee that the PMD entry is stable?

Because the pmd entry may have changed before taking the pmd lock, so we
need to recheck it after taking the pmd or pte lock.

> 
>> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
>> also remove the calling of the pte_lockptr().
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/khugepaged.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 6f8d46d107b4b..6d76dde64f5fb 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                  spinlock_t *pml;
>>                  spinlock_t *ptl;
>>                  bool skipped_uffd = false;
>> +               pte_t *pte;
>>
>>                  /*
>>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                                          addr, addr + HPAGE_PMD_SIZE);
>>                  mmu_notifier_invalidate_range_start(&range);
>>
>> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
>> +               if (!pte) {
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +
>>                  pml = pmd_lock(mm, pmd);
> 
> I don't understand why you're mapping the page table before locking
> the PMD. Doesn't that just mean we need more error checking
> afterwards?

The main purpose is to obtain the pmdval. If we don't use
pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
before pte_lockptr(), like this:

pmdval = pmdp_get_lockless(pmd);
pmd_lock
recheck pmdval
pte_lockptr(mm, pmd)

Otherwise, it may cause the system to crash. Consider the following
situation:

     CPU 0              CPU 1

zap_pte_range
--> clear pmd entry
     free pte page (by RCU)

                       retract_page_tables
                       --> pmd_lock
                           pte_lockptr(mm, pmd)  <-- BOOM!!

So maybe calling pte_offset_map_rw_nolock() is more convenient.

Thanks,
Qi


> 
> 
>> -               ptl = pte_lockptr(mm, pmd);
>>                  if (ptl != pml)
>>                          spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>
>> +               if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>> +                       pte_unmap_unlock(pte, ptl);
>> +                       if (ptl != pml)
>> +                               spin_unlock(pml);
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +               pte_unmap(pte);
>> +
>>                  /*
>>                   * Huge page lock is still held, so normally the page table
>>                   * must remain empty; and we have already skipped anon_vma
>> --
>> 2.20.1
>>


  reply	other threads:[~2024-11-07  7:55 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
2024-11-06 21:48   ` Jann Horn
2024-11-07  7:54     ` Qi Zheng [this message]
2024-11-07 17:57       ` Jann Horn
2024-11-08  6:31         ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
2024-11-06 21:48   ` Jann Horn
2024-11-12 16:58   ` David Hildenbrand
2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
2024-11-07 21:50   ` Jann Horn
2024-11-12 17:00   ` David Hildenbrand
2024-11-13  2:40     ` Qi Zheng
2024-11-13 11:43       ` David Hildenbrand
2024-11-13 12:19         ` Qi Zheng
2024-11-14  3:09         ` Qi Zheng
2024-11-14  4:12           ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-11-07 21:46   ` Jann Horn
2024-10-31  8:13 ` [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-11-07 23:35   ` Jann Horn
2024-11-08  7:13     ` Qi Zheng
2024-11-08 18:04       ` Jann Horn
2024-11-09  3:07         ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-11-07 22:39   ` Jann Horn
2024-11-08  7:38     ` Qi Zheng
2024-11-08 20:09       ` Jann Horn
2024-11-09  3:14         ` Qi Zheng
2024-11-13 11:26       ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c27c1f8-9573-4777-8397-929a15e67f60@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox