Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jann Horn <jannh@google.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	 Mike Rapoport <rppt@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	 Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	 Yang Shi <shy828301@gmail.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	 Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,  Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	 Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	 Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	 Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	 Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	 Minchan Kim <minchan@kernel.org>,
	Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>,
	 Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Russell King <linux@armlinux.org.uk>,
	 "David S. Miller" <davem@davemloft.net>,
	Michael Ellerman <mpe@ellerman.id.au>,
	 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	 Christian Borntraeger <borntraeger@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	 Alexander Gordeev <agordeev@linux.ibm.com>,
	linux-arm-kernel@lists.infradead.org,
	 sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	 linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
Date: Wed, 31 May 2023 17:34:58 +0200	[thread overview]
Message-ID: <CAG48ez0aF1Rf1apSjn9YcnfyFQ4YqSd4GqB6f2wfhF7jMdi5Hg@mail.gmail.com> (raw)
In-Reply-To: <2e9996fa-d238-e7c-1194-834a2bd1f60@google.com>

On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -                              struct mm_struct *target_mm,
> -                              unsigned long target_addr, struct page *hpage,
> -                              struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>         struct vm_area_struct *vma;
> -       int target_result = SCAN_FAIL;
>
> -       i_mmap_lock_write(mapping);
> +       i_mmap_lock_read(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -               int result = SCAN_FAIL;
> -               struct mm_struct *mm = NULL;
> -               unsigned long addr = 0;
> -               pmd_t *pmd;
> -               bool is_target = false;
> +               struct mm_struct *mm;
> +               unsigned long addr;
> +               pmd_t *pmd, pgt_pmd;
> +               spinlock_t *pml;
> +               spinlock_t *ptl;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -                * got written to. These VMAs are likely not worth investing
> -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -                * later.
> +                * got written to. These VMAs are likely not worth removing
> +                * page tables from, as PMD-mapping is likely to be split later.
>                  *
> -                * Note that vma->anon_vma check is racy: it can be set up after
> -                * the check but before we took mmap_lock by the fault path.
> -                * But page lock would prevent establishing any new ptes of the
> -                * page, so we are safe.
> -                *
> -                * An alternative would be drop the check, but check that page
> -                * table is clear before calling pmdp_collapse_flush() under
> -                * ptl. It has higher chance to recover THP for the VMA, but
> -                * has higher cost too. It would also probably require locking
> -                * the anon_vma.
> +                * Note that vma->anon_vma check is racy: it can be set after
> +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +                * prevented establishing new ptes of the page. So we are safe
> +                * to remove page table below, without even checking it's empty.

This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().


>                  */
> -               if (READ_ONCE(vma->anon_vma)) {
> -                       result = SCAN_PAGE_ANON;
> -                       goto next;
> -               }
> +               if (READ_ONCE(vma->anon_vma))
> +                       continue;
> +
>                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>                 if (addr & ~HPAGE_PMD_MASK ||
> -                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -                       result = SCAN_VMA_CHECK;
> -                       goto next;
> -               }
> -               mm = vma->vm_mm;
> -               is_target = mm == target_mm && addr == target_addr;
> -               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -               if (result != SCAN_SUCCEED)
> -                       goto next;
> -               /*
> -                * We need exclusive mmap_lock to retract page table.
> -                *
> -                * We use trylock due to lock inversion: we need to acquire
> -                * mmap_lock while holding page lock. Fault path does it in
> -                * reverse order. Trylock is a way to avoid deadlock.
> -                *
> -                * Also, it's not MADV_COLLAPSE's job to collapse other
> -                * mappings - let khugepaged take care of them later.
> -                */
> -               result = SCAN_PTE_MAPPED_HUGEPAGE;
> -               if ((cc->is_khugepaged || is_target) &&
> -                   mmap_write_trylock(mm)) {
> -                       /* trylock for the same lock inversion as above */
> -                       if (!vma_try_start_write(vma))
> -                               goto unlock_next;
> -
> -                       /*
> -                        * Re-check whether we have an ->anon_vma, because
> -                        * collapse_and_free_pmd() requires that either no
> -                        * ->anon_vma exists or the anon_vma is locked.
> -                        * We already checked ->anon_vma above, but that check
> -                        * is racy because ->anon_vma can be populated under the
> -                        * mmap lock in read mode.
> -                        */
> -                       if (vma->anon_vma) {
> -                               result = SCAN_PAGE_ANON;
> -                               goto unlock_next;
> -                       }
> -                       /*
> -                        * When a vma is registered with uffd-wp, we can't
> -                        * recycle the pmd pgtable because there can be pte
> -                        * markers installed.  Skip it only, so the rest mm/vma
> -                        * can still have the same file mapped hugely, however
> -                        * it'll always mapped in small page size for uffd-wp
> -                        * registered ranges.
> -                        */
> -                       if (hpage_collapse_test_exit(mm)) {
> -                               result = SCAN_ANY_PROCESS;
> -                               goto unlock_next;
> -                       }
> -                       if (userfaultfd_wp(vma)) {
> -                               result = SCAN_PTE_UFFD_WP;
> -                               goto unlock_next;
> -                       }
> -                       collapse_and_free_pmd(mm, vma, addr, pmd);

The old code called collapse_and_free_pmd(), which involves MMU
notifier invocation...

> -                       if (!cc->is_khugepaged && is_target)
> -                               result = set_huge_pmd(vma, addr, pmd, hpage);
> -                       else
> -                               result = SCAN_SUCCEED;
> -
> -unlock_next:
> -                       mmap_write_unlock(mm);
> -                       goto next;
> -               }
> -               /*
> -                * Calling context will handle target mm/addr. Otherwise, let
> -                * khugepaged try again later.
> -                */
> -               if (!is_target) {
> -                       khugepaged_add_pte_mapped_thp(mm, addr);
> +                   vma->vm_end < addr + HPAGE_PMD_SIZE)
>                         continue;
> -               }
> -next:
> -               if (is_target)
> -                       target_result = result;
> +
> +               mm = vma->vm_mm;
> +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +                       continue;
> +
> +               if (hpage_collapse_test_exit(mm))
> +                       continue;
> +               /*
> +                * When a vma is registered with uffd-wp, we cannot recycle
> +                * the page table because there may be pte markers installed.
> +                * Other vmas can still have the same file mapped hugely, but
> +                * skip this one: it will always be mapped in small page size
> +                * for uffd-wp registered ranges.
> +                *
> +                * What if VM_UFFD_WP is set a moment after this check?  No
> +                * problem, huge page lock is still held, stopping new mappings
> +                * of page which might then get replaced by pte markers: only
> +                * existing markers need to be protected here.  (We could check
> +                * after getting ptl below, but this comment distracting there!)
> +                */
> +               if (userfaultfd_wp(vma))
> +                       continue;
> +
> +               /* Huge page lock is still held, so page table must be empty */
> +               pml = pmd_lock(mm, pmd);
> +               ptl = pte_lockptr(mm, pmd);
> +               if (ptl != pml)
> +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);

... while the new code only does pmdp_collapse_flush(), which clears
the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
notifiers. My understanding is that that's problematic - maybe (?) it
is sort of okay with regards to classic MMU notifier users like KVM,
but it's probably wrong for IOMMUv2 users, where an IOMMU directly
consumes the normal page tables?

(FWIW, last I looked, there also seemed to be some other issues with
MMU notifier usage wrt IOMMUv2, see the thread
<https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)


> +               if (ptl != pml)
> +                       spin_unlock(ptl);
> +               spin_unlock(pml);
> +
> +               mm_dec_nr_ptes(mm);
> +               page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +               pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>         }
> -       i_mmap_unlock_write(mapping);
> -       return target_result;
> +       i_mmap_unlock_read(mapping);
>  }
>
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>
>         /*
>          * Remove pte page tables, so we can re-fault the page as huge.
> +        * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>          */
> -       result = retract_page_tables(mapping, start, mm, addr, hpage,
> -                                    cc);
> +       retract_page_tables(mapping, start);
> +       if (cc && !cc->is_khugepaged)
> +               result = SCAN_PTE_MAPPED_HUGEPAGE;
>         unlock_page(hpage);
>
>         /*
> --
> 2.35.3
>

next prev parent reply	other threads:[~2023-05-31 15:35 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29  6:11 [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-05-29  6:14 ` [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-05-31 17:06   ` Jann Horn
2023-05-29  6:16 ` [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-05-29 13:56   ` Matthew Wilcox
2023-05-29  6:17 ` [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-05-29  6:18 ` [PATCH 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-05-29  6:20 ` [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-05-29 14:02   ` Matthew Wilcox
2023-05-29 14:36     ` Hugh Dickins
     [not found]     ` <ZHn6n5eVTsr4Wl8x@ziepe.ca>
     [not found]       ` <4df4909f-f5dd-6f94-9792-8f2949f542b3@google.com>
     [not found]         ` <ZH95oobIqN0WO5MK@ziepe.ca>
     [not found]           ` <ZH+DAxLhIYpTlIFc@x1n>
     [not found]             ` <ZH+EMp9RuEVOjVNb@ziepe.ca>
2023-06-07  3:49               ` Hugh Dickins
2023-05-29  6:21 ` [PATCH 06/12] sparc: " Hugh Dickins
2023-05-29  6:22 ` [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() Hugh Dickins
     [not found]   ` <175ebec8-761-c3f-2d98-6c3bd87161c8@google.com>
2023-06-06 19:40     ` Gerald Schaefer
2023-06-08  3:35       ` Hugh Dickins
2023-06-08 13:58         ` Jason Gunthorpe
2023-06-08 15:47         ` Gerald Schaefer
     [not found]     ` <ZH99cLKeALvUCIH8@ziepe.ca>
2023-06-08  2:46       ` Hugh Dickins
2023-05-29  6:23 ` [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-05-29  6:25 ` [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-05-29 23:26   ` Peter Xu
2023-05-31  0:38     ` Hugh Dickins
2023-05-31 15:34   ` Jann Horn [this message]
2023-05-29  6:26 ` [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-05-31 17:25   ` Jann Horn
2023-05-29  6:28 ` [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-05-29  6:30 ` [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG48ez0aF1Rf1apSjn9YcnfyFQ4YqSd4GqB6f2wfhF7jMdi5Hg@mail.gmail.com \
    --to=jannh@google.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox