[PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	James Houghton <jthoughton@google.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	peterx@redhat.com, Rik van Riel <riel@surriel.com>
Subject: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
Date: Sun, 30 Oct 2022 17:29:19 -0400	[thread overview]
Message-ID: <20221030212929.335473-1-peterx@redhat.com> (raw)

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing.  So this series also depends on that one.  But
there're some huge_pte_offset() paths that seem to be still racy on pmd
unsharing (as they don't take vma lock), more below.

Hopefully this series can make it a more complete resolution for pmd
unsharing.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us from when we're walking on another mm's pgtable,
so it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.

But it means it'll work only if we take the vma lock for all the places
around huge_pte_offset().  There're already a bunch of them that we did as
per the latest mm-unstable, but also a lot that we didn't for various
reasons.  E.g. it may not be applicable for not-allow-to-sleep contexts
like FOLL_NOWAIT.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening.  This
series is trying to resolve that problem.

Resolution
==========

What this patch proposed is, besides using the vma lock, we can also use
RCU to protect the pgtable page from being freed from under us when
huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's.

To apply the same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page.

Patch Layout
============

Patch 1 is a trivial cleanup that I noticed when working on this.  Please
shoot if anyone think I should just post it separately, or hopefully I can
still just carry it over.

Patch 2 is the gut of the patchset, describing how we should use the helper
huge_pte_offset() correctly. Only a comment patch but should be the most
important one, as the follow up patches are just trying to follow the rule
it setup here.

The rest patches resolve all the call sites of huge_pte_offset() to make
sure either it's with the vma lock (which is perfectly good enough for
safety in this case; the last patch commented on all those callers to make
sure we won't miss a single case, and why they're safe).  Besides, each of
the patch will add rcu protection to one caller of huge_pte_offset().

Tests
=====

Only lightly tested on hugetlb kselftests including uffd, no more errors
triggered than current mm-unstable (hugetlb-madvise fails before/after
here, with error "Unexpected number of free huge pages line 207"; haven't
really got time to look into it).

Since this is so far only discussed with Mike quickly in the other thread,
marking this as RFC for now as I could have missed something.

Comments welcomed, thanks.

Peter Xu (10):
  mm/hugetlb: Let vma_offset_start() to return start
  mm/hugetlb: Comment huge_pte_offset() for its locking requirements
  mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe
  mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe
  mm/hugetlb: Make walk_hugetlb_range() RCU-safe
  mm/hugetlb: Make page_vma_mapped_walk() RCU-safe
  mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe
  mm/hugetlb: Make follow_hugetlb_page RCU-safe
  mm/hugetlb: Make hugetlb_fault() RCU-safe
  mm/hugetlb: Comment at rest huge_pte_offset() places

 arch/arm64/mm/hugetlbpage.c | 32 ++++++++++++++++++++++++++
 fs/hugetlbfs/inode.c        | 39 ++++++++++++++++++--------------
 fs/userfaultfd.c            |  4 ++++
 include/linux/rmap.h        |  3 +++
 mm/hugetlb.c                | 45 +++++++++++++++++++++++++++++++++++--
 mm/page_vma_mapped.c        |  7 +++++-
 mm/pagewalk.c               |  5 +++++
 7 files changed, 115 insertions(+), 20 deletions(-)

-- 
2.37.3

next             reply	other threads:[~2022-10-30 21:29 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-30 21:29 Peter Xu [this message]
2022-10-30 21:29 ` [PATCH RFC 01/10] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
2022-11-03 15:25   ` Mike Kravetz
2022-10-30 21:29 ` [PATCH RFC 02/10] mm/hugetlb: Comment huge_pte_offset() for its locking requirements Peter Xu
2022-11-01  5:46   ` Nadav Amit
2022-11-02 20:51     ` Peter Xu
2022-11-03 15:42   ` Mike Kravetz
2022-11-03 18:11     ` Peter Xu
2022-11-03 18:38       ` Mike Kravetz
2022-10-30 21:29 ` [PATCH RFC 03/10] mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe Peter Xu
2022-10-30 21:29 ` [PATCH RFC 04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe Peter Xu
2022-11-02 18:06   ` James Houghton
2022-11-02 21:17     ` Peter Xu
2022-10-30 21:29 ` [PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe Peter Xu
2022-11-06  8:14   ` kernel test robot
2022-11-06 16:41     ` Peter Xu
2022-10-30 21:29 ` [PATCH RFC 06/10] mm/hugetlb: Make page_vma_mapped_walk() RCU-safe Peter Xu
2022-10-30 21:29 ` [PATCH RFC 07/10] mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe Peter Xu
2022-11-02 18:24   ` James Houghton
2022-11-03 15:50     ` Peter Xu
2022-10-30 21:30 ` [PATCH RFC 08/10] mm/hugetlb: Make follow_hugetlb_page RCU-safe Peter Xu
2022-10-30 21:30 ` [PATCH RFC 09/10] mm/hugetlb: Make hugetlb_fault() RCU-safe Peter Xu
2022-11-02 18:04   ` James Houghton
2022-11-03 15:39     ` Peter Xu
2022-10-30 21:30 ` [PATCH RFC 10/10] mm/hugetlb: Comment at rest huge_pte_offset() places Peter Xu
2022-11-01  5:39   ` Nadav Amit
2022-11-02 21:21     ` Peter Xu
2022-11-04  0:21 ` [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Mike Kravetz
2022-11-04 15:02   ` Peter Xu
2022-11-04 15:44     ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221030212929.335473-1-peterx@redhat.com \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=nadav.amit@gmail.com \
    --cc=riel@surriel.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox