[RFC PATCH v3 0/8] hugetlb: Change huge pmd sharing synchronization again

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: [RFC PATCH v3 0/8] hugetlb: Change huge pmd sharing synchronization again
Date: Sun,  8 May 2022 11:34:12 -0700	[thread overview]
Message-ID: <20220508183420.18488-1-mike.kravetz@oracle.com> (raw)

I am sending this as a RFC again in the hope of getting some agreement
on the way to move forward.  This series uses a new hugetlb specific
per-vma rw semaphore to synchronize pmd sharing.  Code is based on
linux-next next-20220506.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 250GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 48 CPU VM with 320GB memory.  48 instances of the
map/fault/unmap program are running.

next-20220506
-------------
Wait_debug: intvl faults sec  1183577
            intvl num faults  11835773
10 sec      intvl num waits   28129
intvl       intvl wait time   233225 msecs

2 hour      max_flt_wait_time 2,010,000 usecs
run         avg faults sec    1,161,818

next-20220506 + this series
---------------------------
Wait_debug: intvl faults sec  1409520
            intvl num faults  14095201
10 sec      intvl num waits   14078
intvl       intvl wait time   18144 msecs

2 hour      max flt wait time 115,000 usecs
run         avg faults sec    1,476,668

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
Patches 3 and 4 of this series address those races.  This requires
using the hugetlb fault muteses for more coordination between the fault
code and file page removal.

Patches 5 - 7 add infrastructure for a new vma based rw semaphore that
will be used for pmd sharing synchronization.  The idea is that this
semaphore will be held in read mode for the duration of fault processing,
and held in write mode for unmap operations which may call huge_pmd_unshare.
Acquiring i_mmap_rwsem is also still required to synchronize huge pmd
sharing.  However it is only required in the fault path when setting up
sharing, and will be acquired in huge_pmd_share().

Patch 8 makes use of this new vma lock.  Unfortunately, the fault code
and truncate/hole punch code would naturally take locks in the opposite
order which could lead to deadlock.  Since the performance of page faults
is more important, the truncation/hole punch code is modified to back
out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

Mike Kravetz (8):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlb: rename vma_shareable() and refactor code
  hugetlb: add vma based lock for pmd sharing synchronization
  hugetlb: create hugetlb_unmap_file_page to unmap single file page
  hugetlb: use new vma_lock for pmd sharing synchronization

 fs/hugetlbfs/inode.c    | 297 +++++++++++++++++++++++++++-----------
 include/linux/hugetlb.h |  37 ++++-
 kernel/fork.c           |   6 +-
 mm/hugetlb.c            | 310 ++++++++++++++++++++++++++++++----------
 mm/memory.c             |   2 +
 mm/rmap.c               |  38 ++++-
 mm/userfaultfd.c        |  14 +-
 7 files changed, 525 insertions(+), 179 deletions(-)

-- 
2.35.3

next             reply	other threads:[~2022-05-08 18:34 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-08 18:34 Mike Kravetz [this message]
2022-05-08 18:34 ` [RFC PATCH v3 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 3/8] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 4/8] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 6/8] hugetlb: add vma based lock for pmd sharing synchronization Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 7/8] hugetlb: create hugetlb_unmap_file_page to unmap single file page Mike Kravetz
2022-05-08 18:34 ` [RFC PATCH v3 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220508183420.18488-1-mike.kravetz@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox