Re: [RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>,
	Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults
Date: Fri, 5 Aug 2022 15:41:32 -0700	[thread overview]
Message-ID: <Yu2cnC+A+zPvlA1G@monkey> (raw)
In-Reply-To: <e0d9b4c4-1195-730a-5838-08c10905adaf@redhat.com>

On 08/05/22 18:28, David Hildenbrand wrote:
> On 28.07.22 18:45, Mike Kravetz wrote:
> > On 07/28/22 10:02, Miaohe Lin wrote:
> >> On 2022/7/28 3:00, Mike Kravetz wrote:
> >>> On 07/27/22 17:20, Miaohe Lin wrote:
> >>>> On 2022/7/7 4:23, Mike Kravetz wrote:
> >>>>> Most hugetlb fault handling code checks for faults beyond i_size.
> >>>>> While there are early checks in the code paths, the most difficult
> >>>>> to handle are those discovered after taking the page table lock.
> >>>>> At this point, we have possibly allocated a page and consumed
> >>>>> associated reservations and possibly added the page to the page cache.
> >>>>>
> >>>>> When discovering a fault beyond i_size, be sure to:
> >>>>> - Remove the page from page cache, else it will sit there until the
> >>>>>   file is removed.
> >>>>> - Do not restore any reservation for the page consumed.  Otherwise
> >>>>>   there will be an outstanding reservation for an offset beyond the
> >>>>>   end of file.
> >>>>>
> >>>>> The 'truncation' code in remove_inode_hugepages must deal with fault
> >>>>> code potentially removing a page/folio from the cache after the page was
> >>>>> returned by filemap_get_folios and before locking the page.  This can be
> >>>>> discovered by a change in folio_mapping() after taking folio lock.  In
> >>>>> addition, this code must deal with fault code potentially consuming
> >>>>> and returning reservations.  To synchronize this, remove_inode_hugepages
> >>>>> will now take the fault mutex for ALL indices in the hole or truncated
> >>>>> range.  In this way, it KNOWS fault code has finished with the page/index
> >>>>> OR fault code will see the updated file size.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> ---
> >>>>
> >>>> <snip>
> >>>>
> >>>>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>>>  
> >>>>>  	ptl = huge_pte_lock(h, mm, ptep);
> >>>>>  	size = i_size_read(mapping->host) >> huge_page_shift(h);
> >>>>> -	if (idx >= size)
> >>>>> +	if (idx >= size) {
> >>>>> +		beyond_i_size = true;
> >>>>
> >>>> Thanks for your patch. There is one question:
> >>>>
> >>>> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex,
> >>>> do we really need to check it again after taking the page table lock?
> >>>>
> >>>
> >>> Well, the fault mutex can only guard a single hugetlb page.  The fault mutex
> >>> is actually an array/table of mutexes hashed by mapping address and file index.
> >>> So, during truncation we take take the mutex for each page as they are
> >>> unmapped and removed.  So, the fault mutex only synchronizes operations
> >>> on one specific page.  The idea with this patch is to coordinate the fault
> >>> code and truncate code when operating on the same page.
> >>>
> >>> In addition, changing the file size happens early in the truncate process
> >>> before taking any locks/mutexes.
> >>
> >> I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens
> >> after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would
> >> remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back
> >> when checking i_size again under the page table lock?
> >>
> >> In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its
> >> work. So we don't need to complicate the already complicated error path. Or am I miss something?
> >>
> > 
> > Thank you! I believe your observations and suggestions are correct.
> > 
> > We can just let the fault code proceed after the early "idx >= size",
> > and let the truncation code remove the page.  This also eliminates the
> > need for patch 3 (hugetlbfs: move routine remove_huge_page to hugetlb.c).
> 
> At least remaining the functions would be very welcome nonetheless :)

Agree.

> 
> > 
> > I will make these changes in the next version.
> 
> Just so I understand correctly, we want to let fault handling code back
> out early if we find any incompatible change, and simply retry the
> fault? I'm thinking about some kind of a high-level seqcount.
> 

Not exactly.

In the routine hugetlb_no_page, there are two (no three) places where we
check for races with truncation to see if the fault is beyond the end of
the file.  The first two are before adding a newly allocated page to the
page cache.  The third check is after taking the page table lock to
update the pte.

The idea is to eliminate this third check that requires backing out the
page from the cache.  So, it is 'possible' that the fault code could add
a page beyond i_size.  With the updates to the truncation code (actually
remove_inode_hugepages), we know that this page beyond i_size will be
removed by the racing truncation code.

Hope that makes sense.
-- 
Mike Kravetz

next prev parent reply	other threads:[~2022-08-05 22:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06 20:23 [RFC PATCH v4 0/8] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 3/8] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-08-05 16:11   ` David Hildenbrand
2022-07-06 20:23 ` [RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-07-27  9:20   ` Miaohe Lin
2022-07-27 19:00     ` Mike Kravetz
2022-07-28  2:02       ` Miaohe Lin
2022-07-28 16:45         ` Mike Kravetz
2022-08-05 16:28           ` David Hildenbrand
2022-08-05 22:41             ` Mike Kravetz [this message]
2022-07-06 20:23 ` [RFC PATCH v4 5/8] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-07-06 20:23 ` [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization Mike Kravetz
2022-07-29  2:55   ` Miaohe Lin
2022-07-29 18:00     ` Mike Kravetz
2022-07-30  2:12       ` Miaohe Lin
2022-07-06 20:23 ` [RFC PATCH v4 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-07-29  2:02   ` Miaohe Lin
2022-07-29 18:11     ` Mike Kravetz
2022-07-30  2:15       ` Miaohe Lin
2022-07-06 20:23 ` [RFC PATCH v4 8/8] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-07-28  6:51   ` Miaohe Lin
2022-07-28 17:47     ` Mike Kravetz
2022-07-29  1:41       ` Miaohe Lin
2022-07-29 17:41         ` Mike Kravetz
2022-07-30  1:57           ` Miaohe Lin
2022-07-20 14:16 ` [RFC PATCH v4 0/8] hugetlb: Change huge pmd sharing synchronization again Ray Fucillo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yu2cnC+A+zPvlA1G@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox