From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D8249C5AD49
	for <linux-mm@archiver.kernel.org>; Mon,  2 Jun 2025 20:47:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 593EF6B0337; Mon,  2 Jun 2025 16:47:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 544816B0338; Mon,  2 Jun 2025 16:47:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 433446B0339; Mon,  2 Jun 2025 16:47:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 27E006B0337
	for <linux-mm@kvack.org>; Mon,  2 Jun 2025 16:47:15 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id D1526BED31
	for <linux-mm@kvack.org>; Mon,  2 Jun 2025 20:47:14 +0000 (UTC)
X-FDA: 83511645588.08.D03D0E0
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
	by imf16.hostedemail.com (Postfix) with ESMTP id 7EC0118000C
	for <linux-mm@kvack.org>; Mon,  2 Jun 2025 20:47:12 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qf06fLmN;
	dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=1I1UE7jb;
	dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qf06fLmN;
	dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=1I1UE7jb;
	spf=pass (imf16.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de;
	dmarc=pass (policy=none) header.from=suse.de
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748897232;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tfx8e2y8N3C9UcfzKeIsM4MfeOQAiZID9UtmjsreyJ4=;
	b=HLSK++7NrKIQGN6Wc4D6eJRtTYtivvPqTbj1ZViQKjuNpnV/48KnJqqY3KnC5A01Xw6sKS
	udqBLulG/eQgWL/JGVfHyJMyjvyoXhnqm3YJ6jk9ffy9TRZXDN0xDKudakeeGg5qStdBGp
	A0tRwJE+5c5Mc6hlkXlDDvUl+9SRiwM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748897232; a=rsa-sha256;
	cv=none;
	b=VmdnBD3LEeWjMfQD9yz9C3LG3bPWG7OVl+XBvW2BjQgw4NvOIHrYw/nSCkGYyGyc8z/qBt
	FB7UE9tkTStGbJMqaneW3mNLuxNH8ovtILJm7fTo+jUnXBH5b/8kStQNzrmnkIWfryjEUb
	cRuF+PdB+Oz+d6vpFec8jyEi3PhHA20=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qf06fLmN;
	dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=1I1UE7jb;
	dkim=pass header.d=suse.de header.s=susede2_rsa header.b=qf06fLmN;
	dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=1I1UE7jb;
	spf=pass (imf16.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de;
	dmarc=pass (policy=none) header.from=suse.de
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out1.suse.de (Postfix) with ESMTPS id BE18C21260;
	Mon,  2 Jun 2025 20:47:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1748897230; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=tfx8e2y8N3C9UcfzKeIsM4MfeOQAiZID9UtmjsreyJ4=;
	b=qf06fLmNbK2BYKxSOxEfDtJ2PUWQRXfAt0HIZKBUTIMNC7wSBOFtajgq3R9TwE3D+sQdEG
	lkfqy8qDyxTHmSYbTC1r3s32oHKTzNi4XBrzp7gLUa1z49W/iKNjaveTnmZr6jAOQL0eGo
	BHlKFS/d8aN5111Q8SHm/09fP5hjD8c=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1748897230;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=tfx8e2y8N3C9UcfzKeIsM4MfeOQAiZID9UtmjsreyJ4=;
	b=1I1UE7jbI9T0LZqMo696ywFslpIzN+ehTjA85y5kq89DJBwfeVWKXkc2wb2bmXljL7aCTo
	bbeMOPF45V+JhCCg==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1748897230; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=tfx8e2y8N3C9UcfzKeIsM4MfeOQAiZID9UtmjsreyJ4=;
	b=qf06fLmNbK2BYKxSOxEfDtJ2PUWQRXfAt0HIZKBUTIMNC7wSBOFtajgq3R9TwE3D+sQdEG
	lkfqy8qDyxTHmSYbTC1r3s32oHKTzNi4XBrzp7gLUa1z49W/iKNjaveTnmZr6jAOQL0eGo
	BHlKFS/d8aN5111Q8SHm/09fP5hjD8c=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1748897230;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=tfx8e2y8N3C9UcfzKeIsM4MfeOQAiZID9UtmjsreyJ4=;
	b=1I1UE7jbI9T0LZqMo696ywFslpIzN+ehTjA85y5kq89DJBwfeVWKXkc2wb2bmXljL7aCTo
	bbeMOPF45V+JhCCg==
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 40B3C136C7;
	Mon,  2 Jun 2025 20:47:10 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id IkkLDc4NPmi3XwAAD6G6ig
	(envelope-from <osalvador@suse.de>); Mon, 02 Jun 2025 20:47:10 +0000
Date: Mon, 2 Jun 2025 22:47:04 +0200
From: Oscar Salvador <osalvador@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	James Houghton <jthoughton@google.com>,
	Gavin Guo <gavinguo@igalia.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault
 and hugetlb_wp
Message-ID: <aD4NyEmRc50OCUBy@localhost.localdomain>
References: <20250602141610.173698-1-osalvador@suse.de>
 <20250602141610.173698-2-osalvador@suse.de>
 <aD2_0qoh1Os6srsA@x1.local>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <aD2_0qoh1Os6srsA@x1.local>
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 7EC0118000C
X-Stat-Signature: geqj38gwdo6i45uc7aiuy37sb7ij1xse
X-Rspam-User: 
X-HE-Tag: 1748897232-958701
X-HE-Meta: U2FsdGVkX1+TNy5tK3yeYys3HFKs8oTlhixzlGioKP3CaFdH+K3oU8iMjL/c5sITbRtYEUkkmW2C/s+uqSMRq+L0BuLgY7jPWpRepv1/F0afhSvS39grC5/v5l/3ua2AdUjeke8whSPc534bs0XT015sDBZnnfP1pIMPozCfKYdbG0RvWzw/f66GFcpi+gdNfbQPbrkU58gMI1oLNbf3zpHrZCWZKrOnGc0MDSOUYBtpy0F3sDMFVeuwZemYyniQk5kBgAf2C9rTS7H8KVT+0jV4YKUYIoiUmzdLxiKA/pbTe48/HoPQZia4haHFdj3zE7EXAAAwBzrmTl5ygLYEeLFi7M/IvN66kjItko8jLdm2cVPcn+mseonAx3yf+eVld2n0IC9uM6cGLUwXHyijNvUtpq4zHGVbQ+z4Is1v03Q8eVz7jkHn6E9kQ3JiZhqyO9EpjkdBkXgyGBp9iOpr7JGHQtq6w/Oo2/ocXRa+J9S93MBjlJJ7XlVIHcJ1zLm3Ddhd7/TxWL9XcJebF0Cwzf2pwlCNYnwozbkOnKfavlUJzkk4K+UwfSqHMKGPMMKLZp8hSYKLzDH8GuWWGRcphLRFXElF6K7nMGDL2epLudwS1Np9bhZ25gNzNck3WoyI7Fzgvjf4CF8aEDOCNGxP75Ckq+Sjnu0VhZ9PMofEUCWBxU1sRLnxpFO+mLi3c5wszSd811+QxMVLH24wjSufD8M8NKjcrUindmb8fgk7HJDPPnBMfUKiW9ceA3XJ5SzCtb2IuUstHvr8FlHZvi1qrm38FZisOOA1S/Sf3hT1hGC1n7ERTIBOdVDTach4e4Mr0SJj1KJk7CGq1qPQjJS6dMsMLVG2/P4hDOmP5fLHjHFeuugziQTixticGpWsFO9vIY2lo0MUzBZ32AYA2czqQtRV28FpD+lQ/XDlVEe1A/q1RjcBKrk2QPfMTINurf+I2nNDXOgDcGcst6WLA5l
 doIaMowa
 EuQ5UwDE+fKz/e0YInZluu0+IBNR7bADvVYPHDgTUo2iCpkwCH4K5dLQ9sAKbb6eqO2HCcUmVckdosX0912oiCTEcXpmRqbOhMMOkoTlTnGXVbEFWMm9fXSaY4hAkFE3oU9DciDWxwFQNxq49YaLCMqZ8PqN71bOCINiugHzFTgph572FoLwMqMJFc5hsGiM42cqOaW0pFVoRzBke6HXCaqytjSenlZtc8rXsZsMVNwYBNkuDqE29oyx2ZgUsWVMY1TKr/qLGiyJbFnr4ihdmXG9pbi8UebcoZa+SjuY/YuqTkDmIVCllBlc6Ih97Im4OfG1HcGQ924ixW7c=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jun 02, 2025 at 11:14:26AM -0400, Peter Xu wrote:
> It makes me feel nervous when we move more thing over to fault mutex - I
> had a feeling it's abused.
> 
> IIRC the fault mutex was inintially introduced only to solve one problem on
> concurrent allocations.  I confess I didn't follow or yet dig into all
> history, though.  From that POV, mfill_atomic_hugetlb() is indeed
> reasonable to still take it because it's user-guided fault injections.  I'm
> not yet sure about other places, e.g., for truncations.

Hi Peter,

taking the mutex for truncation/punching hole was added back in:

 commit b5cec28d36f5ee6b4e6f68a0a40aa1e4045d6d99
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Tue Sep 8 15:01:41 2015 -0700
 
     hugetlbfs: truncate_hugepages() takes a range of pages

when the fallocate functionality and the punching-mode were introduced.
E.g:

 commit c672c7f29f2fdb73e1f72911bf499675c81fcdbb
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Tue Sep 8 15:01:35 2015 -0700
 
     mm/hugetlb: expose hugetlb fault mutex for use by fallocate
 
     hugetlb page faults are currently synchronized by the table of mutexes
     (htlb_fault_mutex_table).  fallocate code will need to synchronize with
     the page fault code when it allocates or deletes pages.  Expose
     interfaces so that fallocate operations can be synchronized with page
     faults.  Minor name changes to be more consistent with other global
     hugetlb symbols.

Sadly, I cannot really see why it was added.
It was introduced from RFC v2 -> RFC v3:

 RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils

But RFC v2 shows no discussions in that matter whatsoever, so go figure
[1].
I read that some tests were written, so I am not sure whether any of
those tests caught a race.

Looking at remove_inode_single_folio, there is a case though:

 /*
  * If folio is mapped, it was faulted in after being
  * unmapped in caller.  Unmap (again) while holding
  * the fault mutex.  The mutex will prevent faults
  * until we finish removing the folio.
  */
	if (unlikely(folio_mapped(folio)))
		hugetlb_unmap_file_folio(h, mapping, folio, index);

So, while the folio_lock that is taken further down guards us against
removing the page from the pagecache, I guess that we need to take the
mutex to guard us against parallel faults for the page.

Looking back at earlier code, we did not handle that race.

It was reworked in

 commit 4aae8d1c051ea00b456da6811bc36d1f69de5445
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Fri Jan 15 16:57:40 2016 -0800
 
     mm/hugetlbfs: unmap pages if page fault raced with hole punch

and the comment looked like:

 /*
  * If page is mapped, it was faulted in after being
  * unmapped in caller.  Unmap (again) now after taking
  * the fault mutex.  The mutex will prevent faults
  * until we finish removing the page.
  *
  * This race can only happen in the hole punch case.
  * Getting here in a truncate operation is a bug.
  */

So, it was placed to close the race.
Now, I do not know whether we could close that race somewhat different,
but it seems risky, and having the mutex here seems like a good
trade-off.


> Meanwhile, IIUC this patch can at least be split into more than one, in
> which case the 1st patch should only change the behavior of
> pagecache_folio, rather than the rest - if we really want to move more
> things over to fault mutex, we can do that in the 2nd+ patches.

Fair enough. This RFC was mainly for kicking off a discussion and decide
the direction we want to take.

> I'd suggest we stick with fixing pagecache_folio issue first, which can be
> much easier and looks like a lock ordering issue only (while we can still
> resolve it with removing on lock, but only on pagecache_folio not rest yet).

Yes, I think that that should be enough to determine if we already mapped and cowed the
folio privately.

> > +
> >  /*
> > - * hugetlb_wp() should be called with page lock of the original hugepage held.
> >   * Called with hugetlb_fault_mutex_table held and pte_page locked so we
> >   * cannot race with other handlers or page migration.
> >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> >   */
> > -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > -		       struct vm_fault *vmf)
> > +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> >  	pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte);
> >  	struct hstate *h = hstate_vma(vma);
> > -	struct folio *old_folio;
> > -	struct folio *new_folio;
> >  	bool cow_from_owner = 0;
> >  	vm_fault_t ret = 0;
> >  	struct mmu_notifier_range range;
> > +	struct folio *old_folio, *new_folio, *pagecache_folio;
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> >  	/*
> >  	 * Never handle CoW for uffd-wp protected pages.  It should be only
> > @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> >  	 * we run out of free hugetlb folios: we would have to kill processes
> >  	 * in scenarios that used to work. As a side effect, there can still
> >  	 * be leaks between processes, for example, with FOLL_GET users.
> > +	 *
> > +	 * We need to take the lock here because the folio might be undergoing a
> > +	 * migration. e.g: see folio_try_share_anon_rmap_pte.
> >  	 */
> 
> I agree we'd better keep the old_folio locked as of now to be further
> discussed, but I'm not 100% sure about the comment - doesn't migration path
> still need the pgtable lock to modify mapcounts?  I think we hold pgtable
> lock here.

Uhm, yes, looking closer I think you are right.
In hugetlb_fault, prior to call in hugetlb_wp we take the ptl spinlock,
and the migration path also takes it (via page_vma_mapped_walk), so we
should not be seeing any spurious data from hugetlb_remove_rmap, as we
are serialized.

> > +	/*
> > +	 * We can be called from two different contexts: hugetlb_no_page or
> > +	 * hugetlb_fault.
> > +	 * When called from the latter, we try to find the original folio in
> > +	 * the pagecache and compare it against the current folio we have mapped
> > +	 * in the pagetables. If it differs, it means that this process already
> > +	 * CoWed and mapped the folio privately, so we know that a reservation
> > +	 * was already consumed.
> > +	 * This will be latter used to determine whether we need to unmap the folio
> > +	 * from other processes in case we fail to allocate a new folio.
> > +	 */
> > +	if (context == HUGETLB_FAULT_CONTEXT) {
> > +		pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> > +		if (IS_ERR(pagecache_folio))
> > +			pagecache_folio = NULL;
> > +		else
> > +			folio_put(pagecache_folio);
> 
> So here we released the refcount but then we're referencing the pointer
> below..  I don't know whether this is wise, looks like it's prone to
> races..  If we want, we can compare the folios before releasing the
> refcount.  Said that,...

We are holding the mutex so the folio cannot really go anywhere, but
could folio_put after the check as well.

> > -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > -			old_folio != pagecache_folio)
> > +	if (context == HUGETLB_FAULT_CONTEXT &&
> > +	    is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > +	    old_folio != pagecache_folio)
> 
> .. here I am actually thinking whether we can use folio_test_anon() and
> completely avoid looking up the page cache.  Here, the ultimate goal is
> trying to know whether this is a CoW on top of a private page.  Then IIUC
> folio_test_anon(old_folio) is enough.

I yet have to fully check, but this sounds reasonable to me.

> >  	new_folio = false;
> > -	folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
> > +	folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> 
> So this is part of the changes that I mentioned previously, that we should
> avoid doing in one patch; actually I really think we should think twice on
> using fault mutex explicitly for more things.  Actually I tend to go the
> other way round: I used to think whether we can avoid some fault mutex in
> some paths.  E.g. for UFFDIO_COPY it still makes sense to take fault mutex
> because it faces similar challenge of concurrent allocations. However I'm
> not sure about truncate/hole-punch lines.  IIUC most file folios use
> invalidate_lock for that, and I'd think going the other way to use the same
> lock in hugetlb code, then keep fault mutex as minimum used as possible,
> then the semantics of the lock and what it protects are much clearer.

I hear you, but as explained above, I am not entirely sure we can get
rid of the mutex in the truncation/punch-hole path, and if that is the
case, having both the lock and the mutex is definitely an overkill.

I can split up the changes, I have no problem with that.
Moreover, with the fact that we might be able to get away checking for
anon-folio vs pagecache_folio-thingy-we-have-now.

 
Thanks a lot for the feedback Peter! 

-- 
Oscar Salvador
SUSE Labs