linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Deepanshu Kartikey <kartikey406@gmail.com>,
	muchun.song@linux.dev, osalvador@suse.de,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, broonie@kernel.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com
Subject: Re: [PATCH v3] hugetlbfs: skip PMD unsharing when shareable lock unavailable
Date: Mon, 6 Oct 2025 09:37:34 +0200	[thread overview]
Message-ID: <6db79bb0-382e-4c3c-89e0-4c7822d4dfca@redhat.com> (raw)
In-Reply-To: <20251003174553.3078839-1-kartikey406@gmail.com>

On 03.10.25 19:45, Deepanshu Kartikey wrote:
> When hugetlb_vmdelete_list() cannot acquire the shareable lock for a VMA,
> the previous fix (dd83609b8898) skipped the entire VMA to avoid lock

The proper way to mention a commit here

"... fix in commit dd83609b8898 ("hugetlbfs: skip VMAs without shareable 
locks in hugetlb_vmdelete_list") skipped ..."

> assertions in huge_pmd_unshare(). However, this prevented pages from being
> unmapped and freed, causing a regression in fallocate(PUNCH_HOLE) operations
> where pages were not freed immediately, as reported by Mark Brown.
> 
> The issue occurs because:
> 1. hugetlb_vmdelete_list() calls hugetlb_vma_trylock_write()
> 2. For shareable VMAs, this attempts to acquire the shareable lock
> 3. If successful, huge_pmd_unshare() expects the lock to be held
> 4. huge_pmd_unshare() asserts the lock via hugetlb_vma_assert_locked()
> 
> The v2 fix avoided calling code that requires locks, but this prevented
> page unmapping entirely, breaking the expected behavior where pages are
> freed during punch hole operations.
> 
> This v3 fix takes a different approach: instead of skipping the entire VMA,
> we skip only the PMD unsharing operation when we don't have the required
> lock, while still proceeding with page unmapping. This is safe because:

It's confusing to talk about fix versions. If you want to reference 
previous discussions, rather link to them.

> 
> - PMD unsharing is an optimization to reduce shared page table overhead
> - Page unmapping can proceed safely with just the VMA write lock
> - Pages get freed immediately as expected by PUNCH_HOLE operations
> - The PMD metadata will be cleaned up when the VMA is destroyed
> 
> We introduce a new ZAP_FLAG_NO_UNSHARE flag that communicates to
> __unmap_hugepage_range() that it should skip huge_pmd_unshare() while
> still clearing page table entries and freeing pages.
> 
> Reported-by: syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com
> Reported-by: Mark Brown <broonie@kernel.org>
> Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
> Tested-by: syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com
> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
> 
> ---
> Changes in v3:
> - Instead of skipping entire VMAs, skip only PMD unsharing operation
> - Add ZAP_FLAG_NO_UNSHARE flag to communicate lock status
> - Ensure pages are still unmapped and freed immediately
> - Fixes regression in fallocate PUNCH_HOLE reported by Mark Brown
> 
> Changes in v2:
> - Check for shareable lock before trylock to avoid lock leaks
> - Add comment explaining why non-shareable VMAs are skipped
> ---
>   fs/hugetlbfs/inode.c | 22 ++++++++++++----------
>   include/linux/mm.h   |  2 ++
>   mm/hugetlb.c         |  3 ++-
>   3 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 9c94ed8c3ab0..519497bc1045 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -474,29 +474,31 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>   	vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) {
>   		unsigned long v_start;
>   		unsigned long v_end;
> +		bool have_shareable_lock;
> +		zap_flags_t local_flags = zap_flags;
>   
>   		if (!hugetlb_vma_trylock_write(vma))
>   			continue;
> -
> +
> +		have_shareable_lock = __vma_shareable_lock(vma);
> +
>   		/*
> -		 * Skip VMAs without shareable locks. Per the design in commit
> -		 * 40549ba8f8e0, these will be handled by remove_inode_hugepages()
> -		 * called after this function with proper locking.
> +		 * If we can't get the shareable lock, set ZAP_FLAG_NO_UNSHARE

What do you mean with "If we can't get the shareable lock"? 
__vma_shareable_lock() doesn't tell us whether we grabbed the lock, but 
whether we have to grab the lock?

I see now R-b/Ack from hugetlb maintainers and this seems to be getting 
rather complicated now and I cannot really easily judge what's right or 
wrong now.

@Muchun, Oscar, can you take a look?


> +		 * to skip PMD unsharing. We still proceed with unmapping to
> +		 * ensure pages are properly freed, which is critical for punch
> +		 * hole operations that expect immediate page freeing.
>   		 */
> -		if (!__vma_shareable_lock(vma))
> -			goto skip;
> -
> +		if (!have_shareable_lock)
> +			local_flags |= ZAP_FLAG_NO_UNSHARE;
>   		v_start = vma_offset_start(vma, start);
>   		v_end = vma_offset_end(vma, end);
>   
> -		unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);
> -
> +		unmap_hugepage_range(vma, v_start, v_end, NULL, local_flags);
>   		/*
>   		 * Note that vma lock only exists for shared/non-private
>   		 * vmas.  Therefore, lock is not held when calling
>   		 * unmap_hugepage_range for private vmas.
>   		 */
> -skip:
>   		hugetlb_vma_unlock_write(vma);
>   	}
>   }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06978b4dbeb8..9126ab44320d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2395,6 +2395,8 @@ struct zap_details {
>   #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
>   /* Set in unmap_vmas() to indicate a final unmap call.  Only used by hugetlb */
>   #define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
> +/* Skip PMD unsharing when unmapping hugetlb ranges without shareable lock */
> +#define  ZAP_FLAG_NO_UNSHARE         ((__force zap_flags_t) BIT(2))

That's nasty: this is hugetlb-specific stuff in a generic mm.h header 
using generic mm flags.

I'm sure we can find a way communicate that in a different way within 
hugetlb code and leave the generic ZAP_* flags alone?

-- 
Cheers

David / dhildenb



  reply	other threads:[~2025-10-06  7:37 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 17:45 Deepanshu Kartikey
2025-10-06  7:37 ` David Hildenbrand [this message]
2025-10-06 13:28 ` Oscar Salvador
2025-10-08  5:27 ` [PATCH v4] hugetlbfs: check for shareable lock before calling huge_pmd_unshare() Deepanshu Kartikey
2025-10-13  8:09   ` Oscar Salvador
2025-10-13  8:27   ` David Hildenbrand
2025-10-06  7:54 [PATCH v3] hugetlbfs: skip PMD unsharing when shareable lock unavailable Deepanshu Kartikey
2025-10-06 12:27 ` Oscar Salvador
2025-10-06 14:13 Deepanshu Kartikey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6db79bb0-382e-4c3c-89e0-4c7822d4dfca@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=broonie@kernel.org \
    --cc=kartikey406@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox