From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 000DFC71157 for ; Tue, 17 Jun 2025 10:03:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 97F036B00AA; Tue, 17 Jun 2025 06:03:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 92F386B00AB; Tue, 17 Jun 2025 06:03:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F7656B00AC; Tue, 17 Jun 2025 06:03:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6D2AA6B00AA for ; Tue, 17 Jun 2025 06:03:19 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1399A120484 for ; Tue, 17 Jun 2025 10:03:19 +0000 (UTC) X-FDA: 83564454918.04.AE0118D Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf22.hostedemail.com (Postfix) with ESMTP id CFDF2C000D for ; Tue, 17 Jun 2025 10:03:16 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=abbLPbCF; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=+4+4PvYT; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=abbLPbCF; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=+4+4PvYT; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf22.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750154597; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lkNuGFHnLWcnxmlORCCDKyxXf53aZscV6U0/WkfcPWo=; b=7ZF2gfmHk6D9NLJXn7+u4GjIY6nFR1nqWe5hNtvz4fPG+C3GS7bQEEBhgyMlu+9rOB+lub U3BtDcVyYQLnNzn33nPhlHjhdLO4C8MsjJO5xNwekyqEYKHQ5s7gdFGpzffOIZTEQmmiii DYXtZ7SnNcajBumbTr3Mg+n28RWHQx8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750154597; a=rsa-sha256; cv=none; b=k1juVcs8ZV3gTL2AtKBOmjZgz29olXsz2aFAf9M2WpM46kFjHSJq8ZwtV7fpa7caG32/ta p3d+IcSRK7Uz+BdezVFan4OFDx/UdX7PKe0QPamr8L5ruKETZwHIaIBcd73IwOXdT51Iuc PqNsSqQ6H5afixsHgAmhYsaoxWruj9g= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=abbLPbCF; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=+4+4PvYT; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=abbLPbCF; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=+4+4PvYT; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf22.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 657D021190; Tue, 17 Jun 2025 10:03:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1750154595; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lkNuGFHnLWcnxmlORCCDKyxXf53aZscV6U0/WkfcPWo=; b=abbLPbCF0mQbynj4JRIDX4N1v1ml2XXAGepHiUo+Bi1uAaXbeK8xjHw3Zfl51/DW60qKhO ERYt3F3iBlmZpBPXx0GyvgFSbEfFnPSrw8T9hM4NVtZYPMwv8NYUSp4nWuTMEbVgEensPZ DbSsc3bjGy/NEZpxir9EGu0OQJNtSZQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1750154595; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lkNuGFHnLWcnxmlORCCDKyxXf53aZscV6U0/WkfcPWo=; b=+4+4PvYT4QpEoAXU/fGikyT2jcyOTXZly0hpO0zg6KqC6MjqPV2L3pcYAK4A1+fLOj90Dc +VsZjnJM4K+uMgBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1750154595; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lkNuGFHnLWcnxmlORCCDKyxXf53aZscV6U0/WkfcPWo=; b=abbLPbCF0mQbynj4JRIDX4N1v1ml2XXAGepHiUo+Bi1uAaXbeK8xjHw3Zfl51/DW60qKhO ERYt3F3iBlmZpBPXx0GyvgFSbEfFnPSrw8T9hM4NVtZYPMwv8NYUSp4nWuTMEbVgEensPZ DbSsc3bjGy/NEZpxir9EGu0OQJNtSZQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1750154595; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lkNuGFHnLWcnxmlORCCDKyxXf53aZscV6U0/WkfcPWo=; b=+4+4PvYT4QpEoAXU/fGikyT2jcyOTXZly0hpO0zg6KqC6MjqPV2L3pcYAK4A1+fLOj90Dc +VsZjnJM4K+uMgBg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id DEBC913A69; Tue, 17 Jun 2025 10:03:14 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id FR6GM2I9UWgFFgAAD6G6ig (envelope-from ); Tue, 17 Jun 2025 10:03:14 +0000 Date: Tue, 17 Jun 2025 12:03:13 +0200 From: Oscar Salvador To: David Hildenbrand Cc: Andrew Morton , Muchun Song , James Houghton , Peter Xu , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path Message-ID: References: <20250612134701.377855-1-osalvador@suse.de> <20250612134701.377855-3-osalvador@suse.de> <44f0f1cc-307a-46e3-9e73-8b2061e4e938@redhat.com> <1297fdd5-3de2-45bc-b146-e14061643fee@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1297fdd5-3de2-45bc-b146-e14061643fee@redhat.com> X-Rspamd-Action: no action X-Stat-Signature: i6rgp5tk7fswuswntdgfu58kjksst8k7 X-Rspamd-Queue-Id: CFDF2C000D X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1750154596-673986 X-HE-Meta: U2FsdGVkX1+JW6fuADdW8a1vUELVqUo6LxPJ/52Z2PiQob67BUficoCQq1jC7xggdlTnVPPphyYxLUU+FBUqWtj6dGY2KScr4UHR/SDllf2R9Amdesx83BLY0gFYdUhb5TEXVCsO8Lx9/s6J2VpnlA8ZzO4dRYuNbW8dlLHsRP3E3Z5ERGZijCblO8X8ssqxOuskSDA4uSXdgr4rNwx3r4voA7Epjh7esc11xWjdlVxSHvMVq1i6AhKdjF2WrqvAFYs0FoycQGf1qQXRs7qNwZatwHtGl9c8hyD2WARLPqWanANmDNIfFXOTMo67NPJCVTUrUYjN99OAqZYd7MkmtBsA2L8Difd/v9mjdqMioNR5Gb2m6uZfrGG2L3nk7vN8g+4MaapRjJhavc6zNEn4IzycXkQWVct8+2fpyf6Va/F1iQTVCub4CtkUK611AD97atHW2+myh0SW9tMLUodhbPWU1wDA5wrlZKnkFDWiSO6sYSD24vA0/oNy/bVfJ28zUPBlMT20VMhDx+TldGLONiUCGadqkajRjWUGQNDiOFQxMsEtRxB2diWIsEiKSHWJJyEY8ajjAZdJr3kkL/HH1n2eco6oVqhNbxRVRSrrRZaidg8mqQMmwW/QPNCiQY9dqkHmGvWR6lFfRRd3Hb2T1wFTZyZZzQFMGS/Z9/tmuCp7rEr0VwBwIcHsXYj2yRtpA89NlKOcN3ZnrrGp5RsTZOynYFxgbIX/4jvcIxwFzmZues1UhYhB61STnA0Ir6naXpI2Ip6ZugYDlqM3XoavA7GbHRs7yZRFjSUlcZ9E/zV7Er9PwXYwthLHtukoy480HuV6Mv8D1DPGU4rZrNSU/UJ9mM7GxA5bCheCu6gV2jSNKrGkzYjfqf+A15FcCx//DzZoxRKhJcG0v2KU2VCAj/ZrllwSfdsJyXzZTFPCJ8G/pTR/LyFtyt2HKYwjiuj2GFnG/3chUtjc2XlxYXI 8h8dxJDQ AiLZgbDL7xzpiAlzgFZcwiYWq2OADYSpIKSlol4Ecbs/GTXZwgxGDAHqnI/6eEyIRgMFZ7SUHhYuxIF47KxV5fqqGz+zlS08o+okhpD/ViK2S/1bvLBo1fACdqZzkN2l87V9P7MSGZp9uVxNqg07Swnsn+opgq2Tiua1L+eNZPswCrfduv7zvMTC6KykDlBd/g2i2SIw0LhB1rhvHElCw/R6LpquP25+0hyUvptmKDJWRgw0RKTaVLtwWZOGKfFIHh8qb71Dl9IqMrz0dBTZd2CZygzL85znMFOAh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 16, 2025 at 04:41:20PM +0200, David Hildenbrand wrote: > On 16.06.25 16:10, Oscar Salvador wrote: > > What do you mean by stable? > > The same "stable" you used in the doc, that I complained about ;) Touche :-D > > In the generic faulting path, we're not worried about the page going away > > because we hold a reference, so I guess the lock must be to keep content stable? > > What you want to avoid is IIRC, is someone doing a truncation/reclaim on the > folio while you are mapping it. Ok, I see. I thought it was more about holding writes, but this makes sense. > Take a look at truncate_inode_pages_range() where we do a folio_lock() > around truncate_inode_folio(). > > In other words, while you hold the folio lock (and verified that the folio > was not truncated yet: for example, that folio->mapping is still set), you > know that it cannot get truncated concurrently -- without holding other > expensive locks. > > Observe how truncate_cleanup_folio() calls > > if (folio_mapped(folio)) > unmap_mapping_folio(folio); > > To remove all page table mappings. > > So while holding the folio lock, new page table mappings are not expected to > appear (IIRC). Ah ok, so it's more that we don't end up mapping something that's not there anymore (or something completely different). > > I mean, yes, after we have mapped the page privately into the pagetables, > > we don't have business about content-integrity anymore, so given this rule, yes, > > I guess hugetlb_wp() wouldn't need the lock (for !anonymous) because we already > > have mapped it privately at that point. > > That's my understanding. And while holding the PTL it cannot get unmapped. > Whenever you temporarily drop the PTL, you have to do a pte_same() check to > make sure concurrent truncation didn't happen. Yap, hugetlb_wp() drops the locks temporarily when it needs to unmap the private page from other processes, but then does the pte_same() check. > So far my understanding at least of common filemap code. > > > > > But there's something I don't fully understand and makes me feel uneasy. > > If the lock in the generic faultin path is to keep content stable till we > > have mapped it privately, wouldn't be more correct to also hold it > > during the copy in hugetlb_wp, to kinda emulate that? > As long there us a page table mapping, it cannot get truncated. So if you > find a PTE under PTL that maps that folio, truncation could not have > happened. I see, this makes a lot of sense, thanks for walking me through David! Alright, then, with all this clear now we should: - Not take any locks on hugetlb_fault()->hugetlb_wp(), hugetlb_wp() will take it if it's an anonymous folio (re-use check) - Drop the lock in hugetlb_no_page() after we have mapped the page in the pagetables - hugetlb_wp() will take the lock IFF the folio is anonymous This will lead to something like the following: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dfa09fc3b2c6..4d48cda8a56d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf) * in scenarios that used to work. As a side effect, there can still * be leaks between processes, for example, with FOLL_GET users. */ + if (folio_test_anon(old_folio)) + folio_lock(old_folio); if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) { if (!PageAnonExclusive(&old_folio->page)) { folio_move_anon_rmap(old_folio, vma); @@ -6212,6 +6214,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf) } VM_BUG_ON_PAGE(folio_test_anon(old_folio) && PageAnonExclusive(&old_folio->page), &old_folio->page); + if (folio_test_anon(old_folio)) + folio_unlock(old_folio); /* * If the process that created a MAP_PRIVATE mapping is about to perform @@ -6537,11 +6541,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, } new_pagecache_folio = true; } else { - /* - * hugetlb_wp() expects the folio to be locked in order to - * check whether we can re-use this page exclusively for us. - */ - folio_lock(folio); anon_rmap = 1; } } else { @@ -6558,7 +6557,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, /* Check for page in userfault range. */ if (userfaultfd_minor(vma)) { - folio_unlock(folio); + if (!anon_rmap) + folio_unlock(folio); folio_put(folio); /* See comment in userfaultfd_missing() block above */ if (!hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) { @@ -6604,6 +6604,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, new_pte = huge_pte_mkuffd_wp(new_pte); set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h)); + /* + * This folio cannot have been truncated since we were holding the lock, + * and we just mapped it into the pagetables. Drop the lock now. + */ + if (!anon_rmap) + folio_unlock(folio); + hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ @@ -6619,8 +6626,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, */ if (new_folio) folio_set_hugetlb_migratable(folio); - - folio_unlock(folio); out: hugetlb_vma_unlock_read(vma); @@ -6639,8 +6644,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, backout_unlocked: if (new_folio && !new_pagecache_folio) restore_reserve_on_error(h, vma, vmf->address, folio); - - folio_unlock(folio); + if (!anon_rmap) + folio_unlock(folio); folio_put(folio); goto out; } @@ -6805,21 +6810,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* Fallthrough to CoW */ } - /* - * We need to lock the folio before calling hugetlb_wp(). - * Either the folio is in the pagecache and we need to copy it over - * to another file, so it must remain stable throughout the operation, - * or the folio is anonymous and we need to lock it in order to check - * whether we can re-use it and mark it exclusive for this process. - * The timespan for the lock differs depending on the type, since - * anonymous folios only need to hold the lock while checking whether we - * can re-use it, while we need to hold it throughout the copy in case - * we are dealing with a folio from a pagecache. - * Representing this difference would be tricky with the current code, - * so just hold the lock for the duration of hugetlb_wp(). - */ folio = page_folio(pte_page(vmf.orig_pte)); - folio_lock(folio); folio_get(folio); if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { @@ -6835,7 +6826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, vmf.address, vmf.pte); out_put_page: - folio_unlock(folio); folio_put(folio); out_ptl: spin_unlock(vmf.ptl); This should be patch#2 with something like "Sorting out locking" per title, and maybe explaining a bit more why the lock in hugelb_wp for anonymous folios. What do you think? -- Oscar Salvador SUSE Labs