From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 356F8C8302B for ; Mon, 30 Jun 2025 14:42:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC0A38D0006; Mon, 30 Jun 2025 10:42:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C48266B00D8; Mon, 30 Jun 2025 10:42:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE7576B009E; Mon, 30 Jun 2025 10:42:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9A75B6B009E for ; Mon, 30 Jun 2025 10:42:27 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 1D1771A051E for ; Mon, 30 Jun 2025 14:42:27 +0000 (UTC) X-FDA: 83612332734.16.C97CA5C Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf11.hostedemail.com (Postfix) with ESMTP id DBB394000C for ; Mon, 30 Jun 2025 14:42:24 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=v3SEP+35; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=Wm0ha8iP; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=v3SEP+35; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=Wm0ha8iP; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf11.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751294545; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aoXI4BTUSdoTvkbJ+U0NIhJSkczrCgYuAhLAQMuINis=; b=FJ2GpcHRWhHiwoztLxj20a+4otbF7wWeGjPVCAM3EzBbwjY/f+z9OMKKGj/D+h/wRwD9Ii 4xHgzse6L+aSItCmQloZX32Bn/YpsPvFtKvKinmXv7GrGddUXUZcD3AFuNE9NR0sNxOUX8 uw38uUNdK5WwfhL4AvlWbRXWlWEWNnM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751294545; a=rsa-sha256; cv=none; b=6+ggZ8oUSapTCTnvgcsfdxP9KsG5+SYqyBF4iamIv2SP8fIFGI9cSR0u03J/uyPdmnLKc1 JTvOuzDYp0KD6XtOGmuAA7LD4a7zJW5GQx88eaFUNwt+tZx7B8+KBHfmNLC1vWQB+HygKb 7rTOUiC+FcVKNr4hn1ccTvsoZ8bWDDY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=v3SEP+35; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=Wm0ha8iP; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=v3SEP+35; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=Wm0ha8iP; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf11.hostedemail.com: domain of osalvador@suse.de designates 195.135.223.130 as permitted sender) smtp.mailfrom=osalvador@suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 0AE7F21165; Mon, 30 Jun 2025 14:42:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1751294537; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aoXI4BTUSdoTvkbJ+U0NIhJSkczrCgYuAhLAQMuINis=; b=v3SEP+358BahKnozexSGibwj68IhIV2tIfokbAz+C8xgncbLdXspeeFLgNgDBAwnhW5WO7 JrdqfzkBGIhOdG5XoGtX3xHnEf3dtKCeGbRyBhCPt/CSCW1RhcaUmGBF2a43pSifFlpBNQ mrqt2HUY3NfLFbDmrUhANagDcP2vWYE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1751294537; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aoXI4BTUSdoTvkbJ+U0NIhJSkczrCgYuAhLAQMuINis=; b=Wm0ha8iPgYdYJ5vR9bd0ymE5Y3bcRM9hA8z/Pd57NiLoSVoPv9OOvPbRyHV5KIploWMbuK 3Pob6s7jbgUuLOCg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1751294537; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aoXI4BTUSdoTvkbJ+U0NIhJSkczrCgYuAhLAQMuINis=; b=v3SEP+358BahKnozexSGibwj68IhIV2tIfokbAz+C8xgncbLdXspeeFLgNgDBAwnhW5WO7 JrdqfzkBGIhOdG5XoGtX3xHnEf3dtKCeGbRyBhCPt/CSCW1RhcaUmGBF2a43pSifFlpBNQ mrqt2HUY3NfLFbDmrUhANagDcP2vWYE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1751294537; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aoXI4BTUSdoTvkbJ+U0NIhJSkczrCgYuAhLAQMuINis=; b=Wm0ha8iPgYdYJ5vR9bd0ymE5Y3bcRM9hA8z/Pd57NiLoSVoPv9OOvPbRyHV5KIploWMbuK 3Pob6s7jbgUuLOCg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 89F2213A24; Mon, 30 Jun 2025 14:42:16 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id 4DjrHkiiYmjqdAAAD6G6ig (envelope-from ); Mon, 30 Jun 2025 14:42:16 +0000 From: Oscar Salvador To: Andrew Morton Cc: David Hildenbrand , Muchun Song , Peter Xu , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador Subject: [PATCH v4 1/5] mm,hugetlb: change mechanism to detect a COW on private mapping Date: Mon, 30 Jun 2025 16:42:08 +0200 Message-ID: <20250630144212.156938-2-osalvador@suse.de> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250630144212.156938-1-osalvador@suse.de> References: <20250630144212.156938-1-osalvador@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Action: no action X-Stat-Signature: f5s6nxuyjzzxiobphudpgbdwzc339gu5 X-Rspamd-Queue-Id: DBB394000C X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1751294544-969059 X-HE-Meta: U2FsdGVkX1/C2slwQKqteuNZVmFSWlaZ2LnI+j9S2tfONEMbWnDWZ05v9Pa8VRh0l3iu1wPfvpGkggScQvdGdMIP2Wy8W5MyCt4aoj8R5tsAZh7hhfgGzNP8XLMd7ZMqi2cRMEjIQSabYXKgm6kMi8c+ehleoWqmFGVhXTXduyNprVzenvTa2gDhAVYLAL5CNayQ6m5DM5wODNKgWVHm1QVLGeFen185QUO+WZIdwfE9cTLRAU0/6D0tn3VHFxduWQLtfUL1DgazLjEYvNttfey1ELGa+GkHANa4HtSKS6H3QJhnixl2ZdD6hEywLURCYPsvw9n53+U6WdYNbwzgwqvY+eUMT6CDl9Z+CnCNqq3rH78UL/18nTdCCBjP0HC4RaePdvSYeeBsGprGJ3yy+ygfG2l9z0NxxQ257TWLjgR/ggeDMl2XZqGK+3CLoEVaAgdEWwTLtxRlvCRgJ84rlqMYpNEzzntjd9+h1BXRea54VlrHT0CAZqiK7TnEUx4+sZHlg4FYRfvjTW2r7uF/iQzDPQqgNJuzfpglj2Lx31aKuZI6d/t8HNSgJmoHmrsqF3/pfXLugI5mybKjPfj0ps+o36MMkBN13RRxPmLIkjJ7efiYA7VQe3wzmUcBFvwVQq/AThMRDhn2Vz9X1vqipAKQCGsuVelZutdcyakBpVKesenIyEPOow/GCqfyqnMFTqFcHZ6EZP/3tgT3VkCcN1GIJw0SSjZFkWiKj6th/wxy32MoHpZQ7ZIkbdjdCTXHOvoJxbTecDLxvyIi9Z7oChUM3P9T2EJFgBjTsLRafrhXBWsa1CnPLvdkAu4lImzj6TVPRwb4G4/aoAfIL0HKLRPCf2I/BWC3CfohvPoIEWVYDbcxppsTuqrm3YSu4A1rYlr3mZdxrK7sexkJIZgKsTDNhTZkYuqhhjuaYOIS69gQyXKI+J37oDpTYj3GFXaDQleZOsnYNPbPq+dsETL BMVLlXKW K+viCiWvtNWNBdfd1utD25M/qCMMD002rkuDL5buu7k3SGZdhdHeuLTvRyhcQ4j2RBiVzJ8w6zS4NaXmqzXYUc4t6xvJZ7kBKvW2Xb2NIF4WN5knIil55MYK4xvFJCb6LUgst73IMrCPJK+sdxyzABnEZrMa35rDDzJjZwZHoM722mIqOBZ+HLoV7pgSfU3WbjjhQ/hn7cY+850qPpcjB1CHp1npSAxnZdvJh3K+3xdObbBaLqnk4gOuXVmg9nac5hmRn7eLIiEBHE4Uci76YNY6alS4dZQ4HgL72E30Ks7mva8UcpjA68Wk+czv1WoB7exCUfT2n//Tj4zBKqsw+kIE4jlgWkVnpcqe3fSBemBBfPknFm9Q6OlupeLbV7bofnvptlGa8JCscZZyNmVRS8eNCn4tDjZsIIJcOTmWQvwFs8Mqt+eRvlhYpoBItA2pyEvQ4QhZVlfpQKus= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: hugetlb_wp() checks whether the process is trying to COW on a private mapping in order to know whether the reservation for that address was already consumed. If it was consumed and we are the ownner of the mapping, the folio will have to be unmapped from the other processes. Currently, that check is done by looking up the folio in the pagecache and compare it to the folio which is mapped in our pagetables. If it differs, it means we already mapped it privately before, consuming a reservation on the way. All we are interested in is whether the mapped folio is anonymous, so we can simplify and check for that instead. Link: https://lkml.kernel.org/r/20250627102904.107202-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250627102904.107202-2-osalvador@suse.de Link: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Signed-off-by: Oscar Salvador Reported-by: Gavin Guo Closes: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ Suggested-by: Peter Xu Acked-by: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 88 ++++++++++++++++++++-------------------------------- 1 file changed, 34 insertions(+), 54 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fa7faf38c99e..14274a02dd14 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6149,8 +6149,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, - struct vm_fault *vmf) +static vm_fault_t hugetlb_wp(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6212,16 +6211,17 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, PageAnonExclusive(&old_folio->page), &old_folio->page); /* - * If the process that created a MAP_PRIVATE mapping is about to - * perform a COW due to a shared page count, attempt to satisfy - * the allocation without using the existing reserves. The pagecache - * page is used to determine if the reserve at this address was - * consumed or not. If reserves were used, a partial faulted mapping - * at the time of fork() could consume its reserves on COW instead - * of the full address range. + * If the process that created a MAP_PRIVATE mapping is about to perform + * a COW due to a shared page count, attempt to satisfy the allocation + * without using the existing reserves. + * In order to determine where this is a COW on a MAP_PRIVATE mapping it + * is enough to check whether the old_folio is anonymous. This means that + * the reserve for this address was consumed. If reserves were used, a + * partial faulted mapping at the fime of fork() could consume its reserves + * on COW instead of the full address range. */ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && - old_folio != pagecache_folio) + folio_test_anon(old_folio)) cow_from_owner = true; folio_get(old_folio); @@ -6600,7 +6600,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_wp(folio, vmf); + ret = hugetlb_wp(vmf); } spin_unlock(vmf->ptl); @@ -6668,10 +6668,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, vm_fault_t ret; u32 hash; struct folio *folio = NULL; - struct folio *pagecache_folio = NULL; struct hstate *h = hstate_vma(vma); struct address_space *mapping; - int need_wait_lock = 0; + bool need_wait_lock = false; struct vm_fault vmf = { .vma = vma, .address = address & huge_page_mask(h), @@ -6766,8 +6765,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the - * spinlock. Also lookup the pagecache page now as it is used to - * determine if a reservation has been consumed. + * spinlock. */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) { @@ -6777,11 +6775,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, vmf.address); - - pagecache_folio = filemap_lock_hugetlb_folio(h, mapping, - vmf.pgoff); - if (IS_ERR(pagecache_folio)) - pagecache_folio = NULL; } vmf.ptl = huge_pte_lock(h, mm, vmf.pte); @@ -6795,10 +6788,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); return handle_userfault(&vmf, VM_UFFD_WP); @@ -6810,24 +6799,19 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* Fallthrough to CoW */ } - /* - * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and - * pagecache_folio, so here we need take the former one - * when folio != pagecache_folio or !pagecache_folio. - */ - folio = page_folio(pte_page(vmf.orig_pte)); - if (folio != pagecache_folio) - if (!folio_trylock(folio)) { - need_wait_lock = 1; - goto out_ptl; - } - - folio_get(folio); - if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - ret = hugetlb_wp(pagecache_folio, &vmf); - goto out_put_page; + /* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */ + folio = page_folio(pte_page(vmf.orig_pte)); + if (!folio_trylock(folio)) { + need_wait_lock = true; + goto out_ptl; + } + folio_get(folio); + ret = hugetlb_wp(&vmf); + folio_unlock(folio); + folio_put(folio); + goto out_ptl; } else if (likely(flags & FAULT_FLAG_WRITE)) { vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); } @@ -6836,17 +6820,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (huge_ptep_set_access_flags(vma, vmf.address, vmf.pte, vmf.orig_pte, flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, vmf.address, vmf.pte); -out_put_page: - if (folio != pagecache_folio) - folio_unlock(folio); - folio_put(folio); out_ptl: spin_unlock(vmf.ptl); - - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } out_mutex: hugetlb_vma_unlock_read(vma); @@ -6859,11 +6834,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, mutex_unlock(&hugetlb_fault_mutex_table[hash]); /* - * Generally it's safe to hold refcount during waiting page lock. But - * here we just wait to defer the next page fault to avoid busy loop and - * the page is not used after unlocked before returning from the current - * page fault. So we are safe from accessing freed page, even if we wait - * here without taking refcount. + * hugetlb_wp drops all the locks, but the folio lock, before trying to + * unmap the folio from other processes. During that window, if another + * process mapping that folio faults in, it will take the mutex and then + * it will wait on folio_lock, causing an ABBA deadlock. + * Use trylock instead and bail out if we fail. + * + * Ideally, we should hold a refcount on the folio we wait for, but we do + * not want to use the folio after it becomes unlocked, but rather just + * wait for it to become unlocked, so hopefully next fault successes on + * the trylock. */ if (need_wait_lock) folio_wait_locked(folio); -- 2.50.0