From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4A14C7115B for ; Mon, 23 Jun 2025 08:28:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 326586B00BD; Mon, 23 Jun 2025 04:28:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 28A766B00C0; Mon, 23 Jun 2025 04:28:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B4B26B00BF; Mon, 23 Jun 2025 04:28:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E6E386B00BD for ; Mon, 23 Jun 2025 04:28:34 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 8C02CC0243 for ; Mon, 23 Jun 2025 08:28:34 +0000 (UTC) X-FDA: 83585988948.27.6110FF0 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf10.hostedemail.com (Postfix) with ESMTP id 8285FC000B for ; Mon, 23 Jun 2025 08:28:32 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=pJofuCfe; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750667313; a=rsa-sha256; cv=none; b=yHgsrSt229hYKrT3DeP/iDwsCO1cIzNlT/jnmIEGr1KV1HqkyOX2+7OQJXUiVNZYIQVZck QeRR0FbC1qjschD8ho3d4jKTUnvbRK/gun4GC1RzGgo0eKvH+dMWEtMBn6UJuOr5vJU8xs nPElSQ2xAw24Wi/Nwl9HIPXDaVvN5v8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=pJofuCfe; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf10.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750667313; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m05n+HbdQ7vd+i1fLmRSJpAwNZi/GFsTlLeJSAQcl70=; b=01HMWB6W6bB4VoDU22tlGNS3fpWXjCA45uWNgXRVelU0L7jUFUNpRA2GKKHEA4bbTQC9pu 01aZ1L2ouTG4nBpEok3B0vKF5osfdSxXCeILj1+Whl4A00fPNzPF5joQkpCphk2PjcuTg/ YxiDQFAnEAD9Bp3+L4MuStFh0bSAMRY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750667309; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=m05n+HbdQ7vd+i1fLmRSJpAwNZi/GFsTlLeJSAQcl70=; b=pJofuCfedAJH00CMj2CmtiRyd33fhPJB+rHdplBIJdsOHrKf6JtSA0OGhyo7Ckvzyl+zftHXV2FU0b96Cg/6eR9KvywylGwyJVjKBZo828Hc1JMcu8B3XATTcTar9dD+0cNEdENhi5aIMe05OdS2ZsNT7v63MqGkkKvlwH+Kwn0= Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WeVy8Ob_1750667306 cluster:ay36) by smtp.aliyun-inc.com; Mon, 23 Jun 2025 16:28:27 +0800 From: Baolin Wang To: akpm@linux-foundation.org, hughd@google.com, david@redhat.com Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled Date: Mon, 23 Jun 2025 16:28:08 +0800 Message-ID: X-Mailer: git-send-email 2.43.5 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 53yrpg9kooefjkcoruierqzgytn5xi7z X-Rspamd-Queue-Id: 8285FC000B X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1750667312-848848 X-HE-Meta: U2FsdGVkX18WwMCvJc2kljZXyf1KCSiaRz23Il+FCJH/Sr/dFVfuX00D9H1tj1OJ9G+2+4S1XJfMJdvu50G/qJV/SB0F0Td6f90b82XcgvH/724/P6xNPX1bc5rlvNJynWABg9l2RFrkE0KcJrCnthPs4lAGxUgRxk4+Rm6WI8HMQ/Unt058PfyF8Cs8yWUpXfnlsPYacfE1VP8SQsx2ZrQp2gACgcyDgZraMRqscCExiln5P6iVxTPeJKh0UfjBIkw4vg6sCoUU1+44aIw02auK1c6EoeVhTS7pasTowQfQP08Hh+dRlrZVD0che9cUGfrWEInLX5fEHX4jtKbXC0DLPA2LRVXjyrzP2N3+ClqBj7a6UzurzFgp/DCTPBImc3IuME9VckBI9UvkzCwNZ1b518VgMaFWcd089drwKdG1aChI808U+7p29bHA02xpszZO1GliCB4CpHyaaajIBah2jd1jZGnzlgtqNvicVOixUuQ5P5ka1WTEliHzxhmYzKbQ6KY3B22Np0tr7F1psAX7ZhhhxkKX34ZqUzXUn+P/OaXLCbW7w/d3SDEWgYLJ/AbsDQ2mvLGI/XZNqhpqtULTdagLJCLohtEqR634m6JJp4ckPyMDPlV1ARXK2wewQQYkPf8GD0hed09jFTC0uZ0hAEXg0hOQGLns6TFjJbSJtp5aK0PyyqjctlUfwUCHDN81XtanRkwHZszeVIWpSVNp8d259uoX1nrzjV0iOP6lhdNYLy45d7IX96DWqtWf0eR7eth7s8LVSZ0/h0pruEFLfKXVTflRfKe7PYuHaHuvNys8+Yx/p5k0ucobImjhfXMhwbQKkQVwQdNuTmjDKnMSnla/Qok2wa96uSWM6XFlNi5hmnCRkfvlQiEtgAKwF68sZELeLW/WaQixZsPM8302W4noyIeAxFgTme4NyEybfXpr9f+i/Y4wXhCGM40AMMw1G8TQNCsFAj4eDn2 fBWs7Kmz W118ELW7F1oKO5f7uEmb3qBBagIreCBrNeqnRwn7DI0ixpa4uTFuitEfP00dHtYF50kqWugTndaeUkKYCdYt+QPDGu01eYXNXQTJYyd566Y/mFGibS0uhJmIek5Dhd69huw50erxcF7hytFWghOWNz3x6k3nWJu7UT5SK6FYol5hFgxZG3HpKQyhe46gMJ6Y1UbMBZHN0RlmPdvR0hYRhbIxL6obo4OF+0XPNr3YAgVFVVdF82OlUra910ahNxy8t8coMHYJetZe5ptsQ34lKas2gal1+La5v7L1QRKI8XRA8e3Dr0bvYvBLaDTTUHKOs2TvpkvyuYE02pQ3Nhmd2hgIT1CxDhcS0S9BADaACQP7a9TUMjP7UfiYltg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not specified, we will ignore the THP sysfs settings. Whilst it makes sense for the callers who do not specify this flag, it creates a odd and surprising situation where a sysadmin specifying 'never' for all THP sizes still observing THP pages being allocated and used on the system. The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse into a Anon THP. This violates the rule we have agreed upon: never means never. Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only one other instance where TVA_ENFORCE_SYSFS is not set, which is in the collapse_pte_mapped_thp() function, but I believe this is reasonable from its comments: " /* * If we are here, we've succeeded in replacing all the native pages * in the page cache with a single hugepage. If a mm were to fault-in * this memory (mapped by a suitably aligned VMA), we'd get the hugepage * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) " Another rule for madvise, referring to David's suggestion: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine". To address this issue, the current strategy should be: If no hugepage modes are enabled for the desired orders, nor can we enable them by inheriting from a 'global' enabled setting - then it must be the case that all desired orders either specify or inherit 'NEVER' - and we must abort. Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling THP. Suggested-by: Lorenzo Stoakes Signed-off-by: Baolin Wang --- include/linux/huge_mm.h | 51 ++++++++++++++++++------- tools/testing/selftests/mm/khugepaged.c | 6 +-- 2 files changed, 39 insertions(+), 18 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4d5bb67dc4ec..ab70ca4e704b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long tva_flags, unsigned long orders); +/* Strictly mask requested anonymous orders according to sysfs settings. */ +static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) +{ + const unsigned long always = READ_ONCE(huge_anon_orders_always); + const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise); + const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit); + const unsigned long never = ~(always | madvise | inherit); + const bool inherit_never = !hugepage_global_enabled(); + + /* Disallow orders that are set to NEVER directly ... */ + orders &= ~never; + + /* ... or through inheritance (global == NEVER). */ + if (inherit_never) + orders &= ~inherit; + + /* + * Otherwise, we only enforce sysfs settings if asked. In addition, + * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS + * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE + * set. + */ + if (!(tva_flags & TVA_ENFORCE_SYSFS)) + return orders; + + /* We already excluded never inherit above. */ + if (vm_flags & VM_HUGEPAGE) + return orders & (always | madvise | inherit); + + if (hugepage_global_always()) + return orders & (always | inherit); + + return orders & always; +} + /** * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma * @vma: the vm area to check @@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long orders) { /* Optimization to check if required orders are enabled early. */ - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { - unsigned long mask = READ_ONCE(huge_anon_orders_always); - - if (vm_flags & VM_HUGEPAGE) - mask |= READ_ONCE(huge_anon_orders_madvise); - if (hugepage_global_always() || - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) - mask |= READ_ONCE(huge_anon_orders_inherit); - - orders &= mask; - if (!orders) - return 0; - } + if (vma_is_anonymous(vma)) + orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders); return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 4341ce6b3b38..85bfff53dba6 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages, printf("%s...", msg); - /* - * Prevent khugepaged interference and tests that MADV_COLLAPSE - * ignores /sys/kernel/mm/transparent_hugepage/enabled - */ - settings.thp_enabled = THP_NEVER; + settings.thp_enabled = THP_ALWAYS; settings.shmem_enabled = SHMEM_NEVER; thp_push_settings(&settings); -- 2.43.5