From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC484C77B7F for ; Wed, 25 Jun 2025 01:40:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D2C9A6B00B3; Tue, 24 Jun 2025 21:40:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C67C96B00AC; Tue, 24 Jun 2025 21:40:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2E416B00B0; Tue, 24 Jun 2025 21:40:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9A1FD6B00AC for ; Tue, 24 Jun 2025 21:40:24 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5A2D2121B2F for ; Wed, 25 Jun 2025 01:40:24 +0000 (UTC) X-FDA: 83592217968.18.A2F91DC Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 39D62A0002 for ; Wed, 25 Jun 2025 01:40:21 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=jOs0VBNb; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750815622; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RR4yBaE/4iR6jBPOOmDgatZtsb6Rosbucf0MZffw3Vc=; b=S+LtFyRQ1UkR6gGDIXy/exL9QLxeV3u7Re0rguGxj1mWvAFI1Fke3WRnZ3Pe2gGIG/rc7N bHaLl4ROGZcerq/W50tesYZm5awjbHH/mLr5xr93QFAvyOP7xrghHww81VY2tHQoIA1SEA EHndxpD3ukitGJywChbGNKjFMJaqymM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750815622; a=rsa-sha256; cv=none; b=0w1NtIZSqVJjs04PNdIT9F7+Es6QYG4O89e/J7O6Ez70WuF8ebh50qUS18UTnWl0H9q8t0 +owduX1bOTXEglOtGjkOy9wPLJEZ5AuAaagmeGN63RkBuCmGisWLItjbL2oD3oH3JPJoQj RTl4/WvmE5FwI3Dsr44GrC8epWvFgiY= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=jOs0VBNb; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750815619; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=RR4yBaE/4iR6jBPOOmDgatZtsb6Rosbucf0MZffw3Vc=; b=jOs0VBNbNKR4y6zU8Yi6sO3XIYcqk9PUxgjtIydXMHx+1a1sbOLrno9m7q2xGlFBhWzg7Ag34vQ3pMuYpiMd4Z+2l26tpLdREhoLMiQX7iVjNASM5orMrixVbWAqf4UcurDdbLEtODh7McY+MnmkqoFt0KSMlDaNBOZJXMdGXtw= Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WeqMLhl_1750815617 cluster:ay36) by smtp.aliyun-inc.com; Wed, 25 Jun 2025 09:40:17 +0800 From: Baolin Wang To: akpm@linux-foundation.org, hughd@google.com, david@redhat.com Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled Date: Wed, 25 Jun 2025 09:40:09 +0800 Message-ID: <25252834a20a2e8f611ba572d9fae98fb8d67982.1750815384.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 39D62A0002 X-Stat-Signature: h57snqmy91aah6djcnqoowo9a8uues8s X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1750815621-444394 X-HE-Meta: U2FsdGVkX18PTbeWUqxDf8ilAbScXRUV9BX8I4hs0VZQiTGgKWrVisr+qe1gLxJmJh6l/1U1wLR+XWO713Bx82UODy6IikqPTd1VNcYAPZj5ZcKUQeRY2l83YTaac6c+mKuRkzmZ9aUd53gIblQ/XQGpoQHdSUS4uwIbMVVFUEKl1yqPcQXCc9OvSlHGsDUyJRJrlhZq87pxLc0IGg9gbDGNPuyGviV3NvbZR+3QnNqc9qwlxi3C782+gBIrzIgNL87H8hyCAf2GUgy/MI8CAF7x+6IzuIK3XCZnI555zp0X01WJSRELE1pB3jS+IhCjs6hI/rdLbX48lBF2kpcrOvNzlBK4T24TrLHfySiJxq3Mwb0xD/z/qO++ChA5bzRLPtwus6j+Oa4bxBzfmltzRsf6W9ISdlFK8OfslL2pREvkyTwILi2rom0YnoZu7A4W79K6OTwZC8QR8c2Uy6/+bzA8Qlrt3Sm/CwJ3MDqzVyaA63dR6nxoKXkMlbQu1P9nkz9FfqgNT5Vgn2/LL7GvjPRPrwVs3EljHXwqnatiEnOjuWy3kl/zhl0/MaM+tG9y9hloXQ/J6kr/umk12fXf++hUh8S8M6GjY4j/Hgj+Ud7TIH6PTuO6ZboR6E9RVA3ErqOyiaH5/Zu8wU31UdS6vHCtt4KW5qa2nkezFzAaWtLYC1vVMbgfnRFMLDAqBmztSPVlsX+UKGPR6KocooUquGLMovqxdFVWMlcm/XeD/hcVTg5EE+CAd9nGbhY5tCuAfrENagCXp+NDS11JZDSp+62j4FbXy9ZpiJytA7eLdeZ5CHkC9nBuDV6930I53i2tms45A+ISCZCvBUsBdTu89IFJzp37X3uSAo0xedqgD/CHDv4/TFeiSCp6pNhBGMikkIOIEytbQiE7ZAxfDySRLFVlUbikWGvmGShIlxWBRq8lZLsbbWMpEKFTO+YGR35/gNGvbPa5Obc2oNYqpYD HAQNn/yu +JrCAxbhkUnqAfXJV0ppv8yjgJjhXD6q/Nt7mrgvSp5qT4I/PFdDXA/aEQl4hnUONI3jZKG4Qs4GEcKiB8HCZH0Y53S7rSoaYTjv9v+YsiE9UtjZCJyM84TyUAzNztmBWJeJP2CUpARdJdoWAy/uijPVQcHj5H/09ROgAlgKzYQ9uud3Ppp5lkC+f6S790bWfAvFUEuErKpIPKTgRu8XNfPDQ17WodNHCpJ/GDVDl7+NtqEyK8NWmJFgIUtud0W5yHy7wqav6LoA3ku7SGuO7yfg6UIkAKYuWi18MGmxCrfQKEz5eSQjkT+bWDca6Be2/ebyGgy7XgSBpORh0qjsgUZ8UbxwFikPv4jKhBC1J/iSUSncWVY4yAANND76u+YmJdl5cXlRKx9BtHW9gkcI14OB59h2qG3EwyJDT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not specified, we will ignore the THP sysfs settings. Whilst it makes sense for the callers who do not specify this flag, it creates a odd and surprising situation where a sysadmin specifying 'never' for all THP sizes still observing THP pages being allocated and used on the system. The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse into a Anon THP. This violates the rule we have agreed upon: never means never. Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only one other instance where TVA_ENFORCE_SYSFS is not set, which is in the collapse_pte_mapped_thp() function, but I believe this is reasonable from its comments. " /* * If we are here, we've succeeded in replacing all the native pages * in the page cache with a single hugepage. If a mm were to fault-in * this memory (mapped by a suitably aligned VMA), we'd get the hugepage * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) " Another rule for madvise, referring to David's suggestion: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine". To address this issue, the current strategy should be: If no hugepage modes are enabled for the desired orders, nor can we enable them by inheriting from a 'global' enabled setting - then it must be the case that all desired orders either specify or inherit 'NEVER' - and we must abort. Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE. Originally, we could prevent khugepaged by setting THP_MADVISE and removing MADV_HUGEPAGE setting, while madvise_collapse() can still perform THP collapse. However, this would cause some test cases to fail because some tests previously set MADV_NOHUGEPAGE, and now there is no other way to clear the MADV_NOHUGEPAGE flag except for setting MADV_HUGEPAGE. Therefore, it should be changed to THP_ALWAYS here to allow madvise_collapse() to perform THP collapse. Suggested-by: David Hildenbrand Suggested-by: Lorenzo Stoakes Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Signed-off-by: Baolin Wang --- include/linux/huge_mm.h | 51 ++++++++++++++++++------- tools/testing/selftests/mm/khugepaged.c | 6 +-- 2 files changed, 39 insertions(+), 18 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4d5bb67dc4ec..ab70ca4e704b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long tva_flags, unsigned long orders); +/* Strictly mask requested anonymous orders according to sysfs settings. */ +static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) +{ + const unsigned long always = READ_ONCE(huge_anon_orders_always); + const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise); + const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit); + const unsigned long never = ~(always | madvise | inherit); + const bool inherit_never = !hugepage_global_enabled(); + + /* Disallow orders that are set to NEVER directly ... */ + orders &= ~never; + + /* ... or through inheritance (global == NEVER). */ + if (inherit_never) + orders &= ~inherit; + + /* + * Otherwise, we only enforce sysfs settings if asked. In addition, + * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS + * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE + * set. + */ + if (!(tva_flags & TVA_ENFORCE_SYSFS)) + return orders; + + /* We already excluded never inherit above. */ + if (vm_flags & VM_HUGEPAGE) + return orders & (always | madvise | inherit); + + if (hugepage_global_always()) + return orders & (always | inherit); + + return orders & always; +} + /** * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma * @vma: the vm area to check @@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long orders) { /* Optimization to check if required orders are enabled early. */ - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { - unsigned long mask = READ_ONCE(huge_anon_orders_always); - - if (vm_flags & VM_HUGEPAGE) - mask |= READ_ONCE(huge_anon_orders_madvise); - if (hugepage_global_always() || - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) - mask |= READ_ONCE(huge_anon_orders_inherit); - - orders &= mask; - if (!orders) - return 0; - } + if (vma_is_anonymous(vma)) + orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders); return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 4341ce6b3b38..85bfff53dba6 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages, printf("%s...", msg); - /* - * Prevent khugepaged interference and tests that MADV_COLLAPSE - * ignores /sys/kernel/mm/transparent_hugepage/enabled - */ - settings.thp_enabled = THP_NEVER; + settings.thp_enabled = THP_ALWAYS; settings.shmem_enabled = SHMEM_NEVER; thp_push_settings(&settings); -- 2.43.5