From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00CFEE77188 for ; Wed, 18 Dec 2024 09:34:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 861DC6B0082; Wed, 18 Dec 2024 04:34:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 808EE6B0083; Wed, 18 Dec 2024 04:34:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D0126B0085; Wed, 18 Dec 2024 04:34:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4E8C96B0082 for ; Wed, 18 Dec 2024 04:34:54 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 00B6580CA6 for ; Wed, 18 Dec 2024 09:34:53 +0000 (UTC) X-FDA: 82907569584.12.D37D307 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf27.hostedemail.com (Postfix) with ESMTP id F3C834000B for ; Wed, 18 Dec 2024 09:34:16 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734514470; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xuo1gExIy5zqhelj0+jGtJJJZeaDaCD0061JUzBDZog=; b=BEK7J8msMDGyuScIok8TKwTfbFeE6eyXeonUhP7zJPDcRfrZoCTtX2qleozyKNBOfFP2p5 J0PGryMRqKt4yqIeLdZggpRboPeaOETAM+B3cYmypr3bvMMmFlMQevWIsTGzYm8+BVAhbx blSmgypDuhiBz80q1sWfcyQ30wr29Pg= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734514470; a=rsa-sha256; cv=none; b=CWUIlJw0SurIUQ7VE+VKW2K7bWryQ+CqNqrmWsA+0B4UKnrl93CLkEgtBJL1FQ7RtOerVq inJfy0oJNiWwqg2IDWzdNnDkmiablMp/eS4T+rm2WiNO5JYbY91+QRuCMKhBwS2trnMQW+ PF6mtJY1FTXa/IrZO5kBoyUL1yppbtY= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4AE55FEC; Wed, 18 Dec 2024 01:35:19 -0800 (PST) Received: from [10.162.42.42] (K4MQJ0H1H2.blr.arm.com [10.162.42.42]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F2C253F7B4; Wed, 18 Dec 2024 01:34:40 -0800 (PST) Message-ID: Date: Wed, 18 Dec 2024 15:04:37 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped To: Ryan Roberts , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-11-dev.jain@arm.com> <7cc1840b-6f6c-4f82-86b8-41bb6fbc1b81@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: <7cc1840b-6f6c-4f82-86b8-41bb6fbc1b81@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Stat-Signature: fwmb87y9z1bfytk9ntbq9ndcqqatbipr X-Rspamd-Queue-Id: F3C834000B X-Rspam-User: X-HE-Tag: 1734514456-673899 X-HE-Meta: U2FsdGVkX1/X0yUyktTBjIL+p/8MkDhEiZk2kUhgvu/AAuoYwFPwfXTMYn0n61c0yOlGhzzJDO3B91+prGXvMyHjZjfALA2bsoRF+LoReH26dG4Hp3x5jwKu/9kQOFEuSe1EbXoHtvvynNKP/DaPlU+sAwg65zGah7ey0dgWLuw3S+gPWFdlKCsyvhh8sZmwTDzfj341FjOeChF0VMMdnW3H7mOQijIMjJ/AkI+x1xP45d/aa5FmxRmdHuJ1tNL8iWK3AHCnGBGDlMPm+O9IcHvNolgcS9GDVLBlt1tZb47bmvFG8qqrG9UkyVserFgJYwZDXKEO4pI6p7J+C6YtyCt6ioT0ckTp1DhbHzT2TWzk6wty+RWs4Lqh4m00Brh0i36qu+I+VZU0Aj+2lIO8fXMI5kX1ltRlXYSHXedotxi1aQ9h3c50Xq6Q571kQfbBOL5x7gnZSVKA2G5JlgseGbiUWFxQH0FrTvveie9eaH1SbUGQ9+X3rVli7OnX77CnJdC1fmnX2WHTk/iKdV8T6RBohUyvRwoSNjZykmS1fm9vLccH2MqXWZoKKfRRa3ffv/XIAC3zU904nM93M1t4BG+JX/D+VPrVNrDkf0l9utRKtiaKMHJ+wnF5HZhgmOhLEatFQ5pgOtlfuU4nQqZZsSeBPIeT1oRI7Me7ywflCd32CzsjHXUSr6euaBXFj/qTbHVqo5CbLRiIElfCG8lOEHuUb8C0y54lwY8DaIBL97jkt57eoSYMcVtZ0EOrNAQXs+ORJ0yqXJN/2xt78sYmlQ9FC+c9LyDD3RRhIMGxcvg+SWsRY3IwVrpYLHI2Rh29vrxzoxB0uRbURXWPj4psri+G26SME+LAfNlUFBGRIKXrJ6bAXj0YYnBmE4APjr2NQOKBiiDjBSroEVgPh1rd+sIlGZKO6gu6eaB2VBYx60SMN8SwffyqnCsxairYCoTX2XsJcU2Tj/CltNzokZS LKDlBFUm 0AV4guYv3TXwhkG+cmPuqvy3Wjw6pg4gx1Wc3+7VBAXTv4JvBApi1VrusuUCXyUqBnGENIwj8sZxd7InD4HSqK/LK9TdNzqMcFY05Hda/GJEalCRpwLJqi+DgYjBPNT0btPXoC1F6PWI7CPg6nQIoHSqA0/Nuxbcgtt+H65DoWQ+/QdnnEtGsGIzkiKksFMOTpnYMJgIQ4j0i5oGbtFkctabLVqPmabn41YT6/5fJrVXaUP5hXZJeIhqxjW5Q0X8J+88Qjo6QlHFIT3D8oTaaxsUou7+08g0bEMx9F7uuhA9LRf/rqF2TYR7OG1qNMjsgwNsb4rY46oqszae8gvGt7phT57rRNyRVr+RH7nXOm91G1Bk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18/12/24 1:06 pm, Ryan Roberts wrote: > On 16/12/2024 16:51, Dev Jain wrote: >> We may hit a situation wherein we have a larger folio mapped. It is incorrect >> to go ahead with the collapse since some pages will be unmapped, leading to >> the entire folio getting unmapped. Therefore, skip the corresponding range. >> >> Signed-off-by: Dev Jain >> --- >> In the future, if at all it is required that at some point we want all the folios >> in the system to be of a specific order, we may split these larger folios. >> >> mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 8040b130e677..47e7c476b893 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -33,6 +33,7 @@ enum scan_result { >> SCAN_PMD_NULL, >> SCAN_PMD_NONE, >> SCAN_PMD_MAPPED, >> + SCAN_PTE_MAPPED, >> SCAN_EXCEED_NONE_PTE, >> SCAN_EXCEED_SWAP_PTE, >> SCAN_EXCEED_SHARED_PTE, >> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >> folio = page_folio(page); >> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); >> >> + if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) { >> + result = SCAN_PTE_MAPPED; >> + goto out; >> + } >> + >> /* See hpage_collapse_scan_ptes(). */ >> if (folio_likely_mapped_shared(folio)) { >> ++shared; >> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> unsigned long orders; >> pte_t *pte, *_pte; >> spinlock_t *ptl; >> + int found_order; >> pmd_t *pmd; >> int order; >> >> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> goto out_unmap; >> } >> >> + found_order = folio_order(folio); >> + >> + /* >> + * No point of scanning. Two options: if this folio was hit >> + * somewhere in the middle of the scan, then drop down the >> + * order. Or, completely skip till the end of this folio. The >> + * latter gives us a higher order to start with, with atmost >> + * 1 << order PTEs not collapsed; the former may force us >> + * to end up going below order 2 and exiting. >> + */ >> + if (order != HPAGE_PMD_ORDER && found_order >= order) { >> + result = SCAN_PTE_MAPPED; >> + _address += (PAGE_SIZE << found_order); >> + _pte += (1UL << found_order); >> + pte_unmap_unlock(pte, ptl); >> + goto decide_order; >> + } > It would be good if you can spell out the desired policy when khugepaged hits > partially unmapped large folios and unaligned large folios. I think the simple > approach is to always collapse them to fully mapped, aligned folios even if the > resulting order is smaller than the original. But I'm not sure that's definitely > going to always be the best thing. > > Regardless, I'm struggling to understand the logic in this patch. Taking the > order of a folio based on having hit one of it's pages says anything about > whether the whole of that folio is mapped or not or it's alignment. And it's not > clear to me how we would get to a situation where we are scanning for a lower > order and find a (fully mapped, aligned) folio of higher order in the first place. > > Let's assume the desired policy is that khugepaged should always collapse to > naturally aligned large folios. If there happens to be an existing aligned > order-4 folio that is fully mapped, we will identify that for collapse as part > of the scan for order-4. At that point, we should just notice that it is already > an aligned order-4 folio and bypass collapse. Of course we may have already > chosen to collapse it into a higher order, but we should definitely not get to a > lower order before we notice it. > > Hmm... I guess if the sysfs thp settings have been changed then things could get > spicy... if order-8 was previously enabled and we have an order-8 folio, then it > get's disabled and khugepaged is scanning for order-4 (which is still enabled) > then hits the order-8; what's the expected policy? rework into 2 order-4 folios > or leave it as as single order-8? Exactly, sorry, I should have made it clear in the patch description that I am handling the following scenario: there is a long running system on which we are using order-8 folios, and now we decide to downgrade to order-4. Will it be a good idea to take the pain of splitting order-8 to 16 order-4 folios? This should be a rare situation in the first place, so I have currently decided to ignore the folios set up by the previous sysfs setting and only focus on collapsing fresh memory. Thinking again, a sys-admin deciding to downgrade order of folios, should do that in the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes sense to shatter large folios....maybe we can have a sysfs tunable for this? > >> + >> /* >> * We treat a single page as shared if any part of the THP >> * is shared. "False negatives" from >> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER)) >> goto out; >> } >> + /* A larger folio was mapped; it will be skipped in next iteration */ >> + if (result == SCAN_PTE_MAPPED) >> + goto decide_order; >> + >> if (result != SCAN_SUCCEED) { >> >> /* Go to the next order. */ >> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> goto out; >> goto maybe_mmap_lock; >> } else { >> + >> +decide_order: >> address = _address; >> pte = _pte; >>