From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D683C5B552 for ; Mon, 9 Jun 2025 14:11:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C82D86B0093; Mon, 9 Jun 2025 10:11:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C34B56B0096; Mon, 9 Jun 2025 10:11:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B706C6B009E; Mon, 9 Jun 2025 10:11:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 953DE6B0093 for ; Mon, 9 Jun 2025 10:11:32 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3400D804E9 for ; Mon, 9 Jun 2025 14:11:32 +0000 (UTC) X-FDA: 83536050024.02.6DFCBF3 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) by imf14.hostedemail.com (Postfix) with ESMTP id 2FF2F100012 for ; Mon, 9 Jun 2025 14:11:29 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mvNTDSoD; spf=pass (imf14.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749478290; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pqh3lxLD1G5TsyFks4Etlv2Tl+EoVaAQIQ7gKoDHURQ=; b=fzO8hKxT351CILGgmrNtNu2WFE8P8JUREM0sotjUeS6yiOIPH8q2LBpSH7Ugou7u+2AwLJ pD4QGoey93hnPVKoWSYRXxF4HI2QxxoyAYOCbJ7jzXMx6NkQidoC18rSZzmF43dO26XObE Ki04Xs4i4iyQ/yvZbPjnTws+jBCUL70= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749478290; a=rsa-sha256; cv=none; b=v+JNqYp12WIulpwm3Qddid1EeufXLH1zc6AG6SUgzuy/1lJCTWNPDJuXCt/XqUzJgd21+k omLHcB8G/Zcf1vFYqLjxjUweJfqqNKxQVt3e4E0cJ7SFhpJXiSxfHOc0jD7SeiOUHt/9/I uymQzfB/Ui/5M2NHppIRJroTvuhNk1o= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mvNTDSoD; spf=pass (imf14.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wr1-f47.google.com with SMTP id ffacd0b85a97d-3a4f78ebec8so2589428f8f.0 for ; Mon, 09 Jun 2025 07:11:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749478288; x=1750083088; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=pqh3lxLD1G5TsyFks4Etlv2Tl+EoVaAQIQ7gKoDHURQ=; b=mvNTDSoD9dsMuFxgBCOxfhp629OqoJlvJcPvMz4BE4JBjW5alTT/Io7IYlc5QGP9lX eVUH63HVHbekhU92iXxI3Jpm5AulkLqxA+MDvo7wGtBQCGWNqeUau0oAuT0lhLj1jY0N QhRpMj0vyPT7ncfYODU6WbPwFYhForStMtLg2mBHh0H5lFoy4BI5gs8Lyg48uGQHAZQj 15Sty3QwpvV1MftVaZqdKihMO3X61tZ9wjKQWJMD0UNVf3hhXrcPFiejjur3ji8bimRr O1pS9yniRwnfatUsSssTPAxmxfC1NhM820TH99r4bHWb/DumA7COTfJ+BHHEwCp5FJ29 4Y+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749478288; x=1750083088; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pqh3lxLD1G5TsyFks4Etlv2Tl+EoVaAQIQ7gKoDHURQ=; b=DkQ8nFM6fu9SzqZsyYAZZRnFIlBwFvZ6uTd1ROJlkVAN0BVUlB1OHVo61aEC3MU2eK M28WG2SVKaLWC2fGBF6zuZgkUaUu4MnqXCxZOLhvxO7cEU9/IDHQjy2/OSeKyApc56Fp mBxheu3GTfXQtCuotdxghcEZaArvAchlVfTtUFUvT7Kz58ap0Pi+3/wRjEAq/3sNacdc 8utTRTk6mCgeb5GuL3Lk2aQABp8pfXCrF65D31wR/spoD0FjRNAcUZSDcedaAHX5uG0n JysHrkiwE+MojnvLbB0bEv2keKOllI5rI0WkGZ/MYBukZNoWfBdxKrBjkoyjXoY659X2 fdxQ== X-Forwarded-Encrypted: i=1; AJvYcCWuNmg+xETbKUm8g2fzqjlGn938dTJWaqqWdt1aCbB9TznScQGzPB8LP3DuFF0nCVIxjJoafv05gA==@kvack.org X-Gm-Message-State: AOJu0Yz4JsScr8FkVwUOi0e5MjhWQKXfvvQJY1ZEfY41kE/Ti+g1kk7F +6cM4HbjKsk5uo/sCVbLgAuZZAoEogKWFcY4vpEomrZVeV+YbiqVnL4V X-Gm-Gg: ASbGncv7rdAvbg/Uv0IWPvfTTi7/5cpolAc62Na58D2Q10gkZNJFS6epZdHQsxxCuAm zymPu6xJDydUmBUnkEiehwwtqDmEe8ehQtmQV7CdaNnQWX30bq9xT4gY3UwvyMrE/H/UnPpG+CL wM/E7Vw4f5Yl98FTyAXoD57nniBZa5JzhlDPUIL+ATRMlTqpiRT2n1IxUfLiizsm7Gvxifs31u8 uzjx25rnW7pDcSoajpvh19kFizPwUVd1i4z8qVg7+VR4DH9ZRDodrhVE114CDMXPsZUFYd4AooZ NK3TAYUHTfCpvKiEpPORFnkCl5to8X0KJn19gdTg9RWgMfVbLXNUOki9c3zoDVfgnt9BJxCGyNL Zk7sJfdmbN6nh7HBEs1V8IfH6c7SePHZdQ86gSgHnvgh9+o8= X-Google-Smtp-Source: AGHT+IG/tZn1MX2LCYyK8fzHb137FZNS0uL6i7Rke5cNQNTYJ1nN+Try44K5lIUnI4d8IJGw5yuSUQ== X-Received: by 2002:a5d:62c8:0:b0:3a5:3b40:6e8f with SMTP id ffacd0b85a97d-3a53b406f9fmr5622569f8f.1.1749478288173; Mon, 09 Jun 2025 07:11:28 -0700 (PDT) Received: from ?IPV6:2a02:6b6f:e750:f900:146f:2c4f:d96e:4241? ([2a02:6b6f:e750:f900:146f:2c4f:d96e:4241]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3a53229de53sm9602495f8f.8.2025.06.09.07.11.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Jun 2025 07:11:27 -0700 (PDT) Message-ID: <5bd47006-a38f-4451-8a74-467ddc5f61e1@gmail.com> Date: Mon, 9 Jun 2025 15:11:27 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes To: Zi Yan , lorenzo.stoakes@oracle.com, david@redhat.com Cc: Andrew Morton , linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hughd@google.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Juan Yescas , Breno Leitao References: <20250606143700.3256414-1-usamaarif642@gmail.com> <35A3819F-C8EE-48DB-8EB4-093C04DEF504@nvidia.com> <18BEDC9A-77D2-4E9B-BF5A-90F7C789D535@nvidia.com> Content-Language: en-US From: Usama Arif In-Reply-To: <18BEDC9A-77D2-4E9B-BF5A-90F7C789D535@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2FF2F100012 X-Stat-Signature: xjbhrza696ogzmq9mgdhweqb4yc5ubca X-Rspam-User: X-HE-Tag: 1749478289-340379 X-HE-Meta: U2FsdGVkX19wxzYxMjBtCu/HJXe1OQ8dG76Alpzya+YyNIRz+lSkmK/GMjorzZeYgta1oqa3zbU8bTq2EAXY/4nbzUUu8hKUZoE4ZgBpn17ln1F/SflD4QumAcnXXufY5+WJZ+wCFHJ8wiF74NRs8S8gWjqJN7lPra0GF11bMhI+r+9MvxJgD3FLwKxAIVrAf/2q4H3mCKZ+TzAR2hdjreS5YAXzNiTBp0G0V4eAoGCbK0K5lED7uQIA2BdcwzXbqd83KACFpN6Ao0X3d/dq5qXScbeYKxr8IXGtTYHIn7qBiNDFPGfm2PxYaSqCR4pCUdq91mZHpSZrMVhXMtGKpWsYbUFxx5N5iBGghv5QsMI/tqlwXrWJ8kDOR7z7R6v2p0XJywSjDmBHRHwup+3dJljfoTLTl+gcP40kgRR1B2vaQcDfJU8W39lZSn6hnKJNM1InIiH83tYQkXSrO1hucRse2+aOJToEqqhxUT9tX/mybrqiZa1Zyz8Cxys+4owqIeX1DDw6p52HobRh2Q4EPS38umJK0Ofg6W8ol850V1QHaTT3Nevc4y2UiGV2Ts8ZQJi3uQH7G9bXwT4aQ8ZX7xPZwn8Jv+XmZEtJMAU6NGukxuP+B7gQuNECA3vbabZp7iyvADKhrLj6uMYdZ8FNQ3+E97rQOVxP82HEfN5OutIo1raS2dv/gw3B0oqZBgis53LqVGlwCyXiqhturEiuA2vNquYZQUJIlGeN6qqc7ouiBVou4HW3hzcNSamkVWn4wzpP/HBtjXbsiNkwdJOL2jc7oGvFUgj2Zr5PFnNdzqy5zslnWq054dMJWdV7W9MPcGhaBWlwM/HuxL38x/GTmhO203QNHpTwq5l+6oKKK3ZIC1h29okvE/vB1J2hOt0qZ0FTyNmMYOiXzi9nBbZaW3d9AHHAmgfgpdQnB6draGQ601yfV8e0lWR+NjNfTz9DCi/kWtyJ7e32yN/him5 CPvRhIx5 CSro/hwrF/CVZkO/0noP4r5OV84yGInGhVch7TqswqXIcbYvie/KD71ZaoeWRErYXWMmxXJPJe5noTbno19kSgpemoDF/CEh4F9w99Fwq9fXxCRE7smO2bq09FWSdUC3YyetYawxHlyStnvN4T3QNJSp8Y+DGC8QcsWYX7avYAbcBsPMvQoh053mSzVnwjjcI2tT+V1l+3h+8zBqzSgH6JfMSOGQLMcKxJvFbi4nxmjJ9p+1ZRoi9YiEtN4MUXGoT3qEML7iRqAnENugRnwVuNb6fxuL2bli+koCUNOWQIYX1Qz2lFQtwaWuqmCjyCt64dAoNa+bHBwBZB3OKgnwQkjLbfFLLIpB7nMZmvcD27ktTr88uPMHq/nt1350q9GnoPr86rd8qyz3On8JoR3q9yzynGMpohABXiJVwrWC8HCsY3iwZwNCrFgSsvePMHzwzdzFtWSI9wm9gHjecmCJ1r4mLfmbmMmGJa7CamrgKouIXCmkemrwwb8+10nneFWNZsL51hC2k2weAEGhsBcVFI6Z+vSTzUxQI4yGM/gUq/SXa2rQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 09/06/2025 14:19, Zi Yan wrote: > On 9 Jun 2025, at 7:13, Usama Arif wrote: > >> On 06/06/2025 17:10, Zi Yan wrote: >>> On 6 Jun 2025, at 11:38, Usama Arif wrote: >>> >>>> On 06/06/2025 16:18, Zi Yan wrote: >>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote: >>>>> >>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the >>>>>> watermarks are evaluated to extremely high values, for e.g. a server with >>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest >>>>>> of the sizes set to never, the min, low and high watermarks evaluate to >>>>>> 11.2G, 14G and 16.8G respectively. >>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage >>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M >>>>>> and 1G respectively. >>>>>> This is because set_recommended_min_free_kbytes is designed for PMD >>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)). >>>>>> Such high watermark values can cause performance and latency issues in >>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough >>>>>> most of them would never actually use a 512M PMD THP. >>>>>> >>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large >>>>>> folio order enabled in set_recommended_min_free_kbytes. >>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the >>>>>> same machine with 64K page size, with the rest of the sizes set to never, >>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G >>>>>> respectively. When 512M THP hugepage size is set to madvise for the same >>>>>> machine with 64K page size, the min, low and high watermarks evaluate to >>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch. >>>>> >>>>> Getting pageblock_order involved here might be confusing. I think you just >>>>> want to adjust min, low and high watermarks to reasonable values. >>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone >>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes >>>>> look reasonable to me. >>>> >>>> Hi Zi, >>>> >>>> Thanks for the review! >>>> >>>> I forgot to change it in another place, sorry about that! So can't move >>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function. >>>> Have added the additional place where min_thp_pageblock_nr_pages() is called >>>> as a fixlet here: >>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/ >>>> >>>> I think atleast in this context the orginal name pageblock_nr_pages isn't >>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER). >>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy >>>> to change it to something appropriate. >>> >>> Got it. pageblock is the defragmentation granularity. If user only wants >>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise, >>> kernel will defragment at 512MB granularity, which might not be efficient. >>> Maybe make pageblock_order a boot time parameter? >>> >>> In addition, we are mixing two things together: >>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction >>> will be triggered; >>> 2. pageblock order: it is the granularity of defragmentation for creating >>> mTHP/THP. >>> >>> In your use case, you want to lower watermarks, right? Considering what you >>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes, >>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion >>> is lame, sorry). >>> >>> I think for 2, we might want to decouple pageblock order from defragmentation >>> granularity. >>> >> >> This is a good point. I only did it for the watermarks in the RFC, but there >> is no reason that the defrag granularity is done in 512M chunks and is probably >> very inefficient to do so? >> >> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes, >> maybe we just need to change the definition of pageblock_order in [1] to take into >> account the highest large folio order enabled instead of HPAGE_PMD_ORDER? > > Ideally, yes. But pageblock migratetypes are stored in a fixed size array > determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c). > Changing pageblock_order at runtime means we will need to resize pageblock > migratetypes array, which is a little unrealistic. In a system with GBs or TBs > memory, reducing pageblock_order by 1 means doubling pageblock migratetypes > array and replicating one pageblock migratetypes to two; increasing pageblock > order by 1 means halving the array and splitting a pageblock into two. > The former, if memory is enough, might be easy, but the latter is a little > involved, since for a pageblock with both movable and unmovable pages, > you will need to check all pages to decide the migratetypes of the after-split > pageblocks to make sure pageblock migratetype matches the pages inside that > pageblock. > Thanks for explaining this so well and the code pointer! Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime. > >> >> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50 >> >> I really want to avoid coming up with a solution that requires changing a Kconfig or needs >> kernel commandline to change. It would mean a reboot whenever a different workload >> runs on a server that works optimally with a different THP size, and that would make >> workload orchestration a nightmare. >> > > As I said above, changing pageblock order at runtime might not be easy. But > changing defragmentation granularity should be fine, since it just changes > the range of memory compaction. That is the reason of my proposal, > decoupling pageblock order from defragmentation granularity. We probably > need to do some experiments to see the impact of the decoupling, as I > imagine defragmenting a range smaller than pageblock order is fine, but > defragmenting a range larger than pageblock order might cause issues > if there is any unmovable pageblock within that range. Since it is very likely > unmovable pages reside in an unmovable pageblock and lead to a defragmentation > failure. > > I saw you mentioned of a proposal to decouple pageblock order from defrag granularity in one of the other replies as well, just wanted to check if there was anything you had sent in lore in terms of proposal or RFC that I could look at. So I guess the question is what should be the next step? The following has been discussed: - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above and might have unintended consequences if done at runtime, so a no go? - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). The decoupling can be done separately. Watermark calculation can be decoupled using the approach taken in this RFC. Although max order used by pagecache needs to be addressed. > -- > Best Regards, > Yan, Zi