From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72637C5B552 for ; Mon, 9 Jun 2025 20:03:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EC006B0098; Mon, 9 Jun 2025 16:03:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C3546B0099; Mon, 9 Jun 2025 16:03:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1B096B009A; Mon, 9 Jun 2025 16:03:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D352D6B0098 for ; Mon, 9 Jun 2025 16:03:30 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 84EF71003DB for ; Mon, 9 Jun 2025 20:03:30 +0000 (UTC) X-FDA: 83536936980.29.C00B24B Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) by imf04.hostedemail.com (Postfix) with ESMTP id 9830F40011 for ; Mon, 9 Jun 2025 20:03:28 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=J95ZJVfN; spf=pass (imf04.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.49 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749499408; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t3XzMKnu0mIQ95ufMeBLRVI9/OO5W5wFHMg0c5ffM7c=; b=P2ACZl9SCR9/bX4E930rjHIQ613R5isESi3uO7Zq6sWXRiHUet16/L0x23bhUFK3GnzZ5C QmBRAzJkmxrZ1xsdnUJRXGzqIrYPr5RAQYW3in/aUPSzq7EwQ64IrcCnJynGAUpMEEb5fQ BtqAai9o+kFBCSRHNouacBeDdxZ/IOg= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=J95ZJVfN; spf=pass (imf04.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.49 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749499408; a=rsa-sha256; cv=none; b=tssweWr1WeCL8NGw1RGWaifIU4piaYDc/oVZ54O5Qb29rujoInAT3I35xnZyC5W+Fz+o4z 861tKbWSf2u1LWQm4jgdo6rpmDaklGrgHybW+JGNlJkGFd5usRI10mMDi6BJrk//JCzOfa D8A8zeNcVGlqDD/KKpJeslpxp9v6b/w= Received: by mail-wr1-f49.google.com with SMTP id ffacd0b85a97d-3a5123c1533so2616357f8f.2 for ; Mon, 09 Jun 2025 13:03:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749499407; x=1750104207; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=t3XzMKnu0mIQ95ufMeBLRVI9/OO5W5wFHMg0c5ffM7c=; b=J95ZJVfN3k1puAy9uwNPVXQqdw8BF5NwjzJhyGJgLOQ8kjHoqLxHGjlDI7AF1Ipyv5 OD1OeBOd/erPMwjuENZvSol+onj8bgYfGGfz5liIIdx+C+uZM2NrcoM4aWJvjIZwct3A Idg+9uEJWClGr/qg5QEBnepctSZV/ZzFaVn5z/hXjRtn3/XXUXKtJTqK9nHgSLJiw9lY +6ZJ2sBE4XyRTaRjq5c9p+4eK0yH03rANGXnP0oPKR1UCot1eJDZeTXFnVVooqCGvfHT 4GGXv49eZrcm4fLo3gsATSfv+kD53MpQgrn8br2qF8wj2bn3IbeB68n9W41gC3Q+7TcY a0Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749499407; x=1750104207; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=t3XzMKnu0mIQ95ufMeBLRVI9/OO5W5wFHMg0c5ffM7c=; b=M84hVPA1EuhmAjzGlbITKGVVV3eUbzHuUNHxnLcyfbO9mwfdeUGt12g9tRhN9djxhC vzbVTiZF7OPBvTqUuj0gFvGhWisLeZWD3MBl93VOx2NOmDTXo74n6KLqY5t9ZGI1eP09 gz+hVSGB6WkdG1IdOsrrIP8K+c8wqk4OSg5dKeAIGR/LXHyhclKufXZeXRcII9UMCL2x 3Yz/vRAAk7GGeMnSqkTkjrqWaNZJvPtBeE3rSCRkVKgeQu6jKQAranO1/XLORej0Rr4v 303a360JL4bia+JEaVh4OlcGHO+GgOHEnVfSQ2JAhswRYi6IXjD1PyDUiVPl93QMGVjn s49g== X-Forwarded-Encrypted: i=1; AJvYcCVfP3PMpJ7U5oMEGMsZGfZPtthUnTISZFYhlfMhHiv3M2rHSX/HE/FGIC/ZBTexiu7ueh4dlfTB+g==@kvack.org X-Gm-Message-State: AOJu0Yz5NlCiGBsdQWmTQV2DtZS+H089GitS4rTOK7RfIqI65LK9wXN1 WT53lHaNmLg7kG69HOW5Nxvv9OA8kRN4en18efAC4loiCS+RHIUIB+Mp X-Gm-Gg: ASbGncvmSMJDyR/rXAVoMbvO+ukzygG6EXm+8duRv6/XOJA32L4XHTtDwh/fy8WV+fB viiiQV2Vwb6tqbvKpWmQIeD7awUFnz4BOPUjfkD2AgxekH2mLig50c5GHN9ghVBKloXbl3z1rK+ jCd3N+8AwNYTyUoElegghgRRPHaAtd5haMcmVFKDLS6NzQf1rWtTM++/gUzcYOdgJASYkfuCJPC j1NEJN2jGBoksKyW6IXd3BVr65SP7De21c0Fl7RdVPWQbP7NHuoqXzQpFcivJRj6nTaSU3mDK0x zBCdtTlYbXQtum/J6jHU8z9MitXdDZlUtK29MgTlvlAJANL+n0T16jJYsJPho5VdePu4SE61umC T1fRUTZdXqXQULR7QqEw7+01rSWrQcB+Q2R0ASv1QwRoixMA= X-Google-Smtp-Source: AGHT+IGowvBUP/O3pdklAdz+N/DtoDcUGaG42o5RJmofp/W7d1VrV3PEWR87sR4Qiah+t4LIncEkoA== X-Received: by 2002:a5d:5f86:0:b0:3a3:64fb:304d with SMTP id ffacd0b85a97d-3a5319ba166mr11059717f8f.12.1749499406780; Mon, 09 Jun 2025 13:03:26 -0700 (PDT) Received: from ?IPV6:2a02:6b6f:e750:f900:146f:2c4f:d96e:4241? ([2a02:6b6f:e750:f900:146f:2c4f:d96e:4241]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3a532463ed4sm10366760f8f.94.2025.06.09.13.03.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Jun 2025 13:03:26 -0700 (PDT) Message-ID: <35c82f60-0090-4d23-bb83-9898393cf928@gmail.com> Date: Mon, 9 Jun 2025 21:03:25 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes To: Zi Yan , Lorenzo Stoakes Cc: david@redhat.com, Andrew Morton , linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hughd@google.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Juan Yescas , Breno Leitao References: <35A3819F-C8EE-48DB-8EB4-093C04DEF504@nvidia.com> <18BEDC9A-77D2-4E9B-BF5A-90F7C789D535@nvidia.com> <5bd47006-a38f-4451-8a74-467ddc5f61e1@gmail.com> <0a746461-16f3-4cfb-b1a0-5146c808e354@lucifer.local> <61da7d25-f115-4be3-a09f-7696efe7f0ec@lucifer.local> <2338896F-7F86-4F5A-A3CC-D14459B8F227@nvidia.com> Content-Language: en-US From: Usama Arif In-Reply-To: <2338896F-7F86-4F5A-A3CC-D14459B8F227@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 9830F40011 X-Rspamd-Server: rspam09 X-Stat-Signature: 1yb8agf54f6cwtws9cfc8nuz7m7qittb X-HE-Tag: 1749499408-961556 X-HE-Meta: U2FsdGVkX18fGt4xvSTdEDMomN0oH4Azmed0iBizy/Ru7YVtdxNNSZznQLJflxIw+ieTMRIugNI7hwpVggQQuOtjacDjiuQYoF2WZTP6bm+z6c5SioxgCqPjzq+4Tzt6quMxvoRJVThsfqpr05IXzDacmXFvjvNt102V2ys6Nz2w2fiO0uOMa5TI+mz0jZiSwDt9cy2TTnKdp1n8uYJIOz3fgOfQuxKmPJS/bRb9yQ/NLOiHuHWPVquJ+INsxHiNr23gjgK5ZLxl1Gev5jWb9CLlZCVa2Jwc1mxQYYz4wq81nr26ZSla4XqLB+7rLkhgi1h/Kqc6nXd1Y1jAPcBNYXqNVa2C+T3xRlvd6/R5sCEWaytMeMi/RID4FTY0Gs6pRHraPOpkvTdF5qGu/+p7TcJVGRKvfAHpb85RXDK+sjndpZvCvK9QMtDNsyjkFy5mmxs6UrsqC9KaYr8SRkK3dGzjizrf4Qn6IIyDTkmtHAL0Y1v9HYOH8JC3l1hmIScoKNLrz3+NJau9NwZub5/BbsjYoqvRbq3ArBIj8Y13WfRgwg6lZXSqPJUPGBjua9AYjuxVlWqf3eL8EELdt8kkhOjW5Q+Ejnj4au8h0jwvITXhl3QdKcD5T9t2B20kXiSvNZELXCwwLdik2sw8OiLOC7E+5zCux1dC2Q14o2cyvasGkr1BLT8Zqz76C0k4RbSJWkMAMKMKCDXBtbhFJr7I1tkRbhc9vW9SC3MouU9Vypl8kxbL+aHzHN3Q6nXlwi+716FhCxv6pxMBpxTK3cXt8jBv7w8jTvhllaeXaozL+B8d9ZALPrJ0v6jpq0qusMqeCI3VA/i1al+ME28L14aTegew/B/IVws2tg/vqO8m0UPNA4ueeiJy+O0/aHwbALfJNhNM+WQe7P9598NkeRX6i357FjMAZpNwuXomO0l5v/gjzyKCOvbF0I4cOTcpVNNTJL4Pz93HRMW0FrWwysD BT9vXpPj gJYhgCrPEDL1WWkBCPbTs8LE99HJZCdnBfFU6xhNVPfWrcKUThaVTE/1EPmUTkWVBrc3a X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 09/06/2025 20:49, Zi Yan wrote: > On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote: > >> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote: >>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote: >>> >>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: >>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: >>>>> >>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: >>>> >>>> [snip] >>>> >>>>>>> So I guess the question is what should be the next step? The following has been discussed: >>>>>>> >>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above >>>>>>> and might have unintended consequences if done at runtime, so a no go? >>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). >>>>>>> The decoupling can be done separately. Watermark calculation can be decoupled using the >>>>>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed. >>>>>>> >>>>>> >>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it >>>>>> feasible to simply statically adjust the pageblock size? >>>>>> >>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less >>>>>> likely there'll be fragmentation when requesting page blocks. >>>>>> >>>>>> And the watermark code is explicitly about providing reserves at a >>>>>> _pageblock granularity_. >>>>>> >>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks >>>>>> that we rarely use? >>>>>> >>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible >>>>>> level no? >>>>> >>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock >>>>> at 2MB, everyone should be happy at the moment. But if one user wants to >>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock >>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB >>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. >>>>> >>>>> Defragmentation has two components: 1) pageblock, which has migratetypes >>>>> to prevent mixing movable and unmovable pages, as a single unmovable page >>>>> blocks large free pages from being created; 2) memory compaction granularity, >>>>> which is the actual work to move pages around and form a large free pages. >>>>> Currently, kernel assumes pageblock size = defragmentation granularity, >>>>> but in reality, as long as pageblock size >= defragmentation granularity, >>>>> memory compaction would still work, but not the other way around. So we >>>>> need to choose pageblock size carefully to not break memory compaction. >>>> >>>> OK I get it - the issue is that compaction itself operations at a pageblock >>>> granularity, and once you get so fragmented that compaction is critical to >>>> defragmentation, you are stuck if the pageblock is not big enough. >>> >>> Right. >>> >>>> >>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in >>>> itself, and if we're complaining about issues with unavailable reserved >>>> memory due to crazy PMD size, surely one will encounter the compaction >>>> process simply failing to succeed/taking forever/causing issues with >>>> reclaim/higher order folio allocation. >>> >>> Yep. Initially, we probably never thought PMD THP would be as large as >>> 512MB. >> >> Of course, such is the 'organic' nature of kernel development :) >> >>> >>>> >>>> I mean, I don't really know the compaction code _at all_ (ran out of time >>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or >>>> gives up? >>> >>> compaction works on one pageblock at a time, trying to migrate in-use pages >>> within the pageblock away to create a free page for THP allocation. >>> It assumes PMD THP size is equal to pageblock size. It will keep working >>> until a PMD THP size free page is created. This is a very high level >>> description, omitting a lot of details like how to avoid excessive compaction >>> work, how to reduce compaction latency. >> >> Yeah this matches my assumptions. >> >>> >>>> >>>> Because it strikes me that a crazy pageblock size would cause really >>>> serious system issues on that basis alone if that's the case. >>>> >>>> And again this leads me back to thinking it should just be the page block >>>> size _as a whole_ that should be adjusted. >>>> >>>> Keep in mind a user can literally reduce the page block size already via >>>> CONFIG_PAGE_BLOCK_MAX_ORDER. >>>> >>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size >>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size >>>> system. >>>> >>>> That way, people _can still get_ super huge PMD sized huge folios up to the >>>> point of fragmentation. >>>> >>>> If we do reduce things this way we should give a config option to allow >>>> users who truly want collosal PMD sizes with associated >>>> watermarks/compaction to be able to still have it. >>>> >>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? >>> >>> I agree with capping pageblock size at a highest reasonable mTHP size. >>> In case there is some user relying on this huge PMD THP, making >>> pageblock a boot time variable might be a little better, since >>> they do not need to recompile the kernel for their need, assuming >>> distros will pick something like 2MB as the default pageblock size. >> >> Right, this seems sensible, as long as we set a _default_ that limits to >> whatever it would be, 2MB or such. >> >> I don't think it's unreasonable to make that change since this 512 MB thing >> is so entirely unexpected and unusual. >> >> I think Usama said it would be a pain it working this way if it had to be >> explicitly set as a boot time variable without defaulting like this. >> >>> >>>> >>>> I also question this de-coupling in general (I may be missing somethig >>>> however!) - the watermark code _very explicitly_ refers to providing >>>> _pageblocks_ in order to ensure _defragmentation_ right? >>> >>> Yes. Since without enough free memory (bigger than a PMD THP), >>> memory compaction will just do useless work. >> >> Yeah right, so this is a key thing and why we need to rework the current >> state of the patch. >> >>> >>>> >>>> We would need to absolutely justify why it's suddenly ok to not provide >>>> page blocks here. >>>> >>>> This is very very delicate code we have to be SO careful about. >>>> >>>> This is why I am being cautious here :) >>> >>> Understood. In theory, we can associate watermarks with THP allowed orders >>> the other way around too, meaning if user lowers vm.min_free_kbytes, >>> all THP/mTHP sizes bigger than the watermark threshold are disabled >>> automatically. This could fix the memory compaction issues, but >>> that might also drive user crazy as they cannot use the THP sizes >>> they want. >> >> Yeah that's interesting but I think that's just far too subtle and people will >> have no idea what's going on. >> >> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to >> go (but overrideable for people crazy enough to truly want 512 MB pages - and >> who cannot then complain about watermarks). > > I agree. Basically, I am thinking: > 1) use something like 2MB as default pageblock size for all arch (the value can > be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default > value; > > 2) make pageblock_order a boot time parameter, so that user who wants > 512MB pages can still get it by changing pageblock order at boot time. > > WDYT? > I was really hoping we would come up with a dynamic way of doing this, especially one that doesn't require any more input from the user apart from just setting the mTHP size via sysfs.. 1) in a way is already done. We can set it to 2M by setting ARCH_FORCE_MAX_ORDER to 5: In arch/arm64/Kconfig we already have: config ARCH_FORCE_MAX_ORDER int default "13" if ARM64_64K_PAGES default "11" if ARM64_16K_PAGES default "10" Doing 2) would require reboot and doing this just for changing mTHP size will probably be a nightmare for workload orchestration.