From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D9F5FCCD1BF for ; Wed, 29 Oct 2025 02:09:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 120EB8E0028; Tue, 28 Oct 2025 22:09:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F8858E0015; Tue, 28 Oct 2025 22:09:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00EA88E0028; Tue, 28 Oct 2025 22:09:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E44888E0015 for ; Tue, 28 Oct 2025 22:09:54 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8628EBA7FC for ; Wed, 29 Oct 2025 02:09:54 +0000 (UTC) X-FDA: 84049521108.10.35B102A Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf19.hostedemail.com (Postfix) with ESMTP id 31F621A0005 for ; Wed, 29 Oct 2025 02:09:50 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=erUOGeef; spf=pass (imf19.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1761703792; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wZ1yjYC/dgXU0FjIkqKm9KU8zUcua/YDv5tCcHYwZT4=; b=vg8dGXoNEq3WLSuU536IZLNsnn5Z5szxm7t2cxbwAEHGq3yfXfcvoMGdC+X82O6CxB4/rM mXCpB+eSEltnqCXMqgVLgy/1k1f24XDWNp+IrbgGDJuqGcY7rviOMb+0l6GXA6+CVhkR54 UPYruJeh/WhpQUCjEYZs5N0MEmvB70w= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1761703792; a=rsa-sha256; cv=none; b=yOIT7oLTMKy7AALCMHFXdfG/CmY+jwRwcxCK3cDiLxlUMuxP2ovBPzXzeCYqchfeL3hiy1 nAj2I/akfWgI0PNwjGmRBNjh1ybirhcbLK2s4xe3P7SjD9+gxhUQa0sfaNghyangryG9no fhTNyqaFikKFXFr1dbSQXQEKOGy2czw= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=erUOGeef; spf=pass (imf19.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1761703787; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=wZ1yjYC/dgXU0FjIkqKm9KU8zUcua/YDv5tCcHYwZT4=; b=erUOGeefFEpUwOP/TG7jfJBOlTa7YDr428HFXar8/PwJhKSuAut1zANmr4/5y0pUwbHmfFiggPSd7lvE23ml+26uyRQ7KDJGc77kZ58iYjM/DnpoZ+8X62xdHfRuZrH0eC8dPDxhYN0AE0YlzeSYNJfJHfgmqJHmHDHVCWMokqA= Received: from 30.74.144.125(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WrDxfxu_1761703783 cluster:ay36) by smtp.aliyun-inc.com; Wed, 29 Oct 2025 10:09:44 +0800 Message-ID: Date: Wed, 29 Oct 2025 10:09:43 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function To: Lorenzo Stoakes , David Hildenbrand Cc: Nico Pache , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, ziy@nvidia.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, dev.jain@arm.com, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, akpm@linux-foundation.org, baohua@kernel.org, willy@infradead.org, peterx@redhat.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kas@kernel.org, aarcange@redhat.com, raquini@redhat.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org, hughd@google.com, richard.weiyang@gmail.com, lance.yang@linux.dev, vbabka@suse.cz, rppt@kernel.org, jannh@google.com, pfalcato@suse.de References: <20251022183717.70829-1-npache@redhat.com> <20251022183717.70829-7-npache@redhat.com> <5f8c69c1-d07b-4957-b671-b37fccf729f1@lucifer.local> <74583699-bd9e-496c-904c-ce6a8e1b42d9@redhat.com> <3dc6b17f-a3e0-4b2c-9348-c75257b0e7f6@lucifer.local> From: Baolin Wang In-Reply-To: <3dc6b17f-a3e0-4b2c-9348-c75257b0e7f6@lucifer.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Stat-Signature: hb6ini6se74kdhhbq5tk44q9hoptnefq X-Rspam-User: X-Rspamd-Queue-Id: 31F621A0005 X-HE-Tag: 1761703790-272756 X-HE-Meta: U2FsdGVkX18CMRZ8XcPuF7Lbod6j4mb5i7m1QnnMvds0iyE+fJzbneovok1fwdfASe+qTIMp6OpzMLQC8vOatDYgHffY/zyfqeOCSF5SDaqkSiDUY6F3RzasbZa+8Kyud9QcSNFFR2AgsavelO/NYPSXnX0feFEt9LfdU2XndQaT5Kbdh1G53nHnhzIDRV/7mzGGAPCpCfsdCtj/ApB7YQabJT5nmnDxJvmpzCGgTqTkIULYms7uWcFhkYKGQzVAynO0lfNyHqKXvydCfinkNapFPexbsQl0kAxM6H2GTz454Bgm+TaTx5QC1PbiY7uWhvA0AI/4UvblIynGvJDym6Qah5JZGweTNkgGH9ti+oZyQ0nkcXKElRTD0g5MYYiQQG3d1N/cBfU0raeMULAYRQkMteoN7M9czjb7wtluHrsbANemAfQ4cpm1ViEizWwBMsvi+hoQf98KznoXbcRs+ujSNTUNPdUo1KwWihwGtYE2G9lBXkSp8InaAyqu6dQ7CrXNwBPSBFIEZWilSDptRbNbDiYCC19GZhkyQBW9dzuIpVb4ILMImolG1r4UW1RkghBIfF4gmDp9TnOVtwMkFSaiaL61MB27XEUQmrKQn2thLna6rmfDABEQNq9ZH1RSDlRon7f+o3MkTm9Iik7Hlo75i5Wh6qxtfW6PsiEjyN/TCtP0u7U5vV8Vj+2Wbg/pndDouI+c1nKShlAgz8LO3zzvh5R/HDMXwVIW9OaSbWz38xs92FYYTOR4GjtaYJYgVAOIOdEQPmc0ZAQ3kke4CfFZBbJm/YOosSnt9ZbOSVukJvQRwrCx1ozme4K1gAwU4rLtsxcuz3SsjnHtWRLvHSkNKnotdaD9FRJD0X/AMjjtgM7e1Eu9Xou4MFEkMOEsLUeS/cXDjbCs5uL2QClX7c+iMbM52xYATPnqOuNMiFF/JHknb4w9ieBBKm6S2rBdk4sgTcDSyw7GPoXdPob eT5/01Ba O4RkfsV8Dp9Cv77ij0g9gyyQP+zpJvxUIwG2CVuuV6pjdmK5ZBAjVMjCF6pWp35+wz9hclabq6CSRYCf4N7/2StHoGj25CclrHv7jzP8q6JZPPJUQ/kJ2lRiAiZTU+3f19aH8RvtrCYsZzfaNRu+Y8n4Pc9ev7lLZyNEeju31dc8kvaz7cT3YjztaQrc6UXZYeAkuriDtnHANjrJeqB47xYzSwOyVMZMCvWEPW7WidTPyVtLIiw+PqW0WLYPxb2aU65PqEkgi0xs7NenxuUU0/EOJdydojH2oKmWupgeTcPqUCU1uQ6XNUowuQAU8zHbKdpY6HuWgWfXg8ff7l5ZyYDvKS4o+Y4SMzeLbiAP9AGlc/EjEVyBozy+MvNHlEh677jUMNya8Bs344uwwQtNilD7js4LoyrhXoLPbsjj1koTzdEg/aMVwhAHOuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/10/29 02:59, Lorenzo Stoakes wrote: > On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote: >> >>>>> Hey Lorenzo, >>>>> >>>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going >>>>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just >>>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again? >>>>> >>>>> I spoke to David and he said to continue forward with this series; the >>>>> "eagerness" tunable will take some time, and may require further >>>>> considerations/discussion. >>>> >>>> Right, after talking to Johannes it got clearer that what we envisioned with >>> >>> I'm not sure that you meant to say go ahead with the series as-is with this >>> silent capping? >> >> No, "go ahead" as in "let's find some way forward that works for all and is >> not too crazy". > > Right we clearly needed to discuss that further at the time but that's moot now, > we're figuring it out now :) > >> >> [...] >> >>>> "eagerness" would not be like swappiness, and we will really have to be >>>> careful here. I don't know yet when I will have time to look into that. >>> >>> I guess I missed this part of the converastion, what do you mean? >> >> Johannes raised issues with that on the list and afterwards we had an >> offline discussion about some of the details and why something unpredictable >> is not good. > > Could we get these details on-list so we can discuss them? This doesn't have to > be urgent, but I would like to have a say in this or at least be part of the > converastion please. > >> >>> >>> The whole concept is that we have a paramaeter whose value is _abstracted_ and >>> which we control what it means. >>> >>> I'm not sure exactly why that would now be problematic? The fundamental concept >>> seems sound no? Last I remember of the conversation this was the case. >> >> The basic idea was to do something abstracted as swappiness. Turns out >> "swappiness" is really something predictable, not something we can randomly >> change how it behaves under the hood. >> >> So we'd have to find something similar for "eagerness", and that's where it >> stops being easy. > > I think we shouldn't be too stuck on > >> >>> >>>> >>>> If we want to avoid the implicit capping, I think there are the following >>>> possible approaches >>>> >>>> (1) Tolerate creep for now, maybe warning if the user configures it. >>> >>> I mean this seems a viable option if there is pressure to land this series >>> before we have a viable uAPI for configuring this. >>> >>> A part of me thinks we shouldn't rush series in for that reason though and >>> should require that we have a proper control here. >>> >>> But I guess this approach is the least-worst as it leaves us with the most >>> options moving forwards. >> >> Yes. There is also the alternative of respecting only 0 / 511 for mTHP >> collapse for now as discussed in the other thread. > > Yes I guess let's carry that on over there. > > I mean this is why I said it's better to try to keep things in one thread :) but > anyway, we've forked and can't be helped now. > > To be clear that was a criticism of - email development - not you. > > It's _extremely easy_ to have this happen because one thread naturally leads to > a broader discussion of a given topic, whereas another has questions from > somebody else about the same topic, to which people reply and then... you have a > fork and it can't be helped. > > I guess I'm saying it'd be good if we could say 'ok let's move this to X'. > > But that's also broken in its own way, you can't stop people from replying in > the other thread still and yeah. It's a limitation of this model :) > >> >>> >>>> (2) Avoid creep by counting zero-filled pages towards none_or_zero. >>> >>> Would this really make all that much difference? >> >> It solves the creep problem I think, but it's a bit nasty IMHO. > > Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP > collapses, interesting... > > Scanning for that does seem a bit nasty though yes... > >> >>> >>>> (3) Have separate toggles for each THP size. Doesn't quite solve the >>>> problem, only shifts it. >>> >>> Yeah I did wonder about this as an alternative solution. But of course it then >>> makes it vague what the parent values means in respect of the individual levels, >>> unless we have an 'inherit' mode there too (possible). >>> >>> It's going to be confusing though as max_ptes_none sits at the root khugepaged/ >>> level and I don't think any other parameter from khugepaged/ is exposed at >>> individual page size levels. >>> >>> And of course doing this means we >>> >>>> >>>> Anything else? >>> >>> Err... I mean I'm not sure if you missed it but I suggested an approach in the >>> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at: >>> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none >>> >>> Then we allow the capping, but simply document that we specify what the capped >>> value will be here for mTHP. >> >> I did not have time to read the details on that so far. > > OK. It is a bit nasty, yes. The idea is to find something that allows the > capping to work. > >> >> It would be one solution forward. I dislike it because I think the whole >> capping is an intermediate thing that can be (and likely must be, when >> considering mTHP underused shrinking I think) solved in the future >> differently. That's why I would prefer adding this only if there is no >> other, simpler, way forward. > > Yes I agree that if we could avoid it it'd be great. > > Really I proposed this solution on the basis that we were somehow ok with the > capping. > > If we can avoid that'd be ideal as it reduces complexity and 'unexpected' > behaviour. > > We'll clarify on the other thread, but the 511/0 was compelling to me before as > a simplification, and if we can have a straightforward model of how mTHP > collapse across none/zero page PTEs behaves this is ideal. > > The only question is w.r.t. warnings etc. but we can handle details there. > >> >>> >>> That struck me as the simplest way of getting this series landed without >>> necessarily violating any future eagerness which: >>> >>> a. Must still support khugepaged/max_ptes_none - we aren't getting away from >>> this, it's uAPI. >>> >>> b. Surely must want to do different things for mTHP in eagerness, so if we're >>> exposing some PTE value in max_ptes_none doing so in >>> khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's >>> readonly so unlike max_ptes_none we don't have to worry about the other >>> direction). >>> >>> HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in >>> which case perhaps mthp_max_ptes_none would be problematic in that it is some >>> kind of average. >>> >>> Then again we could always revert to putting this parameter as in (3) in that >>> case, ugly but kinda viable. >>> >>>> >>>> IIUC, creep is less of a problem when we have the underused shrinker >>>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get >>>> reclaimed again. >>>> >>>> So maybe having underused-shrinker support for mTHP as well would be a >>>> solution to tackle (1) later? >>> >>> How viable is this in the short term? >> >> I once started looking into it, but it will require quite some work, because >> the lists will essentially include each and every (m)THP in the system ... >> so i think we will need some redesign. > > Ack. > > This aligns with non-0/511 settings being non-functional for mTHP atm anyway. > >> >>> >>> Another possible solution: >>> >>> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.: >>> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none >>> >>> As a simple boolean. If switched on then we document that it caps mTHP as >>> per Nico's suggestion. >>> >>> That way we avoid the 'silent' issue I have with all this and it's an >>> explicit setting. >> >> Right, but it's another toggle I wish we wouldn't need. We could of course >> also make it some compile-time option, but not sure if that's really any >> better. >> >> I'd hope we find an easy way forward that doesn't require new toggles, at >> least for now ... > > Right, well I agree if we can make this 0/511 thing work, let's do that. > > Toggle are just 'least worst' workarounds on assumption of the need for capping. I finally finished reading through the discussions across multiple threads:), and it looks like we've reached a preliminary consensus (make 0/511 work). Great and thanks! IIUC, the strategy is, configuring it to 511 means always enabling mTHP collapse, configuring it to 0 means collapsing mTHP only if all PTEs are non-none/zero, and for other values, we issue a warning and prohibit mTHP collapse (avoid Lorenzo's concern about silently changing max_ptes_none). Then the implementation for collapse_max_ptes_none() should be as follows: static int collapse_max_ptes_none(unsigned int order, bool full_scan) { /* ignore max_ptes_none limits */ if (full_scan) return HPAGE_PMD_NR - 1; if (order == HPAGE_PMD_ORDER) return khugepaged_max_ptes_none; /* * To prevent creeping towards larger order collapses for mTHP collapse, * we restrict khugepaged_max_ptes_none to only 511 or 0, simplifying the * logic. This means: * max_ptes_none == 511 -> collapse mTHP always * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are non-none/zero */ if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); pr_warn_once("mTHP collapse only supports khugepaged_max_ptes_none configured as 0 or %d\n", HPAGE_PMD_NR - 1); return -EINVAL; } So what do you think?