From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FD6BC83F27 for ; Sun, 20 Jul 2025 02:33:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C2C76B008C; Sat, 19 Jul 2025 22:33:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 599EF6B0092; Sat, 19 Jul 2025 22:33:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 488876B0093; Sat, 19 Jul 2025 22:33:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 328336B008C for ; Sat, 19 Jul 2025 22:33:30 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AAC071DA45A for ; Sun, 20 Jul 2025 02:33:29 +0000 (UTC) X-FDA: 83683071738.30.4762BAB Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf03.hostedemail.com (Postfix) with ESMTP id D1A5720002 for ; Sun, 20 Jul 2025 02:33:27 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XXfvqmYr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752978807; a=rsa-sha256; cv=none; b=NhCWKi205EQXcWB+cUB+OAU0erfvD0R1+0eurnXnuuvR3mzIZ9ILx6aRf4zhrLM2fv5OFK wp17C5NGDc4dv7S18/mgoplQd4Aqozbps2jNvqmh5CuuKLU7nU63n4kuLjbaXrTBkGeLD/ 5eoi3macj9vji2q9K8YGCYY9XosnYDw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XXfvqmYr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752978807; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=; b=0iJBjBIbA+qvS3F4DfXWHJCFGx4Gv7EVv2gNin+LKOy1Los2WQoP38IyUh+y3CrAczphJn +pF4fT+U6+cmjySS2mjUAaJeGm2kuKbx6fcORXYgx5Pr+qtVwu5/f1u0o8F/O8ywg3if6v h5JsJetwyFlKKizWHfxDPIdajdSCAo4= Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6fac7147cb8so49245026d6.1 for ; Sat, 19 Jul 2025 19:33:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1752978807; x=1753583607; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=; b=XXfvqmYrjJGPjPoGH5vgfmXiZWqsDH7BxnhSFROU81QTLjsfGlmkSO0oQiD581e60g cnVPnK72CDJ/f+RE2MQdityfnnM+tNhtwnoZes3in67uy1sTkYlnlAk1lgGHpnJzCw1r MB00OG2I0XtOhSdsRsfjbURWIVtCe1XjBdRQemkk7PigL5unfjkJeO+oRdajwFT8fSJA S60rLJHuu7Lvm0gcHzTBN2QP03e6syfiTE0RQpgiZ2VbHTXIer9xDMf6qfe6DaeFaklf uALcPLPZEfD8rHhJWFknY9usaSXUZupCg4ZRjvgHssREBa/cVOS/hk9rUnn/pnzi8f83 NRbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752978807; x=1753583607; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=; b=KabcvUcchB9aI0+9Y8He1kC+aDjv0GbdBTqCXrJMSRP7b9pV4BtxcWCqcZrPiNpNI5 HFkUW3DnGm1ZEIpnqzsZH8y4hKvd/Wmp0cftc1OfqpwTvnaHa2LeX39UlRYYjMxmyjHN st9d7X+LD7Wj65OXD7fJqGnPzF4gYUv93pyPoqwFBT4cajtpudtED/aoOkpY/pEXEBXN m2TnWK9qUVq84ytDKthT0pg62T0UdjjevbQBYKJ+D9lY69xABpGqjBfiD6ctGv9hslyD A0OIcmZWXPHzRbmTcjn/pVcoWbmNlSQJ4PKRWpBMjqrimn7SrjFVeS9O018FFQjJc+Kf v20g== X-Forwarded-Encrypted: i=1; AJvYcCVlPvVBTEXKiy0eyo3binSwVKU68Li9pX8YO8heNrmC0GBg9k+UEH1Qy4zgowsWlzudKxJVRsI1kg==@kvack.org X-Gm-Message-State: AOJu0Yw20t7GyzY84LvkFzvX+B+8FqxYuH5lh6lqaD/0nN44ma3mEw7y oBH4Ar9y1Ur75lxrclPq8H2VtOs6mXF1o94WoKn/j9ucW9VrcyR9/vr0K3kieT3aLia0rRVQhR2 hDaNz4PD+rqyar8zSTzQLZ+NOrTd3+1A= X-Gm-Gg: ASbGncsLm7GjOLDe6TI01A9k2vV/fjc+fcw1IJOriv4KsIkAAfXzK+hRFjrwphEXhQI a6dnPOk3I/faG0mrYOV7paiCeT2ouK+Jdwuz/svCLLPnY/KI2NeAYohdLzxvauNruNOntIc1tVP 13o5M/SZx+5gBbyYyOQRBsVYOTezSNJvap9ELsKKzwI58ALMlybuXVNDmW9K35Lw6amnPS5efaz JxOc7Mv X-Google-Smtp-Source: AGHT+IFo3rS8yovheWHQeZ6/MU78iRdjhC4knUI/qAE8qhQkjdVnscfZ5FRSyE2vjG2CT0Qt6hM9aqGxg3hoOmtQIWY= X-Received: by 2002:a05:6214:3a87:b0:704:7c55:4ff3 with SMTP id 6a1803df08f44-70519fc9d3fmr120998616d6.4.1752978806850; Sat, 19 Jul 2025 19:33:26 -0700 (PDT) MIME-Version: 1.0 References: <20250608073516.22415-1-laoar.shao@gmail.com> <9bc57721-5287-416c-aa30-46932d605f63@redhat.com> In-Reply-To: <9bc57721-5287-416c-aa30-46932d605f63@redhat.com> From: Yafang Shao Date: Sun, 20 Jul 2025 10:32:50 +0800 X-Gm-Features: Ac12FXzbHqjEqvUujy_ZW2tgACE4NObdiUPq0wo6Dd4Wytvm3VY9_-7qweOQFdU Message-ID: Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment To: David Hildenbrand , Alexei Starovoitov Cc: Matthew Wilcox , akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: D1A5720002 X-Stat-Signature: j45y664h8cwfdgjsntx6gh9c88ezm5fp X-HE-Tag: 1752978807-906747 X-HE-Meta: U2FsdGVkX1+4UAut1GCzKCoXqaAlPVb2HbsVuC3VV5oyigF/F1eK4Ip3/kFHMCghFgnMOLftZF5lhLarKzjf5WqoAOWUHMH0YmjgWkt9Uep8MAzpeJRA/zEz6LrSzoF5jJkQp3cSLeMHrX3WaWdMNgi+fk8wv36aOdNuHDBrqhOELFBj0051ax4d3hMqx340qwI6/S7IYDFUSE2C0ObnS1abu9w2mCX0kkz3vS7YTsjFPUVjEcM7ZHmK70qGCBABIPBtbBEQiuv9rvLayQkx6qIH8gemDN6sI3zKV3VrEOzvuMmnE4Xgxh2A2W5oVPGigz2vp7YqfCoUoMHcj69NghnfTMuyjBFo9ThZyZCah4uMSSHCCtsbpFUd9ZYbQ5mISD7NCZ1QtgNho7lEgxJNo21GU++hSIy80spLQxQ59qdXl9MZjMoziBLkm19RrcC30yF+FpdbHAiZ/cO21XdTSmQuYZGZKic/eA/S39PfyYiG6PHqviLwUk/5eL05GF2sDRLEIo8NJhLLh7u24181yL/fUMMB9mvit/G+lfHEL7WLTbrwLjVo7o5SOg5W7VbRWdkDd+kaFDKTWv4QbyA6YYrCeh0CBemudpR5A66Hx7TpRRLhM+YWM+NngGOwbePSdRK3Zdmqn/gW6dn6ywPpsDtS40n5jDpV4iUDZyQmJ9lhHLB0XtJCsAM3zmAg5IQUzaovLqDm0CjthwAy1pmdFM8xQEdr+uVZvb5m3tWGGLwLEuwDDAe8p5AOjvs2m8h3/zE0AVClbOoYiYLDc0uAtzrHNufEQdAO89xQWW6EllCxpFbU7Z4hqjaVdtN8f2KQzDH1/fMTKaGF4RWvWok2sbVPixgs5u8+8i3XZHsrlL+b9ifArsc185WBLwveOGP4Uf5Fi54mpQf2o+B9TuSNd4fIWMRMgnai8rb4DNps9u/spyi9WPOWpYH8mpum3/s35tQUWcDdliQKTB1R1bv X6oAaGKJ d2yzniXn92rphg9rabbS/3lkVOiO4QojbktV6IuimthVmw06i5db6pPOUChURHVOQKKCyT9nipQEVNQn5lAMbPAozkPwXpm5KHX44nTKWVXRlaZdV/TMDs3l6CGWYTK9RwlVT5AVhGRaJeHlRYSkXun6dBz0QKnLyGsRqjlLYB4Hb8HalgA9yDbmEGJ1+bXN+g0Ewk3nVMsq/m0H4Y/qa9HQ2+6cm90jP7mArLJ3RT/T73t1aFw1ewNnIeslkeRye/E/xMgLjHNJaq9sMqojT+gd5c15DDdzTeFbCt5WPFT9E7Qbvm/haP8Z/XgLn7CyVk7JsPj+zs75rKARUYqDQ71AAZb0aMIDBoJdtQK2jxT8I1A10E/wdFGwg2NBFtYw7+KiSmPbPzITZXqEhH2ze0nPSadVJlIWUuShKPZq3ae7RXltXD3CA6mHey3NOHswnzIRZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 17, 2025 at 4:52=E2=80=AFPM David Hildenbrand wrote: > > On 17.07.25 05:09, Yafang Shao wrote: > > On Wed, Jul 16, 2025 at 6:42=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 08.06.25 09:35, Yafang Shao wrote: > >> > >> Sorry for not replying earlier, I was caught up with all other stuff. > >> > >> I still consider this a very interesting approach, although I think we > >> should think more about what a reasonable policy would look like > >> medoium-term (in particular, multiple THP sizes, not always falling ba= ck > >> to small pages if it means splitting excessively in the buddy etc.) > > > > I find it difficult to understand why we introduced the mTHP sysfs > > knobs instead of implementing automatic THP size switching within the > > kernel. I'm skeptical about its practical utility in real-world > > workloads. > > > > In contrast, XFS large folio (AKA. File THP) can automatically select > > orders between 0 and 9. Based on our verification, this feature has > > proven genuinely useful for certain specific workloads=E2=80=94though i= t's not > > yet perfect. > > I suggest you do some digging about the history of these toggles and the > plans for the future (automatic), there has been plenty of talk about > all that. > > [...] > > >>> > >>> - THP allocator > >>> > >>> int (*allocator)(unsigned long vm_flags, unsigned long tva_flags)= ; > >>> > >>> The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHU= GEPAGED, > >>> indicating whether THP allocation should be performed synchronous= ly > >>> (current task) or asynchronously (khugepaged). > >>> > >>> The decision is based on the current task context, VMA flags, and= TVA > >>> flags. > >> > >> I think we should go one step further and actually get advises about t= he > >> orders (THP sizes) to use. It might be helpful if the program would ha= ve > >> access to system stats, to make an educated decision. > >> > >> Given page fault information and system information, the program could > >> then decide which orders to try to allocate. > > > > Yes, that aligns with my thoughts as well. For instance, we could > > automate the decision-making process based on factors like PSI, memory > > fragmentation, and other metrics. However, this logic could be > > implemented within BPF programs=E2=80=94all we=E2=80=99d need is to ext= end the feature > > by introducing a few kfuncs (also known as BPF helpers). > > We discussed this yesterday at a THP upstream meeting, and what we > should look into is: > > (1) Having a callback like > > unsigned int (*get_suggested_order)(.., bool in_pagefault); This interface meets our needs precisely, enabling allocation orders of either 0 or 9 as required by our workloads. > > Where we can provide some information about the fault (vma > size/flags/anon_name), and whether we are in the page fault (or in > khugepaged). > > Maybe we want a bitmap of orders to try (fallback), not sure yet. > > (2) Having some way to tag these callbacks as "this is absolutely > unstable for now and can be changed as we please.". BPF has already helped us complete this, so we don=E2=80=99t need to implem= ent this restriction. Note that all BPF kfuncs (including struct_ops) are currently unstable and may change in the future. Alexei, could you confirm this understanding? > > One idea will be to use this mechanism as a way to easily prototype > policies, and once we know that a policy works, start moving it into the > core. > > In general, the core, without a BPF program, should be able to continue > providing a sane default behavior. makes sense. > > > > >> > >> That means, one would query during page faults and during khugepaged, > >> which order one should try -- compared to our current approach of "sta= rt > >> with the largest order that is enabled and fits". > >> > >>> > >>> - THP reclaimer > >>> > >>> int (*reclaimer)(bool vma_madvised); > >>> > >>> The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSW= APD, > >>> determining whether memory reclamation is handled by the current = task or > >>> kswapd. > >> > >> Not sure about that, will have to look into the details. > > > > Some workloads allocate all their memory during initialization and do > > not require THP at runtime. For such cases, aggressively attempting > > THP allocation is beneficial. However, other workloads may dynamically > > allocate THP during execution=E2=80=94if these are latency-sensitive, w= e must > > avoid introducing long allocation delays. > > > > Given these differing requirements, the global > > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient. > > Instead, we should implement per-workload defrag policies to better > > optimize performance based on individual application behavior. > > We'll be very careful about the callbacks we will offer. Maybe the > get_suggested_order() callback could itself make a decision and not > suggest a high order if allocation would require comapction. > > Initially, we should keep it simple and see what other callbacks to add > / how to extend get_suggested_order(), to cover these cases. Yes, we can proceed by adding a simple get_suggested_order() and address any remaining details in follow-up work. -- Regards Yafang