From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0FD6BC83F27
	for <linux-mm@archiver.kernel.org>; Sun, 20 Jul 2025 02:33:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5C2C76B008C; Sat, 19 Jul 2025 22:33:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 599EF6B0092; Sat, 19 Jul 2025 22:33:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 488876B0093; Sat, 19 Jul 2025 22:33:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 328336B008C
	for <linux-mm@kvack.org>; Sat, 19 Jul 2025 22:33:30 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id AAC071DA45A
	for <linux-mm@kvack.org>; Sun, 20 Jul 2025 02:33:29 +0000 (UTC)
X-FDA: 83683071738.30.4762BAB
Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48])
	by imf03.hostedemail.com (Postfix) with ESMTP id D1A5720002
	for <linux-mm@kvack.org>; Sun, 20 Jul 2025 02:33:27 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=XXfvqmYr;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752978807; a=rsa-sha256;
	cv=none;
	b=NhCWKi205EQXcWB+cUB+OAU0erfvD0R1+0eurnXnuuvR3mzIZ9ILx6aRf4zhrLM2fv5OFK
	wp17C5NGDc4dv7S18/mgoplQd4Aqozbps2jNvqmh5CuuKLU7nU63n4kuLjbaXrTBkGeLD/
	5eoi3macj9vji2q9K8YGCYY9XosnYDw=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=XXfvqmYr;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1752978807;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=;
	b=0iJBjBIbA+qvS3F4DfXWHJCFGx4Gv7EVv2gNin+LKOy1Los2WQoP38IyUh+y3CrAczphJn
	+pF4fT+U6+cmjySS2mjUAaJeGm2kuKbx6fcORXYgx5Pr+qtVwu5/f1u0o8F/O8ywg3if6v
	h5JsJetwyFlKKizWHfxDPIdajdSCAo4=
Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6fac7147cb8so49245026d6.1
        for <linux-mm@kvack.org>; Sat, 19 Jul 2025 19:33:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1752978807; x=1753583607; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=;
        b=XXfvqmYrjJGPjPoGH5vgfmXiZWqsDH7BxnhSFROU81QTLjsfGlmkSO0oQiD581e60g
         cnVPnK72CDJ/f+RE2MQdityfnnM+tNhtwnoZes3in67uy1sTkYlnlAk1lgGHpnJzCw1r
         MB00OG2I0XtOhSdsRsfjbURWIVtCe1XjBdRQemkk7PigL5unfjkJeO+oRdajwFT8fSJA
         S60rLJHuu7Lvm0gcHzTBN2QP03e6syfiTE0RQpgiZ2VbHTXIer9xDMf6qfe6DaeFaklf
         uALcPLPZEfD8rHhJWFknY9usaSXUZupCg4ZRjvgHssREBa/cVOS/hk9rUnn/pnzi8f83
         NRbA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1752978807; x=1753583607;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=W1AcXbZhhMCsQq8cdg55uHT4Lno8k1BjaqAVBbHTohA=;
        b=KabcvUcchB9aI0+9Y8He1kC+aDjv0GbdBTqCXrJMSRP7b9pV4BtxcWCqcZrPiNpNI5
         HFkUW3DnGm1ZEIpnqzsZH8y4hKvd/Wmp0cftc1OfqpwTvnaHa2LeX39UlRYYjMxmyjHN
         st9d7X+LD7Wj65OXD7fJqGnPzF4gYUv93pyPoqwFBT4cajtpudtED/aoOkpY/pEXEBXN
         m2TnWK9qUVq84ytDKthT0pg62T0UdjjevbQBYKJ+D9lY69xABpGqjBfiD6ctGv9hslyD
         A0OIcmZWXPHzRbmTcjn/pVcoWbmNlSQJ4PKRWpBMjqrimn7SrjFVeS9O018FFQjJc+Kf
         v20g==
X-Forwarded-Encrypted: i=1; AJvYcCVlPvVBTEXKiy0eyo3binSwVKU68Li9pX8YO8heNrmC0GBg9k+UEH1Qy4zgowsWlzudKxJVRsI1kg==@kvack.org
X-Gm-Message-State: AOJu0Yw20t7GyzY84LvkFzvX+B+8FqxYuH5lh6lqaD/0nN44ma3mEw7y
	oBH4Ar9y1Ur75lxrclPq8H2VtOs6mXF1o94WoKn/j9ucW9VrcyR9/vr0K3kieT3aLia0rRVQhR2
	hDaNz4PD+rqyar8zSTzQLZ+NOrTd3+1A=
X-Gm-Gg: ASbGncsLm7GjOLDe6TI01A9k2vV/fjc+fcw1IJOriv4KsIkAAfXzK+hRFjrwphEXhQI
	a6dnPOk3I/faG0mrYOV7paiCeT2ouK+Jdwuz/svCLLPnY/KI2NeAYohdLzxvauNruNOntIc1tVP
	13o5M/SZx+5gBbyYyOQRBsVYOTezSNJvap9ELsKKzwI58ALMlybuXVNDmW9K35Lw6amnPS5efaz
	JxOc7Mv
X-Google-Smtp-Source: AGHT+IFo3rS8yovheWHQeZ6/MU78iRdjhC4knUI/qAE8qhQkjdVnscfZ5FRSyE2vjG2CT0Qt6hM9aqGxg3hoOmtQIWY=
X-Received: by 2002:a05:6214:3a87:b0:704:7c55:4ff3 with SMTP id
 6a1803df08f44-70519fc9d3fmr120998616d6.4.1752978806850; Sat, 19 Jul 2025
 19:33:26 -0700 (PDT)
MIME-Version: 1.0
References: <20250608073516.22415-1-laoar.shao@gmail.com> <b2fc85fb-1c7b-40ab-922b-9351114aa994@redhat.com>
 <CALOAHbD2-f5CRXJy6wpXuCC5P9gqqsbVbjBzgAF4e+PqWv0xNg@mail.gmail.com> <9bc57721-5287-416c-aa30-46932d605f63@redhat.com>
In-Reply-To: <9bc57721-5287-416c-aa30-46932d605f63@redhat.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Sun, 20 Jul 2025 10:32:50 +0800
X-Gm-Features: Ac12FXzbHqjEqvUujy_ZW2tgACE4NObdiUPq0wo6Dd4Wytvm3VY9_-7qweOQFdU
Message-ID: <CALOAHbBoZpAartkb-HEwxJZ90Zgn+u6G4fCC0_Wq-shKqnb6iQ@mail.gmail.com>
Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
To: David Hildenbrand <david@redhat.com>, Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, ziy@nvidia.com, 
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, 
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, 
	dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, 
	gutierrez.asier@huawei-partners.com, ast@kernel.org, daniel@iogearbox.net, 
	andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: D1A5720002
X-Stat-Signature: j45y664h8cwfdgjsntx6gh9c88ezm5fp
X-HE-Tag: 1752978807-906747
X-HE-Meta: U2FsdGVkX1+4UAut1GCzKCoXqaAlPVb2HbsVuC3VV5oyigF/F1eK4Ip3/kFHMCghFgnMOLftZF5lhLarKzjf5WqoAOWUHMH0YmjgWkt9Uep8MAzpeJRA/zEz6LrSzoF5jJkQp3cSLeMHrX3WaWdMNgi+fk8wv36aOdNuHDBrqhOELFBj0051ax4d3hMqx340qwI6/S7IYDFUSE2C0ObnS1abu9w2mCX0kkz3vS7YTsjFPUVjEcM7ZHmK70qGCBABIPBtbBEQiuv9rvLayQkx6qIH8gemDN6sI3zKV3VrEOzvuMmnE4Xgxh2A2W5oVPGigz2vp7YqfCoUoMHcj69NghnfTMuyjBFo9ThZyZCah4uMSSHCCtsbpFUd9ZYbQ5mISD7NCZ1QtgNho7lEgxJNo21GU++hSIy80spLQxQ59qdXl9MZjMoziBLkm19RrcC30yF+FpdbHAiZ/cO21XdTSmQuYZGZKic/eA/S39PfyYiG6PHqviLwUk/5eL05GF2sDRLEIo8NJhLLh7u24181yL/fUMMB9mvit/G+lfHEL7WLTbrwLjVo7o5SOg5W7VbRWdkDd+kaFDKTWv4QbyA6YYrCeh0CBemudpR5A66Hx7TpRRLhM+YWM+NngGOwbePSdRK3Zdmqn/gW6dn6ywPpsDtS40n5jDpV4iUDZyQmJ9lhHLB0XtJCsAM3zmAg5IQUzaovLqDm0CjthwAy1pmdFM8xQEdr+uVZvb5m3tWGGLwLEuwDDAe8p5AOjvs2m8h3/zE0AVClbOoYiYLDc0uAtzrHNufEQdAO89xQWW6EllCxpFbU7Z4hqjaVdtN8f2KQzDH1/fMTKaGF4RWvWok2sbVPixgs5u8+8i3XZHsrlL+b9ifArsc185WBLwveOGP4Uf5Fi54mpQf2o+B9TuSNd4fIWMRMgnai8rb4DNps9u/spyi9WPOWpYH8mpum3/s35tQUWcDdliQKTB1R1bv
 X6oAaGKJ
 d2yzniXn92rphg9rabbS/3lkVOiO4QojbktV6IuimthVmw06i5db6pPOUChURHVOQKKCyT9nipQEVNQn5lAMbPAozkPwXpm5KHX44nTKWVXRlaZdV/TMDs3l6CGWYTK9RwlVT5AVhGRaJeHlRYSkXun6dBz0QKnLyGsRqjlLYB4Hb8HalgA9yDbmEGJ1+bXN+g0Ewk3nVMsq/m0H4Y/qa9HQ2+6cm90jP7mArLJ3RT/T73t1aFw1ewNnIeslkeRye/E/xMgLjHNJaq9sMqojT+gd5c15DDdzTeFbCt5WPFT9E7Qbvm/haP8Z/XgLn7CyVk7JsPj+zs75rKARUYqDQ71AAZb0aMIDBoJdtQK2jxT8I1A10E/wdFGwg2NBFtYw7+KiSmPbPzITZXqEhH2ze0nPSadVJlIWUuShKPZq3ae7RXltXD3CA6mHey3NOHswnzIRZ
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jul 17, 2025 at 4:52=E2=80=AFPM David Hildenbrand <david@redhat.com=
> wrote:
>
> On 17.07.25 05:09, Yafang Shao wrote:
> > On Wed, Jul 16, 2025 at 6:42=E2=80=AFAM David Hildenbrand <david@redhat=
.com> wrote:
> >>
> >> On 08.06.25 09:35, Yafang Shao wrote:
> >>
> >> Sorry for not replying earlier, I was caught up with all other stuff.
> >>
> >> I still consider this a very interesting approach, although I think we
> >> should think more about what a reasonable policy would look like
> >> medoium-term (in particular, multiple THP sizes, not always falling ba=
ck
> >> to small pages if it means splitting excessively in the buddy etc.)
> >
> > I find it difficult to understand why we introduced the mTHP sysfs
> > knobs instead of implementing automatic THP size switching within the
> > kernel. I'm skeptical about its practical utility in real-world
> > workloads.
> >
> > In contrast, XFS large folio (AKA. File THP) can automatically select
> > orders between 0 and 9. Based on our verification, this feature has
> > proven genuinely useful for certain specific workloads=E2=80=94though i=
t's not
> > yet perfect.
>
> I suggest you do some digging about the history of these toggles and the
> plans for the future (automatic), there has been plenty of talk about
> all that.
>
> [...]
>
> >>>
> >>> - THP allocator
> >>>
> >>>     int (*allocator)(unsigned long vm_flags, unsigned long tva_flags)=
;
> >>>
> >>>     The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHU=
GEPAGED,
> >>>     indicating whether THP allocation should be performed synchronous=
ly
> >>>     (current task) or asynchronously (khugepaged).
> >>>
> >>>     The decision is based on the current task context, VMA flags, and=
 TVA
> >>>     flags.
> >>
> >> I think we should go one step further and actually get advises about t=
he
> >> orders (THP sizes) to use. It might be helpful if the program would ha=
ve
> >> access to system stats, to make an educated decision.
> >>
> >> Given page fault information and system information, the program could
> >> then decide which orders to try to allocate.
> >
> > Yes, that aligns with my thoughts as well. For instance, we could
> > automate the decision-making process based on factors like PSI, memory
> > fragmentation, and other metrics. However, this logic could be
> > implemented within BPF programs=E2=80=94all we=E2=80=99d need is to ext=
end the feature
> > by introducing a few kfuncs (also known as BPF helpers).
>
> We discussed this yesterday at a THP upstream meeting, and what we
> should look into is:
>
> (1) Having a callback like
>
> unsigned int (*get_suggested_order)(.., bool in_pagefault);

This interface meets our needs precisely, enabling allocation orders
of either 0 or 9 as required by our workloads.

>
> Where we can provide some information about the fault (vma
> size/flags/anon_name), and whether we are in the page fault (or in
> khugepaged).
>
> Maybe we want a bitmap of orders to try (fallback), not sure yet.
>
> (2) Having some way to tag these callbacks as "this is absolutely
> unstable for now and can be changed as we please.".

BPF has already helped us complete this, so we don=E2=80=99t need to implem=
ent
this restriction.
Note that all BPF kfuncs (including struct_ops) are currently unstable
and may change in the future.

Alexei, could you confirm this understanding?

>
> One idea will be to use this mechanism as a way to easily prototype
> policies, and once we know that a policy works, start moving it into the
> core.
>
> In general, the core, without a BPF program, should be able to continue
> providing a sane default behavior.

makes sense.

>
> >
> >>
> >> That means, one would query during page faults and during khugepaged,
> >> which order one should try -- compared to our current approach of "sta=
rt
> >> with the largest order that is enabled and fits".
> >>
> >>>
> >>> - THP reclaimer
> >>>
> >>>     int (*reclaimer)(bool vma_madvised);
> >>>
> >>>     The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSW=
APD,
> >>>     determining whether memory reclamation is handled by the current =
task or
> >>>     kswapd.
> >>
> >> Not sure about that, will have to look into the details.
> >
> > Some workloads allocate all their memory during initialization and do
> > not require THP at runtime. For such cases, aggressively attempting
> > THP allocation is beneficial. However, other workloads may dynamically
> > allocate THP during execution=E2=80=94if these are latency-sensitive, w=
e must
> > avoid introducing long allocation delays.
> >
> > Given these differing requirements, the global
> > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
> > Instead, we should implement per-workload defrag policies to better
> > optimize performance based on individual application behavior.
>
> We'll be very careful about the callbacks we will offer. Maybe the
> get_suggested_order() callback could itself make a decision and not
> suggest a high order if allocation would require comapction.
>
> Initially, we should keep it simple and see what other callbacks to add
> / how to extend get_suggested_order(), to cover these cases.

Yes, we can proceed by adding a simple get_suggested_order() and
address any remaining details in follow-up work.

--
Regards

Yafang