From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C27D0C83F27
	for <linux-mm@archiver.kernel.org>; Thu, 17 Jul 2025 03:10:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 649F66B0089; Wed, 16 Jul 2025 23:10:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5FAA66B0095; Wed, 16 Jul 2025 23:10:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4E9EF6B0096; Wed, 16 Jul 2025 23:10:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 3459D6B0095
	for <linux-mm@kvack.org>; Wed, 16 Jul 2025 23:10:36 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 8C8871605DF
	for <linux-mm@kvack.org>; Thu, 17 Jul 2025 03:10:35 +0000 (UTC)
X-FDA: 83672278830.20.9504ACF
Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171])
	by imf01.hostedemail.com (Postfix) with ESMTP id B038640009
	for <linux-mm@kvack.org>; Thu, 17 Jul 2025 03:10:33 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hC8eJNg0;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1752721833;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=;
	b=kTmg2EI75s+bl2CABfzAYZ2nn8Q5uovbVoVg8RjY+OWko6cbHCKmER54yEPAdZQfaDfqsl
	dILjih5Gd8hhjIuHWrtRxGbZ5T9Lh8q0QQdPh/dgucDKiRKJ+suGEU23914mcjaQugkqOU
	L0H1zbeUmZPc+rN98b8RjU0iTg2fnQg=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hC8eJNg0;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752721833; a=rsa-sha256;
	cv=none;
	b=HcoP60+fhsP8QJbuK9HJTcdVkJOgPItmfyV6ObEqei8/L8v6qRf/bBMOKHTQxr/m4aSg32
	WzOvARSgy2tcX8x7kjxPkWJCWwA1UbEoQcXFKizbtk5OEM3OOX1ELCUfEr026a3F5bsbf+
	Id7OzRzFkzzqU7FdGiJCuLucwDGvgWM=
Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7d7f0fcef86so48949485a.1
        for <linux-mm@kvack.org>; Wed, 16 Jul 2025 20:10:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1752721833; x=1753326633; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=;
        b=hC8eJNg0xjTvoLbAmJIgM+A8Nmb3BGIPX9Rd/eI5AcwdUsqDrFF5Y/MeoyVCGCiO/n
         6vCaG7TLwNCujlu4CaHXRgubm9TXCN7eBvbTu2ZXDEGTHTFZWp1Sq+bVzOVJFZD4ORj9
         Cb2AOiCm99J8giQRuU0Lvs2pfn1cw+fkzuljNuQkVkX+xHHms3DewweRAQA+ukC1GKoI
         MIs6iQR3Q9rrExhCNUd36D/dlN1G2wSsVNtZHn75Dr9FG6+BFiUmpyRvGp7ZgOOQXzm9
         mZ3JgHAW3KHujuOQkDTZoEgoYDevr207o85nE3HKL8PePWXv/+LmgzJwzjI+6YaFizwC
         GvyQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1752721833; x=1753326633;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=;
        b=DtsV8gJ6z5Odw/aKdskw1vUrF+ZyUzPTb9ngpxjqn/F8bPaORJdEQzHrWuS1wTI7ZP
         QrkrpBj77pTs05j4+75sAJ331Kp1Lz9Z2jri6TIIVyULKR0WEqoRpFLCM82plNL7VeGU
         ewA7fLrIqAqxE4KRSNGDOi2mzXRb7oX73xGFbpF8DexMalj3IvZ6nVTSq3IISe6ikO2M
         XbfjwCEGcm+FwGQ7iSt8LWPIMS8c4xifG92X0rBNEeiM6BvtADsYKz7hDK+QR77OajFY
         NDlbnqeCXzfL6jmmvMRU7c+Zcrqbmws1HeJretlmkE8DWuT9TBPhHjTI4ZyfXidwIxa8
         AfWA==
X-Forwarded-Encrypted: i=1; AJvYcCU8OInO8RrYipILufPUEQxK5XnF6MmDI8f5QVTcb5PGvg5p4YPn3tz3iS0IQg00aZVV8LCkp5MQ/g==@kvack.org
X-Gm-Message-State: AOJu0YyxScjtMr8R9BXEQfZ+5V5wmSKeDdxRYc9/iTblDY1wdqUyizxH
	sQlGgkeztbsdhLpF6CyoAAENTFcjUBGAef/Xfw4s/ntpLuVt2eUue7ijfghJb0YpHcTw4LdBCw0
	2BfSkawteojUZhgBMZNWZYov2+p1Z0s0=
X-Gm-Gg: ASbGncvnJV7Y1BChmxeIgMjxCA3S8DNBm2D2Xc4fFGoYa8wR9lFZm08SA7x24m3jkRH
	dIPtlG+VrIF7fq1ggihYFwj3cDj6IeZuvq6W5+j1WIaW9eL+0XoK6D27LshnmEbiCD48OfOIM/K
	n1QWils022PFtDtGWK2FrVB2rs51dOxtJiJdbWjKiZDcw5DMd+yQI06JC4gxiHJdvwchWSpYOT9
	w33VrT4QdYxEB8fghQ=
X-Google-Smtp-Source: AGHT+IGaEET0Q41yt/KUxbXgwYWDbCP2Fwqe/8vdtggSADiYMaZbm3jv4KJAEUX2LELc4GVmkLBYVb1fLlHzCK+aTKA=
X-Received: by 2002:a05:620a:450c:b0:7e2:769a:c85d with SMTP id
 af79cd13be357-7e343350fb3mr712947785a.3.1752721832629; Wed, 16 Jul 2025
 20:10:32 -0700 (PDT)
MIME-Version: 1.0
References: <20250608073516.22415-1-laoar.shao@gmail.com> <b2fc85fb-1c7b-40ab-922b-9351114aa994@redhat.com>
In-Reply-To: <b2fc85fb-1c7b-40ab-922b-9351114aa994@redhat.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Thu, 17 Jul 2025 11:09:56 +0800
X-Gm-Features: Ac12FXy-t5oZin4xRVLt33sd1d3DxFzwzaSxQSIQzkxN07Im_3FtSYyRMyaRWxA
Message-ID: <CALOAHbD2-f5CRXJy6wpXuCC5P9gqqsbVbjBzgAF4e+PqWv0xNg@mail.gmail.com>
Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
To: David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>
Cc: akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, 
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, 
	ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, 
	usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, ast@kernel.org, 
	daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: B038640009
X-Rspamd-Server: rspam06
X-Stat-Signature: msdfukni15h78yhbm9zozptspro4m3wu
X-HE-Tag: 1752721833-623131
X-HE-Meta: U2FsdGVkX1/B4wPH0AMIWTnaDS82h0Q8veJ64U1z1/edRsoLoNeIISXgXj3P3LICG/o/GqE+RK2LfYjXG9e5G3XgYb5tjg3oqUfwkO4ehNjA1060w59kLDCaSYffbB1hsSh+NvbGz19Imak7Hern2yWC00LwWXHz0kmXFfDn+KCSTs+k/8FnvYi9++Iiwg/e6ROLPlZdoC34roCttEU1pGuDmZWObeTrKb1Ek5B2EZHS3Ik7tpVb/+kSHkem9UoRt9ap6tO7vosLZ+VjIhOdnwvp9PgQCDj4tSQiz0BzdOhl8NF6RGZDYlbZHdxxNyL0xW0eCwAuHyyVB7YFecNqBFE+CeFFA47TZ4H2qJXh5l4i83DBZqEbp/vGSBei+gtZEiK3fWWQ/TG9lvsTzCClgW6nFNy4fZsDyxTwcvIXS545BmCTwJSKvfHiMlYfD4K1bmpAQUleFBa5F+eo6QqFEV0A4VwdGdYc5FPNodJsVo2gdLZK8wh646/WvclYL4Qg7fTb0197Mnv4qvdrAYGVzX5E4/X92gRwEFA4Lj3px4vINuR+FjENSo4K/JCzq/CZga8kYMpHZGRvCKDKFOEVv+0YndW/bYwzJ1q1+Yf43AW7lnWs2TYYyxXGynVhQ/jEU1OiXXxEdRpEpAMFTGSv5vteRWmYOrHGdArJLLvcj0YZk8SphxVO0wQz3zUQH19WW3lNrAyKkKXNIMnQ+Rp3bR746G/wSFq0EDFReMgdIHEE4r31zUzlIGAzydbbCHjfVTaBoCUk5FcFIQ/ur9uJPD9dlBwTZjOrIPmSEf7JvyVRqnsG9tYULXmm7kapDZ7UguYcq2STQRp7Oe2jg82G1oj4k+/N/oD0a0oUUW+sxlCrlEQYCB6uVdyu5dg5KkxA5GXL5Lqt0PeWRTDvT19/yZ6jFEv9lE1rlXRaRIolOHCwx/jipiAySwo1AVCwUatGa+1OnFtcHGSWSh1hvoA
 XHLTdRe+
 15uA4OrkIFwpuUUb5cVggMZTlU/eUWbEy4vYRzRIs6KgDoerB96i3WyHOYHzCPiSnixKpb1jd86E1l6KK08MM/dEAHzLSM2usNrgBCSK8Yywl+2qg8/XXSEsJhZdL5zlZuKtGH0Ai3pdvQ1y03Sxig3Yn/on0x5sa5S+7RWx9QbNr6oDmNDQV3efaadg8VMmiGjDU1JyMjC9o1RosPhuA2RL3uzyEXWuHgd8s7JvXa5VCzS85bzghcsBOtzwOjcnQDz+9fq2MTe4F27ylFCZI6el8sMN+hnnmpDmNrjHeWDsSQLbkyWt6pBzP7HkUBflTw/ncpMjK0qUWKfwuebskk1VcXWKah5i6wzXhGULUZmsV1yaL5LFSOPlZHBTyC5PK1k2iAU8rASO108emUkLwnDcF1lt+fUB9zWgs
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jul 16, 2025 at 6:42=E2=80=AFAM David Hildenbrand <david@redhat.com=
> wrote:
>
> On 08.06.25 09:35, Yafang Shao wrote:
>
> Sorry for not replying earlier, I was caught up with all other stuff.
>
> I still consider this a very interesting approach, although I think we
> should think more about what a reasonable policy would look like
> medoium-term (in particular, multiple THP sizes, not always falling back
> to small pages if it means splitting excessively in the buddy etc.)

I find it difficult to understand why we introduced the mTHP sysfs
knobs instead of implementing automatic THP size switching within the
kernel. I'm skeptical about its practical utility in real-world
workloads.

In contrast, XFS large folio (AKA. File THP) can automatically select
orders between 0 and 9. Based on our verification, this feature has
proven genuinely useful for certain specific workloads=E2=80=94though it's =
not
yet perfect.

>
> > Background
> > ----------
> >
> > We have consistently configured THP to "never" on our production server=
s
> > due to past incidents caused by its behavior:
> >
> > - Increased memory consumption
> >    THP significantly raises overall memory usage.
> >
> > - Latency spikes
> >    Random latency spikes occur due to more frequent memory compaction
> >    activity triggered by THP.
> >
> > - Lack of Fine-Grained Control
> >    THP tuning knobs are globally configured, making them unsuitable for
> >    containerized environments. When different workloads run on the same
> >    host, enabling THP globally (without per-workload control) can cause
> >    unpredictable behavior.
> >
> > Due to these issues, system administrators remain hesitant to switch to
> > "madvise" or "always" modes=E2=80=94unless finer-grained control over T=
HP
> > behavior is implemented.
> >
> > New Motivation
> > --------------
> >
> > We have now identified that certain AI workloads achieve substantial
> > performance gains with THP enabled. However, we=E2=80=99ve also verifie=
d that some
> > workloads see little to no benefit=E2=80=94or are even negatively impac=
ted=E2=80=94by THP.
> >
> > In our Kubernetes environment, we deploy mixed workloads on a single se=
rver
> > to maximize resource utilization. Our goal is to selectively enable THP=
 for
> > services that benefit from it while keeping it disabled for others. Thi=
s
> > approach allows us to incrementally enable THP for additional services =
and
> > assess how to make it more viable in production.
> >
> > Proposed Solution
> > -----------------
> >
> > To enable fine-grained control over THP behavior, we propose dynamicall=
y
> > adjusting THP policies using BPF. This approach allows per-workload THP
> > tuning, providing greater flexibility and precision.
> >
> > The BPF-based THP adjustment mechanism introduces two new APIs for gran=
ular
> > policy control:
> >
> > - THP allocator
> >
> >    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> >
> >    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEP=
AGED,
> >    indicating whether THP allocation should be performed synchronously
> >    (current task) or asynchronously (khugepaged).
> >
> >    The decision is based on the current task context, VMA flags, and TV=
A
> >    flags.
>
> I think we should go one step further and actually get advises about the
> orders (THP sizes) to use. It might be helpful if the program would have
> access to system stats, to make an educated decision.
>
> Given page fault information and system information, the program could
> then decide which orders to try to allocate.

Yes, that aligns with my thoughts as well. For instance, we could
automate the decision-making process based on factors like PSI, memory
fragmentation, and other metrics. However, this logic could be
implemented within BPF programs=E2=80=94all we=E2=80=99d need is to extend =
the feature
by introducing a few kfuncs (also known as BPF helpers).

>
> That means, one would query during page faults and during khugepaged,
> which order one should try -- compared to our current approach of "start
> with the largest order that is enabled and fits".
>
> >
> > - THP reclaimer
> >
> >    int (*reclaimer)(bool vma_madvised);
> >
> >    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD=
,
> >    determining whether memory reclamation is handled by the current tas=
k or
> >    kswapd.
>
> Not sure about that, will have to look into the details.

Some workloads allocate all their memory during initialization and do
not require THP at runtime. For such cases, aggressively attempting
THP allocation is beneficial. However, other workloads may dynamically
allocate THP during execution=E2=80=94if these are latency-sensitive, we mu=
st
avoid introducing long allocation delays.

Given these differing requirements, the global
/sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
Instead, we should implement per-workload defrag policies to better
optimize performance based on individual application behavior.

>
> But what could be interesting is deciding how to deal with underutilized
> THPs: for now we will try replacing zero-filled pages by the shared
> zeropage during a split. *maybe* some workloads could benefit from ...
> not doing that, and instead optimize the split.

I believe a per-workload THP shrinker (e.g.,
/sys/kernel/mm/transparent_hugepage/shrink_underused) would also be
valuable.
Thank you for the suggestion.

--=20
Regards
Yafang