From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34ACFC3ABA3 for ; Fri, 2 May 2025 12:18:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94B366B0092; Fri, 2 May 2025 08:18:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8F8566B0093; Fri, 2 May 2025 08:18:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C3A76B0095; Fri, 2 May 2025 08:18:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5C89F6B0092 for ; Fri, 2 May 2025 08:18:50 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 06C2A80070 for ; Fri, 2 May 2025 12:18:50 +0000 (UTC) X-FDA: 83397871662.26.434199B Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) by imf22.hostedemail.com (Postfix) with ESMTP id ECD2EC000E for ; Fri, 2 May 2025 12:18:48 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OhW1n72X; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.170 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746188329; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4qnYssvigqQWiexsb57cBRhvr2T5TwZbcMMOB+hK+DE=; b=MrEcF3FVbC4hvKPhk0KmfGEacXq0QOp5L2PLYjBxBNbDbBBTZddf/JiHAFgHzGWHrDGyWb ftfg0eM/U0d91KBVA3adLryPj2e5vSkg9KLTg2Jhtpmwq6wdKxQP84d/5Bnze3SG2Cp5MP p1m5ko5XqCsUu0AIfW28SWBKL1TO88Q= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OhW1n72X; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.170 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746188329; a=rsa-sha256; cv=none; b=S0PygrokDytWHrIKnCDTLctlsmkHLgIuCPytChVGzrPBzPQmKA4b5CKmYrvBEN3m+hGei5 ezSm0BxYtgMj5Ug2OS9Sd9F+qnbxGT/aBT/ZTyIYQUXCxoRs1cHe0SGP6nI/QxJTp5rmBE jPDZQUvZzSZOy8f0PaxMkIOyO2JTiBM= Received: by mail-qk1-f170.google.com with SMTP id af79cd13be357-7c7913bab2cso212454285a.0 for ; Fri, 02 May 2025 05:18:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746188328; x=1746793128; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4qnYssvigqQWiexsb57cBRhvr2T5TwZbcMMOB+hK+DE=; b=OhW1n72XTGrTyR3++ml5Gs2rgFm2sEYBKTQO38QFWnkTI8vfS4M0KTHqemwy+GXp6Z O4DbDUZ37dg+xKLVmteeVEORXumAwNq4fyKxDgB1pgpvDSD/rU69oGcdbAuHAaNQ6V3/ esH3dDhNtYSqrnNgLiI52W3rUU/slqIbKjpyoMPPfrOYQdwvUKdKshBjAZdo3JcdZoXI tLG7a1Vp7e5XYv0tlG2A14RqYK2WMiVU7zRzW/EyM2WACDFxutvms2RvfClj+dx7cDT9 F+incXSSM7kzqjfQQlFji5teGcHjsgVAQf2t4tkeprJoVrHqHhXLUe5LckzPJ6zmFklk PELw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746188328; x=1746793128; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4qnYssvigqQWiexsb57cBRhvr2T5TwZbcMMOB+hK+DE=; b=DrsAqVeqMVUSRvxnhsIe/XeB5lmJZn6CMbDN1wC3Dc3PMI97C91MfYjnPUUwCMhmxw afRk6sX86ZrDw/8VmplofrsGt3vlX+Uzv+aYSxWV2hKdwM1XbnIu1XXtcHBDnfMjSqDY kbM4J6zSrnSGJqPtUhtic6zexBkYFdwo7gmBiRXWfYnkl0uFmjGI92uBezS7nPCNoYzU xmAKq8SNJPOl8ihXQ/A+ESOj4Z+ioj2PDOuk02hICwHpvybvIEZGw0aPHF1vsqhvC4U6 xR8mYP4OolewkeGnOSy93l5mE7m9efcieFZjSVgkBLYWJoHCciiHPeno7KbE3UILC+mg 5BYw== X-Forwarded-Encrypted: i=1; AJvYcCUCMhSdNnGngEbqJ/msfTrO5fSQTMj10eQQIoQ+oHIGepMEYdQp+doApmpKeowzQZEjI83za6ESdg==@kvack.org X-Gm-Message-State: AOJu0Yx3iddx36CW0VPs7aE5g6yk01iAeTqASPuEYj+7naGsC3ozwpkZ kSjiF4yvDqwptfbsV7ZHfIsVaTJXKm+/d6hcCOngK8rW+f8u3Lhrvw2hvJxk+esdcjMc/uQkgsl krbnb3X8FTYf0zavOljS5IGLGTSo= X-Gm-Gg: ASbGncsdjOmMSJAnfqr/Dy0yBPqVM28F/E1h2tCB0pXwFqJMtG6iC6gI4TMgBE6n52l N1jooQG/eTISoatSddKTj7Bt7QNZZP/YgUfKOVVS1rOcn7B7N1vuc3OIA5037fs92CbeRvBb8D2 +ycwvICUB6tdG6TfYuTVmlTcg= X-Google-Smtp-Source: AGHT+IHqbK4ZuU6whuG1aaB4mqbXVMZHOPktn4ERWLnnIB0YBqVLnHx/r6NjkZeXmm88j20izt3acG0nH7pOgAOkGJ8= X-Received: by 2002:a05:6214:ac1:b0:6e8:fb94:b9bf with SMTP id 6a1803df08f44-6f5152625bfmr44877126d6.4.1746188327124; Fri, 02 May 2025 05:18:47 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> <20250430174521.GC2020@cmpxchg.org> <84DE7C0C-DA49-4E4F-9F66-E07567665A53@nvidia.com> <6850ac3f-af96-4cc6-9dd0-926dd3a022c9@huawei-partners.com> <3006EA5B-3E02-4D82-8626-90735FE8F656@nvidia.com> In-Reply-To: <3006EA5B-3E02-4D82-8626-90735FE8F656@nvidia.com> From: Yafang Shao Date: Fri, 2 May 2025 20:18:09 +0800 X-Gm-Features: ATxdqUE0qTTPNY8ygjxDlSsNmJN8mrMYfKnGNPgqVWC3S_BBYEo1fY5SaTREsqY Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: Zi Yan Cc: Gutierrez Asier , Johannes Weiner , "Liam R. Howlett" , akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, David Hildenbrand , Baolin Wang , Lorenzo Stoakes , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: ECD2EC000E X-Stat-Signature: 1zqxu6mjhwerjpnzocfaueag9tkqd1nx X-HE-Tag: 1746188328-485518 X-HE-Meta: U2FsdGVkX18bPdKTY4I7PS0c5uKqe4DHg3Jp1QPhuGE5cXIDWlYgdOha85f4MdxPo4AufDrsDVitUk6vIAqHdzWPIOXgUTQ2P4h1eEH5lEVoHWCzMySf9RE7ES+FPXszQRjpTJdq6si7CfmObnB4Rez0YYAMzRW9nOhtm/7VlzgcHBtdyaSPIuwaY1j8K1wn5Xu1w72GjKuo7hQT9mLQZ001SOmRuEz2jSt9JX+fWJwY4RRhI9vjTEYvhtqBaTPbdfvtDcYUU8MTdCWa09NuSHm+IZem4Dg23nRfpFAQOaomEbbveVhe4pxhvKxOwrJFC5GEZMsbbk/w8BvDd18+arEglCNgIRBMF419tiGg7v0xrSHSo8VqHt0ReL1G397bF3BpJ7EN0knMegrXHHsIlpu7WXPGoiIwq+0ClBCJLQ9qg9rijXd91gqveTfFUNWvg5uJD9Vkz5POFJVeaC8RLftK0Q2D/d6vbPdiuIWIujUTQWNJwfe8uEG2kBZi9N81B5k+NHtycDvG8KHlYQueI7xGkk6i68fXS/LYFUO5dDgTpJp0FFykHthinOW8N3dDMMcphIu58U1xhE7RVK6Ossck9DIAtkHuOIEDy5Gg+06tW70U1cPyAsr4o9wMP1XQgV7SVHUX93ENPomUmvc/hc/um8MPz9CRZOHtqyBu3DZOK/SDAZ5wY7xneVwkQebgRlwvFhGKQR5acZDpJISHlRdy3xNmLctT8CkVKgoXd09ANOZOBshirQLoM6vr+5OsmBrRIC9n8BHtsJsKcgVg7FtktrTP/so8xklIEgBjhoAy6lhkLNSXYZaJOdy1H7xI9hTIiun/lVvwdSmuh/G/bPEJvYlJcZCwCdpurkRJ/pNCIL5yBWC21eh2C52WifVg5j0D4XYEu4jvevAaY79IJ7dhqFQPO6jk+VtEyWRt7H3A/MSqDCu5yH1q4OAwLSMCHJzNQvl+3LIxOun1yVQ SDe6iN4k 53IzQmbogXxZG0KC7IrWLppgQIbMpDP+grKCs4NSTKCLs98QlrxoMEWVWprvm9XkE8n6CQEe2dxUc7dgVuB4hj5ebZzJR/fUBOsqezs6anh3PZB5OLzAFvDvvRKnpTrSHMk1JAwwfEGplUdlfq3urh5atOlQD4I6kYtdYyGBXxqgeMEEnGM+UKqe6Fce4QL2MjomAJD1Kdf4RbKw7KhKZ4ZTGzrwxC8syNkTXjY33h/sE5auGWDGIAO/3coRoaMQIH8Lqe9D84cXooT5Yj5DDzQbgaNVVwhu4ruZzyL+h6gs0K/gokyVT4pbKD/ukEZB7pizbyZluKsX3x7hMLmfkXgJN4faB5CeCu/hEhn7jMLx2gnB1sP/SGDkhl8x3hpr0lnjtouteL/ImWoBYqLBBTQq7Rnv3tlbonmpeIFjxmIT2TmNWgWhjJSf5CPkEWZjqmyFB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 2, 2025 at 8:00=E2=80=AFPM Zi Yan wrote: > > On 2 May 2025, at 1:48, Yafang Shao wrote: > > > On Fri, May 2, 2025 at 3:36=E2=80=AFAM Gutierrez Asier > > wrote: > >> > >> > >> On 4/30/2025 8:53 PM, Zi Yan wrote: > >>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > >>> > >>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: > >>>>>>>> If it isn't, can you state why? > >>>>>>>> > >>>>>>>> The main difference is that you are saying it's in a container t= hat you > >>>>>>>> don't control. Your plan is to violate the control the internal > >>>>>>>> applications have over THP because you know better. I'm not sur= e how > >>>>>>>> people might feel about you messing with workloads, > >>>>>>> > >>>>>>> It=E2=80=99s not a mess. They have the option to deploy their ser= vices on > >>>>>>> dedicated servers, but they would need to pay more for that choic= e. > >>>>>>> This is a two-way decision. > >>>>>> > >>>>>> This implies you want a container-level way of controlling the set= ting > >>>>>> and not a system service-level? > >>>>> > >>>>> Right. We want to control the THP per container. > >>>> > >>>> This does strike me as a reasonable usecase. > >>>> > >>>> I think there is consensus that in the long-term we want this stuff = to > >>>> just work and truly be transparent to userspace. > >>>> > >>>> In the short-to-medium term, however, there are still quite a few > >>>> caveats. thp=3Dalways can significantly increase the memory footprin= t of > >>>> sparse virtual regions. Huge allocations are not as cheap and reliab= le > >>>> as we would like them to be, which for real production systems means > >>>> having to make workload-specifcic choices and tradeoffs. > >>>> > >>>> There is ongoing work in these areas, but we do have a bit of a > >>>> chicken-and-egg problem: on the one hand, huge page adoption is slow > >>>> due to limitations in how they can be deployed. For example, we can'= t > >>>> do thp=3Dalways on a DC node that runs arbitary combinations of jobs > >>>> from a wide array of services. Some might benefit, some might hurt. > >>>> > >>>> Yet, it's much easier to improve the kernel based on exactly such > >>>> production experience and data from real-world usecases. We can't > >>>> improve the THP shrinker if we can't run THP. > >>>> > >>>> So I don't see it as overriding whoever wrote the software running > >>>> inside the container. They don't know, and they shouldn't have to ca= re > >>>> about page sizes. It's about letting admins and kernel teams get > >>>> started on using and experimenting with this stuff, given the very > >>>> real constraints right now, so we can get the feedback necessary to > >>>> improve the situation. > >>> > >>> Since you think it is reasonable to control THP at container-level, > >>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]? > >>> (Asier cc'd) > >>> > >>> In this patchset, Yafang uses BPF to adjust THP global configs based > >>> on VMA, which does not look a good approach to me. WDYT? > >>> > >>> > >>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierre= z.asier@huawei-partners.com/ > >>> > >>> -- > >>> Best Regards, > >>> Yan, Zi > >> > >> Hi, > >> > >> I believe cgroup is a better approach for containers, since this > >> approach can be easily integrated with the user space stack like > >> containerd and kubernets, which use cgroup to control system resources= . > > > > The integration of BPF with containerd and Kubernetes is emerging as a > > clear trend. > > > >> > >> However, I pointed out earlier, the approach I suggested has some > >> flaws: > >> 1. Potential polution of cgroup with a big number of knobs > > > > Right, the memcg maintainers once told me that introducing a new > > cgroup file means committing to maintaining it indefinitely, as these > > interface files are treated as part of the ABI. > > In contrast, BPF kfuncs are considered an unstable API, giving you the > > flexibility to modify them later if needed. > > > >> 2. Requires configuration by the admin > >> > >> Ideally, as Matthew W. mentioned, there should be an automatic system. > > > > Take Matthew=E2=80=99s XFS large folio feature as an example=E2=80=94it= was enabled > > automatically. A few years ago, when we upgraded to the 6.1.y stable > > kernel, we noticed this new feature. Since it was enabled by default, > > we assumed the author was confident in its stability. Unfortunately, > > it led to severe issues in our production environment: servers crashed > > randomly, and in some cases, we experienced data loss without > > understanding the root cause. > > > > We began disabling various kernel configurations in an attempt to > > isolate the issue, and eventually, the problem disappeared after > > disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new > > kernel version with THP disabled and had to restart hundreds of > > thousands of production servers. It was a nightmare for both us and > > our sysadmins. > > > > Last year, we discovered that the initial issue had been resolved by th= is patch: > > https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/= . > > We backported the fix and re-enabled XFS large folios=E2=80=94only to f= ace a > > new nightmare. One of our services began crashing sporadically with > > core dumps. It took us several months to trace the issue back to the > > re-enabled XFS large folio feature. Fortunately, we were able to > > disable it using livepatch, avoiding another round of mass server > > restarts. To this day, the root cause remains unknown. The good news > > is that the issue appears to be resolved in the 6.12.y stable kernel. > > We're still trying to bisect which commit fixed it, though progress is > > slow because the issue is not reliably reproducible. > > This is a very wrong attitude towards open source projects. You sounded > like, whether intended or not, Linux community should provide issue-free > kernels and is responsible for fixing all issues. But that is wrong. > Since you are using the kernel, you could help improve it like Kairong > is doing instead of waiting for others to fix the issue. > > > > > In theory, new features should be enabled automatically. But in > > practice, every new feature should come with a tunable knob. That=E2=80= =99s a > > lesson we learned the hard way from this experience=E2=80=94and perhaps > > Matthew did too. > > That means new features will not get enough testing. People like you > will just simply disable all new features and wait for they are stable. > It would never come without testing and bug fixes. Pardon me? This discussion has taken such an unexpected turn that I don=E2=80=99t feel the need to explain what I=E2=80=99ve contributed to the Linux community ov= er the past few years. That said, you're free to express yourself as you wish=E2=80=94even if it comes across as unnecessarily rude toward someone who has been participating in the community voluntarily for many years. Best of luck in your first maintainer role. --=20 Regards Yafang