From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E165C369DC for ; Mon, 5 May 2025 02:35:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 84E0C6B0088; Sun, 4 May 2025 22:35:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7FA706B0089; Sun, 4 May 2025 22:35:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69B696B008A; Sun, 4 May 2025 22:35:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 470336B0088 for ; Sun, 4 May 2025 22:35:43 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 594851D02AB for ; Mon, 5 May 2025 02:35:44 +0000 (UTC) X-FDA: 83407288608.21.C6880F0 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf22.hostedemail.com (Postfix) with ESMTP id 7F70BC0008 for ; Mon, 5 May 2025 02:35:42 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XUTadXXw; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746412542; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5nxtVU6N/RrAV5d6XTHusC6YZMAk4OkqLoS2AeBCFQM=; b=UYZ+XVz9eAOp5dnxMu2afCNmCr/qPbjsig8qbBk3khvc3kY6hG4aQHCV7KtQLPQzrxsK/Z NOESEg/xe+hWaRVior/U8oMcKcz1F/c0uT0qx9M/v2s9XsuNSkvhZpqTYv+K1GOg0TYhfI zyHHZoStAahtLiiTCAKlenYEi9q/6sk= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XUTadXXw; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746412542; a=rsa-sha256; cv=none; b=bUyTyzgs/3vZ9fRSh6+rKxOxXonmTI2GxpmqH4fCTYlqK2fp4jSGZIiIBw5WXscecIzI9H OquwFz8AIHslrmTtDH1S9ngiB8szi4gPxoG9V23nBephcB+m0Dxyow034azVTsV0elwA47 NOMvnshZM539W7jpmlUp907NewgEKeg= Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6f0c30a1cb6so28798376d6.2 for ; Sun, 04 May 2025 19:35:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746412541; x=1747017341; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5nxtVU6N/RrAV5d6XTHusC6YZMAk4OkqLoS2AeBCFQM=; b=XUTadXXwCMEYNlE/Telcmz/dcGCdwiE+vCSoVG7+GOHLY3zMSP/hrv8PUH3xr0nRE/ tncKAVaPZrBtI9U8SmiXeeP/Y/IoFOz7VJOxkDuJr/lyqpCw7Hi4Li2xk6DCShnXi4s2 yrjm5zCF4v4hNUve2Q6Qq8pPhKCxhK+nqmPsnufG/9DYjcj3Ziac8a+xxjePCKhvSJ9U QhQA4ZZmsJtRgPN9kgbU/v31seer8dZHbJL/nv+hvqWMLXKf6Fd8y2A5EyszVKJR44z8 9fxeNVo2/iX+arqbbw3zVvseGdBTtG6DGFP3L8pPMIqoSUk7VJh3qbvPEUv+OPfE46Ph WHHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746412541; x=1747017341; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5nxtVU6N/RrAV5d6XTHusC6YZMAk4OkqLoS2AeBCFQM=; b=E+2OiXA4DDd1UK0Sw+9RVaghpp7Rqu48or8SlQ569n7xxtumJ+yGK2BmsKNE9QUzJd XPPXH5MYbdI0hSgXmHjdhHIX/LSqfrn+82MYvPXUoFk5uR520HLXLGijiY3ljF1Fyqt6 g9lk3bnLhhlLdBYDCWVak7aetHbGQDYwu7h78whcRpPiavAie8zX1YhcW/DN/LogMKJr 1NWpI1FebSAqg5p5NhhycWqglDGVzrZiC4uXksERFNCPWTt3HSH7y3VIjBRaSlQtKn6B HCxcI4scIXG1o8vE7wyw/F0bh57fI9pa4jQyNWgnrjqr624dkt1P+Nsp4S+b3BPHHF/W NAgw== X-Forwarded-Encrypted: i=1; AJvYcCXZIFPEAmgvULgPU7/n16AlcvPMAqeF/2JxeAoV6XOub+WjxpEL5pKyLrCzczoxVwutxcvOu+ZV6Q==@kvack.org X-Gm-Message-State: AOJu0YzK6jm9fY8CEJI2r7++JSmwbJA+Ibwf5BcXTBLB6CG66TabqIAH 7b20hXdyRV5ZtZRSUOgFx3nmMGaboAlGmp6wkXV77FGDqtQS0lEYTCk/A4WGfVSwv6ATMcSxwAm YMbz6A0dMBKLpjahORuBrCWmlvbU= X-Gm-Gg: ASbGnct4lH2mUhsmE4CLebgMP8K4FKxwd/b4/m0GnVNt+31s4kf4LU8VL1n121lSHJx 1prCQwBHtfJEkSPfU2k/LtHNAIFa5mdB8yVWKXFMuCFB+xg/sIOu46tukcQFDrmpvpMM1Xgvgu1 QB4t5j0m2ka66U1LTHvb1u1bE= X-Google-Smtp-Source: AGHT+IH+g6n4EubsnN8ph+rM2FXcVsRR1u0bbnz+JilLAYhB1a5u8eaZ6e2q2R8nzaZkJYE7By5wVBkJ/UnhBdHzdGk= X-Received: by 2002:a05:6214:762:b0:6ed:1651:e8c1 with SMTP id 6a1803df08f44-6f528c53ed6mr85159676d6.13.1746412541464; Sun, 04 May 2025 19:35:41 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> <20250430174521.GC2020@cmpxchg.org> <84DE7C0C-DA49-4E4F-9F66-E07567665A53@nvidia.com> <6850ac3f-af96-4cc6-9dd0-926dd3a022c9@huawei-partners.com> <3006EA5B-3E02-4D82-8626-90735FE8F656@nvidia.com> <4883bdec-f7f2-4350-bf72-f0fa75c9ddd5@redhat.com> In-Reply-To: <4883bdec-f7f2-4350-bf72-f0fa75c9ddd5@redhat.com> From: Yafang Shao Date: Mon, 5 May 2025 10:35:04 +0800 X-Gm-Features: ATxdqUFfIUgQJBw89A40HdwZQ27xd8C9VQ8oBQ7228ueTQD_frv68dsIALp5p2g Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: David Hildenbrand Cc: Zi Yan , Gutierrez Asier , Johannes Weiner , "Liam R. Howlett" , akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, Baolin Wang , Lorenzo Stoakes , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: kk49ic8ajrmnrnfnrus7f8j97gp45t55 X-Rspamd-Queue-Id: 7F70BC0008 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746412542-918899 X-HE-Meta: U2FsdGVkX19GP3bfRN42QgTudAclQCl8ofPWtyIDEPtgiRizd30IrOIUVq1SpN8wc9Bunll3RWUY1qYh4u0YW7yWAb8Mc6m6SdkXjLaCAItbcHj04L8an3L677EZRsB73jrDErYvwi5pcMZozvfRgykoarBWqcGrScRKiT7un84s48e3WxOkMNRszdzpL5yasaVNRUM34Gzm6h88QG0kF/N4Lw2QYxYbJFBBmfVWvJMaVlWaJmJLkQDvniwIfNT4HOzSs4DKMLeCQpCdxCCOuo7PMWjyYyoq2/YkKmLJQuAaxyAVOLODT6UIJrP/7OLV2+Wenr0FTclXIu0N/UFuu/NBQ0kLkMewJzhhJejcvzQ44LJacMwpOlkyytaOaMC5sSPElwEOJ/thnG6SJjt52DIHM0yIIamVJYlgzRg2KVjspH6HkYvzee3e20Dh2XR5rj+y33McxX1Jy0xyINzFLvbokWA9cZBAd12whn59JCMwYQNm+iNlLkEkC8Pco/vCIKISCYx5yiA4XEFbVtOlZD3S6Grp558cZXaqaJNXk+jrtsGw6Xqhvc6IK1zuOSyZisaSZIlBJvY2j0ucMezBOUoFq3zAGfKl3L6R3da6+pYfUeX7H2hgQkPNacmSzox0MnnhLd6VCa/lshpNOQOc6eoCCC/Bs9QVtizR/1Yxw4dcv7/wvaEvaBBDb9lq6DCkne0OtEOc5DjeGE+t21l9WpY4gKOLQM4ixQStb661RF95asUy7p/Tz9hwHpBtAU//hy95OHFCCsUCyrl0YoSgjTyZTeNgguqKp4C9Ensuhz38hbtO9CwWaaJ7+GPd+UBeXXMu/qWJ57Hu75deN+iJevy1xUym8uEs2b8y+bD1pDV/gYhGGeBYCRgViZC8HPhT/MouQ43AA9mU14Ui+I8sItalREh+ek+kaDn8DYb6KEwE7iBUm02gr0gouZU5CAfc6byltOUqS1zq1QPwRbI p5l3ubgj PR6186S3GHynMfLVdiRDNSDNSyPxOR7MmTXajN4SWcxizehLLyH3f6rt72vgcotfN2kTMkW/tkuH9kbnzkDt7MMZDMXL8u9KkxfCtff15WXiWQHJRaRL8Xz/ZOJOi5VUws6/wkFQK0wxXDOMJJt4dJw4eb50YrPDbMG33kJBKUdlGIgXwchnIPc5Yw0Nw0+R/iwaZ09f90OpcCbhPjwwAw3/oK20rJUnvvu2oEL5hbWAOCphnRj+PRDAB1tcKvcG+UbzhcTwaqqOL85CO1+/W4MswhT9tlN5j2S0iGhYCqAgvURlshfckJ3Eb4nQMmflepYGhs7zq/c8OSKLiJ6bBtX5Fiy9GoZLTC2oqLPNnPJDOArghXV3GF57kHt92WTRXq8VepzeWIU/Xt7XFXHBCyhCXOn/XIvGyOXnVnlqOgLCEo6C5VaL+cfargW1pZYTpzbweVca5rhAnKQTqEo/tXYIyew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 2, 2025 at 9:04=E2=80=AFPM David Hildenbrand = wrote: > > On 02.05.25 14:18, Yafang Shao wrote: > > On Fri, May 2, 2025 at 8:00=E2=80=AFPM Zi Yan wrote: > >> > >> On 2 May 2025, at 1:48, Yafang Shao wrote: > >> > >>> On Fri, May 2, 2025 at 3:36=E2=80=AFAM Gutierrez Asier > >>> wrote: > >>>> > >>>> > >>>> On 4/30/2025 8:53 PM, Zi Yan wrote: > >>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > >>>>> > >>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: > >>>>>>>>>> If it isn't, can you state why? > >>>>>>>>>> > >>>>>>>>>> The main difference is that you are saying it's in a container= that you > >>>>>>>>>> don't control. Your plan is to violate the control the intern= al > >>>>>>>>>> applications have over THP because you know better. I'm not s= ure how > >>>>>>>>>> people might feel about you messing with workloads, > >>>>>>>>> > >>>>>>>>> It=E2=80=99s not a mess. They have the option to deploy their s= ervices on > >>>>>>>>> dedicated servers, but they would need to pay more for that cho= ice. > >>>>>>>>> This is a two-way decision. > >>>>>>>> > >>>>>>>> This implies you want a container-level way of controlling the s= etting > >>>>>>>> and not a system service-level? > >>>>>>> > >>>>>>> Right. We want to control the THP per container. > >>>>>> > >>>>>> This does strike me as a reasonable usecase. > >>>>>> > >>>>>> I think there is consensus that in the long-term we want this stuf= f to > >>>>>> just work and truly be transparent to userspace. > >>>>>> > >>>>>> In the short-to-medium term, however, there are still quite a few > >>>>>> caveats. thp=3Dalways can significantly increase the memory footpr= int of > >>>>>> sparse virtual regions. Huge allocations are not as cheap and reli= able > >>>>>> as we would like them to be, which for real production systems mea= ns > >>>>>> having to make workload-specifcic choices and tradeoffs. > >>>>>> > >>>>>> There is ongoing work in these areas, but we do have a bit of a > >>>>>> chicken-and-egg problem: on the one hand, huge page adoption is sl= ow > >>>>>> due to limitations in how they can be deployed. For example, we ca= n't > >>>>>> do thp=3Dalways on a DC node that runs arbitary combinations of jo= bs > >>>>>> from a wide array of services. Some might benefit, some might hurt= . > >>>>>> > >>>>>> Yet, it's much easier to improve the kernel based on exactly such > >>>>>> production experience and data from real-world usecases. We can't > >>>>>> improve the THP shrinker if we can't run THP. > >>>>>> > >>>>>> So I don't see it as overriding whoever wrote the software running > >>>>>> inside the container. They don't know, and they shouldn't have to = care > >>>>>> about page sizes. It's about letting admins and kernel teams get > >>>>>> started on using and experimenting with this stuff, given the very > >>>>>> real constraints right now, so we can get the feedback necessary t= o > >>>>>> improve the situation. > >>>>> > >>>>> Since you think it is reasonable to control THP at container-level, > >>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]= ? > >>>>> (Asier cc'd) > >>>>> > >>>>> In this patchset, Yafang uses BPF to adjust THP global configs base= d > >>>>> on VMA, which does not look a good approach to me. WDYT? > >>>>> > >>>>> > >>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutier= rez.asier@huawei-partners.com/ > >>>>> > >>>>> -- > >>>>> Best Regards, > >>>>> Yan, Zi > >>>> > >>>> Hi, > >>>> > >>>> I believe cgroup is a better approach for containers, since this > >>>> approach can be easily integrated with the user space stack like > >>>> containerd and kubernets, which use cgroup to control system resourc= es. > >>> > >>> The integration of BPF with containerd and Kubernetes is emerging as = a > >>> clear trend. > >>> > >>>> > >>>> However, I pointed out earlier, the approach I suggested has some > >>>> flaws: > >>>> 1. Potential polution of cgroup with a big number of knobs > >>> > >>> Right, the memcg maintainers once told me that introducing a new > >>> cgroup file means committing to maintaining it indefinitely, as these > >>> interface files are treated as part of the ABI. > >>> In contrast, BPF kfuncs are considered an unstable API, giving you th= e > >>> flexibility to modify them later if needed. > >>> > >>>> 2. Requires configuration by the admin > >>>> > >>>> Ideally, as Matthew W. mentioned, there should be an automatic syste= m. > >>> > >>> Take Matthew=E2=80=99s XFS large folio feature as an example=E2=80=94= it was enabled > >>> automatically. A few years ago, when we upgraded to the 6.1.y stable > >>> kernel, we noticed this new feature. Since it was enabled by default, > >>> we assumed the author was confident in its stability. Unfortunately, > >>> it led to severe issues in our production environment: servers crashe= d > >>> randomly, and in some cases, we experienced data loss without > >>> understanding the root cause. > >>> > >>> We began disabling various kernel configurations in an attempt to > >>> isolate the issue, and eventually, the problem disappeared after > >>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new > >>> kernel version with THP disabled and had to restart hundreds of > >>> thousands of production servers. It was a nightmare for both us and > >>> our sysadmins. > >>> > >>> Last year, we discovered that the initial issue had been resolved by = this patch: > >>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.co= m/. > >>> We backported the fix and re-enabled XFS large folios=E2=80=94only to= face a > >>> new nightmare. One of our services began crashing sporadically with > >>> core dumps. It took us several months to trace the issue back to the > >>> re-enabled XFS large folio feature. Fortunately, we were able to > >>> disable it using livepatch, avoiding another round of mass server > >>> restarts. To this day, the root cause remains unknown. The good news > >>> is that the issue appears to be resolved in the 6.12.y stable kernel. > >>> We're still trying to bisect which commit fixed it, though progress i= s > >>> slow because the issue is not reliably reproducible. > >> > >> This is a very wrong attitude towards open source projects. You sounde= d > >> like, whether intended or not, Linux community should provide issue-fr= ee > >> kernels and is responsible for fixing all issues. But that is wrong. > >> Since you are using the kernel, you could help improve it like Kairong > >> is doing instead of waiting for others to fix the issue. > >> > >>> > >>> In theory, new features should be enabled automatically. But in > >>> practice, every new feature should come with a tunable knob. That=E2= =80=99s a > >>> lesson we learned the hard way from this experience=E2=80=94and perha= ps > >>> Matthew did too. > >> > >> That means new features will not get enough testing. People like you > >> will just simply disable all new features and wait for they are stable= . > >> It would never come without testing and bug fixes. > Hello David, Thanks for your replyment. > We do have the concept of EXPERIMENTAL kernel configs, that are either > expected get removed completely ("always enabled") or get turned into > actual long-term kernel options. But yeah, it's always tricky what we > actually want to put behind such options. > > I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ... READ_ONLY_THP_FOR_FS is not enabled in our 6.1 kernel, as we are cautious about enabling any EXPERIMENTAL feature. XFS large folio support operates independently of READ_ONLY_THP_FOR_FS. However, it is automatically enabled when CONFIG_TRANSPARENT_HUGEPAGE is set, as seen in the 6.1.y stable kernel mapping_large_folio_support(). > > Distro kernels are usually very careful about what to backport and what > to support. Once we (working for a distro) do backport + test, we > usually find some additional things that upstream hasn't spotted yet: in > particular, because some workloads are only run in that form on distro > kernels. We also ran into some issues with large folios (e.g., me > personally with s390x KVM guests) and trying our best to fix them. We also worked on this. As you may recall, I previously fixed a large folio bug, which was merged into the 6.1.y stable kernel [0]. [0]. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi= t/?h=3Dlinux-6.1.y&id=3Da3f8ee15228c89ce3713ee7e8e82f6d8a13fdb4b > > It can be quite time consuming, so I can understand that not everybody > has the time to invest into heavy debugging, especially if it's > extremely hard to reproduce (or even corrupts data :( ). Correct. If the vmcore is incomplete, it is nearly impossible to reliably determine the root cause. The best approach is to isolate the issue as quickly as possible. > > I agree that adding a toggle after the effects to work around issues is > not the right approach. Introducing a EXPERIMENTAL toggle early because > one suspects complicated interactions in a different story. It's > absolutely not trivial to make that decision. In this patchset, we are not introducing a toggle as a workaround. Rather, the change reflects the fact that some workloads benefit from THP while others are negatively impacted. Therefore, it makes sense to enable THP selectively based on workload characteristics. > > > > > Pardon me? > > This discussion has taken such an unexpected turn that I don=E2=80=99t = feel > > the need to explain what I=E2=80=99ve contributed to the Linux communit= y over > > the past few years. > > I'm sure Zi Yan didn't mean to insult you. I would have phrased it as: > > "It's difficult to decide which toggles make sense. There is a fine line > between adding a toggle and not getting people actually testing it to > stabilize it vs. not adding a toggle and forcing people to test it and > fix it/report issues." > > Ideally, we'd find most issue in the RC phase or at least shortly after. > > You've been active in the kernel for a long time, please don't feel like > the community is not appreciating that. Thank you for the clarification. I truly appreciate your patience and thoroughness. --=20 Regards Yafang