From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A76FC369D9 for ; Wed, 30 Apr 2025 16:07:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FE846B00CA; Wed, 30 Apr 2025 12:07:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AD936B00CB; Wed, 30 Apr 2025 12:07:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2747F6B00CC; Wed, 30 Apr 2025 12:07:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0BC036B00CA for ; Wed, 30 Apr 2025 12:07:12 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 597551203E6 for ; Wed, 30 Apr 2025 16:07:12 +0000 (UTC) X-FDA: 83391189504.21.C925521 Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) by imf23.hostedemail.com (Postfix) with ESMTP id 9E77C140013 for ; Wed, 30 Apr 2025 16:07:10 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L5qPbwkQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746029230; a=rsa-sha256; cv=none; b=DN6CU2f+xkcDhvFQ0orhvZX8gPUQuYnI3UCl4Na9xUbfKDeAUKVRi80z9628BRzFzk0KgN gC2WGgpZtl+IODWRdeFoNTEsEuPOEt5qJiNayWmqg25iyPwNYgF1+HfN2k0whytE/YHL2z OhOPdvLCEVKf9ebDcEMfgWoZTGqzUFY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746029230; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4zQ7ppAzBejUmFebxp620NG0aCIkVnqqLnTlpcMySAY=; b=lUhmHzLg0hFLoUK+YIo7SW6fWUMUt1SSIyJFXeOxQNEfYWM3ySZI8e7c3fC1GHhS0HGERh c7lREHjFgQewhIHDG5nGJMfR2ut9sfWEOJ6ZUg6qKoL2YEnXnIO8GGI7G564z8N4T8IoxT wkTPkb/TY5rkYDlrRmGrNc+OQO1BEnA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L5qPbwkQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7c5e2fe5f17so792275685a.3 for ; Wed, 30 Apr 2025 09:07:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746029229; x=1746634029; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4zQ7ppAzBejUmFebxp620NG0aCIkVnqqLnTlpcMySAY=; b=L5qPbwkQgYFtS7aQKYcBEvOxQhmAtrtvD96W/qW64mq51+CJ3LTeKQKmrnpBBmRX/P fE2mz/IXYdD9SPrgQhNqc8sQR0nexcLGNO8VvvCSahK/gGPHY8j7OWynALxSevt05gQU SxpUNMVAnIYMvXetbtUvmAF+u+WMtVg4FhcEBcLn6S/Mtq3GTn0i6gUAffCGFLJq6bAX /+zC1MVS05ovyqv8I/ZxhAi/5L7AhNSwxxep94CeeEE7Rl30Vgf4Rw3TFrmmiySVbGC0 nogIVrlI8dsTIEXNw4xyQch1o5PYzXRVEU43xbjee3bnPvY/veBLPjmOqfl4Y0lo/GsJ O19A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746029229; x=1746634029; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4zQ7ppAzBejUmFebxp620NG0aCIkVnqqLnTlpcMySAY=; b=oYB1QKTlIG/KwaD8HJrBjymOxzK1jd/sql530FbPdJYDZiKhAt3d472Kf7iUwYaA+p INZCHKppYzwtmEsWEmP7dl6VazKDCBYBaS4WYTLkISaZ3iea6wLZCzuMxnErXvDfGduw ZUKH4WCHtb8DWj6GWhsxPkCVr7HMnVmkl0CZ2Mgga90c/jlTKpIRvULKgjYI+RnpbcaV IuH4gznI4kCbFtaWTjIIhc+AUMr+XZBh1jYZRWc3xZjUek/Ozr5CdYyT7Tb2TLOQVhq9 8BqLGuT7w0uJwLPFcpArH19rutKkSIlEMcM+udPqaDyp3MwdfzXjwyks62xSa+mtqeGN Hylg== X-Forwarded-Encrypted: i=1; AJvYcCXLpa/N+FXCLbqR/iqmtDJjd3jgiJVnq8mv88mRLFkC5dKUeX6Civ0tdGrcYO4Q5EVJyDRrYXAa6w==@kvack.org X-Gm-Message-State: AOJu0YzjKUHkn8fKAAwkd/SUaK9fZO4Lw+2NsbCp3UgAj678om6LAFkH iUe/UGYQD1QIASMnJUEberwOIMyGbBI6+OvlFZNm5maSo8QKEOLU5SvsPBZRUpqAieoTmgsKsLi 7qC9WldOLAXuPu8WGQKpjIAiduq4= X-Gm-Gg: ASbGnctApGRsFzRB4jIL0SQwOolvHoUznaf9VJBcRuWUetmAxsSg8Gss8kwiHP9KzUL aZn3aPTli9wsFEl1dds+sWHsKPjmoZXdfeP60EH8upnjJPNDmqaw9LDEe2zsQFtYiTuQmwrLnjB UIvpvmTfPR3eqe+1LIgXUY X-Google-Smtp-Source: AGHT+IEK2iOkwSYxyslAnZkxkeJx6nWUyX7pEMGAel0zGMEm9adF85kps5SSwc/GgEtgjlzFX1smLnN4gAJsY7oETU4= X-Received: by 2002:a05:6214:2a49:b0:6e6:61f1:458a with SMTP id 6a1803df08f44-6f4fce68783mr66421376d6.14.1746029229284; Wed, 30 Apr 2025 09:07:09 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> In-Reply-To: From: Yafang Shao Date: Thu, 1 May 2025 00:06:31 +0800 X-Gm-Features: ATxdqUHJRWXK5-pSkGYEN5idIyqxJ3F5TfeWPFCUSpk63kWiycE8PWLRtr4NIBc Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: "Liam R. Howlett" , Yafang Shao , Zi Yan , akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, David Hildenbrand , Baolin Wang , Lorenzo Stoakes , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 9E77C140013 X-Stat-Signature: w8jp6mxkkzpdjqa9shqxdfbfi7xagtog X-Rspam-User: X-HE-Tag: 1746029230-811292 X-HE-Meta: U2FsdGVkX19B1Pq+XNhq0oDNg48b2ZVb2xN9DzLGHGA89iL3YV1XYUutvo3AArFCmT1N23joEsCVxDE/GWRf4cgfrB7YwYPUIFkHE2fRZupweWX4agCDFkUzTTM5zWYBhPH6ZlIkujdedHVBAWRwRfWBK/xqxkxRjH/kOgGN3/fXDjThqB5V7xjYYxTRe8/iARge8vwrqA2qBjCViQnhNw6cky0hPl9457F269F2VAX2gFKEZyQy+z2vKmXLvJVPa3BZjP0d4R4QeYDr83FBk6pNKU43M8+EUxbokwHE/hngYe42eXiBBMET5fe6VxglgaxDR41zAytpOr5dhewcG/N9gLMe+F0C9Ju1Wh7N3kQH/85v3DytIc+moVdFLvlAo0rltqy7yeEFDwhZzMitwZjuuUmbggRiHhwD/aC30rx8Q6dVkbOjqgVf723yB962rzqplFUuKLBX8WnTUklQuSQnfDCVh5nA0rAu1HQuRBhWeijw+se/qN0E93q3CTWsbw1WPi31GjT7fFdq6IhOKlNY1G+rVMR2p4WNVBAuDPJq+MWBSPaUNrWM0hJSIdMNqgC860mXlKZwTQB6/86/vGOxrIdj5tBnamUeXMj7TiSEzntYJGSLg1WHBe4h8439xTD2Hm/a4+hiJtDbZVqnhS/mVqsGYt3/DPbOe7YPQ9LQ1xNROFpWgw2Qx0lJtbA5jy6ba+BuE1Q5vlnZnd2SdeK8kzixx54x2wEveOb+Vq3klfbfJB2ZRFTvnjEoDS5jc0ZpcYJ4V6ABk+TNgT203UFTlsAd3S2+VwGTrE4q/j4tc26PP5TGyUxM/KzoicI3SZUODde18Drws12wO09tsdXhmPDc6BT8+02dK1LNVYtmKfAOxUFGTPrMvFpmQLed62dqgnVN2C5zZ5Pd17K71C9PkWW+8TTJ1QbEufdmhgWiIAMcBfF78CQ0oYRSmSAHo+dlCB1/oW1uCii+KW1 ix8tjeMX P1xIowP1++apjqLA6O53IRPnzaw1lXEoyIOsHPP6un5DVQ/xXcx5Sbjl/kcXIF21xWVWhVrspAkfPsRT2+yvj3TSqbQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 30, 2025 at 11:53=E2=80=AFPM Liam R. Howlett wrote: > > * Yafang Shao [250430 11:37]: > > On Wed, Apr 30, 2025 at 11:21=E2=80=AFPM Liam R. Howlett > > wrote: > > > > > > * Zi Yan [250430 11:01]: > > > > > > ... > > > > > > > >>>>> Since multiple services run on a single host in a containeriz= ed environment, > > > > >>>>> enabling THP globally is not ideal. Previously, we set THP to= madvise, > > > > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. Howev= er, this > > > > >>>>> approach had limitation: > > > > >>>>> > > > > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) thr= ough > > > > >>>>> third-party libraries, bypassing our restrictions. > > > > >>>> > > > > >>>> Basically, you want more precise control of THP enablement and= the > > > > >>>> ability of overriding madvise() from userspace. > > > > >>>> > > > > >>>> In terms of overriding madvise(), do you have any concrete exa= mple of > > > > >>>> these third-party libraries? madvise() users are supposed to k= now what > > > > >>>> they are doing, so I wonder why they are causing trouble in yo= ur > > > > >>>> environment. > > > > >>> > > > > >>> To my knowledge, jemalloc [0] supports THP. > > > > >>> Applications using jemalloc typically rely on its default > > > > >>> configurations rather than explicitly enabling or disabling THP= . If > > > > >>> the system is configured with THP=3Dmadvise, these applications= may > > > > >>> automatically leverage THP where appropriate > > > > >>> > > > > >>> [0]. https://github.com/jemalloc/jemalloc > > > > >> > > > > >> It sounds like a userspace issue. For jemalloc, if applications = require > > > > >> it, can't you replace the jemalloc with a one compiled with --di= sable-thp > > > > >> to work around the issue? > > > > > > > > > > That=E2=80=99s not the issue this patchset is trying to address o= r work > > > > > around. I believe we should focus on the actual problem it's mean= t to > > > > > solve. > > > > > > > > > > By the way, you might not raise this question if you were managin= g a > > > > > large fleet of servers. We're a platform provider, but we don=E2= =80=99t > > > > > maintain all the packages ourselves. Users make their own choices > > > > > based on their specific requirements. It's not a feasible solutio= n for > > > > > us to develop and maintain every package. > > > > > > > > Basically, user wants to use THP, but as a service provider, you th= ink > > > > differently, so want to override userspace choice. Am I getting it = right? > > > > > > Who is the platform provider in question? It makes me uneasy to have > > > such claims from an @gmail account with current world events.. > > > > It=E2=80=99s a small company based in China, called PDD=E2=80=94if that= information is helpful. > > Thanks. > > > > > > > > > ... > > > > > > > >>> > > > > >>> I chose not to include this in the self-tests to avoid the comp= lexity > > > > >>> of setting up cgroups for testing purposes. However, in patch #= 4 of > > > > >>> this series, I've included a simpler example demonstrating task= -level > > > > >>> control. > > > > >> > > > > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)? > > > > > > > > > > You=E2=80=99ll need to modify the user-space code=E2=80=94and aga= in, this likely > > > > > wouldn=E2=80=99t be a concern if you were managing a large fleet = of servers. > > > > > > > > > >> > > > > >>> For service-level control, we could potentially utilize BPF tas= k local > > > > >>> storage as an alternative approach. > > > > >> > > > > >> +cgroup people > > > > >> > > > > >> For service-level control, there was a proposal of adding cgroup= based > > > > >> THP control[1]. You might need a strong use case to convince peo= ple. > > > > >> > > > > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gut= ierrez.asier@huawei-partners.com/ > > > > > > > > > > Thanks for the reference. I've reviewed the related discussion, a= nd if > > > > > I understand correctly, the proposal was rejected by the maintain= ers. > > > > > > More of the point is why it was rejected. Why is your motive differe= nt? > > > > > > > > > > > I wonder why your approach is better than the cgroup based THP cont= rol proposal. > > > > > > I think Matthew's response in that thread is pretty clear and still > > > relevant. > > > > Are you refering > > https://lore.kernel.org/linux-mm/ZyT7QebITxOKNi_c@casper.infradead.org/ > > or https://lore.kernel.org/linux-mm/ZyIxRExcJvKKv4JW@casper.infradead.= org/ > > ? > > > > If it=E2=80=99s the latter, then this patchset aims to make sysadmins' = lives easier. > > Both, really. Your patch gives the sysadm another knob to turn and know > when to turn it. Matthew is suggesting we should know when to do the > right thing and avoid a knob in the first place. The problem is that there's no proper mechanism to control THP at the container level. From the moment we introduced containers and cgroups, the goal has been to manage all resources through cgroups. Of course, implementing everything at once wasn=E2=80=99t feasible, so we added controllers incrementally=E2=80=94and we=E2=80=99re still introducing new o= nes even today, aren=E2=80=99t we? Now, with BPF, we have a more flexible way to achieve this=E2=80=94so why not use it? I believe we should focus on making life easier for users, not just sysadmins. That philosophy has been a driving force behind the continued development of the Linux kernel. > > > > > > If it isn't, can you state why? > > > > > > The main difference is that you are saying it's in a container that y= ou > > > don't control. Your plan is to violate the control the internal > > > applications have over THP because you know better. I'm not sure how > > > people might feel about you messing with workloads, > > > > It=E2=80=99s not a mess. They have the option to deploy their services = on > > dedicated servers, but they would need to pay more for that choice. > > This is a two-way decision. > > This implies you want a container-level way of controlling the setting > and not a system service-level? Right. We want to control the THP per container. > I guess I find the wording of the > problem statement unclear. > > > > > > but beyond that, you > > > are fundamentally fixing things at a sysadmin level because programme= rs > > > have made errors. > > > > No, they=E2=80=99re not making mistakes=E2=80=94they simply focus on th= e > > implementation details of their own services and don=E2=80=99t find it > > worthwhile to dive into kernel internals. Their services run perfectly > > well with or without THP. > > > > > You state as much in the cover letter, yes? > > > > I=E2=80=99ll try to explain it in more detail in the next version if th= at > > would be helpful. > > Yes, I think so. > > Thanks, > Liam --=20 Regards Yafang