From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 726BCC369D9 for ; Wed, 30 Apr 2025 15:17:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2098E6B00A8; Wed, 30 Apr 2025 11:17:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B6AA6B00A9; Wed, 30 Apr 2025 11:17:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 031FD6B00C4; Wed, 30 Apr 2025 11:17:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D278E6B00A8 for ; Wed, 30 Apr 2025 11:17:07 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6499EB6BD2 for ; Wed, 30 Apr 2025 15:17:08 +0000 (UTC) X-FDA: 83391063336.16.9136E20 Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf28.hostedemail.com (Postfix) with ESMTP id 8635AC000C for ; Wed, 30 Apr 2025 15:17:06 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lhfKLbOg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf28.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746026226; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y1eGfNgNhOkfDvT6PA5fh2AacW8V+M0lJMdLcR/pdF0=; b=ubEMjt/B2zuuzLxKhLOvHYVGjnW0mkU4sAz2YRELZiGniSWh6HX7aa3yXsIHDa7jcdnTNB 6/Vf4prsVvTzoq1UXrbQsQ7DR83RqNhIlRTAye9vR3Y7o5qYouxR+IMkUvJ1MYksT3xRMA PMrvzgLML6TkU0y8QVLEd8vISuoInDI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746026226; a=rsa-sha256; cv=none; b=pP41YsgU8d18bSjFFsOvfYpDjXQdKlxNpP7o1X89+CzvZVFR3P+zGu9NU+kupcTs8B+5vB 6q4tR6z74B4G1UgQ8dgqs+FSlk/tnSqgkvPIoiVcepDVzeQuZhblzg4Avd0OxhbO3og55C aqKFIeiiV56bVLeXfA80qT3uUxIC8dY= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lhfKLbOg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf28.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6e8f6970326so310886d6.0 for ; Wed, 30 Apr 2025 08:17:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746026226; x=1746631026; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y1eGfNgNhOkfDvT6PA5fh2AacW8V+M0lJMdLcR/pdF0=; b=lhfKLbOgfRhIHMLmiFmrpky+tXKKU+DyH2I9lkUGlystv8a0DsiszRJzFOhMg/dWwX 2Co1RwsLqK5TY72S0urmFvKUVS3pPsUNuHUPtq0fDvKXFo8d7RHpM94cTZcZ3ArQvtxX /vrESbadA9kP33aWWLv70gXRu058noXZy9D/5jiJBC5l7Vm9Im4w4KiizFaypmxW3s6S 1sn0nvKkFzMLBM5+l/Y2DRUop5G/en4URaSQALpgFLDEjgacFtULNstPiSeydBgqJJC/ I6pdEgNF5FkOJA6yHpUwczPqAMph13JruSpVrULrpURnjEYFgK8owFFU+fvV28m+AoG3 j/yw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746026226; x=1746631026; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y1eGfNgNhOkfDvT6PA5fh2AacW8V+M0lJMdLcR/pdF0=; b=JOvOEazjV2FbOfrSRWCdf/qxQc78f5gYt1jRsDtuTml5nX+Enju41rOwTLDnvKyovz ggaYB2CbXZUwGRLozBMhDuQ/F7VJnc1ccGu3NK7igrTZmT+Z1aijWm9CRxnGLWvnW6fw IZ1F3ZeDr5LxDIO1H8nAFtGy5JqgyobbLW5SD+9jBo2w3HRf6Dpmlandq2Aqs4GggYwd 1SvTaYhUiKCdLvN0Z36kS23f0nOAkAREB+GDmLHSZRkV53X3Tk/M84WIQOEcbVBdtoh1 ks5ox6cd012QqefYJ+phGgnb+pb6b3XDs3LDfuMMJ/e9FeCAKfgiVn0IERQ4kNjJht6C CF4g== X-Forwarded-Encrypted: i=1; AJvYcCV/Un2jEEdOKQ8SQOUEroW4ROzZj68UxtXdShfZMutci6dTGGXmlsn8hbF1rc024LygLNg9qcxuQg==@kvack.org X-Gm-Message-State: AOJu0YxnVa3EOP3vvhtxwJcNYRU5WSlcMuXHpqkQtCRx8++qJqqBCNTe cHtTU6b98alBd9ygm2GAEhVDR3oqs/M1Mctqqc42JRvlth0COCgAHOjvBOBAiUAN70zhtQFZO0g eIadpcBb3ToUb25dw0yjvlNytd1murkA6xiSPEA== X-Gm-Gg: ASbGnctgIuW8iWYQGHCskhm+uq37axZG6UGmj5T7LgfSaB0yDMhq5ukspWCG86mtYKf YzCIvV3hTItYsz9Fr7q1bGcl0aYnzbUbHH4csC8kQTWi60qt3PvqhkCNv/GDKw4s62pDfIcsydx AF4zn/UxP3WuY+rV8syKE/nv8= X-Google-Smtp-Source: AGHT+IEia4f+wjAVsBUvm/2+ziBVTBEuYip3CdpjAvABDnOWGPHqBtFRRyGkWVvfA72hUjQ4aaQR4MlRJcQbEJ4VGl4= X-Received: by 2002:a05:6214:d41:b0:6e8:9dfa:d932 with SMTP id 6a1803df08f44-6f4fcea349cmr60372726d6.15.1746026225636; Wed, 30 Apr 2025 08:17:05 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> In-Reply-To: <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> From: Yafang Shao Date: Wed, 30 Apr 2025 23:16:28 +0800 X-Gm-Features: ATxdqUEy1ObHwX52a6IW6Ijr93XcylJGHFGOQfcuuqOdUkPDHEQ_5lpHj2vRvPA Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: Zi Yan Cc: akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, David Hildenbrand , Baolin Wang , Lorenzo Stoakes , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 8635AC000C X-Rspam-User: X-Stat-Signature: bhtrdbhcsnfgie14wjecogudh5konmoc X-HE-Tag: 1746026226-669781 X-HE-Meta: U2FsdGVkX1/jt+WKgzh1GIxrTwqyaYp3HmZp5mZ/adaTftONbghS0m7xkFHQqnRg95m5p1tFAMBX3nRAUPkN40hkj9yP1cq0NABGJtBXUtfqz9ZuAeY4XgIH1/5aMQCICM4oc2bS5JkjJy4+XFnpR680q/1UpgldVSnxNm1QX7V/wtDfJEMZbJ94wGW4y+A0YI2xGMvDGBGFVLsoComCCXE3Mqqxqah9pVb5ilVE1MU3oRoR5/zVp68GJNIQSYsW4clbddNuKWGF2EdxAuMVJnmmHiDIyXul8MMcdvbAAf6Ona/jHNremp2diPz8+ELh2OgHyVObQGc7yvoP8h4Ck3A8xCnkQry7ZIxKLPvX64KxK6B+Hmkin2lZTdVY+Qs0AzCzqVHNzOH0X9QgxHyRInY3hPk2vUI6jW3Z7IRTwSyu+lfMo4RpGtBkICWqMWuIgdFrxyHuxrW3f3hEicEX5aWzQlSxIYrpCnIWWWaXvmRIPsc/DsXVPC461LjLhme4/SaztZV6CPXcIb8Uf8s3wZGECx5UQF0Kp7jN4cq8xmQWtWp6BYTGrGelccruVJn5w+HuQz3WYWmnjhnTVhm0uB11HydEcWgtMnukMvb7BqfyIe12KKz4aDFz1XjQbkMPfnZXIRFOl73TcpI0x1ENheIy/5lFm5eIpMxq89PpvGPdDyhgyqimI70I6zuVDgZXk9g8t13pe4apAZQx6p8oqL+yDF7DB6tEtUS9aTAH+42ReC4s0u3Pd/4/0MaCx3wxqMMNWVAOKnzHiS6vddzzKq+6SS4dJgq9RkX9lJpiaDGZX3rsjeig7bsJ4jiTEniV/3b2DbESCfueScfNd9Fpbx5xMvOSYhUcmSt/jSi8n6iNPJmLAaeMHhOGPCAUJ4IWqKM5i5gF9J7z7iTbRQVDW3cRfB629sqPHaRkJImMhsmgi2yHE2w25LMu7oYh4WJfkYBcgq+7bPhUam1Gkyn zs41bdgh zcxflm2k0FPLWRcaZqkGEAWN0nO3WPaVEJwWIg6CuYsii3Eo/FEuGPcDfbC6orc2m+Cx9T/YsSvrzXb+dx8VMVL5reqRMUctWR5hUyQdwAlYYxXvCm3R3zLoKRPyovO3G/cqtAGmssBx0MR7xnTbXEepGRf0FpOxx8JV0g6vOP68NauMEInC/IB//y9Md98aaTN8VgFSu/goV/2pXzxsf+T9Qq5a4mWVUkdwu/iTyXe/HQIskPoQyhZArgiBFq1ki9j8XsG+NL7UV1yrd5AjQccAg/isCn2mCTBfJ5i++9fExagxhKN2HUqobU2IfACIOZQe5AasilalVAH6oltrvUbZlVvHNlEU9SqIiF9VL40Vh5/iINfg73nEt0kg2eA024GX6Rh2jEH6TOlEjwd6sY3t1YVdB1LzHJlYaN0GS+N9fwnM6K7RCcQ7L/Uwdiwi/XChh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 30, 2025 at 11:00=E2=80=AFPM Zi Yan wrote: > > On 30 Apr 2025, at 10:38, Yafang Shao wrote: > > > On Wed, Apr 30, 2025 at 9:19=E2=80=AFPM Zi Yan wrote: > >> > >> On 29 Apr 2025, at 22:33, Yafang Shao wrote: > >> > >>> On Tue, Apr 29, 2025 at 11:09=E2=80=AFPM Zi Yan wrot= e: > >>>> > >>>> Hi Yafang, > >>>> > >>>> We recently added a new THP entry in MAINTAINERS file[1], do you min= d ccing > >>>> people there in your next version? (I added them here) > >>>> > >>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree= /MAINTAINERS?h=3Dmm-everything#n15589 > >>> > >>> Thanks for your reminder. > >>> I will add the maintainers and reviewers in the next version. > >>> > >>>> > >>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: > >>>>> In our container environment, we aim to enable THP selectively=E2= =80=94allowing > >>>>> specific services to use it while restricting others. This approach= is > >>>>> driven by the following considerations: > >>>>> > >>>>> 1. Memory Fragmentation > >>>>> THP can lead to increased memory fragmentation, so we want to li= mit its > >>>>> use across services. > >>>>> 2. Performance Impact > >>>>> Some services see no benefit from THP, making its usage unnecess= ary. > >>>>> 3. Performance Gains > >>>>> Certain workloads, such as machine learning services, experience > >>>>> significant performance improvements with THP, so we enable it f= or them > >>>>> specifically. > >>>>> > >>>>> Since multiple services run on a single host in a containerized env= ironment, > >>>>> enabling THP globally is not ideal. Previously, we set THP to madvi= se, > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, th= is > >>>>> approach had limitation: > >>>>> > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through > >>>>> third-party libraries, bypassing our restrictions. > >>>> > >>>> Basically, you want more precise control of THP enablement and the > >>>> ability of overriding madvise() from userspace. > >>>> > >>>> In terms of overriding madvise(), do you have any concrete example o= f > >>>> these third-party libraries? madvise() users are supposed to know wh= at > >>>> they are doing, so I wonder why they are causing trouble in your > >>>> environment. > >>> > >>> To my knowledge, jemalloc [0] supports THP. > >>> Applications using jemalloc typically rely on its default > >>> configurations rather than explicitly enabling or disabling THP. If > >>> the system is configured with THP=3Dmadvise, these applications may > >>> automatically leverage THP where appropriate > >>> > >>> [0]. https://github.com/jemalloc/jemalloc > >> > >> It sounds like a userspace issue. For jemalloc, if applications requir= e > >> it, can't you replace the jemalloc with a one compiled with --disable-= thp > >> to work around the issue? > > > > That=E2=80=99s not the issue this patchset is trying to address or work > > around. I believe we should focus on the actual problem it's meant to > > solve. > > > > By the way, you might not raise this question if you were managing a > > large fleet of servers. We're a platform provider, but we don=E2=80=99t > > maintain all the packages ourselves. Users make their own choices > > based on their specific requirements. It's not a feasible solution for > > us to develop and maintain every package. > > Basically, user wants to use THP, but as a service provider, you think > differently, so want to override userspace choice. Am I getting it right? No=E2=80=94the users aren=E2=80=99t specifically concerned with THP. They j= ust copied a configuration from the internet and deployed it in the production environment. > > > > >> > >>> > >>>> > >>>>> > >>>>> To address this issue, we initially hooked the __x64_sys_madvise() = syscall, > >>>>> which is error-injectable, to blacklist unwanted services. While th= is > >>>>> worked, it was error-prone and ineffective for services needing alw= ays mode, > >>>>> as modifying their code to use madvise was impractical. > >>>>> > >>>>> To achieve finer-grained control, we introduced an fmod_ret-based s= olution. > >>>>> Now, we dynamically adjust THP settings per service by hooking > >>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set T= HP to > >>>>> enable or disable on a per-service basis without global impact. > >>>> > >>>> hugepage_global_*() are whole system knobs. How did you use it to > >>>> achieve per-service control? In terms of per-service, does it mean > >>>> you need per-memcg group (I assume each service has its own memcg) T= HP > >>>> configuration? > >>> > >>> With this new BPF hook, we can manage THP behavior either per-service > >>> or per-memory. > >>> In our use case, we=E2=80=99ve chosen memcg-based control for finer-g= rained > >>> management. Below is a simplified example of our implementation: > >>> > >>> struct{ > >>> __uint(type, BPF_MAP_TYPE_HASH); > >>> __uint(max_entries, 4096); /* usually there won't too > >>> many cgroups */ > >>> __type(key, u64); > >>> __type(value, u32); > >>> __uint(map_flags, BPF_F_NO_PREALLOC); > >>> } thp_whitelist SEC(".maps"); > >>> > >>> SEC("fmod_ret/mm_bpf_thp_vma_allowable") > >>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) > >>> { > >>> struct cgroup_subsys_state *css; > >>> struct css_set *cgroups; > >>> struct mm_struct *mm; > >>> struct cgroup *cgroup; > >>> struct cgroup *parent; > >>> struct task_struct *p; > >>> u64 cgrp_id; > >>> > >>> if (!vma) > >>> return 0; > >>> > >>> mm =3D vma->vm_mm; > >>> if (!mm) > >>> return 0; > >>> > >>> p =3D mm->owner; > >>> cgroups =3D p->cgroups; > >>> cgroup =3D cgroups->subsys[memory_cgrp_id]->cgroup; > >>> cgrp_id =3D cgroup->kn->id; > >>> > >>> /* Allow the tasks in the thp_whiltelist to use THP. */ > >>> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) > >>> return 1; > >>> return 0; > >>> } > >>> > >>> I chose not to include this in the self-tests to avoid the complexity > >>> of setting up cgroups for testing purposes. However, in patch #4 of > >>> this series, I've included a simpler example demonstrating task-level > >>> control. > >> > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)? > > > > You=E2=80=99ll need to modify the user-space code=E2=80=94and again, th= is likely > > wouldn=E2=80=99t be a concern if you were managing a large fleet of ser= vers. > > > >> > >>> For service-level control, we could potentially utilize BPF task loca= l > >>> storage as an alternative approach. > >> > >> +cgroup people > >> > >> For service-level control, there was a proposal of adding cgroup based > >> THP control[1]. You might need a strong use case to convince people. > >> > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez= .asier@huawei-partners.com/ > > > > Thanks for the reference. I've reviewed the related discussion, and if > > I understand correctly, the proposal was rejected by the maintainers. > > I wonder why your approach is better than the cgroup based THP control pr= oposal. It=E2=80=99s more flexible, and you can still use it even without cgroups. One limitation is that CONFIG_MEMCG must be enabled due to the use of mm_struct::owner. I'm wondering if it would be feasible to decouple mm_struct::owner from CONFIG_MEMCG. Alternatively, if there=E2=80=99s anoth= er reliable way to retrieve the task_struct without relying on mm_struct::owner, we could consider adding BPF kfuncs to support it. --=20 Regards Yafang