From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42F81C369D9 for ; Wed, 30 Apr 2025 14:38:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8EC0C6B00C0; Wed, 30 Apr 2025 10:38:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89B2B6B00C1; Wed, 30 Apr 2025 10:38:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78B476B00C2; Wed, 30 Apr 2025 10:38:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 58CC36B00C0 for ; Wed, 30 Apr 2025 10:38:51 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 33337121AE7 for ; Wed, 30 Apr 2025 14:38:51 +0000 (UTC) X-FDA: 83390966862.15.F20FF68 Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50]) by imf23.hostedemail.com (Postfix) with ESMTP id 1EA76140013 for ; Wed, 30 Apr 2025 14:38:48 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i3qIjHTx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746023929; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+ZaQKL0lKSkWTZ3PYuofXsxJYk/+1B248o24QagaRnc=; b=dSm8dT9FTVEH7a7GIZHccJ5QfO0AgeI5t/GnLyBaqlsbARG4B3L033JpLTn8nY+VBUZRUu KrHabw9dK6f9I1pBtrpN2t6S2HiFGc8eraDVaewdAwQuqYBjx1+lyEgD1LnDFlM7fHwd1h Qu8ZMo2C+WN6FCEhKQ+Jl9NNNJa5Slo= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i3qIjHTx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746023929; a=rsa-sha256; cv=none; b=ud0KXK1R0BiKk1WOsXtv7aNhzWWAl8R46bTS2JjGgD2ks0gq97k3eWMQG0z0ip/jPRCD8k qAWWi4jzqPmpaqALUJi8lPvdgG6JBEqcdkHNIXs1gwpAHvTWkvTKDGU8jECz+yCdcP6pcF IlQe0x4EjGjJo+u4e+3T3IZoz9GXIF0= Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-6e8fb83e137so72703076d6.0 for ; Wed, 30 Apr 2025 07:38:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746023928; x=1746628728; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+ZaQKL0lKSkWTZ3PYuofXsxJYk/+1B248o24QagaRnc=; b=i3qIjHTxJaed9FmDHV9ePCNa+06ZjeQK7NiXn38eCAv8Xj3jwLqQ99MRhyUaaDcBH+ pJ69dxvtZ/oR2ZjQylBdYYqX6FszqmGcBLR9+Pry1jNy388yZu0xhZZaA/tip+Ngwcjk lTYNOaP0/76lY38tc7eFwA6lMMQ37v0ifGOwcsMW0ewEz0YIf4zQdw34mGSmw7I/l6Fl gWAm2nih5woMqCWl2fEFphbUFJ9aB3lunzFTCIt/kfjOtri95yvTfZ9V6IeZM7zB/2lh ortA3v5xaBfP0Llt2kbkGHWQW1FTi74WO7DfDQfWxBsPzvp1H3WqA5NN5OzpAhrfSJ45 m1Jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746023928; x=1746628728; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+ZaQKL0lKSkWTZ3PYuofXsxJYk/+1B248o24QagaRnc=; b=o5HsA4234VnIuP9ZzHBPKXAU6jNks3hFI/VHUIk4dmbkmoGgQZP+NQOg6Z3ErPHZWi CnBySL3FTOkTJDAWf04YVP4JIiez09Wf8P+CVLLGXVp+HfoYqVfDXj4xQraPmHOORay8 bBSu20q2KKLptWTjArVhp/ZtWpoYl17H6a7IQh03lCiYGyHGPVmMYP5RNC42dQ4eECUG wuzkR5dOlF2+BqbLvKaklwA9f3dNPZLqArAY+RELXkCyajOzua3medY5X5EW2ash5K/f WqU5USoW/ayVmsd9YtQwHGmem+DIzCEuw8X9nizmyruUUkpA3s6mr8wUDVV4sNozWMDL JNvA== X-Forwarded-Encrypted: i=1; AJvYcCVk05hO9GKV+hBLgie7ohfHJ6Rn5Dq3F5knYNYZF2yGP1FhvI20WbtkCZOl75slKDg5Y9UVANF8Nw==@kvack.org X-Gm-Message-State: AOJu0Yws/y66vCDyDrXMYlG3RmPyAUtbCPPK8mN2ft+xBJo75LJoqmYE 2Fm/dPIfbqr0P9f80lVmVQCNf1h6cJn3CqZ3I/HOcSpQ7Bp6QmBujPhCTxE5lhN/Ndq4/kEQodA NGc5PhRooR5CdO50sSsdl9qsHnYQ= X-Gm-Gg: ASbGncvCBWKbhEUB7EkVEX4ZY7chHhq3UomZqOlYzlGhsJASigvMcMz3HbSgXF8UPPM LiAY7nWSC0FE8HVo6mrzqBDNEnWQhvew0fP/gRtUg1gr6FWOac0vsNlsaxWD9+ips8J4aswS5FJ Difft8Uc6CvJRYnQDpp6Drc44= X-Google-Smtp-Source: AGHT+IHg+dNBE9IgivlmhXXs9kgFbJGA2vwJQy+7AROJZC2vzGZZQeWQcBL7P+66HVxAwhmUFFHfXTxra6y/Ac/yesw= X-Received: by 2002:ad4:5942:0:b0:6e8:f4c6:6813 with SMTP id 6a1803df08f44-6f4fce6824amr56619626d6.13.1746023928083; Wed, 30 Apr 2025 07:38:48 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> In-Reply-To: <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> From: Yafang Shao Date: Wed, 30 Apr 2025 22:38:10 +0800 X-Gm-Features: ATxdqUEkpNmZXgzw3NLqzVrGeErP__C_79FEfxEYquqvwEzk3Jt8KlA3VRx4JPo Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: Zi Yan Cc: akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, David Hildenbrand , Baolin Wang , Lorenzo Stoakes , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: t1sx39n6reqyi6nsmtincs3q4kfejnof X-Rspamd-Queue-Id: 1EA76140013 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746023928-295414 X-HE-Meta: U2FsdGVkX18HjDSr9QpQOhKjG/ITWQkxSEEjQ6nii+WTvWhOa0V/578egYb1048D82iFn18+3UptL7iFH9LkL6YKRfDykeQh8D+BNGLWp5T4JWRE/6telEHENOQLfkgainS6itWucErPnSBwxj/o1U4FE7NAE0ExhEsFCjfM9fbAuhkz+E9ngxoQT62/J6aOfL4sMUkcCWrGYshCbazD2boVK5Fcnp/idyjDXwklxV9Q4mhEYDBJn12X5hOe+aOYw1G/MwHG5dvBH4OCBQi0H1h/cMUL2Ve3SAj1dDLDNou2jpzh5h17he/u0LGNl09IeW+2KXM/O6AMulMNlrzKiUk5iQWbKwkkBJHdKGb4LMjjHAKj9uHmCGCgfsXmF1eWb0YznhJjWRv4BPiatczQGujeArJbCafiF/YMWpk9wfVX5iqHbdCC7F7QZL2H5IJHgPtspX+m8qsLLWGjyBs7mZV8VVxzUHlx4BIy5uDc8vwI3uQxDglTO6GEt/t7MPrq6iQSzy4TPRx0dgyRs4our2DlR7QzKK7KUkqFKhF+6yIrJr/DDk2VoOf/2d9kLNtnJFkGxeqjW+wms7wNGtQ2Ro+X9km1eTuNShBHTT5NCIFoptzNipsBQQMe874gq8I8kvFyO62KrQYK7KEXkJ+8m+nmIt7Mkf0HLFkKUCrQN8PigbxV8k5asmOHPxd6PSNkROO3x2/HJ5kqUT8ZEQbg4ELEjT1B5X6LTj2efNISrAKa9zrqx1YioUb9fIKYahbHqEQ9fbzZR1333TvyIYr16++kSimg14z0jMVeakEJu7Nh3SvV2SU4ghCYw8Ns7eKrzAVmpXCeazbiqeGH7qhWrrlkRAPKnmKAjtFw6qpKBjy19YlfGv3O127G4Hcu4YFaZJe3x1HMAz9NV0ASnXtMUJDAlHb/iwscyCkXYMWRSd+VYIiAldqCBzMlRiEjotq4Wt0HZFc2b8IxG+ZHuZX o3h2LnnB 4TBq/rvHC+X+KHJ8nZTISBbJBLuqc4W++dsCmFQFumdv0hPEa/xZpCLfrCtH7nEZlYAtJsLCl78TqvabVTkpFli34cJinLStOX2dmDUvEa6C+uGhbbPoCFuhnR43kTyb6yI44m8ogvGAV9zNNma2hHPdOYJsYl2EJ/ak/cxbN2fNb5YBrMaZBNyV+SFviYx8D047LcpXFlHQ1x2K5s7bZLjr6MBgRLZS5fLWMF6iljJuSroyuVJXm5gtnwQA3LQnyp7Vsk/WwkZPQgemWxn35GQN32bVl+NFyvTkAAPFCMcRDUgXBPSk8S4luES0kg9Av0/HsS/3DrCnn1UWrf8XNQG3Io7uVI6a1f2KMPJ4wJqBYeDT1C4b0HF3MEPmjSBbVxKjQ+Cumqv1YNBHxQMhChxPNrv7kK+tZKdOyHlxz85DbUhsmzB5R7puSqVE8mJ71kjQ3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 30, 2025 at 9:19=E2=80=AFPM Zi Yan wrote: > > On 29 Apr 2025, at 22:33, Yafang Shao wrote: > > > On Tue, Apr 29, 2025 at 11:09=E2=80=AFPM Zi Yan wrote: > >> > >> Hi Yafang, > >> > >> We recently added a new THP entry in MAINTAINERS file[1], do you mind = ccing > >> people there in your next version? (I added them here) > >> > >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/M= AINTAINERS?h=3Dmm-everything#n15589 > > > > Thanks for your reminder. > > I will add the maintainers and reviewers in the next version. > > > >> > >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: > >>> In our container environment, we aim to enable THP selectively=E2=80= =94allowing > >>> specific services to use it while restricting others. This approach i= s > >>> driven by the following considerations: > >>> > >>> 1. Memory Fragmentation > >>> THP can lead to increased memory fragmentation, so we want to limi= t its > >>> use across services. > >>> 2. Performance Impact > >>> Some services see no benefit from THP, making its usage unnecessar= y. > >>> 3. Performance Gains > >>> Certain workloads, such as machine learning services, experience > >>> significant performance improvements with THP, so we enable it for= them > >>> specifically. > >>> > >>> Since multiple services run on a single host in a containerized envir= onment, > >>> enabling THP globally is not ideal. Previously, we set THP to madvise= , > >>> allowing selected services to opt in via MADV_HUGEPAGE. However, this > >>> approach had limitation: > >>> > >>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through > >>> third-party libraries, bypassing our restrictions. > >> > >> Basically, you want more precise control of THP enablement and the > >> ability of overriding madvise() from userspace. > >> > >> In terms of overriding madvise(), do you have any concrete example of > >> these third-party libraries? madvise() users are supposed to know what > >> they are doing, so I wonder why they are causing trouble in your > >> environment. > > > > To my knowledge, jemalloc [0] supports THP. > > Applications using jemalloc typically rely on its default > > configurations rather than explicitly enabling or disabling THP. If > > the system is configured with THP=3Dmadvise, these applications may > > automatically leverage THP where appropriate > > > > [0]. https://github.com/jemalloc/jemalloc > > It sounds like a userspace issue. For jemalloc, if applications require > it, can't you replace the jemalloc with a one compiled with --disable-thp > to work around the issue? That=E2=80=99s not the issue this patchset is trying to address or work around. I believe we should focus on the actual problem it's meant to solve. By the way, you might not raise this question if you were managing a large fleet of servers. We're a platform provider, but we don=E2=80=99t maintain all the packages ourselves. Users make their own choices based on their specific requirements. It's not a feasible solution for us to develop and maintain every package. > > > > >> > >>> > >>> To address this issue, we initially hooked the __x64_sys_madvise() sy= scall, > >>> which is error-injectable, to blacklist unwanted services. While this > >>> worked, it was error-prone and ineffective for services needing alway= s mode, > >>> as modifying their code to use madvise was impractical. > >>> > >>> To achieve finer-grained control, we introduced an fmod_ret-based sol= ution. > >>> Now, we dynamically adjust THP settings per service by hooking > >>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP= to > >>> enable or disable on a per-service basis without global impact. > >> > >> hugepage_global_*() are whole system knobs. How did you use it to > >> achieve per-service control? In terms of per-service, does it mean > >> you need per-memcg group (I assume each service has its own memcg) THP > >> configuration? > > > > With this new BPF hook, we can manage THP behavior either per-service > > or per-memory. > > In our use case, we=E2=80=99ve chosen memcg-based control for finer-gra= ined > > management. Below is a simplified example of our implementation: > > > > struct{ > > __uint(type, BPF_MAP_TYPE_HASH); > > __uint(max_entries, 4096); /* usually there won't too > > many cgroups */ > > __type(key, u64); > > __type(value, u32); > > __uint(map_flags, BPF_F_NO_PREALLOC); > > } thp_whitelist SEC(".maps"); > > > > SEC("fmod_ret/mm_bpf_thp_vma_allowable") > > int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) > > { > > struct cgroup_subsys_state *css; > > struct css_set *cgroups; > > struct mm_struct *mm; > > struct cgroup *cgroup; > > struct cgroup *parent; > > struct task_struct *p; > > u64 cgrp_id; > > > > if (!vma) > > return 0; > > > > mm =3D vma->vm_mm; > > if (!mm) > > return 0; > > > > p =3D mm->owner; > > cgroups =3D p->cgroups; > > cgroup =3D cgroups->subsys[memory_cgrp_id]->cgroup; > > cgrp_id =3D cgroup->kn->id; > > > > /* Allow the tasks in the thp_whiltelist to use THP. */ > > if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) > > return 1; > > return 0; > > } > > > > I chose not to include this in the self-tests to avoid the complexity > > of setting up cgroups for testing purposes. However, in patch #4 of > > this series, I've included a simpler example demonstrating task-level > > control. > > For task-level control, why not using prctl(PR_SET_THP_DISABLE)? You=E2=80=99ll need to modify the user-space code=E2=80=94and again, this l= ikely wouldn=E2=80=99t be a concern if you were managing a large fleet of servers= . > > > For service-level control, we could potentially utilize BPF task local > > storage as an alternative approach. > > +cgroup people > > For service-level control, there was a proposal of adding cgroup based > THP control[1]. You might need a strong use case to convince people. > > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.as= ier@huawei-partners.com/ Thanks for the reference. I've reviewed the related discussion, and if I understand correctly, the proposal was rejected by the maintainers. --=20 Regards Yafang