From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53FEFC3ABDD for ; Tue, 20 May 2025 06:52:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF0B86B0096; Tue, 20 May 2025 02:52:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CC8416B0099; Tue, 20 May 2025 02:52:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C04F96B009A; Tue, 20 May 2025 02:52:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A3B3D6B0096 for ; Tue, 20 May 2025 02:52:37 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4745114068B for ; Tue, 20 May 2025 06:52:37 +0000 (UTC) X-FDA: 83462367954.04.A428080 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 135B520006 for ; Tue, 20 May 2025 06:52:34 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZzAAfVqk; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747723955; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K02mA4C+4/K25Ln3AUAdiBQorrkaczFdbSqJeBmGrbU=; b=AffSW6owgtLwT6ltUJaXf8jPIbzjSdKJKPbT62p9d4oubjrCiZ9SRrj1CkHoPMYylJ0blC 90xSJff1GqjysPlKFMQ831uVpWNEm4pGmQ7TzyKl8Xfy8NxJMB8oSRF2WFZf1mgWjxjj/v qpcQ7sQooZtXFotOT5P606FU8OsL024= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747723955; a=rsa-sha256; cv=none; b=wc7DaYo2CJveaZiwoyyMLtISrMiQxUrUclteXARFsBQOW49G7+izlCvGZ5r0a5u4eAhHEI SCdB1HRC1KOv2pBqkxYepnW1Ms//QCR0iAW/SeI4iroiR2Ub5Fv7ZxtmMdIZvwQCiRvkpj xTQH9rmQgnYZgXfrr+rqMlukJQcLxYo= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZzAAfVqk; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1747723954; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K02mA4C+4/K25Ln3AUAdiBQorrkaczFdbSqJeBmGrbU=; b=ZzAAfVqks/+IGe7MDkeTe9u7IDQiz5YhLDxr+LkfuHjdfu+WsxNwh1kQYmIxy+foToT5Ki Oquvko7XXpb91LiuTDP3iEUc3bD2Hua6dZ4s9OkV/On2hJJ05j9uLlnzEWf7q4Wr9hFFJ3 s22iwlhb4O2I7cAg7COB/fENtkvgSjM= Received: from mail-yw1-f197.google.com (mail-yw1-f197.google.com [209.85.128.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-652-9fgOTWbXO3OvusPicLTzzg-1; Tue, 20 May 2025 02:52:33 -0400 X-MC-Unique: 9fgOTWbXO3OvusPicLTzzg-1 X-Mimecast-MFC-AGG-ID: 9fgOTWbXO3OvusPicLTzzg_1747723952 Received: by mail-yw1-f197.google.com with SMTP id 00721157ae682-703d7a66d77so77402767b3.0 for ; Mon, 19 May 2025 23:52:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747723952; x=1748328752; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K02mA4C+4/K25Ln3AUAdiBQorrkaczFdbSqJeBmGrbU=; b=YFlGaoNVcDe7rXZn9bPL4HpTVc8z5siv7F60RSHTk8RtoT3tPq0WozNEJKu5Y32+D/ bFsWyI1DENSp8LkXo/fLa3aKJmtFLmMgSfdTeAixDHgWWS06U1HqqODBWqseZdj1Zwn6 NYqf0YiF/yEODaEfHbvbFmBfJMEwu72/0dOsLAsfZbn+jxEkk/NnLXINjBHsT+OPmHpR toBiJeeH/c3oNIUCuoL2fhyGdfessyQoTUJKf/Z2a/XCZrN8JVoFae0vodtPahHZSvbJ 0VQ43i2sgHQB/UyrDWlz5kfkDJgqWmUFFgW3vN4AYB3Exk34E6r80eYLQmKFQKiFVbtD 6w/w== X-Forwarded-Encrypted: i=1; AJvYcCUF5r2+UTcb+9cGgpT20RHBsFZHEJ1bwR6BpnWUMIhgagXFnRpx/uiMf4ZeQRepSGfxXOIcyuTcmQ==@kvack.org X-Gm-Message-State: AOJu0YwC7yM8cDSV2VyYkQ/upP4lO52yvxwAAR6tJR99QySOigL3UOx5 yQrk8gsQ9SPN3bylLJFoivpTf58yQRFVZBFwq1JE1s1NcbNuSQneKzHuZzeu41BO3QxoLXaFR53 fO2MO1v45GPqbz4TNBJsvFNq24hQO1wdJdwIVJmF+Jh3qiHLD7F4tC82N0c2tuNbqztC4tpf4t7 frTEKVuuEvLNYaxfuULp+RIMWfWW4= X-Gm-Gg: ASbGnct+VKNWk90bADGi0Um3sRFjLC4tHiwkq2L65Oa7T9HAqxFacn78TL2S/Bz8hsi 07nnrHn6u7M3RJu/xkqOJ+u1ReghLQSZULs8rfX5OPT2NegRWWZ9UbWuCgAjrZ9B0SdeCYBw= X-Received: by 2002:a05:690c:3708:b0:70d:cfa6:3a78 with SMTP id 00721157ae682-70dcfa63c61mr77604337b3.14.1747723952263; Mon, 19 May 2025 23:52:32 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEfRmKTFh/P7sVm7GJzzYcMZ5kOxdBOMqKx0xmtNzIhOh//EXCKXxN6SfOiiNKsR2xSVp4cYFUTyKoHaj7f/KM= X-Received: by 2002:a05:690c:3708:b0:70d:cfa6:3a78 with SMTP id 00721157ae682-70dcfa63c61mr77604157b3.14.1747723951938; Mon, 19 May 2025 23:52:31 -0700 (PDT) MIME-Version: 1.0 References: <20250520060504.20251-1-laoar.shao@gmail.com> In-Reply-To: <20250520060504.20251-1-laoar.shao@gmail.com> From: Nico Pache Date: Tue, 20 May 2025 00:52:05 -0600 X-Gm-Features: AX0GCFv2SCRFUqlzGzET-AHxP_3iGPACHE6NZmE2RNFmksIZEVxzSPB4qcgl-7Y Message-ID: Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: Yafang Shao Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 5TuiuXDuYRsxGKttfi1CA5cYWKWasMiOqQjZOnX7cCI_1747723952 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 5gcpqdppn43gkxwje5oknffxiw7toyw4 X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 135B520006 X-HE-Tag: 1747723954-971619 X-HE-Meta: U2FsdGVkX18iTgEXVryWBEX3KRmSf799XKJoNxiAkgYL5S7o8bnXALpq6dzztlo5Qgsqax0lWnz5DXurPY64KrWfUvzIdWn9I9RlslKrz1hLy5E9nEu1LbbvR1umU88IxTw1Aq1fBQ0GrqU3kYesn10pQUD/aP1lfsqZ5w1bYf2WcUTvM463anZy5pLNTdHklYzVyTeoM2Czpw3k7C/SK1vsR7kU1nPhKZc1z7pRzCr6+LfBydPZT81V2xAEdked8PHYXBJfvJGgKME3bcr32wuU9R/cpW8U1Xyl0sOwGPe1rLuEbcOyzLopbVvpPQn7t56z+TixnaIfjqXtnRJEj5NCgNx/RpYt8icYPKH9K3J8UhI5vTVLr+/Jn1D0WdOSs2ylv1zO4u0r4wbvW2agOssDe4URdyea2Qm9d8UfD3dOAw6eT3GkU2ndShAoQxKYYcNL07dpWQrtJFZtwAixQsDxa/OV0OHPgBiV+YesUukMr1+6KwFx0Bs8vwVWkEOJs3WfNRIzUzBR1KWDOG8zE6dNkdbCgmWIKHSFzU2tUyi5NeLu1LADoZEVUc11LLsy6peW6Ttmd+S9+zD7dtLxtvHQoqnJIXhtHGBbS61y88dYAl8zyoWYdkGxSHF3TKTwmu3kNNHOgjxUQhwYDEB3H5S99Fp3afKwhGBR8/KcGbWMerJjfB9MaWao2MOjw6KhugWUmgcrsxzq8CsARO16VMlZgqUqVojnNPWvCVwfqC+TeJgB4F5ZbTesBlXKtvRXrmeQKnsUnGXEoHMJP+nm3x4xv/ouUuiV3zSMLcPehL2+X6zFSgNjz3V7owsq7sqMqvRi8KXlct3Avsf0nUla6g4YP0bhnqFAhTML1uHNcg1nMRSKk8AYLNFHCFbOzyrheh79gHOxmN0bIFo+fLRJQ7cHzJbLBaJbVtK1xyZ4t7bvlFmNVUkl1STZRH4kfpsozYsC7VzYtRsNhdblwCQ gGgBc59V pRi/zs4GBfjh3qqU8uiIijj2hHkd6altYAHe8wuzhp0j1VZXUnXssLfmpy96lX/BgUZ1XbL2WyXY8Eeq+JBtEAYf3NitDhe/8kdbuvcy6G/0ynoN4/OL2g6UByljnhx6WdpvOC9cxS0LLkc+w3IKqi9sD9Bef5mqKybP/VMPsv+2PuLitSwl79wF10/jzn8JpuWu4wJbxBXswiOSYQDtEGlpicw4LY5ODvnke3aRkkYH4qSczaGJBp+1GyNDfkv+Iew555QWu8mvJ7eENz1NxwsRyGYUBu+5XJwPrl65G7H7AUY0LYcQ4AtpP58hMpC/x1wtyDC8pyLjtOWsPnz+H2w8nu3PtUvASRIXHSHfZyg1nCYpbt+XMCggy7WdP/HKmPSbFFNN9Xzd/9KJSIYJgfWwvGIAeVx/Uy/9v26Lt72szLvWzk4Jn4ifY2VN7vIXyxGmkI9xc9R4hj+KoDv00yFtbi2dcmTAbWS6Pt0gTzQZJeWvqynWQe47FnBI4rSW7n1SdEuUZoUADnYLJD4P6osHGFhB28drX1wl3neXvRsX8eM0hO5ge6amNQAFyg5b8KEp3VzGCgnNXY+4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 20, 2025 at 12:06=E2=80=AFAM Yafang Shao = wrote: > > Background > ---------- > > At my current employer, PDD, we have consistently configured THP to "neve= r" > on our production servers due to past incidents caused by its behavior: > > - Increased memory consumption > THP significantly raises overall memory usage. > > - Latency spikes > Random latency spikes occur due to more frequent memory compaction > activity triggered by THP. > > These issues have made sysadmins hesitant to switch to "madvise" or > "always" modes. > > New Motivation > -------------- > > We have now identified that certain AI workloads achieve substantial > performance gains with THP enabled. However, we=E2=80=99ve also verified = that some > workloads see little to no benefit=E2=80=94or are even negatively impacte= d=E2=80=94by THP. > > In our Kubernetes environment, we deploy mixed workloads on a single serv= er > to maximize resource utilization. Our goal is to selectively enable THP f= or > services that benefit from it while keeping it disabled for others. This > approach allows us to incrementally enable THP for additional services an= d > assess how to make it more viable in production. > > Proposed Solution > ----------------- > > For this use case, Johannes suggested introducing a dedicated mode [0]. I= n > this new mode, we could implement BPF-based THP adjustment for fine-grain= ed > control over tasks or cgroups. If no BPF program is attached, THP remains > in "never" mode. This solution elegantly meets our needs while avoiding t= he > complexity of managing BPF alongside other THP modes. > > A selftest example demonstrates how to enable THP for the current task > while keeping it disabled for others. > > Alternative Proposals > --------------------- > > - Gutierrez=E2=80=99s cgroup-based approach [1] > - Proposed adding a new cgroup file to control THP policy. > - However, as Johannes noted, cgroups are designed for hierarchical > resource allocation, not arbitrary policy settings [2]. > > - Usama=E2=80=99s per-task THP proposal based on prctl() [3]: > - Enabling THP per task via prctl(). > - As David pointed out, neither madvise() nor prctl() works in "never" > mode [4], making this solution insufficient for our needs. Hi Yafang Shao, I believe you would have to invert your logic and disable the processes you dont want using THPs, and have THP=3D"madvise"|"always". I have yet to look over Usama's solution in detail but I believe this is possible based on his cover letter. I also have an alternative solution proposed here! https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/ It's different in the sense it doesn't give you granular control per process, cgroup, or BPF programmability, but it "may" suit your needs by taming the THP waste and removing the latency spikes of PF time THP compactions/allocations. Cheers, -- Nico > > Conclusion > ---------- > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is th= e > most effective solution for our requirements. This approach represents a > small but meaningful step toward making THP truly usable=E2=80=94and mana= geable=E2=80=94in > production environments. > > This is currently a PoC implementation. Feedback of any kind is welcome. > > Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.or= g/ [0] > Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.= asier@huawei-partners.com/ [1] > Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/= [2] > Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif= 642@gmail.com/ [3] > Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c88= 1b@redhat.com/ [4] > > RFC v1->v2: > The main changes are as follows, > - Use struct_ops instead of fmod_ret (Alexei) > - Introduce a new THP mode (Johannes) > - Introduce new helpers for BPF hook (Zi) > - Refine the commit log > > RFC v1: https://lwn.net/Articles/1019290/ > > Yafang Shao (5): > mm: thp: Add a new mode "bpf" > mm: thp: Add hook for BPF based THP adjustment > mm: thp: add struct ops for BPF based THP adjustment > bpf: Add get_current_comm to bpf_base_func_proto > selftests/bpf: Add selftest for THP adjustment > > include/linux/huge_mm.h | 15 +- > kernel/bpf/cgroup.c | 2 - > kernel/bpf/helpers.c | 2 + > mm/Makefile | 3 + > mm/bpf_thp.c | 120 ++++++++++++ > mm/huge_memory.c | 65 ++++++- > mm/khugepaged.c | 3 + > tools/testing/selftests/bpf/config | 1 + > .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++ > .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++ > 10 files changed, 414 insertions(+), 11 deletions(-) > create mode 100644 mm/bpf_thp.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c > > -- > 2.43.5 >