From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E305DC5AD49 for ; Sun, 8 Jun 2025 07:35:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 547DD6B0088; Sun, 8 Jun 2025 03:35:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4AAB86B0089; Sun, 8 Jun 2025 03:35:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C10F6B008A; Sun, 8 Jun 2025 03:35:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1F4E06B0088 for ; Sun, 8 Jun 2025 03:35:48 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9A459BE71E for ; Sun, 8 Jun 2025 07:35:47 +0000 (UTC) X-FDA: 83531423934.27.5099416 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf20.hostedemail.com (Postfix) with ESMTP id CF7011C0002 for ; Sun, 8 Jun 2025 07:35:45 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JFyxIucS; spf=pass (imf20.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749368145; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=rLJfTflDJDqHQE108qqMo51AUAVzbVzG2WCvHF03S0c=; b=MC4ryUUSbpEx4RdhQY91ygC0Nr5R9HwlrrZjUCDZMdk2WUsnCDASkgdxx3mpN7SUQYd92q 2QNvNLYKSSp+1GM8+MSE/pLPx1z4//DdpqJnxKFmKFo3K4TgtYZuEaOwG7+AH0fZ44rs/A 8eZWIvPXpI4TBH8E+sUx9Ew3+y1rrRQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749368145; a=rsa-sha256; cv=none; b=uiKT1kD5632FsSsY477fAcFJXUZ3X5ksJgPGzYBcxLKiyVBXncDvnu4jtBXPsuwUa2fiJJ il4yj6mgqEgQyrL9iJ7cKsepiny/zZ5jttwcSD5Jv+8XVCqvv3qwYacssMjToVp+YyTeMb gvjADyncSHUTuMYFfr+VGIH1oqwuydk= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JFyxIucS; spf=pass (imf20.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-73972a54919so2823824b3a.3 for ; Sun, 08 Jun 2025 00:35:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749368145; x=1749972945; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=rLJfTflDJDqHQE108qqMo51AUAVzbVzG2WCvHF03S0c=; b=JFyxIucStaBJQi3NEU71qKrTyH9L0J3lvNqTlWB1RGkriwDqbeigv0ix/uDwjX6ecO wFbCtgySvYzRgXjibHkKNNyawscRiNglVMySQXf45Rovo/pgTUep0xmShYk5UUo9d+DM SXOwAEnhj1APiakEeb/f8nnbcosCVMeXRocv0PAKehaj4TdQHsjv++8rkEaGVaA5gFk6 jbBnuLXb6M1faJAj9J07GeDO/4xA0jxJjpmj3KeP7+saRZ6dIn9+ebEakgQaAtfdbSYW nXfC4NKTgpo1EGCtpJhv0jBqujoDyJBzK0ECIHmVRlC3qhR1rItVh6Qx1OdcdVjiaEnU IsPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749368145; x=1749972945; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=rLJfTflDJDqHQE108qqMo51AUAVzbVzG2WCvHF03S0c=; b=fYAxWqw04uVyTL0vxYTG7fV/Fxt/UL7z6EXCAiY0DZApxAChdCWV35yOOcxqy/Mqxm L819i+SaM1hTxf0arqSLIYy/RDxD3AUP5VDkKLk0ztDYQ0RFGfa8GLqcVEGv3tsZvREb ZmhXUF6rH4zOfjhFmmb5dwYf0ltJoARZEDyAmFrCjHL8FhqpLOLJ/5gcmSUfG8u7laPG qVBLH45SZl0KwHrBQt2eOl2xijNOQ2fUKpEIA3TViAtjBnf4rLjq4QlvgKyP+krN64Dn CV1KtsWeyoRIuG3iTIkxVZ9ABQwQmd22xVGmPc2polPMaIRLy8TLA0+OKhGU1BqVCBvw EVWA== X-Forwarded-Encrypted: i=1; AJvYcCUMAXctzzGa2GbOKZGl9BuakKP1N41OfI9ZMmXGZi/MC8xa/6ictuwU+BVq19WOSk2ithaELG/QiQ==@kvack.org X-Gm-Message-State: AOJu0YxeDxMYXNssg2tpGpXTGmq7wtDUjFlekMoDXmVWqzrUWCtF5TnG nryZ3dOBjwAZM3XcuSBD8iGfaaajoWweAwDgU2GIDaCIwP70Itd3p+4HPKw6Aau8QeN/9Q== X-Gm-Gg: ASbGnct+/zaj8DrOVCEzM7uxhvXTnMNrRCeHZO44wTt7TwDB4BjSpyrljTcpnERvDYe jxXDzEHGcaqt4Dl/4gQ2lkWcSgD7VtPOpYLgzpph+6fqPOiH9gzZp8lx7kAhzsqbMF8E1MBFiDX sZ9KNbSEONxV+Eqga0OgswAb/32EZk0Z9Gl4ijRz07mLmYE9obhib65bIe23mXi5GiU6mVKw0uF 2wSkLn3QlzukXAA4Xqfp7M/e7+0SUFa2+4CdfWLGBwbCjkPwUcoYRSitSjasKHJC5KgZyz9v7B/ nApAD+o1kBtmayCdwIVYNSi/cROmEtQUgBrN/2AtYbHzt9gFoJlg3/vdVipg07D2V8pCllQPykq HW+O663NK7g== X-Google-Smtp-Source: AGHT+IGHBATQYKD0LjZb9hMz4Bpv4lpQX9A4LQto5tC/WT/v3Z67n+A7KwsXa/6TvOACLbmVzX43dQ== X-Received: by 2002:a17:903:3d0f:b0:235:2375:7eaa with SMTP id d9443c01a7336-23601d05c80mr124941475ad.22.1749368144566; Sun, 08 Jun 2025 00:35:44 -0700 (PDT) Received: from localhost.localdomain ([39.144.124.91]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-236035069c3sm35968135ad.234.2025.06.08.00.35.35 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sun, 08 Jun 2025 00:35:43 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, Yafang Shao Subject: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Date: Sun, 8 Jun 2025 15:35:11 +0800 Message-Id: <20250608073516.22415-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: CF7011C0002 X-Stat-Signature: gfo1or347d1gkf45qk6qtg8mpakny8ru X-Rspam-User: X-HE-Tag: 1749368145-521306 X-HE-Meta: U2FsdGVkX1/snz4U5Rzh/d8sW5ByXOat05zm/KbNYrCcPtHyNMAU72GOsEwrRZAM5b/GRiirV68LnNVvgWHSbqS9Lnkt6sM9xoTeIs7A2/R1xzng3PcS4xKsq9xRvF1SwV2HvF3yS3Nudw4X2vxx7YjmfS5CEuqDiqXOAlbgyq72D5OJ3TRYN0jHU+P3gggmHyqb3+tEpAMzvxEP13TGC6OqoKmTKy2aJoh9OGV7merioCK4/ohlmXVzxhHas/3yfS4PIvnkeaxFXh9kSZtdVFHr61X1l8VajwFBZLlJ01ekIaQiur6GfzoWagWiy5QK6rQI+RXPb2vuAoLKIk97Y4+Yz28b8OR7RkeW5bhxWrkc+3BEjid5ZUJVimYSxY2w0udoJI1X6ZlLxLyeqCD91Kl3YlPWsotxvuY683+IZP/WdffU3KkH/b8RYVOFxP7MhxKZ0VXKhheJnPmbJ+L+IPv9V9VWt0szmiXvAzadKQHad7tEYYaC3mWBpoUHUaBpap3E3lmNQkZGmLrrfsi3qev+Xm5b/gDDCQHFJViAMhDWPpDs4Vo/aJtTjYnizKrAFjg4okiNT2y7FGs+uy/r+ASx9EyvOb1bDe47ZgK5bvwfGrineTiuuO3Iy2XGOnzyUp/5X9v/nfDui20ICL7GD7m3TAo3hhygByNHjg9YsBAMVLts6LAabJwGysjEPTvDardkEUoSVoNP8ZMSIX4m7EMGr+nSGD4KFhpRVLW0zhgGZ1MqzQc7AsO+YXQrepcVI+W7ZYfvF6qsoMVAJMCOxK+hjARYIK+bUBV1Kk2C8+38STCD+/EU2PBvKi4MlUOWznZ2eJSrzaXqEGqA91heofRUTqSU4fiMq9ey3Jc/DUKFB9qm9ew/nFrXLIerZsuUK0Jkx3BXsyIthtxxRCv6O02Q3ek6brKAfDU7cMgvs8XW/7vvjBFy/sGdwif421DA2kofH90bMWdhRVTzQ51 fwKNwR2b zdrTM7UKvVU6Mc6ekx0Wldx5zdIzIYxazZMVyFdNtz1abh34F0MebGJO5nGfSbjCsOMBYO1lTIV8MuRbSqPgNGXRydVmV5W3stXC8NaKPpiscQlcxvwCyAjx/rMR1ku5YCcrl69XSonFpUy7Y8A/fYzCx6BOQE1VoyPwNIKffogO0mJq6xr862ON+WjqrrhIigh07rVScoebiHQ0HWW6kSKfeLYOa8HGLG4k8JdexJVJ7Hoo6RHwYOUk77EdGzPINY2PsZn+rpyg9IWZNGcdiBv79xpNiyEzUrNwemnTgO488kYm7vNFsW8Q349/bBXl5LmWZVZzR1QjsoktSzmYc/fGU3noG9SPlP1//eq6OpzKEbFcvruM+jfRcwOaDr6vuFKb2BQJW1+7Q39EIYDFWt2+7EZ1bkU1wbMoWAsOq3Tjr4MtzEg89+CvpAchEMb/e5YfxT+mDKfSCmG6mMIFV2uW1JubJ521RULAOtFo4spjEnCaElPfAQ+smjrQulel/f3QbOIiWCN33XM6WIzvGFHtf6BVU/DjzgG9+rXCtC0TMpaTsHmwKqTnuxsePu8AG/iET+AEBMLzatpndI43QQTCAO9hyTGQ2ANhZSuCMznWLSAcnxUAgp7IMtX2+RgBiG7KZEZyrCOrXf+sUOKs0xR9e9W9chPf6a6d+rEjCdueNo8hN/z1tgTq+nZ6Y5J9t0shnXSbJ0o9ko9KS7C9o7jzVvw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Background ---------- We have consistently configured THP to "never" on our production servers due to past incidents caused by its behavior: - Increased memory consumption THP significantly raises overall memory usage. - Latency spikes Random latency spikes occur due to more frequent memory compaction activity triggered by THP. - Lack of Fine-Grained Control THP tuning knobs are globally configured, making them unsuitable for containerized environments. When different workloads run on the same host, enabling THP globally (without per-workload control) can cause unpredictable behavior. Due to these issues, system administrators remain hesitant to switch to "madvise" or "always" modes—unless finer-grained control over THP behavior is implemented. New Motivation -------------- We have now identified that certain AI workloads achieve substantial performance gains with THP enabled. However, we’ve also verified that some workloads see little to no benefit—or are even negatively impacted—by THP. In our Kubernetes environment, we deploy mixed workloads on a single server to maximize resource utilization. Our goal is to selectively enable THP for services that benefit from it while keeping it disabled for others. This approach allows us to incrementally enable THP for additional services and assess how to make it more viable in production. Proposed Solution ----------------- To enable fine-grained control over THP behavior, we propose dynamically adjusting THP policies using BPF. This approach allows per-workload THP tuning, providing greater flexibility and precision. The BPF-based THP adjustment mechanism introduces two new APIs for granular policy control: - THP allocator int (*allocator)(unsigned long vm_flags, unsigned long tva_flags); The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED, indicating whether THP allocation should be performed synchronously (current task) or asynchronously (khugepaged). The decision is based on the current task context, VMA flags, and TVA flags. - THP reclaimer int (*reclaimer)(bool vma_madvised); The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD, determining whether memory reclamation is handled by the current task or kswapd. We may explore implementing fine-grained tuning for khugepaged in future iterations. Alternative Proposals --------------------- - Gutierrez’s cgroup-based approach [1] - Proposed adding a new cgroup file to control THP policy. - However, as Johannes noted, cgroups are designed for hierarchical resource allocation, not arbitrary policy settings [2]. - Usama’s per-task THP proposal based on prctl() [3]: - Enabling THP per task via prctl(). - This provides an alternative approach for per-workload THP tuning, though it lacks dynamic policy adjustment capabilities and thus offers limited flexibility. This is currently a PoC implementation with limited test. Feedback of any kind is welcome. Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1] Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2] Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3] RFC v2->v3: Thanks to the valuable input from David and Lorenzo: - Finer-graind tuning based on madvise or always mode - Use BPF to write more advanced policies / allocation logic RFC v1->v2: https://lwn.net/Articles/1021783/ The main changes are as follows, - Use struct_ops instead of fmod_ret (Alexei) - Introduce a new THP mode (Johannes) - Introduce new helpers for BPF hook (Zi) - Refine the commit log RFC v1: https://lwn.net/Articles/1019290/ Yafang Shao (5): mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() mm, thp: add bpf thp hook to determine thp allocator mm, thp: add bpf thp hook to determine thp reclaimer mm: thp: add bpf thp struct ops selftests/bpf: Add selftest for THP adjustment include/linux/huge_mm.h | 8 + mm/Makefile | 3 + mm/bpf_thp.c | 184 ++++++++++++++++++ mm/huge_memory.c | 5 + mm/khugepaged.c | 6 +- tools/testing/selftests/bpf/config | 1 + .../selftests/bpf/prog_tests/thp_adjust.c | 158 +++++++++++++++ .../selftests/bpf/progs/test_thp_adjust.c | 38 ++++ 8 files changed, 401 insertions(+), 2 deletions(-) create mode 100644 mm/bpf_thp.c create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c -- 2.43.5