From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C27D0C83F27 for ; Thu, 17 Jul 2025 03:10:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 649F66B0089; Wed, 16 Jul 2025 23:10:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FAA66B0095; Wed, 16 Jul 2025 23:10:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E9EF6B0096; Wed, 16 Jul 2025 23:10:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3459D6B0095 for ; Wed, 16 Jul 2025 23:10:36 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8C8871605DF for ; Thu, 17 Jul 2025 03:10:35 +0000 (UTC) X-FDA: 83672278830.20.9504ACF Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) by imf01.hostedemail.com (Postfix) with ESMTP id B038640009 for ; Thu, 17 Jul 2025 03:10:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hC8eJNg0; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752721833; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=; b=kTmg2EI75s+bl2CABfzAYZ2nn8Q5uovbVoVg8RjY+OWko6cbHCKmER54yEPAdZQfaDfqsl dILjih5Gd8hhjIuHWrtRxGbZ5T9Lh8q0QQdPh/dgucDKiRKJ+suGEU23914mcjaQugkqOU L0H1zbeUmZPc+rN98b8RjU0iTg2fnQg= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hC8eJNg0; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752721833; a=rsa-sha256; cv=none; b=HcoP60+fhsP8QJbuK9HJTcdVkJOgPItmfyV6ObEqei8/L8v6qRf/bBMOKHTQxr/m4aSg32 WzOvARSgy2tcX8x7kjxPkWJCWwA1UbEoQcXFKizbtk5OEM3OOX1ELCUfEr026a3F5bsbf+ Id7OzRzFkzzqU7FdGiJCuLucwDGvgWM= Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7d7f0fcef86so48949485a.1 for ; Wed, 16 Jul 2025 20:10:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1752721833; x=1753326633; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=; b=hC8eJNg0xjTvoLbAmJIgM+A8Nmb3BGIPX9Rd/eI5AcwdUsqDrFF5Y/MeoyVCGCiO/n 6vCaG7TLwNCujlu4CaHXRgubm9TXCN7eBvbTu2ZXDEGTHTFZWp1Sq+bVzOVJFZD4ORj9 Cb2AOiCm99J8giQRuU0Lvs2pfn1cw+fkzuljNuQkVkX+xHHms3DewweRAQA+ukC1GKoI MIs6iQR3Q9rrExhCNUd36D/dlN1G2wSsVNtZHn75Dr9FG6+BFiUmpyRvGp7ZgOOQXzm9 mZ3JgHAW3KHujuOQkDTZoEgoYDevr207o85nE3HKL8PePWXv/+LmgzJwzjI+6YaFizwC GvyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752721833; x=1753326633; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GdEjgjFyfSkQ7Qq7sTJYYecJMsmFRlNLmJSZOVtGdIM=; b=DtsV8gJ6z5Odw/aKdskw1vUrF+ZyUzPTb9ngpxjqn/F8bPaORJdEQzHrWuS1wTI7ZP QrkrpBj77pTs05j4+75sAJ331Kp1Lz9Z2jri6TIIVyULKR0WEqoRpFLCM82plNL7VeGU ewA7fLrIqAqxE4KRSNGDOi2mzXRb7oX73xGFbpF8DexMalj3IvZ6nVTSq3IISe6ikO2M XbfjwCEGcm+FwGQ7iSt8LWPIMS8c4xifG92X0rBNEeiM6BvtADsYKz7hDK+QR77OajFY NDlbnqeCXzfL6jmmvMRU7c+Zcrqbmws1HeJretlmkE8DWuT9TBPhHjTI4ZyfXidwIxa8 AfWA== X-Forwarded-Encrypted: i=1; AJvYcCU8OInO8RrYipILufPUEQxK5XnF6MmDI8f5QVTcb5PGvg5p4YPn3tz3iS0IQg00aZVV8LCkp5MQ/g==@kvack.org X-Gm-Message-State: AOJu0YyxScjtMr8R9BXEQfZ+5V5wmSKeDdxRYc9/iTblDY1wdqUyizxH sQlGgkeztbsdhLpF6CyoAAENTFcjUBGAef/Xfw4s/ntpLuVt2eUue7ijfghJb0YpHcTw4LdBCw0 2BfSkawteojUZhgBMZNWZYov2+p1Z0s0= X-Gm-Gg: ASbGncvnJV7Y1BChmxeIgMjxCA3S8DNBm2D2Xc4fFGoYa8wR9lFZm08SA7x24m3jkRH dIPtlG+VrIF7fq1ggihYFwj3cDj6IeZuvq6W5+j1WIaW9eL+0XoK6D27LshnmEbiCD48OfOIM/K n1QWils022PFtDtGWK2FrVB2rs51dOxtJiJdbWjKiZDcw5DMd+yQI06JC4gxiHJdvwchWSpYOT9 w33VrT4QdYxEB8fghQ= X-Google-Smtp-Source: AGHT+IGaEET0Q41yt/KUxbXgwYWDbCP2Fwqe/8vdtggSADiYMaZbm3jv4KJAEUX2LELc4GVmkLBYVb1fLlHzCK+aTKA= X-Received: by 2002:a05:620a:450c:b0:7e2:769a:c85d with SMTP id af79cd13be357-7e343350fb3mr712947785a.3.1752721832629; Wed, 16 Jul 2025 20:10:32 -0700 (PDT) MIME-Version: 1.0 References: <20250608073516.22415-1-laoar.shao@gmail.com> In-Reply-To: From: Yafang Shao Date: Thu, 17 Jul 2025 11:09:56 +0800 X-Gm-Features: Ac12FXy-t5oZin4xRVLt33sd1d3DxFzwzaSxQSIQzkxN07Im_3FtSYyRMyaRWxA Message-ID: Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment To: David Hildenbrand , Matthew Wilcox Cc: akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: B038640009 X-Rspamd-Server: rspam06 X-Stat-Signature: msdfukni15h78yhbm9zozptspro4m3wu X-HE-Tag: 1752721833-623131 X-HE-Meta: U2FsdGVkX1/B4wPH0AMIWTnaDS82h0Q8veJ64U1z1/edRsoLoNeIISXgXj3P3LICG/o/GqE+RK2LfYjXG9e5G3XgYb5tjg3oqUfwkO4ehNjA1060w59kLDCaSYffbB1hsSh+NvbGz19Imak7Hern2yWC00LwWXHz0kmXFfDn+KCSTs+k/8FnvYi9++Iiwg/e6ROLPlZdoC34roCttEU1pGuDmZWObeTrKb1Ek5B2EZHS3Ik7tpVb/+kSHkem9UoRt9ap6tO7vosLZ+VjIhOdnwvp9PgQCDj4tSQiz0BzdOhl8NF6RGZDYlbZHdxxNyL0xW0eCwAuHyyVB7YFecNqBFE+CeFFA47TZ4H2qJXh5l4i83DBZqEbp/vGSBei+gtZEiK3fWWQ/TG9lvsTzCClgW6nFNy4fZsDyxTwcvIXS545BmCTwJSKvfHiMlYfD4K1bmpAQUleFBa5F+eo6QqFEV0A4VwdGdYc5FPNodJsVo2gdLZK8wh646/WvclYL4Qg7fTb0197Mnv4qvdrAYGVzX5E4/X92gRwEFA4Lj3px4vINuR+FjENSo4K/JCzq/CZga8kYMpHZGRvCKDKFOEVv+0YndW/bYwzJ1q1+Yf43AW7lnWs2TYYyxXGynVhQ/jEU1OiXXxEdRpEpAMFTGSv5vteRWmYOrHGdArJLLvcj0YZk8SphxVO0wQz3zUQH19WW3lNrAyKkKXNIMnQ+Rp3bR746G/wSFq0EDFReMgdIHEE4r31zUzlIGAzydbbCHjfVTaBoCUk5FcFIQ/ur9uJPD9dlBwTZjOrIPmSEf7JvyVRqnsG9tYULXmm7kapDZ7UguYcq2STQRp7Oe2jg82G1oj4k+/N/oD0a0oUUW+sxlCrlEQYCB6uVdyu5dg5KkxA5GXL5Lqt0PeWRTDvT19/yZ6jFEv9lE1rlXRaRIolOHCwx/jipiAySwo1AVCwUatGa+1OnFtcHGSWSh1hvoA XHLTdRe+ 15uA4OrkIFwpuUUb5cVggMZTlU/eUWbEy4vYRzRIs6KgDoerB96i3WyHOYHzCPiSnixKpb1jd86E1l6KK08MM/dEAHzLSM2usNrgBCSK8Yywl+2qg8/XXSEsJhZdL5zlZuKtGH0Ai3pdvQ1y03Sxig3Yn/on0x5sa5S+7RWx9QbNr6oDmNDQV3efaadg8VMmiGjDU1JyMjC9o1RosPhuA2RL3uzyEXWuHgd8s7JvXa5VCzS85bzghcsBOtzwOjcnQDz+9fq2MTe4F27ylFCZI6el8sMN+hnnmpDmNrjHeWDsSQLbkyWt6pBzP7HkUBflTw/ncpMjK0qUWKfwuebskk1VcXWKah5i6wzXhGULUZmsV1yaL5LFSOPlZHBTyC5PK1k2iAU8rASO108emUkLwnDcF1lt+fUB9zWgs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 16, 2025 at 6:42=E2=80=AFAM David Hildenbrand wrote: > > On 08.06.25 09:35, Yafang Shao wrote: > > Sorry for not replying earlier, I was caught up with all other stuff. > > I still consider this a very interesting approach, although I think we > should think more about what a reasonable policy would look like > medoium-term (in particular, multiple THP sizes, not always falling back > to small pages if it means splitting excessively in the buddy etc.) I find it difficult to understand why we introduced the mTHP sysfs knobs instead of implementing automatic THP size switching within the kernel. I'm skeptical about its practical utility in real-world workloads. In contrast, XFS large folio (AKA. File THP) can automatically select orders between 0 and 9. Based on our verification, this feature has proven genuinely useful for certain specific workloads=E2=80=94though it's = not yet perfect. > > > Background > > ---------- > > > > We have consistently configured THP to "never" on our production server= s > > due to past incidents caused by its behavior: > > > > - Increased memory consumption > > THP significantly raises overall memory usage. > > > > - Latency spikes > > Random latency spikes occur due to more frequent memory compaction > > activity triggered by THP. > > > > - Lack of Fine-Grained Control > > THP tuning knobs are globally configured, making them unsuitable for > > containerized environments. When different workloads run on the same > > host, enabling THP globally (without per-workload control) can cause > > unpredictable behavior. > > > > Due to these issues, system administrators remain hesitant to switch to > > "madvise" or "always" modes=E2=80=94unless finer-grained control over T= HP > > behavior is implemented. > > > > New Motivation > > -------------- > > > > We have now identified that certain AI workloads achieve substantial > > performance gains with THP enabled. However, we=E2=80=99ve also verifie= d that some > > workloads see little to no benefit=E2=80=94or are even negatively impac= ted=E2=80=94by THP. > > > > In our Kubernetes environment, we deploy mixed workloads on a single se= rver > > to maximize resource utilization. Our goal is to selectively enable THP= for > > services that benefit from it while keeping it disabled for others. Thi= s > > approach allows us to incrementally enable THP for additional services = and > > assess how to make it more viable in production. > > > > Proposed Solution > > ----------------- > > > > To enable fine-grained control over THP behavior, we propose dynamicall= y > > adjusting THP policies using BPF. This approach allows per-workload THP > > tuning, providing greater flexibility and precision. > > > > The BPF-based THP adjustment mechanism introduces two new APIs for gran= ular > > policy control: > > > > - THP allocator > > > > int (*allocator)(unsigned long vm_flags, unsigned long tva_flags); > > > > The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEP= AGED, > > indicating whether THP allocation should be performed synchronously > > (current task) or asynchronously (khugepaged). > > > > The decision is based on the current task context, VMA flags, and TV= A > > flags. > > I think we should go one step further and actually get advises about the > orders (THP sizes) to use. It might be helpful if the program would have > access to system stats, to make an educated decision. > > Given page fault information and system information, the program could > then decide which orders to try to allocate. Yes, that aligns with my thoughts as well. For instance, we could automate the decision-making process based on factors like PSI, memory fragmentation, and other metrics. However, this logic could be implemented within BPF programs=E2=80=94all we=E2=80=99d need is to extend = the feature by introducing a few kfuncs (also known as BPF helpers). > > That means, one would query during page faults and during khugepaged, > which order one should try -- compared to our current approach of "start > with the largest order that is enabled and fits". > > > > > - THP reclaimer > > > > int (*reclaimer)(bool vma_madvised); > > > > The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD= , > > determining whether memory reclamation is handled by the current tas= k or > > kswapd. > > Not sure about that, will have to look into the details. Some workloads allocate all their memory during initialization and do not require THP at runtime. For such cases, aggressively attempting THP allocation is beneficial. However, other workloads may dynamically allocate THP during execution=E2=80=94if these are latency-sensitive, we mu= st avoid introducing long allocation delays. Given these differing requirements, the global /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient. Instead, we should implement per-workload defrag policies to better optimize performance based on individual application behavior. > > But what could be interesting is deciding how to deal with underutilized > THPs: for now we will try replacing zero-filled pages by the shared > zeropage during a split. *maybe* some workloads could benefit from ... > not doing that, and instead optimize the split. I believe a per-workload THP shrinker (e.g., /sys/kernel/mm/transparent_hugepage/shrink_underused) would also be valuable. Thank you for the suggestion. --=20 Regards Yafang