From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C919C83F27 for ; Tue, 15 Jul 2025 22:42:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B03686B009C; Tue, 15 Jul 2025 18:42:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ADB946B009D; Tue, 15 Jul 2025 18:42:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F1176B009E; Tue, 15 Jul 2025 18:42:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 890C36B009C for ; Tue, 15 Jul 2025 18:42:30 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 317331401FC for ; Tue, 15 Jul 2025 22:42:30 +0000 (UTC) X-FDA: 83667974460.25.DF035BE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id DADE14000C for ; Tue, 15 Jul 2025 22:42:27 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bAO7mH9N; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752619347; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8jrrfPd3sKTC7j+572MDh5BuKujCFnIWZJNyLYNArnk=; b=KqGse6YeRv8m8esUkhPl2NT6tH9yvHLQ7TFRrdmUKdImKErb0vJFmwIEzljs9AOPF5ItnT Z1oshwzKTe9qvV045LL3p52Rvb6d1+BonjQo461med1xZ58xCN9RTrtlQwFetfRX/L57ea GQV5wn8Fxw96i+mvN1qVTm5CssMJuyY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bAO7mH9N; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752619347; a=rsa-sha256; cv=none; b=j5tOoN3zg4Yu4Go/5/KRSoNUcSi6iUhyfquU0NuGFJTvxvvD+Fvc52/sh/+NFTfr3ZlvLk 7MNU2vmvQASXXZBf9Hzd5VdYGc/8FDZb1xV6UZ5LXaHzME/YJR827wsX1995yDXAtgFods 8VU3QiADwk5z2WgZ4/r1Di05YLUMpBk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752619347; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8jrrfPd3sKTC7j+572MDh5BuKujCFnIWZJNyLYNArnk=; b=bAO7mH9NMWT+d3wgyuk/doQ6Ku8nmpbIsIZ+VbJ3EqXEJo+tUoBFko05XW3LUOOLD8FEVc UwbDXctP+z5gTA+j2IQPEmAZAahWOCeHnGACK2dppSul+QaSCij1i0aH0ClEszcpicPx22 dwfjvHoI1DxN4FVPrYy9OTuE+Wh2VTo= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-641-XRwtHGcMMFK056wxxc7ypQ-1; Tue, 15 Jul 2025 18:42:26 -0400 X-MC-Unique: XRwtHGcMMFK056wxxc7ypQ-1 X-Mimecast-MFC-AGG-ID: XRwtHGcMMFK056wxxc7ypQ_1752619345 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-3a503f28b09so169910f8f.0 for ; Tue, 15 Jul 2025 15:42:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752619345; x=1753224145; h=content-transfer-encoding:in-reply-to:organization:content-language :from:references:cc:to:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8jrrfPd3sKTC7j+572MDh5BuKujCFnIWZJNyLYNArnk=; b=HNWru6nkMQzuRQcFkeG0rGpK+HEFF2ds0OrKykWiZoom8SgVvXEiRlJL+tnJPAnqUn tlBgn1+T/NwZD2jdhlTUGssKFSouXzf4DfZAGEnvxGZU/JjiN6JawueUtXRSBC43jI7L NGmSpXWUxlMq9TkOh+mu3T8Nmx/T9c1VZphutdrSpb9IBFSyugNTByFWvTlKi3lCsKvt JP75+GQXeOSz1Jr4nvPWp5CU+S0LfNrh/9VhTNQ7ypYs6Kl7H4HJ/RnRoVjbUi/OyYnF ZIeNPOhhl66Zu8k2AVa1bDTcNP3S1Uk739LCZv1nsrCVnFklLGKhYdsNxIOr1WrRD2rZ 91Bw== X-Forwarded-Encrypted: i=1; AJvYcCVsk4qX/zrAVgYi39mXM1LX/SfawrTjEj1xj8HXLg/e11d4zlY3liXzbn1hFiNrDG8QLo1N5GS7RA==@kvack.org X-Gm-Message-State: AOJu0YzFcMyGxgPyxD3ZBoXVEotQSNDf6LatPpfompO9G9i4ACc5c+78 jyk9JYdwN6z2sLuQdijBQrLvclnqmmlqIFO0jeZfKGOJP+ZElxl8yAZ47btg0U5iydQrSvR0rtM bLWsSMrqwz/b50BL5SYFFKh+l4RG2zlcrdnXhF93YbgiLXucMq8Yg X-Gm-Gg: ASbGncsBpXnfxXBPnYNNon6lMLk7OkrDckmT0p5gnIcauCdPGXT5KEvAW9IAMebWtoE rxC09F3eTY2IBV3qrqFvMajDUz5y2aBy4ow9LxtoieEB6vl2kwmDwQjmacq+IQ0YoHkwGqLQnAb kSy8Ju2IZ9gKcWN/TsjOMUiOWin+WIqLW4kQctiu0Dmc264qijYKzfgqPkMHd9zyZrlTeqeOL6B E++T4ZADA0vVKQxZD8VjiOyS3PRcF8bnxK9a/cSV4gQCTj2tzMqeni+JeUmlC0wtltc7TiMYkRm TVv0ytPBCy81t1EcYdfGy/g2Rq0+Fugy3lbNgE9RMWObE9azMoci3dt4IdqvhhEsv0SE1GSHnnM USPXqC4EemlvcWF9FuYthmBTHE1e0HM398boP8IUlipTqW2FMQ9SVTzyZ0DO51c+PHoo= X-Received: by 2002:a05:6000:1a8b:b0:3b2:e0ad:758c with SMTP id ffacd0b85a97d-3b609520267mr3717780f8f.3.1752619344731; Tue, 15 Jul 2025 15:42:24 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGnlRzvOEtukBVdMWGytCygRYjVCMdVNmxJOnB5Cah/D20aa5NuV/jX9OOiO9VGkCts+Jllog== X-Received: by 2002:a05:6000:1a8b:b0:3b2:e0ad:758c with SMTP id ffacd0b85a97d-3b609520267mr3717764f8f.3.1752619344270; Tue, 15 Jul 2025 15:42:24 -0700 (PDT) Received: from ?IPV6:2003:d8:2f28:4900:2c24:4e20:1f21:9fbd? (p200300d82f2849002c244e201f219fbd.dip0.t-ipconnect.de. [2003:d8:2f28:4900:2c24:4e20:1f21:9fbd]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4562e820e4asm3077205e9.23.2025.07.15.15.42.23 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 15 Jul 2025 15:42:23 -0700 (PDT) Message-ID: Date: Wed, 16 Jul 2025 00:42:22 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment To: Yafang Shao , akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org References: <20250608073516.22415-1-laoar.shao@gmail.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <20250608073516.22415-1-laoar.shao@gmail.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Nb4WSfnsSS2henJunfb4-L3gDOdti-fvN813XS-nkvQ_1752619345 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: y1iyyf7bbqg3o945fdde6xj8eqi4wgd7 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: DADE14000C X-Rspam-User: X-HE-Tag: 1752619347-337255 X-HE-Meta: U2FsdGVkX19clkw/NgJYMQGPJQ4JSzLyRbHengJ7hqyqoMdB1nN+bjR/KrxBXiGqZyA9qhn/AJbfgRJRZonpCcXVSEm5dYOKZqM2uRlq6qdBwR9T0fHjRILnWJG1HbdmhFKwuSHF7UnDBBEhghG/4ytefJH5yPztLSq6H73fWXMVaEf3xp8Z/y7ZvLFZC+S8nhSdVXqIbYA/UHFa1ZipVcaTTHjhVtaQZqFBwf4wzMCRJfN1A2pHDGaDzbzIxOkrd9pd4mJ64l9zTrpg3mLBiQdoFo+Lu4t30koGPlxcTse2fiajtDR1r9AmtTB4g12YxbqDYFxmzvqPYZzmWkh7rZgwwG9mg8gD14jyJDdee7ad2eT5Yg+FhanBZhdbgpBrfOHfvNnWsH5wwiTGKaFnOz0gYC/qqo1ZyrHaCiwkhDnkhonD49QdWJJquPlpMhsdy2QwmIHCm8Mrq3AVmaHBBZXPFLyFFPRnarkl1+lR5KtuR8TL0ac5W8M1aIQAzZeQ+ilEa2E5MirnuIKMcH/zL3eiHSo5TqNClQJ1EB5SwKK1oNCMHE2BkU/sneJty3b4KJr5Th+ODsmzyc+FNs7ymEL3B8RgDyzN1SOROWW2t47obdhI9ZDUupCviDBlI3rG1HNDKJipJtMf6FsdcIzl8nVjB0RnLZJCqp9A4zSkx0uxYIAUu+ocqtYuVYOV1f0BBmzsK2NB6LJ8Dc3eqCKGx13w+zi/AJEEFPj1R9KTjXwYA4uny2cvnE/uSW4YSLyHpbe9gApjla77sChIoE2XzYg9E0/BAi5WdANc443nH0hhjNzz4fSJ3RFUseN3G9fsydAAvwTcvJ61AYpA/b/HN4VmspGTGaPNYN1/5FK3pAXvkSI8cXekQAI4RmKbNv5JCwbRsmiABjp2IXXPKmZ6i2cHf2yPdrTcYenFELyvaigC6Yzqj0BuYyhRBsUs2CypG0W3fAhT6rPngFCCaFN fzf0SCPs /Ud7U5I52qbRwqw8i+P0dtrHPsmeAZm1EvspswiTdAa+YeQ/sY+QgIILhcojRVvNQONbjoJpjg6xrB160qbU+vKFBXtgzszAiYaYt3N2p+ykwx3IrgG3SOxcfEaBbIdAuHXi90TcVagEpGW4C59HI/POlcsmxqB7vNCPlo9lxcdVOZLg50/H9qa35hzEUlwe3a7qKV5Ck/FTiRzBhPcEEChLm5HRhWrHdvSIIE/0OEA/G+KMR2VEpLnCPp+ffUPamU2ErYL1F0HbgmShy2te/xw7H76+Cfac3q3FU1IMkHwdQxjWIcJZsMOxeCoUHKkEV2ukEvVEd5dkp4A3dKgXN/lZn/FyZdQRQqaaCYZFMA2eNwvrgFoAcp6o9ifiZt85lROUHYjghgG4BUisqmGMTSOUQDRlr4HjrP+BHzJ2a4r4ecJqF3KBtImrfOZUbbBt6gvmBu9KjW43b0ZSq4Asx0N2oQAMNChIM1RWLVKyQiydQwksIp7FZ/0tb4ukiY5bZaNOD1owTFoi18o4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 08.06.25 09:35, Yafang Shao wrote: Sorry for not replying earlier, I was caught up with all other stuff. I still consider this a very interesting approach, although I think we should think more about what a reasonable policy would look like medoium-term (in particular, multiple THP sizes, not always falling back to small pages if it means splitting excessively in the buddy etc.) > Background > ---------- > > We have consistently configured THP to "never" on our production servers > due to past incidents caused by its behavior: > > - Increased memory consumption > THP significantly raises overall memory usage. > > - Latency spikes > Random latency spikes occur due to more frequent memory compaction > activity triggered by THP. > > - Lack of Fine-Grained Control > THP tuning knobs are globally configured, making them unsuitable for > containerized environments. When different workloads run on the same > host, enabling THP globally (without per-workload control) can cause > unpredictable behavior. > > Due to these issues, system administrators remain hesitant to switch to > "madvise" or "always" modes—unless finer-grained control over THP > behavior is implemented. > > New Motivation > -------------- > > We have now identified that certain AI workloads achieve substantial > performance gains with THP enabled. However, we’ve also verified that some > workloads see little to no benefit—or are even negatively impacted—by THP. > > In our Kubernetes environment, we deploy mixed workloads on a single server > to maximize resource utilization. Our goal is to selectively enable THP for > services that benefit from it while keeping it disabled for others. This > approach allows us to incrementally enable THP for additional services and > assess how to make it more viable in production. > > Proposed Solution > ----------------- > > To enable fine-grained control over THP behavior, we propose dynamically > adjusting THP policies using BPF. This approach allows per-workload THP > tuning, providing greater flexibility and precision. > > The BPF-based THP adjustment mechanism introduces two new APIs for granular > policy control: > > - THP allocator > > int (*allocator)(unsigned long vm_flags, unsigned long tva_flags); > > The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED, > indicating whether THP allocation should be performed synchronously > (current task) or asynchronously (khugepaged). > > The decision is based on the current task context, VMA flags, and TVA > flags. I think we should go one step further and actually get advises about the orders (THP sizes) to use. It might be helpful if the program would have access to system stats, to make an educated decision. Given page fault information and system information, the program could then decide which orders to try to allocate. That means, one would query during page faults and during khugepaged, which order one should try -- compared to our current approach of "start with the largest order that is enabled and fits". > > - THP reclaimer > > int (*reclaimer)(bool vma_madvised); > > The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD, > determining whether memory reclamation is handled by the current task or > kswapd. Not sure about that, will have to look into the details. But what could be interesting is deciding how to deal with underutilized THPs: for now we will try replacing zero-filled pages by the shared zeropage during a split. *maybe* some workloads could benefit from ... not doing that, and instead optimize the split. Will maybe be a bit more trick, though. -- Cheers, David / dhildenb