From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67828C5321E for ; Mon, 26 Aug 2024 16:47:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC5366B0085; Mon, 26 Aug 2024 12:47:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D74616B0088; Mon, 26 Aug 2024 12:47:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C3D536B0089; Mon, 26 Aug 2024 12:47:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AA3676B0085 for ; Mon, 26 Aug 2024 12:47:31 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6A3981A103C for ; Mon, 26 Aug 2024 16:47:31 +0000 (UTC) X-FDA: 82494977502.21.DE90557 Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com [209.85.128.51]) by imf01.hostedemail.com (Postfix) with ESMTP id 2609E40010 for ; Mon, 26 Aug 2024 16:47:27 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KydkcpUn; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.51 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724690762; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SKbf1ohA+1o/gWoQsuuL4/aWJwZFQVs/5SqMRPbJMT0=; b=yyiOWTlZ1nE0D+Z8m13nDc56ss5DPyobFhce+uDp6Cuvvi8L3qhPrG90XFFwzV0RTBTBML D6MUiFGq4nvQ2uzm0+HkNDeV7IdtEjCfwqelzr9+loSZvb0psGlw9ddQ2lSCPHiZ4+0Bnn Bj+AVrxMFyDJfAyoqdiAmbRpidOeA5g= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724690762; a=rsa-sha256; cv=none; b=zobO6HUtIIB2Xnv9Mm6OyKViBetBakCgkd5B+wYgioTcedRhP1dUtLkCDG4sajQHo+BhaF JyxKIciSrKEb3Dxox7YMTuw49yZMJHlLiG+K3YnqhX91muFHiV+KixpHCVkP7EYfVaHamT uWPCrQdxTgmX/cTknV7z2TlhgG5qU/8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KydkcpUn; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.51 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wm1-f51.google.com with SMTP id 5b1f17b1804b1-428e1915e18so37796295e9.1 for ; Mon, 26 Aug 2024 09:47:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724690847; x=1725295647; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=SKbf1ohA+1o/gWoQsuuL4/aWJwZFQVs/5SqMRPbJMT0=; b=KydkcpUnyuVavT7OK0r89GarXKfA/cdbMFXguvSF2/qI9qejD+PW26I6JwWhV/WlHy j9ROfazdIYHBu1q3xqO35MZHOLSMDlHiqHdvOzo5RrXtApadWc6pQXhv3QVnwzDsURZu gb01a+JwB7lzOnhQfUFAe8fwQSOmKQ1WN96lL2ebfP7OsNHe/caC3q9rcuQYT448tYkk ZCelEubXDiVn7hlsijTpHvbx0McssLGzxfp6fiBuzfJ2mUYx6OwNQkkosjHT8PXmW3BN mzDmMaS3bToSGiEzzQnO6d4eEQO0VE85It2bvjhmqMRhoIjywwapHsW0ZofeNwdFMP5j KXWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724690847; x=1725295647; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SKbf1ohA+1o/gWoQsuuL4/aWJwZFQVs/5SqMRPbJMT0=; b=bWkCbiNN2v1+U3BBU4frCJcZoJ+mIkg1HTIRSvoymPHNyuuYwBvta+TkNv/vr7exB7 zsexoxfvMM2d1p1Ui24j9dP09g1Kt84GoKKGovHskBkTGTDcZbRVhe+bSUYiTgROx+Mc xz/QpAJxNdRK3bzWrZMb40uI/yQ374R8sEwCjN8sXlCajdXVP2Es6roqtqScajDVjR7x ujnVSUHsaEIN8Hw2LIVRFD6ud1ET/2Q8wLEFrjl5ii5bjr+7Wi98pnX28LgGdCk7ahNk LLIM7Voc0FTaxQUUe2qtzVzsrhBKltfceNGh4J4Zdj5dUZCM6CewvS3wJZUzMvOgJGBC ew/A== X-Forwarded-Encrypted: i=1; AJvYcCUM0pCC8oyzzXWhH9HJi9bKRzjWLRJXeGfx/kbcpr1oZA6bOrQLjY/yGqMq6GFsPNW/pIejFepSfA==@kvack.org X-Gm-Message-State: AOJu0YwhBt62Mg0RFG9Mw3GZUQOcAwZOKpOb8xdfGakErhViStmbVB9Z hxoGnVLt5RfKnmZhn1hPKeOzwXObN2ZfFssXmGZOR9LDIqUjwIZO X-Google-Smtp-Source: AGHT+IHAHeB2IEbfkzBBPlNMahcqSaE8JIZRCZCEI0JWJ3AQHKPZ12JgsS5s1AgfAXtyLp4jFwvknA== X-Received: by 2002:a5d:410a:0:b0:36c:ff0c:36d7 with SMTP id ffacd0b85a97d-373118580b7mr7217598f8f.2.1724690845950; Mon, 26 Aug 2024 09:47:25 -0700 (PDT) Received: from ?IPV6:2a02:6b6f:e750:7600:c5:51ce:2b5:970b? ([2a02:6b6f:e750:7600:c5:51ce:2b5:970b]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a868f220d16sm687261366b.22.2024.08.26.09.47.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 26 Aug 2024 09:47:25 -0700 (PDT) Message-ID: <61411216-d196-42de-aa64-12bd28aef44f@gmail.com> Date: Mon, 26 Aug 2024 17:47:24 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] mm: introduce THP deferred setting To: Nico Pache , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: linux-doc@vger.kernel.org, Andrew Morton , David Hildenbrand , Matthew Wilcox , Barry Song , Ryan Roberts , Baolin Wang , Lance Yang , Peter Xu , Rafael Aquini , Andrea Arcangeli , Jonathan Corbet , "Kirill A . Shutemov" , Zi Yan , Johannes Weiner References: <20240729222727.64319-1-npache@redhat.com> <72320F9D-9B6A-4ABA-9B18-E59B8382A262@nvidia.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: bqrmxb1cwua3n4u39nhwraiwq1jmwzsk X-Rspamd-Queue-Id: 2609E40010 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1724690847-187785 X-HE-Meta: U2FsdGVkX18mwmgWvJxJrTXgE0G33B37RWKqlv8jsVZZaeeQ7Ac1aSnXChNWBBZCCAf4NzIBOEkDEWwlHEzWB9sYBVl1sTM7GJ+021ZwphsWKl+t2nbymg+THUi7ECGQLNmOd/dwT2o/4XxhyMV38oUCFYjcF5iTYLBpc9KjdYUcqhy8WTqitKDNRxu/OVZ+JAqvmCFYvMdSah7B2OlzB0wp7T0OgqJR1yN8SnWi/Ki0xdL1KRbhx089HwNMe2gaFTCdQQRocXoKPxg2pRAkGAm/55Jlhhf1TDmJPcjlhvRHTPuWrPfGM2+16tPEPfUWt2YLFZrXZNXHCAn8IqC0TwP0/FOF+fYUSWFcYyZGGGDr07QggmtfXUWK4eAtaU3Tl2KgVbBrMd1N1hB+3xvCu8bGfK3q8LnDywsUaEEzof9P7PxhnQn6nlz9oK02qAFeTnuHuK9ihdmiGalOjSnt371h7FFZU8rnqd5uhemd+U3n/hx8yIYB+krBUCdJnq7CoL6Ts6+EtCGqvIeiMN905YkZNTgCpKfgnqPFKSDJFPr3FVWNLk9hAx7h80IpqBaqhU5oEwi+0pllF1lUY1u1oB4YOKHJeIo53iFhzSiLy5dN499vffLU5p6nzg8CXltq7RLpDd/LE5lPp2c10ip4Moy/Z0kgi+rwpndSsJJ6FMuhWZq35B5WtTm+IqTxKrl0yjWwcdvLARverh50SAxrweR7eh3xNLgL8D/UoCmM1NqIKYDVLY28SiFyklE0gh5qlBOcuNxMtO9OtGb8bOt58KCYGony4KM6BhCIJY7oX570+mI95qVY5Q6LVrTOEJY09kV38BM5mo38M3+ESeIRABNJd6KD5KV2GLr8r/ypvsx2MT8VRWsXfLmn441b7q+fc2Bh4kzdUwAwvWbZf0sLMSMenBvDiATSUc1emH74ZIML+OgHyki7nLfTdc46J4Ci4OkR/VMkwCES21A6XGs NZK9pWnl CiOfnv9zgw9BDTPIJUnZcFWbUFUyxwVutIOfnSAQdJyyDnIXI/prxdLJrkPFm7qNhStKMBUjZ8hB6iCuQ3MAOdO4jBiGO4mQPdXEQIOIcsRlVbhnAYj4NqSoZa+I67EuFe9jgxOqe3KeggNmY85HtOMC+2Lc8w6fWsRcK6+CAEl8GSDAsn0hMV/7Dz2KXwrjwgB5Esd/qdPEHrM0s6aRN2199OIk1m5Mq60sdSACP4bA3rQ00spGDAML++jq/hrdgWzBkCDyTUAkCUKzYNY1Lfhr3+0YO0KJKTqpWtdJKpPdhuWNo90T/kD49lpN6Iw6t7zgNcY2TyI+Giepk+3hCVSkyh2ByY2NCW9TIW+wesGIvhU04G9wNgeo5NRT3QoqZ5FKLQs4LNHkv5osXAQTVm4DqLPCweL+EjycZ3zoCXPZdMfl96+/jMYKkIitrB7IoOdXL6MVIo6F9PvbYnHL/ZDdJpFOs/mj9aG8rCwlTsOhjaOvpB/Jns1qOL7JCIxMgmxddp7s2lAMps4OTkr2JYw+rTJJ9cT1dB5dWYVkTwrErE9AtJWVoygmYGtyJCnReytDwSUYx/LfyAF3xeCMM1YRgwwJikeJ/ODvmkO/xOlwzxUrGYpJzD6+dHG5M2Xaej0Q0hFTztRqLnzvcsg139wB51EoNhZfJL2qvdulCcndGOv0LJ/KLwnk8rzlKTnPo1YNaMsPnT2mkxKXIzT+mMirETib7+/kCExgjIK+4ODr/oHShkX2LA4SiOqSDwU0sCLo+DtwJn6YESA64fq7KltmstDaC7VydFyV1cOgaALT7tm0TBjvMOazdb46mPYILdSb8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 26/08/2024 11:40, Nico Pache wrote: > On Tue, Jul 30, 2024 at 4:37 PM Nico Pache wrote: >> >> Hi Zi Yan, >> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan wrote: >>> >>> +Kirill >>> >>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>> >>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>> significant increase in the memory footprint for the same workloads. >>>> >>>> Through our investigations we found that a large contributing factor to >>>> the increase in RSS was an increase in THP usage. >>> >>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >> IIRC, most of the systems tuning is the same. We attributed the >> increase in THP usage to a combination of improvements in the kernel, >> and improvements in the libraries (better alignments). That allowed >> THP allocations to succeed at a higher rate. I can go back and confirm >> this tomorrow though. >>> >>>> >>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>> in part due to performance degradations and increased memory waste. >>>> >>>> This series introduces enabled=defer, this setting acts as a middle >>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>> page fault handler will act normally, making a hugepage if possible. If >>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>> default to the base size allocation. The caveat is that khugepaged can >>>> still operate on pages thats not MADV_HUGEPAGE. >>> >>> Why? If user does not explicitly want huge page, why bother providing huge >>> pages? Wouldn't it increase memory footprint? >> >> So we have "always", which will always try to allocate a THP when it >> can. This setting gives good performance in a lot of conditions, but >> tends to waste memory. Additionally applications DON'T need to be >> modified to take advantage of THPs. >> >> We have "madvise" which will only satisfy allocations that are >> MADV_HUGEPAGE, this gives you granular control, and a lot of times >> these madvises come from libraries. Unlike "always" you DO need to >> modify your application if you want to use THPs. >> >> Then we have "never", which of course, never allocates THPs. >> >> Ok. back to your question, like "madvise", "defer" gives you the >> benefits of THPs when you specifically know you want them >> (madv_hugepage), but also benefits applications that dont specifically >> ask for them (or cant be modified to ask for them), like "always" >> does. The applications that dont ask for THPs must wait for khugepaged >> to get them (avoid insertions at PF time)-- this curbs a lot of memory >> waste, and gives an increased tunability over "always". Another added >> benefit is that khugepaged will most likely not operate on short lived >> allocations, meaning that only longstanding memory will be collapsed >> to THPs. >> >> The memory waste can be tuned with max_ptes_none... lets say you want >> ~90% of your PMD to be full before collapsing into a huge page. simply >> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >> 512 pages to be present before being collapsed. >> >>> >>>> >>>> This allows for two things... one, applications specifically designed to >>>> use hugepages will get them, and two, applications that don't use >>>> hugepages can still benefit from them without aggressively inserting >>>> THPs at every possible chance. This curbs the memory waste, and defers >>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>> for eligible collapsing. >>> >>> khugepaged would replace application memory with huge pages without specific >>> goal. Why not use a user space agent with process_madvise() to collapse >>> huge pages? Admin might have more knobs to tweak than khugepaged. >> >> The benefits of "always" are that no userspace agent is needed, and >> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >> benefit from THPs. This setting hopes to gain some of the same >> benefits without the significant waste of memory and an increased >> tunability. >> >> future changes I have in the works are to make khugepaged more >> "smart". Moving it away from the round robin fashion it currently >> operates in, to instead make smart and informed decisions of what >> memory to collapse (and potentially split). >> >> Hopefully that helped explain the motivation for this new setting! > > Any last comments before I resend this? > > Ive been made aware of > https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > which introduces THP splitting. These are both trying to achieve the > same thing through different means. Our approach leverages khugepaged > to promote pages, while Usama's uses the reclaim path to demote > hugepages and shrink the underlying memory. > > I will leave it up to reviewers to determine which is better; However, > we can't have both, as we'd be introducing trashing conditions. > Hi, Just inserting this here from my cover letter: Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. I just went through your patches, and am not sure why we can't have both? Both use max_ptes_none as the tunable. If the number of zero-filled pages is above max_ptes_none, the shrinker will split them, and khugepaged will not collapse them (SCAN_EXCEED_NONE_PTE), so I don't see how it causes trashing? > Cheers, > -- Nico > > > > > >> >> Cheer! >> -- Nico >>> >>>> >>>> Admins may want to lower max_ptes_none, if not, khugepaged may >>>> aggressively collapse single allocations into hugepages. >>>> >>>> RFC note >>>> ========== >>>> Im not sure if im missing anything related to the mTHP >>>> changes. I think now that we have hugepage_pmd_enabled in >>>> commit 00f58104202c ("mm: fix khugepaged activation policy") everything >>>> should work as expected. >>>> >>>> Nico Pache (2): >>>> mm: defer THP insertion to khugepaged >>>> mm: document transparent_hugepage=defer usage >>>> >>>> Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- >>>> include/linux/huge_mm.h | 15 +++++++++-- >>>> mm/huge_memory.c | 31 +++++++++++++++++++--- >>>> 3 files changed, 55 insertions(+), 9 deletions(-) >>>> >>>> Cc: Andrew Morton >>>> Cc: David Hildenbrand >>>> Cc: Matthew Wilcox >>>> Cc: Barry Song >>>> Cc: Ryan Roberts >>>> Cc: Baolin Wang >>>> Cc: Lance Yang >>>> Cc: Peter Xu >>>> Cc: Zi Yan >>>> Cc: Rafael Aquini >>>> Cc: Andrea Arcangeli >>>> Cc: Jonathan Corbet >>>> -- >>>> 2.45.2 >>> >>> -- >>> Best Regards, >>> Yan, Zi >