From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64A71C54734 for ; Tue, 27 Aug 2024 11:10:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B785E6B007B; Tue, 27 Aug 2024 07:10:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B280A6B0082; Tue, 27 Aug 2024 07:10:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C9236B0083; Tue, 27 Aug 2024 07:10:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7F81E6B007B for ; Tue, 27 Aug 2024 07:10:11 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E54321417D2 for ; Tue, 27 Aug 2024 11:10:10 +0000 (UTC) X-FDA: 82497756180.24.59A71C1 Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68]) by imf11.hostedemail.com (Postfix) with ESMTP id 4C43640025 for ; Tue, 27 Aug 2024 11:10:07 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=fLgf4yjU; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf11.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.68 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724756943; a=rsa-sha256; cv=none; b=rxmjFVxO2Vkmt8UwH16OgVf32TuWPI8TkH94eXaXBbrxpZnquVQgGl75qiOgW4sG5Ibl0q MaGU4QFnDWyGDRwHPIP3gRC08GwT5BTkK9SGR0CI812T4m0krsb41ctJk+RRRcE69DchMd zFTm/pmmgyFxjQi3LxGdTPL2paz5xLo= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=fLgf4yjU; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf11.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.68 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724756943; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=b2GVADUu+50FR86WFWjEQDx1sWPsEjQk62VuMrGxa3s=; b=jHQYub6SJwaQSTTZeEHhG3rA7MnOTa5JIQXkL6vHUv5huBJwmWt228Y1lBVEj8JfuKEcuu HQ3QqnfEUCMTMLsS4uzXjbq23aXMmjo6nHHlyfJ2tM/Kon+OD10Um0uibZYS6fwAKWXgzT vY1Lg+dN/pnndufx1j/0EhEuDsmrrBk= Received: by mail-ed1-f68.google.com with SMTP id 4fb4d7f45d1cf-5bec4e00978so5141354a12.0 for ; Tue, 27 Aug 2024 04:10:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1724757005; x=1725361805; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=b2GVADUu+50FR86WFWjEQDx1sWPsEjQk62VuMrGxa3s=; b=fLgf4yjUOP+IA5KeTAhZPTaVzzmBREv3kgxOFtebfGdC4abRFyh2EDMVfztwfc5ysi 0IiioKefh4TGlcmcS1bCs6QVx3EP6OXYIDOV6RYZ0V88RylBSLH0u+04ULR28p0GDlGW S2sIDJZPzSXsnqcxD48JBDY5sjsgOzju0q9/IdYVlxG0l4mStKMI/+NggW9HqiI75yNC Z+xv1GHLXEQjToAHvSBciw1D/A6NOaZStADJTIFrwZ6N7Bt3jUAF2BkVnZWUiYq33Q4K hZtcheeCiPz989QMAQGxBDYPeiF3lPQDREofdrvgOipK2IL7DxAbJTzktmWmk8hTnB6S PRrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724757005; x=1725361805; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=b2GVADUu+50FR86WFWjEQDx1sWPsEjQk62VuMrGxa3s=; b=aS5gJSix6+d5BhiAGqpTsXbF6NM8RFMKSchEXKU5gz9KG6BFqmP+qVaX/DxHNd4gmG pNobxCQZ5aORyKWF1bbLuYO/bZWNKJmL203mkIoEAilgBbLrUsYFTgfbXE4nhlfI7o4v 7aVYkRy5BD4vicEvsCvCAln4bgWq2QDXB2zzObZup1k9MEa8U7lLUjhBXzTCRy8rl+dT taFYL+dWinlqjoD7MmUGfAL/nrlW4JVRFQhpImTtpJSiRG09R+wa1v0WgoS+YhgWg8Y7 X0944QFew5k5hTlicVnMTr6ZfwFPAtaP2RLR8PAl23i+OKOjL6AepvvAF7cC1pne2PDv 9MeQ== X-Forwarded-Encrypted: i=1; AJvYcCVi6RdRzJGmXWq+9nhk5axPOiMPdb25rNjmE5xkzdck2yomd7U8O59uv3laOe9bn0sr175ViLAtLw==@kvack.org X-Gm-Message-State: AOJu0YzTHqd5n/RoLmNcq7RaeFQTwWCBvJ6W84JfmQ5puWoUNW/4K27T rqyUglhwuhLMQKUjlcQYc4cWmY8/AUt4jEi2gJscne4CrmlNZFToZkhY3/4uU/w= X-Google-Smtp-Source: AGHT+IFXILVvadY6JDPZpwFQLW1mMv0in4yMBTZeFRwWdlT0hkieZp9RXze6ioVo5e5lBWEyGlV/5A== X-Received: by 2002:a17:907:86a9:b0:a86:7bdf:efcd with SMTP id a640c23a62f3a-a86a52b8ae9mr731168566b.20.1724757005143; Tue, 27 Aug 2024 04:10:05 -0700 (PDT) Received: from localhost (ip-046-005-139-073.um12.pools.vodafone-ip.de. [46.5.139.73]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a86e549c1ffsm97141966b.50.2024.08.27.04.10.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Aug 2024 04:10:04 -0700 (PDT) Date: Tue, 27 Aug 2024 13:09:59 +0200 From: Johannes Weiner To: Usama Arif Cc: Nico Pache , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Andrew Morton , David Hildenbrand , Matthew Wilcox , Barry Song , Ryan Roberts , Baolin Wang , Lance Yang , Peter Xu , Rafael Aquini , Andrea Arcangeli , Jonathan Corbet , "Kirill A . Shutemov" , Zi Yan Subject: Re: [RFC 0/2] mm: introduce THP deferred setting Message-ID: <20240827110959.GA438928@cmpxchg.org> References: <20240729222727.64319-1-npache@redhat.com> <72320F9D-9B6A-4ABA-9B18-E59B8382A262@nvidia.com> <61411216-d196-42de-aa64-12bd28aef44f@gmail.com> <698ea52e-db99-4d21-9984-ad07038d4068@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <698ea52e-db99-4d21-9984-ad07038d4068@gmail.com> X-Rspamd-Queue-Id: 4C43640025 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: fhs4ewiiysewzuwy9dcsb143hgr193r5 X-HE-Tag: 1724757007-192540 X-HE-Meta: U2FsdGVkX1+VDXFYjsbWH3DemURq+vookSVZQRuZPXVPFZ9rbQUHGBrElNxUwecVXJ6uANEVlvNn9YW/TaRiy24Vb1j4ipXcvUnELrGiP96/X5E6+ZqtRYt7e0L8jm0GO5CLp5d3da83bQt3wq8ps5XSLs3KR5J23IgazIqMD9zdyMDAvNelNT5BfSeDicdTE1SNsYN6DzU/Cy5VOaMqoaSVVYYccWtzA39A+wpXbKLPEezhDZ53XjFDtxmfCkSW8lG3uSvzrfJeJ6h+fBKGHJxOhuKY5LGSMC25ChWb3qQzx3Yzv69+rsOSpZ6HnhReQ93OJwpPF4k3+Cnp9FZTegTB8bsI56p9v773QeaJA1jr7E4JcLWvBfaQOnMO03OWo31NoZYO6moFFJtFhN4MT3MX2XBcRlhW6KzDIPeJAhNIb0Z2LSAakfOjmoHcRw1xMhQ/cKZJcEG3FB6jFFEAXidWM5nABGf0lYuT3++arQAxtKMhkkqhe6HMyu+K6fKPqVnzKQIFjzUXKF8cD/woYLDyB5rdJl0oT3Di0T8KnzKfvvPbjI49LjjxjYSacGjD+sLNyccuacXpSTpNcJFwDm/d4qNfgMYl97CD2A61tXFjxmVjM6sha5qR5iFwTAVOaOjkCg/VJOesWteB/4HRJe6QefKqWte3i6Q+Iq0iOoB2by3xYcAbvKZtQMz46/sMOW3/NhbV/AzI0gLtrsmtn72HAE5bqfXubkXBZnusjRV1ErhN0E5UvIFyhuo/QDpKt/pnzFNe/YT/al/BUyxWoJdNbILAlNuHd2/NEYPbZWWeGERP8/mWLWPr2XbyP8htMu1w/N4vVnzR4TVROgfv+huLbMInIr9vTsK/mP+ROsGdFrwtv3Inq2/h0VxlsrCZeakBpLm1nXN1cqAIqilNyAdOgTsBWec4ZWdj20xHqwFEuKGJabMWwx7bbECoeSK2XGpldM4Ekr1/sxQvOl/ ayJubbPs qKjKQUsAv2Ml+piQZrB+JGz7C2jeVvOhGRv8zx5kylx8tAZOx97CVHtMMKjIpwChDf1d/tF0FCB7R5/joCySCUF9YaunqmsF0GkNZS3wbqbPwZnkqCQ9orESzc45ayKx9vJDskyfaCnAiyICVZiwLqNTjhqiVaAiwyxDVYXKfsPVxoQpVGCXUREyA2m2awqeGSVr6qYk/am1+hf6A78FTA1XF11/Vmfz4fT3621tRm16Y4bOVMuruQ5IReNfWEpKqLeVD4ANNfY3v7OHoPZJO4hz8o12Wo7xLV5VQMF9Z+4yv79hxfAtgtneKmIdaY0fMICcYxvPhEuyqNw1KNI2VL5gGjs4ztxEYRtHYfu8/8X2/upp7S5/jqsYFqXCiICm7s7/0uDQ8L91o5RVQZSqTboc5u/3kX4HZ9YBGNX0yARP2m2Hic7uH1e+t3UaeyqrK11UeN2BS6L+IzAmGOOnsUML2aDgdKRzNVa4qeFBX3psdlo0e6cmAd5ZuQkUNuxiQk5rMMlrbpEoeIPqOQsOFZwrNGYkYq1ksBC7zVxRF5OHGpbiXKsHU5nmd6dTWtVQSUJ/N0HWAWJQXcqGMh/t1MRxscg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: > > > On 26/08/2024 17:14, Nico Pache wrote: > > On Mon, Aug 26, 2024 at 10:47 AM Usama Arif wrote: > >> > >> > >> > >> On 26/08/2024 11:40, Nico Pache wrote: > >>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache wrote: > >>>> > >>>> Hi Zi Yan, > >>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan wrote: > >>>>> > >>>>> +Kirill > >>>>> > >>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: > >>>>> > >>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a > >>>>>> significant increase in the memory footprint for the same workloads. > >>>>>> > >>>>>> Through our investigations we found that a large contributing factor to > >>>>>> the increase in RSS was an increase in THP usage. > >>>>> > >>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > >>>> IIRC, most of the systems tuning is the same. We attributed the > >>>> increase in THP usage to a combination of improvements in the kernel, > >>>> and improvements in the libraries (better alignments). That allowed > >>>> THP allocations to succeed at a higher rate. I can go back and confirm > >>>> this tomorrow though. > >>>>> > >>>>>> > >>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is > >>>>>> often recommended to set /transparent_hugepages/enabled=never. This is > >>>>>> in part due to performance degradations and increased memory waste. > >>>>>> > >>>>>> This series introduces enabled=defer, this setting acts as a middle > >>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > >>>>>> page fault handler will act normally, making a hugepage if possible. If > >>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will > >>>>>> default to the base size allocation. The caveat is that khugepaged can > >>>>>> still operate on pages thats not MADV_HUGEPAGE. > >>>>> > >>>>> Why? If user does not explicitly want huge page, why bother providing huge > >>>>> pages? Wouldn't it increase memory footprint? > >>>> > >>>> So we have "always", which will always try to allocate a THP when it > >>>> can. This setting gives good performance in a lot of conditions, but > >>>> tends to waste memory. Additionally applications DON'T need to be > >>>> modified to take advantage of THPs. > >>>> > >>>> We have "madvise" which will only satisfy allocations that are > >>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times > >>>> these madvises come from libraries. Unlike "always" you DO need to > >>>> modify your application if you want to use THPs. > >>>> > >>>> Then we have "never", which of course, never allocates THPs. > >>>> > >>>> Ok. back to your question, like "madvise", "defer" gives you the > >>>> benefits of THPs when you specifically know you want them > >>>> (madv_hugepage), but also benefits applications that dont specifically > >>>> ask for them (or cant be modified to ask for them), like "always" > >>>> does. The applications that dont ask for THPs must wait for khugepaged > >>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory > >>>> waste, and gives an increased tunability over "always". Another added > >>>> benefit is that khugepaged will most likely not operate on short lived > >>>> allocations, meaning that only longstanding memory will be collapsed > >>>> to THPs. > >>>> > >>>> The memory waste can be tuned with max_ptes_none... lets say you want > >>>> ~90% of your PMD to be full before collapsing into a huge page. simply > >>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > >>>> 512 pages to be present before being collapsed. > >>>> > >>>>> > >>>>>> > >>>>>> This allows for two things... one, applications specifically designed to > >>>>>> use hugepages will get them, and two, applications that don't use > >>>>>> hugepages can still benefit from them without aggressively inserting > >>>>>> THPs at every possible chance. This curbs the memory waste, and defers > >>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory > >>>>>> for eligible collapsing. > >>>>> > >>>>> khugepaged would replace application memory with huge pages without specific > >>>>> goal. Why not use a user space agent with process_madvise() to collapse > >>>>> huge pages? Admin might have more knobs to tweak than khugepaged. > >>>> > >>>> The benefits of "always" are that no userspace agent is needed, and > >>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > >>>> benefit from THPs. This setting hopes to gain some of the same > >>>> benefits without the significant waste of memory and an increased > >>>> tunability. > >>>> > >>>> future changes I have in the works are to make khugepaged more > >>>> "smart". Moving it away from the round robin fashion it currently > >>>> operates in, to instead make smart and informed decisions of what > >>>> memory to collapse (and potentially split). > >>>> > >>>> Hopefully that helped explain the motivation for this new setting! > >>> > >>> Any last comments before I resend this? > >>> > >>> Ive been made aware of > >>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > >>> which introduces THP splitting. These are both trying to achieve the > >>> same thing through different means. Our approach leverages khugepaged > >>> to promote pages, while Usama's uses the reclaim path to demote > >>> hugepages and shrink the underlying memory. > >>> > >>> I will leave it up to reviewers to determine which is better; However, > >>> we can't have both, as we'd be introducing trashing conditions. > >>> > >> > >> Hi, > >> > >> Just inserting this here from my cover letter: > >> > >> Waiting for khugepaged to scan memory and > >> collapse pages into THP can be slow and unpredictable in terms of performance > > Obviously not part of my patchset here, but I have been testing some > > changes to khugepaged to make it more aware of what processes are hot. > > Ideally then it can make better choices of what to operate on. > >> (i.e. you dont know when the collapse will happen), while production > >> environments require predictable performance. If there is enough memory > >> available, its better for both performance and predictability to have > >> a THP from fault time, i.e. THP=always rather than wait for khugepaged > >> to collapse it, and deal with sparsely populated THPs when the system is > >> running out of memory. > >> > >> I just went through your patches, and am not sure why we can't have both? > > Fair point, we can. I've been playing around with splitting hugepages > > and via khugepaged and was thinking of the trashing conditions there-- > > but your implementation takes a different approach. > > I've been working on performance testing my "defer" changes, once I > > find the appropriate workloads I'll try adding your changes to the > > mix. I have a feeling my approach is better for latency sensitive > > workloads, while yours is better for throughput, but let me find a way > > to confirm that. > > > > > Hmm, I am not sure if its latency vs throughput. > > There are 2 things we probably want to consider, short lived and long lived mappings, and > in each of these situations, having enough memory and running out of memory. > > For short lived mappings, I believe reducing page faults is a bigger factor in > improving performance. In that case, khugepaged won't have enough time to work, > so THP=always will perform better than THP=defer. THP=defer in this case will perform > the same as THP=madvise? > If there is enough memory, then the changes I introduced in the shrinker won't cost anything > as the shrinker won't run, and the system performance will be the same as THP=always. > If there is low memory and the shrinker runs, it will only split THPs that have zero-filled > pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. > There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits > underused THPs. > > For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. > For the initial run of the application THP=always will perform better wrt TLB misses as > page fault handler will give THPs from start. > Later on in the run, the memory might look similar between THP=always with shrinker and > max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? > This is because khugepaged will have collapsed pages that might have initially been faulted in. > And collapsing has a cost, which would not have been incurred if the THPs were present from fault. > If there is low memory, then shrinker would split memory (which has a cost as well) and the system > memory would look similar or better than THP=defer, as the shrinker would split THPs that initially > might not have been underused, but are underused at time of memory pressure. > > With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. > While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, > even if the system might have plenty of memory available and there is no need to take a performance hit. I agree with this. The defer mode is an improvement over the upstream status quo, no doubt. However, both defer mode and the shrinker solve the issue of memory waste under pressure, while the shrinker permits more desirable behavior when memory is abundant. So my take is that the shrinker is the way to go, and I don't see a bonafide usecase for defer mode that the shrinker couldn't cover.