From: David Hildenbrand <david@redhat.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
Usama Arif <usamaarif642@gmail.com>
Cc: Nico Pache <npache@redhat.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Barry Song <baohua@kernel.org>,
Ryan Roberts <ryan.roberts@arm.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Lance Yang <ioworker0@gmail.com>, Peter Xu <peterx@redhat.com>,
Rafael Aquini <aquini@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Zi Yan <ziy@nvidia.com>
Subject: Re: [RFC 0/2] mm: introduce THP deferred setting
Date: Tue, 27 Aug 2024 13:46:26 +0200 [thread overview]
Message-ID: <b73961a2-87ec-45a5-b6fb-83d3505a0f39@redhat.com> (raw)
In-Reply-To: <20240827110959.GA438928@cmpxchg.org>
On 27.08.24 13:09, Johannes Weiner wrote:
> On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote:
>>
>>
>> On 26/08/2024 17:14, Nico Pache wrote:
>>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 26/08/2024 11:40, Nico Pache wrote:
>>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>
>>>>>> Hi Zi Yan,
>>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>>>>
>>>>>>> +Kirill
>>>>>>>
>>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote:
>>>>>>>
>>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a
>>>>>>>> significant increase in the memory footprint for the same workloads.
>>>>>>>>
>>>>>>>> Through our investigations we found that a large contributing factor to
>>>>>>>> the increase in RSS was an increase in THP usage.
>>>>>>>
>>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage?
>>>>>> IIRC, most of the systems tuning is the same. We attributed the
>>>>>> increase in THP usage to a combination of improvements in the kernel,
>>>>>> and improvements in the libraries (better alignments). That allowed
>>>>>> THP allocations to succeed at a higher rate. I can go back and confirm
>>>>>> this tomorrow though.
>>>>>>>
>>>>>>>>
>>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is
>>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is
>>>>>>>> in part due to performance degradations and increased memory waste.
>>>>>>>>
>>>>>>>> This series introduces enabled=defer, this setting acts as a middle
>>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
>>>>>>>> page fault handler will act normally, making a hugepage if possible. If
>>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will
>>>>>>>> default to the base size allocation. The caveat is that khugepaged can
>>>>>>>> still operate on pages thats not MADV_HUGEPAGE.
>>>>>>>
>>>>>>> Why? If user does not explicitly want huge page, why bother providing huge
>>>>>>> pages? Wouldn't it increase memory footprint?
>>>>>>
>>>>>> So we have "always", which will always try to allocate a THP when it
>>>>>> can. This setting gives good performance in a lot of conditions, but
>>>>>> tends to waste memory. Additionally applications DON'T need to be
>>>>>> modified to take advantage of THPs.
>>>>>>
>>>>>> We have "madvise" which will only satisfy allocations that are
>>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times
>>>>>> these madvises come from libraries. Unlike "always" you DO need to
>>>>>> modify your application if you want to use THPs.
>>>>>>
>>>>>> Then we have "never", which of course, never allocates THPs.
>>>>>>
>>>>>> Ok. back to your question, like "madvise", "defer" gives you the
>>>>>> benefits of THPs when you specifically know you want them
>>>>>> (madv_hugepage), but also benefits applications that dont specifically
>>>>>> ask for them (or cant be modified to ask for them), like "always"
>>>>>> does. The applications that dont ask for THPs must wait for khugepaged
>>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory
>>>>>> waste, and gives an increased tunability over "always". Another added
>>>>>> benefit is that khugepaged will most likely not operate on short lived
>>>>>> allocations, meaning that only longstanding memory will be collapsed
>>>>>> to THPs.
>>>>>>
>>>>>> The memory waste can be tuned with max_ptes_none... lets say you want
>>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply
>>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the
>>>>>> 512 pages to be present before being collapsed.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This allows for two things... one, applications specifically designed to
>>>>>>>> use hugepages will get them, and two, applications that don't use
>>>>>>>> hugepages can still benefit from them without aggressively inserting
>>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers
>>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory
>>>>>>>> for eligible collapsing.
>>>>>>>
>>>>>>> khugepaged would replace application memory with huge pages without specific
>>>>>>> goal. Why not use a user space agent with process_madvise() to collapse
>>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged.
>>>>>>
>>>>>> The benefits of "always" are that no userspace agent is needed, and
>>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to
>>>>>> benefit from THPs. This setting hopes to gain some of the same
>>>>>> benefits without the significant waste of memory and an increased
>>>>>> tunability.
>>>>>>
>>>>>> future changes I have in the works are to make khugepaged more
>>>>>> "smart". Moving it away from the round robin fashion it currently
>>>>>> operates in, to instead make smart and informed decisions of what
>>>>>> memory to collapse (and potentially split).
>>>>>>
>>>>>> Hopefully that helped explain the motivation for this new setting!
>>>>>
>>>>> Any last comments before I resend this?
>>>>>
>>>>> Ive been made aware of
>>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u
>>>>> which introduces THP splitting. These are both trying to achieve the
>>>>> same thing through different means. Our approach leverages khugepaged
>>>>> to promote pages, while Usama's uses the reclaim path to demote
>>>>> hugepages and shrink the underlying memory.
>>>>>
>>>>> I will leave it up to reviewers to determine which is better; However,
>>>>> we can't have both, as we'd be introducing trashing conditions.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> Just inserting this here from my cover letter:
>>>>
>>>> Waiting for khugepaged to scan memory and
>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>> Obviously not part of my patchset here, but I have been testing some
>>> changes to khugepaged to make it more aware of what processes are hot.
>>> Ideally then it can make better choices of what to operate on.
>>>> (i.e. you dont know when the collapse will happen), while production
>>>> environments require predictable performance. If there is enough memory
>>>> available, its better for both performance and predictability to have
>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>> running out of memory.
>>>>
>>>> I just went through your patches, and am not sure why we can't have both?
>>> Fair point, we can. I've been playing around with splitting hugepages
>>> and via khugepaged and was thinking of the trashing conditions there--
>>> but your implementation takes a different approach.
>>> I've been working on performance testing my "defer" changes, once I
>>> find the appropriate workloads I'll try adding your changes to the
>>> mix. I have a feeling my approach is better for latency sensitive
>>> workloads, while yours is better for throughput, but let me find a way
>>> to confirm that.
>>>
>>>
>> Hmm, I am not sure if its latency vs throughput.
>>
>> There are 2 things we probably want to consider, short lived and long lived mappings, and
>> in each of these situations, having enough memory and running out of memory.
>>
>> For short lived mappings, I believe reducing page faults is a bigger factor in
>> improving performance. In that case, khugepaged won't have enough time to work,
>> so THP=always will perform better than THP=defer. THP=defer in this case will perform
>> the same as THP=madvise?
>> If there is enough memory, then the changes I introduced in the shrinker won't cost anything
>> as the shrinker won't run, and the system performance will be the same as THP=always.
>> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled
>> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory.
>> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits
>> underused THPs.
>>
>> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance.
>> For the initial run of the application THP=always will perform better wrt TLB misses as
>> page fault handler will give THPs from start.
>> Later on in the run, the memory might look similar between THP=always with shrinker and
>> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR?
>> This is because khugepaged will have collapsed pages that might have initially been faulted in.
>> And collapsing has a cost, which would not have been incurred if the THPs were present from fault.
>> If there is low memory, then shrinker would split memory (which has a cost as well) and the system
>> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially
>> might not have been underused, but are underused at time of memory pressure.
>>
>> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed.
>> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time,
>> even if the system might have plenty of memory available and there is no need to take a performance hit.
>
> I agree with this. The defer mode is an improvement over the upstream
> status quo, no doubt. However, both defer mode and the shrinker solve
> the issue of memory waste under pressure, while the shrinker permits
> more desirable behavior when memory is abundant.
>
> So my take is that the shrinker is the way to go, and I don't see a
> bonafide usecase for defer mode that the shrinker couldn't cover.
Page fault latency? IOW, zeroing a complete THP, which might be up to
512 MiB on arm64. This is one of the things people bring up, where
FreeBSD is different because it will zero fragments on-demand (but also
result in more pagefaults).
On the downside, in the past (before) we could easily and repeatedly
fail to collapse THPs in busy environments. With per-VMA locks this
might have improved in the meantime.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-08-27 11:46 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-29 22:27 Nico Pache
2024-07-29 22:27 ` [RFC 1/2] mm: defer THP insertion to khugepaged Nico Pache
2024-07-29 22:27 ` [RFC 2/2] mm: document transparent_hugepage=defer usage Nico Pache
2024-07-30 1:26 ` [RFC 0/2] mm: introduce THP deferred setting Zi Yan
2024-07-30 22:37 ` Nico Pache
2024-08-26 15:40 ` Nico Pache
2024-08-26 16:47 ` Usama Arif
2024-08-26 21:14 ` Nico Pache
2024-08-27 10:37 ` Usama Arif
2024-08-27 11:09 ` Johannes Weiner
2024-08-27 11:46 ` David Hildenbrand [this message]
2024-08-27 13:05 ` Johannes Weiner
2024-08-27 13:22 ` David Hildenbrand
2024-08-27 13:57 ` Usama Arif
2024-08-27 22:04 ` Nico Pache
2024-08-28 1:18 ` Rik van Riel
2024-08-28 6:17 ` Kirill A . Shutemov
2024-08-28 10:44 ` Usama Arif
2024-08-28 12:54 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b73961a2-87ec-45a5-b6fb-83d3505a0f39@redhat.com \
--to=david@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aquini@redhat.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=ioworker0@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox