From: David Hildenbrand <david@redhat.com>
To: Daniel Gomez <da.gomez@samsung.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"hughd@google.com" <hughd@google.com>,
"willy@infradead.org" <willy@infradead.org>,
"wangkefeng.wang@huawei.com" <wangkefeng.wang@huawei.com>,
"ying.huang@intel.com" <ying.huang@intel.com>,
"21cnbao@gmail.com" <21cnbao@gmail.com>,
"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
"shy828301@gmail.com" <shy828301@gmail.com>,
"ziy@nvidia.com" <ziy@nvidia.com>,
"ioworker0@gmail.com" <ioworker0@gmail.com>,
Pankaj Raghav <p.raghav@samsung.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 0/6] add mTHP support for anonymous shmem
Date: Tue, 4 Jun 2024 11:59:51 +0200 [thread overview]
Message-ID: <8e10beae-b3cd-4f5a-8e50-3f6dfe75ba9f@redhat.com> (raw)
In-Reply-To: <zz23hukm6kpguehfsxmzdtam34bj2opt63oesspwsikw57bpqy@eftgscqy3bly>
[...]
>>> How can we use per-block tracking for reclaiming memory and what changes would
>>> be needed? Or is per-block really a non-viable option?
>>
>> The interesting thing is: with mTHP toggles it is opt-in -- like for
>> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
>> about this problem right now.
>
> Without respecting the size when allocating large folios, mTHP toggles would
> over allocate. My proposal added earlier to this thread is to combine the 2
> to avoid that case. Otherwise, shouldn't we try to find a solution for the
> reclaiming path?
I think at some point we'll really have to do a better job at reclaiming
(either memory-overallocation, PUNCHHOLE that couldn't split, but maybe
also pages that are now all-zero again and could be reclaimed again).
>
>>
>>>
>>> Clearly, if per-block is viable option, shmem_fault() bug would required to be
>>> fixed first. Any ideas on how to make it reproducible?
>>>
>>> The alternatives discussed where sub-page refcounting and zeropage scanning.
>>
>> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
>> desired ;) ) option in the folio world.
>>
>>> The first one is not possible (IIUC) because of a refactor years back that
>>> simplified the code and also requires extra complexity. The second option would
>>> require additional overhead as we would involve scanning.
>>
>> We'll likely need something similar (scanning, tracking?) for anonymous
>> memory as well. There was a proposal for a THP shrinker some time ago, that
>> would solve part of the problem.
>
> It's good to know we have the same problem in different places. I'm more
> inclined to keep the information rather than adding an extra overhead. Unless
> the complexity is really overwhelming. Considering the concerns here, not sure
> how much should we try merging with iomap as per Ritesh's suggestion.
As raised in the meeting, I do see value in maintaining the information;
but I also see why Hugh and Kirill think this might be unwarranted
complexity in shmem.c. Meaning: we might be able to achieve something
similar without it, and we don't have to solve the problem right now to
benefit from large folios.
>
> The THP shrinker, could you please confirm if it is this following thread?
>
> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/
Yes, although there was no follow up so far. Possibly, because in the
current khugepaged approach, there will be a constant back-and-forth
between khugepaged collapsing memory (and wasting memory in the default
setting), and the THP shrinker reclaiming memory; doesn't sound quite
effective :) That needs more thought IMHO.
[...]
>>>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>>>> is not possible at fallcoate() time, detecting zeropages later and
>>>> retrying to split+free might be an option, without per-block tracking.
>>>
>>>>
>>>> (2) mTHP controls
>>>>
>>>> As a default, we should not be using large folios / mTHP for any shmem, just
>>>> like we did with THP via shmem_enabled. This is what this series currently
>>>> does, and is aprt of the whole mTHP user-space interface design.
>>>
>>> That was clear for me too. But what is the reason we want to boot in 'safe
>>> mode'? What are the implications of not respecing that?
>>
>> [...]
>>
>>>
>>> As I understood from the call, mTHP with sysctl knobs is preferred over
>>> optimistic falloc/write allocation? But is still unclear to me why the former
>>> is preferred.
>>
>> I think Hugh's point was that this should be an opt-in feature, just like
>> PMD-sized THP started out, and still is, an opt-in feature.
>
> I'd be keen to understand the use case for this. Even the current THP controls
> we have in tmpfs. I guess these are just scenarios with no swap involved?
> Are these use cases the same for both tmpfs and shmem anon mm?
Systems without swap are one case I think. The other reason for a toggle
in the past was to be able to disable it to troubleshoot issues (Hugh
mentioned in the meeting about unlocking new code paths in shmem.c only
with a toggle).
Staring at my Fedora system:
$ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
always within_size advise [never] deny force
Maybe because it uses tmpfs to mount /tmp (interesting article on
lwn.net about that) and people are concerned about the side-effects
(that can currently waste memory, or result in more reclaim work being
required when exceeding file sizes).
For VMs, I know that shmem+THP with memory ballooning is problematic and
not really recommended.
[...]
>>>
>>>>
>>>> Also, we should properly fallback within the configured sizes, and not jump
>>>> "over" configured sizes. Unless there is a good reason.
>>>>
>>>> (3) khugepaged
>>>>
>>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
>>>> THP later. But really, khugepaged needs to be fixed to handle that.
>>>>
>>>> (4) force/disable
>>>>
>>>> These settings are rather testing artifacts from the old ages. We should not
>>>> add them to the per-size toggles. We might "inherit" it from the global one,
>>>> though.
>>>>
>>>> "within_size" might have value, and especially for consistency, we should
>>>> have them per size.
>>>>
>>>>
>>>>
>>>> So, this series only tackles anonymous shmem, which is a good starting
>>>> point. Ideally, we'd get support for other shmem (especially during fault
>>>> time) soon afterwards, because we won't be adding separate toggles for that
>>>> from the interface POV, and having inconsistent behavior between kernel
>>>> versions would be a bit unfortunate.
>>>>
>>>>
>>>> @Baolin, this series likely does not consider (4) yet. And I suggest we have
>>>> to take a lot of the "anonymous thp" terminology out of this series,
>>>> especially when it comes to documentation.
>>>>
>>>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
>>>> could get an understanding on the next steps on !anon shmem.
>>>
>>> I realize I've raised so many questions, but it's essential for us to grasp the
>>> mm concerns and usage scenarios. This understanding will provide clarity on the
>>> direction regarding folios for !anon shmem.
>>
>> If I understood correctly, Hugh had strong feelings against not respecting
>> mTHP toggles for shmem. Without per-block tracking, I agree with that.
>
> My understanding was the same. But I have this follow-up question: should
> we respect mTHP toggles without considering mapping constraints (size and
> index)? Or perhaps we should use within_size where we can fit this intermediate
> approach, as long as mTHP granularity is respected?
Likely we should do exactly the same as we would do with the existing
PMD-sized THPs.
I recall in the meeting that we discussed that always vs. within_size
seems to have some value, and that we should respect that setting like
we did for PMD-sized THP.
Which other mapping constraints could we have?
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-06-04 10:00 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-30 2:04 Baolin Wang
2024-05-30 2:04 ` [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
2024-06-03 4:44 ` Lance Yang
2024-06-03 8:04 ` Baolin Wang
2024-06-03 5:28 ` Barry Song
2024-06-03 8:29 ` Baolin Wang
2024-06-03 8:58 ` Barry Song
2024-06-03 9:01 ` Barry Song
2024-06-03 9:37 ` Baolin Wang
2024-05-30 2:04 ` [PATCH v3 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
2024-05-30 2:04 ` [PATCH v3 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
2024-06-01 3:29 ` wang wei
2024-06-02 4:36 ` [PATCH " Baolin Wang
2024-05-30 2:04 ` [PATCH v3 4/6] mm: shmem: add mTHP support " Baolin Wang
2024-05-30 6:36 ` kernel test robot
2024-06-02 4:16 ` Baolin Wang
2024-06-04 9:23 ` Dan Carpenter
2024-06-04 9:46 ` Baolin Wang
2024-05-30 2:04 ` [PATCH v3 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
2024-05-30 2:04 ` [PATCH v3 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
2024-05-31 9:35 ` [PATCH v3 0/6] add mTHP support " David Hildenbrand
2024-05-31 10:13 ` Baolin Wang
2024-05-31 11:13 ` David Hildenbrand
2024-06-02 4:15 ` Baolin Wang
2024-06-04 8:18 ` Daniel Gomez
2024-06-04 9:45 ` Baolin Wang
2024-06-04 12:05 ` Daniel Gomez
2024-06-06 3:31 ` Baolin Wang
2024-06-06 8:38 ` David Hildenbrand
2024-06-06 9:31 ` Baolin Wang
2024-06-07 9:05 ` Daniel Gomez
2024-06-07 10:39 ` David Hildenbrand
2024-06-01 3:54 ` wang wei
2024-05-31 13:19 ` Daniel Gomez
2024-05-31 14:43 ` David Hildenbrand
2024-06-04 9:29 ` Daniel Gomez
2024-06-04 9:59 ` David Hildenbrand [this message]
2024-06-04 12:30 ` Daniel Gomez
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8e10beae-b3cd-4f5a-8e50-3f6dfe75ba9f@redhat.com \
--to=david@redhat.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=da.gomez@samsung.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=p.raghav@samsung.com \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox