From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosryahmed@google.com>,
akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>,
Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
David Hildenbrand <david@redhat.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg
Date: Tue, 05 Nov 2024 09:41:18 +0800 [thread overview]
Message-ID: <87ldxy78zl.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <CAGsJ_4z-zFESVpK2hDSs3EwHa2Ra3fYJFeQwH74LMHw3wVmB0g@mail.gmail.com> (Barry Song's message of "Tue, 5 Nov 2024 14:13:35 +1300")
Barry Song <21cnbao@gmail.com> writes:
> On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> > On 04/11/2024 06:42, Huang, Ying wrote:
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >>
>> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
>> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> >>>>>>>>> then the mitigation being tried here wont apply.
>> >>>>>>>>
>> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>> >>>>>>>> entire system is under memory pressure. I guess we would want the same
>> >>>>>>>> "mitigation" either way.
>> >>>>>>>>
>> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory
>> >>>>>>> in memcg?
>> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
>> >>>>>>> in cgroup and limit memory.max.
>> >>>>>>
>> >>>>>> You mean a benchmark that puts the entire system under memory
>> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>> >>>>>> you have, among other factors.
>> >>>>>>
>> >>>>>> What if you run the kernel build test in a VM? Then you can limit is
>> >>>>>> size like a memcg, although you'd probably need to leave more room
>> >>>>>> because the entire guest OS will also subject to the same limit.
>> >>>>>>
>> >>>>>
>> >>>>> I had tried this, but the variance in time/zswap numbers was very high.
>> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>> >>>>> it very difficult to make comparison.
>> >>>>
>> >>>> Hmm yeah maybe more factors come into play with global memory
>> >>>> pressure. I am honestly not sure how to test this scenario, and I
>> >>>> suspect variance will be high anyway.
>> >>>>
>> >>>> We can just try to use whatever technique we use for the memcg limit
>> >>>> though, if possible, right?
>> >>>
>> >>> You can boot a physical machine with mem=1G on the commandline, which
>> >>> restricts the physical range of memory that will be initialized.
>> >>> Double check /proc/meminfo after boot, because part of that physical
>> >>> range might not be usable RAM.
>> >>>
>> >>> I do this quite often to test physical memory pressure with workloads
>> >>> that don't scale up easily, like kernel builds.
>> >>>
>> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
>> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> >>>>>>>>> well.
>> >>>
>> >>> +1
>> >>>
>> >>> I also think there is too much focus on cgroup alone. The bigger issue
>> >>> seems to be how much optimistic volume we swap in when we're under
>> >>> pressure already. This applies to large folios and readahead; global
>> >>> memory availability and cgroup limits.
>> >>
>> >> The current swap readahead logic is something like,
>> >>
>> >> 1. try readahead some pages for sequential access pattern, mark them as
>> >> readahead
>> >>
>> >> 2. if these readahead pages get accessed before swapped out again,
>> >> increase 'hits' counter
>> >>
>> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>> >>
>> >> So, if there's heavy memory pressure, the readaheaded pages will not be
>> >> accessed before being swapped out again (in 2 above), the readahead
>> >> pages will be minimal.
>> >>
>> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in
>> >> addition to the pages accessed are swapped in, the adjacent pages are
>> >> swapped in (swap readahead) too. If these readahead pages are not
>> >> accessed before swapped out again, system runs into more severe
>> >> thrashing. This is because we lack the swap readahead window scaling
>> >> mechanism as above. And, this is why I suggested to combine the swap
>> >> readahead mechanism and mTHP swap-in by default before. That is, when
>> >> kernel swaps in a page, it checks current swap readahead window, and
>> >> decides mTHP order according to window size. So, if there are heavy
>> >> memory pressure, so that the nearby pages will not be accessed before
>> >> being swapped out again, the mTHP swap-in order can be adjusted
>> >> automatically.
>> >
>> > This is a good idea to do, but I think the issue is that readahead
>> > is a folio flag and not a page flag, so only works when folio size is 1.
>> >
>> > In the swapin_readahead swapcache path, the current implementation decides
>> > the ra_window based on hits, which is incremented in swap_cache_get_folio
>> > if it has not been gotten from swapcache before.
>> > The problem would be that we need information on how many distinct pages in
>> > a large folio that has been swapped in have been accessed to decide the
>> > hits/window size, which I don't think is possible. As once the entire large
>> > folio has been swapped in, we won't get a fault.
>> >
>>
>> To do that, we need to move readahead flag to per-page from per-folio.
>> And we need to map only the accessed page of the folio in page fault
>> handler. This may impact performance. So, we may only do that for
>> sampled folios only, for example, every 100 folios.
>
> I'm not entirely sure there's a chance to gain traction on this, as the current
> trend clearly leans toward moving flags from page to folio, not from folio to
> page :-)
This may be a problem. However, I think we can try to find a solution
for this. Anyway, we need some way to track per-page status in a folio,
regardless how to implement it.
>>
>> >>
>> >>> It happens to manifest with THP in cgroups because that's what you
>> >>> guys are testing. But IMO, any solution to this problem should
>> >>> consider the wider scope.
>> >>>
>> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am
>> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>> >>>>>>>> not.
>> >>>>>>>>
>> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> >>>>>>> seen here as there would be in swapin_readahead?
>> >>>>>>
>> >>>>>> Maybe readahead is not as aggressive in general as large folio
>> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
>> >>>>>> SWAP_RA_ORDER_CEILING (5).
>> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>> >>>>> be similar to enabling 32K mTHP?
>> >>>>
>> >>>> Not quite.
>> >>>
>> >>> Actually, I would expect it to be...
>> >>
>> >> Me too.
>> >>
>> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>> >>>>>> contiguous allocation like large folio swapin. So that could be
>> >>>>>> another factor why readahead may not reproduce the problem.
>> >>>>
>> >>>> Because of this ^.
>> >>>
>> >>> ...this matters for the physical allocation, which might require more
>> >>> reclaim and compaction to produce the 32k. But an earlier version of
>> >>> Barry's patch did the cgroup margin fallback after the THP was already
>> >>> physically allocated, and it still helped.
>> >>>
>> >>> So the issue in this test scenario seems to be mostly about cgroup
>> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
>> >>> charge when it comes to cgroup pressure.
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2024-11-05 1:45 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-27 0:14 Barry Song
2024-10-28 12:07 ` Usama Arif
2024-10-28 22:03 ` Barry Song
2024-10-30 14:51 ` Usama Arif
2024-10-30 19:51 ` Yosry Ahmed
2024-10-30 20:25 ` Usama Arif
2024-10-30 21:01 ` Yosry Ahmed
2024-10-30 21:13 ` Usama Arif
2024-10-30 21:18 ` Yosry Ahmed
2024-10-31 15:38 ` Johannes Weiner
2024-10-31 15:59 ` Yosry Ahmed
2024-10-31 20:59 ` Barry Song
2024-11-01 16:19 ` Yosry Ahmed
2024-11-04 6:42 ` Huang, Ying
2024-11-04 8:06 ` Barry Song
2024-11-04 13:09 ` Huang, Ying
2024-11-04 12:13 ` Usama Arif
2024-11-05 0:57 ` Huang, Ying
2024-11-05 1:13 ` Barry Song
2024-11-05 1:41 ` Huang, Ying [this message]
2024-10-30 20:27 ` Barry Song
2024-10-30 20:41 ` Usama Arif
2024-10-30 20:48 ` Barry Song
2024-10-30 21:00 ` Usama Arif
2024-10-30 21:08 ` Barry Song
2024-10-30 21:10 ` Yosry Ahmed
2024-10-30 21:21 ` Barry Song
2024-10-30 21:31 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ldxy78zl.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=kanchana.p.sridhar@intel.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=usamaarif642@gmail.com \
--cc=v-songbaohua@oppo.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox