Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Barry Song <21cnbao@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>,
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg
Date: Mon, 4 Nov 2024 12:13:22 +0000	[thread overview]
Message-ID: <3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com> (raw)
In-Reply-To: <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>



On 04/11/2024 06:42, Huang, Ying wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> 
>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>>>>>> then the mitigation being tried here wont apply.
>>>>>>>
>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>>>>>>> entire system is under memory pressure. I guess we would want the same
>>>>>>> "mitigation" either way.
>>>>>>>
>>>>>> What would be a good open source benchmark/workload to test without limiting memory
>>>>>> in memcg?
>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
>>>>>> in cgroup and limit memory.max.
>>>>>
>>>>> You mean a benchmark that puts the entire system under memory
>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>>>>> you have, among other factors.
>>>>>
>>>>> What if you run the kernel build test in a VM? Then you can limit is
>>>>> size like a memcg, although you'd probably need to leave more room
>>>>> because the entire guest OS will also subject to the same limit.
>>>>>
>>>>
>>>> I had tried this, but the variance in time/zswap numbers was very high.
>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>>>> it very difficult to make comparison.
>>>
>>> Hmm yeah maybe more factors come into play with global memory
>>> pressure. I am honestly not sure how to test this scenario, and I
>>> suspect variance will be high anyway.
>>>
>>> We can just try to use whatever technique we use for the memcg limit
>>> though, if possible, right?
>>
>> You can boot a physical machine with mem=1G on the commandline, which
>> restricts the physical range of memory that will be initialized.
>> Double check /proc/meminfo after boot, because part of that physical
>> range might not be usable RAM.
>>
>> I do this quite often to test physical memory pressure with workloads
>> that don't scale up easily, like kernel builds.
>>
>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>>>>>> well.
>>
>> +1
>>
>> I also think there is too much focus on cgroup alone. The bigger issue
>> seems to be how much optimistic volume we swap in when we're under
>> pressure already. This applies to large folios and readahead; global
>> memory availability and cgroup limits.
> 
> The current swap readahead logic is something like,
> 
> 1. try readahead some pages for sequential access pattern, mark them as
>    readahead
> 
> 2. if these readahead pages get accessed before swapped out again,
>    increase 'hits' counter
> 
> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
> 
> So, if there's heavy memory pressure, the readaheaded pages will not be
> accessed before being swapped out again (in 2 above), the readahead
> pages will be minimal.
> 
> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> addition to the pages accessed are swapped in, the adjacent pages are
> swapped in (swap readahead) too.  If these readahead pages are not
> accessed before swapped out again, system runs into more severe
> thrashing.  This is because we lack the swap readahead window scaling
> mechanism as above.  And, this is why I suggested to combine the swap
> readahead mechanism and mTHP swap-in by default before.  That is, when
> kernel swaps in a page, it checks current swap readahead window, and
> decides mTHP order according to window size.  So, if there are heavy
> memory pressure, so that the nearby pages will not be accessed before
> being swapped out again, the mTHP swap-in order can be adjusted
> automatically.

This is a good idea to do, but I think the issue is that readahead
is a folio flag and not a page flag, so only works when folio size is 1.

In the swapin_readahead swapcache path, the current implementation decides
the ra_window based on hits, which is incremented in swap_cache_get_folio
if it has not been gotten from swapcache before.
The problem would be that we need information on how many distinct pages in
a large folio that has been swapped in have been accessed to decide the
hits/window size, which I don't think is possible. As once the entire large
folio has been swapped in, we won't get a fault.


> 
>> It happens to manifest with THP in cgroups because that's what you
>> guys are testing. But IMO, any solution to this problem should
>> consider the wider scope.
>>
>>>>>>> I think large folio swapin would make the problem worse anyway. I am
>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>>>>>>> not.
>>>>>>>
>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>>>>>> seen here as there would be in swapin_readahead?
>>>>>
>>>>> Maybe readahead is not as aggressive in general as large folio
>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>>>>> of the window is the smaller of page_cluster (2 or 3) and
>>>>> SWAP_RA_ORDER_CEILING (5).
>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>>>> be similar to enabling 32K mTHP?
>>>
>>> Not quite.
>>
>> Actually, I would expect it to be...
> 
> Me too.
> 
>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>>>>> contiguous allocation like large folio swapin. So that could be
>>>>> another factor why readahead may not reproduce the problem.
>>>
>>> Because of this ^.
>>
>> ...this matters for the physical allocation, which might require more
>> reclaim and compaction to produce the 32k. But an earlier version of
>> Barry's patch did the cgroup margin fallback after the THP was already
>> physically allocated, and it still helped.
>>
>> So the issue in this test scenario seems to be mostly about cgroup
>> volume. And then 8 4k charges should be equivalent to a singular 32k
>> charge when it comes to cgroup pressure.
> 
> --
> Best Regards,
> Huang, Ying

next prev parent reply	other threads:[~2024-11-04 12:13 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-27  0:14 Barry Song
2024-10-28 12:07 ` Usama Arif
2024-10-28 22:03   ` Barry Song
2024-10-30 14:51     ` Usama Arif
2024-10-30 19:51       ` Yosry Ahmed
2024-10-30 20:25         ` Usama Arif
2024-10-30 21:01           ` Yosry Ahmed
2024-10-30 21:13             ` Usama Arif
2024-10-30 21:18               ` Yosry Ahmed
2024-10-31 15:38                 ` Johannes Weiner
2024-10-31 15:59                   ` Yosry Ahmed
2024-10-31 20:59                     ` Barry Song
2024-11-01 16:19                       ` Yosry Ahmed
2024-11-04  6:42                   ` Huang, Ying
2024-11-04  8:06                     ` Barry Song
2024-11-04 13:09                       ` Huang, Ying
2024-11-04 12:13                     ` Usama Arif [this message]
2024-11-05  0:57                       ` Huang, Ying
2024-11-05  1:13                         ` Barry Song
2024-11-05  1:41                           ` Huang, Ying
2024-10-30 20:27       ` Barry Song
2024-10-30 20:41         ` Usama Arif
2024-10-30 20:48           ` Barry Song
2024-10-30 21:00             ` Usama Arif
2024-10-30 21:08               ` Barry Song
2024-10-30 21:10               ` Yosry Ahmed
2024-10-30 21:21                 ` Barry Song
2024-10-30 21:31                   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=v-songbaohua@oppo.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox