Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Barry Song <21cnbao@gmail.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>,
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Chris Li <chrisl@kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	Kairui Song <kasong@tencent.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg
Date: Wed, 30 Oct 2024 21:13:16 +0000	[thread overview]
Message-ID: <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com> (raw)
In-Reply-To: <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>



On 30/10/2024 21:01, Yosry Ahmed wrote:
> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 30/10/2024 19:51, Yosry Ahmed wrote:
>>> [..]
>>>>> My second point about the mitigation is as follows: For a system (or
>>>>> memcg) under severe memory pressure, especially one without hardware TLB
>>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>>>> of optimization. Could the mitigation code help in automatically tuning
>>>>> this fragmentation?
>>>>>
>>>>
>>>> I agree with the point that enabling mTHP always is not the right thing to do
>>>> on all platforms. I also think it might be the case that enabling mTHP
>>>> might be a good thing for some workloads, but enabling mTHP swapin along with
>>>> it might not.
>>>>
>>>> As you said when you have apps switching between foreground and background
>>>> in android, it probably makes sense to have large folio swapping, as you
>>>> want to bringin all the pages from background app as quickly as possible.
>>>> And also all the TLB optimizations and smaller lru overhead you get after
>>>> you have brought in all the pages.
>>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>>>> and smaller lru overhead, as probably the pages are very short lived. So I
>>>> think it doesnt show the benefit of large folio swapin properly and
>>>> large folio swapin should probably be disabled for this kind of workload,
>>>> eventhough mTHP should be enabled.
>>>>
>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>> then the mitigation being tried here wont apply.
>>>
>>> Is the problem reproducible without memcg? I imagine only if the
>>> entire system is under memory pressure. I guess we would want the same
>>> "mitigation" either way.
>>>
>> What would be a good open source benchmark/workload to test without limiting memory
>> in memcg?
>> For the kernel build test, I can only get zswap activity to happen if I build
>> in cgroup and limit memory.max.
> 
> You mean a benchmark that puts the entire system under memory
> pressure? I am not sure, it ultimately depends on the size of memory
> you have, among other factors.
> 
> What if you run the kernel build test in a VM? Then you can limit is
> size like a memcg, although you'd probably need to leave more room
> because the entire guest OS will also subject to the same limit.
> 

I had tried this, but the variance in time/zswap numbers was very high.
Much higher than the AMD numbers I posted in reply to Barry. So found
it very difficult to make comparison.

>>
>> I can just run zswap large folio zswapin in production and see, but that will take me a few
>> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
>> then maybe its not really an issue? I believe Barry doesnt see an issue in android
>> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
>> production as well, its a good data point for servers as well. And maybe
>> kernel build in 4G memcg is not a good test.
> 
> If there is a regression in the kernel build, this means some
> workloads may be affected, even if Meta's prod isn't. I understand
> that the benchmark is not very representative of real world workloads,
> but in this instance I think the thrashing problem surfaced by the
> benchmark is real.
> 
>>
>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>> well.
>>>
>>> I think large folio swapin would make the problem worse anyway. I am
>>> also not sure if the readahead window adjusts on memory pressure or
>>> not.
>>>
>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> seen here as there would be in swapin_readahead?
> 
> Maybe readahead is not as aggressive in general as large folio
> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> of the window is the smaller of page_cluster (2 or 3) and
> SWAP_RA_ORDER_CEILING (5).
Yes, I was seeing 8 pages swapin (order 3) when testing. So might
be similar to enabling 32K mTHP?

> 
> Also readahead will swapin 4k folios AFAICT, so we don't need a
> contiguous allocation like large folio swapin. So that could be
> another factor why readahead may not reproduce the problem.
> 
>> Maybe if we check kernel build test
>> performance in 4G memcg with below diff, it might get better?
> 
> I think you can use the page_cluster tunable to do this at runtime.
> 
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 4669f29cf555..9e196e1e6885 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>>         pgoff_t ilx;
>>         bool page_allocated;
>>
>> -       win = swap_vma_ra_win(vmf, &start, &end);
>> +       win = 1;
>>         if (win == 1)
>>                 goto skip;
>>

next prev parent reply	other threads:[~2024-10-30 21:13 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-27  0:14 Barry Song
2024-10-28 12:07 ` Usama Arif
2024-10-28 22:03   ` Barry Song
2024-10-30 14:51     ` Usama Arif
2024-10-30 19:51       ` Yosry Ahmed
2024-10-30 20:25         ` Usama Arif
2024-10-30 21:01           ` Yosry Ahmed
2024-10-30 21:13             ` Usama Arif [this message]
2024-10-30 21:18               ` Yosry Ahmed
2024-10-31 15:38                 ` Johannes Weiner
2024-10-31 15:59                   ` Yosry Ahmed
2024-10-31 20:59                     ` Barry Song
2024-11-01 16:19                       ` Yosry Ahmed
2024-11-04  6:42                   ` Huang, Ying
2024-11-04  8:06                     ` Barry Song
2024-11-04 13:09                       ` Huang, Ying
2024-11-04 12:13                     ` Usama Arif
2024-11-05  0:57                       ` Huang, Ying
2024-11-05  1:13                         ` Barry Song
2024-11-05  1:41                           ` Huang, Ying
2024-10-30 20:27       ` Barry Song
2024-10-30 20:41         ` Usama Arif
2024-10-30 20:48           ` Barry Song
2024-10-30 21:00             ` Usama Arif
2024-10-30 21:08               ` Barry Song
2024-10-30 21:10               ` Yosry Ahmed
2024-10-30 21:21                 ` Barry Song
2024-10-30 21:31                   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c76635d7-f382-433a-8900-72bca644cdaa@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=v-songbaohua@oppo.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox