From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 39E3BD1CA14
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 01:45:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 506556B0083; Mon,  4 Nov 2024 20:45:00 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4B55E6B0095; Mon,  4 Nov 2024 20:45:00 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 32F0A6B0098; Mon,  4 Nov 2024 20:45:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 100DB6B0095
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 20:45:00 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 6EBD61A0CD5
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:44:59 +0000 (UTC)
X-FDA: 82750346952.22.616F7AB
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	by imf25.hostedemail.com (Postfix) with ESMTP id 4E3E1A0008
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:44:33 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=b742hfVK;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730771039; a=rsa-sha256;
	cv=none;
	b=ERKxkG2x5JR5KMYVXdxI7HhDEIkGNlBWX6jV6EtPOx/1EkC33++XIo6JiFjCB4ezDfrnpB
	+Yl8avlfyhEzivoIE2FM/RXvTdUXzaZ2iIip5lZezsoa0qwECQBbIFJWj3RtGbCx2de2t4
	w7fzCsmSbf+mPe/KW+RiGdo+eVnKA1c=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=b742hfVK;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730771039;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=eQHeRfL8rZzRSEN4/Fv2sbqCLO8N/fsLZXRgjtP8NjQ=;
	b=EL0s4X9paJBb7UUpymQuew/5hG0zk2/OcD13cuYm75d68H9RV174RkOn0qZFB3g7CkTkJI
	zcYNspQ18aA/IoZrgHvEESt8s9SzV6dZnoIw7YwC0McPduLgVLeinKSRrGK7RV9uxwJSpN
	drDCnH86fVLjj+D+7D/JuBJUVNHmBx4=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730771097; x=1762307097;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=601r3QiK97UotA1IcFgllJRt19fzap5qMYMAijX2K5I=;
  b=b742hfVKNd6dxo26wsTO1yqyT1yeoRF5F90cYwQU1LhgnZdEAw/l04HT
   qBqswcJrqNLXGkz8y12+ytl+oWm91WUo5BsnCbWm+sYZrjJay8ou29y9i
   8v/GS32VJ+/bs9wDwpCWZuWjVzsF4LM2umjeh1sXGIANAdGsnZpgG5Avr
   X53U99z6zQFGscdanzIdx/zu4C5XUVAceC2k7aulRYkE+zPS6kr0QqIsj
   wYzBdSHiiwwv1rfYilfyxUbFuXcYrVwSsahC0tdMbHnJwJnMMDD7klqe/
   ixnoAMyuM5+3cNTvzvOvM8nwG0ftFgMOhThC3Q67MwV1fbu3lROWX1xoD
   w==;
X-CSE-ConnectionGUID: vroPGAQySjy0qlRoe+Bpjw==
X-CSE-MsgGUID: pMk/YM7wRnGEAEvAVCO3mw==
X-IronPort-AV: E=McAfee;i="6700,10204,11246"; a="30711936"
X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; 
   d="scan'208";a="30711936"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:44:55 -0800
X-CSE-ConnectionGUID: HxPc0ynESoiXdCN0qR/xYA==
X-CSE-MsgGUID: AtbssF/1Ske4Hvr7Jh4JcA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; 
   d="scan'208";a="83362727"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:44:50 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>,  Johannes Weiner
 <hannes@cmpxchg.org>,  Yosry Ahmed <yosryahmed@google.com>,
  akpm@linux-foundation.org,  linux-mm@kvack.org,
  linux-kernel@vger.kernel.org,  Barry Song <v-songbaohua@oppo.com>,
  Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,  David Hildenbrand
 <david@redhat.com>,  Baolin Wang <baolin.wang@linux.alibaba.com>,  Chris
 Li <chrisl@kernel.org>,  Kairui Song <kasong@tencent.com>,  Ryan Roberts
 <ryan.roberts@arm.com>,  Michal Hocko <mhocko@kernel.org>,  Roman Gushchin
 <roman.gushchin@linux.dev>,  Shakeel Butt <shakeel.butt@linux.dev>,
  Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
In-Reply-To: <CAGsJ_4z-zFESVpK2hDSs3EwHa2Ra3fYJFeQwH74LMHw3wVmB0g@mail.gmail.com>
	(Barry Song's message of "Tue, 5 Nov 2024 14:13:35 +1300")
References: <20241027001444.3233-1-21cnbao@gmail.com>
	<33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
	<CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
	<e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com>
	<CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
	<852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
	<CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
	<c76635d7-f382-433a-8900-72bca644cdaa@gmail.com>
	<CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
	<20241031153830.GA799903@cmpxchg.org>
	<87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com>
	<87ses67b0b.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4z-zFESVpK2hDSs3EwHa2Ra3fYJFeQwH74LMHw3wVmB0g@mail.gmail.com>
Date: Tue, 05 Nov 2024 09:41:18 +0800
Message-ID: <87ldxy78zl.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 4E3E1A0008
X-Rspamd-Server: rspam11
X-Stat-Signature: god3yn5ktgrxmfnpeacdmcjw1nkak9zt
X-HE-Tag: 1730771073-344836
X-HE-Meta: U2FsdGVkX194SCsECPATqm9tH4pCrhX3njOWWRzJPVbhUsIIO4mN33HuTDkc9PVwL2oyaOqjXi1iAthMSZHN8pumGxxaWad40XX7I/GjIQKdeORSOKXtFCHAvQXeUksKrUVcMuahfXk+9rdIPj/lb+Ihf75MIfdQ7Po4g/20BUg+m83EUDaWIwVTs59MX20DYFYDaeeiIQVt7NEVSS5L6AvJbFZ2UfhdOwnEx+zifiwKXOA3bnAS5doYyFcIwqJeITd39VaT0p1euJyKxIMjyp95dqnQ6zI/NwUC/uGUaJ+dLJGaqfBcbH4RwGRIBW6tBhok58oRxgYoYZT+7FoC01Gi7z9b1rFQ+VECSZcauLIqnIBvn9JH90E5OqyJmMDEvNXu2Y7JJYqSSjawyIgROcUNjC6IlCVw/IyLydcUyOI5V3EZvTP2Pv4YUbM4baZKfL9Q6bJjeJw0njlPB+n1zFOOY1/WjHnjz3AqZWMO8Y3zNMZuY4333hSwESafe9qLGtn4Brhpys8h8qudRv/Tw663z3hV9ZPG6u0GaFVcj9qz3JBPUfk1DE9u8AE5SE45f5SIDRylnI0t3m6EBxFjWwEet5ZvO9yeaMhEAMU6di5rpUuoBct3u3/vDDi3XRkcgcKeF4ua6jdpeNEc+ocnSe5aTuWJEd0YHWM7Q82IzeaA07/gbWAqcVP8Kie5R+7mPaTYf3XU/pkjIscIo+uQxCe4ZS1zBnN7LJC5XCSMvwiN53QA1/tF0kLwfzA4YdBErKHWzDDodPJg72pXNTlqUxGjlqQAu91YFmlqshBtbgCDJNhgszXLPWYlWHgO1XSUkCF1wi+q7jVpYFNTcd0kV13tlYmGkpGqDCjX+4XjOUENFulZG5bBhIR6TvXhMb8fyL09imtGy+QCPN33r86D8w/WMafUWADgj6Yn+j6Usyyq5+InZQzamhMHnkktA8RGReg7iyp3CfV3TSw9ZAL
 52QTyU3C
 WjPthdqOdfVolyyFjS3XQKklswkwCgKVZtMR4pH1NzJRc1PcbI7GghYw6uH6AXXrJG4Z0GxdoSaO435xfopgo1zMgAOFgyk7zYiKSi3JmTLQMagGNDB0/J+ivwvg9n6szUsO+ke9kvNoPKAYdnGVJdIXQmD/UHZT3xx6fSMt2gM3JIuhiGFfeDPnrOd9sq9wfZtrHhQ9ewcMwzEXjmsn0IMwVormoJJUsEn4110BCzebsVjofdY3YxVzTh70zzQIAlwzvWFnrBLexx3z+ztYh7coA3w==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Barry Song <21cnbao@gmail.com> writes:

> On Tue, Nov 5, 2024 at 2:01=E2=80=AFPM Huang, Ying <ying.huang@intel.com>=
 wrote:
>>
>> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> > On 04/11/2024 06:42, Huang, Ying wrote:
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >>
>> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>> >>>> On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gm=
ail.com> wrote:
>> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>> >>>>>> On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@=
gmail.com> wrote:
>> >>>>>>>>> I am not sure that the approach we are trying in this patch is=
 the right way:
>> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg =
disabled and
>> >>>>>>>>> then the mitigation being tried here wont apply.
>> >>>>>>>>
>> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>> >>>>>>>> entire system is under memory pressure. I guess we would want t=
he same
>> >>>>>>>> "mitigation" either way.
>> >>>>>>>>
>> >>>>>>> What would be a good open source benchmark/workload to test with=
out limiting memory
>> >>>>>>> in memcg?
>> >>>>>>> For the kernel build test, I can only get zswap activity to happ=
en if I build
>> >>>>>>> in cgroup and limit memory.max.
>> >>>>>>
>> >>>>>> You mean a benchmark that puts the entire system under memory
>> >>>>>> pressure? I am not sure, it ultimately depends on the size of mem=
ory
>> >>>>>> you have, among other factors.
>> >>>>>>
>> >>>>>> What if you run the kernel build test in a VM? Then you can limit=
 is
>> >>>>>> size like a memcg, although you'd probably need to leave more room
>> >>>>>> because the entire guest OS will also subject to the same limit.
>> >>>>>>
>> >>>>>
>> >>>>> I had tried this, but the variance in time/zswap numbers was very =
high.
>> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So fo=
und
>> >>>>> it very difficult to make comparison.
>> >>>>
>> >>>> Hmm yeah maybe more factors come into play with global memory
>> >>>> pressure. I am honestly not sure how to test this scenario, and I
>> >>>> suspect variance will be high anyway.
>> >>>>
>> >>>> We can just try to use whatever technique we use for the memcg limit
>> >>>> though, if possible, right?
>> >>>
>> >>> You can boot a physical machine with mem=3D1G on the commandline, wh=
ich
>> >>> restricts the physical range of memory that will be initialized.
>> >>> Double check /proc/meminfo after boot, because part of that physical
>> >>> range might not be usable RAM.
>> >>>
>> >>> I do this quite often to test physical memory pressure with workloads
>> >>> that don't scale up easily, like kernel builds.
>> >>>
>> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more=
 of a readahead
>> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and=
 change the window
>> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in lin=
ux kernel build time
>> >>>>>>>>> when cgroup memory is limited as readahead would probably caus=
e swap thrashing as
>> >>>>>>>>> well.
>> >>>
>> >>> +1
>> >>>
>> >>> I also think there is too much focus on cgroup alone. The bigger iss=
ue
>> >>> seems to be how much optimistic volume we swap in when we're under
>> >>> pressure already. This applies to large folios and readahead; global
>> >>> memory availability and cgroup limits.
>> >>
>> >> The current swap readahead logic is something like,
>> >>
>> >> 1. try readahead some pages for sequential access pattern, mark them =
as
>> >>    readahead
>> >>
>> >> 2. if these readahead pages get accessed before swapped out again,
>> >>    increase 'hits' counter
>> >>
>> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>> >>
>> >> So, if there's heavy memory pressure, the readaheaded pages will not =
be
>> >> accessed before being swapped out again (in 2 above), the readahead
>> >> pages will be minimal.
>> >>
>> >> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
>> >> addition to the pages accessed are swapped in, the adjacent pages are
>> >> swapped in (swap readahead) too.  If these readahead pages are not
>> >> accessed before swapped out again, system runs into more severe
>> >> thrashing.  This is because we lack the swap readahead window scaling
>> >> mechanism as above.  And, this is why I suggested to combine the swap
>> >> readahead mechanism and mTHP swap-in by default before.  That is, when
>> >> kernel swaps in a page, it checks current swap readahead window, and
>> >> decides mTHP order according to window size.  So, if there are heavy
>> >> memory pressure, so that the nearby pages will not be accessed before
>> >> being swapped out again, the mTHP swap-in order can be adjusted
>> >> automatically.
>> >
>> > This is a good idea to do, but I think the issue is that readahead
>> > is a folio flag and not a page flag, so only works when folio size is =
1.
>> >
>> > In the swapin_readahead swapcache path, the current implementation dec=
ides
>> > the ra_window based on hits, which is incremented in swap_cache_get_fo=
lio
>> > if it has not been gotten from swapcache before.
>> > The problem would be that we need information on how many distinct pag=
es in
>> > a large folio that has been swapped in have been accessed to decide the
>> > hits/window size, which I don't think is possible. As once the entire =
large
>> > folio has been swapped in, we won't get a fault.
>> >
>>
>> To do that, we need to move readahead flag to per-page from per-folio.
>> And we need to map only the accessed page of the folio in page fault
>> handler.  This may impact performance.  So, we may only do that for
>> sampled folios only, for example, every 100 folios.
>
> I'm not entirely sure there's a chance to gain traction on this, as the c=
urrent
> trend clearly leans toward moving flags from page to folio, not from foli=
o to
> page :-)

This may be a problem.  However, I think we can try to find a solution
for this.  Anyway, we need some way to track per-page status in a folio,
regardless how to implement it.

>>
>> >>
>> >>> It happens to manifest with THP in cgroups because that's what you
>> >>> guys are testing. But IMO, any solution to this problem should
>> >>> consider the wider scope.
>> >>>
>> >>>>>>>> I think large folio swapin would make the problem worse anyway.=
 I am
>> >>>>>>>> also not sure if the readahead window adjusts on memory pressur=
e or
>> >>>>>>>> not.
>> >>>>>>>>
>> >>>>>>> readahead window doesnt look at memory pressure. So maybe the sa=
me thing is being
>> >>>>>>> seen here as there would be in swapin_readahead?
>> >>>>>>
>> >>>>>> Maybe readahead is not as aggressive in general as large folio
>> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum =
order
>> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
>> >>>>>> SWAP_RA_ORDER_CEILING (5).
>> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>> >>>>> be similar to enabling 32K mTHP?
>> >>>>
>> >>>> Not quite.
>> >>>
>> >>> Actually, I would expect it to be...
>> >>
>> >> Me too.
>> >>
>> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>> >>>>>> contiguous allocation like large folio swapin. So that could be
>> >>>>>> another factor why readahead may not reproduce the problem.
>> >>>>
>> >>>> Because of this ^.
>> >>>
>> >>> ...this matters for the physical allocation, which might require more
>> >>> reclaim and compaction to produce the 32k. But an earlier version of
>> >>> Barry's patch did the cgroup margin fallback after the THP was alrea=
dy
>> >>> physically allocated, and it still helped.
>> >>>
>> >>> So the issue in this test scenario seems to be mostly about cgroup
>> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
>> >>> charge when it comes to cgroup pressure.

--
Best Regards,
Huang, Ying