From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 185C5D1CA0E
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 01:01:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 80F4A6B0088; Mon,  4 Nov 2024 20:01:22 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 798776B0089; Mon,  4 Nov 2024 20:01:22 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 611846B008A; Mon,  4 Nov 2024 20:01:22 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 3E9306B0088
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 20:01:22 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id CFBBBC0DA8
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:01:21 +0000 (UTC)
X-FDA: 82750237374.11.48E2CD6
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21])
	by imf09.hostedemail.com (Postfix) with ESMTP id 7BC2F140008
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:00:56 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kn+D4RTe;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730768421; a=rsa-sha256;
	cv=none;
	b=nPLR9EzdKDp+0K1YIZttdbB527JdSSU6Gj8DO1bM8xAKacWntP0Bktkw964LHvg0ILX5y8
	KO9L30tC+M1fGOh9gtjUHVuuqKCzGGQV8KxhXMYmaV/qZnLNZbveZK0tQMO2g0osnh+7ZL
	KC+tktB3gwZllZYFS0N2pqAmfFgN77I=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kn+D4RTe;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730768421;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=psjmAumNgcKiyo7Qg1MSFxg5yQsYEoJPiw5C+b5227E=;
	b=RIiFdyNeVVylsedEktpU0buuL7kzYOUFfv5oxTAqp3SD6taePNp8oVs0J0SNTdDBeo+DPv
	zSRZa4GI1InC/Ussq/sN4RiKanvIgYik5lRGNE/TTiCvZAIUNaiaIOtlB2ebss4VKkyI20
	/TQtapHYj4wpH9Iom1iYA6jIav2zkwY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730768479; x=1762304479;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=6Smu2+pVyEdVlPhPaVJ8lO8x4GnQZiyiM07wNVergK8=;
  b=kn+D4RTeFRg6drOYsJE0a7nmtNAudrUvR9OqMcO6ZnaNEr/oTtX3sreT
   zsCyzuqE3QH+ASHDtxruqaUOFJuAMlNsf6N385LJZTpckQKTYcL/EnTe6
   m6ogHK3LHrLE3rC/TG/SuejUbY9bnDQA2ltQ8YSO9fJI5L+bcWRK6IGno
   fyoAWAMyzE4i2calYaIt6P0skqnKdZPpXRHuCCpJMilcpAw352iPds0kW
   ziYRxZG1NbwAdbrM+zka1TSztVmYsotLQ5r2fCOIeVz4vqyoQe5C3Ci+/
   Kq96B3QmnmO9MFbL8SXcrU3RAFGd0iXbwa3ISlxOt7DRgDkioXV4XKb5C
   A==;
X-CSE-ConnectionGUID: epl8LN+VSXS+cVGcVy/I1A==
X-CSE-MsgGUID: KrT9LGzORIm7skNhUjddvg==
X-IronPort-AV: E=McAfee;i="6700,10204,11222"; a="30442042"
X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; 
   d="scan'208";a="30442042"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:01:17 -0800
X-CSE-ConnectionGUID: f3PogGT+TbOvp2DAvmUk/Q==
X-CSE-MsgGUID: BEiD0mGXTNyRCFv3B3Pkng==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; 
   d="scan'208";a="87754317"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:01:13 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,  Barry Song <21cnbao@gmail.com>,
  Yosry Ahmed <yosryahmed@google.com>,  akpm@linux-foundation.org,
  linux-mm@kvack.org,  linux-kernel@vger.kernel.org,  Barry Song
 <v-songbaohua@oppo.com>,  Kanchana P Sridhar
 <kanchana.p.sridhar@intel.com>,  David Hildenbrand <david@redhat.com>,
  Baolin Wang <baolin.wang@linux.alibaba.com>,  Chris Li
 <chrisl@kernel.org>,  Kairui Song <kasong@tencent.com>,  Ryan Roberts
 <ryan.roberts@arm.com>,  Michal Hocko <mhocko@kernel.org>,  Roman Gushchin
 <roman.gushchin@linux.dev>,  Shakeel Butt <shakeel.butt@linux.dev>,
  Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
In-Reply-To: <3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com> (Usama Arif's
	message of "Mon, 4 Nov 2024 12:13:22 +0000")
References: <20241027001444.3233-1-21cnbao@gmail.com>
	<33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
	<CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
	<e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com>
	<CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
	<852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
	<CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
	<c76635d7-f382-433a-8900-72bca644cdaa@gmail.com>
	<CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
	<20241031153830.GA799903@cmpxchg.org>
	<87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com>
Date: Tue, 05 Nov 2024 08:57:40 +0800
Message-ID: <87ses67b0b.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 7BC2F140008
X-Rspamd-Server: rspam11
X-Stat-Signature: 954oyc6ehi7ntcg1ttkottp76ex6wyt3
X-HE-Tag: 1730768456-365113
X-HE-Meta: U2FsdGVkX1+KLrt/lmDpY+MiB9W06KqqBJ43Y3rSCxgkcLWcFdBhEdamlhkO3pblhmqbRhAjXeFqvr4LwFiZZG2h0ziOH4jmdJKE31CG2asLEC0FSIr5YfiCd4NwbqhuGpBl8JxWlTB6Stt5aynjyBlIT6exrg07jjioiATB8bpRXcX/SWGM5/lKXp/ZRleVfX/1RZpjRr18Smp10cTwpWqlBKe47Dz0ylLMTaEYvMEvGIaA50KiJapJu3dnn231+wa6H9mvK6OQgED7hzJ74XPZZTrNAHPlkZRf1CaSWi86Uq3m0Y5UKM8LON+b1DHNaV/+swzHTX5mplB665quJ+kbwRGqirPCaDgATaBCSggBaGS0fxYEjK7g+sF2qae7TXII7c6bk7uZjuKr/k2Yx3NvK9pLuI/tqMVrkiZzY5MZ8xAGuIwJQjGYjW1+NRjIwpJPDWaMxe+7xIOvn5/AuNdYDSLViN7j6tvw4zfGvlZZDQIg+1KKqa0hx6L9rHmekR0lSlOWfUc9TSXsBGLyDhmXZKSPsscWVCuCM4DxJyi4PvMCGcHFGDdLtd9yvYJqRNaEYtKFCG+sjRiFMseddsf1Szl6B61IBZ0Sj/XzayK4UzcFolx41f2JLeSPAyelyAXFt0IlR19sbnnhvBCwP7VCpgGrnni1PEwqt1XpSciTuqzc/4JUSFefZceXqDo3ykv6wQtzKJbk1fhqS/fczFqj2TVZcNytE+b/2gLH7vigdPVNuE5ENfPdbgOE8RyGmQyc3+DD4R6AxrFivOyb9g9915ZA99z0mdSLczH7sp1HLRynycTiFDMfbJUvyREYVxqYVvb36jPKiY8lOPs0gBKbl7N41rC0eLmclm5vmpydu2Q83zV/PN28fdItxCimrgatc/slbS/dlbtK6bzhO9fAFPWcbwoktKPdY3FBPcSMnHYOCqWxwOb1rMQ9SaIkKINhbKY6aLLeDolYrmt
 jspbToWK
 86UUZxJJGEk+gq1Ic3BEXAIlI6tWuR0RGLKsQe8ygzkIs3fCv/6jh5L6mQjtWp88eOpmz2Eh7dGQHFpGpG5zj1EcSDxA69r8FNYQ5uuBJkY8QD8NWFXEppWoJExV+7STAn0tmC+HAXXx9Ip2q0QWH15ZjiiCow3nN5oAaKZqYAEaFmN3lHE5+hGhWGSpVSgmndPZzl8rE0EdVf2y+fXTTSJJJMerjcWL9w/Lfx2eP5NxTamb6fHUtb2GD5q0UrEqni7/W
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Usama Arif <usamaarif642@gmail.com> writes:

> On 04/11/2024 06:42, Huang, Ying wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>>=20
>>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>>>> On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gmail=
.com> wrote:
>>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>>>>>> On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@gma=
il.com> wrote:
>>>>>>>>> I am not sure that the approach we are trying in this patch is th=
e right way:
>>>>>>>>> - This patch makes it a memcg issue, but you could have memcg dis=
abled and
>>>>>>>>> then the mitigation being tried here wont apply.
>>>>>>>>
>>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>>>>>>>> entire system is under memory pressure. I guess we would want the =
same
>>>>>>>> "mitigation" either way.
>>>>>>>>
>>>>>>> What would be a good open source benchmark/workload to test without=
 limiting memory
>>>>>>> in memcg?
>>>>>>> For the kernel build test, I can only get zswap activity to happen =
if I build
>>>>>>> in cgroup and limit memory.max.
>>>>>>
>>>>>> You mean a benchmark that puts the entire system under memory
>>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>>>>>> you have, among other factors.
>>>>>>
>>>>>> What if you run the kernel build test in a VM? Then you can limit is
>>>>>> size like a memcg, although you'd probably need to leave more room
>>>>>> because the entire guest OS will also subject to the same limit.
>>>>>>
>>>>>
>>>>> I had tried this, but the variance in time/zswap numbers was very hig=
h.
>>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>>>>> it very difficult to make comparison.
>>>>
>>>> Hmm yeah maybe more factors come into play with global memory
>>>> pressure. I am honestly not sure how to test this scenario, and I
>>>> suspect variance will be high anyway.
>>>>
>>>> We can just try to use whatever technique we use for the memcg limit
>>>> though, if possible, right?
>>>
>>> You can boot a physical machine with mem=3D1G on the commandline, which
>>> restricts the physical range of memory that will be initialized.
>>> Double check /proc/meminfo after boot, because part of that physical
>>> range might not be usable RAM.
>>>
>>> I do this quite often to test physical memory pressure with workloads
>>> that don't scale up easily, like kernel builds.
>>>
>>>>>>>>> - Instead of this being a large folio swapin issue, is it more of=
 a readahead
>>>>>>>>> issue? If we zswap (without the large folio swapin series) and ch=
ange the window
>>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux =
kernel build time
>>>>>>>>> when cgroup memory is limited as readahead would probably cause s=
wap thrashing as
>>>>>>>>> well.
>>>
>>> +1
>>>
>>> I also think there is too much focus on cgroup alone. The bigger issue
>>> seems to be how much optimistic volume we swap in when we're under
>>> pressure already. This applies to large folios and readahead; global
>>> memory availability and cgroup limits.
>>=20
>> The current swap readahead logic is something like,
>>=20
>> 1. try readahead some pages for sequential access pattern, mark them as
>>    readahead
>>=20
>> 2. if these readahead pages get accessed before swapped out again,
>>    increase 'hits' counter
>>=20
>> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>>=20
>> So, if there's heavy memory pressure, the readaheaded pages will not be
>> accessed before being swapped out again (in 2 above), the readahead
>> pages will be minimal.
>>=20
>> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
>> addition to the pages accessed are swapped in, the adjacent pages are
>> swapped in (swap readahead) too.  If these readahead pages are not
>> accessed before swapped out again, system runs into more severe
>> thrashing.  This is because we lack the swap readahead window scaling
>> mechanism as above.  And, this is why I suggested to combine the swap
>> readahead mechanism and mTHP swap-in by default before.  That is, when
>> kernel swaps in a page, it checks current swap readahead window, and
>> decides mTHP order according to window size.  So, if there are heavy
>> memory pressure, so that the nearby pages will not be accessed before
>> being swapped out again, the mTHP swap-in order can be adjusted
>> automatically.
>
> This is a good idea to do, but I think the issue is that readahead
> is a folio flag and not a page flag, so only works when folio size is 1.
>
> In the swapin_readahead swapcache path, the current implementation decides
> the ra_window based on hits, which is incremented in swap_cache_get_folio
> if it has not been gotten from swapcache before.
> The problem would be that we need information on how many distinct pages =
in
> a large folio that has been swapped in have been accessed to decide the
> hits/window size, which I don't think is possible. As once the entire lar=
ge
> folio has been swapped in, we won't get a fault.
>

To do that, we need to move readahead flag to per-page from per-folio.
And we need to map only the accessed page of the folio in page fault
handler.  This may impact performance.  So, we may only do that for
sampled folios only, for example, every 100 folios.

>>=20
>>> It happens to manifest with THP in cgroups because that's what you
>>> guys are testing. But IMO, any solution to this problem should
>>> consider the wider scope.
>>>
>>>>>>>> I think large folio swapin would make the problem worse anyway. I =
am
>>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>>>>>>>> not.
>>>>>>>>
>>>>>>> readahead window doesnt look at memory pressure. So maybe the same =
thing is being
>>>>>>> seen here as there would be in swapin_readahead?
>>>>>>
>>>>>> Maybe readahead is not as aggressive in general as large folio
>>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum ord=
er
>>>>>> of the window is the smaller of page_cluster (2 or 3) and
>>>>>> SWAP_RA_ORDER_CEILING (5).
>>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>>>>> be similar to enabling 32K mTHP?
>>>>
>>>> Not quite.
>>>
>>> Actually, I would expect it to be...
>>=20
>> Me too.
>>=20
>>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>>>>>> contiguous allocation like large folio swapin. So that could be
>>>>>> another factor why readahead may not reproduce the problem.
>>>>
>>>> Because of this ^.
>>>
>>> ...this matters for the physical allocation, which might require more
>>> reclaim and compaction to produce the 32k. But an earlier version of
>>> Barry's patch did the cgroup margin fallback after the THP was already
>>> physically allocated, and it still helped.
>>>
>>> So the issue in this test scenario seems to be mostly about cgroup
>>> volume. And then 8 4k charges should be equivalent to a singular 32k
>>> charge when it comes to cgroup pressure.

--
Best Regards,
Huang, Ying