From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAAB4CFA45E for ; Wed, 23 Oct 2024 18:52:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B0FB6B007B; Wed, 23 Oct 2024 14:52:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 75FF26B009D; Wed, 23 Oct 2024 14:52:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 600366B009E; Wed, 23 Oct 2024 14:52:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 400616B007B for ; Wed, 23 Oct 2024 14:52:17 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9FDA4A0D8C for ; Wed, 23 Oct 2024 18:51:44 +0000 (UTC) X-FDA: 82705761516.08.F898850 Received: from mail-vs1-f54.google.com (mail-vs1-f54.google.com [209.85.217.54]) by imf05.hostedemail.com (Postfix) with ESMTP id 95DDD100026 for ; Wed, 23 Oct 2024 18:51:41 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Tlep87Kf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729709496; a=rsa-sha256; cv=none; b=Yhtr59hMp59yJTQmLU++eJ4m4yuE6XLdmgw6L26DFo/zSnRvLttyyUq6iRpWlRqq8CNoyw QSStD4OTACQVD9MtykFC47BxkKkWUjove9Y1slFWzvB19R85UOyYCNyacLL84sBCmAX2/+ 3dDcnPdCYYzOxSYAjeV0+KIPmSFkKE0= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Tlep87Kf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729709496; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BOKo15FalVfcgavpWt9liGgilAyV8FkT17tVYolJJFo=; b=PvezLoX5inS1E+e8gKi7TiNJRLlIwrJIHtgzz3/CllqV8lPrbQpJivMZJdvW742Mc6TNa2 UuUlqufrJhp2I5Auum5yFN1yZZHsmQOfTSegk5k8CxmEhgYxDEIZ/D2/l3pNEGEV1FzUKN dykXkuDUQRKXhC/hz9m+17JjHinQY2k= Received: by mail-vs1-f54.google.com with SMTP id ada2fe7eead31-4a46b6affeaso17490137.2 for ; Wed, 23 Oct 2024 11:52:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729709534; x=1730314334; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BOKo15FalVfcgavpWt9liGgilAyV8FkT17tVYolJJFo=; b=Tlep87KfYMDn2iUIsrnZl7JoWIG7mPy628iLkKhAH8VLvchZ3w+HYG3QFKZ73fQBCq EoVpeStR962JVDzjG3cGYUZStiVCbUBWqBIjgRU/IGluaHCcpqMQDRy07/u7FEBDpvmL GytbfPiPMrivKnfQ45wZ7WmKrdg0D3tBjOvDmHRqJ8QsM0vYRs4aJh75GvdXDXisX5Zh ZbEOs/DK5saheHDFdKBDYHGPmsnNp5xR3yvKGSKkl7pG2o9r7tHeCiSZgA+zAngEJN2Z PkgLmkblKQxc35Uabqx3hUDFU0oIMEEdYvVTnZWWd8yYIhLl46dV9mA0uQeNI117pNNi eSyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729709534; x=1730314334; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BOKo15FalVfcgavpWt9liGgilAyV8FkT17tVYolJJFo=; b=xIGizJLoTjob7Kz6gNEq+wW4Oq+fZDEAvRBrUkb9Rkr3c/PrN+sRQztsJWQTlzSEuy X0VczdjzcUD7gxjttlfoaSKKLIm3s1CJ08FH9CKuhcSyVrf6fhM1LOLug+bOG2mnQTkx duG6VIfDuefKp6G2c5rHiQ9Z7n19HeG9EOGPdagjcH775LlpmBEmfFQ5lsKw0WyY0YD/ gN2Ih80KxClh39tkihxSt80MKFSx/xnc+BG0MjSWVl9nSvLpx8XYYzWszjmuW5oQ7Rbh r3jmhImTdhYkEHO48ieoZSGcjSe3MPT6EXD/oqhnfr0Bc7YfXJg7pg0UzcsOfQrF9NoG rnYw== X-Forwarded-Encrypted: i=1; AJvYcCWlsje1xJ+9sfJ3gCYzN/vtc2UPPj0qXYPDwX1Lte0BbobwcH3bPsivgXX/uTI/6GMYc1PbEoTOiQ==@kvack.org X-Gm-Message-State: AOJu0YwEUIjPhv+CE2o0tN9ZtasRReGHsDz8xX5XEYRucbRWC7CyDeiX 8MrKBbZ6xGVMtFH4xAWFp4mi/BfBgXbtcvUd8aOazlQvTotzY7JmoqU6V7mVeMZCMczz5hFHIF3 kG5c7PX9QBPzepKKPJdfzLv8BBSM= X-Google-Smtp-Source: AGHT+IGmSoi7EQKoemOA7oupQxpcLLvNt3dGw7kpWMGbnah/qXFQC7NPkc10lQWyw4eDcWzFcExh37M4jpAPSYNm2pY= X-Received: by 2002:a05:6102:dcf:b0:4a5:c297:7d5a with SMTP id ada2fe7eead31-4a751c017f9mr4862582137.16.1729709533848; Wed, 23 Oct 2024 11:52:13 -0700 (PDT) MIME-Version: 1.0 References: <20241018105026.2521366-1-usamaarif642@gmail.com> <5313c721-9cf1-4ecd-ac23-1eeddabd691f@gmail.com> <4c30cc30-0f7c-4ca7-a933-c8edfadaee5c@gmail.com> <7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com> <3dca2498-363c-4ba5-a7e6-80c5e5532db5@gmail.com> In-Reply-To: <3dca2498-363c-4ba5-a7e6-80c5e5532db5@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 24 Oct 2024 07:52:01 +1300 Message-ID: Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios To: Usama Arif Cc: Yosry Ahmed , senozhatsky@chromium.org, minchan@kernel.org, hanchuanhua@oppo.com, v-songbaohua@oppo.com, akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: mx65qa8gsueabouyhsgu6w1sz5ocdk99 X-Rspamd-Queue-Id: 95DDD100026 X-Rspamd-Server: rspam02 X-HE-Tag: 1729709501-993914 X-HE-Meta: U2FsdGVkX1/In4qw9BhowNLQ6Nc6GfIztwqG6ESLf5uqc4e8lpqDyYVxue93yb2Tk2G5iYsWLLv+jJXBC+4KxRqqJ+ucWMPFUC4iKjJLXx/+P4w04zPd3yORaKV+vnuW/3vupaY7u/qPR17RBhsAekeFDyluF+L0De5LOYTmg81ACUZ6TIGWhX1f9cxCvIBf+Z0cD3JSB0ktWwOpeF4Bdoi/Q+LyOzhmutd7Sp76brcTWMuL4dX6nEXxYl6+fs/FHiGhRDZsi6rrOET1TfeEsYocXjJj9CmZ1wbF2runx4RebYiOHCfArTbWieu9NpMPI0Dlblz2zWCL08zCiLayuzrhbsjNt9Tgjo1nurMlYAh7hsBku+6KS+gw1DDs5VkeW84wJzDTZHK2z8gGZFP8ROAEk6ahdL7wh4xXaSVUIPTvCkFhnWD6D3dNgn2Dyu3SEyJHnM6qBuiocg4w3Gln6oqfM10mF6SIDR2SHTk+Wuta6pHxDI6DaCXPeRB+bo0POyQJhMRsjbMofGtFSEMCNyu8YSgPwPglGE3/vBQ6hrPI6JZstZBSRoycZR4gUfM9an8VbkVBpK0B8JrzEZq2Z0SIhFz7zAdK1R4gT+iUcXHyl+eYJlGIEHkQF5mzmlVc/voSzDf5hkZMM2t8lUi/dJxqnnBIKP1qLsJf0uDP+VN/P6UT81fgouhGjom/XzXzEp8ars9rrhI4T1KE9VHh51pOSIPTYwphI7SL+lKK+XZKZw81cvJG6PZPRolJr0ls81jkrd9zbMrNbFD8LzIE9mYR1QJ44Pa7P6DPKzsd6JObpncFReQdRyglYm7xsKIwyFaqx12lsIVrGcxcZLW5s/aDFt7T+sG+uvKUUdwUUd+8ql3X9vYPUedxV6AG3EhVKOZSJ33iTTEz6x8I8oy42tfXop/Ou9k4Amuw8nOBwtBCBF9/di87gkXRKGynMpR3rL7B6SkEo1Mac3ecZbC NHvxS1d2 OUCJNsvmZVY3sJODZHfhX1OykYSHKILAVVrSp6tbTlOJahXZ3fjjpOO15N7d3LHEgHP2pbn+ePKL5mJ29clK3qeg3CirDqMxkVwQYolK31aABQ4EWzUc8eG+TGaD/JJF+XqZ3occozBWNkOI6SEfYW60RTnP3jseqBmh6N71KujU5R/tjcRqzlt8OEuR88k0d5JXKiiEAWC3EKMurBYMfNq46UodHVOD68o83sziq7BUHF3LV21Hajn0KN6zeOdQ+/fzVvR5XvLO6XhMoTsXQZCGpNNVY/QWliRajO34fD50Rc0tHpRTbOZxd96K5xKrvvecdCrS/uaoEj4bk5nUSXE8tYw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 24, 2024 at 7:31=E2=80=AFAM Usama Arif = wrote: > > > > On 23/10/2024 19:02, Yosry Ahmed wrote: > > [..] > >>>> I suspect the regression occurs because you're running an edge case > >>>> where the memory cgroup stays nearly full most of the time (this isn= 't > >>>> an inherent issue with large folio swap-in). As a result, swapping i= n > >>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > >>>> next swap-in then recreates the overflow, leading to a repeating > >>>> cycle. > >>>> > >>> > >>> Yes, agreed! Looking at the swap counters, I think this is what is go= ing > >>> on as well. > >>> > >>>> We need a way to stop the cup from repeatedly filling to the brim an= d > >>>> overflowing. While not a definitive fix, the following change might = help > >>>> improve the situation: > >>>> > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>> > >>>> index 17af08367c68..f2fa0eeb2d9a 100644 > >>>> --- a/mm/memcontrol.c > >>>> +++ b/mm/memcontrol.c > >>>> > >>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct fol= io > >>>> *folio, struct mm_struct *mm, > >>>> memcg =3D get_mem_cgroup_from_mm(mm); > >>>> rcu_read_unlock(); > >>>> > >>>> - ret =3D charge_memcg(folio, memcg, gfp); > >>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > >>>> MEMCG_CHARGE_BATCH) > >>>> + ret =3D -ENOMEM; > >>>> + else > >>>> + ret =3D charge_memcg(folio, memcg, gfp); > >>>> > >>>> css_put(&memcg->css); > >>>> return ret; > >>>> } > >>>> > >>> > >>> The diff makes sense to me. Let me test later today and get back to y= ou. > >>> > >>> Thanks! > >>> > >>>> Please confirm if it makes the kernel build with memcg limitation > >>>> faster. If so, let's > >>>> work together to figure out an official patch :-) The above code has= n't consider > >>>> the parent memcg's overflow, so not an ideal fix. > >>>> > >> > >> Thanks Barry, I think this fixes the regression, and even gives an imp= rovement! > >> I think the below might be better to do: > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index c098fd7f5c5e..0a1ec55cc079 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio= *folio, struct mm_struct *mm, > >> memcg =3D get_mem_cgroup_from_mm(mm); > >> rcu_read_unlock(); > >> > >> - ret =3D charge_memcg(folio, memcg, gfp); > >> + if (folio_test_large(folio) && > >> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_n= r_pages(folio))) > >> + ret =3D -ENOMEM; > >> + else > >> + ret =3D charge_memcg(folio, memcg, gfp); > >> > >> css_put(&memcg->css); > >> return ret; > >> > >> > >> AMD 16K+32K THP=3Dalways > >> metric mm-unstable mm-unstable + large folio zswapin seri= es mm-unstable + large folio zswapin + no swap thrashing fix > >> real 1m23.038s 1m23.050s = 1m22.704s > >> user 53m57.210s 53m53.437s = 53m52.577s > >> sys 7m24.592s 7m48.843s = 7m22.519s > >> zswpin 612070 999244 = 815934 > >> zswpout 2226403 2347979 = 2054980 > >> pgfault 20667366 20481728 = 20478690 > >> pgmajfault 385887 269117 = 309702 > >> > >> AMD 16K+32K+64K THP=3Dalways > >> metric mm-unstable mm-unstable + large folio zswapin seri= es mm-unstable + large folio zswapin + no swap thrashing fix > >> real 1m22.975s 1m23.266s = 1m22.549s > >> user 53m51.302s 53m51.069s = 53m46.471s > >> sys 7m40.168s 7m57.104s = 7m25.012s > >> zswpin 676492 1258573 = 1225703 > >> zswpout 2449839 2714767 = 2899178 > >> pgfault 17540746 17296555 = 17234663 > >> pgmajfault 429629 307495 = 287859 > >> > > > > Thanks Usama and Barry for looking into this. It seems like this would > > fix a regression with large folio swapin regardless of zswap. Can the > > same result be reproduced on zram without this series? > > > Yes, its a regression in large folio swapin support regardless of zswap/z= ram. > > Need to do 3 tests, one with probably the below diff to remove large foli= o support, > one with current upstream and one with upstream + swap thrashing fix. > > We only use zswap and dont have a zram setup (and I am a bit lazy to crea= te one :)). > Any zram volunteers to try this? Hi Usama, I tried a quick experiment: echo 1 > /sys/module/zswap/parameters/enabled echo 0 > /sys/module/zswap/parameters/enabled This was to test the zRAM scenario. Enabling zswap even once disables mTHP swap-in. :) I noticed a similar regression with zRAM alone, but the change resolved the issue and even sped up the kernel build compared to the setup without mTHP swap-in. However, I=E2=80=99m still working on a proper patch to address this. The c= urrent approach: mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) isn=E2=80=99t sufficient, as it doesn=E2=80=99t cover cases where group A c= ontains group B, and we=E2=80=99re operating within group B. The problem occurs not at the bound= ary of group B but at the boundary of group A. I believe there=E2=80=99s still room for improvement. For example, if a 64K= B charge attempt fails, there=E2=80=99s no need to waste time trying 32KB or 16KB. W= e can directly fall back to 4KB, as 32KB and 16KB will also fail based on our margin detection logic. > > diff --git a/mm/memory.c b/mm/memory.c > index fecdd044bc0b..62f6b087beb3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4124,6 +4124,8 @@ static struct folio *alloc_swap_folio(struct vm_fau= lt *vmf) > gfp_t gfp; > int order; > > + goto fallback; > + > /* > * If uffd is active for the vma we need per-page fault fidelity = to > * maintain the uffd semantics. Thanks Barry