From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F9F0E7716C
	for <linux-mm@archiver.kernel.org>; Thu,  5 Dec 2024 15:21:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A0CFD6B0132; Thu,  5 Dec 2024 10:19:18 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 900036B00EC; Thu,  5 Dec 2024 10:19:15 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7D1D06B00EC; Thu,  5 Dec 2024 10:19:11 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id B93836B007B
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 00:44:17 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 5B5CD1A0B4B
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 04:43:50 +0000 (UTC)
X-FDA: 82696367292.26.CD22F34
Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41])
	by imf23.hostedemail.com (Postfix) with ESMTP id 03EF7140002
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 04:44:05 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=none;
	spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729485779;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=m01RtHcH7jVqVjjekW1wmTbPRlfyOxXGbu4ESEnzxXo=;
	b=zTCbfzyq/N6KGwZWlFw5Awn+6zmmo9l6vB4Fd3EdLtQR67QGmCEot4jHOGbmq/uKKhPOvs
	wL+nEpZy/gQzZbcqftSyFGA12SEWnDtd15fH66HTycMKCKDBLV8bI7bE7zUj3/lgL9YRws
	45SQPYWt6hIXX8evs/Ue2pwRw/xRQcQ=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=none;
	spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729485779; a=rsa-sha256;
	cv=none;
	b=pOysq0whS83G54CU2bI49BojG9M4QxwiqahhVQS+8CyZ2IOVM7B6cjHAp345w+frR0cdsz
	nlZWP2NTZeUWktTSFCttUGKmoDW5QJHs96V7HWoC8nDsWXYBSqT6JtHgyJJA/PWPzLDKtK
	EJtPb+ag18uk2B82KyEJWoRm8Lfp6Kk=
Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-84fc0209e87so1328209241.1
        for <linux-mm@kvack.org>; Sun, 20 Oct 2024 21:44:14 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729485854; x=1730090654;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=m01RtHcH7jVqVjjekW1wmTbPRlfyOxXGbu4ESEnzxXo=;
        b=IsWu+R9OELO0y6Xrmi/m/sM74RtyhqL9G3WxtWtVTfK+N1lKJOrLI6YgcMt90fKwiT
         ze1ICklZEXIvynwDk45vyPqQVi2aVdjqhtK0QE/sKD0rwpv94Z6kl7Uzmam1aHMz1Lpy
         9+BFeN9qJbDgkCewMO18azgSJx9fP+d17ZKh5eBZkp8HUZ8FC/nbrrdXUH1wJ4gU4aEz
         qnX5iZRSnpOlwdsTHuad1RMThJBQ8ESD/ZhVRTBQKWt2nZ4nbhVGwpvp7EVVigoeBLze
         wmYFzsH//HbWXUV6TmvB9pBNkotuHfa9BV2NK5wre2voo/VUJm08/5CX5ME3TVLAnMn9
         554A==
X-Forwarded-Encrypted: i=1; AJvYcCW0U4R16JNlRtXSNZ19kvIRUGOKyTvn4qb1zYaWZvte7WJw0ysACHhldRAMTCB6b8I5kBmuPp2eOg==@kvack.org
X-Gm-Message-State: AOJu0Yzbrc59KuY7Yq21avSryt8xYPequQIbSrRP33U6CUGb+eQbN/By
	wKfKVPDf0+0fQ8W5MD8yi0Y3f9jmHnn53OQR2WkYcFDi2coZkjEwgz+4jXjIlKhZYxYjCvHMpzH
	SanTpQOk+SbEmghkb11Eos6rbf+E=
X-Google-Smtp-Source: AGHT+IGgeNOxeW2ojfo08suRiiHeAdMlxK3cKjG4ICyLsdNqgGmYc6WTwrOOEP3XWAxeEu1d4KqPB8m0VUKGVAz3zt4=
X-Received: by 2002:a05:6122:201e:b0:50d:7a14:ddf7 with SMTP id
 71dfb90a1353d-50dda3a0289mr7667931e0c.8.1729485853975; Sun, 20 Oct 2024
 21:44:13 -0700 (PDT)
MIME-Version: 1.0
References: <20241010081802.290893-1-chenridong@huaweicloud.com>
 <c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com> <CAGsJ_4zR+=80S_Fz1ZV3iZxjVKUE3-f32w7W1smuAgZM=HrPRQ@mail.gmail.com>
 <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com>
In-Reply-To: <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com>
From: Barry Song <baohua@kernel.org>
Date: Mon, 21 Oct 2024 17:44:03 +1300
Message-ID: <CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
To: chenridong <chenridong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>, Chen Ridong <chenridong@huaweicloud.com>, 
	akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	wangweiyang2@huawei.com, Michal Hocko <mhocko@suse.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, Yu Zhao <yuzhao@google.com>, 
	David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, 
	Ryan Roberts <ryan.roberts@arm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam03
X-Rspam-User: 
X-Stat-Signature: 5dfdjz53en7791aaww5ymsfim89ur4ny
X-Rspamd-Queue-Id: 03EF7140002
X-Rspamd-Pre-Result: action=add header;
	module=dmarc;
	Action set by DMARC
X-Rspam: Yes
X-HE-Tag: 1729485845-2438
X-HE-Meta: U2FsdGVkX1+e2/+v8auqjlBvyrtlXJ/vVQIP6tel1BM8x1JP624weL5cdbyBUNpSeIPx+VgFM3a5qSFXs5XY+O/5IyLC1yGFGXZ/QZgKGaTCGK5lIH6WUSkhsrOUmekRqGy7Xc8+NTN5bYC6kIHXU0GeGzycPNTEgxEVH9FZb0GrqBfhW5xBIssdgMmLPfB5k5Sgr8XSY1XpekqPyqYaaou3nisaYIfm9nXj2lMbOtfwoOGNCqN3ampq4vY2tpCGKe+IRqKV15LhRxZJP+BTL6cxGXNKBfzZ++JbAg10KwrvzEq+HNHTwplLbHVpBDyoWT9ASqbX6Sgw/LdQ4s+rt7MKPNTBFakt1jY4t9anuJb449AwF/yiY/BKEEkchoHAP+ZH+9kjR3nyk62L6NQiCmTgqkWVGOa4v9jPz81tZIYz/gAbvlms1cGf/gEMg7LDnCljFGu7P4rp9ea2qA/Q6FJilK/3YqTxoCBHjs/R6jtMdVcYD62R05prB+U1tFeJ3WFw+8BeYYGodOe0zYZ+nOsXN1/jXK4CncV4ZcXoo4WK40FlIkeIUSEgV+pdHyj3DICRlRVkul3bUR46iDl9sc3N4CO7++2YeWLi3qaTkm7e0leqv/6v28EmzVEbI3O1dQaL2JmOGGZ/2Xdal9aNdNLd/ijWByEER87lwPmOTV0G+4R4ikpExxzfT/nD1PqSMR3mx9mOyjWSYn1UJGqwSzGzvPoqIpA7lVi/E2gmfsVoNzIy59miKeX6ZF1OQNwGPfn2Ne6RMXqZJebw8FuN5Le7qTOvEJgRnmNoGNZ6gyVF4aNl34bxz0TnUa4nSi8XMqmzqX9zuKFV8+vWCGfB61dpguavSgCvmeLP4p+RlYPVWGhC9inve2ekFJAhTsa6H/WN+ufyg5obKnwReT9pvf4Gv6PAHsiPjunPWYQtU8FcchmXlZ7ZzjuWl03bxcsTbhtyTiiJRLxO/qddTFP
 d/WYGXFz
 Vh+n3lB3DFjqxweeHQQe5/GBANFAEbALfs7w4z08r1tWWFHyu0cFDG3kARZqXYp6vdfXOmBl9XpVa7cBrQIcRNlpCoemCSnFUPY81+S+hqTWL+KbHZlrvMjTvTXj9PcOiTYUQZm14Av4COUnBh+NxzXc99K8aNdDZGFboUtMjEl3BSzsOBvsaOmgpWHax9c5NvAnweilBJuT6pGAelltqXUqgtmUwYrysZQXQwTQyFYNHreMgZFA2bDsLQFVLy/6R0bZA3GiiUbUKBY6/vDfQNHwI9r7ZHC3OFd6hQgOS6VBIY9XRBK/u1RcwThp6Q+dGN5PMiSkohgPlI8a5lvgO5g15nhdqqv6smCfgf7Dgl+wSE0UHyvkX81mdCIP7DI3sWN0iaU8qD0w1cy8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Oct 11, 2024 at 7:49=E2=80=AFPM chenridong <chenridong@huawei.com> =
wrote:
>
>
>
> On 2024/10/11 0:17, Barry Song wrote:
> > On Thu, Oct 10, 2024 at 4:59=E2=80=AFPM Kefeng Wang <wangkefeng.wang@hu=
awei.com> wrote:
> >>
> >> Hi Ridong,
> >>
> >> This should be the first version for upstream, and the issue only
> >> occurred when large folio is spited.
> >>
> >> Adding more CCs to see if there's more feedback.
> >>
> >>
> >> On 2024/10/10 16:18, Chen Ridong wrote:
> >>> From: Chen Ridong <chenridong@huawei.com>
> >>>
> >>> An issue was found with the following testing step:
> >>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=3Dy
> >>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>      usage_in_bytes=3D2.1G, memsw.usage_in_bytes=3D3G.
> >>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg=
.
> >>>
> >>> It was found that:
> >>>
> >>> cat memory.usage_in_bytes
> >>> 2144940032
> >>> cat memory.memsw.usage_in_bytes
> >>> 2255056896
> >>>
> >>> free -h
> >>>                 total        used        free
> >>> Mem:           31Gi       2.1Gi        27Gi
> >>> Swap:         1.0Gi       618Mi       405Mi
> >>>
> >>> As shown above, the test_memcg used about 100M swap, but 600M+ swap m=
emory
> >>> was used, which means that 500M may be wasted because other memcgs ca=
n not
> >>> use these swap memory.
> >>>
> >>> It can be explained as follows:
> >>> 1. When entering shrink_inactive_list, it isolates folios from lru fr=
om
> >>>      tail to head. If it just takes folioN from lru(make it simple).
> >>>
> >>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>      isolated list: folioN
> >>>
> >>> 2. In shrink_page_list function, if folioN is THP, it may be splited =
and
> >>>      added to swap cache folio by folio. After adding to swap cache, =
it will
> >>>      submit io to writeback folio to swap, which is asynchronous.
> >>>      When shrink_page_list is finished, the isolated folios list will=
 be
> >>>      moved back to the head of inactive lru. The inactive lru may jus=
t look
> >>>      like this, with 512 filioes have been move to the head of inacti=
ve lru.
> >>>
> >>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN=
-1
> >>>
> >>> 3. When folio writeback io is completed, the folio may be rotated to =
tail
> >>>      of lru. The following lru list is expected, with those filioes t=
hat have
> >>>      been added to swap cache are rotated to tail of lru. So those fo=
lios
> >>>      can be reclaimed as soon as possible.
> >>>
> >>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->foli=
oN512
> >>>
> >>> 4. However, shrink_page_list and folio writeback are asynchronous. If=
 THP
> >>>      is splited, shrink_page_list loops at least 512 times, which mea=
ns that
> >>>      shrink_page_list is not completed but some folios writeback have=
 been
> >>>      completed, and this may lead to failure to rotate these folios t=
o the
> >>>      tail of lru. The lru may look likes as below:
> >
> > I assume you=E2=80=99re referring to PMD-mapped THP, but your code also=
 modifies
> > mTHP, which might not be that large. For instance, it could be a 16KB m=
THP.
> >
> >>>
> >>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1=
<->
> >>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>
> >>>      Although those folios (N1-N50) have been finished writing back, =
they
> >>>      are still at the head of lru. When isolating folios from lru, it=
 scans
> >>>      from tail to head, so it is difficult to scan those folios again=
.
> >>>
> >>> What mentioned above may lead to a large number of folios have been a=
dded
> >>> to swap cache but can not be reclaimed in time, which may reduce recl=
aim
> >>> efficiency and prevent other memcgs from using this swap memory even =
if
> >>> they trigger OOM.
> >>>
> >>> To fix this issue, it's better to stop looping if THP has been splite=
d and
> >>> nr_pageout is greater than nr_to_reclaim.
> >>>
> >>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >>> ---
> >>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 749cdc110c74..fd8ad251eda2 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct li=
st_head *folio_list,
> >>>        LIST_HEAD(demote_folios);
> >>>        unsigned int nr_reclaimed =3D 0;
> >>>        unsigned int pgactivate =3D 0;
> >>> -     bool do_demote_pass;
> >>> +     bool do_demote_pass, splited =3D false;
> >>>        struct swap_iocb *plug =3D NULL;
> >>>
> >>>        folio_batch_init(&free_folios);
> >>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct l=
ist_head *folio_list,
> >>>
> >>>                cond_resched();
> >>>
> >>> +             /*
> >>> +              * If a large folio has been split, many folios are add=
ed
> >>> +              * to folio_list. Looping through the entire list takes
> >>> +              * too much time, which may prevent folios that have co=
mpleted
> >>> +              * writeback from rotateing to the tail of the lru. Jus=
t
> >>> +              * stop looping if nr_pageout is greater than nr_to_rec=
laim.
> >>> +              */
> >>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_re=
claim))
> >>> +                     break;
> >
> > I=E2=80=99m not entirely sure about the theory behind comparing stat->n=
r_pageout
> > with sc->nr_to_reclaim. However, the condition might still hold true ev=
en
> > if you=E2=80=99ve split a relatively small =E2=80=9Clarge folio,=E2=80=
=9D such as 16kB?
> >
>
> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> pages that have been pageout can be reclaimed, then enough pages can be
> reclaimed when all pages have finished writeback. Thus, it may not have
> to pageout more.
>
> If a small large folio(16 kB) has been split, it may return early
> without the entire pages in the folio_list being pageout, but I think
> that is fine. It can pageout more pages the next time it enters
> shrink_folio_list if there are not enough pages to reclaimed.
>
> However, if pages that have been pageout are still at the head of the
> LRU, it is difficult to scan these pages again. In this case, not only
> might it "waste" some swap memory but it also has to pageout more pages.
>
> Considering the above, I sent this patch. It may not be a perfect
> solution, but i think it's a good option to consider. And I am wondering
> if anyone has a better solution.

Hi Ridong,
My overall understanding is that you have failed to describe your problem
particularly I don't understand what your 3 and 4 mean:

> 3. When folio writeback io is completed, the folio may be rotated to tail
>    of lru. The following lru list is expected, with those filioes that ha=
ve
>    been added to swap cache are rotated to tail of lru. So those folios
>  can be reclaimed as soon as possible.
>
>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

 > 4. However, shrink_page_list and folio writeback are asynchronous. If TH=
P
 >    is splited, shrink_page_list loops at least 512 times, which means th=
at
 >    shrink_page_list is not completed but some folios writeback have been
 >    completed, and this may lead to failure to rotate these folios to the
  >  tail of lru. The lru may look likes as below:

can you please describe it in a readable approach?

i feel your below diagram is somehow wrong:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

You mentioned "rotate', how could "rotate" makes:
folioN512<->folioN511<->...filioN1 in (2)
become
filioN1<->...folioN511<->folioN512 in (3).

btw, writeback isn't always async. it could be sync for zram and sync_io
swap. in that case, your patch might change the order of LRU. i mean,
for example, while a mTHP becomes cold, we always reclaim all of them,
but not part of them and put back part of small folios to the head of lru.

>
> Best regards,
> Ridong
>
> >>> +
> >>>                folio =3D lru_to_folio(folio_list);
> >>>                list_del(&folio->lru);
> >>>
> >>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct li=
st_head *folio_list,
> >>>                if ((nr_pages > 1) && !folio_test_large(folio)) {
> >>>                        sc->nr_scanned -=3D (nr_pages - 1);
> >>>                        nr_pages =3D 1;
> >>> +                     splited =3D true;
> >>>                }
> >>>
> >>>                /*
> >>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct =
list_head *folio_list,
> >>>                                if (nr_pages > 1 && !folio_test_large(=
folio)) {
> >>>                                        sc->nr_scanned -=3D (nr_pages =
- 1);
> >>>                                        nr_pages =3D 1;
> >>> +                                     splited =3D true;
> >>>                                }
> >>>                                goto activate_locked;
> >>>                        case PAGE_SUCCESS:
> >>>                                if (nr_pages > 1 && !folio_test_large(=
folio)) {
> >>>                                        sc->nr_scanned -=3D (nr_pages =
- 1);
> >>>                                        nr_pages =3D 1;
> >>> +                                     splited =3D true;
> >>>                                }
> >>>                                stat->nr_pageout +=3D nr_pages;
> >>>
> >>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct li=
st_head *folio_list,
> >>>                if (nr_pages > 1) {
> >>>                        sc->nr_scanned -=3D (nr_pages - 1);
> >>>                        nr_pages =3D 1;
> >>> +                     splited =3D true;
> >>>                }
> >>>    activate_locked:
> >>>                /* Not a candidate for swapping, so reclaim swap space=
. */
> >>
>

Thanks
barry