From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 24DDBD0E6DB
	for <linux-mm@archiver.kernel.org>; Mon, 21 Oct 2024 10:09:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A0FD86B0083; Mon, 21 Oct 2024 06:09:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9C06F6B0088; Mon, 21 Oct 2024 06:09:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 887426B0089; Mon, 21 Oct 2024 06:09:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 6570B6B0083
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 06:09:36 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id CC0C01C72C9
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 10:09:18 +0000 (UTC)
X-FDA: 82697187090.30.BC1F6C5
Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48])
	by imf06.hostedemail.com (Postfix) with ESMTP id B166018000A
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 10:09:23 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=eY2e76CH;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf06.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729505263; a=rsa-sha256;
	cv=none;
	b=Q6Ebnjj7xUtDSlpbM1mL+aHo7zPhb96ww4FEsg3YKDbbJmKs8G+BpAnNYB4d+rQDzXgLqT
	i5ETjAjPoiDxSmsNTLpdp/9CMXNIcZ+55xszkKUh+JLB6ut2HDpdHYryjMp6epxzZaCu6O
	EVprb7Q9VXCu5cPmVdy23gYuyzAzjFg=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=eY2e76CH;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf06.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729505263;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+7aeabzaQLkaNKEn8nxxF+sKaTT/AZcjGrs/CJVL4kA=;
	b=x5CVoPMoabCa0B1ehbgkLtabyZ7i2xkLQfDML0Y1/c7AQ9FnI8eHMGLYtPPodh2aGgBZyV
	vJG91MPU9IlZCS8dbKQQygxu06z92v0m8HqEgxDxKrkqVh5BoM3yYFYUOLUEOATXwHEr6c
	kvv8d+5bDz45+UoFMX161EcCa6+AmGA=
Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-84fdf96b31aso1318455241.2
        for <linux-mm@kvack.org>; Mon, 21 Oct 2024 03:09:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729505373; x=1730110173; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+7aeabzaQLkaNKEn8nxxF+sKaTT/AZcjGrs/CJVL4kA=;
        b=eY2e76CHRpcvhkkBdqBXr6evzcWCUAJoTK5tWHI+9LE5VhqI497lpv4rKWkirMFgnL
         wmmM1dfngoARTmmY+mEmLgXrjPQFNxK9uNbRjGDQTCVS1EctnzIFILdxElUwaGY1oy5m
         AggUP8nFRNJ5+y5HKiR1gmzA4V1/WpjY8C/8nUDQJZsGrjj5/zZke9spvk4dSl5jWssS
         NAgQcBudo2kK0bUhopIhfqhNcHN8HfNg10M7olu/bRERfvldY2iJmF1V2vQyE6EWP3oc
         Mtmv7avUgAnciv2gDCDaC+Ox4lXCpGEpFLu23MAySDoT0cXg1jSsN9ea55tZtO5Jsbrx
         fFLQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729505373; x=1730110173;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+7aeabzaQLkaNKEn8nxxF+sKaTT/AZcjGrs/CJVL4kA=;
        b=TvdkOWzZhTB7MD4/5cybm8yCqQ29ZH38LnC5QwJAlWSytilShSZxTsgmcuIAKelaC1
         KwINyAgCMiHCj6pvJvY8hztZwR4hydRGXr+hyxDeNY8yUCTeIn973sNxSw5QmANZko8B
         hUmluMNttpUgruD3ekuYCJ8v9f9/KPfzDBq3OJx9sLuM/atctfEAXXiDOrEHV00cgVC/
         1FGXSi51/uLeQ/wLeIg/53pg+Mupv1gsbmvbUpIAdw0b43nJKVkji1Jj/hiKt3ahR/xr
         +typAOprDe9vsAkDoltnxxfzkGQXiU0VT7q5GsUvObJKWKg3F5bs95X5iReGL5AtqjbE
         jLag==
X-Forwarded-Encrypted: i=1; AJvYcCXWEjbmQbxxUBFBE/lEqT/tcHfIO464/elUZeMHDJQF3S3Tch7TPFMORrpViIAqPVDtrkEacmrHnQ==@kvack.org
X-Gm-Message-State: AOJu0YwLsM+6g+Pj9cg199LgfTBasVkQnetSvg/lr8UQa9k3Zou9lbbd
	eKqjeYq3vW0umPfJHpSEPv2PP1+2plW9rVdwvHFTD7/3//cp8prX8OiCdQSsd0JMO8N1dWwKy6n
	FLNC+UAeKr1z8UZvG0EkNYfFgFzQ=
X-Google-Smtp-Source: AGHT+IFGnMnr+kA36VRys9G8Q876lQLIQBWaNxf/4dChzdM7jOETx64G9MpYP6TlAa7tPztFgUe7nYZVwxMDerXYQhk=
X-Received: by 2002:a05:6102:b08:b0:4a4:7980:b9c8 with SMTP id
 ada2fe7eead31-4a5d6adfdf4mr9567667137.13.1729505373146; Mon, 21 Oct 2024
 03:09:33 -0700 (PDT)
MIME-Version: 1.0
References: <20241010081802.290893-1-chenridong@huaweicloud.com>
 <c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com> <CAGsJ_4zR+=80S_Fz1ZV3iZxjVKUE3-f32w7W1smuAgZM=HrPRQ@mail.gmail.com>
 <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com> <CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com>
 <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com> <CAGsJ_4yx8Z2w=FbBCUHtDa-=jDVDcdsBAHu26-LNeuOuquoOmg@mail.gmail.com>
 <c85a035b-70d2-47df-b2c3-db255c98f6ff@huawei.com>
In-Reply-To: <c85a035b-70d2-47df-b2c3-db255c98f6ff@huawei.com>
From: Barry Song <21cnbao@gmail.com>
Date: Mon, 21 Oct 2024 23:09:21 +1300
Message-ID: <CAGsJ_4yfpPoEnR2DOGcytEe0Xzuhq05W8Ncrb=z7OZ_icdWAww@mail.gmail.com>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
To: chenridong <chenridong@huawei.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, 
	akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	wangweiyang2@huawei.com, Michal Hocko <mhocko@suse.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, Yu Zhao <yuzhao@google.com>, 
	David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, 
	Ryan Roberts <ryan.roberts@arm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: B166018000A
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: dzic8o8k89tjijpkkh4uhg1ur8ptktb3
X-HE-Tag: 1729505363-931006
X-HE-Meta: U2FsdGVkX18zIfLlabqPehA+SVd/D1VcMU12oGX3aa4i9PAFx+IHIkuhq9HhjCyYR97d3YrvMvJeppN4ll4rc8OpFIjUrg/Ve1jQjiD916Gw9+MdKUl5LdLA7lMBJsEC2xI+29GEO2ZYctvYIH3z/vSOlFJ8c5cfTNLJZGWtOfZrOaO1gzcO9jxnJyvFUuGiOl4/aOdgbinGnbgWmWkO30xCwrM2EsUnyWaXVgEsHpsREh7wzP9Kw6uam6QAWwyXnmCBF83CkhU9KSGL38lgK11tVVBj9poBZ/BSLtrdln6nFCFF65rdpe6rmVRQz9mkmU4442J3GqJ1uaB7cRRp3hcf+DVykXKgmS13xpqg1pRt3j4AbtSVtYGIx+VP6dFeJjlRDpiXTkhIb4F/Gf+O3nD4Im28rW4zDBtPCQ0MYw2CJ6brMJQYYJK0Gozix/0XfFtZfwYRHYiKeQa54m7DoOeoYLc9BtN6wZOWxs+7rdRSZWJBwhOAGvheTxb0aVU4ZkRZNoH/sDVWr1KPbLvEvPPScsLVnzp8o+TRhPG22CxwfuZw63akD3fV6v7TtLlrYqk1JOUvQ5L/PyyBhAxG7You4EI1MBSkAK4QfLFW7lL7Pro3Wp0hLxa6CHIKv3vOLhnf3MzS1p8BelMGljx3tYuZ6rZb0M8FbXosHBhq9O5aMAZnwuHsnl+3A+LKIMSBghzfj18paIcII0MnERAMTS9gQ5S5v32opZ9LAqVZEbQ8nZGUY4n7PvOJOanVj0C09lGqscUfMJzIFmUvbhuXbiUQ2GuHupklbrENDInNAqjs6tM3vNjN/ifTp1OVg1hHVB3Y6zFEq2TK0e4mt36eRR/xxCqh6PhM2T+piU8DKBVIn9ZlRnQBXinovDyGLpkciTgxja4pn69+omYA4Jo19GGXv8GKNDbHZBTfmrbbhhXINc3Gzq/T+bsXW7Dg0WBWgf4+VfnXzqqGSsvpzKx
 c9s1T0o5
 XadisR1xrjzDsNEjamOgWDV7c9wodVnlyaZpTDg8ZDMfcf4LKZYbyTTWUV9fEfBQ6zHF6f07JhtozxZgQoHa2xhNhSRwkw/w3lCcNchsZZOc6TlDu8Ls2KWw/VR+6rxrUkNyldbXfNchQmpSPketChDS58tbxMwupQmaa3yvmXE9rrNgSbuQNrCLodC7lrGdErhiAxxiC7bRPAj4mBLS6UVuwm2hBlSYEIXCBuZcUHjmVTZsy3kDc732Tn3/UB4Z4z+qstwyBMMjcEaSdpAEuNjCSuZT2sBD55uMq9xlKvNRWGLSsbUxTNUzbRuHe0dcALMta94I5Uk74omsL05j9/GLCjr1UTdzijxFB7yvgR/ec3WB6iQiC7qJ83KYnvKRkFLzhRyCKKOvILCtK9EHJBjtFEv80rAsw5l5x
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Oct 21, 2024 at 10:56=E2=80=AFPM chenridong <chenridong@huawei.com>=
 wrote:
>
>
>
> On 2024/10/21 17:42, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 9:14=E2=80=AFPM Chen Ridong <chenridong@huaweic=
loud.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/21 12:44, Barry Song wrote:
> >>> On Fri, Oct 11, 2024 at 7:49=E2=80=AFPM chenridong <chenridong@huawei=
.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024/10/11 0:17, Barry Song wrote:
> >>>>> On Thu, Oct 10, 2024 at 4:59=E2=80=AFPM Kefeng Wang <wangkefeng.wan=
g@huawei.com> wrote:
> >>>>>>
> >>>>>> Hi Ridong,
> >>>>>>
> >>>>>> This should be the first version for upstream, and the issue only
> >>>>>> occurred when large folio is spited.
> >>>>>>
> >>>>>> Adding more CCs to see if there's more feedback.
> >>>>>>
> >>>>>>
> >>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
> >>>>>>> From: Chen Ridong <chenridong@huawei.com>
> >>>>>>>
> >>>>>>> An issue was found with the following testing step:
> >>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=3Dy
> >>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>>>>>      usage_in_bytes=3D2.1G, memsw.usage_in_bytes=3D3G.
> >>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_m=
emcg.
> >>>>>>>
> >>>>>>> It was found that:
> >>>>>>>
> >>>>>>> cat memory.usage_in_bytes
> >>>>>>> 2144940032
> >>>>>>> cat memory.memsw.usage_in_bytes
> >>>>>>> 2255056896
> >>>>>>>
> >>>>>>> free -h
> >>>>>>>                 total        used        free
> >>>>>>> Mem:           31Gi       2.1Gi        27Gi
> >>>>>>> Swap:         1.0Gi       618Mi       405Mi
> >>>>>>>
> >>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ sw=
ap memory
> >>>>>>> was used, which means that 500M may be wasted because other memcg=
s can not
> >>>>>>> use these swap memory.
> >>>>>>>
> >>>>>>> It can be explained as follows:
> >>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lr=
u from
> >>>>>>>      tail to head. If it just takes folioN from lru(make it simpl=
e).
> >>>>>>>
> >>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>>>>>      isolated list: folioN
> >>>>>>>
> >>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be spli=
ted and
> >>>>>>>      added to swap cache folio by folio. After adding to swap cac=
he, it will
> >>>>>>>      submit io to writeback folio to swap, which is asynchronous.
> >>>>>>>      When shrink_page_list is finished, the isolated folios list =
will be
> >>>>>>>      moved back to the head of inactive lru. The inactive lru may=
 just look
> >>>>>>>      like this, with 512 filioes have been move to the head of in=
active lru.
> >>>>>>>
> >>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->fo=
lioN-1
> >>>>>>>
> >>>>>>> 3. When folio writeback io is completed, the folio may be rotated=
 to tail
> >>>>>>>      of lru. The following lru list is expected, with those filio=
es that have
> >>>>>>>      been added to swap cache are rotated to tail of lru. So thos=
e folios
> >>>>>>>      can be reclaimed as soon as possible.
> >>>>>>>
> >>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->=
folioN512
> >>>>>>>
> >>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous=
. If THP
> >>>>>>>      is splited, shrink_page_list loops at least 512 times, which=
 means that
> >>>>>>>      shrink_page_list is not completed but some folios writeback =
have been
> >>>>>>>      completed, and this may lead to failure to rotate these foli=
os to the
> >>>>>>>      tail of lru. The lru may look likes as below:
> >>>>>
> >>>>> I assume you=E2=80=99re referring to PMD-mapped THP, but your code =
also modifies
> >>>>> mTHP, which might not be that large. For instance, it could be a 16=
KB mTHP.
> >>>>>
> >>>>>>>
> >>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->foli=
oN-1<->
> >>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>>>>>
> >>>>>>>      Although those folios (N1-N50) have been finished writing ba=
ck, they
> >>>>>>>      are still at the head of lru. When isolating folios from lru=
, it scans
> >>>>>>>      from tail to head, so it is difficult to scan those folios a=
gain.
> >>>>>>>
> >>>>>>> What mentioned above may lead to a large number of folios have be=
en added
> >>>>>>> to swap cache but can not be reclaimed in time, which may reduce =
reclaim
> >>>>>>> efficiency and prevent other memcgs from using this swap memory e=
ven if
> >>>>>>> they trigger OOM.
> >>>>>>>
> >>>>>>> To fix this issue, it's better to stop looping if THP has been sp=
lited and
> >>>>>>> nr_pageout is greater than nr_to_reclaim.
> >>>>>>>
> >>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >>>>>>> ---
> >>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>>>> index 749cdc110c74..fd8ad251eda2 100644
> >>>>>>> --- a/mm/vmscan.c
> >>>>>>> +++ b/mm/vmscan.c
> >>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struc=
t list_head *folio_list,
> >>>>>>>        LIST_HEAD(demote_folios);
> >>>>>>>        unsigned int nr_reclaimed =3D 0;
> >>>>>>>        unsigned int pgactivate =3D 0;
> >>>>>>> -     bool do_demote_pass;
> >>>>>>> +     bool do_demote_pass, splited =3D false;
> >>>>>>>        struct swap_iocb *plug =3D NULL;
> >>>>>>>
> >>>>>>>        folio_batch_init(&free_folios);
> >>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(stru=
ct list_head *folio_list,
> >>>>>>>
> >>>>>>>                cond_resched();
> >>>>>>>
> >>>>>>> +             /*
> >>>>>>> +              * If a large folio has been split, many folios are=
 added
> >>>>>>> +              * to folio_list. Looping through the entire list t=
akes
> >>>>>>> +              * too much time, which may prevent folios that hav=
e completed
> >>>>>>> +              * writeback from rotateing to the tail of the lru.=
 Just
> >>>>>>> +              * stop looping if nr_pageout is greater than nr_to=
_reclaim.
> >>>>>>> +              */
> >>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_t=
o_reclaim))
> >>>>>>> +                     break;
> >>>>>
> >>>>> I=E2=80=99m not entirely sure about the theory behind comparing sta=
t->nr_pageout
> >>>>> with sc->nr_to_reclaim. However, the condition might still hold tru=
e even
> >>>>> if you=E2=80=99ve split a relatively small =E2=80=9Clarge folio,=E2=
=80=9D such as 16kB?
> >>>>>
> >>>>
> >>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if=
 all
> >>>> pages that have been pageout can be reclaimed, then enough pages can=
 be
> >>>> reclaimed when all pages have finished writeback. Thus, it may not h=
ave
> >>>> to pageout more.
> >>>>
> >>>> If a small large folio(16 kB) has been split, it may return early
> >>>> without the entire pages in the folio_list being pageout, but I thin=
k
> >>>> that is fine. It can pageout more pages the next time it enters
> >>>> shrink_folio_list if there are not enough pages to reclaimed.
> >>>>
> >>>> However, if pages that have been pageout are still at the head of th=
e
> >>>> LRU, it is difficult to scan these pages again. In this case, not on=
ly
> >>>> might it "waste" some swap memory but it also has to pageout more pa=
ges.
> >>>>
> >>>> Considering the above, I sent this patch. It may not be a perfect
> >>>> solution, but i think it's a good option to consider. And I am wonde=
ring
> >>>> if anyone has a better solution.
> >>>
> >>> Hi Ridong,
> >>> My overall understanding is that you have failed to describe your pro=
blem
> >>> particularly I don't understand what your 3 and 4 mean:
> >>>
> >>>> 3. When folio writeback io is completed, the folio may be rotated to=
 tail
> >>>>    of lru. The following lru list is expected, with those filioes th=
at have
> >>>>    been added to swap cache are rotated to tail of lru. So those fol=
ios
> >>>>  can be reclaimed as soon as possible.
> >>>>
> >>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN5=
12
> >>>
> >>>  > 4. However, shrink_page_list and folio writeback are asynchronous.=
 If THP
> >>>  >    is splited, shrink_page_list loops at least 512 times, which me=
ans that
> >>>  >    shrink_page_list is not completed but some folios writeback hav=
e been
> >>>  >    completed, and this may lead to failure to rotate these folios =
to the
> >>>   >  tail of lru. The lru may look likes as below:
> >>>
> >>> can you please describe it in a readable approach?
> >>>
> >>> i feel your below diagram is somehow wrong:
> >>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>
> >>> You mentioned "rotate', how could "rotate" makes:
> >>> folioN512<->folioN511<->...filioN1 in (2)
> >>> become
> >>> filioN1<->...folioN511<->folioN512 in (3).
> >>>
> >>
> >> I am sorry for any confusion.
> >>
> >> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
> >> to writeback one by one. it assumed that filioN1,
> >> filioN2,filioN3,...filioN512 are completed in order.
> >>
> >> Orignal:
> >> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>
> >> filioN1 is finished, filioN1 is rotated to the tail of LRU:
> >> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->f=
olioN1
> >>
> >> filioN2 is finished:
> >> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->f=
olioN1<->folioN2
> >>
> >> filioN3 is finished:
> >> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->f=
olioN1<->folioN2<->filioN3
> >>
> >> ...
> >>
> >> filioN512 is finished:
> >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>
> >> When the filios are finished, the LRU might just like this:
> >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > understood, thanks!
> >
> > Let me try to understand the following part:
> >
> >> 4:
> >>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>   folioN51<->folioN52<->...folioN511<->folioN512
> >
> >  >  Although those folios (N1-N50) have been finished writing back, the=
y
> >  >  are still at the head of lru. When isolating folios from lru, it sc=
ans
> >  >  from tail to head, so it is difficult to scan those folios again.
> >
> > What is the reason that "those folios (N1-N50) have finished writing ba=
ck,
> > yet they remain at the head of the LRU"? Is it because their writeback_=
end
> > occurred while we were still looping in shrink_folio_list(), causing
> > folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
> > these folios, which are still in the "folio_list", to the tail of the L=
RU?
> >
>
> Yes, you are right.
>
> >>
> >>> btw, writeback isn't always async. it could be sync for zram and sync=
_io
> >>> swap. in that case, your patch might change the order of LRU. i mean,
> >>> for example, while a mTHP becomes cold, we always reclaim all of them=
,
> >>> but not part of them and put back part of small folios to the head of=
 lru.
> >>>
> >>
> >> Yes, This can be changed.
> >> Although it may put back part of small folios to the head of lru, it c=
an
> >> return in time from shrink_folio_list without causing much additional =
I/O.
> >>
> >> If you have understood this issue, do you have any suggestions to fix
> >> it? My patch may not be a perfect way to fix this issue.
> >>
> >
> > My point is that synchronous I/O, like zRAM, doesn't have this issue an=
d
> > doesn't require this fix, as writeback is always completed without
> > asynchronous latency.
> >
>
> I have tested zRAM and found no issues.
> This is version 1, and I don't know whether this fix will be accepted.
> If it is accepted, perhaps this patch could be modified to apply only to
> asynchronous io.

Consider a 2MB THP: when it becomes cold, we detect that it is cold and
decide to page it out. Even if we split it into 512 * 4KiB folios, the enti=
re
2MB is still cold, so we want pageout() to be called for the entire 2MB.
With your current approach, some parts of the 2MB are moved to the
LRU head while we're still paging out other parts, which seems
problematic.

Could we address this in move_folios_to_lru()? Perhaps we could find
a way to detect folios whose writeback has done and move them to the
tail instead of always placing them at the head.

>
> Best regards,
> Ridong
>
> >
> >> Best regards,
> >> Ridong
> >>

Thanks
Barry