From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9F8BD0E6D8 for ; Mon, 21 Oct 2024 09:42:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3426B6B007B; Mon, 21 Oct 2024 05:42:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F2766B0082; Mon, 21 Oct 2024 05:42:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 193DD6B0083; Mon, 21 Oct 2024 05:42:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id ED66D6B007B for ; Mon, 21 Oct 2024 05:42:27 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 67E511C4219 for ; Mon, 21 Oct 2024 09:42:10 +0000 (UTC) X-FDA: 82697118210.20.B267C67 Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42]) by imf26.hostedemail.com (Postfix) with ESMTP id E2F1614000F for ; Mon, 21 Oct 2024 09:42:14 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dc5FkDiT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729503635; a=rsa-sha256; cv=none; b=CYkwSTzPm+QQXyS3K2otfJd/rcImRsROMpk2ZyAhFCVC6AKBK63Bduz2ZJE2YjTcxYHtKI s1c071ddp93Rw8chDiX4acq0CJB3QLa43IXjqfdmu4UsbRCWocVmQ1VrSuDePrDfTDe4eU apo10W7Dt5hshhrhqlqaMnRWLmPupdc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dc5FkDiT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729503635; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GKWQ+PiNxEr+3J2vDFkOhXSbXDVfWjoiMlO+y8Dj1m0=; b=itYSivoak26rAPW0B536I7VvKjTlZJFZ+pjbrDsc+x6/dXNohRvwnWtsyV6454tH0RjG13 /TP6MD8oKnXXDF/Xkpo7yJdMuekvQdEnzF/R5KmNysyZvaf1cg7bSkya6N0AtFFTFMNCki PfsCv/2jjSZSC5xTYZNS7K02JFTaNZc= Received: by mail-vs1-f42.google.com with SMTP id ada2fe7eead31-4a48477b25eso3056046137.0 for ; Mon, 21 Oct 2024 02:42:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729503745; x=1730108545; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GKWQ+PiNxEr+3J2vDFkOhXSbXDVfWjoiMlO+y8Dj1m0=; b=dc5FkDiTJhwzAdEpJU9wCfXxNLEavXZBa3kaeL763wDsyeBDiE4L7T69viDdpblu3m 6YZ12w/LIsiNBq18VWjgTFk7gMS0LGpBz84GhTV5bOl6kLdF3Q4CWcNdhwmGY+UbAi7K MCWfzwKss73KiNkFsBwMJUmrOKvG/erZpuAdJFUhmyB3OrnBfLSt18gNCMefbnpirGo8 n0KyRHlAYQuJI6h1GE7IHtverWSOErQF/3uikhzWMGdRsyYk8cCz/L7Mtx6HQS3O1CPD QJK8Bq/h2sXRPBGpGHafVSfj5kBofoT9CSqQHBUudZD+p0QTOEluoaRENYdYJN6u93M0 DlNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729503745; x=1730108545; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GKWQ+PiNxEr+3J2vDFkOhXSbXDVfWjoiMlO+y8Dj1m0=; b=CpBeqeM7yVfizsMMBmsk/MwyDsvWlE9JMi7W5LFFJvTNkVUoL1UmUSlLUseuooUQSR 78VKIy4KHbdbs/GHB+IhJwYVJnsOOrCkFbE0x8oRhpNiVUcmmWC8p8enE5tfk993kkdU C8WIUx0hz7j0PIS+sNMvOI3nCCF7XNw/pHJ5+cNDrwDNeQYjwpYpaS/LHm6M5AKYMPzA chFE8kWc9UbqAtBLwjU+V8L5Kl5mVFNbWKDN2zGPlKlslbOL05ED2vDTQMo4hX+IDTJL AcjCrXr1c2WWQYikbyOSzxL3c/LBvfBa6ALkzCBWsJ9IhePEDDYRQ8GB2RRsCIjlGCOS mx7A== X-Forwarded-Encrypted: i=1; AJvYcCXk9U6sRxEoZBbAOf49V0MBLQYLXkEG6kCPn+JAqmqsicE9vu7M9je/LwWTDz2aoEu2x8TvY0FIZA==@kvack.org X-Gm-Message-State: AOJu0YwbfwFijBfl0X/T/vCOJhknWa26502pndgCBlOl3on7SCVx7Y0I /eL9ogRwzpJCRODotUzJ2O8CE4WaklaAiCj9B+nzZ6d/sWRkVOllbgYrMLLd5q7I+GCQKuGWUJv 5Yh8T0y9Tok114seVYEo2T11E5Sw= X-Google-Smtp-Source: AGHT+IH5bvFdKhaa3Mfg5e8pWptQK15nfHtN0KXRjq1x037cl+QupNMTsv4bTid/1Hw+JF7dCSiO01RFSWIhmKGnq1Y= X-Received: by 2002:a05:6102:38d2:b0:4a4:92f7:3611 with SMTP id ada2fe7eead31-4a5d6f01da5mr4919910137.12.1729503744835; Mon, 21 Oct 2024 02:42:24 -0700 (PDT) MIME-Version: 1.0 References: <20241010081802.290893-1-chenridong@huaweicloud.com> <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com> <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com> In-Reply-To: <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 21 Oct 2024 22:42:13 +1300 Message-ID: Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out To: Chen Ridong Cc: chenridong , Kefeng Wang , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, wangweiyang2@huawei.com, Michal Hocko , Johannes Weiner , Yosry Ahmed , Yu Zhao , David Hildenbrand , Matthew Wilcox , Ryan Roberts Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E2F1614000F X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 93jnqbwukfa481dh44se7omubgutynhf X-HE-Tag: 1729503734-501854 X-HE-Meta: U2FsdGVkX19S/SYdxncg8SuVklYo67gArrFXRGWwh1vV8ZzZLYQZMUHPtNhJBcUx1FUP2fmXvuUjqGRUvyxbxu5rRKkV57fbGDezU+nKoUL7IHqP+4zDsiLLaTLR1q0t8rqIvuaC8AvG2Zffi+aHkDH2VUF/zFdy3lVARNCQQ/JR3IUuT3s5P9ioeIgXwwYcNhvA35YeTLrHCWnzp3x9nXwK0a6nZdz/QDaImQw26YAyLkgrMoW2ax5Jc15zxbQnom5pmJt2fz4XWMIHNKI78kOgQxXJeGqH0eyYudztna5KTw0+v5+Fp7RwMYVzisOZqvWGE398pxrBen1PjZd6OOB+JMEnROTEmcsBXowdNFRo/nEwFllKSpYC3f8RJOOWSs77ZIn4W+zv/4ApBQlYy59ajKyXiH006tPu7fXZhR6dJ/EKdEJrjdAvGH2c4LOyqxpHbUwxeJin59eQ3MoKbLqGen75qEt3355aCDUVIPwD3C3O6g8GBKwX1KM7wZxupfhcHaHWUEUWvcliIqexqe8X+psDGIG2Q7qJeTcc9kbVY6VwFq/0jP2hXGk09hyIVNqOp6noRUdlNBk1p3IEkuhs9uT4pzKkw0AHVPqD/fF4lkU9nriAcA0/UucfMw5P6ZH1H5kKuSw5ht76P/MhpCjhFsDFrpuSp1UQA38xgW8mI+E5kBu6z1z5v0aEfB9Sj/mbYmIMd2DIrK/JZZo0g9lrpbXdyiGJpKL5QsxGARH9HOKQTLWds2iw5zoDRhfzhhuUvVkLCx3p93v1fF60rjfoPf7Sd/2s4HlAivQF8AcYDrK7s1KdPVFWkunXbzPqJFAwN6OdXRlV+lLAvxMa8YDKW+CWfSeF4iR2n7ldNuwe6x1zpAsDTv2q/YY9VKx75qkqdMO8BCW9CS5PQSSedSSATgTvKdTaPKgXgrd4XfSQm87utzswfmkkWfEof5zuS4o/0mG69mqhT1G/sn+ sfCWlqpa LL9Q50ZpWq1R5Q/EaxsqsJpEMPaveytPs8GgzlcaTAQGtd2Cimq7zGnVEJJd2IL0ClPp4peC032Paa61pVQcyPBu/wHRJ++1qP8q+lUOJ5cWUSXkIdBfU9dEnbo9FB3Y5L4kfe1V4pCuSAG5dHaYozTDsl8e86RahU8j0WI7P6KtYBqvKx95L7V81I8Zs57H/0NFor3nZV5u9EPy2amGuXrc6Bzd2ZOZBU1O3MqUkvt4a2u4j6W7/hRWVufyNoenAboDFoR5vn9JeehgaN3MRzuXFrlGjs2nkaJrb+jZuo1IFOVk9ggvWNLA6i11dKYjLpMVjJuT6HZKAL0okTb5PwPfSyMLDNcKKALKzZvF0oH0lqEJft1SM78O+irana1FtesVgMagUeZ1grObpyEzd9Ub/CJoZY+tfx5HC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 21, 2024 at 9:14=E2=80=AFPM Chen Ridong wrote: > > > > On 2024/10/21 12:44, Barry Song wrote: > > On Fri, Oct 11, 2024 at 7:49=E2=80=AFPM chenridong wrote: > >> > >> > >> > >> On 2024/10/11 0:17, Barry Song wrote: > >>> On Thu, Oct 10, 2024 at 4:59=E2=80=AFPM Kefeng Wang wrote: > >>>> > >>>> Hi Ridong, > >>>> > >>>> This should be the first version for upstream, and the issue only > >>>> occurred when large folio is spited. > >>>> > >>>> Adding more CCs to see if there's more feedback. > >>>> > >>>> > >>>> On 2024/10/10 16:18, Chen Ridong wrote: > >>>>> From: Chen Ridong > >>>>> > >>>>> An issue was found with the following testing step: > >>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=3Dy > >>>>> 2. Mount memcg v1, and create memcg named test_memcg and set > >>>>> usage_in_bytes=3D2.1G, memsw.usage_in_bytes=3D3G. > >>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_mem= cg. > >>>>> > >>>>> It was found that: > >>>>> > >>>>> cat memory.usage_in_bytes > >>>>> 2144940032 > >>>>> cat memory.memsw.usage_in_bytes > >>>>> 2255056896 > >>>>> > >>>>> free -h > >>>>> total used free > >>>>> Mem: 31Gi 2.1Gi 27Gi > >>>>> Swap: 1.0Gi 618Mi 405Mi > >>>>> > >>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap= memory > >>>>> was used, which means that 500M may be wasted because other memcgs = can not > >>>>> use these swap memory. > >>>>> > >>>>> It can be explained as follows: > >>>>> 1. When entering shrink_inactive_list, it isolates folios from lru = from > >>>>> tail to head. If it just takes folioN from lru(make it simple)= . > >>>>> > >>>>> inactive lru: folio1<->folio2<->folio3...<->folioN-1 > >>>>> isolated list: folioN > >>>>> > >>>>> 2. In shrink_page_list function, if folioN is THP, it may be splite= d and > >>>>> added to swap cache folio by folio. After adding to swap cache= , it will > >>>>> submit io to writeback folio to swap, which is asynchronous. > >>>>> When shrink_page_list is finished, the isolated folios list wi= ll be > >>>>> moved back to the head of inactive lru. The inactive lru may j= ust look > >>>>> like this, with 512 filioes have been move to the head of inac= tive lru. > >>>>> > >>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->foli= oN-1 > >>>>> > >>>>> 3. When folio writeback io is completed, the folio may be rotated t= o tail > >>>>> of lru. The following lru list is expected, with those filioes= that have > >>>>> been added to swap cache are rotated to tail of lru. So those = folios > >>>>> can be reclaimed as soon as possible. > >>>>> > >>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->fo= lioN512 > >>>>> > >>>>> 4. However, shrink_page_list and folio writeback are asynchronous. = If THP > >>>>> is splited, shrink_page_list loops at least 512 times, which m= eans that > >>>>> shrink_page_list is not completed but some folios writeback ha= ve been > >>>>> completed, and this may lead to failure to rotate these folios= to the > >>>>> tail of lru. The lru may look likes as below: > >>> > >>> I assume you=E2=80=99re referring to PMD-mapped THP, but your code al= so modifies > >>> mTHP, which might not be that large. For instance, it could be a 16KB= mTHP. > >>> > >>>>> > >>>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN= -1<-> > >>>>> folioN51<->folioN52<->...folioN511<->folioN512 > >>>>> > >>>>> Although those folios (N1-N50) have been finished writing back= , they > >>>>> are still at the head of lru. When isolating folios from lru, = it scans > >>>>> from tail to head, so it is difficult to scan those folios aga= in. > >>>>> > >>>>> What mentioned above may lead to a large number of folios have been= added > >>>>> to swap cache but can not be reclaimed in time, which may reduce re= claim > >>>>> efficiency and prevent other memcgs from using this swap memory eve= n if > >>>>> they trigger OOM. > >>>>> > >>>>> To fix this issue, it's better to stop looping if THP has been spli= ted and > >>>>> nr_pageout is greater than nr_to_reclaim. > >>>>> > >>>>> Signed-off-by: Chen Ridong > >>>>> --- > >>>>> mm/vmscan.c | 16 +++++++++++++++- > >>>>> 1 file changed, 15 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>>>> index 749cdc110c74..fd8ad251eda2 100644 > >>>>> --- a/mm/vmscan.c > >>>>> +++ b/mm/vmscan.c > >>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct = list_head *folio_list, > >>>>> LIST_HEAD(demote_folios); > >>>>> unsigned int nr_reclaimed =3D 0; > >>>>> unsigned int pgactivate =3D 0; > >>>>> - bool do_demote_pass; > >>>>> + bool do_demote_pass, splited =3D false; > >>>>> struct swap_iocb *plug =3D NULL; > >>>>> > >>>>> folio_batch_init(&free_folios); > >>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct= list_head *folio_list, > >>>>> > >>>>> cond_resched(); > >>>>> > >>>>> + /* > >>>>> + * If a large folio has been split, many folios are a= dded > >>>>> + * to folio_list. Looping through the entire list tak= es > >>>>> + * too much time, which may prevent folios that have = completed > >>>>> + * writeback from rotateing to the tail of the lru. J= ust > >>>>> + * stop looping if nr_pageout is greater than nr_to_r= eclaim. > >>>>> + */ > >>>>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_= reclaim)) > >>>>> + break; > >>> > >>> I=E2=80=99m not entirely sure about the theory behind comparing stat-= >nr_pageout > >>> with sc->nr_to_reclaim. However, the condition might still hold true = even > >>> if you=E2=80=99ve split a relatively small =E2=80=9Clarge folio,=E2= =80=9D such as 16kB? > >>> > >> > >> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if a= ll > >> pages that have been pageout can be reclaimed, then enough pages can b= e > >> reclaimed when all pages have finished writeback. Thus, it may not hav= e > >> to pageout more. > >> > >> If a small large folio(16 kB) has been split, it may return early > >> without the entire pages in the folio_list being pageout, but I think > >> that is fine. It can pageout more pages the next time it enters > >> shrink_folio_list if there are not enough pages to reclaimed. > >> > >> However, if pages that have been pageout are still at the head of the > >> LRU, it is difficult to scan these pages again. In this case, not only > >> might it "waste" some swap memory but it also has to pageout more page= s. > >> > >> Considering the above, I sent this patch. It may not be a perfect > >> solution, but i think it's a good option to consider. And I am wonderi= ng > >> if anyone has a better solution. > > > > Hi Ridong, > > My overall understanding is that you have failed to describe your probl= em > > particularly I don't understand what your 3 and 4 mean: > > > >> 3. When folio writeback io is completed, the folio may be rotated to t= ail > >> of lru. The following lru list is expected, with those filioes that= have > >> been added to swap cache are rotated to tail of lru. So those folio= s > >> can be reclaimed as soon as possible. > >> > >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > > > > 4. However, shrink_page_list and folio writeback are asynchronous. I= f THP > > > is splited, shrink_page_list loops at least 512 times, which mean= s that > > > shrink_page_list is not completed but some folios writeback have = been > > > completed, and this may lead to failure to rotate these folios to= the > > > tail of lru. The lru may look likes as below: > > > > can you please describe it in a readable approach? > > > > i feel your below diagram is somehow wrong: > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > > > You mentioned "rotate', how could "rotate" makes: > > folioN512<->folioN511<->...filioN1 in (2) > > become > > filioN1<->...folioN511<->folioN512 in (3). > > > > I am sorry for any confusion. > > If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed > to writeback one by one. it assumed that filioN1, > filioN2,filioN3,...filioN512 are completed in order. > > Orignal: > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > filioN1 is finished, filioN1 is rotated to the tail of LRU: > folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->foli= oN1 > > filioN2 is finished: > folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->foli= oN1<->folioN2 > > filioN3 is finished: > folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->foli= oN1<->folioN2<->filioN3 > > ... > > filioN512 is finished: > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > When the filios are finished, the LRU might just like this: > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 understood, thanks! Let me try to understand the following part: > 4: > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > folioN51<->folioN52<->...folioN511<->folioN512 > Although those folios (N1-N50) have been finished writing back, they > are still at the head of lru. When isolating folios from lru, it scans > from tail to head, so it is difficult to scan those folios again. What is the reason that "those folios (N1-N50) have finished writing back, yet they remain at the head of the LRU"? Is it because their writeback_end occurred while we were still looping in shrink_folio_list(), causing folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving these folios, which are still in the "folio_list", to the tail of the LRU? > > > btw, writeback isn't always async. it could be sync for zram and sync_i= o > > swap. in that case, your patch might change the order of LRU. i mean, > > for example, while a mTHP becomes cold, we always reclaim all of them, > > but not part of them and put back part of small folios to the head of l= ru. > > > > Yes, This can be changed. > Although it may put back part of small folios to the head of lru, it can > return in time from shrink_folio_list without causing much additional I/O= . > > If you have understood this issue, do you have any suggestions to fix > it? My patch may not be a perfect way to fix this issue. > My point is that synchronous I/O, like zRAM, doesn't have this issue and doesn't require this fix, as writeback is always completed without asynchronous latency. > Best regards, > Ridong > Thanks Barry