From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F9F0E7716C for ; Thu, 5 Dec 2024 15:21:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0CFD6B0132; Thu, 5 Dec 2024 10:19:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 900036B00EC; Thu, 5 Dec 2024 10:19:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D1D06B00EC; Thu, 5 Dec 2024 10:19:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B93836B007B for ; Mon, 21 Oct 2024 00:44:17 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5B5CD1A0B4B for ; Mon, 21 Oct 2024 04:43:50 +0000 (UTC) X-FDA: 82696367292.26.CD22F34 Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41]) by imf23.hostedemail.com (Postfix) with ESMTP id 03EF7140002 for ; Mon, 21 Oct 2024 04:44:05 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729485779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m01RtHcH7jVqVjjekW1wmTbPRlfyOxXGbu4ESEnzxXo=; b=zTCbfzyq/N6KGwZWlFw5Awn+6zmmo9l6vB4Fd3EdLtQR67QGmCEot4jHOGbmq/uKKhPOvs wL+nEpZy/gQzZbcqftSyFGA12SEWnDtd15fH66HTycMKCKDBLV8bI7bE7zUj3/lgL9YRws 45SQPYWt6hIXX8evs/Ue2pwRw/xRQcQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729485779; a=rsa-sha256; cv=none; b=pOysq0whS83G54CU2bI49BojG9M4QxwiqahhVQS+8CyZ2IOVM7B6cjHAp345w+frR0cdsz nlZWP2NTZeUWktTSFCttUGKmoDW5QJHs96V7HWoC8nDsWXYBSqT6JtHgyJJA/PWPzLDKtK EJtPb+ag18uk2B82KyEJWoRm8Lfp6Kk= Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-84fc0209e87so1328209241.1 for ; Sun, 20 Oct 2024 21:44:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729485854; x=1730090654; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m01RtHcH7jVqVjjekW1wmTbPRlfyOxXGbu4ESEnzxXo=; b=IsWu+R9OELO0y6Xrmi/m/sM74RtyhqL9G3WxtWtVTfK+N1lKJOrLI6YgcMt90fKwiT ze1ICklZEXIvynwDk45vyPqQVi2aVdjqhtK0QE/sKD0rwpv94Z6kl7Uzmam1aHMz1Lpy 9+BFeN9qJbDgkCewMO18azgSJx9fP+d17ZKh5eBZkp8HUZ8FC/nbrrdXUH1wJ4gU4aEz qnX5iZRSnpOlwdsTHuad1RMThJBQ8ESD/ZhVRTBQKWt2nZ4nbhVGwpvp7EVVigoeBLze wmYFzsH//HbWXUV6TmvB9pBNkotuHfa9BV2NK5wre2voo/VUJm08/5CX5ME3TVLAnMn9 554A== X-Forwarded-Encrypted: i=1; AJvYcCW0U4R16JNlRtXSNZ19kvIRUGOKyTvn4qb1zYaWZvte7WJw0ysACHhldRAMTCB6b8I5kBmuPp2eOg==@kvack.org X-Gm-Message-State: AOJu0Yzbrc59KuY7Yq21avSryt8xYPequQIbSrRP33U6CUGb+eQbN/By wKfKVPDf0+0fQ8W5MD8yi0Y3f9jmHnn53OQR2WkYcFDi2coZkjEwgz+4jXjIlKhZYxYjCvHMpzH SanTpQOk+SbEmghkb11Eos6rbf+E= X-Google-Smtp-Source: AGHT+IGgeNOxeW2ojfo08suRiiHeAdMlxK3cKjG4ICyLsdNqgGmYc6WTwrOOEP3XWAxeEu1d4KqPB8m0VUKGVAz3zt4= X-Received: by 2002:a05:6122:201e:b0:50d:7a14:ddf7 with SMTP id 71dfb90a1353d-50dda3a0289mr7667931e0c.8.1729485853975; Sun, 20 Oct 2024 21:44:13 -0700 (PDT) MIME-Version: 1.0 References: <20241010081802.290893-1-chenridong@huaweicloud.com> <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com> In-Reply-To: <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com> From: Barry Song Date: Mon, 21 Oct 2024 17:44:03 +1300 Message-ID: Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out To: chenridong Cc: Kefeng Wang , Chen Ridong , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, wangweiyang2@huawei.com, Michal Hocko , Johannes Weiner , Yosry Ahmed , Yu Zhao , David Hildenbrand , Matthew Wilcox , Ryan Roberts Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Stat-Signature: 5dfdjz53en7791aaww5ymsfim89ur4ny X-Rspamd-Queue-Id: 03EF7140002 X-Rspamd-Pre-Result: action=add header; module=dmarc; Action set by DMARC X-Rspam: Yes X-HE-Tag: 1729485845-2438 X-HE-Meta: U2FsdGVkX1+e2/+v8auqjlBvyrtlXJ/vVQIP6tel1BM8x1JP624weL5cdbyBUNpSeIPx+VgFM3a5qSFXs5XY+O/5IyLC1yGFGXZ/QZgKGaTCGK5lIH6WUSkhsrOUmekRqGy7Xc8+NTN5bYC6kIHXU0GeGzycPNTEgxEVH9FZb0GrqBfhW5xBIssdgMmLPfB5k5Sgr8XSY1XpekqPyqYaaou3nisaYIfm9nXj2lMbOtfwoOGNCqN3ampq4vY2tpCGKe+IRqKV15LhRxZJP+BTL6cxGXNKBfzZ++JbAg10KwrvzEq+HNHTwplLbHVpBDyoWT9ASqbX6Sgw/LdQ4s+rt7MKPNTBFakt1jY4t9anuJb449AwF/yiY/BKEEkchoHAP+ZH+9kjR3nyk62L6NQiCmTgqkWVGOa4v9jPz81tZIYz/gAbvlms1cGf/gEMg7LDnCljFGu7P4rp9ea2qA/Q6FJilK/3YqTxoCBHjs/R6jtMdVcYD62R05prB+U1tFeJ3WFw+8BeYYGodOe0zYZ+nOsXN1/jXK4CncV4ZcXoo4WK40FlIkeIUSEgV+pdHyj3DICRlRVkul3bUR46iDl9sc3N4CO7++2YeWLi3qaTkm7e0leqv/6v28EmzVEbI3O1dQaL2JmOGGZ/2Xdal9aNdNLd/ijWByEER87lwPmOTV0G+4R4ikpExxzfT/nD1PqSMR3mx9mOyjWSYn1UJGqwSzGzvPoqIpA7lVi/E2gmfsVoNzIy59miKeX6ZF1OQNwGPfn2Ne6RMXqZJebw8FuN5Le7qTOvEJgRnmNoGNZ6gyVF4aNl34bxz0TnUa4nSi8XMqmzqX9zuKFV8+vWCGfB61dpguavSgCvmeLP4p+RlYPVWGhC9inve2ekFJAhTsa6H/WN+ufyg5obKnwReT9pvf4Gv6PAHsiPjunPWYQtU8FcchmXlZ7ZzjuWl03bxcsTbhtyTiiJRLxO/qddTFP d/WYGXFz Vh+n3lB3DFjqxweeHQQe5/GBANFAEbALfs7w4z08r1tWWFHyu0cFDG3kARZqXYp6vdfXOmBl9XpVa7cBrQIcRNlpCoemCSnFUPY81+S+hqTWL+KbHZlrvMjTvTXj9PcOiTYUQZm14Av4COUnBh+NxzXc99K8aNdDZGFboUtMjEl3BSzsOBvsaOmgpWHax9c5NvAnweilBJuT6pGAelltqXUqgtmUwYrysZQXQwTQyFYNHreMgZFA2bDsLQFVLy/6R0bZA3GiiUbUKBY6/vDfQNHwI9r7ZHC3OFd6hQgOS6VBIY9XRBK/u1RcwThp6Q+dGN5PMiSkohgPlI8a5lvgO5g15nhdqqv6smCfgf7Dgl+wSE0UHyvkX81mdCIP7DI3sWN0iaU8qD0w1cy8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 11, 2024 at 7:49=E2=80=AFPM chenridong = wrote: > > > > On 2024/10/11 0:17, Barry Song wrote: > > On Thu, Oct 10, 2024 at 4:59=E2=80=AFPM Kefeng Wang wrote: > >> > >> Hi Ridong, > >> > >> This should be the first version for upstream, and the issue only > >> occurred when large folio is spited. > >> > >> Adding more CCs to see if there's more feedback. > >> > >> > >> On 2024/10/10 16:18, Chen Ridong wrote: > >>> From: Chen Ridong > >>> > >>> An issue was found with the following testing step: > >>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=3Dy > >>> 2. Mount memcg v1, and create memcg named test_memcg and set > >>> usage_in_bytes=3D2.1G, memsw.usage_in_bytes=3D3G. > >>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg= . > >>> > >>> It was found that: > >>> > >>> cat memory.usage_in_bytes > >>> 2144940032 > >>> cat memory.memsw.usage_in_bytes > >>> 2255056896 > >>> > >>> free -h > >>> total used free > >>> Mem: 31Gi 2.1Gi 27Gi > >>> Swap: 1.0Gi 618Mi 405Mi > >>> > >>> As shown above, the test_memcg used about 100M swap, but 600M+ swap m= emory > >>> was used, which means that 500M may be wasted because other memcgs ca= n not > >>> use these swap memory. > >>> > >>> It can be explained as follows: > >>> 1. When entering shrink_inactive_list, it isolates folios from lru fr= om > >>> tail to head. If it just takes folioN from lru(make it simple). > >>> > >>> inactive lru: folio1<->folio2<->folio3...<->folioN-1 > >>> isolated list: folioN > >>> > >>> 2. In shrink_page_list function, if folioN is THP, it may be splited = and > >>> added to swap cache folio by folio. After adding to swap cache, = it will > >>> submit io to writeback folio to swap, which is asynchronous. > >>> When shrink_page_list is finished, the isolated folios list will= be > >>> moved back to the head of inactive lru. The inactive lru may jus= t look > >>> like this, with 512 filioes have been move to the head of inacti= ve lru. > >>> > >>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN= -1 > >>> > >>> 3. When folio writeback io is completed, the folio may be rotated to = tail > >>> of lru. The following lru list is expected, with those filioes t= hat have > >>> been added to swap cache are rotated to tail of lru. So those fo= lios > >>> can be reclaimed as soon as possible. > >>> > >>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->foli= oN512 > >>> > >>> 4. However, shrink_page_list and folio writeback are asynchronous. If= THP > >>> is splited, shrink_page_list loops at least 512 times, which mea= ns that > >>> shrink_page_list is not completed but some folios writeback have= been > >>> completed, and this may lead to failure to rotate these folios t= o the > >>> tail of lru. The lru may look likes as below: > > > > I assume you=E2=80=99re referring to PMD-mapped THP, but your code also= modifies > > mTHP, which might not be that large. For instance, it could be a 16KB m= THP. > > > >>> > >>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1= <-> > >>> folioN51<->folioN52<->...folioN511<->folioN512 > >>> > >>> Although those folios (N1-N50) have been finished writing back, = they > >>> are still at the head of lru. When isolating folios from lru, it= scans > >>> from tail to head, so it is difficult to scan those folios again= . > >>> > >>> What mentioned above may lead to a large number of folios have been a= dded > >>> to swap cache but can not be reclaimed in time, which may reduce recl= aim > >>> efficiency and prevent other memcgs from using this swap memory even = if > >>> they trigger OOM. > >>> > >>> To fix this issue, it's better to stop looping if THP has been splite= d and > >>> nr_pageout is greater than nr_to_reclaim. > >>> > >>> Signed-off-by: Chen Ridong > >>> --- > >>> mm/vmscan.c | 16 +++++++++++++++- > >>> 1 file changed, 15 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>> index 749cdc110c74..fd8ad251eda2 100644 > >>> --- a/mm/vmscan.c > >>> +++ b/mm/vmscan.c > >>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct li= st_head *folio_list, > >>> LIST_HEAD(demote_folios); > >>> unsigned int nr_reclaimed =3D 0; > >>> unsigned int pgactivate =3D 0; > >>> - bool do_demote_pass; > >>> + bool do_demote_pass, splited =3D false; > >>> struct swap_iocb *plug =3D NULL; > >>> > >>> folio_batch_init(&free_folios); > >>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct l= ist_head *folio_list, > >>> > >>> cond_resched(); > >>> > >>> + /* > >>> + * If a large folio has been split, many folios are add= ed > >>> + * to folio_list. Looping through the entire list takes > >>> + * too much time, which may prevent folios that have co= mpleted > >>> + * writeback from rotateing to the tail of the lru. Jus= t > >>> + * stop looping if nr_pageout is greater than nr_to_rec= laim. > >>> + */ > >>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_re= claim)) > >>> + break; > > > > I=E2=80=99m not entirely sure about the theory behind comparing stat->n= r_pageout > > with sc->nr_to_reclaim. However, the condition might still hold true ev= en > > if you=E2=80=99ve split a relatively small =E2=80=9Clarge folio,=E2=80= =9D such as 16kB? > > > > Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all > pages that have been pageout can be reclaimed, then enough pages can be > reclaimed when all pages have finished writeback. Thus, it may not have > to pageout more. > > If a small large folio(16 kB) has been split, it may return early > without the entire pages in the folio_list being pageout, but I think > that is fine. It can pageout more pages the next time it enters > shrink_folio_list if there are not enough pages to reclaimed. > > However, if pages that have been pageout are still at the head of the > LRU, it is difficult to scan these pages again. In this case, not only > might it "waste" some swap memory but it also has to pageout more pages. > > Considering the above, I sent this patch. It may not be a perfect > solution, but i think it's a good option to consider. And I am wondering > if anyone has a better solution. Hi Ridong, My overall understanding is that you have failed to describe your problem particularly I don't understand what your 3 and 4 mean: > 3. When folio writeback io is completed, the folio may be rotated to tail > of lru. The following lru list is expected, with those filioes that ha= ve > been added to swap cache are rotated to tail of lru. So those folios > can be reclaimed as soon as possible. > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > 4. However, shrink_page_list and folio writeback are asynchronous. If TH= P > is splited, shrink_page_list loops at least 512 times, which means th= at > shrink_page_list is not completed but some folios writeback have been > completed, and this may lead to failure to rotate these folios to the > tail of lru. The lru may look likes as below: can you please describe it in a readable approach? i feel your below diagram is somehow wrong: folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 You mentioned "rotate', how could "rotate" makes: folioN512<->folioN511<->...filioN1 in (2) become filioN1<->...folioN511<->folioN512 in (3). btw, writeback isn't always async. it could be sync for zram and sync_io swap. in that case, your patch might change the order of LRU. i mean, for example, while a mTHP becomes cold, we always reclaim all of them, but not part of them and put back part of small folios to the head of lru. > > Best regards, > Ridong > > >>> + > >>> folio =3D lru_to_folio(folio_list); > >>> list_del(&folio->lru); > >>> > >>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct li= st_head *folio_list, > >>> if ((nr_pages > 1) && !folio_test_large(folio)) { > >>> sc->nr_scanned -=3D (nr_pages - 1); > >>> nr_pages =3D 1; > >>> + splited =3D true; > >>> } > >>> > >>> /* > >>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct = list_head *folio_list, > >>> if (nr_pages > 1 && !folio_test_large(= folio)) { > >>> sc->nr_scanned -=3D (nr_pages = - 1); > >>> nr_pages =3D 1; > >>> + splited =3D true; > >>> } > >>> goto activate_locked; > >>> case PAGE_SUCCESS: > >>> if (nr_pages > 1 && !folio_test_large(= folio)) { > >>> sc->nr_scanned -=3D (nr_pages = - 1); > >>> nr_pages =3D 1; > >>> + splited =3D true; > >>> } > >>> stat->nr_pageout +=3D nr_pages; > >>> > >>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct li= st_head *folio_list, > >>> if (nr_pages > 1) { > >>> sc->nr_scanned -=3D (nr_pages - 1); > >>> nr_pages =3D 1; > >>> + splited =3D true; > >>> } > >>> activate_locked: > >>> /* Not a candidate for swapping, so reclaim swap space= . */ > >> > Thanks barry