From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: Chen Ridong <chenridong@huaweicloud.com>, <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<chenridong@huawei.com>, <wangweiyang2@huawei.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosryahmed@google.com>, Yu Zhao <yuzhao@google.com>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Ryan Roberts <ryan.roberts@arm.com>,
Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
Date: Thu, 10 Oct 2024 16:59:10 +0800 [thread overview]
Message-ID: <c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com> (raw)
In-Reply-To: <20241010081802.290893-1-chenridong@huaweicloud.com>
Hi Ridong,
This should be the first version for upstream, and the issue only
occurred when large folio is spited.
Adding more CCs to see if there's more feedback.
On 2024/10/10 16:18, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> An issue was found with the following testing step:
> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> 2. Mount memcg v1, and create memcg named test_memcg and set
> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>
> It was found that:
>
> cat memory.usage_in_bytes
> 2144940032
> cat memory.memsw.usage_in_bytes
> 2255056896
>
> free -h
> total used free
> Mem: 31Gi 2.1Gi 27Gi
> Swap: 1.0Gi 618Mi 405Mi
>
> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> was used, which means that 500M may be wasted because other memcgs can not
> use these swap memory.
>
> It can be explained as follows:
> 1. When entering shrink_inactive_list, it isolates folios from lru from
> tail to head. If it just takes folioN from lru(make it simple).
>
> inactive lru: folio1<->folio2<->folio3...<->folioN-1
> isolated list: folioN
>
> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> added to swap cache folio by folio. After adding to swap cache, it will
> submit io to writeback folio to swap, which is asynchronous.
> When shrink_page_list is finished, the isolated folios list will be
> moved back to the head of inactive lru. The inactive lru may just look
> like this, with 512 filioes have been move to the head of inactive lru.
>
> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>
> 3. When folio writeback io is completed, the folio may be rotated to tail
> of lru. The following lru list is expected, with those filioes that have
> been added to swap cache are rotated to tail of lru. So those folios
> can be reclaimed as soon as possible.
>
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> is splited, shrink_page_list loops at least 512 times, which means that
> shrink_page_list is not completed but some folios writeback have been
> completed, and this may lead to failure to rotate these folios to the
> tail of lru. The lru may look likes as below:
>
> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> folioN51<->folioN52<->...folioN511<->folioN512
>
> Although those folios (N1-N50) have been finished writing back, they
> are still at the head of lru. When isolating folios from lru, it scans
> from tail to head, so it is difficult to scan those folios again.
>
> What mentioned above may lead to a large number of folios have been added
> to swap cache but can not be reclaimed in time, which may reduce reclaim
> efficiency and prevent other memcgs from using this swap memory even if
> they trigger OOM.
>
> To fix this issue, it's better to stop looping if THP has been splited and
> nr_pageout is greater than nr_to_reclaim.
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> mm/vmscan.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..fd8ad251eda2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> LIST_HEAD(demote_folios);
> unsigned int nr_reclaimed = 0;
> unsigned int pgactivate = 0;
> - bool do_demote_pass;
> + bool do_demote_pass, splited = false;
> struct swap_iocb *plug = NULL;
>
> folio_batch_init(&free_folios);
> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>
> cond_resched();
>
> + /*
> + * If a large folio has been split, many folios are added
> + * to folio_list. Looping through the entire list takes
> + * too much time, which may prevent folios that have completed
> + * writeback from rotateing to the tail of the lru. Just
> + * stop looping if nr_pageout is greater than nr_to_reclaim.
> + */
> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> + break;
> +
> folio = lru_to_folio(folio_list);
> list_del(&folio->lru);
>
> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if ((nr_pages > 1) && !folio_test_large(folio)) {
> sc->nr_scanned -= (nr_pages - 1);
> nr_pages = 1;
> + splited = true;
> }
>
> /*
> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (nr_pages > 1 && !folio_test_large(folio)) {
> sc->nr_scanned -= (nr_pages - 1);
> nr_pages = 1;
> + splited = true;
> }
> goto activate_locked;
> case PAGE_SUCCESS:
> if (nr_pages > 1 && !folio_test_large(folio)) {
> sc->nr_scanned -= (nr_pages - 1);
> nr_pages = 1;
> + splited = true;
> }
> stat->nr_pageout += nr_pages;
>
> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (nr_pages > 1) {
> sc->nr_scanned -= (nr_pages - 1);
> nr_pages = 1;
> + splited = true;
> }
> activate_locked:
> /* Not a candidate for swapping, so reclaim swap space. */
next prev parent reply other threads:[~2024-10-10 8:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-10 8:18 Chen Ridong
2024-10-10 8:59 ` Kefeng Wang [this message]
2024-10-10 9:28 ` chenridong
2024-10-10 16:17 ` Barry Song
2024-10-11 6:49 ` chenridong
2024-10-21 4:44 ` Barry Song
2024-10-21 8:14 ` Chen Ridong
2024-10-21 9:42 ` Barry Song
2024-10-21 9:56 ` chenridong
2024-10-21 10:09 ` Barry Song
2024-10-21 10:45 ` Barry Song
2024-11-01 8:49 ` 回复: " 解 咏梅
2024-11-14 12:56 ` chenridong
2024-10-21 12:15 ` chenridong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com \
--to=wangkefeng.wang@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=chenridong@huawei.com \
--cc=chenridong@huaweicloud.com \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=ryan.roberts@arm.com \
--cc=wangweiyang2@huawei.com \
--cc=willy@infradead.org \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox