linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chen Ridong <chenridong@huaweicloud.com>
To: akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	chenridong@huawei.com, wangweiyang2@huawei.com
Subject: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
Date: Thu, 10 Oct 2024 08:18:02 +0000	[thread overview]
Message-ID: <20241010081802.290893-1-chenridong@huaweicloud.com> (raw)

From: Chen Ridong <chenridong@huawei.com>

An issue was found with the following testing step:
1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
2. Mount memcg v1, and create memcg named test_memcg and set
   usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.

It was found that:

cat memory.usage_in_bytes
2144940032
cat memory.memsw.usage_in_bytes
2255056896

free -h
              total        used        free
Mem:           31Gi       2.1Gi        27Gi
Swap:         1.0Gi       618Mi       405Mi

As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
was used, which means that 500M may be wasted because other memcgs can not
use these swap memory.

It can be explained as follows:
1. When entering shrink_inactive_list, it isolates folios from lru from
   tail to head. If it just takes folioN from lru(make it simple).

   inactive lru: folio1<->folio2<->folio3...<->folioN-1
   isolated list: folioN

2. In shrink_page_list function, if folioN is THP, it may be splited and
   added to swap cache folio by folio. After adding to swap cache, it will
   submit io to writeback folio to swap, which is asynchronous.
   When shrink_page_list is finished, the isolated folios list will be
   moved back to the head of inactive lru. The inactive lru may just look
   like this, with 512 filioes have been move to the head of inactive lru.

   folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1

3. When folio writeback io is completed, the folio may be rotated to tail
   of lru. The following lru list is expected, with those filioes that have
   been added to swap cache are rotated to tail of lru. So those folios
   can be reclaimed as soon as possible.

   folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

4. However, shrink_page_list and folio writeback are asynchronous. If THP
   is splited, shrink_page_list loops at least 512 times, which means that
   shrink_page_list is not completed but some folios writeback have been
   completed, and this may lead to failure to rotate these folios to the
   tail of lru. The lru may look likes as below:

   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
   folioN51<->folioN52<->...folioN511<->folioN512

   Although those folios (N1-N50) have been finished writing back, they
   are still at the head of lru. When isolating folios from lru, it scans
   from tail to head, so it is difficult to scan those folios again.

What mentioned above may lead to a large number of folios have been added
to swap cache but can not be reclaimed in time, which may reduce reclaim
efficiency and prevent other memcgs from using this swap memory even if
they trigger OOM.

To fix this issue, it's better to stop looping if THP has been splited and
nr_pageout is greater than nr_to_reclaim.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/vmscan.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..fd8ad251eda2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
-	bool do_demote_pass;
+	bool do_demote_pass, splited = false;
 	struct swap_iocb *plug = NULL;
 
 	folio_batch_init(&free_folios);
@@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 		cond_resched();
 
+		/*
+		 * If a large folio has been split, many folios are added
+		 * to folio_list. Looping through the entire list takes
+		 * too much time, which may prevent folios that have completed
+		 * writeback from rotateing to the tail of the lru. Just
+		 * stop looping if nr_pageout is greater than nr_to_reclaim.
+		 */
+		if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
+			break;
+
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
 
@@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if ((nr_pages > 1) && !folio_test_large(folio)) {
 			sc->nr_scanned -= (nr_pages - 1);
 			nr_pages = 1;
+			splited = true;
 		}
 
 		/*
@@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 				if (nr_pages > 1 && !folio_test_large(folio)) {
 					sc->nr_scanned -= (nr_pages - 1);
 					nr_pages = 1;
+					splited = true;
 				}
 				goto activate_locked;
 			case PAGE_SUCCESS:
 				if (nr_pages > 1 && !folio_test_large(folio)) {
 					sc->nr_scanned -= (nr_pages - 1);
 					nr_pages = 1;
+					splited = true;
 				}
 				stat->nr_pageout += nr_pages;
 
@@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (nr_pages > 1) {
 			sc->nr_scanned -= (nr_pages - 1);
 			nr_pages = 1;
+			splited = true;
 		}
 activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
-- 
2.34.1



             reply	other threads:[~2024-10-10  8:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-10  8:18 Chen Ridong [this message]
2024-10-10  8:59 ` Kefeng Wang
2024-10-10  9:28   ` chenridong
2024-10-10 16:17   ` Barry Song
2024-10-11  6:49     ` chenridong
2024-10-21  4:44       ` Barry Song
2024-10-21  8:14         ` Chen Ridong
2024-10-21  9:42           ` Barry Song
2024-10-21  9:56             ` chenridong
2024-10-21 10:09               ` Barry Song
2024-10-21 10:45                 ` Barry Song
2024-11-01  8:49                   ` 回复: " 解 咏梅
2024-11-14 12:56                     ` chenridong
2024-10-21 12:15                 ` chenridong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241010081802.290893-1-chenridong@huaweicloud.com \
    --to=chenridong@huaweicloud.com \
    --cc=akpm@linux-foundation.org \
    --cc=chenridong@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=wangweiyang2@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox