linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <baohua@kernel.org>
To: chenridong <chenridong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	 akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,  wangweiyang2@huawei.com,
	Michal Hocko <mhocko@suse.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosryahmed@google.com>, Yu Zhao <yuzhao@google.com>,
	 David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
Date: Mon, 21 Oct 2024 17:44:03 +1300	[thread overview]
Message-ID: <CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com> (raw)
In-Reply-To: <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com>

On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>
>
>
> On 2024/10/11 0:17, Barry Song wrote:
> > On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >> Hi Ridong,
> >>
> >> This should be the first version for upstream, and the issue only
> >> occurred when large folio is spited.
> >>
> >> Adding more CCs to see if there's more feedback.
> >>
> >>
> >> On 2024/10/10 16:18, Chen Ridong wrote:
> >>> From: Chen Ridong <chenridong@huawei.com>
> >>>
> >>> An issue was found with the following testing step:
> >>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> >>> 2. Mount memcg v1, and create memcg named test_memcg and set
> >>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> >>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >>>
> >>> It was found that:
> >>>
> >>> cat memory.usage_in_bytes
> >>> 2144940032
> >>> cat memory.memsw.usage_in_bytes
> >>> 2255056896
> >>>
> >>> free -h
> >>>                 total        used        free
> >>> Mem:           31Gi       2.1Gi        27Gi
> >>> Swap:         1.0Gi       618Mi       405Mi
> >>>
> >>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> >>> was used, which means that 500M may be wasted because other memcgs can not
> >>> use these swap memory.
> >>>
> >>> It can be explained as follows:
> >>> 1. When entering shrink_inactive_list, it isolates folios from lru from
> >>>      tail to head. If it just takes folioN from lru(make it simple).
> >>>
> >>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
> >>>      isolated list: folioN
> >>>
> >>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
> >>>      added to swap cache folio by folio. After adding to swap cache, it will
> >>>      submit io to writeback folio to swap, which is asynchronous.
> >>>      When shrink_page_list is finished, the isolated folios list will be
> >>>      moved back to the head of inactive lru. The inactive lru may just look
> >>>      like this, with 512 filioes have been move to the head of inactive lru.
> >>>
> >>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >>>
> >>> 3. When folio writeback io is completed, the folio may be rotated to tail
> >>>      of lru. The following lru list is expected, with those filioes that have
> >>>      been added to swap cache are rotated to tail of lru. So those folios
> >>>      can be reclaimed as soon as possible.
> >>>
> >>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >>>
> >>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> >>>      is splited, shrink_page_list loops at least 512 times, which means that
> >>>      shrink_page_list is not completed but some folios writeback have been
> >>>      completed, and this may lead to failure to rotate these folios to the
> >>>      tail of lru. The lru may look likes as below:
> >
> > I assume you’re referring to PMD-mapped THP, but your code also modifies
> > mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> >
> >>>
> >>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> >>>      folioN51<->folioN52<->...folioN511<->folioN512
> >>>
> >>>      Although those folios (N1-N50) have been finished writing back, they
> >>>      are still at the head of lru. When isolating folios from lru, it scans
> >>>      from tail to head, so it is difficult to scan those folios again.
> >>>
> >>> What mentioned above may lead to a large number of folios have been added
> >>> to swap cache but can not be reclaimed in time, which may reduce reclaim
> >>> efficiency and prevent other memcgs from using this swap memory even if
> >>> they trigger OOM.
> >>>
> >>> To fix this issue, it's better to stop looping if THP has been splited and
> >>> nr_pageout is greater than nr_to_reclaim.
> >>>
> >>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >>> ---
> >>>    mm/vmscan.c | 16 +++++++++++++++-
> >>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 749cdc110c74..fd8ad251eda2 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>        LIST_HEAD(demote_folios);
> >>>        unsigned int nr_reclaimed = 0;
> >>>        unsigned int pgactivate = 0;
> >>> -     bool do_demote_pass;
> >>> +     bool do_demote_pass, splited = false;
> >>>        struct swap_iocb *plug = NULL;
> >>>
> >>>        folio_batch_init(&free_folios);
> >>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>
> >>>                cond_resched();
> >>>
> >>> +             /*
> >>> +              * If a large folio has been split, many folios are added
> >>> +              * to folio_list. Looping through the entire list takes
> >>> +              * too much time, which may prevent folios that have completed
> >>> +              * writeback from rotateing to the tail of the lru. Just
> >>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
> >>> +              */
> >>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> >>> +                     break;
> >
> > I’m not entirely sure about the theory behind comparing stat->nr_pageout
> > with sc->nr_to_reclaim. However, the condition might still hold true even
> > if you’ve split a relatively small “large folio,” such as 16kB?
> >
>
> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
> pages that have been pageout can be reclaimed, then enough pages can be
> reclaimed when all pages have finished writeback. Thus, it may not have
> to pageout more.
>
> If a small large folio(16 kB) has been split, it may return early
> without the entire pages in the folio_list being pageout, but I think
> that is fine. It can pageout more pages the next time it enters
> shrink_folio_list if there are not enough pages to reclaimed.
>
> However, if pages that have been pageout are still at the head of the
> LRU, it is difficult to scan these pages again. In this case, not only
> might it "waste" some swap memory but it also has to pageout more pages.
>
> Considering the above, I sent this patch. It may not be a perfect
> solution, but i think it's a good option to consider. And I am wondering
> if anyone has a better solution.

Hi Ridong,
My overall understanding is that you have failed to describe your problem
particularly I don't understand what your 3 and 4 mean:

> 3. When folio writeback io is completed, the folio may be rotated to tail
>    of lru. The following lru list is expected, with those filioes that have
>    been added to swap cache are rotated to tail of lru. So those folios
>  can be reclaimed as soon as possible.
>
>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

 > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
 >    is splited, shrink_page_list loops at least 512 times, which means that
 >    shrink_page_list is not completed but some folios writeback have been
 >    completed, and this may lead to failure to rotate these folios to the
  >  tail of lru. The lru may look likes as below:

can you please describe it in a readable approach?

i feel your below diagram is somehow wrong:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512

You mentioned "rotate', how could "rotate" makes:
folioN512<->folioN511<->...filioN1 in (2)
become
filioN1<->...folioN511<->folioN512 in (3).

btw, writeback isn't always async. it could be sync for zram and sync_io
swap. in that case, your patch might change the order of LRU. i mean,
for example, while a mTHP becomes cold, we always reclaim all of them,
but not part of them and put back part of small folios to the head of lru.

>
> Best regards,
> Ridong
>
> >>> +
> >>>                folio = lru_to_folio(folio_list);
> >>>                list_del(&folio->lru);
> >>>
> >>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>                if ((nr_pages > 1) && !folio_test_large(folio)) {
> >>>                        sc->nr_scanned -= (nr_pages - 1);
> >>>                        nr_pages = 1;
> >>> +                     splited = true;
> >>>                }
> >>>
> >>>                /*
> >>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
> >>>                                        sc->nr_scanned -= (nr_pages - 1);
> >>>                                        nr_pages = 1;
> >>> +                                     splited = true;
> >>>                                }
> >>>                                goto activate_locked;
> >>>                        case PAGE_SUCCESS:
> >>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
> >>>                                        sc->nr_scanned -= (nr_pages - 1);
> >>>                                        nr_pages = 1;
> >>> +                                     splited = true;
> >>>                                }
> >>>                                stat->nr_pageout += nr_pages;
> >>>
> >>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>                if (nr_pages > 1) {
> >>>                        sc->nr_scanned -= (nr_pages - 1);
> >>>                        nr_pages = 1;
> >>> +                     splited = true;
> >>>                }
> >>>    activate_locked:
> >>>                /* Not a candidate for swapping, so reclaim swap space. */
> >>
>

Thanks
barry


  reply	other threads:[~2024-12-05 15:21 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-10  8:18 Chen Ridong
2024-10-10  8:59 ` Kefeng Wang
2024-10-10  9:28   ` chenridong
2024-10-10 16:17   ` Barry Song
2024-10-11  6:49     ` chenridong
2024-10-21  4:44       ` Barry Song [this message]
2024-10-21  8:14         ` Chen Ridong
2024-10-21  9:42           ` Barry Song
2024-10-21  9:56             ` chenridong
2024-10-21 10:09               ` Barry Song
2024-10-21 10:45                 ` Barry Song
2024-11-01  8:49                   ` 回复: " 解 咏梅
2024-11-14 12:56                     ` chenridong
2024-10-21 12:15                 ` chenridong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com' \
    --to=baohua@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=chenridong@huawei.com \
    --cc=chenridong@huaweicloud.com \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=ryan.roberts@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=wangweiyang2@huawei.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox