From: Chen Ridong <chenridong@huaweicloud.com>
To: Barry Song <baohua@kernel.org>, chenridong <chenridong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>,
akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, wangweiyang2@huawei.com,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosryahmed@google.com>, Yu Zhao <yuzhao@google.com>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
Date: Mon, 21 Oct 2024 16:14:39 +0800 [thread overview]
Message-ID: <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com> (raw)
In-Reply-To: <CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com>
On 2024/10/21 12:44, Barry Song wrote:
> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/11 0:17, Barry Song wrote:
>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> Hi Ridong,
>>>>
>>>> This should be the first version for upstream, and the issue only
>>>> occurred when large folio is spited.
>>>>
>>>> Adding more CCs to see if there's more feedback.
>>>>
>>>>
>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>>
>>>>> An issue was found with the following testing step:
>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>
>>>>> It was found that:
>>>>>
>>>>> cat memory.usage_in_bytes
>>>>> 2144940032
>>>>> cat memory.memsw.usage_in_bytes
>>>>> 2255056896
>>>>>
>>>>> free -h
>>>>> total used free
>>>>> Mem: 31Gi 2.1Gi 27Gi
>>>>> Swap: 1.0Gi 618Mi 405Mi
>>>>>
>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>> use these swap memory.
>>>>>
>>>>> It can be explained as follows:
>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>> tail to head. If it just takes folioN from lru(make it simple).
>>>>>
>>>>> inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>> isolated list: folioN
>>>>>
>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>> added to swap cache folio by folio. After adding to swap cache, it will
>>>>> submit io to writeback folio to swap, which is asynchronous.
>>>>> When shrink_page_list is finished, the isolated folios list will be
>>>>> moved back to the head of inactive lru. The inactive lru may just look
>>>>> like this, with 512 filioes have been move to the head of inactive lru.
>>>>>
>>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>
>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>> of lru. The following lru list is expected, with those filioes that have
>>>>> been added to swap cache are rotated to tail of lru. So those folios
>>>>> can be reclaimed as soon as possible.
>>>>>
>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>> is splited, shrink_page_list loops at least 512 times, which means that
>>>>> shrink_page_list is not completed but some folios writeback have been
>>>>> completed, and this may lead to failure to rotate these folios to the
>>>>> tail of lru. The lru may look likes as below:
>>>
>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>
>>>>>
>>>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>> folioN51<->folioN52<->...folioN511<->folioN512
>>>>>
>>>>> Although those folios (N1-N50) have been finished writing back, they
>>>>> are still at the head of lru. When isolating folios from lru, it scans
>>>>> from tail to head, so it is difficult to scan those folios again.
>>>>>
>>>>> What mentioned above may lead to a large number of folios have been added
>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>> they trigger OOM.
>>>>>
>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>
>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>> ---
>>>>> mm/vmscan.c | 16 +++++++++++++++-
>>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>> LIST_HEAD(demote_folios);
>>>>> unsigned int nr_reclaimed = 0;
>>>>> unsigned int pgactivate = 0;
>>>>> - bool do_demote_pass;
>>>>> + bool do_demote_pass, splited = false;
>>>>> struct swap_iocb *plug = NULL;
>>>>>
>>>>> folio_batch_init(&free_folios);
>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>
>>>>> cond_resched();
>>>>>
>>>>> + /*
>>>>> + * If a large folio has been split, many folios are added
>>>>> + * to folio_list. Looping through the entire list takes
>>>>> + * too much time, which may prevent folios that have completed
>>>>> + * writeback from rotateing to the tail of the lru. Just
>>>>> + * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>> + */
>>>>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>> + break;
>>>
>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>
>>
>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>> pages that have been pageout can be reclaimed, then enough pages can be
>> reclaimed when all pages have finished writeback. Thus, it may not have
>> to pageout more.
>>
>> If a small large folio(16 kB) has been split, it may return early
>> without the entire pages in the folio_list being pageout, but I think
>> that is fine. It can pageout more pages the next time it enters
>> shrink_folio_list if there are not enough pages to reclaimed.
>>
>> However, if pages that have been pageout are still at the head of the
>> LRU, it is difficult to scan these pages again. In this case, not only
>> might it "waste" some swap memory but it also has to pageout more pages.
>>
>> Considering the above, I sent this patch. It may not be a perfect
>> solution, but i think it's a good option to consider. And I am wondering
>> if anyone has a better solution.
>
> Hi Ridong,
> My overall understanding is that you have failed to describe your problem
> particularly I don't understand what your 3 and 4 mean:
>
>> 3. When folio writeback io is completed, the folio may be rotated to tail
>> of lru. The following lru list is expected, with those filioes that have
>> been added to swap cache are rotated to tail of lru. So those folios
>> can be reclaimed as soon as possible.
>>
>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> > is splited, shrink_page_list loops at least 512 times, which means that
> > shrink_page_list is not completed but some folios writeback have been
> > completed, and this may lead to failure to rotate these folios to the
> > tail of lru. The lru may look likes as below:
>
> can you please describe it in a readable approach?
>
> i feel your below diagram is somehow wrong:
> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> You mentioned "rotate', how could "rotate" makes:
> folioN512<->folioN511<->...filioN1 in (2)
> become
> filioN1<->...folioN511<->folioN512 in (3).
>
I am sorry for any confusion.
If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
to writeback one by one. it assumed that filioN1,
filioN2,filioN3,...filioN512 are completed in order.
Orignal:
folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
filioN1 is finished, filioN1 is rotated to the tail of LRU:
folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
filioN2 is finished:
folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
filioN3 is finished:
folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
...
filioN512 is finished:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
When the filios are finished, the LRU might just like this:
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> btw, writeback isn't always async. it could be sync for zram and sync_io
> swap. in that case, your patch might change the order of LRU. i mean,
> for example, while a mTHP becomes cold, we always reclaim all of them,
> but not part of them and put back part of small folios to the head of lru.
>
Yes, This can be changed.
Although it may put back part of small folios to the head of lru, it can
return in time from shrink_folio_list without causing much additional I/O.
If you have understood this issue, do you have any suggestions to fix
it? My patch may not be a perfect way to fix this issue.
Best regards,
Ridong
next prev parent reply other threads:[~2024-10-21 8:14 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-10 8:18 Chen Ridong
2024-10-10 8:59 ` Kefeng Wang
2024-10-10 9:28 ` chenridong
2024-10-10 16:17 ` Barry Song
2024-10-11 6:49 ` chenridong
2024-10-21 4:44 ` Barry Song
2024-10-21 8:14 ` Chen Ridong [this message]
2024-10-21 9:42 ` Barry Song
2024-10-21 9:56 ` chenridong
2024-10-21 10:09 ` Barry Song
2024-10-21 10:45 ` Barry Song
2024-11-01 8:49 ` 回复: " 解 咏梅
2024-11-14 12:56 ` chenridong
2024-10-21 12:15 ` chenridong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com \
--to=chenridong@huaweicloud.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=chenridong@huawei.com \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=ryan.roberts@arm.com \
--cc=wangkefeng.wang@huawei.com \
--cc=wangweiyang2@huawei.com \
--cc=willy@infradead.org \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox