Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jinjiang Tu <tujinjiang@huawei.com>
To: David Hildenbrand <david@redhat.com>, Zi Yan <ziy@nvidia.com>
Cc: <akpm@linux-foundation.org>, <linmiaohe@huawei.com>,
	<linux-mm@kvack.org>, <wangkefeng.wang@huawei.com>
Subject: Re: [PATCH] mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
Date: Tue, 17 Jun 2025 14:43:20 +0800	[thread overview]
Message-ID: <b5d6ae46-0f2d-4c46-d563-45caee188d27@huawei.com> (raw)
In-Reply-To: <52d16469-8df6-4ee5-bc6f-97c5557f7aa1@redhat.com>


在 2025/6/17 3:27, David Hildenbrand 写道:
> On 16.06.25 13:33, Jinjiang Tu wrote:
>>
>> 在 2025/6/12 23:50, David Hildenbrand 写道:
>>> On 12.06.25 17:35, Zi Yan wrote:
>>>> On 12 Jun 2025, at 3:53, David Hildenbrand wrote:
>>>>
>>>>> On 11.06.25 19:52, Zi Yan wrote:
>>>>>> On 11 Jun 2025, at 13:34, David Hildenbrand wrote:
>>>>>>
>>>>>>>> So __folio_split() has an implicit rule that:
>>>>>>>> 1. if the given list is not NULL, the folio cannot be on LRU;
>>>>>>>> 2. if the given list is NULL, the folio is on LRU.
>>>>>>>>
>>>>>>>> And the rule is buried deeply in lru_add_split_folio().
>>>>>>>>
>>>>>>>> Should we add some checks in __folio_split()?
>>>>>>>>
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index d3e66136e41a..8ce2734c9ca0 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -3732,6 +3732,11 @@ static int __folio_split(struct folio
>>>>>>>> *folio, unsigned int new_order,
>>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>>>>>>>          VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>>>>>
>>>>>>>> +    if (list && folio_test_lru(folio))
>>>>>>>> +        return -EINVAL;
>>>>>>>> +    if (!list && !folio_test_lru(folio))
>>>>>>>> +        return -EINVAL;
>>>>>>>> +
>>>>>>>
>>>>>>> I guess we currently don't run into that, because whenever a folio
>>>>>>> is otherwise isolated, there is an additional reference or a page
>>>>>>> table mapping, so it cannot get split either way (e.g., freezing
>>>>>>> the refcount fails).
>>>>>>>
>>>>>>> So maybe these checks would be too early and they should happen
>>>>>>> after we froze the refcount?
>>>>>>
>>>>>> But if the caller does the isolation, the additional refcount is OK
>>>>>> and
>>>>>> can_split_folio() will return true. In addition, __folio_split()
>>>>>> does not
>>>>>> change folio LRU state, so these two checks are orthogonal to 
>>>>>> refcount
>>>>>> check, right? The placement of them does not matter, but earlier
>>>>>> the better
>>>>>> to avoid unnecessary work. I see these are sanity checks for 
>>>>>> callers.
>>>>>
>>>>> In light of the discussion in this thread, if you have someone that
>>>>> takes the folio off the LRU concurrently, I think we could still run
>>>>> into a race here. Because that could happen just after we passed the
>>>>> test in __folio_split().
>>>>>
>>>>> That's why I think the test would have to happen when there are no
>>>>> such races possible anymore.
>>>>
>>>> Make sense. Thanks for the explanation.
>>>>
>>>>>
>>>>> But the real question is if it is okay to remove the folio from the
>>>>> LRU as done in the patch discussed here ...
>>>>
>>>> I just read through the email thread. IIUC, when
>>>> deferred_split_scan() split
>>>> a THP, it expects the THP is on LRU list. I think it makes sense since
>>>> all these THPs are in both the deferred_split_queue and LRU list.
>>>> And deferred_split_scan() uses split_folio() without providing a list
>>>> to store the after-split folios.
>>>>
>>>> In terms of the patch, since unmap_poisoned_folio() does not handle
>>>> large
>>>> folios, why not just split the large folios and add the after-split
>>>> folios
>>>> to folio_list?
>>>
>>> That's what I raised, but apparently it might not be worth it for that
>>> corner case (splitting might fail).
>>>
>>> Then, the while loop will go over all the after-split folios
>>>> one by one.
>>>>
>>>> BTW, unmap_poisoned_folio() is also used in do_migrate_range() from
>>>> memory_hotplug.c and there is no guard for large folios either. That
>>>> also needs a fix?
>>>
>>> Yes, that was mentioned, and I was hoping we could let
>>> unmap_poisoned_folio() check+fail in that case.
>>
>> Maybe we could fix do_migrate_range() like below:
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 8305483de38b..5a6d869e6b56 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1823,7 +1823,10 @@ static void do_migrate_range(unsigned long
>> start_pfn, unsigned long end_pfn)
>>                           pfn = folio_pfn(folio) + 
>> folio_nr_pages(folio) - 1;
>>
>>                   if (folio_contain_hwpoisoned_page(folio)) {
>> -                       if (WARN_ON(folio_test_lru(folio)))
>> +                       if (folio_test_large(folio))
>> +                               goto put_folio;
>> +
>> +                       if (folio_test_lru(folio))
> >                                   folio_isolate_lru(folio);
>
> Hm, what is supposed to happen if we fail folio_isolate_lru()?

unmap_poisoned_folio() seems not to be assumed with lru flag cleared.

But other calls of unmap_poisoned_folio() guarantee the folio is 
ioslated, maybe we should be consistent with them?

>
>>                           if (folio_mapped(folio)) {
>>                                   folio_lock(folio);
>>
>> The folio may be on lru, if folio_test_lru() check happens between
>> setting hwposion flag and isolating from lru in memory_failure().
>> So, I remove the WARN_ON.
>
> I guess this would work, although this special-casing on large folios 
> in the caller of unmap_poisoned_folio() is rather weird.
>
> What is supposed to happen if unmap_poisoned_folio() failed for small 
> folios? Why are we okay with having the LRU flag cleared and the folio 
> isolated?

In hwpoison_user_mappings(), if unmap_poisoned_folio() fails, the small 
folio is kept with isolated too.

IIUC, after the folio is kept with ioslated, other subsystems could not 
operate on this folio, to avoid introduing issues.

     prev parent reply	other threads:[~2025-06-17  6:43 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-11  7:46 Jinjiang Tu
2025-06-11  7:59 ` David Hildenbrand
2025-06-11  8:29   ` Jinjiang Tu
2025-06-11  8:35     ` David Hildenbrand
2025-06-11  9:00       ` Jinjiang Tu
2025-06-11  9:20         ` David Hildenbrand
2025-06-11  9:24           ` David Hildenbrand
2025-06-11 14:30             ` Zi Yan
2025-06-11 17:34               ` David Hildenbrand
2025-06-11 17:52                 ` Zi Yan
2025-06-12  7:53                   ` David Hildenbrand
2025-06-12 15:35                     ` Zi Yan
2025-06-12 15:50                       ` David Hildenbrand
2025-06-12 16:48                         ` Zi Yan
2025-06-16 11:34                           ` Jinjiang Tu
2025-06-16 11:33                         ` Jinjiang Tu
2025-06-16 19:27                           ` David Hildenbrand
2025-06-17  6:43                             ` Jinjiang Tu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b5d6ae46-0f2d-4c46-d563-45caee188d27@huawei.com \
    --to=tujinjiang@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=wangkefeng.wang@huawei.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox