* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 12:41 ` David Hildenbrand
@ 2024-03-04 13:03 ` Ryan Roberts
2024-03-04 14:27 ` David Hildenbrand
2024-03-04 21:04 ` Barry Song
2024-03-05 7:50 ` Huang, Ying
2 siblings, 1 reply; 28+ messages in thread
From: Ryan Roberts @ 2024-03-04 13:03 UTC (permalink / raw)
To: David Hildenbrand, Barry Song, akpm, linux-mm
Cc: chrisl, yuzhao, hanchuanhua, linux-kernel, willy, ying.huang,
xiang, mhocko, shy828301, wangkefeng.wang, Barry Song,
Hugh Dickins
On 04/03/2024 12:41, David Hildenbrand wrote:
> On 04.03.24 13:20, Ryan Roberts wrote:
>> Hi Barry,
>>
>> On 04/03/2024 10:37, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>> PTEs modification such as break-before-make, while iterating PTEs
>>> of a large folio, it will only begin to acquire PTL after it gets
>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>> in try_to_unmap_one().
>>
>> I just want to check my understanding here - I think the problem occurs for
>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>> that I've had a look at the code and have a better understanding, I think that
>> must be the case? And therefore this problem exists independently of my work to
>> support swap-out of mTHP? (From your previous report I was under the impression
>> that it only affected mTHP).
>>
>> Its just that the problem is becoming more pronounced because with mTHP,
>> PTE-mapped large folios are much more common?
>
> That is my understanding.
>
>>
>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>> So folio will be still mapped, the folio fails to be reclaimed.
>>> What’s even more worrying is, its PTEs are no longer in a unified
>>> state. This might lead to accident folio_split() afterwards. And
>>> since a part of PTEs are now swap entries, accessing them will
>>> incur page fault - do_swap_page.
>>> It creates both anxiety and more expense. While we can't avoid
>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>> a large folio, we can indeed keep away from kernel's breaking up
>>> them due to its code design.
>>> This patch is holding PTL from PTE0, thus, the folio will either
>>> be entirely reclaimed or entirely kept. On the other hand, this
>>> approach doesn't increase PTL contention. Even w/o the patch,
>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>> skips one or two PTEs because intermediate break-before-makes
>>> are short, according to test. Of course, even w/o this patch,
>>> the vast majority of try_to_unmap_one still can get PTL from
>>> PTE0. This patch makes the number 100%.
>>> The other option is that we can give up in try_to_unmap_one
>>> once we find PTE0 is not the first entry we get PTL, we call
>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>> The result is quite similar with small folios with one PTE -
>>> either entirely reclaimed or entirely kept.
>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>> option comparing to giving up after detecting PTL begins from
>>> non-PTE0.
>>>
>
> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> we can drop some of the details,
>
> If you need some inspiration, I can give it a shot.
>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Do we need a Fixes tag?
>>
>
> What would be the description of the problem we are fixing?
>
> 1) failing to unmap?
>
> That can happen with small folios as well IIUC.
>
> 2) Putting the large folio on the deferred split queue?
>
> That sounds more reasonable.
Isn't the real problem today that we can end up writng a THP to the swap file
(so 2M more IO and space used) but we can't remove it from memory, so no actual
reclaim happens? Although I guess your (2) is really just another way of saying
that.
>
>>> ---
>>> mm/vmscan.c | 11 +++++++++++
>>> 1 file changed, 11 insertions(+)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 0b888a2afa58..e4722fbbcd0c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head
>>> *folio_list,
>>> if (folio_test_pmd_mappable(folio))
>>> flags |= TTU_SPLIT_HUGE_PMD;
>>> + /*
>>> + * if page table lock is not held from the first PTE of
>>> + * a large folio, some PTEs might be skipped because of
>>> + * races with break-before-make, for example, PTEs can
>>> + * be pte_none intermediately, thus one or more PTEs
>>> + * might be skipped in try_to_unmap_one, we might result
>>> + * in a large folio is partially mapped and partially
>>> + * unmapped after try_to_unmap
>>> + */
>>> + if (folio_test_large(folio))
>>> + flags |= TTU_SYNC;
>>
>> This looks sensible to me after thinking about it for a while. But I also have a
>> gut feeling that there might be some more subtleties that are going over my
>> head, since I'm not expert in this area. So will leave others to provide R-b :)
>>
>
> As we are seeing more such problems with lockless PT walks, maybe we really want
> some other special value (nonswap entry?) to indicate that a PTE this is
> currently ondergoing protection changes. So we'd avoid the pte_none()
> temporarily, if possible.
>
> Without that, TTU_SYNC feels like the right thing to do.
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 13:03 ` Ryan Roberts
@ 2024-03-04 14:27 ` David Hildenbrand
2024-03-04 20:42 ` Barry Song
0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-03-04 14:27 UTC (permalink / raw)
To: Ryan Roberts, Barry Song, akpm, linux-mm
Cc: chrisl, yuzhao, hanchuanhua, linux-kernel, willy, ying.huang,
xiang, mhocko, shy828301, wangkefeng.wang, Barry Song,
Hugh Dickins
On 04.03.24 14:03, Ryan Roberts wrote:
> On 04/03/2024 12:41, David Hildenbrand wrote:
>> On 04.03.24 13:20, Ryan Roberts wrote:
>>> Hi Barry,
>>>
>>> On 04/03/2024 10:37, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>> in try_to_unmap_one().
>>>
>>> I just want to check my understanding here - I think the problem occurs for
>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>> that I've had a look at the code and have a better understanding, I think that
>>> must be the case? And therefore this problem exists independently of my work to
>>> support swap-out of mTHP? (From your previous report I was under the impression
>>> that it only affected mTHP).
>>>
>>> Its just that the problem is becoming more pronounced because with mTHP,
>>> PTE-mapped large folios are much more common?
>>
>> That is my understanding.
>>
>>>
>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>> state. This might lead to accident folio_split() afterwards. And
>>>> since a part of PTEs are now swap entries, accessing them will
>>>> incur page fault - do_swap_page.
>>>> It creates both anxiety and more expense. While we can't avoid
>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>> them due to its code design.
>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>> skips one or two PTEs because intermediate break-before-makes
>>>> are short, according to test. Of course, even w/o this patch,
>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>> PTE0. This patch makes the number 100%.
>>>> The other option is that we can give up in try_to_unmap_one
>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>> The result is quite similar with small folios with one PTE -
>>>> either entirely reclaimed or entirely kept.
>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>> option comparing to giving up after detecting PTL begins from
>>>> non-PTE0.
>>>>
>>
>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
>> we can drop some of the details,
>>
>> If you need some inspiration, I can give it a shot.
>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Do we need a Fixes tag?
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
>
> Isn't the real problem today that we can end up writng a THP to the swap file
> (so 2M more IO and space used) but we can't remove it from memory, so no actual
> reclaim happens? Although I guess your (2) is really just another way of saying
> that.
The same could happen with small folios I believe? We might end up
running into the
folio_mapped()
after the try_to_unmap().
Note that the actual I/O does not happen during add_to_swap(), but
during the pageout() call when we find the folio to be dirty.
So there would not actually be more I/O. Only swap space would be
reserved, that would be used later when not running into the race.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 14:27 ` David Hildenbrand
@ 2024-03-04 20:42 ` Barry Song
2024-03-04 21:02 ` David Hildenbrand
0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-04 20:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 14:03, Ryan Roberts wrote:
> > On 04/03/2024 12:41, David Hildenbrand wrote:
> >> On 04.03.24 13:20, Ryan Roberts wrote:
> >>> Hi Barry,
> >>>
> >>> On 04/03/2024 10:37, Barry Song wrote:
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>>> PTEs modification such as break-before-make, while iterating PTEs
> >>>> of a large folio, it will only begin to acquire PTL after it gets
> >>>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>>> in try_to_unmap_one().
> >>>
> >>> I just want to check my understanding here - I think the problem occurs for
> >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >>> that I've had a look at the code and have a better understanding, I think that
> >>> must be the case? And therefore this problem exists independently of my work to
> >>> support swap-out of mTHP? (From your previous report I was under the impression
> >>> that it only affected mTHP).
> >>>
> >>> Its just that the problem is becoming more pronounced because with mTHP,
> >>> PTE-mapped large folios are much more common?
> >>
> >> That is my understanding.
> >>
> >>>
> >>>> For example, for an anon folio, after try_to_unmap_one(), we may
> >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >>>> So folio will be still mapped, the folio fails to be reclaimed.
> >>>> What’s even more worrying is, its PTEs are no longer in a unified
> >>>> state. This might lead to accident folio_split() afterwards. And
> >>>> since a part of PTEs are now swap entries, accessing them will
> >>>> incur page fault - do_swap_page.
> >>>> It creates both anxiety and more expense. While we can't avoid
> >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >>>> a large folio, we can indeed keep away from kernel's breaking up
> >>>> them due to its code design.
> >>>> This patch is holding PTL from PTE0, thus, the folio will either
> >>>> be entirely reclaimed or entirely kept. On the other hand, this
> >>>> approach doesn't increase PTL contention. Even w/o the patch,
> >>>> page_vma_mapped_walk() will always get PTL after it sometimes
> >>>> skips one or two PTEs because intermediate break-before-makes
> >>>> are short, according to test. Of course, even w/o this patch,
> >>>> the vast majority of try_to_unmap_one still can get PTL from
> >>>> PTE0. This patch makes the number 100%.
> >>>> The other option is that we can give up in try_to_unmap_one
> >>>> once we find PTE0 is not the first entry we get PTL, we call
> >>>> page_vma_mapped_walk_done() to end the iteration at this case.
> >>>> This will keep the unified PTEs while the folio isn't reclaimed.
> >>>> The result is quite similar with small folios with one PTE -
> >>>> either entirely reclaimed or entirely kept.
> >>>> Reclaiming large folios by holding PTL from PTE0 seems a better
> >>>> option comparing to giving up after detecting PTL begins from
> >>>> non-PTE0.
> >>>>
> >>
> >> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> >> we can drop some of the details,
> >>
> >> If you need some inspiration, I can give it a shot.
> >>
> >>>> Cc: Hugh Dickins <hughd@google.com>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> Do we need a Fixes tag?
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > Isn't the real problem today that we can end up writng a THP to the swap file
> > (so 2M more IO and space used) but we can't remove it from memory, so no actual
> > reclaim happens? Although I guess your (2) is really just another way of saying
> > that.
>
> The same could happen with small folios I believe? We might end up
> running into the
>
> folio_mapped()
>
> after the try_to_unmap().
>
> Note that the actual I/O does not happen during add_to_swap(), but
> during the pageout() call when we find the folio to be dirty.
>
> So there would not actually be more I/O. Only swap space would be
> reserved, that would be used later when not running into the race.
I am not worried about small folios at all as they have only one PTE.
so the PTE is either completely unmapped or completely mapped.
In terms of large folios, it is a different story. for example, a large
folio with 16 PTEs with CONT-PTE, we will have
1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries
2. page faults on PTE1-PTE15 after try_to_unmap if we access them.
This is totally useless PF and can be avoided if we can try_to_unmap
properly at the beginning.
3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
MADV_FREE might split it after finding it is not completely mapped.
For small folios, we don't have any concern on the above issues.
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 20:42 ` Barry Song
@ 2024-03-04 21:02 ` David Hildenbrand
2024-03-04 21:41 ` Barry Song
0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-03-04 21:02 UTC (permalink / raw)
To: Barry Song
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On 04.03.24 21:42, Barry Song wrote:
> On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 14:03, Ryan Roberts wrote:
>>> On 04/03/2024 12:41, David Hildenbrand wrote:
>>>> On 04.03.24 13:20, Ryan Roberts wrote:
>>>>> Hi Barry,
>>>>>
>>>>> On 04/03/2024 10:37, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>>>> in try_to_unmap_one().
>>>>>
>>>>> I just want to check my understanding here - I think the problem occurs for
>>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>>>> that I've had a look at the code and have a better understanding, I think that
>>>>> must be the case? And therefore this problem exists independently of my work to
>>>>> support swap-out of mTHP? (From your previous report I was under the impression
>>>>> that it only affected mTHP).
>>>>>
>>>>> Its just that the problem is becoming more pronounced because with mTHP,
>>>>> PTE-mapped large folios are much more common?
>>>>
>>>> That is my understanding.
>>>>
>>>>>
>>>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>>>> state. This might lead to accident folio_split() afterwards. And
>>>>>> since a part of PTEs are now swap entries, accessing them will
>>>>>> incur page fault - do_swap_page.
>>>>>> It creates both anxiety and more expense. While we can't avoid
>>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>>>> them due to its code design.
>>>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>>>> skips one or two PTEs because intermediate break-before-makes
>>>>>> are short, according to test. Of course, even w/o this patch,
>>>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>>>> PTE0. This patch makes the number 100%.
>>>>>> The other option is that we can give up in try_to_unmap_one
>>>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>>>> The result is quite similar with small folios with one PTE -
>>>>>> either entirely reclaimed or entirely kept.
>>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>>>> option comparing to giving up after detecting PTL begins from
>>>>>> non-PTE0.
>>>>>>
>>>>
>>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
>>>> we can drop some of the details,
>>>>
>>>> If you need some inspiration, I can give it a shot.
>>>>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Do we need a Fixes tag?
>>>>>
>>>>
>>>> What would be the description of the problem we are fixing?
>>>>
>>>> 1) failing to unmap?
>>>>
>>>> That can happen with small folios as well IIUC.
>>>>
>>>> 2) Putting the large folio on the deferred split queue?
>>>>
>>>> That sounds more reasonable.
>>>
>>> Isn't the real problem today that we can end up writng a THP to the swap file
>>> (so 2M more IO and space used) but we can't remove it from memory, so no actual
>>> reclaim happens? Although I guess your (2) is really just another way of saying
>>> that.
>>
>> The same could happen with small folios I believe? We might end up
>> running into the
>>
>> folio_mapped()
>>
>> after the try_to_unmap().
>>
>> Note that the actual I/O does not happen during add_to_swap(), but
>> during the pageout() call when we find the folio to be dirty.
>>
>> So there would not actually be more I/O. Only swap space would be
>> reserved, that would be used later when not running into the race.
>
> I am not worried about small folios at all as they have only one PTE.
> so the PTE is either completely unmapped or completely mapped.
>
> In terms of large folios, it is a different story. for example, a large
> folio with 16 PTEs with CONT-PTE, we will have
>
> 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries
>
> 2. page faults on PTE1-PTE15 after try_to_unmap if we access them.
>
> This is totally useless PF and can be avoided if we can try_to_unmap
> properly at the beginning.
>
> 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
> MADV_FREE might split it after finding it is not completely mapped.
>
> For small folios, we don't have any concern on the above issues.
Right, but when we talk about "Fixes:", what exactly are we consider
"really broken" above and what is "undesired"?
(a) is there a correctness issue? I don't think so.
(b) is there a real performance issue? I'd like to understand.
After all, we've been living with that ever since we supported THP_SWAP,
correct? "something does not work ideally in some corner cases" might be
reasonable to handle here (and I really think we should), but might not
be worth a "Fixes:".
So if we could clarify that, it would be great.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 21:02 ` David Hildenbrand
@ 2024-03-04 21:41 ` Barry Song
0 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2024-03-04 21:41 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 10:02 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 21:42, Barry Song wrote:
> > On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.03.24 14:03, Ryan Roberts wrote:
> >>> On 04/03/2024 12:41, David Hildenbrand wrote:
> >>>> On 04.03.24 13:20, Ryan Roberts wrote:
> >>>>> Hi Barry,
> >>>>>
> >>>>> On 04/03/2024 10:37, Barry Song wrote:
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>>>>> PTEs modification such as break-before-make, while iterating PTEs
> >>>>>> of a large folio, it will only begin to acquire PTL after it gets
> >>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>>>>> in try_to_unmap_one().
> >>>>>
> >>>>> I just want to check my understanding here - I think the problem occurs for
> >>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >>>>> that I've had a look at the code and have a better understanding, I think that
> >>>>> must be the case? And therefore this problem exists independently of my work to
> >>>>> support swap-out of mTHP? (From your previous report I was under the impression
> >>>>> that it only affected mTHP).
> >>>>>
> >>>>> Its just that the problem is becoming more pronounced because with mTHP,
> >>>>> PTE-mapped large folios are much more common?
> >>>>
> >>>> That is my understanding.
> >>>>
> >>>>>
> >>>>>> For example, for an anon folio, after try_to_unmap_one(), we may
> >>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >>>>>> So folio will be still mapped, the folio fails to be reclaimed.
> >>>>>> What’s even more worrying is, its PTEs are no longer in a unified
> >>>>>> state. This might lead to accident folio_split() afterwards. And
> >>>>>> since a part of PTEs are now swap entries, accessing them will
> >>>>>> incur page fault - do_swap_page.
> >>>>>> It creates both anxiety and more expense. While we can't avoid
> >>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >>>>>> a large folio, we can indeed keep away from kernel's breaking up
> >>>>>> them due to its code design.
> >>>>>> This patch is holding PTL from PTE0, thus, the folio will either
> >>>>>> be entirely reclaimed or entirely kept. On the other hand, this
> >>>>>> approach doesn't increase PTL contention. Even w/o the patch,
> >>>>>> page_vma_mapped_walk() will always get PTL after it sometimes
> >>>>>> skips one or two PTEs because intermediate break-before-makes
> >>>>>> are short, according to test. Of course, even w/o this patch,
> >>>>>> the vast majority of try_to_unmap_one still can get PTL from
> >>>>>> PTE0. This patch makes the number 100%.
> >>>>>> The other option is that we can give up in try_to_unmap_one
> >>>>>> once we find PTE0 is not the first entry we get PTL, we call
> >>>>>> page_vma_mapped_walk_done() to end the iteration at this case.
> >>>>>> This will keep the unified PTEs while the folio isn't reclaimed.
> >>>>>> The result is quite similar with small folios with one PTE -
> >>>>>> either entirely reclaimed or entirely kept.
> >>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
> >>>>>> option comparing to giving up after detecting PTL begins from
> >>>>>> non-PTE0.
> >>>>>>
> >>>>
> >>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> >>>> we can drop some of the details,
> >>>>
> >>>> If you need some inspiration, I can give it a shot.
> >>>>
> >>>>>> Cc: Hugh Dickins <hughd@google.com>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> Do we need a Fixes tag?
> >>>>>
> >>>>
> >>>> What would be the description of the problem we are fixing?
> >>>>
> >>>> 1) failing to unmap?
> >>>>
> >>>> That can happen with small folios as well IIUC.
> >>>>
> >>>> 2) Putting the large folio on the deferred split queue?
> >>>>
> >>>> That sounds more reasonable.
> >>>
> >>> Isn't the real problem today that we can end up writng a THP to the swap file
> >>> (so 2M more IO and space used) but we can't remove it from memory, so no actual
> >>> reclaim happens? Although I guess your (2) is really just another way of saying
> >>> that.
> >>
> >> The same could happen with small folios I believe? We might end up
> >> running into the
> >>
> >> folio_mapped()
> >>
> >> after the try_to_unmap().
> >>
> >> Note that the actual I/O does not happen during add_to_swap(), but
> >> during the pageout() call when we find the folio to be dirty.
> >>
> >> So there would not actually be more I/O. Only swap space would be
> >> reserved, that would be used later when not running into the race.
> >
> > I am not worried about small folios at all as they have only one PTE.
> > so the PTE is either completely unmapped or completely mapped.
> >
> > In terms of large folios, it is a different story. for example, a large
> > folio with 16 PTEs with CONT-PTE, we will have
> >
> > 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries
> >
> > 2. page faults on PTE1-PTE15 after try_to_unmap if we access them.
> >
> > This is totally useless PF and can be avoided if we can try_to_unmap
> > properly at the beginning.
> >
> > 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
> > MADV_FREE might split it after finding it is not completely mapped.
> >
> > For small folios, we don't have any concern on the above issues.
>
> Right, but when we talk about "Fixes:", what exactly are we consider
> "really broken" above and what is "undesired"?
>
> (a) is there a correctness issue? I don't think so.
>
> (b) is there a real performance issue? I'd like to understand.
>
> After all, we've been living with that ever since we supported THP_SWAP,
> correct? "something does not work ideally in some corner cases" might be
> reasonable to handle here (and I really think we should), but might not
> be worth a "Fixes:".
>
> So if we could clarify that, it would be great.
I don't think this needs a fixes tag, and I would think it is an optimization
on corner cases. And I don't think we will see noticeable performance
improvement as this isn't happening quite often though I believe it does
improve the performance of corner cases. but if the corner case only
takes 0.1% of all try_to_unmap_one, no noticeable performance
improvement will be seen.
I feel it is more of a behavior to kick flies out while flies don't kill people
but it can be sometimes quite annoying :-) I ran into another bug in
large-folio swapin series due to this problem, 16 PTEs had contiguous
swap entries from Ryan's add_to_swap(), but they were splitted in
MADV_PAGEOUT because the folio was not completely mapped after
try_to_unmap_one.
Then after some time, some of the 16 pages were in swapcache, while
others were not in. In this case, I couldn't swap them in together and had to
handle PF one by one, but i was incorrectly handling it and trying to
swap-in them together by reading swapfile. if they were atomically
handled in try_to_unmap_one, we could avoid this.
we may have to cope with this kind of problem from time to time in
future work.
>
> --
> Cheers,
>
> David / dhildenb
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 12:41 ` David Hildenbrand
2024-03-04 13:03 ` Ryan Roberts
@ 2024-03-04 21:04 ` Barry Song
2024-03-04 21:15 ` David Hildenbrand
2024-03-04 22:02 ` Ryan Roberts
2024-03-05 7:50 ` Huang, Ying
2 siblings, 2 replies; 28+ messages in thread
From: Barry Song @ 2024-03-04 21:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 13:20, Ryan Roberts wrote:
> > Hi Barry,
> >
> > On 04/03/2024 10:37, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >> PTEs modification such as break-before-make, while iterating PTEs
> >> of a large folio, it will only begin to acquire PTL after it gets
> >> a valid(present) PTE. break-before-make intermediately sets PTEs
> >> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >> in try_to_unmap_one().
> >
> > I just want to check my understanding here - I think the problem occurs for
> > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> > that I've had a look at the code and have a better understanding, I think that
> > must be the case? And therefore this problem exists independently of my work to
> > support swap-out of mTHP? (From your previous report I was under the impression
> > that it only affected mTHP).
> >
> > Its just that the problem is becoming more pronounced because with mTHP,
> > PTE-mapped large folios are much more common?
>
> That is my understanding.
>
> >
> >> For example, for an anon folio, after try_to_unmap_one(), we may
> >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >> So folio will be still mapped, the folio fails to be reclaimed.
> >> What’s even more worrying is, its PTEs are no longer in a unified
> >> state. This might lead to accident folio_split() afterwards. And
> >> since a part of PTEs are now swap entries, accessing them will
> >> incur page fault - do_swap_page.
> >> It creates both anxiety and more expense. While we can't avoid
> >> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >> a large folio, we can indeed keep away from kernel's breaking up
> >> them due to its code design.
> >> This patch is holding PTL from PTE0, thus, the folio will either
> >> be entirely reclaimed or entirely kept. On the other hand, this
> >> approach doesn't increase PTL contention. Even w/o the patch,
> >> page_vma_mapped_walk() will always get PTL after it sometimes
> >> skips one or two PTEs because intermediate break-before-makes
> >> are short, according to test. Of course, even w/o this patch,
> >> the vast majority of try_to_unmap_one still can get PTL from
> >> PTE0. This patch makes the number 100%.
> >> The other option is that we can give up in try_to_unmap_one
> >> once we find PTE0 is not the first entry we get PTL, we call
> >> page_vma_mapped_walk_done() to end the iteration at this case.
> >> This will keep the unified PTEs while the folio isn't reclaimed.
> >> The result is quite similar with small folios with one PTE -
> >> either entirely reclaimed or entirely kept.
> >> Reclaiming large folios by holding PTL from PTE0 seems a better
> >> option comparing to giving up after detecting PTL begins from
> >> non-PTE0.
> >>
>
> I'm sure that wall of text can be formatted in a better way :) . Also, I
> think we can drop some of the details,
>
> If you need some inspiration, I can give it a shot.
>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >
> > Do we need a Fixes tag?
I am not quite sure which commit should be here for a fixes tag.
I think it's more of an optimization.
> >
>
> What would be the description of the problem we are fixing?
>
> 1) failing to unmap?
>
> That can happen with small folios as well IIUC.
>
> 2) Putting the large folio on the deferred split queue?
>
> That sounds more reasonable.
I don't feel it is reasonable. Avoiding this kind of accident splitting
from the kernel's improper code is a more reasonable approach
as there is always a price to pay for splitting and unfolding PTEs
etc.
While we can't avoid splitting coming from userspace's
MADV_DONTNEED, munmap, mprotect, we have a way
to ensure the kernel itself doesn't accidently break up a
large folio.
In OPPO's phones, we ran into some weird bugs due to skipped PTEs
in try_to_unmap_one. hardly could we fix it from the root cause. with
various races, figuring out their timings was really a big pain :-)
But we did "resolve" those bugs by entirely untouching all PTEs if we
found some PTEs were skipped in try_to_unmap_one [1].
While we find we only get the PTL from 2nd, 3rd but not
1st PTE, we entirely give up on try_to_unmap_one, and leave
all PTEs untouched.
/* we are not starting from head */
if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
ret = false;
atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
set_pte_at(mm, address, pvmw.pte, pteval);
page_vma_mapped_walk_done(&pvmw);
break;
}
This will ensure all PTEs still have a unified state such as CONT-PTE
after try_to_unmap fails.
I feel this could have some false postive because when racing
with unmap, 1st PTE might really become pte_none. So explicitly
holding PTL from 1st PTE seems a better way.
[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730
>
> >> ---
> >> mm/vmscan.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 0b888a2afa58..e4722fbbcd0c 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>
> >> if (folio_test_pmd_mappable(folio))
> >> flags |= TTU_SPLIT_HUGE_PMD;
> >> + /*
> >> + * if page table lock is not held from the first PTE of
> >> + * a large folio, some PTEs might be skipped because of
> >> + * races with break-before-make, for example, PTEs can
> >> + * be pte_none intermediately, thus one or more PTEs
> >> + * might be skipped in try_to_unmap_one, we might result
> >> + * in a large folio is partially mapped and partially
> >> + * unmapped after try_to_unmap
> >> + */
> >> + if (folio_test_large(folio))
> >> + flags |= TTU_SYNC;
> >
> > This looks sensible to me after thinking about it for a while. But I also have a
> > gut feeling that there might be some more subtleties that are going over my
> > head, since I'm not expert in this area. So will leave others to provide R-b :)
> >
>
> As we are seeing more such problems with lockless PT walks, maybe we
> really want some other special value (nonswap entry?) to indicate that a
> PTE this is currently ondergoing protection changes. So we'd avoid the
> pte_none() temporarily, if possible.
>
> Without that, TTU_SYNC feels like the right thing to do.
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 21:04 ` Barry Song
@ 2024-03-04 21:15 ` David Hildenbrand
2024-03-04 22:29 ` Barry Song
2024-03-04 22:02 ` Ryan Roberts
1 sibling, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-03-04 21:15 UTC (permalink / raw)
To: Barry Song
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
>>> Do we need a Fixes tag?
>
> I am not quite sure which commit should be here for a fixes tag.
> I think it's more of an optimization.
Good, that helps!
>
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
>
> I don't feel it is reasonable. Avoiding this kind of accident splitting
> from the kernel's improper code is a more reasonable approach
> as there is always a price to pay for splitting and unfolding PTEs
> etc.
>
> While we can't avoid splitting coming from userspace's
> MADV_DONTNEED, munmap, mprotect, we have a way
> to ensure the kernel itself doesn't accidently break up a
> large folio.
Note that on the next vmscan we would retry, find the remaining present
entries and swapout that thing completely :)
>
> In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> in try_to_unmap_one. hardly could we fix it from the root cause. with
> various races, figuring out their timings was really a big pain :-)
>
I can imagine. I assume, though, that it might be related to the way the
cont-pte bit was handled. Ryan's implementation should be able to cope
with that.
> But we did "resolve" those bugs by entirely untouching all PTEs if we
> found some PTEs were skipped in try_to_unmap_one [1].
>
> While we find we only get the PTL from 2nd, 3rd but not
> 1st PTE, we entirely give up on try_to_unmap_one, and leave
> all PTEs untouched.
>
> /* we are not starting from head */
> if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> ret = false;
> atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> set_pte_at(mm, address, pvmw.pte, pteval);
> page_vma_mapped_walk_done(&pvmw);
> break;
> }
> This will ensure all PTEs still have a unified state such as CONT-PTE
> after try_to_unmap fails.
> I feel this could have some false postive because when racing
> with unmap, 1st PTE might really become pte_none. So explicitly
> holding PTL from 1st PTE seems a better way.
Can we estimate the "cost" of holding the PTL?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 21:15 ` David Hildenbrand
@ 2024-03-04 22:29 ` Barry Song
2024-03-05 7:53 ` Huang, Ying
0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-04 22:29 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, akpm, linux-mm, chrisl, yuzhao, hanchuanhua,
linux-kernel, willy, ying.huang, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>
>
> >>> Do we need a Fixes tag?
> >
> > I am not quite sure which commit should be here for a fixes tag.
> > I think it's more of an optimization.
>
> Good, that helps!
>
> >
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > I don't feel it is reasonable. Avoiding this kind of accident splitting
> > from the kernel's improper code is a more reasonable approach
> > as there is always a price to pay for splitting and unfolding PTEs
> > etc.
> >
> > While we can't avoid splitting coming from userspace's
> > MADV_DONTNEED, munmap, mprotect, we have a way
> > to ensure the kernel itself doesn't accidently break up a
> > large folio.
>
> Note that on the next vmscan we would retry, find the remaining present
> entries and swapout that thing completely :)
This is true, but since we can finish the job the first time, it seems
second retry is a cost :-)
>
> >
> > In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> > in try_to_unmap_one. hardly could we fix it from the root cause. with
> > various races, figuring out their timings was really a big pain :-)
> >
>
> I can imagine. I assume, though, that it might be related to the way the
> cont-pte bit was handled. Ryan's implementation should be able to cope
> with that.
I guess you are probably right. Ryan's implementation decouples CONT-PTE
from mm core. nice to have it.
>
> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > found some PTEs were skipped in try_to_unmap_one [1].
> >
> > While we find we only get the PTL from 2nd, 3rd but not
> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > all PTEs untouched.
> >
> > /* we are not starting from head */
> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> > ret = false;
> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> > set_pte_at(mm, address, pvmw.pte, pteval);
> > page_vma_mapped_walk_done(&pvmw);
> > break;
> > }
> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > after try_to_unmap fails.
> > I feel this could have some false postive because when racing
> > with unmap, 1st PTE might really become pte_none. So explicitly
> > holding PTL from 1st PTE seems a better way.
>
> Can we estimate the "cost" of holding the PTL?
>
This is just moving PTL acquisition one or two PTE earlier in those corner
cases. In normal cases, it doesn't affect when PTL is held.
In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
PTL immediately. in corner cases, page_vma_mapped_walk races with break-
before-make, after skipping one or two PTEs whose states are transferring,
it will find a present pte then acquire lock.
> --
> Cheers,
>
> David / dhildenb
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 22:29 ` Barry Song
@ 2024-03-05 7:53 ` Huang, Ying
2024-03-05 9:02 ` Barry Song
0 siblings, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-05 7:53 UTC (permalink / raw)
To: Barry Song
Cc: David Hildenbrand, Ryan Roberts, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
Barry Song <21cnbao@gmail.com> writes:
> On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>> > But we did "resolve" those bugs by entirely untouching all PTEs if we
>> > found some PTEs were skipped in try_to_unmap_one [1].
>> >
>> > While we find we only get the PTL from 2nd, 3rd but not
>> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
>> > all PTEs untouched.
>> >
>> > /* we are not starting from head */
>> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>> > ret = false;
>> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>> > set_pte_at(mm, address, pvmw.pte, pteval);
>> > page_vma_mapped_walk_done(&pvmw);
>> > break;
>> > }
>> > This will ensure all PTEs still have a unified state such as CONT-PTE
>> > after try_to_unmap fails.
>> > I feel this could have some false postive because when racing
>> > with unmap, 1st PTE might really become pte_none. So explicitly
>> > holding PTL from 1st PTE seems a better way.
>>
>> Can we estimate the "cost" of holding the PTL?
>>
>
> This is just moving PTL acquisition one or two PTE earlier in those corner
> cases. In normal cases, it doesn't affect when PTL is held.
The mTHP may be mapped at the end of page table. In that case, the PTL
will be held longer. Or am I missing something?
--
Best Regards,
Huang, Ying
> In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> before-make, after skipping one or two PTEs whose states are transferring,
> it will find a present pte then acquire lock.
>
>> --
>> Cheers,
>>
>> David / dhildenb
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-05 7:53 ` Huang, Ying
@ 2024-03-05 9:02 ` Barry Song
2024-03-05 9:10 ` Huang, Ying
0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-05 9:02 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Ryan Roberts, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> >> > found some PTEs were skipped in try_to_unmap_one [1].
> >> >
> >> > While we find we only get the PTL from 2nd, 3rd but not
> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> >> > all PTEs untouched.
> >> >
> >> > /* we are not starting from head */
> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> >> > ret = false;
> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> >> > set_pte_at(mm, address, pvmw.pte, pteval);
> >> > page_vma_mapped_walk_done(&pvmw);
> >> > break;
> >> > }
> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> >> > after try_to_unmap fails.
> >> > I feel this could have some false postive because when racing
> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> >> > holding PTL from 1st PTE seems a better way.
> >>
> >> Can we estimate the "cost" of holding the PTL?
> >>
> >
> > This is just moving PTL acquisition one or two PTE earlier in those corner
> > cases. In normal cases, it doesn't affect when PTL is held.
>
> The mTHP may be mapped at the end of page table. In that case, the PTL
> will be held longer. Or am I missing something?
no. this patch doesn't change when we release PTL but change when we
get PTL.
when the original code iterates nr_pages PTEs in a large folio, it will skip
invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
intermediate PTE values some other threads are modifying, it might
skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
arriving at PTE2, likely other threads have written a new value, so we
will begin to hold PTL and iterate till the end of the large folio.
The proposal is that we directly get PTL from PTE0, thus we don't get
intermediate values for the head of nr_pages PTEs. this will ensure
a large folio is either completely unmapped or completely mapped.
but not partially mapped and partially unmapped.
>
> --
> Best Regards,
> Huang, Ying
>
>
> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> > before-make, after skipping one or two PTEs whose states are transferring,
> > it will find a present pte then acquire lock.
> >
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-05 9:02 ` Barry Song
@ 2024-03-05 9:10 ` Huang, Ying
2024-03-05 9:21 ` Barry Song
0 siblings, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-05 9:10 UTC (permalink / raw)
To: Barry Song
Cc: David Hildenbrand, Ryan Roberts, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
Barry Song <21cnbao@gmail.com> writes:
> On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
>> >> > found some PTEs were skipped in try_to_unmap_one [1].
>> >> >
>> >> > While we find we only get the PTL from 2nd, 3rd but not
>> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
>> >> > all PTEs untouched.
>> >> >
>> >> > /* we are not starting from head */
>> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>> >> > ret = false;
>> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>> >> > set_pte_at(mm, address, pvmw.pte, pteval);
>> >> > page_vma_mapped_walk_done(&pvmw);
>> >> > break;
>> >> > }
>> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
>> >> > after try_to_unmap fails.
>> >> > I feel this could have some false postive because when racing
>> >> > with unmap, 1st PTE might really become pte_none. So explicitly
>> >> > holding PTL from 1st PTE seems a better way.
>> >>
>> >> Can we estimate the "cost" of holding the PTL?
>> >>
>> >
>> > This is just moving PTL acquisition one or two PTE earlier in those corner
>> > cases. In normal cases, it doesn't affect when PTL is held.
>>
>> The mTHP may be mapped at the end of page table. In that case, the PTL
>> will be held longer. Or am I missing something?
>
> no. this patch doesn't change when we release PTL but change when we
> get PTL.
>
> when the original code iterates nr_pages PTEs in a large folio, it will skip
> invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> intermediate PTE values some other threads are modifying, it might
> skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> arriving at PTE2, likely other threads have written a new value, so we
> will begin to hold PTL and iterate till the end of the large folio.
Is there any guarantee that the mTHP will always be mapped at the
beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496.
If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
some cases.
--
Best Regards,
Huang, Ying
> The proposal is that we directly get PTL from PTE0, thus we don't get
> intermediate values for the head of nr_pages PTEs. this will ensure
> a large folio is either completely unmapped or completely mapped.
> but not partially mapped and partially unmapped.
>
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>>
>> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
>> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
>> > before-make, after skipping one or two PTEs whose states are transferring,
>> > it will find a present pte then acquire lock.
>> >
>> >> --
>> >> Cheers,
>> >>
>> >> David / dhildenb
>> >
> Thanks
> Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-05 9:10 ` Huang, Ying
@ 2024-03-05 9:21 ` Barry Song
2024-03-05 10:28 ` Barry Song
0 siblings, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-05 9:21 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Ryan Roberts, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> >> >> > found some PTEs were skipped in try_to_unmap_one [1].
> >> >> >
> >> >> > While we find we only get the PTL from 2nd, 3rd but not
> >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> >> >> > all PTEs untouched.
> >> >> >
> >> >> > /* we are not starting from head */
> >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> >> >> > ret = false;
> >> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> >> >> > set_pte_at(mm, address, pvmw.pte, pteval);
> >> >> > page_vma_mapped_walk_done(&pvmw);
> >> >> > break;
> >> >> > }
> >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> >> >> > after try_to_unmap fails.
> >> >> > I feel this could have some false postive because when racing
> >> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> >> >> > holding PTL from 1st PTE seems a better way.
> >> >>
> >> >> Can we estimate the "cost" of holding the PTL?
> >> >>
> >> >
> >> > This is just moving PTL acquisition one or two PTE earlier in those corner
> >> > cases. In normal cases, it doesn't affect when PTL is held.
> >>
> >> The mTHP may be mapped at the end of page table. In that case, the PTL
> >> will be held longer. Or am I missing something?
> >
> > no. this patch doesn't change when we release PTL but change when we
> > get PTL.
> >
> > when the original code iterates nr_pages PTEs in a large folio, it will skip
> > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> > intermediate PTE values some other threads are modifying, it might
> > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> > arriving at PTE2, likely other threads have written a new value, so we
> > will begin to hold PTL and iterate till the end of the large folio.
>
> Is there any guarantee that the mTHP will always be mapped at the
> beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496.
> If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
> some cases.
I agree. but in another discussion[1], the plan is if we find a large folio has
been deferred split, we split it before try_to_unmap and pageout. otherwise,
we may result in lots of redundant I/O, because PTE0-495 will still be
pageout()-ed.
[1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/
>
> --
> Best Regards,
> Huang, Ying
>
> > The proposal is that we directly get PTL from PTE0, thus we don't get
> > intermediate values for the head of nr_pages PTEs. this will ensure
> > a large folio is either completely unmapped or completely mapped.
> > but not partially mapped and partially unmapped.
> >
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >>
> >> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> >> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> >> > before-make, after skipping one or two PTEs whose states are transferring,
> >> > it will find a present pte then acquire lock.
> >> >
> >> >> --
> >> >> Cheers,
> >> >>
> >> >> David / dhildenb
> >> >
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-05 9:21 ` Barry Song
@ 2024-03-05 10:28 ` Barry Song
0 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2024-03-05 10:28 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Ryan Roberts, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
On Tue, Mar 5, 2024 at 10:21 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> > >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > >> >> > found some PTEs were skipped in try_to_unmap_one [1].
> > >> >> >
> > >> >> > While we find we only get the PTL from 2nd, 3rd but not
> > >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > >> >> > all PTEs untouched.
> > >> >> >
> > >> >> > /* we are not starting from head */
> > >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> > >> >> > ret = false;
> > >> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> > >> >> > set_pte_at(mm, address, pvmw.pte, pteval);
> > >> >> > page_vma_mapped_walk_done(&pvmw);
> > >> >> > break;
> > >> >> > }
> > >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > >> >> > after try_to_unmap fails.
> > >> >> > I feel this could have some false postive because when racing
> > >> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> > >> >> > holding PTL from 1st PTE seems a better way.
> > >> >>
> > >> >> Can we estimate the "cost" of holding the PTL?
> > >> >>
> > >> >
> > >> > This is just moving PTL acquisition one or two PTE earlier in those corner
> > >> > cases. In normal cases, it doesn't affect when PTL is held.
> > >>
> > >> The mTHP may be mapped at the end of page table. In that case, the PTL
> > >> will be held longer. Or am I missing something?
> > >
> > > no. this patch doesn't change when we release PTL but change when we
> > > get PTL.
> > >
> > > when the original code iterates nr_pages PTEs in a large folio, it will skip
> > > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> > > intermediate PTE values some other threads are modifying, it might
> > > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> > > arriving at PTE2, likely other threads have written a new value, so we
> > > will begin to hold PTL and iterate till the end of the large folio.
> >
> > Is there any guarantee that the mTHP will always be mapped at the
> > beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496.
> > If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
> > some cases.
>
> I agree. but in another discussion[1], the plan is if we find a large folio has
> been deferred split, we split it before try_to_unmap and pageout. otherwise,
> we may result in lots of redundant I/O, because PTE0-495 will still be
> pageout()-ed.
>
> [1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/
I thought about this again, seems we can cope with it even w/o the above plan
by:
+ if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
+ flags |= TTU_SYNC;
if a folio has been deferred split, it seems no sense to have the optimization
for the corner cases this patch wants to provide. Only while we know this
folio is still entirely mapped, we have this optimization. This should have
reduced the chance to be quite small though we still have a bit.
>
> >
> > --
> > Best Regards,
> > Huang, Ying
> >
> > > The proposal is that we directly get PTL from PTE0, thus we don't get
> > > intermediate values for the head of nr_pages PTEs. this will ensure
> > > a large folio is either completely unmapped or completely mapped.
> > > but not partially mapped and partially unmapped.
> > >
> > >>
Thanks
Barry
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 21:04 ` Barry Song
2024-03-04 21:15 ` David Hildenbrand
@ 2024-03-04 22:02 ` Ryan Roberts
1 sibling, 0 replies; 28+ messages in thread
From: Ryan Roberts @ 2024-03-04 22:02 UTC (permalink / raw)
To: Barry Song, David Hildenbrand
Cc: akpm, linux-mm, chrisl, yuzhao, hanchuanhua, linux-kernel, willy,
ying.huang, xiang, mhocko, shy828301, wangkefeng.wang,
Barry Song, Hugh Dickins
On 04/03/2024 21:04, Barry Song wrote:
> On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 13:20, Ryan Roberts wrote:
>>> Hi Barry,
>>>
>>> On 04/03/2024 10:37, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>> in try_to_unmap_one().
>>>
>>> I just want to check my understanding here - I think the problem occurs for
>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>> that I've had a look at the code and have a better understanding, I think that
>>> must be the case? And therefore this problem exists independently of my work to
>>> support swap-out of mTHP? (From your previous report I was under the impression
>>> that it only affected mTHP).
>>>
>>> Its just that the problem is becoming more pronounced because with mTHP,
>>> PTE-mapped large folios are much more common?
>>
>> That is my understanding.
>>
>>>
>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>> state. This might lead to accident folio_split() afterwards. And
>>>> since a part of PTEs are now swap entries, accessing them will
>>>> incur page fault - do_swap_page.
>>>> It creates both anxiety and more expense. While we can't avoid
>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>> them due to its code design.
>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>> skips one or two PTEs because intermediate break-before-makes
>>>> are short, according to test. Of course, even w/o this patch,
>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>> PTE0. This patch makes the number 100%.
>>>> The other option is that we can give up in try_to_unmap_one
>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>> The result is quite similar with small folios with one PTE -
>>>> either entirely reclaimed or entirely kept.
>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>> option comparing to giving up after detecting PTL begins from
>>>> non-PTE0.
>>>>
>>
>> I'm sure that wall of text can be formatted in a better way :) . Also, I
>> think we can drop some of the details,
>>
>> If you need some inspiration, I can give it a shot.
>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Do we need a Fixes tag?
It seems my original question has snowballed a bit. I was conflating this change
with other reports Barry has made where the kernel was panicking (I think?).
Given we are not seeing any incorrect functional behaviour that this change
fixes, I agree we don't need a Fixes tag here.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio
2024-03-04 12:41 ` David Hildenbrand
2024-03-04 13:03 ` Ryan Roberts
2024-03-04 21:04 ` Barry Song
@ 2024-03-05 7:50 ` Huang, Ying
2 siblings, 0 replies; 28+ messages in thread
From: Huang, Ying @ 2024-03-05 7:50 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, Barry Song, akpm, linux-mm, chrisl, yuzhao,
hanchuanhua, linux-kernel, willy, xiang, mhocko, shy828301,
wangkefeng.wang, Barry Song, Hugh Dickins
David Hildenbrand <david@redhat.com> writes:
>
> As we are seeing more such problems with lockless PT walks, maybe we
> really want some other special value (nonswap entry?) to indicate that
> a PTE this is currently ondergoing protection changes. So we'd avoid
> the pte_none() temporarily, if possible.
This sounds like a good idea. This can solve other issue caused by
temporarily pte_none() issue too, like the following,
https://lore.kernel.org/linux-mm/20240229060907.836589-1-zhangpeng362@huawei.com/
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 28+ messages in thread