All MADV_FREE mTHPs are fully subjected to deferred_split

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
@ 2024-12-29 21:12 Barry Song
  2024-12-30  2:14 ` Lance Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-12-29 21:12 UTC (permalink / raw)
  To: Linux-MM, Lance Yang, Ryan Roberts, David Hildenbrand,
	Baolin Wang, Andrew Morton

Hi Lance,

Along with Ryan, David, Baolin, and anyone else who might be interested,

We’ve noticed an unexpectedly high number of deferred splits. The root
cause appears to be the changes introduced in commit dce7d10be4bbd3
("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
that commit, split_folio is no longer called in mm/madvise.c.

However, we are still performing deferred_split_folio for all
MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
This happens because we execute a goto discard in
try_to_unmap_one(), which eventually leads to
folio_remove_rmap_pte() adding all folios to deferred_split when we
scan the 1st pte in try_to_unmap_one().

discard:
                if (unlikely(folio_test_hugetlb(folio)))
                        hugetlb_remove_rmap(folio);
                else
                        folio_remove_rmap_pte(folio, subpage, vma);

This could lead to a race condition with shrinker - deferred_split_scan().
The shrinker might call folio_try_get(folio), and while we are scanning
the second PTE of this folio in try_to_unmap_one(), the entire mTHP
could be transitioned back to swap-backed because the reference count
is incremented.

                                /*
                                 * The only page refs must be one from isolation
                                 * plus the rmap(s) (dropped by discard:).
                                 */
                                if (ref_count == 1 + map_count &&
                                    (!folio_test_dirty(folio) ||
                                     ...
                                     (vma->vm_flags & VM_DROPPABLE))) {
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

It also significantly increases contention on ds_queue->split_queue_lock during
memory reclamation and could potentially introduce other race conditions with
shrinker as well.

I’m curious if anyone has suggestions for resolving this issue. My
idea is to use
folio_remove_rmap_ptes to drop all PTEs at once, rather than
folio_remove_rmap_pte,
which processes PTEs one by one for an mTHP. This approach would require some
changes, such as checking the dirty state of PTEs and performing a TLB
flush for the
entire mTHP as a whole in try_to_unmap_one().

Please let me know if you have any objections or alternative suggestions.

Thanks
Barry

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-29 21:12 All MADV_FREE mTHPs are fully subjected to deferred_split_folio() Barry Song
@ 2024-12-30  2:14 ` Lance Yang
  2024-12-30  9:48   ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Lance Yang @ 2024-12-30  2:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Linux-MM, Ryan Roberts, David Hildenbrand, Baolin Wang, Andrew Morton

Hi Barry,

On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Lance,
>
> Along with Ryan, David, Baolin, and anyone else who might be interested,
>
> We’ve noticed an unexpectedly high number of deferred splits. The root
> cause appears to be the changes introduced in commit dce7d10be4bbd3
> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> that commit, split_folio is no longer called in mm/madvise.c.
>
> However, we are still performing deferred_split_folio for all
> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> This happens because we execute a goto discard in
> try_to_unmap_one(), which eventually leads to
> folio_remove_rmap_pte() adding all folios to deferred_split when we
> scan the 1st pte in try_to_unmap_one().
>
> discard:
>                 if (unlikely(folio_test_hugetlb(folio)))
>                         hugetlb_remove_rmap(folio);
>                 else
>                         folio_remove_rmap_pte(folio, subpage, vma);
>
> This could lead to a race condition with shrinker - deferred_split_scan().
> The shrinker might call folio_try_get(folio), and while we are scanning
> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> could be transitioned back to swap-backed because the reference count
> is incremented.
>
>                                 /*
>                                  * The only page refs must be one from isolation
>                                  * plus the rmap(s) (dropped by discard:).
>                                  */
>                                 if (ref_count == 1 + map_count &&
>                                     (!folio_test_dirty(folio) ||
>                                      ...
>                                      (vma->vm_flags & VM_DROPPABLE))) {
>                                         dec_mm_counter(mm, MM_ANONPAGES);
>                                         goto discard;
>                                 }
>
> It also significantly increases contention on ds_queue->split_queue_lock during
> memory reclamation and could potentially introduce other race conditions with
> shrinker as well.

Good catch!

>
> I’m curious if anyone has suggestions for resolving this issue. My
> idea is to use
> folio_remove_rmap_ptes to drop all PTEs at once, rather than
> folio_remove_rmap_pte,
> which processes PTEs one by one for an mTHP. This approach would require some
> changes, such as checking the dirty state of PTEs and performing a TLB
> flush for the
> entire mTHP as a whole in try_to_unmap_one().

Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole
in real-world scenarios where MADV_FREE mTHPs are typically no longer
written ;)

>
> Please let me know if you have any objections or alternative suggestions.

Let's hear suggestions from other folks as well ~

Thanks,
Lance

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30  2:14 ` Lance Yang
@ 2024-12-30  9:48   ` David Hildenbrand
  2024-12-30 11:54     ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-12-30  9:48 UTC (permalink / raw)
  To: Lance Yang, Barry Song; +Cc: Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On 30.12.24 03:14, Lance Yang wrote:
> Hi Barry,
> 
> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> Hi Lance,
>>
>> Along with Ryan, David, Baolin, and anyone else who might be interested,
>>
>> We’ve noticed an unexpectedly high number of deferred splits. The root
>> cause appears to be the changes introduced in commit dce7d10be4bbd3
>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
>> that commit, split_folio is no longer called in mm/madvise.c.

Hi,

I assume you don't see "deferred splits" at all. You see that a folio 
was added to the deferred split queue to immediately be removed again as 
it gets freed. Correct?

>>
>> However, we are still performing deferred_split_folio for all
>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
>> This happens because we execute a goto discard in
>> try_to_unmap_one(), which eventually leads to
>> folio_remove_rmap_pte() adding all folios to deferred_split when we
>> scan the 1st pte in try_to_unmap_one().
>>
>> discard:
>>                  if (unlikely(folio_test_hugetlb(folio)))
>>                          hugetlb_remove_rmap(folio);
>>                  else
>>                          folio_remove_rmap_pte(folio, subpage, vma);

Yes, that's kind-of know: we neither do PTE batching during unmap for 
reclaim nor during unmap for migration. We should add that support.

But note, just like I raised earlier in the context of similar to 
"improved partial-mapped logic in rmap code when batching", we are 
primarily only pleasing counters here.

See below on concurrent shrinker.

>>
>> This could lead to a race condition with shrinker - deferred_split_scan().
>> The shrinker might call folio_try_get(folio), and while we are scanning
>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
>> could be transitioned back to swap-backed because the reference count
>> is incremented.
>>
>>                                  /*
>>                                   * The only page refs must be one from isolation
>>                                   * plus the rmap(s) (dropped by discard:).
>>                                   */
>>                                  if (ref_count == 1 + map_count &&
>>                                      (!folio_test_dirty(folio) ||
>>                                       ...
>>                                       (vma->vm_flags & VM_DROPPABLE))) {
>>                                          dec_mm_counter(mm, MM_ANONPAGES);
>>                                          goto discard;
>>                                  }


Reclaim code holds an additional folio reference and has the folio 
locked. So I don't think this race can really happen in the way you 
think it could? Please feel free to correct me if I am wrong.

>>
>> It also significantly increases contention on ds_queue->split_queue_lock during
>> memory reclamation and could potentially introduce other race conditions with
>> shrinker as well.
> 
> Good catch!
> 

Call me "skeptical" that this is a big issue, at least regarding the 
shrinker, but also regarding actual lock contention. :)

The issue might be less severe than you think: mostly pleasing counters. 
But yes, there is room for improvement.

>>
>> I’m curious if anyone has suggestions for resolving this issue. My
>> idea is to use
>> folio_remove_rmap_ptes to drop all PTEs at once, rather than
>> folio_remove_rmap_pte,
>> which processes PTEs one by one for an mTHP. This approach would require some
>> changes, such as checking the dirty state of PTEs and performing a TLB
>> flush for the
>> entire mTHP as a whole in try_to_unmap_one().
> 
> Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole
> in real-world scenarios where MADV_FREE mTHPs are typically no longer
> written ;)


We should be implementing folio batching. But it won't be able to cover 
all cases.

In the future, I envision that during reclaim/access bit scanning, we 
determine whether a folio is partially mapped and add it to the deferred 
split queue. That's one requirement for getting rid of 
folio->_nr_page_mapped and reliably detecting all partial mappings, but 
it also avoids having to messing with this information whenever we 
(temporarily) unmap only parts of a folio, like we have here.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30  9:48   ` David Hildenbrand
@ 2024-12-30 11:54     ` Barry Song
  2024-12-30 12:52       ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-12-30 11:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.12.24 03:14, Lance Yang wrote:
> > Hi Barry,
> >
> > On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> Hi Lance,
> >>
> >> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>
> >> We’ve noticed an unexpectedly high number of deferred splits. The root
> >> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >> that commit, split_folio is no longer called in mm/madvise.c.
>
> Hi,
>
> I assume you don't see "deferred splits" at all. You see that a folio
> was added to the deferred split queue to immediately be removed again as
> it gets freed. Correct?
>
> >>
> >> However, we are still performing deferred_split_folio for all
> >> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >> This happens because we execute a goto discard in
> >> try_to_unmap_one(), which eventually leads to
> >> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >> scan the 1st pte in try_to_unmap_one().
> >>
> >> discard:
> >>                  if (unlikely(folio_test_hugetlb(folio)))
> >>                          hugetlb_remove_rmap(folio);
> >>                  else
> >>                          folio_remove_rmap_pte(folio, subpage, vma);
>
> Yes, that's kind-of know: we neither do PTE batching during unmap for
> reclaim nor during unmap for migration. We should add that support.
>
> But note, just like I raised earlier in the context of similar to
> "improved partial-mapped logic in rmap code when batching", we are
> primarily only pleasing counters here.
>
> See below on concurrent shrinker.
>
> >>
> >> This could lead to a race condition with shrinker - deferred_split_scan().
> >> The shrinker might call folio_try_get(folio), and while we are scanning
> >> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >> could be transitioned back to swap-backed because the reference count
> >> is incremented.
> >>
> >>                                  /*
> >>                                   * The only page refs must be one from isolation
> >>                                   * plus the rmap(s) (dropped by discard:).
> >>                                   */
> >>                                  if (ref_count == 1 + map_count &&
> >>                                      (!folio_test_dirty(folio) ||
> >>                                       ...
> >>                                       (vma->vm_flags & VM_DROPPABLE))) {
> >>                                          dec_mm_counter(mm, MM_ANONPAGES);
> >>                                          goto discard;
> >>                                  }
>
>
> Reclaim code holds an additional folio reference and has the folio
> locked. So I don't think this race can really happen in the way you
> think it could? Please feel free to correct me if I am wrong.

try_to_unmap_one will only execute "goto discard" and remove the rmap if
ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
can invalidate this condition, leading to the restoration of the PTE and setting
the folio as swap-backed.

                               /*
                                 * The only page refs must be one from isolation
                                 * plus the rmap(s) (dropped by discard:).
                                 */
                                if (ref_count == 1 + map_count &&
                                    (!folio_test_dirty(folio) ||
                                     /*
                                      * Unlike MADV_FREE mappings, VM_DROPPABLE
                                      * ones can be dropped even if they've
                                      * been dirtied.
                                      */
                                     (vma->vm_flags & VM_DROPPABLE))) {
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

                                /*
                                 * If the folio was redirtied, it cannot be
                                 * discarded. Remap the page to page table.
                                 */
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                /*
                                 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
                                 * never get swap backed on failure to drop.
                                 */
                                if (!(vma->vm_flags & VM_DROPPABLE))
                                        folio_set_swapbacked(folio);
                                goto walk_abort;


>
> >>
> >> It also significantly increases contention on ds_queue->split_queue_lock during
> >> memory reclamation and could potentially introduce other race conditions with
> >> shrinker as well.
> >
> > Good catch!
> >
>
> Call me "skeptical" that this is a big issue, at least regarding the
> shrinker, but also regarding actual lock contention. :)
>
> The issue might be less severe than you think: mostly pleasing counters.
> But yes, there is room for improvement.
>
> >>
> >> I’m curious if anyone has suggestions for resolving this issue. My
> >> idea is to use
> >> folio_remove_rmap_ptes to drop all PTEs at once, rather than
> >> folio_remove_rmap_pte,
> >> which processes PTEs one by one for an mTHP. This approach would require some
> >> changes, such as checking the dirty state of PTEs and performing a TLB
> >> flush for the
> >> entire mTHP as a whole in try_to_unmap_one().
> >
> > Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole
> > in real-world scenarios where MADV_FREE mTHPs are typically no longer
> > written ;)
>
>
> We should be implementing folio batching. But it won't be able to cover
> all cases.
>
> In the future, I envision that during reclaim/access bit scanning, we
> determine whether a folio is partially mapped and add it to the deferred
> split queue. That's one requirement for getting rid of
> folio->_nr_page_mapped and reliably detecting all partial mappings, but
> it also avoids having to messing with this information whenever we
> (temporarily) unmap only parts of a folio, like we have here.
>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 11:54     ` Barry Song
@ 2024-12-30 12:52       ` David Hildenbrand
  2024-12-30 16:02         ` Lance Yang
  2024-12-30 19:19         ` Barry Song
  0 siblings, 2 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-12-30 12:52 UTC (permalink / raw)
  To: Barry Song; +Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On 30.12.24 12:54, Barry Song wrote:
> On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 30.12.24 03:14, Lance Yang wrote:
>>> Hi Barry,
>>>
>>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> Hi Lance,
>>>>
>>>> Along with Ryan, David, Baolin, and anyone else who might be interested,
>>>>
>>>> We’ve noticed an unexpectedly high number of deferred splits. The root
>>>> cause appears to be the changes introduced in commit dce7d10be4bbd3
>>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
>>>> that commit, split_folio is no longer called in mm/madvise.c.
>>
>> Hi,
>>
>> I assume you don't see "deferred splits" at all. You see that a folio
>> was added to the deferred split queue to immediately be removed again as
>> it gets freed. Correct?
>>
>>>>
>>>> However, we are still performing deferred_split_folio for all
>>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
>>>> This happens because we execute a goto discard in
>>>> try_to_unmap_one(), which eventually leads to
>>>> folio_remove_rmap_pte() adding all folios to deferred_split when we
>>>> scan the 1st pte in try_to_unmap_one().
>>>>
>>>> discard:
>>>>                   if (unlikely(folio_test_hugetlb(folio)))
>>>>                           hugetlb_remove_rmap(folio);
>>>>                   else
>>>>                           folio_remove_rmap_pte(folio, subpage, vma);
>>
>> Yes, that's kind-of know: we neither do PTE batching during unmap for
>> reclaim nor during unmap for migration. We should add that support.
>>
>> But note, just like I raised earlier in the context of similar to
>> "improved partial-mapped logic in rmap code when batching", we are
>> primarily only pleasing counters here.
>>
>> See below on concurrent shrinker.
>>
>>>>
>>>> This could lead to a race condition with shrinker - deferred_split_scan().
>>>> The shrinker might call folio_try_get(folio), and while we are scanning
>>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
>>>> could be transitioned back to swap-backed because the reference count
>>>> is incremented.
>>>>
>>>>                                   /*
>>>>                                    * The only page refs must be one from isolation
>>>>                                    * plus the rmap(s) (dropped by discard:).
>>>>                                    */
>>>>                                   if (ref_count == 1 + map_count &&
>>>>                                       (!folio_test_dirty(folio) ||
>>>>                                        ...
>>>>                                        (vma->vm_flags & VM_DROPPABLE))) {
>>>>                                           dec_mm_counter(mm, MM_ANONPAGES);
>>>>                                           goto discard;
>>>>                                   }
>>
>>
>> Reclaim code holds an additional folio reference and has the folio
>> locked. So I don't think this race can really happen in the way you
>> think it could? Please feel free to correct me if I am wrong.
> 
> try_to_unmap_one will only execute "goto discard" and remove the rmap if
> ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
> can invalidate this condition, leading to the restoration of the PTE and setting
> the folio as swap-backed.
> 
>                                 /*
>                                   * The only page refs must be one from isolation
>                                   * plus the rmap(s) (dropped by discard:).
>                                   */
>                                  if (ref_count == 1 + map_count &&
>                                      (!folio_test_dirty(folio) ||
>                                       /*
>                                        * Unlike MADV_FREE mappings, VM_DROPPABLE
>                                        * ones can be dropped even if they've
>                                        * been dirtied.
>                                        */
>                                       (vma->vm_flags & VM_DROPPABLE))) {
>                                          dec_mm_counter(mm, MM_ANONPAGES);
>                                          goto discard;
>                                  }
> 
>                                  /*
>                                   * If the folio was redirtied, it cannot be
>                                   * discarded. Remap the page to page table.
>                                   */
>                                  set_pte_at(mm, address, pvmw.pte, pteval);
>                                  /*
>                                   * Unlike MADV_FREE mappings, VM_DROPPABLE ones
>                                   * never get swap backed on failure to drop.
>                                   */
>                                  if (!(vma->vm_flags & VM_DROPPABLE))
>                                          folio_set_swapbacked(folio);
>                                  goto walk_abort;

Ah, that's what you mean. Yes, but the shrinker behaves mostly like just 
any other speculative reference.

So we're not actually handling speculative references here correctly, so 
this issue is not completely shrinker-specific.

Maybe, we should be doing something like this?

/*
  * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
  * they've been dirtied.
  */
if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
	/*
	 * redirtied either using the page table or a previously
	 * obtained GUP reference.
	 */
	set_pte_at(mm, address, pvmw.pte, pteval);
	folio_set_swapbacked(folio);
	goto walk_abort;
} else if (ref_count != 1 + map_count) {
	/*
	 * Additional reference. Could be a GUP reference or any
	 * speculative reference. GUP users must mark the folio dirty if
	 * there was a modification. This folio cannot be reclaimed
	 * right now either way, so act just like nothing happened.
	 * We'll come back here later and detect if the folio was
	 * dirtied when the additional reference is gone.
	 */
	set_pte_at(mm, address, pvmw.pte, pteval);
	goto walk_abort;
}
goto discard;


Probably cleaning up goto labels.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 12:52       ` David Hildenbrand
@ 2024-12-30 16:02         ` Lance Yang
  2024-12-30 19:19         ` Barry Song
  1 sibling, 0 replies; 10+ messages in thread
From: Lance Yang @ 2024-12-30 16:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On Mon, Dec 30, 2024 at 8:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.12.24 12:54, Barry Song wrote:
> > On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 30.12.24 03:14, Lance Yang wrote:
> >>> Hi Barry,
> >>>
> >>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> Hi Lance,
> >>>>
> >>>> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>>>
> >>>> We’ve noticed an unexpectedly high number of deferred splits. The root
> >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >>>> that commit, split_folio is no longer called in mm/madvise.c.
> >>
> >> Hi,
> >>
> >> I assume you don't see "deferred splits" at all. You see that a folio
> >> was added to the deferred split queue to immediately be removed again as
> >> it gets freed. Correct?
> >>
> >>>>
> >>>> However, we are still performing deferred_split_folio for all
> >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >>>> This happens because we execute a goto discard in
> >>>> try_to_unmap_one(), which eventually leads to
> >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >>>> scan the 1st pte in try_to_unmap_one().
> >>>>
> >>>> discard:
> >>>>                   if (unlikely(folio_test_hugetlb(folio)))
> >>>>                           hugetlb_remove_rmap(folio);
> >>>>                   else
> >>>>                           folio_remove_rmap_pte(folio, subpage, vma);
> >>
> >> Yes, that's kind-of know: we neither do PTE batching during unmap for
> >> reclaim nor during unmap for migration. We should add that support.
> >>
> >> But note, just like I raised earlier in the context of similar to
> >> "improved partial-mapped logic in rmap code when batching", we are
> >> primarily only pleasing counters here.
> >>
> >> See below on concurrent shrinker.
> >>
> >>>>
> >>>> This could lead to a race condition with shrinker - deferred_split_scan().
> >>>> The shrinker might call folio_try_get(folio), and while we are scanning
> >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >>>> could be transitioned back to swap-backed because the reference count
> >>>> is incremented.
> >>>>
> >>>>                                   /*
> >>>>                                    * The only page refs must be one from isolation
> >>>>                                    * plus the rmap(s) (dropped by discard:).
> >>>>                                    */
> >>>>                                   if (ref_count == 1 + map_count &&
> >>>>                                       (!folio_test_dirty(folio) ||
> >>>>                                        ...
> >>>>                                        (vma->vm_flags & VM_DROPPABLE))) {
> >>>>                                           dec_mm_counter(mm, MM_ANONPAGES);
> >>>>                                           goto discard;
> >>>>                                   }
> >>
> >>
> >> Reclaim code holds an additional folio reference and has the folio
> >> locked. So I don't think this race can really happen in the way you
> >> think it could? Please feel free to correct me if I am wrong.
> >
> > try_to_unmap_one will only execute "goto discard" and remove the rmap if
> > ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
> > can invalidate this condition, leading to the restoration of the PTE and setting
> > the folio as swap-backed.
> >
> >                                 /*
> >                                   * The only page refs must be one from isolation
> >                                   * plus the rmap(s) (dropped by discard:).
> >                                   */
> >                                  if (ref_count == 1 + map_count &&
> >                                      (!folio_test_dirty(folio) ||
> >                                       /*
> >                                        * Unlike MADV_FREE mappings, VM_DROPPABLE
> >                                        * ones can be dropped even if they've
> >                                        * been dirtied.
> >                                        */
> >                                       (vma->vm_flags & VM_DROPPABLE))) {
> >                                          dec_mm_counter(mm, MM_ANONPAGES);
> >                                          goto discard;
> >                                  }
> >
> >                                  /*
> >                                   * If the folio was redirtied, it cannot be
> >                                   * discarded. Remap the page to page table.
> >                                   */
> >                                  set_pte_at(mm, address, pvmw.pte, pteval);
> >                                  /*
> >                                   * Unlike MADV_FREE mappings, VM_DROPPABLE ones
> >                                   * never get swap backed on failure to drop.
> >                                   */
> >                                  if (!(vma->vm_flags & VM_DROPPABLE))
> >                                          folio_set_swapbacked(folio);
> >                                  goto walk_abort;
>
> Ah, that's what you mean. Yes, but the shrinker behaves mostly like just
> any other speculative reference.
>
> So we're not actually handling speculative references here correctly, so
> this issue is not completely shrinker-specific.
>
> Maybe, we should be doing something like this?
>
> /*
>   * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
>   * they've been dirtied.
>   */
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
>         /*
>          * redirtied either using the page table or a previously
>          * obtained GUP reference.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         folio_set_swapbacked(folio);
>         goto walk_abort;
> } else if (ref_count != 1 + map_count) {
>         /*
>          * Additional reference. Could be a GUP reference or any
>          * speculative reference. GUP users must mark the folio dirty if
>          * there was a modification. This folio cannot be reclaimed
>          * right now either way, so act just like nothing happened.
>          * We'll come back here later and detect if the folio was
>          * dirtied when the additional reference is gone.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         goto walk_abort;
> }
> goto discard;
>
>
> Probably cleaning up goto labels.

Nice! It looks simple and easy to understand ;)

It might be reasonable to temporarily ignore the folio and check
later whether it was marked dirty after the reference is released,
as the additional reference could be a GUP or another speculative
reference. Personally, I also prefer this approach.

Thanks,
Lance


>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 12:52       ` David Hildenbrand
  2024-12-30 16:02         ` Lance Yang
@ 2024-12-30 19:19         ` Barry Song
  2024-12-30 19:32           ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-12-30 19:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On Tue, Dec 31, 2024 at 1:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.12.24 12:54, Barry Song wrote:
> > On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 30.12.24 03:14, Lance Yang wrote:
> >>> Hi Barry,
> >>>
> >>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> Hi Lance,
> >>>>
> >>>> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>>>
> >>>> We’ve noticed an unexpectedly high number of deferred splits. The root
> >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >>>> that commit, split_folio is no longer called in mm/madvise.c.
> >>
> >> Hi,
> >>
> >> I assume you don't see "deferred splits" at all. You see that a folio
> >> was added to the deferred split queue to immediately be removed again as
> >> it gets freed. Correct?
> >>
> >>>>
> >>>> However, we are still performing deferred_split_folio for all
> >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >>>> This happens because we execute a goto discard in
> >>>> try_to_unmap_one(), which eventually leads to
> >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >>>> scan the 1st pte in try_to_unmap_one().
> >>>>
> >>>> discard:
> >>>>                   if (unlikely(folio_test_hugetlb(folio)))
> >>>>                           hugetlb_remove_rmap(folio);
> >>>>                   else
> >>>>                           folio_remove_rmap_pte(folio, subpage, vma);
> >>
> >> Yes, that's kind-of know: we neither do PTE batching during unmap for
> >> reclaim nor during unmap for migration. We should add that support.
> >>
> >> But note, just like I raised earlier in the context of similar to
> >> "improved partial-mapped logic in rmap code when batching", we are
> >> primarily only pleasing counters here.
> >>
> >> See below on concurrent shrinker.
> >>
> >>>>
> >>>> This could lead to a race condition with shrinker - deferred_split_scan().
> >>>> The shrinker might call folio_try_get(folio), and while we are scanning
> >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >>>> could be transitioned back to swap-backed because the reference count
> >>>> is incremented.
> >>>>
> >>>>                                   /*
> >>>>                                    * The only page refs must be one from isolation
> >>>>                                    * plus the rmap(s) (dropped by discard:).
> >>>>                                    */
> >>>>                                   if (ref_count == 1 + map_count &&
> >>>>                                       (!folio_test_dirty(folio) ||
> >>>>                                        ...
> >>>>                                        (vma->vm_flags & VM_DROPPABLE))) {
> >>>>                                           dec_mm_counter(mm, MM_ANONPAGES);
> >>>>                                           goto discard;
> >>>>                                   }
> >>
> >>
> >> Reclaim code holds an additional folio reference and has the folio
> >> locked. So I don't think this race can really happen in the way you
> >> think it could? Please feel free to correct me if I am wrong.
> >
> > try_to_unmap_one will only execute "goto discard" and remove the rmap if
> > ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
> > can invalidate this condition, leading to the restoration of the PTE and setting
> > the folio as swap-backed.
> >
> >                                 /*
> >                                   * The only page refs must be one from isolation
> >                                   * plus the rmap(s) (dropped by discard:).
> >                                   */
> >                                  if (ref_count == 1 + map_count &&
> >                                      (!folio_test_dirty(folio) ||
> >                                       /*
> >                                        * Unlike MADV_FREE mappings, VM_DROPPABLE
> >                                        * ones can be dropped even if they've
> >                                        * been dirtied.
> >                                        */
> >                                       (vma->vm_flags & VM_DROPPABLE))) {
> >                                          dec_mm_counter(mm, MM_ANONPAGES);
> >                                          goto discard;
> >                                  }
> >
> >                                  /*
> >                                   * If the folio was redirtied, it cannot be
> >                                   * discarded. Remap the page to page table.
> >                                   */
> >                                  set_pte_at(mm, address, pvmw.pte, pteval);
> >                                  /*
> >                                   * Unlike MADV_FREE mappings, VM_DROPPABLE ones
> >                                   * never get swap backed on failure to drop.
> >                                   */
> >                                  if (!(vma->vm_flags & VM_DROPPABLE))
> >                                          folio_set_swapbacked(folio);
> >                                  goto walk_abort;
>
> Ah, that's what you mean. Yes, but the shrinker behaves mostly like just
> any other speculative reference.
>
> So we're not actually handling speculative references here correctly, so
> this issue is not completely shrinker-specific.
>
> Maybe, we should be doing something like this?
>
> /*
>   * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
>   * they've been dirtied.
>   */
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
>         /*
>          * redirtied either using the page table or a previously
>          * obtained GUP reference.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         folio_set_swapbacked(folio);
>         goto walk_abort;
> } else if (ref_count != 1 + map_count) {
>         /*
>          * Additional reference. Could be a GUP reference or any
>          * speculative reference. GUP users must mark the folio dirty if
>          * there was a modification. This folio cannot be reclaimed
>          * right now either way, so act just like nothing happened.
>          * We'll come back here later and detect if the folio was
>          * dirtied when the additional reference is gone.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         goto walk_abort;
> }
> goto discard;
>

I agree that this is necessary, but I'm not sure it addresses my
concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
and this does not resolve the issue of them being partially unmapped
though it is definitely better than the existing code, at least folios are
not moved back to swap-backed.

On the other hand, users might rely on the `deferred_split` counter to
assess how aggressively userspace is performing address/size unaligned
operations
like MADV_DONTNEED or unmapped behavior. However, our debugging shows
that the majority of `deferred_split` counter increments result from
aligned MADV_FREE operations. This diminishes the counter's usefulness
in reflecting unaligned userspace behavior.

If possible, I am still looking for some approach to entirely avoid
adding the folio to deferred_split and partially being unmapped.

Could the concept be something like this?

 if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
         /*
          * redirtied either using the page table or a previously
          * obtained GUP reference.
          */
         set_pte_at(mm, address, pvmw.pte, pteval);
         folio_set_swapbacked(folio);
         /* remove the rmap for the subpages before the current subpage */
         folio_remove_rmap_ptes(folio, head_page, (address -
start_address)/PAGE_SIZE, vma);
         goto walk_abort;
 } else if (ref_count != 1 + map_count) {
         /*
          * Additional reference. Could be a GUP reference or any
          * speculative reference. GUP users must mark the folio dirty if
          * there was a modification. This folio cannot be reclaimed
          * right now either way, so act just like nothing happened.
          * We'll come back here later and detect if the folio was
          * dirtied when the additional reference is gone.
          */
         set_pte_at(mm, address, pvmw.pte, pteval);
         /* remove the rmap for the subpages before the current subpage */
         folio_remove_rmap_ptes(folio, head_page, (address -
start_address)/PAGE_SIZE, vma);
         goto walk_abort;
 }
continue;  /* don't remove rmap one by one */

>
> Probably cleaning up goto labels.
>
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 19:19         ` Barry Song
@ 2024-12-30 19:32           ` David Hildenbrand
  2024-12-30 20:22             ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-12-30 19:32 UTC (permalink / raw)
  To: Barry Song; +Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

>> goto discard;
>>
> 
> I agree that this is necessary, but I'm not sure it addresses my
> concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
> and this does not resolve the issue of them being partially unmapped
> though it is definitely better than the existing code, at least folios are
> not moved back to swap-backed.
 > > On the other hand, users might rely on the `deferred_split` counter to
> assess how aggressively userspace is performing address/size unaligned
> operations
> like MADV_DONTNEED or unmapped behavior. However, our debugging shows
> that the majority of `deferred_split` counter increments result from
> aligned MADV_FREE operations. This diminishes the counter's usefulness
> in reflecting unaligned userspace behavior.

Optimizing that is certainly something to look into, but the bigger 
issue you describe arises from bad handling of speculative references.

Just imagine you indeed have a partially-mapped anon folio and the 
remaining pages are MADV_FREE'ed. The problem with the speculative 
reference would still apply.

> 
> If possible, I am still looking for some approach to entirely avoid
> adding the folio to deferred_split and partially being unmapped.
> 
> Could the concept be something like this?

Very likely it's wrong, because you really have to assure that that 
folio range is mapped here.

Proper folio PTE batching should be applied here -- folio_pte_batch() etc.

That can please the counters in many, but not all cases. Again, maybe 
the deferred-split handling should be handled differently, and not 
synchronously from rmap code.

I see 3 different work items

1) Fix mis-handling of speculative references

2) Perform proper PTE batching during unmap/migration. Will improve
    performance in any case.

3) Try moving deferred-split handling out of rmap code into reclaim/
    access-bit handling.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 19:32           ` David Hildenbrand
@ 2024-12-30 20:22             ` Barry Song
  2024-12-30 20:31               ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-12-30 20:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On Tue, Dec 31, 2024 at 8:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> goto discard;
> >>
> >
> > I agree that this is necessary, but I'm not sure it addresses my
> > concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
> > and this does not resolve the issue of them being partially unmapped
> > though it is definitely better than the existing code, at least folios are
> > not moved back to swap-backed.
>  > > On the other hand, users might rely on the `deferred_split` counter to
> > assess how aggressively userspace is performing address/size unaligned
> > operations
> > like MADV_DONTNEED or unmapped behavior. However, our debugging shows
> > that the majority of `deferred_split` counter increments result from
> > aligned MADV_FREE operations. This diminishes the counter's usefulness
> > in reflecting unaligned userspace behavior.
>
> Optimizing that is certainly something to look into, but the bigger
> issue you describe arises from bad handling of speculative references.
>
> Just imagine you indeed have a partially-mapped anon folio and the
> remaining pages are MADV_FREE'ed. The problem with the speculative
> reference would still apply.
>
> >
> > If possible, I am still looking for some approach to entirely avoid
> > adding the folio to deferred_split and partially being unmapped.
> >
> > Could the concept be something like this?
>
> Very likely it's wrong, because you really have to assure that that
> folio range is mapped here.
>
> Proper folio PTE batching should be applied here -- folio_pte_batch() etc.
>

I agree that using `folio_pte_batch()` to check if all PTEs are mapped and
determining `any_dirty` for setting swap-backed is the right direction. I'm
just curious if `(!list_empty(&folio->_deferred_list))` or
`folio_test_partially_mapped(folio)` could replace it if we're aiming for
a smaller change :-)

> That can please the counters in many, but not all cases. Again, maybe
> the deferred-split handling should be handled differently, and not
> synchronously from rmap code.
>
> I see 3 different work items
>
> 1) Fix mis-handling of speculative references
>

Agreed, the patch you're sending is absolutely necessary. I'd prefer
it lands sooner in some way. Would you like to post it?

> 2) Perform proper PTE batching during unmap/migration. Will improve
>     performance in any case.

Agreed. I remember discussing this with Ryan in an email thread about
a year ago, even for normal (non-MADV_FREE'ed) folios, but it seems
everyone has been busy with other priorities.

This seems like a good time to start exploring the idea. We could begin
with MADV_FREE'ed folios and later extend it to normal folios—for
instance, by implementing batched setting of swap entries.

>
> 3) Try moving deferred-split handling out of rmap code into reclaim/
>     access-bit handling.

I'm not quite sure we still need this after having 1 and 2. With those,
we've been able to operate on the mTHP as a whole. Do we still
need to move deferred_split out of rmap?

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
  2024-12-30 20:22             ` Barry Song
@ 2024-12-30 20:31               ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-12-30 20:31 UTC (permalink / raw)
  To: Barry Song; +Cc: Lance Yang, Linux-MM, Ryan Roberts, Baolin Wang, Andrew Morton

On 30.12.24 21:22, Barry Song wrote:
> On Tue, Dec 31, 2024 at 8:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> goto discard;
>>>>
>>>
>>> I agree that this is necessary, but I'm not sure it addresses my
>>> concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
>>> and this does not resolve the issue of them being partially unmapped
>>> though it is definitely better than the existing code, at least folios are
>>> not moved back to swap-backed.
>>   > > On the other hand, users might rely on the `deferred_split` counter to
>>> assess how aggressively userspace is performing address/size unaligned
>>> operations
>>> like MADV_DONTNEED or unmapped behavior. However, our debugging shows
>>> that the majority of `deferred_split` counter increments result from
>>> aligned MADV_FREE operations. This diminishes the counter's usefulness
>>> in reflecting unaligned userspace behavior.
>>
>> Optimizing that is certainly something to look into, but the bigger
>> issue you describe arises from bad handling of speculative references.
>>
>> Just imagine you indeed have a partially-mapped anon folio and the
>> remaining pages are MADV_FREE'ed. The problem with the speculative
>> reference would still apply.
>>
>>>
>>> If possible, I am still looking for some approach to entirely avoid
>>> adding the folio to deferred_split and partially being unmapped.
>>>
>>> Could the concept be something like this?
>>
>> Very likely it's wrong, because you really have to assure that that
>> folio range is mapped here.
>>
>> Proper folio PTE batching should be applied here -- folio_pte_batch() etc.
>>
> 
> I agree that using `folio_pte_batch()` to check if all PTEs are mapped and
> determining `any_dirty` for setting swap-backed is the right direction. I'm
> just curious if `(!list_empty(&folio->_deferred_list))` or
> `folio_test_partially_mapped(folio)` could replace it if we're aiming for
> a smaller change :-)

Likely we should just do it cleanly using PTE batching without any such 
optimizations. All unmapping (not just MADV_FREE) will benefit from PTE 
batching :)

> 
>> That can please the counters in many, but not all cases. Again, maybe
>> the deferred-split handling should be handled differently, and not
>> synchronously from rmap code.
>>
>> I see 3 different work items
>>
>> 1) Fix mis-handling of speculative references
>>
> 
> Agreed, the patch you're sending is absolutely necessary. I'd prefer
> it lands sooner in some way. Would you like to post it?

I'm out until the 7th. I'll be happy if you could clean it up, test it 
and send it out (Suggested-by: is sufficient).

> 
>> 2) Perform proper PTE batching during unmap/migration. Will improve
>>      performance in any case.
> 
> Agreed. I remember discussing this with Ryan in an email thread about
> a year ago, even for normal (non-MADV_FREE'ed) folios, but it seems
> everyone has been busy with other priorities.

Unfortunately, yes.

> 
> This seems like a good time to start exploring the idea. We could begin
> with MADV_FREE'ed folios and later extend it to normal folios—for
> instance, by implementing batched setting of swap entries.
> 
>>
>> 3) Try moving deferred-split handling out of rmap code into reclaim/
>>      access-bit handling.
> 
> I'm not quite sure we still need this after having 1 and 2. With those,
> we've been able to operate on the mTHP as a whole. Do we still
> need to move deferred_split out of rmap?

I have various things in mind, like having mremap'ed parts of the folio 
(cannot batch), user space zapping the folio in smaller chunks (cannot 
batch), and ... removing folio->_nr_pages_mapped + page->_mapcount, 
where we might not always detect "partially mapped" from rmap code and 
want to detect that separately either way.

Ideally, we'd just move deferred-split handling out of rmap code, and 
trigger that detection+handling from reclaim code. After all, the only 
purpose of deferred-split is .. memory reclaim.

Anyhow, 1) and 2) are beneficial independent of 3).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-30 20:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-29 21:12 All MADV_FREE mTHPs are fully subjected to deferred_split_folio() Barry Song
2024-12-30  2:14 ` Lance Yang
2024-12-30  9:48   ` David Hildenbrand
2024-12-30 11:54     ` Barry Song
2024-12-30 12:52       ` David Hildenbrand
2024-12-30 16:02         ` Lance Yang
2024-12-30 19:19         ` Barry Song
2024-12-30 19:32           ` David Hildenbrand
2024-12-30 20:22             ` Barry Song
2024-12-30 20:31               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox