From: Barry Song <21cnbao@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>, Linux-MM <linux-mm@kvack.org>,
Ryan Roberts <ryan.roberts@arm.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
Date: Tue, 31 Dec 2024 08:19:54 +1300 [thread overview]
Message-ID: <CAGsJ_4z94HqGt8mVMYABnMQ5jOhNyztmqB5bOqqE6MSNx6vgAA@mail.gmail.com> (raw)
In-Reply-To: <8690de27-a1be-4440-a2d6-1a5cc56dcceb@redhat.com>
On Tue, Dec 31, 2024 at 1:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.12.24 12:54, Barry Song wrote:
> > On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 30.12.24 03:14, Lance Yang wrote:
> >>> Hi Barry,
> >>>
> >>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> Hi Lance,
> >>>>
> >>>> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>>>
> >>>> We’ve noticed an unexpectedly high number of deferred splits. The root
> >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >>>> that commit, split_folio is no longer called in mm/madvise.c.
> >>
> >> Hi,
> >>
> >> I assume you don't see "deferred splits" at all. You see that a folio
> >> was added to the deferred split queue to immediately be removed again as
> >> it gets freed. Correct?
> >>
> >>>>
> >>>> However, we are still performing deferred_split_folio for all
> >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >>>> This happens because we execute a goto discard in
> >>>> try_to_unmap_one(), which eventually leads to
> >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >>>> scan the 1st pte in try_to_unmap_one().
> >>>>
> >>>> discard:
> >>>> if (unlikely(folio_test_hugetlb(folio)))
> >>>> hugetlb_remove_rmap(folio);
> >>>> else
> >>>> folio_remove_rmap_pte(folio, subpage, vma);
> >>
> >> Yes, that's kind-of know: we neither do PTE batching during unmap for
> >> reclaim nor during unmap for migration. We should add that support.
> >>
> >> But note, just like I raised earlier in the context of similar to
> >> "improved partial-mapped logic in rmap code when batching", we are
> >> primarily only pleasing counters here.
> >>
> >> See below on concurrent shrinker.
> >>
> >>>>
> >>>> This could lead to a race condition with shrinker - deferred_split_scan().
> >>>> The shrinker might call folio_try_get(folio), and while we are scanning
> >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >>>> could be transitioned back to swap-backed because the reference count
> >>>> is incremented.
> >>>>
> >>>> /*
> >>>> * The only page refs must be one from isolation
> >>>> * plus the rmap(s) (dropped by discard:).
> >>>> */
> >>>> if (ref_count == 1 + map_count &&
> >>>> (!folio_test_dirty(folio) ||
> >>>> ...
> >>>> (vma->vm_flags & VM_DROPPABLE))) {
> >>>> dec_mm_counter(mm, MM_ANONPAGES);
> >>>> goto discard;
> >>>> }
> >>
> >>
> >> Reclaim code holds an additional folio reference and has the folio
> >> locked. So I don't think this race can really happen in the way you
> >> think it could? Please feel free to correct me if I am wrong.
> >
> > try_to_unmap_one will only execute "goto discard" and remove the rmap if
> > ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
> > can invalidate this condition, leading to the restoration of the PTE and setting
> > the folio as swap-backed.
> >
> > /*
> > * The only page refs must be one from isolation
> > * plus the rmap(s) (dropped by discard:).
> > */
> > if (ref_count == 1 + map_count &&
> > (!folio_test_dirty(folio) ||
> > /*
> > * Unlike MADV_FREE mappings, VM_DROPPABLE
> > * ones can be dropped even if they've
> > * been dirtied.
> > */
> > (vma->vm_flags & VM_DROPPABLE))) {
> > dec_mm_counter(mm, MM_ANONPAGES);
> > goto discard;
> > }
> >
> > /*
> > * If the folio was redirtied, it cannot be
> > * discarded. Remap the page to page table.
> > */
> > set_pte_at(mm, address, pvmw.pte, pteval);
> > /*
> > * Unlike MADV_FREE mappings, VM_DROPPABLE ones
> > * never get swap backed on failure to drop.
> > */
> > if (!(vma->vm_flags & VM_DROPPABLE))
> > folio_set_swapbacked(folio);
> > goto walk_abort;
>
> Ah, that's what you mean. Yes, but the shrinker behaves mostly like just
> any other speculative reference.
>
> So we're not actually handling speculative references here correctly, so
> this issue is not completely shrinker-specific.
>
> Maybe, we should be doing something like this?
>
> /*
> * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
> * they've been dirtied.
> */
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> /*
> * redirtied either using the page table or a previously
> * obtained GUP reference.
> */
> set_pte_at(mm, address, pvmw.pte, pteval);
> folio_set_swapbacked(folio);
> goto walk_abort;
> } else if (ref_count != 1 + map_count) {
> /*
> * Additional reference. Could be a GUP reference or any
> * speculative reference. GUP users must mark the folio dirty if
> * there was a modification. This folio cannot be reclaimed
> * right now either way, so act just like nothing happened.
> * We'll come back here later and detect if the folio was
> * dirtied when the additional reference is gone.
> */
> set_pte_at(mm, address, pvmw.pte, pteval);
> goto walk_abort;
> }
> goto discard;
>
I agree that this is necessary, but I'm not sure it addresses my
concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
and this does not resolve the issue of them being partially unmapped
though it is definitely better than the existing code, at least folios are
not moved back to swap-backed.
On the other hand, users might rely on the `deferred_split` counter to
assess how aggressively userspace is performing address/size unaligned
operations
like MADV_DONTNEED or unmapped behavior. However, our debugging shows
that the majority of `deferred_split` counter increments result from
aligned MADV_FREE operations. This diminishes the counter's usefulness
in reflecting unaligned userspace behavior.
If possible, I am still looking for some approach to entirely avoid
adding the folio to deferred_split and partially being unmapped.
Could the concept be something like this?
if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
/*
* redirtied either using the page table or a previously
* obtained GUP reference.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
/* remove the rmap for the subpages before the current subpage */
folio_remove_rmap_ptes(folio, head_page, (address -
start_address)/PAGE_SIZE, vma);
goto walk_abort;
} else if (ref_count != 1 + map_count) {
/*
* Additional reference. Could be a GUP reference or any
* speculative reference. GUP users must mark the folio dirty if
* there was a modification. This folio cannot be reclaimed
* right now either way, so act just like nothing happened.
* We'll come back here later and detect if the folio was
* dirtied when the additional reference is gone.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
/* remove the rmap for the subpages before the current subpage */
folio_remove_rmap_ptes(folio, head_page, (address -
start_address)/PAGE_SIZE, vma);
goto walk_abort;
}
continue; /* don't remove rmap one by one */
>
> Probably cleaning up goto labels.
>
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks
Barry
next prev parent reply other threads:[~2024-12-30 19:20 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-29 21:12 Barry Song
2024-12-30 2:14 ` Lance Yang
2024-12-30 9:48 ` David Hildenbrand
2024-12-30 11:54 ` Barry Song
2024-12-30 12:52 ` David Hildenbrand
2024-12-30 16:02 ` Lance Yang
2024-12-30 19:19 ` Barry Song [this message]
2024-12-30 19:32 ` David Hildenbrand
2024-12-30 20:22 ` Barry Song
2024-12-30 20:31 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4z94HqGt8mVMYABnMQ5jOhNyztmqB5bOqqE6MSNx6vgAA@mail.gmail.com \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@redhat.com \
--cc=ioworker0@gmail.com \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox