Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>, Linux-MM <linux-mm@kvack.org>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 Andrew Morton <akpm@linux-foundation.org>
Subject: Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()
Date: Tue, 31 Dec 2024 00:54:55 +1300	[thread overview]
Message-ID: <CAGsJ_4yVc7_Jm2er-_SWm57A_TH2zgCGiydZ3g6N2XS8tbcWOg@mail.gmail.com> (raw)
In-Reply-To: <142a47b6-ac31-465c-917e-7b2e98fddb2f@redhat.com>

On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.12.24 03:14, Lance Yang wrote:
> > Hi Barry,
> >
> > On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> Hi Lance,
> >>
> >> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>
> >> We’ve noticed an unexpectedly high number of deferred splits. The root
> >> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >> that commit, split_folio is no longer called in mm/madvise.c.
>
> Hi,
>
> I assume you don't see "deferred splits" at all. You see that a folio
> was added to the deferred split queue to immediately be removed again as
> it gets freed. Correct?
>
> >>
> >> However, we are still performing deferred_split_folio for all
> >> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >> This happens because we execute a goto discard in
> >> try_to_unmap_one(), which eventually leads to
> >> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >> scan the 1st pte in try_to_unmap_one().
> >>
> >> discard:
> >>                  if (unlikely(folio_test_hugetlb(folio)))
> >>                          hugetlb_remove_rmap(folio);
> >>                  else
> >>                          folio_remove_rmap_pte(folio, subpage, vma);
>
> Yes, that's kind-of know: we neither do PTE batching during unmap for
> reclaim nor during unmap for migration. We should add that support.
>
> But note, just like I raised earlier in the context of similar to
> "improved partial-mapped logic in rmap code when batching", we are
> primarily only pleasing counters here.
>
> See below on concurrent shrinker.
>
> >>
> >> This could lead to a race condition with shrinker - deferred_split_scan().
> >> The shrinker might call folio_try_get(folio), and while we are scanning
> >> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >> could be transitioned back to swap-backed because the reference count
> >> is incremented.
> >>
> >>                                  /*
> >>                                   * The only page refs must be one from isolation
> >>                                   * plus the rmap(s) (dropped by discard:).
> >>                                   */
> >>                                  if (ref_count == 1 + map_count &&
> >>                                      (!folio_test_dirty(folio) ||
> >>                                       ...
> >>                                       (vma->vm_flags & VM_DROPPABLE))) {
> >>                                          dec_mm_counter(mm, MM_ANONPAGES);
> >>                                          goto discard;
> >>                                  }
>
>
> Reclaim code holds an additional folio reference and has the folio
> locked. So I don't think this race can really happen in the way you
> think it could? Please feel free to correct me if I am wrong.

try_to_unmap_one will only execute "goto discard" and remove the rmap if
ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
can invalidate this condition, leading to the restoration of the PTE and setting
the folio as swap-backed.

                               /*
                                 * The only page refs must be one from isolation
                                 * plus the rmap(s) (dropped by discard:).
                                 */
                                if (ref_count == 1 + map_count &&
                                    (!folio_test_dirty(folio) ||
                                     /*
                                      * Unlike MADV_FREE mappings, VM_DROPPABLE
                                      * ones can be dropped even if they've
                                      * been dirtied.
                                      */
                                     (vma->vm_flags & VM_DROPPABLE))) {
                                        dec_mm_counter(mm, MM_ANONPAGES);
                                        goto discard;
                                }

                                /*
                                 * If the folio was redirtied, it cannot be
                                 * discarded. Remap the page to page table.
                                 */
                                set_pte_at(mm, address, pvmw.pte, pteval);
                                /*
                                 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
                                 * never get swap backed on failure to drop.
                                 */
                                if (!(vma->vm_flags & VM_DROPPABLE))
                                        folio_set_swapbacked(folio);
                                goto walk_abort;


>
> >>
> >> It also significantly increases contention on ds_queue->split_queue_lock during
> >> memory reclamation and could potentially introduce other race conditions with
> >> shrinker as well.
> >
> > Good catch!
> >
>
> Call me "skeptical" that this is a big issue, at least regarding the
> shrinker, but also regarding actual lock contention. :)
>
> The issue might be less severe than you think: mostly pleasing counters.
> But yes, there is room for improvement.
>
> >>
> >> I’m curious if anyone has suggestions for resolving this issue. My
> >> idea is to use
> >> folio_remove_rmap_ptes to drop all PTEs at once, rather than
> >> folio_remove_rmap_pte,
> >> which processes PTEs one by one for an mTHP. This approach would require some
> >> changes, such as checking the dirty state of PTEs and performing a TLB
> >> flush for the
> >> entire mTHP as a whole in try_to_unmap_one().
> >
> > Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a whole
> > in real-world scenarios where MADV_FREE mTHPs are typically no longer
> > written ;)
>
>
> We should be implementing folio batching. But it won't be able to cover
> all cases.
>
> In the future, I envision that during reclaim/access bit scanning, we
> determine whether a folio is partially mapped and add it to the deferred
> split queue. That's one requirement for getting rid of
> folio->_nr_page_mapped and reliably detecting all partial mappings, but
> it also avoids having to messing with this information whenever we
> (temporarily) unmap only parts of a folio, like we have here.
>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

next prev parent reply	other threads:[~2024-12-30 11:55 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-29 21:12 Barry Song
2024-12-30  2:14 ` Lance Yang
2024-12-30  9:48   ` David Hildenbrand
2024-12-30 11:54     ` Barry Song [this message]
2024-12-30 12:52       ` David Hildenbrand
2024-12-30 16:02         ` Lance Yang
2024-12-30 19:19         ` Barry Song
2024-12-30 19:32           ` David Hildenbrand
2024-12-30 20:22             ` Barry Song
2024-12-30 20:31               ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4yVc7_Jm2er-_SWm57A_TH2zgCGiydZ3g6N2XS8tbcWOg@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox