Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>
Cc: Hugh Dickins <hughd@google.com>, Yu Zhao <yuzhao@google.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, willy@infradead.org,
	 david@redhat.com, ryan.roberts@arm.com, shy828301@gmail.com
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio
Date: Thu, 20 Jul 2023 20:39:42 -0700	[thread overview]
Message-ID: <CAJD7tkZJFG=7xs=9otc5CKs6odWu48daUuZP9Wd9Z-sZF07hXg@mail.gmail.com> (raw)
In-Reply-To: <c9b53e12-80bc-7447-af2e-71920e4179d0@intel.com>

On Thu, Jul 20, 2023 at 8:19 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/21/2023 9:35 AM, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 6:12 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
> >>> On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> >>>>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@google.com> wrote:
> >>>>>>
> >>>>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> >>>>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> cpu1                        cpu2
> >>>>>>>>>>>>>>>                             isolate folio
> >>>>>>>>>>>>>>> folio_test_clear_lru() // 0
> >>>>>>>>>>>>>>>                             putback folio // add to unevictable list
> >>>>>>>>>>>>>>> folio_test_clear_mlocked()
> >>>>>>>>>>>>                                folio_set_lru()
> >>>>>>> Let's wait the response from Huge and Yu. :).
> >>>>>>
> >>>>>> I haven't been able to give it enough thought, but I suspect you are right:
> >>>>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> >>>>>> fails.
> >>>>>>
> >>>>>> (Though it has not been reported as a problem in practice: perhaps because
> >>>>>> so few places try to isolate from the unevictable "list".)
> >>>>>>
> >>>>>> I forget what my order of development was, but it's likely that I first
> >>>>>> wrote the version for our own internal kernel - which used our original
> >>>>>> lruvec locking, which did not depend on getting PG_lru first (having got
> >>>>>> lru_lock, it checked memcg, then tried again if that had changed).
> >>>>>
> >>>>> Right. Just holding the lruvec lock without clearing PG_lru would not
> >>>>> protect against memcg movement in this case.
> >>>>>
> >>>>>>
> >>>>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> >>>>>> but it turned out to work okay - elsewhere; but it looks as if I missed
> >>>>>> its implication when adapting __munlock_page() for upstream.
> >>>>>>
> >>>>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> >>>>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
> >>>>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> >>>>>> spin waiting for PG_lru here?
> >>>>>
> >>>>> +Matthew Wilcox
> >>>>>
> >>>>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> >>>>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> >>>>> folio_set_lru() before checking folio_evictable(). While this is
> >>>>> probably extraneous since folio_batch_move_lru() will set it again
> >>>>> afterwards, it's probably harmless given that the lruvec lock is held
> >>>>> throughout (so no one can complete the folio isolation anyway), and
> >>>>> given that there were no problems introduced by this extra
> >>>>> folio_set_lru() as far as I can tell.
> >>>> After checking related code, Yes. Looks fine if we move folio_set_lru()
> >>>> before if (folio_evictable(folio)) in lru_add_fn() because of holding
> >>>> lru lock.
> >>>>
> >>>>>
> >>>>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> >>>>> ("mm/munlock:
> >>>>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> >>>>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> >>>>> get away without having to spin. Again, that would only be possible if
> >>>>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> >>>>> PG_mlocked before PG_lru in __munlock_folio().
> >>>> What about following change to move mlocked operation before check lru
> >>>> in __munlock_folio()?
> >>>
> >>> It seems correct to me on a high level, but I think there is a subtle problem:
> >>>
> >>> We clear PG_mlocked before trying to isolate to make sure that if
> >>> someone already has the folio isolated they will put it back on an
> >>> evictable list, then if we are able to isolate the folio ourselves and
> >>> find that the mlock_count is > 0, we set PG_mlocked again.
> >>>
> >>> There is a small window where PG_mlocked might be temporarily cleared
> >>> but the folio is not actually munlocked (i.e we don't update the
> >>> NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
> >>> may find VM_LOCKED from in a different vma, and call mlock_folio(). In
> >>> mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
> >>> PG_mlocked is clear, so we will increment the MLOCK stats, even though
> >>> the folio was already mlocked. This can cause MLOCK stats to be
> >>> unbalanced (increments more than decrements), no?
> >> Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
> >> to be unbalanced.
> >>
> >> Let's say:
> >>   mlock_folio()  NR_MLOCK increase and set mlocked
> >>   mlock_folio()  NR_MLOCK NO change as folio is already mlocked
> >>
> >>   __munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
> >>                                          clear mlocked
> >>
> >>   folio_putback_lru()
> >>   reclaimed mlock_folio()  NR_MLOCK increase and set mlocked
> >>
> >>   munlock_folio()  NR_MLOCK decrease (0) and clear mlocked
> >>   munlock_folio()  NR_MLOCK NO change as folio has no mlocked set
> >
> > Right. The problem with the diff is that we temporarily clear
> > PG_mlocked *without* updating NR_MLOCK.
> >
> > Consider a folio that is mlocked by two vmas. NR_MLOCK = folio_nr_pages.
> >
> > Assume cpu 1 is doing __munlock_folio from one of the vmas, while cpu
> > 2 is doing reclaim.
> >
> > cpu 1                                        cpu2
> > clear PG_mlocked
> >                                                  folio_referenced()
> >                                                    mlock_folio()
> >                                                      set PG_mlocked
> >                                                        add to NR_MLOCK
> > mlock_count > 0
> > set PG_mlocked
> > goto out
> >
> > Result: NR_MLOCK = folio_nr_pages * 2.
> >
> > When the folio is munlock()'d later from the second vma, NR_MLOCK will
> > be reduced to folio_nr_pages, but there are not mlocked folios.
> >
> > This is the scenario that I have in mind. Please correct me if I am wrong.
> Yes. Looks possible even may be difficult to hit.
>
> My first thought was it's not possible because unevictable folio will not
> be picked by reclaimer. But it's possible case if things happen between
> clear_mlock and test_and_clear_lru:
>     folio_putback_lru() by other isolation user like migration
>     reclaimer pick the folio and call mlock_folio()
>     reclaimer call folio
>
> The fixing can be following the rules (combine NR_LOCK with PG_mlocked bit)
> strictly.

Yeah probably. I believe restoring the old ordering of manipulating
PG_lru and PG_mlocked with the memory barrier would be a simpler fix,
but this is only possible if the mlock_count rework gets merged.

>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei
> >>
> >>>
> >>>>
> >>>> diff --git a/mm/mlock.c b/mm/mlock.c
> >>>> index 0a0c996c5c21..514f0d5bfbfd 100644
> >>>> --- a/mm/mlock.c
> >>>> +++ b/mm/mlock.c
> >>>> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
> >>>>  static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
> >>>>  {
> >>>>         int nr_pages = folio_nr_pages(folio);
> >>>> -       bool isolated = false;
> >>>> +       bool isolated = false, mlocked = true;
> >>>> +
> >>>> +       mlocked = folio_test_clear_mlocked(folio);
> >>>>
> >>>>         if (!folio_test_clear_lru(folio))
> >>>>                 goto munlock;
> >>>> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
> >>>>                 /* Then mlock_count is maintained, but might undercount */
> >>>>                 if (folio->mlock_count)
> >>>>                         folio->mlock_count--;
> >>>> -               if (folio->mlock_count)
> >>>> +               if (folio->mlock_count) {
> >>>> +                       if (mlocked)
> >>>> +                               folio_set_mlocked(folio);
> >>>>                         goto out;
> >>>> +               }
> >>>>         }
> >>>>         /* else assume that was the last mlock: reclaim will fix it if not */
> >>>>
> >>>>  munlock:
> >>>> -       if (folio_test_clear_mlocked(folio)) {
> >>>> +       if (mlocked) {
> >>>>                 __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> >>>>                 if (isolated || !folio_test_unevictable(folio))
> >>>>                         __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> >>>>
> >>>>
> >>>>>
> >>>>> I am not saying this is necessarily better than spinning, just a note
> >>>>> (and perhaps selfishly making [1] more appealing ;)).
> >>>>>
> >>>>> [1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@google.com/
> >>>>>
> >>>>>>
> >>>>>> Hugh

next prev parent reply	other threads:[~2023-07-21  3:40 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-12  6:01 [RFC PATCH v2 0/3] support large folio for mlock Yin Fengwei
2023-07-12  6:01 ` [RFC PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
2023-07-12  6:11   ` Yu Zhao
2023-07-12  6:01 ` [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-07-12  6:23   ` Yu Zhao
2023-07-12  6:43     ` Yin Fengwei
2023-07-12 17:03       ` Yu Zhao
2023-07-13  1:55         ` Yin Fengwei
2023-07-14  2:21       ` Hugh Dickins
2023-07-14  2:49         ` Yin, Fengwei
2023-07-14  3:41           ` Hugh Dickins
2023-07-14  5:45             ` Yin, Fengwei
2023-07-12  6:01 ` [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
2023-07-12  6:31   ` Yu Zhao
2023-07-15  6:06     ` Yu Zhao
2023-07-16 23:59       ` Yin, Fengwei
2023-07-17  0:35         ` Yu Zhao
2023-07-17  1:58           ` Yin Fengwei
2023-07-18 22:48             ` Yosry Ahmed
2023-07-18 23:47               ` Yin Fengwei
2023-07-19  1:32                 ` Yosry Ahmed
2023-07-19  1:52                   ` Yosry Ahmed
2023-07-19  1:57                     ` Yin Fengwei
2023-07-19  2:00                       ` Yosry Ahmed
2023-07-19  2:09                         ` Yin Fengwei
2023-07-19  2:22                           ` Yosry Ahmed
2023-07-19  2:28                             ` Yin Fengwei
2023-07-19 14:26                               ` Hugh Dickins
2023-07-19 15:44                                 ` Yosry Ahmed
2023-07-20 12:02                                   ` Yin, Fengwei
2023-07-20 20:51                                     ` Yosry Ahmed
2023-07-21  1:12                                       ` Yin, Fengwei
2023-07-21  1:35                                         ` Yosry Ahmed
2023-07-21  3:18                                           ` Yin, Fengwei
2023-07-21  3:39                                             ` Yosry Ahmed [this message]
2023-07-20  1:52                                 ` Yin, Fengwei
2023-07-17  8:12           ` Yin Fengwei
2023-07-18  2:06             ` Yin Fengwei
2023-07-18  3:59               ` Yu Zhao
2023-07-26 12:49       ` Yin Fengwei
2023-07-26 16:57         ` Yu Zhao
2023-07-27  0:15           ` Yin Fengwei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkZJFG=7xs=9otc5CKs6odWu48daUuZP9Wd9Z-sZF07hXg@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox