Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Barry Song <21cnbao@gmail.com>
Cc: david@redhat.com, Liam.Howlett@oracle.com, aarcange@redhat.com,
	akpm@linux-foundation.org, axelrasmussen@google.com,
	bgeffon@google.com, brauner@kernel.org, hughd@google.com,
	jannh@google.com, kaleshsingh@google.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	lokeshgidra@google.com, mhocko@suse.com, ngeoffray@google.com,
	rppt@kernel.org, ryan.roberts@arm.com, shuah@kernel.org,
	surenb@google.com, v-songbaohua@oppo.com,
	viro@zeniv.linux.org.uk, willy@infradead.org,
	zhangpeng362@huawei.com, zhengtangquan@oppo.com,
	yuzhao@google.com, stable@vger.kernel.org
Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
Date: Thu, 20 Feb 2025 20:49:22 -0500	[thread overview]
Message-ID: <Z7fbom4rxRu-NX81@x1.local> (raw)
In-Reply-To: <CAGsJ_4zXMj3hxazV1R-e9kCi_q-UDyYDhU6onWQRtRNgEEV3rw@mail.gmail.com>

On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > > 2. src_anon_vma and its lock – swapcache doesn’t require it（folio is not mapped）
> >
> > Could you help explain what guarantees the rmap walk not happen on a
> > swapcache page?
> >
> > I'm not familiar with this path, though at least I see damon can start a
> > rmap walk on PageAnon almost with no locking..  some explanations would be
> > appreciated.
> 
> I am observing the following in folio_referenced(), which the anon_vma lock
> was originally intended to protect.
> 
>         if (!pra.mapcount)
>                 return 0;
> 
> I assume all other rmap walks should do the same?

Yes normally there'll be a folio_mapcount() check, however..

> 
> int folio_referenced(struct folio *folio, int is_locked,
>                      struct mem_cgroup *memcg, unsigned long *vm_flags)
> {
> 
>         bool we_locked = false;
>         struct folio_referenced_arg pra = {
>                 .mapcount = folio_mapcount(folio),
>                 .memcg = memcg,
>         };
> 
>         struct rmap_walk_control rwc = {
>                 .rmap_one = folio_referenced_one,
>                 .arg = (void *)&pra,
>                 .anon_lock = folio_lock_anon_vma_read,
>                 .try_lock = true,
>                 .invalid_vma = invalid_folio_referenced_vma,
>         };
> 
>         *vm_flags = 0;
>         if (!pra.mapcount)
>                 return 0;
>         ...
> }
> 
> By the way, since the folio has been under reclamation in this case and
> isn't in the lru, this should also prevent the rmap walk, right?

.. I'm not sure whether it's always working.

The thing is anon doesn't even require folio lock held during (1) checking
mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
nothing blocks it from a concurrent thread zapping that last mapcount:

               thread 1                         thread 2
               --------                         --------
        [whatever scanner] 
           check folio_mapcount(), non-zero
                                                zap the last map.. then mapcount==0
           rmap_walk()

Not sure if I missed something.

The other thing is IIUC swapcache page can also have chance to be faulted
in but only if a read not write.  I actually had a feeling that your
reproducer triggered that exact path, causing a read swap in, reusing the
swapcache page, and hit the sanity check there somehow (even as mentioned
in the other reply, I don't yet know why the 1st check didn't seem to
work.. as we do check folio->index twice..).

Said that, I'm not sure if above concern will happen in this specific case,
as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
pgtable lock so maybe it can't get mapped back.. it is just still tricky,
at least we do some dances all over releasing and retaking locks.

We could either justify that's safe, or maybe still ok and simpler if we
could take anon_vma write lock, making sure nobody will be able to read the
folio->index when it's prone to an update.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2025-02-21  1:49 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-19 11:25 Barry Song
2025-02-19 18:26 ` Suren Baghdasaryan
2025-02-19 18:30   ` David Hildenbrand
2025-02-19 18:58     ` Suren Baghdasaryan
2025-02-20  8:40       ` David Hildenbrand
2025-02-20  9:21         ` Barry Song
2025-02-20 10:24           ` David Hildenbrand
2025-02-26  5:37             ` Barry Song
2025-02-26  8:03               ` David Hildenbrand
2025-02-20 23:32           ` Peter Xu
2025-02-21  0:07             ` Barry Song
2025-02-21  1:49               ` Peter Xu [this message]
2025-02-22 21:31                 ` Barry Song
2025-02-24 17:50                   ` Peter Xu
2025-02-24 18:03                     ` David Hildenbrand
2025-02-19 20:37   ` Barry Song
2025-02-19 20:57     ` Matthew Wilcox
2025-02-19 21:05       ` Barry Song
2025-02-19 21:02     ` Lokesh Gidra
2025-02-19 21:26       ` Barry Song
2025-02-19 21:32         ` Lokesh Gidra
2025-02-19 22:14     ` Peter Xu
2025-02-19 23:04       ` Barry Song
2025-02-19 23:19         ` Lokesh Gidra
2025-02-20  0:49           ` Barry Song
2025-02-20 22:59         ` Peter Xu
2025-02-20 23:47           ` Suren Baghdasaryan
2025-02-20 23:52             ` Suren Baghdasaryan
2025-02-21  0:36               ` Suren Baghdasaryan
2025-02-25 11:05                 ` Barry Song
2025-02-25 15:34                   ` Peter Xu
2025-02-25 17:02                     ` Suren Baghdasaryan
2025-02-21  1:36           ` Barry Song
2025-02-21  1:54             ` Peter Xu
2025-02-20  8:51     ` David Hildenbrand
2025-02-20  9:31       ` Barry Song
2025-02-20  9:36         ` David Hildenbrand
2025-02-20 21:45           ` Barry Song
2025-02-20 22:19             ` Lokesh Gidra
2025-02-20 22:26               ` Barry Song
2025-02-20 22:31                 ` David Hildenbrand
2025-02-20 22:33                 ` Lokesh Gidra
2025-02-19 18:40 ` Lokesh Gidra
2025-02-19 20:45   ` Barry Song
2025-02-19 20:53     ` Lokesh Gidra
2025-02-19 22:31 ` Peter Xu
2025-02-20  0:50   ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z7fbom4rxRu-NX81@x1.local \
    --to=peterx@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bgeffon@google.com \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=mhocko@suse.com \
    --cc=ngeoffray@google.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=zhangpeng362@huawei.com \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox