linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Stevens <stevensd@chromium.org>
To: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-mm@kvack.org,  Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,  Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd
Date: Thu, 2 Feb 2023 18:56:04 +0900	[thread overview]
Message-ID: <CAD=HUj4FjuLpihQLGLzUu82vr5fdFFxfnyKNhApC6L67F5iV4g@mail.gmail.com> (raw)
In-Reply-To: <Y9rRCN9EfqzwYnDG@x1n>

On Thu, Feb 2, 2023 at 5:52 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote:
> > On Tue, Jan 31, 2023 at 7:42 PM David Stevens <stevensd@chromium.org> wrote:
> > >
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Collapsing memory in a vma that has an armed userfaultfd results in
> > > zero-filling any missing pages, which breaks user-space paging for those
> > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing
> > > pages in shmem reached via scanning a vma with an armed userfaultfd if
> > > doing so would zero-fill any pages.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  mm/khugepaged.c | 35 ++++++++++++++++++++++++-----------
> > >  1 file changed, 24 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 79be13133322..48e944fb8972 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > >   *    + restore gaps in the page cache;
> > >   *    + unlock and free huge page;
> > >   */
> > > -static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > -                        struct file *file, pgoff_t start,
> > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma,
> > > +                        unsigned long addr, struct file *file, pgoff_t start,
> > >                          struct collapse_control *cc)
> > >  {
> > >         struct address_space *mapping = file->f_mapping;
> > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >          * be able to map it or use it in another way until we unlock it.
> > >          */
> > >
> > > +       if (is_shmem)
> > > +               mmap_read_lock(mm);
> >
> > If you release mmap_lock before then reacquire it here, the vma is not
> > trusted anymore. It is not safe to use the vma anymore.
> >
> > Since you already read uffd_was_armed before releasing mmap_lock, so
> > you could pass it directly to collapse_file w/o dereferencing vma
> > again. The problem may be false positive (not userfaultfd armed
> > anymore), but it should be fine. Khugepaged could collapse this area
> > in the next round.

I didn't notice this race condition. It should be possible to adapt
hugepage_vma_revalidate for this situation, or at least to create an
analogous situation.

> Unfortunately that may not be enough.. because it's also possible that it
> reads uffd_armed==false, released mmap_sem, passed it over to the scanner,
> but then when scanning the file uffd got armed in parallel.
>
> There's another problem where the current vma may not have uffd armed,
> khugepaged may think it has nothing to do with uffd and moved on with
> collapsing, but actually it's armed in another vma of either the current mm
> or just another mm's.
>
> It seems non-trivial too to safely check this across all the vmas, let's
> say, by a reverse walk - the only safe way is to walk all the vmas and take
> the write lock for every mm, but that's not only too heavy but also merely
> impossible to always make it right because of deadlock issues and on the
> order of mmap write lock to take..
>
> So far what I can still think of is, if we can extend shmem_inode_info and
> have a counter showing how many uffd has been armed.  It can be a generic
> counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid
> the page cache being collapsed under the hood, but I am also not aware of
> whether it can be reused by other things besides uffd.
>
> Then when we do the real collapsing, say, when:
>
>                 xas_set_order(&xas, start, HPAGE_PMD_ORDER);
>                 xas_store(&xas, hpage);
>                 xas_unlock_irq(&xas);
>
> We may need to make sure that counter keeps static (probably by holding
> some locks during the process) and we only do that last phase collapse if
> counter==0.
>
> Similar checks in this patch can still be done, but that'll only service as
> a role of failing faster before the ultimate check on the uffd_armed
> counter.  Otherwise I just don't quickly see how to avoid race conditions.

I don't know if it's necessary to go that far. Userfaultfd plus shmem
is inherently brittle. It's possible for userspace to bypass
userfaultfd on a shmem mapping by accessing the shmem through a
different mapping or simply by using the write syscall. It might be
sufficient to say that the kernel won't directly bypass a VMA's
userfaultfd to collapse the underlying shmem's pages. Although on the
other hand, I guess it's not great for the presence of an unused shmem
mapping lying around to cause khugepaged to have user-visible side
effects.

-David

> It'll be great if someone can come up with something better than above..
> Copy Hugh too.
>
> Thanks,
>
> --
> Peter Xu
>


  parent reply	other threads:[~2023-02-02  9:56 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01  3:41 David Stevens
2023-02-01 17:36 ` Yang Shi
2023-02-01 20:52   ` Peter Xu
2023-02-01 23:57     ` Yang Shi
2023-02-02 20:04       ` Peter Xu
2023-02-02 21:11         ` Yang Shi
2023-02-02  9:56     ` David Stevens [this message]
2023-02-02 17:40       ` Yang Shi
2023-02-02 20:22         ` Peter Xu
2023-02-03  6:09           ` David Stevens
2023-02-03 14:56             ` Peter Xu
2023-02-01 23:09 ` Kirill A. Shutemov
2023-02-02  9:30   ` David Stevens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAD=HUj4FjuLpihQLGLzUu82vr5fdFFxfnyKNhApC6L67F5iV4g@mail.gmail.com' \
    --to=stevensd@chromium.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox