From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C9E85C43334
	for <linux-mm@archiver.kernel.org>; Fri, 15 Jul 2022 17:20:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4DA44940207; Fri, 15 Jul 2022 13:20:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 489C59401FB; Fri, 15 Jul 2022 13:20:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 35198940207; Fri, 15 Jul 2022 13:20:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 23FEC9401FB
	for <linux-mm@kvack.org>; Fri, 15 Jul 2022 13:20:41 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay12.hostedemail.com (Postfix) with ESMTP id E9D671208D5
	for <linux-mm@kvack.org>; Fri, 15 Jul 2022 17:20:40 +0000 (UTC)
X-FDA: 79689998640.01.48DA619
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf26.hostedemail.com (Postfix) with ESMTP id 8A2BF14006F
	for <linux-mm@kvack.org>; Fri, 15 Jul 2022 17:20:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1657905640;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=;
	b=M2x9QjAyZKMg89JiseSXkWRd3v7P72xsb1v/Du5wvN6ujXzyLUyfnjWSXoCvqkYj+ltEPo
	vH2ClQ+SorO36nwfdU6ln5PXmFxBaPkO+gyX//LzW4S6F4gNZEH0Liv4/7GhGcMao37Ou4
	e1yc8nx3nwM+cm7cSVDBAqN3YXcI3VA=
Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com
 [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-180-mvauElPOPxyfELCDPUwVxg-1; Fri, 15 Jul 2022 13:20:38 -0400
X-MC-Unique: mvauElPOPxyfELCDPUwVxg-1
Received: by mail-qk1-f200.google.com with SMTP id s9-20020a05620a254900b006b54dd4d6deso3876186qko.3
        for <linux-mm@kvack.org>; Fri, 15 Jul 2022 10:20:37 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=;
        b=ZyOjeupGQoPlQyKJ6dYXR4ekkr7cSy1HosuPCxL2HezOu/a4PtdKq+7tfuW/u3mpF6
         ZtM7/SvpNr7vb21/6+dKm+5cF7MBmkA6M2j1RfYHQ2w52Nc3F/oIfA9CQFvxvhcc8key
         d9IZ8t6S1OJSfwzlx3QFoj2FanHxNpxFI6A6gxsA9QIa/GU3aP7apNsyXXAzi72sda/b
         zIRi53plne+dAaMNOdeVyeEOw/Hy5qTa5oKM6MqPfqDg4EZ5AW27MsuBEElZXNh9qCKa
         JBe6ePkQp2tTey2RZ6UJAO+H0BKQPu8o0tVnz01hj11rTPEaq0DWbCJ+oPiPp4drA6f0
         VP8A==
X-Gm-Message-State: AJIora/QXqGpYCU4P/iyyRGzH+GgMt4K/5I1dVKpD4AOgkWAZQzQdC41
	Jo55MChlwABJywA0bDvHwOic6sgre9LTvSTUbaRxngOTLP5m6x/xO7G01SqevHqOshlEfRm27NM
	V6n5ZG8ZwxMY=
X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265067qtw.439.1657905637172;
        Fri, 15 Jul 2022 10:20:37 -0700 (PDT)
X-Google-Smtp-Source: AGRyM1uOHJxrwYjG93/qRUTPsx6ZXOpHvxvgIjaBzPJoHTHPeCWsXmeiQ/5HBEzj58cZRgTurllH0g==
X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265042qtw.439.1657905636896;
        Fri, 15 Jul 2022 10:20:36 -0700 (PDT)
Received: from xz-m1.local (bras-base-aurron9127w-grc-37-74-12-30-48.dsl.bell.ca. [74.12.30.48])
        by smtp.gmail.com with ESMTPSA id bp13-20020a05622a1b8d00b0031e9fa40c2esm4208168qtb.27.2022.07.15.10.20.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 15 Jul 2022 10:20:36 -0700 (PDT)
Date: Fri, 15 Jul 2022 13:20:35 -0400
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>, Jue Wang <juew@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity
 UFFDIO_CONTINUE
Message-ID: <YtGh46Jr0EGpqW7s@xz-m1.local>
References: <20220624173656.2033256-1-jthoughton@google.com>
 <20220624173656.2033256-21-jthoughton@google.com>
 <YtGUARcBHxLU0axU@xz-m1.local>
 <CADrL8HXYab_VJS=Y0h2OSiCrj2pYbDJME2P=Tsn9jcDRbcqR1g@mail.gmail.com>
MIME-Version: 1.0
In-Reply-To: <CADrL8HXYab_VJS=Y0h2OSiCrj2pYbDJME2P=Tsn9jcDRbcqR1g@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657905640; a=rsa-sha256;
	cv=none;
	b=nxKy4DhLkKhsYT+KHjza6YS+T6jp4JKXpt6qHf4Ru73lzdb4mwV3fha3dCXzJobWyLO+7S
	4zzqqjuReF4TaWxTiZQ3ZV1cO/NTyr6rRDq8f9zGnACgiyLvpeN/buwq3W1k+HfKKUCOVx
	LUoYcx14ukgvMm6ikAuL87HAhF/eNi0=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M2x9QjAy;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1657905640;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=;
	b=YzkhdPmbyDDltM+1wsgI1BxW4Lx/P9uInrNnk/i9HKab8IJf7ivMzjf/gQXG7/fOA941uQ
	fpzrMjjcyy/32zNQiyp27XkHN2Esam9EgNfUQZp0E9VxSatV7jQ9YNPPkBogkvMoNGZr9G
	uOwYtE271bkCdcGg0K2IqG4yZe3qnr8=
X-Stat-Signature: kgfh3a5q9jedeism1b5orqo1k8pky8k8
X-Rspamd-Queue-Id: 8A2BF14006F
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M2x9QjAy;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1657905640-87200
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > The changes here are very similar to the changes made to
> > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > do accounting slightly differently because we are mapping only a piece
> > > of a page.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  fs/userfaultfd.c        |  3 +++
> > >  include/linux/hugetlb.h |  6 +++--
> > >  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
> > >  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
> > >  4 files changed, 82 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index e943370107d0..77c1b8a7d0b9 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > >       if (!ptep)
> > >               goto out;
> > >
> > > +     if (hugetlb_hgm_enabled(vma))
> > > +             goto out;
> > > +
> >
> > This is weird.  It means we'll never wait for sub-page mapping enabled
> > vmas.  Why?
> >
> 
> `ret` is true in this case, so we're actually *always* waiting.

Aha!  Then I think that's another problem, sorry. :) See Below.

> 
> > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> >
> > I'd expect in the in-house postcopy tests you should see vcpu threads
> > spinning on the page faults until it's serviced.
> >
> > IMO we still need to properly wait when the pgtable doesn't have the
> > faulted address covered.  For sub-page mapping it'll probably need to walk
> > into sub-page levels.
> 
> Ok, SGTM. I'll do that for the next version. I'm not sure of the
> consequences of returning `true` here when we should be returning
> `false`.

We've put ourselves onto the wait queue, if another concurrent
UFFDIO_CONTINUE happened and pte is already installed, I think this thread
could be waiting forever on the next schedule().

The solution should be the same - walking the sub-page pgtable would work,
afaict.

[...]

> > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >        * registered, we firstly wr-protect a none pte which has no page cache
> > >        * page backing it, then access the page.
> > >        */
> > > -     if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > +     if (!hugetlb_pte_none_mostly(dst_hpte))
> > >               goto out_release_unlock;
> > >
> > > -     if (vm_shared) {
> > > -             page_dup_file_rmap(page, true);
> > > -     } else {
> > > -             ClearHPageRestoreReserve(page);
> > > -             hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > +     if (new_mapping) {
> >
> > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > that was going to be mapped.
> >
> > Is it a must we get this only from the caller?  Can we know we're doing
> > sub-page mapping already here and make a decision with e.g. dst_hpte?
> >
> > It looks weird to me to pass this explicitly from the caller, especially
> > that's when we don't really have the pgtable lock so I'm wondering about
> > possible race conditions too on having stale new_mapping values.
> 
> The only way to know what the correct value for `new_mapping` should
> be is to know if we had to change the hstate-level P*D to non-none to
> service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> to do that check in `hugetlb_mcopy_atomic_pte`.
> Right now there is no

Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
something alike)?

> race, because we synchronize on the per-hpage mutex.

Yeah not familiar with that mutex enough to tell, as long as that mutex
guarantees no pgtable update (hmm, then why we need the pgtable lock
here???) then it looks fine.

[...]

> > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > >       copied = 0;
> > >       page = NULL;
> > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > +     if (use_hgm)
> > > +             vma_altpagesize = PAGE_SIZE;
> >
> > Do we need to check the "len" to know whether we should use sub-page
> > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > still want the old behavior I think.
> 
> I think that's a fair point; however, if we enable HGM and the address
> and len happen to be hstate-aligned

The address can, but len (note! not "end" here) cannot?

> , we basically do the same thing as
> if HGM wasn't enabled. It could be a minor performance optimization to
> do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> the page tables are set up, the end result would be the same.

Thanks,

-- 
Peter Xu