From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9E85C43334 for ; Fri, 15 Jul 2022 17:20:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4DA44940207; Fri, 15 Jul 2022 13:20:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 489C59401FB; Fri, 15 Jul 2022 13:20:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 35198940207; Fri, 15 Jul 2022 13:20:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 23FEC9401FB for ; Fri, 15 Jul 2022 13:20:41 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id E9D671208D5 for ; Fri, 15 Jul 2022 17:20:40 +0000 (UTC) X-FDA: 79689998640.01.48DA619 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 8A2BF14006F for ; Fri, 15 Jul 2022 17:20:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657905640; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=M2x9QjAyZKMg89JiseSXkWRd3v7P72xsb1v/Du5wvN6ujXzyLUyfnjWSXoCvqkYj+ltEPo vH2ClQ+SorO36nwfdU6ln5PXmFxBaPkO+gyX//LzW4S6F4gNZEH0Liv4/7GhGcMao37Ou4 e1yc8nx3nwM+cm7cSVDBAqN3YXcI3VA= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-180-mvauElPOPxyfELCDPUwVxg-1; Fri, 15 Jul 2022 13:20:38 -0400 X-MC-Unique: mvauElPOPxyfELCDPUwVxg-1 Received: by mail-qk1-f200.google.com with SMTP id s9-20020a05620a254900b006b54dd4d6deso3876186qko.3 for ; Fri, 15 Jul 2022 10:20:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=ZyOjeupGQoPlQyKJ6dYXR4ekkr7cSy1HosuPCxL2HezOu/a4PtdKq+7tfuW/u3mpF6 ZtM7/SvpNr7vb21/6+dKm+5cF7MBmkA6M2j1RfYHQ2w52Nc3F/oIfA9CQFvxvhcc8key d9IZ8t6S1OJSfwzlx3QFoj2FanHxNpxFI6A6gxsA9QIa/GU3aP7apNsyXXAzi72sda/b zIRi53plne+dAaMNOdeVyeEOw/Hy5qTa5oKM6MqPfqDg4EZ5AW27MsuBEElZXNh9qCKa JBe6ePkQp2tTey2RZ6UJAO+H0BKQPu8o0tVnz01hj11rTPEaq0DWbCJ+oPiPp4drA6f0 VP8A== X-Gm-Message-State: AJIora/QXqGpYCU4P/iyyRGzH+GgMt4K/5I1dVKpD4AOgkWAZQzQdC41 Jo55MChlwABJywA0bDvHwOic6sgre9LTvSTUbaRxngOTLP5m6x/xO7G01SqevHqOshlEfRm27NM V6n5ZG8ZwxMY= X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265067qtw.439.1657905637172; Fri, 15 Jul 2022 10:20:37 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uOHJxrwYjG93/qRUTPsx6ZXOpHvxvgIjaBzPJoHTHPeCWsXmeiQ/5HBEzj58cZRgTurllH0g== X-Received: by 2002:a05:622a:313:b0:31e:bb0b:4748 with SMTP id q19-20020a05622a031300b0031ebb0b4748mr13265042qtw.439.1657905636896; Fri, 15 Jul 2022 10:20:36 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-37-74-12-30-48.dsl.bell.ca. [74.12.30.48]) by smtp.gmail.com with ESMTPSA id bp13-20020a05622a1b8d00b0031e9fa40c2esm4208168qtb.27.2022.07.15.10.20.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Jul 2022 10:20:36 -0700 (PDT) Date: Fri, 15 Jul 2022 13:20:35 -0400 From: Peter Xu To: James Houghton Cc: Mike Kravetz , Muchun Song , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , Jue Wang , Manish Mishra , "Dr . David Alan Gilbert" , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE Message-ID: References: <20220624173656.2033256-1-jthoughton@google.com> <20220624173656.2033256-21-jthoughton@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657905640; a=rsa-sha256; cv=none; b=nxKy4DhLkKhsYT+KHjza6YS+T6jp4JKXpt6qHf4Ru73lzdb4mwV3fha3dCXzJobWyLO+7S 4zzqqjuReF4TaWxTiZQ3ZV1cO/NTyr6rRDq8f9zGnACgiyLvpeN/buwq3W1k+HfKKUCOVx LUoYcx14ukgvMm6ikAuL87HAhF/eNi0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M2x9QjAy; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657905640; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PTK4goQcbNMUtE6aJiq+7EgmXb2x5IDyZNvU93q4GVI=; b=YzkhdPmbyDDltM+1wsgI1BxW4Lx/P9uInrNnk/i9HKab8IJf7ivMzjf/gQXG7/fOA941uQ fpzrMjjcyy/32zNQiyp27XkHN2Esam9EgNfUQZp0E9VxSatV7jQ9YNPPkBogkvMoNGZr9G uOwYtE271bkCdcGg0K2IqG4yZe3qnr8= X-Stat-Signature: kgfh3a5q9jedeism1b5orqo1k8pky8k8 X-Rspamd-Queue-Id: 8A2BF14006F Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M2x9QjAy; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1657905640-87200 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote: > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu wrote: > > > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote: > > > The changes here are very similar to the changes made to > > > hugetlb_no_page, where we do a high-granularity page table walk and > > > do accounting slightly differently because we are mapping only a piece > > > of a page. > > > > > > Signed-off-by: James Houghton > > > --- > > > fs/userfaultfd.c | 3 +++ > > > include/linux/hugetlb.h | 6 +++-- > > > mm/hugetlb.c | 54 +++++++++++++++++++++----------------- > > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++---------- > > > 4 files changed, 82 insertions(+), 38 deletions(-) > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index e943370107d0..77c1b8a7d0b9 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > > > if (!ptep) > > > goto out; > > > > > > + if (hugetlb_hgm_enabled(vma)) > > > + goto out; > > > + > > > > This is weird. It means we'll never wait for sub-page mapping enabled > > vmas. Why? > > > > `ret` is true in this case, so we're actually *always* waiting. Aha! Then I think that's another problem, sorry. :) See Below. > > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it > > means we'll stop waiting for all shared hugetlbfs uffd page faults.. > > > > I'd expect in the in-house postcopy tests you should see vcpu threads > > spinning on the page faults until it's serviced. > > > > IMO we still need to properly wait when the pgtable doesn't have the > > faulted address covered. For sub-page mapping it'll probably need to walk > > into sub-page levels. > > Ok, SGTM. I'll do that for the next version. I'm not sure of the > consequences of returning `true` here when we should be returning > `false`. We've put ourselves onto the wait queue, if another concurrent UFFDIO_CONTINUE happened and pte is already installed, I think this thread could be waiting forever on the next schedule(). The solution should be the same - walking the sub-page pgtable would work, afaict. [...] > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > > * registered, we firstly wr-protect a none pte which has no page cache > > > * page backing it, then access the page. > > > */ > > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > > > + if (!hugetlb_pte_none_mostly(dst_hpte)) > > > goto out_release_unlock; > > > > > > - if (vm_shared) { > > > - page_dup_file_rmap(page, true); > > > - } else { > > > - ClearHPageRestoreReserve(page); > > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > > > + if (new_mapping) { > > > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page > > that was going to be mapped. > > > > Is it a must we get this only from the caller? Can we know we're doing > > sub-page mapping already here and make a decision with e.g. dst_hpte? > > > > It looks weird to me to pass this explicitly from the caller, especially > > that's when we don't really have the pgtable lock so I'm wondering about > > possible race conditions too on having stale new_mapping values. > > The only way to know what the correct value for `new_mapping` should > be is to know if we had to change the hstate-level P*D to non-none to > service this UFFDIO_CONTINUE request. I'll see if there is a nice way > to do that check in `hugetlb_mcopy_atomic_pte`. > Right now there is no Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or something alike)? > race, because we synchronize on the per-hpage mutex. Yeah not familiar with that mutex enough to tell, as long as that mutex guarantees no pgtable update (hmm, then why we need the pgtable lock here???) then it looks fine. [...] > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > > copied = 0; > > > page = NULL; > > > vma_hpagesize = vma_kernel_pagesize(dst_vma); > > > + if (use_hgm) > > > + vma_altpagesize = PAGE_SIZE; > > > > Do we need to check the "len" to know whether we should use sub-page > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will > > still want the old behavior I think. > > I think that's a fair point; however, if we enable HGM and the address > and len happen to be hstate-aligned The address can, but len (note! not "end" here) cannot? > , we basically do the same thing as > if HGM wasn't enabled. It could be a minor performance optimization to > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how > the page tables are set up, the end result would be the same. Thanks, -- Peter Xu