From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E65F4C43334 for ; Wed, 20 Jul 2022 20:58:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39F566B0073; Wed, 20 Jul 2022 16:58:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 34CDE6B0074; Wed, 20 Jul 2022 16:58:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 214DD6B0075; Wed, 20 Jul 2022 16:58:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 123A56B0073 for ; Wed, 20 Jul 2022 16:58:20 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DBEEF160737 for ; Wed, 20 Jul 2022 20:58:19 +0000 (UTC) X-FDA: 79708691118.17.FF02DBF Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf25.hostedemail.com (Postfix) with ESMTP id 85F1BA009A for ; Wed, 20 Jul 2022 20:58:19 +0000 (UTC) Received: by mail-lf1-f51.google.com with SMTP id bf9so32199463lfb.13 for ; Wed, 20 Jul 2022 13:58:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=p9K+b21nsVD0iIafTdhhey9+rdLrDC0j6+lROpOC8dU=; b=iF/zzHxsHZq4RK7nVcQ8w/7uTDq+Bc545ccz6AHB6ro67LPjDmKJC04ocWy27p4zC1 Upeo0KVa3gU07eHedBoDLcyoItlaJMZDcGmOBq4buS1Eh1IiS/1X+ChBumtmACXIVlqH xZ/NPX00rKO2Xcz38TR+GpMeTStVJvjx9dfhH0lNuS6m3TKW2b8Le7lk4Io/g6IzFzAP qDr0e/W9HrUrVlPaWi4SKnwaHYV849rXK3p+km08JiFADsWBUL23m+6IM4Yuxkl74Pbd 6T5lz8IOiUDE3vm+AGZYcSGpjMRmTSy2DapQ+C6qpCLFfhFvd0D6pvGizZo2Z67ziXy+ 4xew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=p9K+b21nsVD0iIafTdhhey9+rdLrDC0j6+lROpOC8dU=; b=oA+Z4T4MdFFzc7jtd0xE0ttAgCDbnFYmH0O5cviVHIrZ08h+1b2u53dvmfPQDBRYxn EHceMjdgiq2SgxTaGskGi/GfmCO6ta7Y+KXh4MIgpI4Aaauh8+KIvbhIQWbU0IhcbYOB sofLbCddbJkgfpeV4FkVOTb2HJ0tFKmLOjtRGq8lCmr2crasmAk00BifziFi6wyumIEc T6Qipht+wxeIrddIDxrAqqaZQ3EsgvLEwGgBVWXgwzNr82usSQgNJU0d2+jejugLhNK0 Pys1cwZjD9MTCX4qwzIOJikacJqtwIxGOkwEtlF4umDKsZQYIiXzMrfHJ2/5cekepruZ VnXA== X-Gm-Message-State: AJIora8IjhR+lQh/GE+9MeR94iSnsUzo8cWL/8OqRaEBZkz+A07TFyQi 16VMExiYsoDU4oD+EGNuSi+aHTISGDfqBjIUD9/2Fg== X-Google-Smtp-Source: AGRyM1sKt/3cQ2pBHQuQLFuzpoS1zPyNNo8gcVM36a9a2V4uS+dppxJoeaLyCuOmvmQSpoN+vxJYQuKGTGfEnagiq8Y= X-Received: by 2002:a05:6512:2216:b0:489:48b6:f8cd with SMTP id h22-20020a056512221600b0048948b6f8cdmr19110063lfu.267.1658350697753; Wed, 20 Jul 2022 13:58:17 -0700 (PDT) MIME-Version: 1.0 References: <20220624173656.2033256-1-jthoughton@google.com> <20220624173656.2033256-21-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Wed, 20 Jul 2022 13:58:06 -0700 Message-ID: Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE To: Peter Xu Cc: Mike Kravetz , Muchun Song , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , Jue Wang , Manish Mishra , "Dr . David Alan Gilbert" , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="iF/zzHxs"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658350699; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p9K+b21nsVD0iIafTdhhey9+rdLrDC0j6+lROpOC8dU=; b=44cvPeas1WDbdnwaHuuOw2bZ0FV6zEMslb4BiWOmLEZllfBY8wqLs92y4o8bc1bx0LBlP7 vPA7TWYZ3d4ne7xkXEC/kZVfkn2Gbw9SKot0U5kd9mKFk4vhPstYhD9HkGkMNvekyVsvtX JwzwuAmoI4Y6rZdICwcpoGu8ZeV3eL0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658350699; a=rsa-sha256; cv=none; b=aqTTfOv60eL8+GvSrkh0TGEewYaxsh9zHUynmXTA/9lOI/CAWgMkziOnME4+pD1KR6eimi s9MBFUdGxZoN4+pQ28cygoVUkZQAOSKURZSyiaVobNmP+xC3/uy6b6ZYbCgWsylOIyeACn wgJt3XvggA0gpEVbqGgyNust25dhejE= X-Rspamd-Queue-Id: 85F1BA009A Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="iF/zzHxs"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=jthoughton@google.com X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 9y3cf3cm15ck9b3zkbb8x9tjja4rxmtu X-HE-Tag: 1658350699-584113 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 15, 2022 at 10:20 AM Peter Xu wrote: > > On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote: > > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu wrote: > > > > > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote: > > > > The changes here are very similar to the changes made to > > > > hugetlb_no_page, where we do a high-granularity page table walk and > > > > do accounting slightly differently because we are mapping only a piece > > > > of a page. > > > > > > > > Signed-off-by: James Houghton > > > > --- > > > > fs/userfaultfd.c | 3 +++ > > > > include/linux/hugetlb.h | 6 +++-- > > > > mm/hugetlb.c | 54 +++++++++++++++++++++----------------- > > > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++---------- > > > > 4 files changed, 82 insertions(+), 38 deletions(-) > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index e943370107d0..77c1b8a7d0b9 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > > > > if (!ptep) > > > > goto out; > > > > > > > > + if (hugetlb_hgm_enabled(vma)) > > > > + goto out; > > > > + > > > > > > This is weird. It means we'll never wait for sub-page mapping enabled > > > vmas. Why? > > > > > > > `ret` is true in this case, so we're actually *always* waiting. > > Aha! Then I think that's another problem, sorry. :) See Below. > > > > > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it > > > means we'll stop waiting for all shared hugetlbfs uffd page faults.. > > > > > > I'd expect in the in-house postcopy tests you should see vcpu threads > > > spinning on the page faults until it's serviced. > > > > > > IMO we still need to properly wait when the pgtable doesn't have the > > > faulted address covered. For sub-page mapping it'll probably need to walk > > > into sub-page levels. > > > > Ok, SGTM. I'll do that for the next version. I'm not sure of the > > consequences of returning `true` here when we should be returning > > `false`. > > We've put ourselves onto the wait queue, if another concurrent > UFFDIO_CONTINUE happened and pte is already installed, I think this thread > could be waiting forever on the next schedule(). > > The solution should be the same - walking the sub-page pgtable would work, > afaict. > > [...] > > > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > > > * registered, we firstly wr-protect a none pte which has no page cache > > > > * page backing it, then access the page. > > > > */ > > > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > > > > + if (!hugetlb_pte_none_mostly(dst_hpte)) > > > > goto out_release_unlock; > > > > > > > > - if (vm_shared) { > > > > - page_dup_file_rmap(page, true); > > > > - } else { > > > > - ClearHPageRestoreReserve(page); > > > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > > > > + if (new_mapping) { > > > > > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page > > > that was going to be mapped. > > > > > > Is it a must we get this only from the caller? Can we know we're doing > > > sub-page mapping already here and make a decision with e.g. dst_hpte? > > > > > > It looks weird to me to pass this explicitly from the caller, especially > > > that's when we don't really have the pgtable lock so I'm wondering about > > > possible race conditions too on having stale new_mapping values. > > > > The only way to know what the correct value for `new_mapping` should > > be is to know if we had to change the hstate-level P*D to non-none to > > service this UFFDIO_CONTINUE request. I'll see if there is a nice way > > to do that check in `hugetlb_mcopy_atomic_pte`. > > Right now there is no > > Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or > something alike)? This works in the hugetlb_fault case, because in the hugetlb_fault case, we install the largest PTE possible. If we are mapping a page for the first time, we will use an hstate-sized PTE. But for UFFDIO_CONTINUE, we may be installing a 4K PTE as the first PTE for the whole hpage. > > > race, because we synchronize on the per-hpage mutex. > > Yeah not familiar with that mutex enough to tell, as long as that mutex > guarantees no pgtable update (hmm, then why we need the pgtable lock > here???) then it looks fine. Let me take a closer look at this. I'll have a more detailed explanation for the next version of the RFC. > > [...] > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > > > copied = 0; > > > > page = NULL; > > > > vma_hpagesize = vma_kernel_pagesize(dst_vma); > > > > + if (use_hgm) > > > > + vma_altpagesize = PAGE_SIZE; > > > > > > Do we need to check the "len" to know whether we should use sub-page > > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will > > > still want the old behavior I think. > > > > I think that's a fair point; however, if we enable HGM and the address > > and len happen to be hstate-aligned > > The address can, but len (note! not "end" here) cannot? They both (dst_start and len) need to be hpage-aligned, otherwise we won't be able to install hstate-sized PTEs. Like if we're installing 4K at the beginning of a 1G hpage, we can't install a PUD, because we only want to install that 4K. > > > , we basically do the same thing as > > if HGM wasn't enabled. It could be a minor performance optimization to > > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how > > the page tables are set up, the end result would be the same. > > Thanks, Thanks! > > -- > Peter Xu >