From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43F53C43334 for ; Mon, 18 Jul 2022 21:21:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B064B8E0001; Mon, 18 Jul 2022 17:21:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AB4DF6B0087; Mon, 18 Jul 2022 17:21:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 97C8C8E0001; Mon, 18 Jul 2022 17:21:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8837A6B0071 for ; Mon, 18 Jul 2022 17:21:54 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4D95460B30 for ; Mon, 18 Jul 2022 21:21:54 +0000 (UTC) X-FDA: 79701492948.27.51BD3AD Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 9DB9214006C for ; Mon, 18 Jul 2022 21:21:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1658179313; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3elSrH1rEK2dccrNjT8lvVu0obiOXZp3Hqbkk30+FcY=; b=U0lYkc5+2YESs4Ra315dK9Wx+mNMgIDfUcfbHedYen2O+Fck+E7mac8KCMeAjs5mPh33lF XvTGU1/OuPs1Ywt0lp4aMGLk1xDRoF14OrYDtPlwXBOGE2fmqKMpjEDBwwPJ84woyvF3sF RHSoOs98zXDxvB7G10YSqjJrvlaQvsY= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-651-W4wihBG_OMOBSb_6nxuhTw-1; Mon, 18 Jul 2022 17:21:50 -0400 X-MC-Unique: W4wihBG_OMOBSb_6nxuhTw-1 Received: by mail-qk1-f197.google.com with SMTP id x22-20020a05620a259600b006b552a69231so10216373qko.18 for ; Mon, 18 Jul 2022 14:21:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=3elSrH1rEK2dccrNjT8lvVu0obiOXZp3Hqbkk30+FcY=; b=iZamiGPHH5prFA+ewHeU60cyWcyQGv4hbVbxrbmDqhzrxlHfaQmQ7sLnLzL1/BFoOv aNjR/pJyUxYhTNpLS5KtJTy19U7mWtuotI1F1TIRqX+1exfUagwDPai6Lo3gaHxDxxbH RpJz02Bh4MSe/89TX9GTzVPmmi+yzKun/KuQwr67EsLq88iA8olrsQNwouoVW4zHfdfc dNEVq4AHWCXpMVhvYKYHu9l7aNsBRiJaaQKkB3+I0jWHAyet2kw7egSQ7tW7dpQrNPJk cAphjRDhNhCS1PLx3Zg74nL5BVAl0/TW5hDDIURNImvRwexQmwz6XxczIeAhyimCAPvC G0aw== X-Gm-Message-State: AJIora/kwAOd50tUDAjFAdryfJuaN/mcAqymicB9GHFnehTiWgiSWVwP tDR9pjl0jBWLkXGaaeVRkiXmXhtcUf3hmdooXZOTvIoa3maTsKjGkoqxVuaMJI+s3Wm9+mRf9m3 gEOyO6E+0XJE= X-Received: by 2002:a05:622a:138d:b0:319:a288:44a5 with SMTP id o13-20020a05622a138d00b00319a28844a5mr23122548qtk.338.1658179309472; Mon, 18 Jul 2022 14:21:49 -0700 (PDT) X-Google-Smtp-Source: AGRyM1u0dUSV97NfTL8wsyjfO/Cm4o04f8DrxTYYLUzoGqlJ7beNksy3v2yXf9N5ZrlopKOGLIkIkw== X-Received: by 2002:a05:622a:138d:b0:319:a288:44a5 with SMTP id o13-20020a05622a138d00b00319a28844a5mr23122534qtk.338.1658179309175; Mon, 18 Jul 2022 14:21:49 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-37-74-12-30-48.dsl.bell.ca. [74.12.30.48]) by smtp.gmail.com with ESMTPSA id z125-20020a37b083000000b006a758ce2ae1sm11847561qke.104.2022.07.18.14.21.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Jul 2022 14:21:48 -0700 (PDT) Date: Mon, 18 Jul 2022 17:21:47 -0400 From: Peter Xu To: Nadav Amit Cc: Linux MM , Andrew Morton , Mike Kravetz , Hugh Dickins , Axel Rasmussen , David Hildenbrand , Mike Rapoport Subject: Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Message-ID: References: <20220718114748.2623-1-namit@vmware.com> <20220718114748.2623-3-namit@vmware.com> <8F18AE8D-2496-4F8C-90C2-D537E88F7137@vmware.com> MIME-Version: 1.0 In-Reply-To: <8F18AE8D-2496-4F8C-90C2-D537E88F7137@vmware.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U0lYkc5+; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658179313; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3elSrH1rEK2dccrNjT8lvVu0obiOXZp3Hqbkk30+FcY=; b=ycvkQrQC0ANagNABwxWSJfJ0foAG1Rnm4TURrUvQi4wAJUluIycavR7JZg2UcN/aI4RTg6 8DHVb0lEL/fMGenRF3gYpQRPrgFR0pnWIT+XMYTVuIn9C20KCYhrimdx9HdO0Ezpr1NM1v GQogVzmzfgvEC/CPCPHIQOfY2PhObOY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658179313; a=rsa-sha256; cv=none; b=xYxiJXK9BKuFiMlek/DLqU+X2jgN/5u0xsKy+yaLyAKB3XTYCeH/1StUvCrzTJmuvyHT9v Sb3cuKvhe3fcerBgG8Lut+X9sC+wL8scWCKTanh9h24H86QVM7fQOvkzCH5mWzvA9t3nRj rYtGfV7OR7jsP9EOV1zA/p2nGaMP8HI= X-Stat-Signature: hk4dyoxya9uc8fuwsig4zbdrh5ysu8mw X-Rspamd-Queue-Id: 9DB9214006C X-Rspamd-Server: rspam08 Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U0lYkc5+; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com X-Rspam-User: X-HE-Tag: 1658179313-116889 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jul 18, 2022 at 08:59:37PM +0000, Nadav Amit wrote: > On Jul 18, 2022, at 1:05 PM, Peter Xu wrote: > > > ⚠ External Email > > > > On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote: > >> @@ -261,6 +272,7 @@ struct uffdio_copy { > >> struct uffdio_zeropage { > >> struct uffdio_range range; > >> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0) > >> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY ((__u64)1<<1) > > > > Would access hint help zeropage use case? I remembered you used to comment > > around and said it won't help since we won't reclaim zero page anyway. > > I agree that there is no meaning for access bit on zero page. I just think > that it is best to have the flags for consistency. If you ask me, I would > prefer to have all the flags in a fixed place (highest bits?). Anyhow, if we > expose the hints as a feature, I do not think we would later want to say > “here is another feature that enables another hint that we thought is not > needed before”. Userfaultfd’s feature bits are already nuts, IMHO. > > > It won't help either even if this flag is only used for the follow up > > WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached > > it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT > > itself would be enough for ZEROPAGE to me. > > Agreed. Again, I think it is worthy for consistency. I'd be fine if it's kernel internal flags only. But this is solid kernel ABI. Are you.. sure? We're literally trying to introduce some flags just for "consistency" even if we know nobody will be using it. It really dosn't sound very right on designing good interfaces.. > > > [...] > > > >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > >> index 421784d26651..c15679f3eb6a 100644 > >> --- a/mm/userfaultfd.c > >> +++ b/mm/userfaultfd.c > >> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > >> bool writable = dst_vma->vm_flags & VM_WRITE; > >> bool vm_shared = dst_vma->vm_flags & VM_SHARED; > >> bool page_in_cache = page->mapping; > >> + bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY); > > > > I think it's okay to name it "prefault" as a temp var, but ideally IMHO we > > shouldn't assume what the user app is doing - it is only installing some > > uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a > > prefault process.. > > > >> spinlock_t *ptl; > >> struct inode *inode; > >> pgoff_t offset, max_off; > >> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > >> */ > >> _dst_pte = pte_wrprotect(_dst_pte); > >> > >> + if (prefault && arch_wants_old_prefaulted_pte()) > >> + _dst_pte = pte_mkold(_dst_pte); > >> + else > >> + _dst_pte = pte_sw_mkyoung(_dst_pte); > > > > Could you explain why we couldn't unconditionally mkold here even for x86? > > To answer this question and the previous one, please note that the logic is > “borrowed” from do_set_pte(). If you want me to refactor and extract a > function, please let me know. > > Here is the deal: for x86, we don’t do pte_mkold() because setting the > access bit is expensive (>500 cycles). For arm64 that have access-bit we > don’t since (according to arm64 code or commit log), the cost of setting the > access bit on arm is low. > > > It'll be a pity if this feature bit will only be useful on arm64 but not > > covering x86 (which is so far still the majority I think). > > > > IMHO it's slightly different here comparing to kernel prefaults - the uesr > > app may not be aware of kernel prefaults, but here !ACCESS_HINT it's > > user-aware, and it's what user app explicitly provided. IMO it's a > > stronger proof of a cold page already. > > I’m ok with that if that is your choice. I actually prefer to give userspace > more control, but I tried to be consistent with other parts of the kernel. Ah good to know, then if there's a vote I'll go for your proposal. I'd suggest we make it a strong semantics. We used to have similar discussions around the MADV_COLLAPSE on whether it should be restricted to khugepaged limitations. I think it's similar here. > Having said that, it’s really hard for me to see why young bit would be clear, > but dirty bit would be set... Assume one page has both young/dirty set, the reclaim code decides to age this page, then.. young=0 && dirty=1? > > > The other thing I got confused here is arch_wants_old_prefaulted_pte() > > returns true if arm64 supports hardware AF. However for all the rest archs > > (including x86_64 which, afaict, support AF too in most models) it'll > > constantly return false. Do you know what's the rational behind? > > All x86 (32/64) since 386 support access-bit in the page-tables (IIRC, 286 > had access bit in the segments). > > I thought we discussed it before: if you access an old PTE on x86, you pay > >500 cycles; this actually affected UnixBench when people tried to change > this behavior [1]. In contrast, on arm64, which I have never profiled, you > probably saw the comment saying: "Experimentally, it's cheap to set the > access flag in hardware and we benefit from prefaulting mappings as 'old’ to > start with.”. Thanks. I'm really curious how fast would aarch64 be on setting hardware-assist young bit and why now. > > I do not know what happens on other architectures. > > ( sorry if I have some repetitions in this email ) > > [1] https://marc.info/?l=linux-kernel&m=146582237922378&w=2 > -- Peter Xu