From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6253C433EF for ; Fri, 24 Jun 2022 21:58:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0B078E0277; Fri, 24 Jun 2022 17:58:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9BB118E0244; Fri, 24 Jun 2022 17:58:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85B568E0277; Fri, 24 Jun 2022 17:58:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 776828E0244 for ; Fri, 24 Jun 2022 17:58:24 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3BD0D34B12 for ; Fri, 24 Jun 2022 21:58:24 +0000 (UTC) X-FDA: 79614493728.03.652BAA4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf20.hostedemail.com (Postfix) with ESMTP id 61ED31C0022 for ; Fri, 24 Jun 2022 21:58:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656107902; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zc9pFY/5zp3vNAFjIDM8jaHoo/W5n3z7SlGUvyD2mKA=; b=Q7KzGQxIoiB2C3r76/zQFR4jT6Wfslf+S8bCSB0HZOJE29zLRzebrvSuVe+crfG2rD9Db6 /UAWgsi3CqkL3NSAMh/KesmV2eCoUwAPr7JHXEzY9qkoYRFYzUgCbbraMKbs2YnRFqfYni +HzmetoVjPpUZBuZ2AhBB712Llx8w2M= Received: from mail-io1-f70.google.com (mail-io1-f70.google.com [209.85.166.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-640-trN9cBpyOQmFSUrximmoCQ-1; Fri, 24 Jun 2022 17:58:21 -0400 X-MC-Unique: trN9cBpyOQmFSUrximmoCQ-1 Received: by mail-io1-f70.google.com with SMTP id d11-20020a6bb40b000000b006727828a19fso2092038iof.15 for ; Fri, 24 Jun 2022 14:58:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=zc9pFY/5zp3vNAFjIDM8jaHoo/W5n3z7SlGUvyD2mKA=; b=gCwuFBKjILxykeDGYgrIkl9Yh8NbLl5JEcGgN1GYLMxk1rjLltWoj6ZBepbBVCbYLL DtczuE0mW5LpGQrIEGVnQCNRuo9+jnkSeRJJ4+knvElglVEnoS8hhCdyFQcSvfVdaE+s x9pkYLoFsa/4qPwXt7yGNDnRlT6HUomf2MwptpmHDOZ1gweJmJXbkNwIPfiQ0uiArgCL o3TizTof7dnGAxbUyZywslq4cBk/g5Ucpttjom+Z8ZvGMmFzRT9Mu3zH4KTQwwmEjFy/ Ts7yUyzLOZ20Vl8hLyyZQnmrxCLh0pIHKjY1UmKOOmMWYRpU6q60ynn+wd9SB1TN+jOC BBig== X-Gm-Message-State: AJIora9tfWSEOxB1Zi3yX3Bm/Q4WzhahF66HEsS5NarykozCtm3bR209 PJGic+jNhYT1ef3ei2H1oNAHpYUEnhmR5swKPOC/zqk//N+lwgeTYT66KaoQl2fm0p6sBbLTLub aBC1tOFKy1gA= X-Received: by 2002:a05:6e02:1849:b0:2d3:f382:bb30 with SMTP id b9-20020a056e02184900b002d3f382bb30mr684106ilv.144.1656107900781; Fri, 24 Jun 2022 14:58:20 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vIuvY/Zjm/lsIp3zNOwKVGG6V4gy9nHu7cZHvF7i5BUdRBxxfIIOmAq9bKm2upCZAKLPXzjA== X-Received: by 2002:a05:6e02:1849:b0:2d3:f382:bb30 with SMTP id b9-20020a056e02184900b002d3f382bb30mr684083ilv.144.1656107900398; Fri, 24 Jun 2022 14:58:20 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id e24-20020a5d85d8000000b006694bc50b82sm1726399ios.35.2022.06.24.14.58.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Jun 2022 14:58:19 -0700 (PDT) Date: Fri, 24 Jun 2022 17:58:17 -0400 From: Peter Xu To: Nadav Amit Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Message-ID: References: <20220622185038.71740-1-namit@vmware.com> <20220622185038.71740-3-namit@vmware.com> <18BCC23E-B344-41A8-926D-A49D768485AF@vmware.com> <6EF7D3B4-CF17-407B-A50F-B14D595E99A5@vmware.com> MIME-Version: 1.0 In-Reply-To: <6EF7D3B4-CF17-407B-A50F-B14D595E99A5@vmware.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656107903; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zc9pFY/5zp3vNAFjIDM8jaHoo/W5n3z7SlGUvyD2mKA=; b=uCbmi+y2yNpla+WqjL9bvOkDTn46G6lEVSvQh0Kl30RSMckfvE9mdy6glvNssQG3dTXpO/ 1y8awFnaF0zeSjrXqKuYr+djSu/IITlfd1150PTVtZHRoTrbq5+8H5aEYaES5e0b0XW2xq 6saMZp8d22W60VM3ZvVR5kwW3M00PcQ= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Q7KzGQxI; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf20.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656107903; a=rsa-sha256; cv=none; b=Fmh4jbeyz7qDKUkCKIUVg05Aw0ufb94VHjXwqPYsZlgyCtH2MtQCMwOWh+zfIb+z4pVrvM O/dbW46rwcB+Nz1TQluAl3JCTUhfT2I5IG8Eti+qE+TW0kr+5ISZUuQ7i0lK7cyTqy+VuQ mjxSVeOnjWrBUYaj1w/xA0piDuH+G7k= X-Rspamd-Queue-Id: 61ED31C0022 Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Q7KzGQxI; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf20.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: cfsorcjdatcnwxm57hxt7xug3845zdct X-HE-Tag: 1656107903-743451 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [Sorry for replying late] On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote: > > > > On Jun 23, 2022, at 7:05 PM, Peter Xu wrote: > > > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: > >> My take is that hints are hints. Following David’s (or was it yours?) > >> feedback, I fixed the description to indicate that this is merely a hint and > >> removed all references to dirty/access bits. The kernel therefore can ignore > >> the hint when it wants to or use it in any other way. I fully agree that > >> this gives the kernel the ability to change the behavior as needed. > >> > >> Note that for write-protected 4KB zero-page (where we share the zero-page) > >> we always set the access-bit, regardless of the hint, because it makes > >> sense: the zero-page is not swappable and therefore the access-bit is set. > > > > The zero-page example makes sense, and yeah that makes the hugetlb behavior > > making more sense too. > > > >> > >> I think that the lesser user-facing documentation there is on how the > >> feature is *exactly* used by the kernel - is better from an API point of > >> view. > >> > >> So I see no reason to fail or be forced not to set a page as young, just > >> because a hint was *not* provided. This would even be a regression in the > >> behavior. The hint is actually always respected right now, it is just that > >> even if you do not provide the hint, the access/dirty is set. > >> > >> The only consistency I think worth thinking about is with the dirty-bit, and > >> I can add it if you want. Note that the access-bit (in x86) might be set > >> speculatively in contrast to the dirty-bit is only set atomically with a > >> real access. That’s the reason I think it may make sense not to set the > >> dirty without a hint. > > > > Sorry to ask if this is (another) naive question: any link/help to explain > > the speculative behavior on access bit? Is it part of speculative > > execution (which, iiuc, would it be reverted if the speculation failed)? > > Oh man, it is hard to find a reference. I made this claim it based on my > recollection (and logic). > > The access-bit on Intel is set when the PTE is loaded into the TLB, so if you > allow speculative loading of the TLB, that’s what you get. > > Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e., > the accessed bit is set on entries used by speculative execution only.” [1] > > Intel SDM says: "Whenever the processor uses a paging-structure entry as part > of linear-address translation, it sets the accessed flag in that entry... > Whenever there is a write to a linear address, the processor sets the dirty > flag (if it is not already set) in the paging- structure entry..." > > You can argue that this indicates that the access-bit is updated > speculatively (translations can be speculative) and dirty-bit is on actual > write. But it is somewhat of a creative reading. > > Googling further did not help much, but I found a relevant discussion on > RISC-V, in which they actually consider a similar behavior. [2] > > If you want (and care), we can cc Dave Hansen to get a clear answer. > > [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/ > [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883 I thought even writes can be speculatively executed too? Though I think when the speculation was proved wrong the write needs to be reverted along with making sure D bit cleared if it was cleared before the speculative operation. So I think I get you if you meant the access bit may not be reverted even if we hit a speculative failure (though without solid proofs, afaict..). IOW we could have false positive access bits set even if not accessed, but not to D bits which should be accurate. > > > > >> > >> Is that acceptable? Access-bit always set, dirty-bit according to hint? > > > > I'm still trying to digest what you said above, sorry. > > > > Aren't both access and dirty bits need an atomic op to be set anyway? Then > > from perf pov should we simply keep setting them both too like what you did > > with this version? because it seems that'll always avoid an extra pgtable > > update access? > > I guess by atomic-op you mean atomic-update by the hardware AD-assist. Yes. Btw, since I looked at the SDM as you quoted I think that may not strictly be like an atomic op from processor pov, I guess, since there's a NOTE: The accesses used by the processor to set these flags may or may not be exposed to the processor’s self-modifying code detection logic. If the processor is executing code from the same memory area that is being used for the paging structures, the setting of these flags may or may not result in an immediate change to the executing code stream. So I read it as: even if it'll be an atomic, the op can be postponed. > > I agree that if a page is written, the bits would need to be updated and > these would introduce an overhead. However, if the page cannot be written, > well, the dirty bit would never be set. Ok I see what you mean now. But honestly, I don't think it's anything related to the speculative access bit behavior described above.. or is it? > > hugetlb_mcopy_atomic_pte() currently does the following: > > _dst_pte = huge_pte_mkdirty(_dst_pte); > _dst_pte = pte_mkyoung(_dst_pte); > > if (wp_copy) > _dst_pte = huge_pte_mkuffd_wp(_dst_pte); > > Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three > options: > > 1. Do not set dirty if (wp_copy). > 2. Do not set dirty if (wp_copy || !write_hint) > 3. Keep it as is. AFAICT you already go somewhere at least not (3) with non-hugetlb pages in current series.. because dirty bit is not always set already for them, so I'd say we'd make them match? Hugetlbfs shouldn't be special in this aspect, IMHO. Said that, I think it doesn't really necessary need to be that complex, since make_huge_pte() already sets dirty bit when "writable=1", so IIUC what you need to do is simply make sure dirty bit set when write_hint=1. Does it sounds correct to you? -- Peter Xu