From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A2C5C433EF for ; Fri, 24 Jun 2022 22:17:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0517B8E027B; Fri, 24 Jun 2022 18:17:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F41148E0244; Fri, 24 Jun 2022 18:17:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE2508E027B; Fri, 24 Jun 2022 18:17:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CCCB88E0244 for ; Fri, 24 Jun 2022 18:17:09 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A02CA2051C for ; Fri, 24 Jun 2022 22:17:09 +0000 (UTC) X-FDA: 79614540978.05.E747E5F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf06.hostedemail.com (Postfix) with ESMTP id E63D9180012 for ; Fri, 24 Jun 2022 22:17:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656109028; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0OtGpffjaTLz/6BREuZX8Nl5jeTs73mCwads02ylWD0=; b=UnMjdbBcAvX70nU2vQqrI6noDJyP6gSy3IcMLrOaeqoAoHVUFLyZcOvpsMZ1pv3p3Nm5AE y2QSsvXWLPpS3Jgbt3/WjhrLESuaWdWQuTu06VpGK18oNGYkzm6sgA3CxwrzL1VNl65B0Y D4n/5zgSXzZfnGPbAIbyjQUYqwfM28Q= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-408-jedkfCoMPDabLT7fuCmJ-g-1; Fri, 24 Jun 2022 18:17:07 -0400 X-MC-Unique: jedkfCoMPDabLT7fuCmJ-g-1 Received: by mail-il1-f197.google.com with SMTP id u8-20020a056e021a4800b002d3a5419d1bso2326925ilv.12 for ; Fri, 24 Jun 2022 15:17:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=0OtGpffjaTLz/6BREuZX8Nl5jeTs73mCwads02ylWD0=; b=SluNsmDE4e9offG5KH0Zwn17zdqFK+gxnMJ7nzJsc9n8jrZfy959UKWoVr+/K6fr9n ee5HGTDW4YVemS82Nuv1OoambFD3JKIYFKsVuqs2UJKnlgIDraN9mYYqgtTnBvDmxu+M RtH8a5LtEdnK4I6SVlqN+bo5NXauxOgwpOad5DT38wi/IRRMcSAHDqFZYOCo3jH8gJCY flTCgURR+/0N6dzG3kFBQweAP4nVI/ZwYFdKRCdHRTnl1V5vpHvS3ga+QRFZDkVnTSM2 T0Nxiv2jppVJsA+1lF27cVwEZixu971Etocateu6kExaH6DcgbVHi9uh9ALK5PlMpqDf reSQ== X-Gm-Message-State: AJIora96ylIE3Nk6czE1nhGgW0xr3qMo2u8RlhE0zcWOxKrWIEO6jByv i0sOVTsO6pplMjnXHngc0JI15tzGNCKte64+jh6YFXbX4Q+BRcjSEj9+M//0n9BPT/kXLGDts1f 5KT0TI/sa5Jc= X-Received: by 2002:a05:6638:1691:b0:333:f345:ef6c with SMTP id f17-20020a056638169100b00333f345ef6cmr799467jat.7.1656109026581; Fri, 24 Jun 2022 15:17:06 -0700 (PDT) X-Google-Smtp-Source: AGRyM1tgHfkiNTfuGTMZEe51UCsb3PSPaJpM68DBVrdgGmEWIg/g+xWM1omKb0IHfRa/pDxCAE9Hqg== X-Received: by 2002:a05:6638:1691:b0:333:f345:ef6c with SMTP id f17-20020a056638169100b00333f345ef6cmr799453jat.7.1656109026293; Fri, 24 Jun 2022 15:17:06 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id o1-20020a92c041000000b002d3bb071d5bsm1653173ilf.0.2022.06.24.15.17.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Jun 2022 15:17:05 -0700 (PDT) Date: Fri, 24 Jun 2022 18:17:04 -0400 From: Peter Xu To: Nadav Amit Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Message-ID: References: <20220622185038.71740-1-namit@vmware.com> <20220622185038.71740-3-namit@vmware.com> <18BCC23E-B344-41A8-926D-A49D768485AF@vmware.com> <6EF7D3B4-CF17-407B-A50F-B14D595E99A5@vmware.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656109029; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0OtGpffjaTLz/6BREuZX8Nl5jeTs73mCwads02ylWD0=; b=Eiqk8gJYhJTNaEo4gJSa2DJslwZSCbG3Lk0qZgdD+OFqyQFaq4bmdYSJcfSVKh4Om4wRuO i9RXMdxofnp/7enTKJG4FvcORMb0YqWlUv7uu1/sHF5Xh/Um1xgFEidzjiclYhQCDN1KVN lRRQrtozOkXvuOevhyga9ngTQwIpW9E= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=UnMjdbBc; spf=none (imf06.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656109029; a=rsa-sha256; cv=none; b=r2yvJW9aRervIawKIeGjfLF5w+L9Jy8wxOe66AZeKDQcNWRTfnfMtmaVDOycqhVzmo3J4H fvIdugKsR+rcI5v/Fe0B8gQ/tJSVZXFz9yWrpqWrXq/B/9eT+RogHlZP/jGp58mi5q02DH sqsxWLzyQq8QmaGyNNCslmlip11U9OA= Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=UnMjdbBc; spf=none (imf06.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: o5d3u1w4grb1c43fjnikda9c5bosg8zs X-Rspamd-Queue-Id: E63D9180012 X-HE-Tag: 1656109028-727621 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: > [Sorry for replying late] > > On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote: > > > > > > > On Jun 23, 2022, at 7:05 PM, Peter Xu wrote: > > > > > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: > > >> My take is that hints are hints. Following David’s (or was it yours?) > > >> feedback, I fixed the description to indicate that this is merely a hint and > > >> removed all references to dirty/access bits. The kernel therefore can ignore > > >> the hint when it wants to or use it in any other way. I fully agree that > > >> this gives the kernel the ability to change the behavior as needed. > > >> > > >> Note that for write-protected 4KB zero-page (where we share the zero-page) > > >> we always set the access-bit, regardless of the hint, because it makes > > >> sense: the zero-page is not swappable and therefore the access-bit is set. > > > > > > The zero-page example makes sense, and yeah that makes the hugetlb behavior > > > making more sense too. > > > > > >> > > >> I think that the lesser user-facing documentation there is on how the > > >> feature is *exactly* used by the kernel - is better from an API point of > > >> view. > > >> > > >> So I see no reason to fail or be forced not to set a page as young, just > > >> because a hint was *not* provided. This would even be a regression in the > > >> behavior. The hint is actually always respected right now, it is just that > > >> even if you do not provide the hint, the access/dirty is set. > > >> > > >> The only consistency I think worth thinking about is with the dirty-bit, and > > >> I can add it if you want. Note that the access-bit (in x86) might be set > > >> speculatively in contrast to the dirty-bit is only set atomically with a > > >> real access. That’s the reason I think it may make sense not to set the > > >> dirty without a hint. > > > > > > Sorry to ask if this is (another) naive question: any link/help to explain > > > the speculative behavior on access bit? Is it part of speculative > > > execution (which, iiuc, would it be reverted if the speculation failed)? > > > > Oh man, it is hard to find a reference. I made this claim it based on my > > recollection (and logic). > > > > The access-bit on Intel is set when the PTE is loaded into the TLB, so if you > > allow speculative loading of the TLB, that’s what you get. > > > > Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e., > > the accessed bit is set on entries used by speculative execution only.” [1] > > > > Intel SDM says: "Whenever the processor uses a paging-structure entry as part > > of linear-address translation, it sets the accessed flag in that entry... > > Whenever there is a write to a linear address, the processor sets the dirty > > flag (if it is not already set) in the paging- structure entry..." > > > > You can argue that this indicates that the access-bit is updated > > speculatively (translations can be speculative) and dirty-bit is on actual > > write. But it is somewhat of a creative reading. > > > > Googling further did not help much, but I found a relevant discussion on > > RISC-V, in which they actually consider a similar behavior. [2] > > > > If you want (and care), we can cc Dave Hansen to get a clear answer. > > > > [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/ > > [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883 > > I thought even writes can be speculatively executed too? Though I think > when the speculation was proved wrong the write needs to be reverted along > with making sure D bit cleared if it was cleared before the speculative > operation. > > So I think I get you if you meant the access bit may not be reverted even > if we hit a speculative failure (though without solid proofs, afaict..). > IOW we could have false positive access bits set even if not accessed, but > not to D bits which should be accurate. > > > > > > > > >> > > >> Is that acceptable? Access-bit always set, dirty-bit according to hint? > > > > > > I'm still trying to digest what you said above, sorry. > > > > > > Aren't both access and dirty bits need an atomic op to be set anyway? Then > > > from perf pov should we simply keep setting them both too like what you did > > > with this version? because it seems that'll always avoid an extra pgtable > > > update access? > > > > I guess by atomic-op you mean atomic-update by the hardware AD-assist. > > Yes. > > Btw, since I looked at the SDM as you quoted I think that may not strictly > be like an atomic op from processor pov, I guess, since there's a NOTE: > > The accesses used by the processor to set these flags may or may not be > exposed to the processor’s self-modifying code detection logic. If the > processor is executing code from the same memory area that is being used > for the paging structures, the setting of these flags may or may not > result in an immediate change to the executing code stream. > > So I read it as: even if it'll be an atomic, the op can be postponed. > > > > > I agree that if a page is written, the bits would need to be updated and > > these would introduce an overhead. However, if the page cannot be written, > > well, the dirty bit would never be set. > > Ok I see what you mean now. But honestly, I don't think it's anything > related to the speculative access bit behavior described above.. or is it? > > > > > hugetlb_mcopy_atomic_pte() currently does the following: > > > > _dst_pte = huge_pte_mkdirty(_dst_pte); > > _dst_pte = pte_mkyoung(_dst_pte); > > > > if (wp_copy) > > _dst_pte = huge_pte_mkuffd_wp(_dst_pte); > > > > Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three > > options: > > > > 1. Do not set dirty if (wp_copy). > > 2. Do not set dirty if (wp_copy || !write_hint) > > 3. Keep it as is. > > AFAICT you already go somewhere at least not (3) with non-hugetlb pages in > current series.. because dirty bit is not always set already for them, so > I'd say we'd make them match? Hugetlbfs shouldn't be special in this > aspect, IMHO. > > Said that, I think it doesn't really necessary need to be that complex, > since make_huge_pte() already sets dirty bit when "writable=1", so IIUC > what you need to do is simply make sure dirty bit set when write_hint=1. > > Does it sounds correct to you? Hmm, hold on... I failed to figure out how that write-likely hint could help us for either huge or non-huge pages, since: (1) Old code always set dirty, so no perf degrade anyway with/without the hint (2) If we want to rework dirty bit (which I'm totally fine with..), then we don't apply it when we shouldn't, and afaict we should set D bit whenever we should... if the user assumes this page is likely to be written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), setting D bit will not help, instead, the user should simply use an UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one COW. But that seems to be it. In short: I'm wondering whether we only really need the ACCESS_LIKELY hint as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE separately, but keep that only for zeropage op (and it shouldn't really be called WRITE_LIKELY)? Or did I miss something? -- Peter Xu