From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3158C43334 for ; Tue, 28 Jun 2022 10:55:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E2DB8E0002; Tue, 28 Jun 2022 06:55:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 592838E0001; Tue, 28 Jun 2022 06:55:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 459F88E0002; Tue, 28 Jun 2022 06:55:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 37C3D8E0001 for ; Tue, 28 Jun 2022 06:55:13 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 089B2208E8 for ; Tue, 28 Jun 2022 10:55:13 +0000 (UTC) X-FDA: 79627337706.17.BF7A354 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf27.hostedemail.com (Postfix) with ESMTP id 6BEE740006 for ; Tue, 28 Jun 2022 10:55:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656413712; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mQaOrB2yNhaa+YXwD3kYHTzSS+Mn2gghCeJsIdfr97s=; b=Be5LohR5r/hDsQcWmLgaT1rGSSGRXirr6OVomysIvF5hY6dz2iiSe6t+qlyqFThjKyeH+v W/zCGpFuXAaoQ7Kkm2j1lRdde/575Hr5WzEkrDAJqJSTbPcf7Xd9xf2u7tnZ0z1F229S/p Z10UcYazWc/xSc0vHJ/7tm+gZGeCbdU= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-322-1BjpyyQoOUaHTya_i9fePg-1; Tue, 28 Jun 2022 06:55:10 -0400 X-MC-Unique: 1BjpyyQoOUaHTya_i9fePg-1 Received: by mail-wr1-f71.google.com with SMTP id u9-20020adfa189000000b0021b8b3c8f74so1717481wru.12 for ; Tue, 28 Jun 2022 03:55:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:organization:in-reply-to :content-transfer-encoding; bh=mQaOrB2yNhaa+YXwD3kYHTzSS+Mn2gghCeJsIdfr97s=; b=MNlP+HRe7M+ruyCXjKcXTXRqMc4McnNCpR4w1mHQDjWvVEOtm+9JiDfE9manybV3Th PjIkUmcFqXwNeFSGmimCiwgbkXe6mxuHV+FfCZm8eHiwUubAWnRo9gxYgQ52pvrufUHp YdFr0uTu8yAuhiv8vzsZEh/UBFE3iaVqX8XDbZQGZjUNCKKP0Bw62/6Eq7xiFVg4zDZB Y1DxqnHpf/orJ+aX2ADWOQAawoQhpxmo+6ukZc2zOLdcru0Z4cE0xV2Pw6/6T6B6Mfiu 2Av+EuV6EC7KOJUYIxbGp1eWKqhg40U3hTMOBuJhVlh6TIFZ+pjUbUZW513Y64fIXAH2 50iw== X-Gm-Message-State: AJIora8ADhQDmPmKMzxDlr3TREwVszc9tVdea4DPYR3xxrpoiOj4FAxz orz+wksePwoIrdOF24cp84C931ksiGro2B9fW4wg0aE9cSJHDWiXZFOJieEPm2UBtdb8ms1drjT SIW+uhC6iBz4= X-Received: by 2002:a05:600c:4e09:b0:39c:6c5d:c753 with SMTP id b9-20020a05600c4e0900b0039c6c5dc753mr21030488wmq.34.1656413709632; Tue, 28 Jun 2022 03:55:09 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sIiUP1iYC62m7VQeBBw0xFuxIIIi/QzwPHVqyOjgfSzawBg6DApVbt2eCxJfIidw65xa6Jeg== X-Received: by 2002:a05:600c:4e09:b0:39c:6c5d:c753 with SMTP id b9-20020a05600c4e0900b0039c6c5dc753mr21030455wmq.34.1656413709317; Tue, 28 Jun 2022 03:55:09 -0700 (PDT) Received: from ?IPV6:2003:cb:c709:a00:46df:e778:456a:8d6b? (p200300cbc7090a0046dfe778456a8d6b.dip0.t-ipconnect.de. [2003:cb:c709:a00:46df:e778:456a:8d6b]) by smtp.gmail.com with ESMTPSA id r13-20020adfe68d000000b0021018642ff8sm14111581wrm.76.2022.06.28.03.55.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 28 Jun 2022 03:55:08 -0700 (PDT) Message-ID: Date: Tue, 28 Jun 2022 12:55:07 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations To: Nadav Amit , Peter Xu Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , Mike Rapoport , Dave Hansen References: <20220622185038.71740-3-namit@vmware.com> <18BCC23E-B344-41A8-926D-A49D768485AF@vmware.com> <6EF7D3B4-CF17-407B-A50F-B14D595E99A5@vmware.com> <07B65135-CA6D-4839-BAC0-6D63A94F50C2@vmware.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Be5LohR5; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf27.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656413712; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mQaOrB2yNhaa+YXwD3kYHTzSS+Mn2gghCeJsIdfr97s=; b=e2o5K8rFB7tjYdERTcxk0/WomirSxlNkDPMcJTTjF936l7aNX6t4gC9s14Uqcxpfr8rozy 2VREwMxzOmmiiRTE18Cx9Bcxl2tDCRXwTWMUYXLvOBW2/f1bl+WhGI3uSlkXDoZaqoY+/W eMr2C4fDRNQkCKBT+Zc2W+94377o8ro= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656413712; a=rsa-sha256; cv=none; b=hdu5EM/dkWBw+Yt5ez0HLgc773gaTd79lptOSxh3WCq4fsbQNwGzl1FmJMSuC2OdKfwHaP OjU014JqEL5OizgHkDVBevMhXeU6ZEFhhHMQt/5QnCSmd3/GUIEOte7PfXQzgIEVHSUflO LTM6FmqVRtrTujs3fMXHIG3YneDfPH8= X-Stat-Signature: xn199d7so5mgj9j16eg9n55fnnhak3iq X-Rspamd-Queue-Id: 6BEE740006 Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Be5LohR5; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf27.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1656413712-399197 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 28.06.22 01:37, Nadav Amit wrote: > [ +Dave Hansen to say how wrong I am ] > >> On Jun 27, 2022, at 6:12 AM, Peter Xu wrote: >> >> ⚠ External Email >> >> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote: >>> >>> >>>> On Jun 24, 2022, at 3:17 PM, Peter Xu wrote: >>>> >>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: >>>>> [Sorry for replying late] >>>>> >>>>> Said that, I think it doesn't really necessary need to be that complex, >>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC >>>>> what you need to do is simply make sure dirty bit set when write_hint=1. >>>>> >>>>> Does it sounds correct to you? >>>> >>>> Hmm, hold on... I failed to figure out how that write-likely hint could >>>> help us for either huge or non-huge pages, since: >>>> >>>> (1) Old code always set dirty, so no perf degrade anyway with/without the >>>> hint >>>> >>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then >>>> we don't apply it when we shouldn't, and afaict we should set D bit >>>> whenever we should... if the user assumes this page is likely to be >>>> written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), >>>> setting D bit will not help, instead, the user should simply use an >>>> UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. >>>> >>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one >>>> COW. But that seems to be it. >>>> >>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint >>>> as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE >>>> separately, but keep that only for zeropage op (and it shouldn't really be >>>> called WRITE_LIKELY)? Or did I miss something? >>> >>> Let’s see if I get you correctly. I am not sure whether we had this >>> discussion before. >>> >>> We are talking about a scenario in which WP=0. You argue that if the page >>> is already set as dirty, what is the benefit of not setting the dirty-bit, >>> right? >>> >>> So first, IIUC, there are cases in which the page would not be set as >>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this >>> use-case, so I say it based on the comments. ] >>> >>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it >>> is not written by the user after UFFDI_COPY, marking the PTE as dirty >>> when it is mapped would induce overhead, as we discussed before, since >>> if/when the PTE is unmapped, TLB flush batching might not be possible. >> >> I'd hope we don't make an interface design just to service that purpose of >> when write=0 and dirty=1 use case that is internal to the kernel so far, >> and I still think it's the tlb flush code to change.. or do we have other >> use case for this WRITE_LIKELY hint? >> >> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then >> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the >> user writes to the page in the other mapping, where PageDirty() will start >> to be true already even if the pte that to be CONTINUEd will have dirty=0 >> in the pte entry. From that pov I still don't see why we need to grant the >> user on the dirty bit control, no matter with a hint only, or explicit. >> >>> >>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, >>> I would reiterate my position (which you guys convinced me in!) >> >> David convinced you I think :) >> >>> that having hints that indicate what the user does (WRITE_LIKELY) is a >>> better API than something that indicates directly what the kernel should >>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). >> >> The hint idea sounds good to me, it's just that we actually have two steps >> here: >> >> (1) We think providing user the control of dirty bit makes sense, then, >> (2) We think the flag should be a hint not explicit "set dirty bit" >> >> I agree with (2) in this case if (1) is applicable. And now I think I'm >> questioning myself on (1). >> >> Fundamentally, access bit has more meaningful context (0 means cold, 1 >> means hot), for dirty it's really more a perf thing to me (when clear, >> it'll take extra cycles to set it when memory write happens to it; being >> clear _may_ help only for the tlb flush example you mentioned but I'm not >> fully convinced that's correct). > > I am not sure we understand each other. I think the benefit of not setting > a dirty-bit when a page is not actually written is fundamental, and has > inherit performance benefit. > > When I did x86’s pte_flags_need_flush(), I was defensive, but there is a > basic optimization that is possible to avoid a TLB flush on non-dirty > writable PTEs. > > In x86, consider a situation in which you use ptep_modify_prot_start() > to remove a PTE and load its old value using xchg. (A similar case happens > on reclaim). Assume you want to write-protect the entry. > > If the PTE is non-dirty then you should be able to avoid a flush, even if > the PTE is writable. In x86, a write and the change of the dirty-bit are > performed both atomically. Therefore, if the dirty-bit on the old PTE was > clear, you can avoid a TLB flush. > > Besides the benefit of avoiding a TLB flush, there is also the benefit > of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be > preceded by memory write to the shared memory, but that does not have to > be the case. Similarly, if in the future userfaultfd would also support > memory-backed private mappings, that does not have to be the case either. > > Putting all of the above aside, there is a bug in my code, but this > bug also points why dirty should not be set unconditionally. If someone > uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and > soft-dirty) might be misleading, causing unnecessary userspace writeback > of memory. > > So I do need to fix my code so it would not write-unprotect memory if > soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But > I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. > >> >> Maybe with the to be proposed RFC patch for tlb flush we can know whether >> that should be something we can rely on. It'll add more dependency on this >> work which I'm sorry to say. It's just that IMHO we should think carefully >> for the write-hint because this is a solid new uABI we're talking about. >> >> The other option is we can introduce the access hint first and think more >> on the dirty one (we can always add it when proper). What do you think? >> Also, David please chim in anytime if I missed the whole point when you >> proposed the idea. >> >>> >>> But this discussion made me think that there are two somewhat related >>> matters that we may want to address: >>> >>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp >>> to proactively make entries writable and save . >> >> I'm not sure I'm right here, but I think David's patch should have covered >> that case? The new helper only checks pte_uffd_wp() based on my memory, >> and when resolving page faults uffd-wp bit should have been gone, so it >> should be treated the same as normal ptes. > > Let’s see we get to the same page: > > mwriteprotect_range() does: > > change_protection(&tlb, dst_vma, start, start + len, newprot, > enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) > > As you see no use of MM_CP_TRY_CHANGE_WRITABLE. > > And then change_pte_range() does: > > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > !pte_write(ptent) && > can_change_pte_writable(vma, addr, ptent)) > ptent = pte_mkwrite(ptent); Right, I think in a previous version of my patch (before you guys convinced me to introduce MM_CP_TRY_CHANGE_WRITABLE :P ) it would have done it automatically (for private mappings). We might have to add it to some callers now manually to not only consider mprotect. -- Thanks, David / dhildenb