From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DF4FC433EF for ; Tue, 21 Jun 2022 18:31:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 824978E0003; Tue, 21 Jun 2022 14:31:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D40B6B008C; Tue, 21 Jun 2022 14:31:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 64E758E0003; Tue, 21 Jun 2022 14:31:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4EABB6B008A for ; Tue, 21 Jun 2022 14:31:03 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 19502212ED for ; Tue, 21 Jun 2022 18:31:03 +0000 (UTC) X-FDA: 79603084806.03.8EB4410 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf15.hostedemail.com (Postfix) with ESMTP id 43EE0A00B1 for ; Tue, 21 Jun 2022 18:30:59 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id n12so6916855pfq.0 for ; Tue, 21 Jun 2022 11:30:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Onhx09YN+NherA9QusBR4MRt86+aKP3EWz/So4j65UU=; b=FW9v0ngrmFlSqfBlhouIDX0y0t0xtsLy7EDQjjbJ3XC1jOcBA6aEYSzagsU3sa5WM+ Hpd4XGqAI/oDbmiRTg+P5lHzrYVvw9WfzWRJWCzL5S2RwSril9KoMGHVk5vuS6SG5yqo pX/9xLveeC97NnKfO6ZLPSKaJ40KuBGvgQmyvx+YDPk7pDcliegQv1KvTs1K2MBBgDBG 9IUY1HPb9g0gTDrUHDUL4c+4lWS6Oq3VWjfOUziOovEhDkazPM6V3vpUvL6iQq5h/SG6 dBrrtn3YFDmvY0eFPXi80F18q13KUq9BVH1+dAkb8Z3//zFoxuw+UrLzGSRc6sEOEmap S07g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Onhx09YN+NherA9QusBR4MRt86+aKP3EWz/So4j65UU=; b=2qFo6A2hNdle4S4gCNuzVtriACsQfdaGEchcKCCZ7cDgee38kSsNmKFXBJIKcL0ybk cPW+4ExxE9cyZijG+xfWRfifrVcQqPG54O25IZ98fPu8VV/z6KArugxbbLWrMUc4zvRm ShH8sLt+9Uflqc7mP6hy/FphgiW7NBi2Ac9yrXkxUCPBqxJa4vM07w/P0G/nw1p6WT/0 7ml+rDL64lqYxs3PU9tB6QLCZwgxEjqOl1JMtvKyt9dEDyCfXVSWUeZfOnT3q3Pihr78 OPLLLtJS3tQBxwyx15ySUaQqltmEHFgziZWFRDRF7lAvBF5Q7UsNsfMpn0MXa5OGfGuB g73g== X-Gm-Message-State: AJIora/5Sqh61upLvckoF9N1cZfG8BprgwOntfknfa5Hdayt/YQ52lvg oLAnHSXOXNKPgo77rAmwY5w= X-Google-Smtp-Source: AGRyM1s2RciqJ4InYZSgOsu97hpWg6OFRagBhTRI0G43DElo9xApYEFFoWZG4II0VYwkGNZjNFJFQQ== X-Received: by 2002:a05:6a00:1992:b0:51e:6d8c:3b0 with SMTP id d18-20020a056a00199200b0051e6d8c03b0mr31700668pfl.26.1655836257888; Tue, 21 Jun 2022 11:30:57 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id b19-20020a62a113000000b0051b9ac5a377sm11736482pff.213.2022.06.21.11.30.56 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Jun 2022 11:30:57 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations From: Nadav Amit In-Reply-To: Date: Tue, 21 Jun 2022 11:30:52 -0700 Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport Content-Transfer-Encoding: quoted-printable Message-Id: <58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com> References: <20220619233449.181323-1-namit@vmware.com> <20220619233449.181323-4-namit@vmware.com> To: Peter Xu X-Mailer: Apple Mail (2.3696.100.31) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655836259; a=rsa-sha256; cv=none; b=dKFNvbni2DbrGofk7Hrwtx13VSnGQwQOIs5ekC7JPMwC/F2YBYzWAjAQivxWpGJHUnrjxT iJZyChvekoUh1XpwvSuxc4vSNn58lLlwW1tEEE/Lx2XZDimzkXH7w1vXtX71Ef0xqZcsig W1CNNAxEwI711Z6dGO6Ft9a2SDRX6oM= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655836259; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Onhx09YN+NherA9QusBR4MRt86+aKP3EWz/So4j65UU=; b=SiO670nynFy0dRqvueY8RQom0BmTg1El0Wi9996e+8mymXGfLAuP5ZD0NNJ2NWJEZRc6UP F/vytwuIKaFahgUazitbKGA4GNc1mIKSbzJqxT0sDT7cjkLCpHB8N5ePU/7O/RSXNoYNun +zxviBn+WpIkH0tGOYed5HWBV6atIGU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=FW9v0ngr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=FW9v0ngr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 43EE0A00B1 X-Stat-Signature: 4hs6pjhrhoxcah8eciiexocyfach5gs3 X-HE-Tag: 1655836259-647017 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jun 21, 2022, at 11:10 AM, Peter Xu wrote: > On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote: >> On Jun 21, 2022, at 9:38 AM, Peter Xu wrote: >>=20 >>> Hi, Nadav, >>>=20 >>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote: >>>> From: Nadav Amit >>>>=20 >>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in >>>> mfill_atomic_install_pte") has set PTEs as dirty as its title = indicates. >>>> However, setting read-only PTEs as dirty can have several undesired >>>> implications. >>>>=20 >>>> First, setting read-only PTEs as dirty, can cause these PTEs to = become >>>> writable during mprotect() syscall. See in change_pte_range(): >>>>=20 >>>> /* Avoid taking write faults for known dirty pages */ >>>> if (dirty_accountable && pte_dirty(ptent) && >>>> (pte_soft_dirty(ptent) || >>>> !(vma->vm_flags & VM_SOFTDIRTY))) { >>>> ptent =3D pte_mkwrite(ptent); >>>> } >>>=20 >>> IMHO this is not really the direct reason to add the feature to = "allow user >>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", = because >>> IIUC what's missing is the pte_uffd_wp() check that I should have = added in >>> the shmem+hugetlb uffd-wp series in change_pte_range() but I = missed.. >>>=20 >>> But since this is fixed by David's patch to optimize mprotect() = altogether >>> which checks pte_uffd_wp() (and afaict that's only needed after the >>> shmem+hugetlb patchset, not before), so I think we're safe now with = all the >>> branches. >>>=20 >>> So IMHO we don't need to mention this as it's kind of misleading. = It'll be >>> welcomed if you want to recover the pte_dirty behavior in >>> mfill_atomic_install_pte() but probably this is not the right patch = for it? >>=20 >> Sorry, I will remove it from the commit log. >>=20 >> I am not sure whether there should be an additional patch to revert = (or >> partially revert) 9ae0f87d009ca and to be backported. What do you = say? >=20 > I think we discussed it offlist, I'd not bother but I don't have a = strong > opinion. Please feel free to go with your preference. >=20 > Just to mention that it might not be a clean revert since at that time = we > don't have continue and uffd-wp on files, so we may need to add the > complete check back if we're going to make it. Please also do proper = swap > tests for both anon and shmem with things like memcg, and when you = posted > the patches I can do some tests too. That=E2=80=99s what I was afraid of. It moves it down in my priority = list... >>>> Second, unmapping read-only dirty PTEs often prevents TLB flush = batching. >>>> See try_to_unmap_one(): >>>>=20 >>>> /* >>>> * Page is dirty. Flush the TLB if a writable entry >>>> * potentially exists to avoid CPU writes after IO >>>> * starts and then write it out here. >>>> */ >>>> try_to_unmap_flush_dirty(); >>>>=20 >>>> Similarly batching TLB flushed might be prevented in = zap_pte_range(): >>>>=20 >>>> if (!PageAnon(page)) { >>>> if (pte_dirty(ptent)) { >>>> force_flush =3D 1; >>>> set_page_dirty(page); >>>> } >>>> ... >>>=20 >>> I still keep the pure question here (which I asked in the private = reply to >>> you) on why we'd like only pte_dirty() but not pte_write() && = pte_dirty() >>> here. I'll rewrite what I have in the private email to you here = again that >>> I think should be the write thing to do here.. >>>=20 >>> if (!PageAnon(page)) { >>> if (pte_dirty(ptent)) { >>> set_page_dirty(page); >>> if (pte_write(ptent)) >>> force_flush =3D 1; >>> } >>> } >>=20 >> The =E2=80=9CSecond=E2=80=9D part regards a perf issue, not a = correctness issue. As I told >> you offline, I think if makes sense to look both on pte_dirty() and >> pte_write(), but I am afraid of other architectures than x86, which I = know. >=20 > I don't really see why this logic is arch-specific. I meant, as long = as > pte_write() returns a value that means "this page is writable", I = still > think it's making sense. >=20 >> After our discussion, I looked at other architectures, and it does = look >> safe. So I will add an additional patch for that (with your = suggested-by). >=20 > Only if you agree with what I thought, thanks. And that could be a > separate patch for sure out of this one even if to come. I do agree. I did not quite understand your second sentence. I guess you meant not to combine it into this patchset (or at least that=E2=80=99s = what I thought and I attribute it to you). >>=20 >>=20 >> I think that perhaps I did not communicate well enough the reason it = makes >> sense to set the dirty-bit. >>=20 >> Setting the access-bit and dirty-bit takes quite some time. So = regardless of >> whether you set the PageDirty, if you didn=E2=80=99t see access-bit = and dirty-bit >> and later you access the page - you are going to pay ~550 cycles for >> nothing. >>=20 >> For reference: = https://marc.info/?l=3Dlinux-kernel&m=3D146582237922378&w=3D2 >>=20 >> If you want me to allow userspace to provide a hint, let me know. >=20 > You did communicate well, so probably it's me that didn't. :) >=20 > Yes I think it'll be nicer to allow userspace provide the same hint = for > UFFDIO_CONTINUE, the reason as I explained in the other thread on the = young > bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my > current qemu impl [1]) proactively be used to install pgtables even if > they're code pages and they may not be accessed in the near future. So > IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, = from > this point of view. >=20 > There's just still some differences on young/dirty here so I'm not = sure > whether the idea applies to both, but I think at least it applies to = young > bit (ACCESS_LIKELY). >=20 > [1] = https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51= d9582 We have a saying in Hebrew: =E2=80=9CWhen you explain slowly, I = understand quickly=E2=80=9D. Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.