From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A362C433EF for ; Tue, 21 Jun 2022 17:14:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 98EC48E0006; Tue, 21 Jun 2022 13:14:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93E728E0003; Tue, 21 Jun 2022 13:14:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82CCC8E0006; Tue, 21 Jun 2022 13:14:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 71F9E8E0003 for ; Tue, 21 Jun 2022 13:14:04 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 356F5204B5 for ; Tue, 21 Jun 2022 17:14:04 +0000 (UTC) X-FDA: 79602890808.03.1EC9E09 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf16.hostedemail.com (Postfix) with ESMTP id CD99618009C for ; Tue, 21 Jun 2022 17:14:03 +0000 (UTC) Received: by mail-pj1-f51.google.com with SMTP id g16-20020a17090a7d1000b001ea9f820449so14246261pjl.5 for ; Tue, 21 Jun 2022 10:14:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=onNmXXPSK2wlRWsFhRzwbwIs3eS662/TPVqrtJ2hLGY=; b=NPat3lqKqp6gMfMfBgG8uTqxyUUqhxOKHfyJZheekqsNsIX167KdynZNzDnW7GWUNy iDBGrmIUY9PptEjM7KbUV/dNnI1AXDNfofxeGQGyrocJ1LWW0/TC5etKY//bQQ1i470D 6VOu+WGncHpmKszMgmahgH7oLduxudkkw3yywdKNP4Bug5UtoTImIMprOM+LrSM5LdN0 HBEMxvLzhKW6GqaNOmHm+itMVoQuKiJuH65GmMWSA8ZEBKJk7k91/udjHl+95gkievCo vauxvtP18uA+WHRr7Gt+cne0yhReqWXV8P0hVtrkw8Qorc8lWdJyssQU2JkiCb/dIQ8/ Hu1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=onNmXXPSK2wlRWsFhRzwbwIs3eS662/TPVqrtJ2hLGY=; b=jYuqaYnQJPylLMxhcavgWffsEGnxHhjQMmrKlX1Cbmwn9D5pSdSI1n0Sn7sm32Xtmg kCNXlCC7nyxUTBQpS9JOJNJd55fcygirFBAxo3oKrbPCl9mycEI34B6l9kGCH4t3h8ZO XgMZYBrwVuvgebcD0RCQU6zghwj3BaVRnJQYRbYkqxfhJM1BBbvB96T9J06NtNz5gtBr sNpwt7BFn2kRfDF/1FSs6jAn/A7lwdb0ZH5heNtUxsUwa3LV5ZZR+l+7v5nSodB3Ochk ujWmgmgXZiQRZ8PQ5OPdbmL97lOWgkrN7YL6+JHNXcGc01ceyv+K+LBrVYZrPRZ8g1N8 ETkw== X-Gm-Message-State: AJIora+rJikadKNPlD2VEJY1a9iemFaqxAl29GY1KTBcweVhR8BYowXt q5/uKeAdluqtvQPfgPK7KDY55aYeLS5Xpg== X-Google-Smtp-Source: AGRyM1tiihpVNOM6+X5tdcu6lcMf8xhTRJwdLlU8Pf0HwNcelIxmuajXhr5VrYnaymmTGHkTgn8fzQ== X-Received: by 2002:a17:903:228c:b0:16a:24d7:6a9e with SMTP id b12-20020a170903228c00b0016a24d76a9emr11504479plh.27.1655831642294; Tue, 21 Jun 2022 10:14:02 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id jc15-20020a17090325cf00b0015ee985999dsm10956307plb.97.2022.06.21.10.14.01 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Jun 2022 10:14:01 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations From: Nadav Amit In-Reply-To: Date: Tue, 21 Jun 2022 10:14:00 -0700 Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220619233449.181323-1-namit@vmware.com> <20220619233449.181323-4-namit@vmware.com> To: Peter Xu X-Mailer: Apple Mail (2.3696.100.31) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655831643; a=rsa-sha256; cv=none; b=7WUISnPnDzgmGXjzgqJwt9iAQ27H5Pe12zNv/yE0SBVtI7JjwzZB00IW3aACVDHWmQ+X9f cZ9d9zQVqUsk6o5oFH8YGBBEnGmf4Dvxpr8vI5a4Z4dHwH5fDqYSby2fn+IGVhk6ZDgjyd 3ZLLvTAblcb62lGZQ0ypZ7mStDHzj18= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NPat3lqK; spf=pass (imf16.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655831643; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=onNmXXPSK2wlRWsFhRzwbwIs3eS662/TPVqrtJ2hLGY=; b=OJwQ9yxeeTAtp0G+LBN1xz1oPe6zJiuklvD+ulqQbgWneIlBUIh9JCXFst1VZMUOsV6xcg EDPvac1Dc6PW5gm7Gp7jzcbwAAj8jjugs/gK0Lw5C4HDiN0fnOWAY6oeZnKZGHwvy+I3Dy /4TSDGTzZkfwjyq0H8YKkt7+hZW6B68= X-Stat-Signature: 6ykiss1nf76onqa871nnc376regj5a4t X-Rspamd-Queue-Id: CD99618009C Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NPat3lqK; spf=pass (imf16.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam07 X-Rspam-User: X-HE-Tag: 1655831643-402048 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jun 21, 2022, at 9:38 AM, Peter Xu wrote: > Hi, Nadav, >=20 > On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote: >> From: Nadav Amit >>=20 >> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in >> mfill_atomic_install_pte") has set PTEs as dirty as its title = indicates. >> However, setting read-only PTEs as dirty can have several undesired >> implications. >>=20 >> First, setting read-only PTEs as dirty, can cause these PTEs to = become >> writable during mprotect() syscall. See in change_pte_range(): >>=20 >> /* Avoid taking write faults for known dirty pages */ >> if (dirty_accountable && pte_dirty(ptent) && >> (pte_soft_dirty(ptent) || >> !(vma->vm_flags & VM_SOFTDIRTY))) { >> ptent =3D pte_mkwrite(ptent); >> } >=20 > IMHO this is not really the direct reason to add the feature to "allow = user > to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", = because > IIUC what's missing is the pte_uffd_wp() check that I should have = added in > the shmem+hugetlb uffd-wp series in change_pte_range() but I missed.. >=20 > But since this is fixed by David's patch to optimize mprotect() = altogether > which checks pte_uffd_wp() (and afaict that's only needed after the > shmem+hugetlb patchset, not before), so I think we're safe now with = all the > branches. >=20 > So IMHO we don't need to mention this as it's kind of misleading. = It'll be > welcomed if you want to recover the pte_dirty behavior in > mfill_atomic_install_pte() but probably this is not the right patch = for it? Sorry, I will remove it from the commit log. I am not sure whether there should be an additional patch to revert (or partially revert) 9ae0f87d009ca and to be backported. What do you say? >> Second, unmapping read-only dirty PTEs often prevents TLB flush = batching. >> See try_to_unmap_one(): >>=20 >> /* >> * Page is dirty. Flush the TLB if a writable entry >> * potentially exists to avoid CPU writes after IO >> * starts and then write it out here. >> */ >> try_to_unmap_flush_dirty(); >>=20 >> Similarly batching TLB flushed might be prevented in zap_pte_range(): >>=20 >> if (!PageAnon(page)) { >> if (pte_dirty(ptent)) { >> force_flush =3D 1; >> set_page_dirty(page); >> } >> ... >=20 > I still keep the pure question here (which I asked in the private = reply to > you) on why we'd like only pte_dirty() but not pte_write() && = pte_dirty() > here. I'll rewrite what I have in the private email to you here again = that > I think should be the write thing to do here.. >=20 > if (!PageAnon(page)) { > if (pte_dirty(ptent)) { > set_page_dirty(page); > if (pte_write(ptent)) > force_flush =3D 1; > } > } The =E2=80=9CSecond=E2=80=9D part regards a perf issue, not a = correctness issue. As I told you offline, I think if makes sense to look both on pte_dirty() and pte_write(), but I am afraid of other architectures than x86, which I = know. After our discussion, I looked at other architectures, and it does look safe. So I will add an additional patch for that (with your = suggested-by). > I also mentioned the other example of doing mprotect(PROT_READ) upon a > dirty pte can also create a pte with dirty=3D1 and write=3D0 which = should be > the same condition we have here with uffd, afaict. So it's at least = not a > solo problem for uffd, and I still think if we treat this tlb flush = issue > as a perf bug we should consider fixing up the tlb flush code instead > because otherwise the "mprotect(PROT_READ) upon dirty pte" will also = be > able to hit it. >=20 > Meanwhile, I'll have the same comment that this is not helping any = reviewer > to understand why we need to grant user ability to conditionally set = dirty > bit from uABI, so I think it's better to drop this paragraph too for = this > patch alone. Ok. Point taken. >=20 >> In general, setting a PTE as dirty seems for read-only entries might = be >> dangerous. It should be reminded the dirty-COW vulnerability = mitigation >> also relies on the dirty bit being set only after COW (although it = does >> not appear to apply to userfaultfd). >>=20 >> To summarize, setting the dirty bit for read-only PTEs is dangerous. = But >> even if we only consider writable pages, always setting the dirty bit = or >> always leaving it clear, does not seem as the best policy. Leaving = the >> bit clear introduces overhead on the first write-access to set the = bit. >> Setting the bit for pages the are eventually not written to can = require >> more TLB flushes. >=20 > And IMHO only this paragraph is the real and proper reasoning for this > patch.. Will do. >=20 >> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct = userfaultfd_ctx *ctx, unsigned long arg) >> uffdio_continue.range.start) { >> goto out; >> } >> - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) >> + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE)) >> goto out; >>=20 >> - uffd_flags =3D UFFD_FLAGS_ACCESS_LIKELY; >> + uffd_flags =3D UFFD_FLAGS_ACCESS_LIKELY | = UFFD_FLAGS_WRITE_LIKELY; >=20 > Setting dirty by default for CONTINUE may make some sense (unlike = young), > at least to keep the old behavior of the code where the pte was > unconditionally set. >=20 > But there's another thought a real CONTINUE user modifies the page = cache > elsewhere so logically the PageDirty should be set elsewhere already > anyways, in most cases when the userapp is updating the page cache = with > another mapping before installing pgtables for this mapping. Then this > dirty is not required, it seems. >=20 > If we're going to export this to uABI, I'm wondering maybe it'll be = nicer > to not apply either young or dirty bit for CONTINUE, because = fundamentally > losing dirty bit doesn't sound risky, meanwhile the user app should = have > the best knowledge of what to do (whether page was requested or it's = just > during a pre-faulting in the background). >=20 > Axel may have some better thoughts. I think that perhaps I did not communicate well enough the reason it = makes sense to set the dirty-bit. Setting the access-bit and dirty-bit takes quite some time. So = regardless of whether you set the PageDirty, if you didn=E2=80=99t see access-bit and = dirty-bit and later you access the page - you are going to pay ~550 cycles for nothing. For reference: https://marc.info/?l=3Dlinux-kernel&m=3D146582237922378&w=3D= 2 If you want me to allow userspace to provide a hint, let me know. >> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct = *dst_mm, pmd_t *dst_pmd, >>=20 >> if (writable) >> _dst_pte =3D pte_mkwrite(_dst_pte); >> - else >> + else { >> /* >> * We need this to make sure write bit removed; as = mk_pte() >> * could return a pte with write bit set. >> */ >> _dst_pte =3D pte_wrprotect(_dst_pte); >>=20 >> + /* Marking RO entries as dirty can mess with other code = */ >> + if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY) >> + _dst_pte =3D pte_mkdirty(_dst_pte); >=20 > Hmm.. what about "writable=3Dtrue" ones? Oh boy, I reverted the condition=E2=80=A6 :( Thanks for noticing. I=E2=80=99ll fix it.