From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 379D8C43334 for ; Wed, 20 Jul 2022 20:56:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 571106B0072; Wed, 20 Jul 2022 16:56:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5203A6B0073; Wed, 20 Jul 2022 16:56:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40F5E6B0074; Wed, 20 Jul 2022 16:56:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3072C6B0072 for ; Wed, 20 Jul 2022 16:56:41 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DF8F9C073B for ; Wed, 20 Jul 2022 20:56:40 +0000 (UTC) X-FDA: 79708686960.28.69476B5 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf15.hostedemail.com (Postfix) with ESMTP id 558F2A0083 for ; Wed, 20 Jul 2022 20:56:40 +0000 (UTC) Received: by mail-pl1-f175.google.com with SMTP id w7so16077275ply.12 for ; Wed, 20 Jul 2022 13:56:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=mHaGFbnLm/HHSB9m9KzlAPd69oojkOy770c7wzvC1Fw=; b=P9USMlKdzsb6njhud8gzcwJukaFuFSGmGAYxzQU/GCU008LOZ63VA9e0MkfMPtgKec qeaK4ORAkYC0bKMqWCKiKS3Nm/PkyrAEkClB0AEDZbX1M/QWBmxl7yGfsfgMvQtzIbEr /a8XKGlQp5F3LZzA6DZYH6i8umf/xfdWPFgOVTIjliC6AiAdx2bFnMdE+T6e1U86tk41 OKrqhQTjAujSOe5ccqPuga+Nfn65vzYXhf3n6+R6y8Lki1+aKx6atpeGNulYldt0NDv0 BndVaLXlEe0s2UHHf6qOsoaOVQZscy8FavFcA6Bt+djdXmcCdfej6hsYvGT6Ck6NdlrH Ixhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=mHaGFbnLm/HHSB9m9KzlAPd69oojkOy770c7wzvC1Fw=; b=TX25Sb3LG8SUaYYQaSx0uPTveR2FgZecu8U3qePtGV6tVAb0L8laeK+bH4Dim2Tilg M8SfSq2RAlMdtvSupscfm/KS+DUFN3/uAV+WWFKuVej7+gYUO7eZ3hfmfgjnrAakrb+a D1vMaZeFfK4/rQLtVSyvFStSv+r2F2985DsPotgVwBeGrDVc29ZbMijehvq2mdK0EE7f 7Zvc5yWQ9rbgk5ZRQNIvyy3KZ3bMCc0oBTEiJQpjmwWm3DnYqXnn6g1XZ2Mb85ERrFiN ydFo/cDZWJcfRLCj3vP/VaKYNMm2yd8Ex3jXsCW/lWhkciReM4cJ2kdzCVFQlmHjXEpi Pobg== X-Gm-Message-State: AJIora+zkSlfI4aB3Td+XJdP8UiZnQfX2TwsB2K6HFTH1COScJEMqa0H MH2IMlMDYGSfd80m64VfHpg= X-Google-Smtp-Source: AGRyM1s4n0/ZegBQV/OS265a+KNNVpjLj6yr1o0p2xY9RssykRbBmwqz6QKEv5S208C6V/fggJ6pqg== X-Received: by 2002:a17:90a:e68c:b0:1f2:124d:1143 with SMTP id s12-20020a17090ae68c00b001f2124d1143mr7404851pjy.22.1658350598955; Wed, 20 Jul 2022 13:56:38 -0700 (PDT) Received: from smtpclient.apple ([66.170.99.113]) by smtp.gmail.com with ESMTPSA id b15-20020a170902d50f00b0016bc947c5b7sm18458plg.38.2022.07.20.13.56.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 20 Jul 2022 13:56:38 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect From: Nadav Amit In-Reply-To: <468a7114-7541-0d5e-c1fc-083bbb95e78d@redhat.com> Date: Wed, 20 Jul 2022 13:56:35 -0700 Cc: Peter Xu , Linux MM , LKML , Andrew Morton , Mike Rapoport , Axel Rasmussen , Andrea Arcangeli , Andrew Cooper , Andy Lutomirski , Dave Hansen , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , Nick Piggin Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220718120212.3180-1-namit@vmware.com> <20220718120212.3180-2-namit@vmware.com> <017facf0-7ef8-3faf-138d-3013a20b37db@redhat.com> <2b4393ce-95c9-dd3e-8495-058a139e771e@redhat.com> <69022bad-d6f1-d830-224d-eb8e5c90d5c7@redhat.com> <4ad140b5-1d5b-2486-0893-7886a9cdfd76@redhat.com> <95320077-52CF-4CB0-92F9-523E1AE74A3D@gmail.com> <468a7114-7541-0d5e-c1fc-083bbb95e78d@redhat.com> To: David Hildenbrand X-Mailer: Apple Mail (2.3696.100.31) ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=P9USMlKd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658350600; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mHaGFbnLm/HHSB9m9KzlAPd69oojkOy770c7wzvC1Fw=; b=5bRvkSNTYWAhcssKSiizrv/0lWZTUEKDabVqyCD3CtRWkXzEbLYoI4OtGgNn/n9Xd1VUsr KLBY2IMJlmg0GC5yVX185F4F94neepgtCrKX4LYnnQhZwVexxTX/R+bMb2Y7DRUB/a0ABg PWMKgmqLXLUmvyJNGaybcLl9t4mh0Qk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658350600; a=rsa-sha256; cv=none; b=ftzd310gXqGOclznQeWzgV4oTA1jAk96vQPfvfLWx7ppLJy6pUm5SOsq6Ymkj8Odcvxns7 T9uQHOSQGhUxvo2b02nWOl/jHi5qnLmARrNQYTe2dzuaRY7x+UxNVuuzsSTSnANKYsddj5 b0A677v884Q+so2ITsxUl4CcK1mRYyg= X-Stat-Signature: sne3ojeot6yzsynp7x6fo4kt18yko7sx X-Rspamd-Queue-Id: 558F2A0083 X-Rspamd-Server: rspam08 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=P9USMlKd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com X-Rspam-User: X-HE-Tag: 1658350600-807270 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jul 20, 2022, at 1:38 PM, David Hildenbrand wrote: > On 20.07.22 22:22, Nadav Amit wrote: >> On Jul 20, 2022, at 12:55 PM, David Hildenbrand = wrote: >>=20 >>> On 20.07.22 21:48, Peter Xu wrote: >>>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote: >>>>> On 20.07.22 21:15, Peter Xu wrote: >>>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand = wrote: >>>>>>> For pagecache pages it may as well be *plain wrong* to bypass = the write >>>>>>> fault handler and simply mark pages dirty+map them writable. >>>>>>=20 >>>>>> Could you elaborate? >>>>>=20 >>>>> Write-fault handling for some filesystems (that even require this >>>>> "slow path") is a bit special. >>>>>=20 >>>>> For example, do_shared_fault() might have to call page_mkwrite(). >>>>>=20 >>>>> AFAIK file systems use that for lazy allocation of disk blocks. >>>>> If you simply go ahead and map a !dirty pagecache page writable >>>>> and mark it dirty, it will not trigger page_mkwrite() and you = might >>>>> end up corrupting data. >>>>>=20 >>>>> That's why we the old change_pte_range() code never touched >>>>> anything if the pte wasn't already dirty. >>>>=20 >>>> I don't think that pte_dirty() check was for the pagecache code. = For any fs >>>> that has page_mkwrite() defined, it'll already have = vma_wants_writenotify() >>>> return 1, so we'll never try to add write bit, hence we'll never = even try >>>> to check pte_dirty(). >>>=20 >>> I might be too tired, but the whole reason we had this magic before = my >>> commit in place was only for the pagecache. >>>=20 >>> With vma_wants_writenotify()=3D0 you can directly map the pages = writable >>> and don't have to do these advanced checks here. In a writable >>> MAP_SHARED VMA you'll already have pte_write(). >>>=20 >>> We only get !pte_write() in case we have vma_wants_writenotify()=3D1 = ... >>>=20 >>> try_change_writable =3D vma_wants_writenotify(vma, = vma->vm_page_prot); >>>=20 >>> and that's the code that checked the dirty bit after all to decide = -- >>> amongst other things -- if we can simply map it writable without = going >>> via the write fault handler and triggering do_shared_fault() . >>>=20 >>> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the = dirty bit. >>=20 >> I thought you want to get rid of it at least for anonymous pages. No? >=20 > Yes. Especially for any MAP_PRIVATE mappings. >=20 > If you want to write to something that's not mapped writable in a > MAP_PRIVATE mapping it > a) Has to be an exclusive anonymous page > b) The pte has to be dirty Do you need both conditions to be true? I thought (a) is sufficient (if the soft-dirty and related checks succeed). >=20 > In any other case, you clearly missed to COW or the modifications = might > get lost if the PTE is not dirty. >=20 > MAP_SHARED is a bit more involved. But whether the pte is dirty might = be > good enough ... but this needs a lot more care. >=20 >>> But yeah, it's all confusing so I might just be wrong regarding >>> pagecache pages. >>=20 >> Just to note: I am not very courageous and I did not intend to change >> condition for when non-anonymous pages are set as writable. That=E2=80=99= s the >> reason I did not change the dirty for non-writable non-anonymous = entries (as >> Peter said). And that=E2=80=99s the reason that setting the dirty bit = (at least as I >> should have done it) is only performed after we made the decision on = the >> write-bit. >=20 > Good. As long as we stick to anonymous pages I roughly know what we we > can and cannot do at this point :) >=20 >=20 > The problem I see is that detection whether we can write requires the > dirty bit ... and whether to set the dirty bit requires the = information > whether we can write. >=20 > Again, for anonymous pages we should be able to relax the "dirty" > requirement when detecting whether we can write. That=E2=80=99s all I wanted to do there. >=20 >> IOW, after you made your decision about the write-bit, then and only = then >> you may be able to set the dirty bit for writable entries. Since the = entry >> is already writeable (i.e., can be written without a fault later = directly >> from userspace), there should be no concern of correctness when you = set it. >=20 > That makes sense to me. What keeps confusing me are architectures with > and without a hw-managed dirty bit ... :) I don=E2=80=99t know which arch you have in your mind. But the moment a = PTE is writable, then marking it logically/architecturally as dirty should be fine. But=E2=80=A6 if the Exclusive check is not good enough for private+anon = without the =E2=80=9Clogical=E2=80=9D dirty bit, then there would be a problem.=20=