From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 056FEC636CC for ; Thu, 16 Feb 2023 16:29:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 824D16B0073; Thu, 16 Feb 2023 11:29:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D5AC6B0074; Thu, 16 Feb 2023 11:29:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69C956B0075; Thu, 16 Feb 2023 11:29:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5BF636B0073 for ; Thu, 16 Feb 2023 11:29:55 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2DF89160229 for ; Thu, 16 Feb 2023 16:29:55 +0000 (UTC) X-FDA: 80473691550.15.DACF925 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf03.hostedemail.com (Postfix) with ESMTP id 006B82002B for ; Thu, 16 Feb 2023 16:29:52 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=V54jB6No; spf=pass (imf03.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676564993; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7w8XJVwT2Jpg09oHeOyemAk6ZbMpWoFBWRabPFi9F0Q=; b=Iv5Vc5cC3H9NIc6truH0LTn8UIZkT2WW6h5VyJCMGXMmejKShR2fS1MVJe+9SIwPF9jpwE jG4mOz5V679CTk9rWvaaH5tXvHtIQdEq5f4SIIZ6m2mVNC0U6J4+PW/rRUhU2rAJHo5RuC U5x+S/iDrAsqmdDgmvpkGcyMuZie1rM= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=V54jB6No; spf=pass (imf03.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676564993; a=rsa-sha256; cv=none; b=VEkc0ynKYvivj4Ef/S2mn5aDvJfwDJkI5ERruaSFAQoBvNw1y8bR5GVNrYLvACjlslk6mG kyv0R9zUNsVb4fqibiZ1DgcVMUhK4W+EFlEvGzy07/hFbviUvBfmjZYmlUQ7rvqATUgzBu 0eP9q4ZWRfMkAUzqRwvWLC5R5vsYaBA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676564992; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=7w8XJVwT2Jpg09oHeOyemAk6ZbMpWoFBWRabPFi9F0Q=; b=V54jB6NoN/bui4OGTW1ZMXjrhfZEeWkJqn6Vj76Q0tjyRtozjGlnSN5yG5XLe49i6qMTuh bPKYdDXyUu8pGarBtJPRU0V6PSZmK9FiEfYGowtB3s7IfhlITqi/w4kw4ormIxENKU2AZw tzFFx10cW4k9ec3oFcYwZMjwnz+y2mE= Received: from mail-il1-f200.google.com (mail-il1-f200.google.com [209.85.166.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-155-ctS33G_KPUGAba2KGHvJZQ-1; Thu, 16 Feb 2023 11:29:51 -0500 X-MC-Unique: ctS33G_KPUGAba2KGHvJZQ-1 Received: by mail-il1-f200.google.com with SMTP id w11-20020a92d2cb000000b003157a4fddf3so1558638ilg.13 for ; Thu, 16 Feb 2023 08:29:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7w8XJVwT2Jpg09oHeOyemAk6ZbMpWoFBWRabPFi9F0Q=; b=JFLHwYKN6fIzEqbIiiIQjxbBTwSghF+W0OpohXGTqGKC9ReFYdxyS2y4MySQ8Wz5Hu IN9KU1oTTnwNUJnT0d/aVlbg0UCMFFuWxSik3GD/2IwoiLS+jsO+ft/Yr3tCd/rACERX oplbCQZ0BfRX8euEGY2ySTIbpYT0SBFhkvCJ41omkPXPycPs/S337HqjDh8XeROREVrI t4wOBeLwDz+ASTJR8WopvxBFKJPNYMwEB7phOq50N8/UnmnSV6m+a7VaxcOp2KbMUeFi Cv6uuEGSSiovyxVZi9VpHhm3wlrD2HyOhML5EYkFKTEm2rZjHvwoxM2AyXFYm4KWEPym eTsw== X-Gm-Message-State: AO0yUKXHAY7w98Dy+ikFhIzBXzVbmmqx6NuUqXj0zJvqMfJBz1Zy5RhX 4xO3UyhIL92l3TRTWV+G8W+Glj+vFXIVBR2u+9w4khPw+3V8ylg1mcGnpdnFX8vAAL0pAOsOpre ICGOTHuPNBPk= X-Received: by 2002:a6b:b2d1:0:b0:744:5aff:2ea9 with SMTP id b200-20020a6bb2d1000000b007445aff2ea9mr1603676iof.2.1676564990106; Thu, 16 Feb 2023 08:29:50 -0800 (PST) X-Google-Smtp-Source: AK7set8Szs3hqBzHf1zOy5zUODM2Q5BxszRTU4f/pZ3Di+F7xhSUVWuRPPknjxxIeWHqiWB7Spi0Bg== X-Received: by 2002:a6b:b2d1:0:b0:744:5aff:2ea9 with SMTP id b200-20020a6bb2d1000000b007445aff2ea9mr1603656iof.2.1676564989832; Thu, 16 Feb 2023 08:29:49 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id w5-20020a6bd605000000b007407ea5d07csm597246ioa.51.2023.02.16.08.29.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Feb 2023 08:29:48 -0800 (PST) Date: Thu, 16 Feb 2023 11:29:47 -0500 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Axel Rasmussen , Mike Rapoport , Andrew Morton , Andrea Arcangeli , Nadav Amit , Muhammad Usama Anjum Subject: Re: [PATCH] mm/uffd: UFFD_FEATURE_WP_ZEROPAGE Message-ID: References: <20230215210257.224243-1-peterx@redhat.com> <7eb2bce9-d0b1-a0e3-8be3-f28d858a61a0@redhat.com> MIME-Version: 1.0 In-Reply-To: <7eb2bce9-d0b1-a0e3-8be3-f28d858a61a0@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: ufath6cs4kg85npozoijffunc1uqowun X-Rspamd-Queue-Id: 006B82002B X-HE-Tag: 1676564992-116564 X-HE-Meta: U2FsdGVkX1/cUiqIxU74B18yaPB/pI7ne5HekQnMjK4s2yVOnVSIgT6BtTwGaJSmVMuzNnfjvmywqRS7Xqw7LJjKUUXfdeuwGEYmkSHgC9c3ixyPsl0AcsaclNw4xFaFdNUv/jv512Wxy1Q+4zlm41amAXmpm+sYcps16BIeODf6L+z25Zu4MSRwaSIO/YmjJu0MA+MBTw/jhpjTdTCBBtnxJY6qINLgf282o/v7evozAYztKvo79pVw7FLI/jAPVvcZMdZR9nfLEibGFzD52hzQ13rO4n1PNGTgISAodLZkr2Rj4Wm6tb6iabh2ja8pdE6zxQDF1HFo32jvZeh8tFMOkpsNSg0jjVsqtFh7SN98wc+87dPdeomNsgja+GqUFGo/66ew50f6kaCxsH29WbKEEYCelXNgnk7FQhVrT2/WgvPsneaHUPdwZ4T++RD3quBc9y3KWrE7GvLTKP9bMdyfrHiS5XGhJIeg7joseFbq7+G1IDTFDBZqkiZju9dcE5+9UD5Lvms28SCEpRCWSx1xSaO3DJqZaQVgrXG+gRd9h+0zEDemRfDIfVJBIaWjqARkLOD8qRNK/0hkmhrxdvfPSdLEnDXhyPhBGhdWFT+1YiJZFqoeExYJlf6pqYA3kaWD7B5L4v4I6pCUPz0fbqhO5mCQKVsPdDJg+iQdm9D+9Ihw72dM+gv1Ds//fSXLHeeCkcQ8G36gtQERkc9S1DU7YUkDw/+qXfVTwRLIXlUoHDYAD5LfnCbUq0E7C6Vz2G+sHIurmikdnDsJ1CK5vcGQ0mOZZEn5LNEhurrl3I64quvY8yTOm7bOk6EHFmOGRPYiC24cK+q1q1YHk35HuaF3VCkWIEJEe/JA5tLHJZK/vzzF2jkIxEhtkXpSHT6lH2ZY8mzbu/fP0cf6F+2TcAoINs0dif6IAiwR6daC5gSr/+jo8qIPsdKtOyyhxwaSstGNZzDsl3WCVdJVImD 2RwlNXqh NKIwdk/dpF6qDYpcu64pXmoxLtWfWXlnGXugLom1Mxqy0aR7lVNqGqnvh+8Ta08vhxMpo6cLfGK6TS7g/F2pbsavf+VgL6WdvofYj3grkJE65TkdBN9Q/VNtu6eBxjdq3t4Hjj2qQBDNFTYtHlPGsOBb7mT9fqAliCf3UZ3vM++IvqteQhpQ3tsH4D0FvwFYvPdlkuVS4fwT6+eoIef78DzmYTjVDJiswNk8KQwX2vPwMQGrSNfg2mgI6VMLIsjguuNZ8ZwDbtOfzzaHiCED4O76zLCFcbkZdNwsCV1fucYz6erPzB6LJMk685OCpOQKJwLijXsGkOM/ro1f/L04Hgotg2Oh2w2pCRYC3qw7BC6a56wbBkK2fwbXsOFZ2/N8LmsXbQKqFqk2zIfpeJPbCfUTNJ/8gCNkNZ0m2WsBlUkx2jDbigYpF/nfx4al7BZng/mK9zixDIx+MRaciWCNUqwjo2daJfc752y3kOyK4U00Tu+Sf9sMyr8t6LQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 16, 2023 at 11:47:23AM +0100, David Hildenbrand wrote: > On 15.02.23 22:02, Peter Xu wrote: > > This is a new feature that controls how uffd-wp handles zero pages (aka, > > empty ptes), majorly for anonymous pages only. > > > > Note, here we used "zeropage" as a replacement of "empty pte" just to avoid > > introducing the pte idea into uapi, since "zero page" is more well known to > > an user app developer. > > > > File memories handles none ptes consistently by allowing wr-protecting of > > none ptes because of the unawareness of page cache being exist or not. For > > anonymous it was not as persistent because we used to assume that we don't > > need protections on none ptes or known zero pages. > > > > But it's actually not true. > > > > One use case was VM live snapshot, where if without wr-protecting empty > > ptes the snapshot can contain random rubbish in the holes of the anonymous > > memory, which can cause misbehave of the guest when the guest assumes the > > pages should (and were) all zeros. > > > > QEMU worked it around by pre-populate the section with reads to fill in > > zero page entries before starting the whole snapshot process [1]. > > > > Recently there's another need that raised on using userfaultfd wr-protect > > for detecting dirty pages (to replace soft-dirty) [2]. In that case if > > without being able to wr-protect zero pages by default, the dirty info can > > get lost as long as a zero page is written, even after the tracking was > > started. > > > > In general, we want to be able to wr-protect empty ptes too even for > > anonymous. > > > > This patch implements UFFD_FEATURE_WP_ZEROPAGE so that it'll make uffd-wp > > handling on zeropage being consistent no matter what the memory type is > > underneath. It doesn't have any impact on file memories so far because we > > already have pte markers taking care of that. So it only affects > > anonymous. > > > > One way to implement this is to also install pte markers for anonymous > > memories. However here we can actually do better (than i.e. shmem) because > > we know there's no page that is backing the pte, so the better solution is > > to directly install a zeropage read-only pte, so that if there'll be a > > upcoming read it'll not trigger a fault at all. It will also reduce the > > changeset to implement this feature too. > > > > There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using > PTE markers, would be more benficial: > > 1) It would be applicable to anon hugetlb Anon hugetlb should already work with non ptes with the markers? > 2) It would be applicable even when the zeropage is disallowed > (mm_forbids_zeropage()) Do you mean s390 can disable zeropage with mm_uses_skeys()? So far uffd-wp doesn't support s390 yet, I'm not sure whether we over worried on this effect. Or is there any other projects / ideas that potentially can enlarge forbid zero pages to more contexts? > 3) It would be possible to optimize even without the huge zeropage, by > using a PMD marker. This patch doesn't need huge zeropage being exist. > 4) It would be possible to optimize even on the PUD level using a PMD > marker. I think 3+4 is in general an interesting idea on using pte markers on higher than pte levels, but that needs more changes. Firstly, keep using pte markers is somehow preallocating the pgtables, so a side effect of it could be speeding up future faults because they'll all split into pmd locks and read doesn't need to fault at all, only writes. Imagine when you hit a page fault on a pmd marker, it means you'll need to spread that "marker" information to child ptes and you must - it moves the slow operation of WP into future page faults in some way. In some cases (I'd say, most cases..) that's not wanted. The same to PUDs. > > Especially when uffd-wp'ing large ranges that are possibly all unpopulated > (thinking about the existing VM background snapshot use case either with > untouched memory or with things like free page reporting), we might neither > be reading or writing that memory any time soon. Right, I think that's a trade-off. But I still think large portion of totally unpopulated memory should be rare case rather than majority, or am I wrong? Not to mention that requires a more involved changeset to the kernel. So what I proposed here is the (AFAIU) simplest solution towards providing such a feature in a complete form. I think we have chance to implement it in other ways like pte markers, but that's something we can work upon, and so far I'm not sure how much benefit we can get out of it yet. Thanks, -- Peter Xu