From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 610BBC05027 for ; Fri, 17 Feb 2023 12:31:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 95A0A6B0074; Fri, 17 Feb 2023 07:31:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 909F76B0075; Fri, 17 Feb 2023 07:31:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F9086B0078; Fri, 17 Feb 2023 07:31:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6F1846B0074 for ; Fri, 17 Feb 2023 07:31:31 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2BEEF1C5FF4 for ; Fri, 17 Feb 2023 12:31:31 +0000 (UTC) X-FDA: 80476719582.07.2141C5B Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 01FC3180023 for ; Fri, 17 Feb 2023 12:31:27 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=S4yqOWik; dmarc=pass (policy=reject) header.from=collabora.com; spf=pass (imf24.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676637088; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=73kwZFuvEyNucTG+kS78k5Hsk9KxzsVpuT3GDwh+GRI=; b=KZhjZnFlSVMu5BuQ1XiknshyWGmTCo/mLpyD5Vw03jP/Uhva2r8WpYXQf0rMBGqVO4dftR t40aFlOl3ceygHLsF2OzovltvGd01Oy4q98st8vDQFoCJ4NUK/bxbnR5eVd1Brfx8pLeLt GKHU3ndfGfqT6ymo8eOH7KGt+T9WCks= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=S4yqOWik; dmarc=pass (policy=reject) header.from=collabora.com; spf=pass (imf24.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676637088; a=rsa-sha256; cv=none; b=o8gkkgmwaoSpd++HIvtpY/KMigzdwF0cSyaTtvAA7OvBbQai3cESMKkca7w8AXg94hUP3c EwGMHUmX0dwGockwDHFGd/PDJH/+ooMQZ7+OgJ5b+cwAN+r+6M+I9++Cnwsz/BYqrBzmSZ jQ8YRmfltLaKMoF49ljBP3p3Cliw05g= Received: from [192.168.10.39] (unknown [119.155.16.218]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id DA3996600864; Fri, 17 Feb 2023 12:31:23 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1676637086; bh=w7MN6GDlicUbqaG24KivscjACBzk6TIVqvyY24/C824=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=S4yqOWikkyf4bMJgNY/8+LQuSXF+F1luRq0Lfy4zBX0JGUI9cblFmfBFvaTdgEhva 1JvsW6SM550FA1mgG7lEaVBucwKX9NiW45Hho5VM5UJH0zvwebiL37zs3N4QLpGjzw B7o09wrnmg3Cs1LJJUce/cFzltuWtiUOfzRYSyE1plHT8wWyeIxaMQIKpNbCTGfs07 ClwrMspMk/ptlpCGzSsnHT+YXzaYZDdZ8gKZxzoe0ufn7QgoMMIPI0e6xgjOE389oc ipK9x9KboORqBsY+n3EIEFtsmre0oCYkK8X9HIz4QmCrH9rHBLplgQ7iQza4E445Gg 13b9XTjw5IW5A== Message-ID: <427298c4-0da9-059f-02ff-c5147d317c87@collabora.com> Date: Fri, 17 Feb 2023 17:31:19 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Cc: Muhammad Usama Anjum , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Axel Rasmussen , Mike Rapoport , Andrew Morton , Andrea Arcangeli , Nadav Amit Subject: Re: [PATCH] mm/uffd: UFFD_FEATURE_WP_ZEROPAGE Content-Language: en-US To: Peter Xu , David Hildenbrand References: <20230215210257.224243-1-peterx@redhat.com> <7eb2bce9-d0b1-a0e3-8be3-f28d858a61a0@redhat.com> <4f64d62f-c21d-b7c8-640e-d41742bbbe7b@redhat.com> <456f8e2e-9554-73a3-4fdb-be21f9cc54b6@redhat.com> From: Muhammad Usama Anjum In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 01FC3180023 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: pqy9zw5f63yx9yq9nppdkkbfe3wuisqp X-HE-Tag: 1676637087-24094 X-HE-Meta: U2FsdGVkX19HgXytwIH954s71ZjlaZBJEA9flDcWii/mZ7Vl4d4yLX0RJLgKVCvgxkBbEZFh18hBaXR5sZUfjh8cR6g7kLA/s254M3y2ANlbipXn9cAfmmeblUXqqS1EcO5xeHrOye3nRpbIcIQtak625VFqABkxcEj/SyviRPuUJblYrOzdjofJRb7l2d9/XY1akB2fOWSBwZdZmkTQCFSIdmxaWEvLj7BNQqcDEeGn6T+It0lY9j+9pFtRlYjkhTXvqvR617uDzGIJmyAdZ1Nylb6eAJlsBQ8kZSJ/AC/r7ztrwjfFSBw/jDrSgEH9lUrImnD6260XKEOsq6WjV5m4v91unKNSJsZXbuHCe9iZZrk9z7r2RTpmKPEC//vf9jeurCVPH7fxTmUcxiAP8Db/36bzVlgHo44mgkyKGOKpD2G44Fr89y+M95irTe/n/T5wCClw+7YE6kPblInIDtSPp1lhKWI8lcVRFZjjV8za8WKxKB/ZNZXa/Wt2DSZPaUyTBLK887wW6qC3Q6qg2XVmR9Wo9wMnSVw8Pc7oTq7n+IqLlDrBt9t2HgDKAeWu6Eqq8mbGd6zS/3VC+jgn68Ds/c/3JB2OcRQOqKThUxUqiGwYqdcHejBYkmJwTFd04ykN3OmEh/xCT4RdICMq8obfDimtYwh86DKsiYP56cSIiIte92sYHZ1cxqn3O7hfJXbckN0IZuNdqhszV7Zto/8dfRcVdyftWXb5gxmw27k1nfaMMs4E3tXzhvHw5Nq7Gq6D7YDy43pOcEBrhCu4f1pxNTcLWIjXy7zCYJ6w2cxnyQIth+jIN+Oyg/mwCvzfRTLVExcNKBa++qVgVi7tT55RhrbvpFvWMg/48MEP1Jgg35aVeIpCzClK1zIPcZlkZR3MM3h7YJT6bdUSFFc5EW1xTR9JGrmXbahRnRa2Lo5nKeQe18nn0/cHNDxdT0kE17j2aEvVFZmZjqfZz50 rTC3Vkwn cCmx0+YSix1z6Cqf+v97u5dlDp7QcCnC8WrkSWi7RM9NX4snTTx5HQVpg/0AaW+7NyWWyArhPmfNQ19mYIzHE+ldlV99OqkerRyRK3H1xaSGqnMCKNSfktz9t3Tw8QIDJEW/jRBh6GnQcgF8b2BgwAFvX+CQ0002+F5u1f4nZoASVrctovWOwXKiwvSd7zCTxQ91li32q2bm2DKdCEfxsjjydFy3dDkm+Je9MaP31HKvI9yDeHLHabhWTmgzd8IxyPnOlMr057HVMI3f7Rs+CtR0G09ynY5D0D0A0b4U/A1yzn1GsOhNdlg81aedFZ47xINsbEuhbX4S98C2JwGYajilp29Nlg/yRi0cWoB8LVrVlHFSGwg0HpYVJaPbvVtxmI00FiBrl/uyE4YoA+Ho0y/RqU7E9/R4JBjG2Uu9YvhWliCJQGsZdxX0mycpv5CPpmSgVSJRycBHV5M8NULe625FOVg+xrnmKDeSuDPbKFp+Gh8LqjiDA2izscAKd7ZXvCHAtbnAQ/hmHBCV1JFXJsI70OF/3o/L9w1qC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/17/23 1:08 AM, Peter Xu wrote: > On Thu, Feb 16, 2023 at 07:23:17PM +0100, David Hildenbrand wrote: >> On 16.02.23 18:55, Peter Xu wrote: >>> On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote: >>>>>> >>>>>> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using >>>>>> PTE markers, would be more benficial: >>>>>> >>>>>> 1) It would be applicable to anon hugetlb >>>>> >>>>> Anon hugetlb should already work with non ptes with the markers? >>>>> >>>> >>>> ... really? I thought we'd do the whole pte marker handling only when >>>> dealing with hugetlb/shmem. Interesting, thanks. (we could skip population >>>> in QEMU in that case as well -- we always do it for now) >>> >>> Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :) >> >> I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory" >> without any fd and thus pagecache. ... but anon hugetlb keeps confusing me >> with pagecache handling. > > IIUC when mmap(fd==-1) it's the same as MAP_PRIVATE|MAP_HUGETLB. > >> >>>> >>>>>> 2) It would be applicable even when the zeropage is disallowed >>>>>> (mm_forbids_zeropage()) >>>>> >>>>> Do you mean s390 can disable zeropage with mm_uses_skeys()? So far uffd-wp >>>>> doesn't support s390 yet, I'm not sure whether we over worried on this >>>>> effect. >>>>> >>>>> Or is there any other projects / ideas that potentially can enlarge forbid >>>>> zero pages to more contexts? >>>> >>>> I think it was shown that zeropages can be used to build covert channels >>>> (similar to memory deduplciation, because it effectively is memory >>>> deduplication). It's mentioned as a note in [1] under VII. A. ("Only >>>> Deduplicate Zero Pages.") >>>> >>>> >>>> [1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf >>> >>> Thanks for the link. I'm slightly confused how dedup of zero pages is a >>> concern here, though. >>> >>> IIUC the security risk is when the dedup-ed pages contain valid information >>> so the attacker can measure latency of requests when the attemped malicious >>> page contains exactly the same content of the data page, by trying to >>> detect the CoW from happening. > >>> Here it's the zero page, even if there's CoW difference the data being >>> exposed can only be all zeros? Then what's the risk? >> >> The focus of that paper is on CoW latency yes (and deduplication >> instantiating shared zeropages -- but building a covert channel using CoW >> latency might be rather tricky I think, because they will get deduplciated >> independently of a sender action ...). >> >> However, in theory, one could build a covert channel between two VMs simply >> by using cache flushes and reading from the shared zeropage. Measuring >> access time can reveal if the sender read the page (L3 filled) or not (L3 >> not filled). > > So the attacker will know when someone reads a zeropage, but I still don't > get how that can leads to data leak.. > >> >> Having that said, I don't think that we are going to disable the shared >> zeropage because of that for some workloads, I assume in most cases it will >> simply be way too noisy to transmit any kind of data and we have more >> critical covert channels to sort out if we want to. >> >> Just wanted to raise it because you asked :) >> >>> >>> Another note for s390: when it comes we can consider moving to pte markers >>> conditionally when !zeropage. But we can leave that for later. >> >> Sure, we could always have another feature flag. > > I think that doesn't need to be another feature flag. If someone will port > uffd-wp to s390 we can implement pte markers for WP_ZEROPAGE, then we > either use it when zeropage not exist, or we can switch to pte markers > completely too without changing the interface if we want, depending on > whether we think replacing zeropages with pte markers will be a major issue > with existing apps. I don't worry too much on that part. > >> >>> >>>> >>>>> >>>>>> 3) It would be possible to optimize even without the huge zeropage, by >>>>>> using a PMD marker. >>>>> >>>>> This patch doesn't need huge zeropage being exist. >>>> >>>> Yes, and for that reason I think it may perform worse than what we already >>>> have in some cases. Instead of populating a single PMD you'll have to fill a >>>> full PTE table. >>> >>> Yes. If you think that'll worth it, I can conditionally do pmd zero thp in >>> a new version. Maybe it will be a good intermediate step between >>> introducing pte markers to pmd/pud/etc, so at least we don't need other >>> changes to coordinate pte markers to higher levels. >> >> [...] >> >>>>>> Especially when uffd-wp'ing large ranges that are possibly all unpopulated >>>>>> (thinking about the existing VM background snapshot use case either with >>>>>> untouched memory or with things like free page reporting), we might neither >>>>>> be reading or writing that memory any time soon. >>>>> >>>>> Right, I think that's a trade-off. But I still think large portion of >>>>> totally unpopulated memory should be rare case rather than majority, or am >>>>> I wrong? Not to mention that requires a more involved changeset to the >>>>> kernel. >>>>> >>>>> So what I proposed here is the (AFAIU) simplest solution towards providing >>>>> such a feature in a complete form. I think we have chance to implement it >>>>> in other ways like pte markers, but that's something we can work upon, and >>>>> so far I'm not sure how much benefit we can get out of it yet. >>>>> >>>> >>>> What you propose here can already be achieved by user space fairly easily >>>> (in fact, QEMU implementation could be further sped up using >>>> MADV_POPULATE_READ). Usually, we only do that when there are very good >>>> reasons to (performance). >>> >>> Yes POPULATE_READ will be faster. This patch should make it even faster, >>> because it merges the two walks. >>> >> >> Getting some performance numbers would be nice. > > Sure, I'll collect some data and post a new version. > >> >>>> >>>> Using PTE markers would provide a real advantage IMHO for some users (IMHO >>>> background snapshots), where we might want to avoid populating >>>> zeropages/page tables as best as we can completely if the VM memory is >>>> mostly untouched. >>>> >>>> Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there >>>> is another good reason to combine the populate zeropage+wp that I am missing >>>> (e.g., atomicity by doing both in one operation)? >>> >>> It also makes the new WP_ASYNC and pagemap interface clean: we don't want >>> to have user pre-fault it every time too as a common tactic.. It's hard to >>> use, and the user doesn't need to know the internals of why it is needed, >>> either. >> >> I feel like we're building a lot of infrastructure on uffd-wp instead of >> having an alternative softdirty mode (using a world switch?) that works as >> expected and doesn't require that many uffd-wp extensions. ;) > > We used to discuss this WP_ZEROPAGE before, and I thought we were all happy > to have that. Obviously you changed your mind. :) > > I wasn't really eager on this before because the workaround of pre-read > works good already (I assume slightly slower but it's fine; not until > someone starts to worry). But if we want to extend soft-dirty that's not > good at all to have any new user being requested to prefault memory and > figuring out why it's needed. > >> >> Having that said, I have the feeling that you and Muhammad have a plan to >> make it work using uffd-wp and I won't interfere. It would be nicer to use >> softdirty infrastructure IMHO, though. > > Thanks. If you have any good idea on reusing soft-dirty, please shoot. > I'll be perfectly happy with it as long as it resolves the issue for > Muhammad. Trust me - I wished the soft dirty thing worked out, but > unfortunately it didn't.. Because at least so far uffd-wp has two major > issues as I can see: > > (1) Memory type limitations (e.g. general fs memories stop working) > (2) Tracing uffd application is, afaict, impossible > > So if there's better way to do with soft-dirty or anything else (and I > assume it'll not be limited to any of above) it's time to say.. I'm on the same boat with you guys about soft-dirty feature. But I'm very happy after shifting to UFFD WP. We are closer than ever to complete the feature. > >> >>> >>> The other thing is it provides a way to make anon and !anon behave the same >>> on empty ptes; it's a pity that it was not already like that. As you must have guessed, we are looking for same behavior on anon and !anon. >> >> In an ideal world, we'd simply be using PTE markers unconditionally I think >> and avoid this zeropage feature :/ >> >> Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not >> simply always do that unconditionally? (sure, we have to indicate to user >> space that it now works as expected) Are we really expecting to break user >> space by protecting what was asked for to protect? > > I suspect so. > > From high level, the major functional changes will be: > > (1) The user will start to receive more WP message with zero page being > reported, > > (2) Wr-protecting a very sparse memory can be much slower > > I would expect there're cases where the app just works as usual. > > However in some other cases the user may really not care about zero pages > at all, and I had a feeling that's actually the majority. > > Live snapshot is actually special because IIUC the old semantics should > work perfectly if the guest OS won't try to sanity check freed pages being > all zeros.. IOW that's some corner case, and if we can control that we may > not even need WP_ZEROPAGE too for QEMU, iiuc. For many other apps people > may leverage this (ignoring mem holes) and make the app faster. > > Normally when I'm not confident of any functional change, I'd rather use a > flag. Luckily uffd is very friendly to that, so the user can have better > control of what to expect. Some future app may explicitly want to always > ignore zero pages when on extremely sparse mem, and without the flag it > can't choose. > >> >>> >>> We can always optimize this behavior in the future with either >>> PMD/PUD/.. pte markers as you said, but IMHO that just needs further >>> justification on the complexity, and also on whether that's beneficial to >>> the majority to become the default behavior. >> >> As I said, usually any new features require good justification. Maybe there >> really is a measurable performance gain (less syscalls, less pgtable walks). > > Muhammad may have a word to say here; let's see whether he has any comment. > > Besides that, as I replied above I'll collect some data in my next post > regardless, with an attempt to optimize with huge zeropages on top. I've just ran my single threaded selftest [1] over an over again to get some numbers. Without zeropage qemu has 6 cores: 26.0355 With zeropage qemu has 6 cores: 39.203 33% worse performance with zero pages Definitely, there can be better benchmark application. Please let me know if I should write better benchmarks on my end. [1] https://lore.kernel.org/all/20230202112915.867409-7-usama.anjum@collabora.com > > Thanks, > -- BR, Muhammad Usama Anjum