From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7640FC4708E for ; Thu, 5 Jan 2023 19:51:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 155A78E0005; Thu, 5 Jan 2023 14:51:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0DC708E0001; Thu, 5 Jan 2023 14:51:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EBF718E0005; Thu, 5 Jan 2023 14:51:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D79F58E0001 for ; Thu, 5 Jan 2023 14:51:19 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A7752A0529 for ; Thu, 5 Jan 2023 19:51:19 +0000 (UTC) X-FDA: 80321789478.24.BE64893 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id C08831C000D for ; Thu, 5 Jan 2023 19:51:17 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="eY62/ajW"; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672948277; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EjbjzkivdEfdQKhJnMwtYL1o8m0WmVhHe3G0I/dzTrU=; b=vQ7BRMzir8SOw6zkrJn5yrTrhVoHFE7IgqEBi8WgKVkgOOY+eFmRylm+cpOlPJXjkwaoYA FdnV5nal1/EUWzw1Wrn/J+WNOFeRfjh3zCRxH4I1b2cV/r32DrJgDBm4tDZnTOU8xpXIKi DFPF9qiJ46LOsprQitWZhpMJfCauYDg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="eY62/ajW"; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672948277; a=rsa-sha256; cv=none; b=Wtandta6UcPaRFDpXK6526BD04iAHoubeicDQ6i7W59eMnijPkqhUTvEXVxEhM0KV/5oKr K3TYsWOfBnIyDmAOBCngfYTr6wopgeRJTW0XqP9RbirUXbgtWvrEsa5BUncKuEWzrSJvaK BCsuz2rqmIW1u7DgNeCsTkdJyguYzvk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1672948277; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EjbjzkivdEfdQKhJnMwtYL1o8m0WmVhHe3G0I/dzTrU=; b=eY62/ajW/jPpn6FzCZfbZ7Jzjp70ueoaTgJZnKPkYr6tTNtE+7ZhmRe9DKsJ1bJvc6jxSd fgjrsKYe5aitSGKfi+rwlpWSHeLhddRqKxcaAJJw7E0w9zjrSBnQHrsqATcd7hG9yb7vFV UmmVPAdZXiSiQjQ8oYk1LXfJC2OQUlY= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-120-eWZDW-_lOLCMKcEiGfofTw-1; Thu, 05 Jan 2023 14:51:16 -0500 X-MC-Unique: eWZDW-_lOLCMKcEiGfofTw-1 Received: by mail-qv1-f70.google.com with SMTP id c10-20020a05621401ea00b004c72d0e92bcso19530358qvu.12 for ; Thu, 05 Jan 2023 11:51:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EjbjzkivdEfdQKhJnMwtYL1o8m0WmVhHe3G0I/dzTrU=; b=xcE4O3bQjpW3eW60M7mJm5x5lMhz21w6Lvo2DxocNq9s5DIOIK4MPIReULLQPo5kVD dPYX/2Pst0c1frPwJ8qoHJOrJSce0abgYFEfbS4Tf9nhwd+fMJOkZy2L847hPw3p1dGU ozlnpAGGOyOVWRz1WGzch5MaKG+yo6UTyYavO6ujaWJVFwl9LBOfzmEWD0+3oHtHhOsu noJhwhdVPpWdOQnlYuPH2KCKG1/JrGRpMdJ3oQHWX3HtoPU/3ZNKs5h5pd93Qa61uKek MNoEWWra5HAGa4c7ZDeF4p8OXZIAYa5luvDQ1/FfyJe1HbvEaDxjJ0wY4dq5H9OKuaOW rVMg== X-Gm-Message-State: AFqh2kpKVydDywPJ6G+E0dxZem4wjstFpBOJjlIODiByGkArjzVJUb0L hNz27vlIw92isEzQyISMPGllC+buC90QJ0evLORBsZejRfg4VauBAWdThUCXAVoIEjoFu7eT2t3 /0wBw0iVYOkc= X-Received: by 2002:ac8:6f19:0:b0:3a9:84bd:7cc5 with SMTP id bs25-20020ac86f19000000b003a984bd7cc5mr80496276qtb.39.1672948275481; Thu, 05 Jan 2023 11:51:15 -0800 (PST) X-Google-Smtp-Source: AMrXdXvi5sWv8T/L86iB8C2dAti4v5sfdzxgjfhkQOp4LUK/7nJYD9pvHXMQ/BDPXf6/gLbgU1d1qw== X-Received: by 2002:ac8:6f19:0:b0:3a9:84bd:7cc5 with SMTP id bs25-20020ac86f19000000b003a984bd7cc5mr80496251qtb.39.1672948275210; Thu, 05 Jan 2023 11:51:15 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-39-70-52-228-144.dsl.bell.ca. [70.52.228.144]) by smtp.gmail.com with ESMTPSA id v7-20020a05620a0f0700b006faf76e7c9asm26204357qkl.115.2023.01.05.11.51.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Jan 2023 11:51:14 -0800 (PST) Date: Thu, 5 Jan 2023 14:51:13 -0500 From: Peter Xu To: Nadav Amit Cc: David Hildenbrand , Linux-MM , kernel list , Mike Kravetz , Muchun Song , Andrea Arcangeli , James Houghton , Axel Rasmussen , Andrew Morton Subject: Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Message-ID: References: <20230104225207.1066932-1-peterx@redhat.com> <20230104225207.1066932-4-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: C08831C000D X-Stat-Signature: ezkemp43mo496poixw1fhn6grjkcxpeb X-Rspam-User: X-HE-Tag: 1672948277-229231 X-HE-Meta: U2FsdGVkX1/0D7jQEnfE0WeI3K0xbt4LA6Z0mjK1hbATDFmyC0PU5Kxc+MhormLzS2pJl4QQqhn7o57mZgyB5x5022scJDI8+G5g9P2rdpJjFoPOWNY29umMR1Sq3GYGNjfFiAlepx9FfuXJIbmQK1aK13LiScbR88cwgJClckQ8v3u5bF+xY2WPXsyM3YSUyFpdNkd1gLZahnf/WQ1p8PaKRdPQ7cMVjrLeOmoQPnls8kiEKdxLtxYxRSf3oJZ2ANHirAYqwkOZNJAQu1gJO+HTaJdhxQnxwIKV6M/HwyE3O8xgeb7ofjaqFk9QDcMhOi+ZZsxqFUldVdVoAAKdzYCJMAz1gwcVBwDQfZOyzU00Y2yclTwWZMB5Lk909XRU5ISFqG1XsCYkh5B3uaqgOmTMKLbyfP/NA+HTrGwXi6Yt71tObSEh3myKiTBUJhwTrNRpEm2h48HUicvB9B7ce26y38pUtgOrWoffGAg2SNv9y/ZHHfY5jWFYmqCAhcDOYv1+sGLFlB242hYKhDxwq+w8q5ev5rrzxN1TYdY4X1YjdceVsmWOHsPRgmiXaZ71i+32EXgeAVweFKs2kQdaQfgvC4d85j5LJUJY5SYQW8PwgMP3lpdn8LQbBJ0zRREer4BNbRUFVF38ikMV/0aS4bn3ljvjt3c5pQyOXW5EvOaWAb3rmwyd84CpSu9G+3nBIqWlhZUwB5dhqe+YdrsVBz8V8zlAxdBOL8J1vcmHxb5KGqkJacro3QTrU2YKdl+xSAp5rBvMAwOg3PkFOpqgkieJn7O2Su/z10ByUxGOateR64iRmq4KdA1OWFgPuyxf07qpMeMO4JnCwSVcOxsOD3zti6BTO7soJiui49qqT9qJMQL71ETZ+DwWsN6XPFFz0BuycJ+WIYbHGm0n3iv29bN1qvXa7axTqFRpou6gWgr65/zRxOwEWMUaUT2nlTU+436CFqTz6/3xnIxRKuW 0WRwaalW RbXwn6jGrq27mc6vqM1z6nmNzh1SXc8FaPFb5KvedfsX7IGA5V1FbaGw1EUNVmdfjIbGCDMT1MscIRiRF31gpwrOw6vklpU31ZT6NsgniFJca6IVZalNSWMkFlCq8CyMRO8Q5uRjM+ZCvW22JPRZSj0v5DPzNnwbUQ2whkpOae7VtqWsf3RJeo3xQew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 05, 2023 at 10:01:46AM -0800, Nadav Amit wrote: > > > > On Jan 5, 2023, at 12:59 AM, David Hildenbrand wrote: > > > > On 05.01.23 04:10, Nadav Amit wrote: > >>> On Jan 4, 2023, at 2:52 PM, Peter Xu wrote: > >>> > >>> Before this patch, when there's any pgtable allocation issues happened > >>> during change_protection(), the error will be ignored from the syscall. > >>> For shmem, there will be an error dumped into the host dmesg. Two issues > >>> with that: > >>> > >>> (1) Doing a trace dump when allocation fails is not anything close to > >>> grace.. > >>> > >>> (2) The user should be notified with any kind of such error, so the user > >>> can trap it and decide what to do next, either by retrying, or stop > >>> the process properly, or anything else. > >>> > >>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > >>> pgtable allocation failure happened. It should not normally break anyone, > >>> though. If it breaks, then in good ways. > >>> > >>> One man-page update will be on the way to introduce the new -ENOMEM for > >>> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > >>> 5.19-till-now kernels. > >> I understand that the current assumption is that change_protection() should > >> fully succeed or fail, and I guess this is the current behavior. > >> However, to be more “future-proof” perhaps this needs to be revisited. > >> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on > >> userspace request) prevent write-protection of pages that are pinned. This is > >> necessary to allow userspace uffd monitor to avoid write-protection of > >> O_DIRECT’d memory, for instance, that might change even if a uffd monitor > >> considers it write-protected. > > > > Just a note that this is pretty tricky IMHO, because: > > > > a) We cannot distinguished "pinned readable" from "pinned writable" > > b) We can have false positives ("pinned") even for compound pages due to > > concurrent GUP-fast. > > c) Synchronizing against GUP-fast is pretty tricky ... as we learned. > > Concurrent pinning is usually problematic. > > d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least > > that should be figured out at one point) > > My prototype used the page-count IIRC, so it had false-positives (but > addressed O_DIRECT). I think this means the app is fine to not be able to write protect some page being requested? For a swap framework I think it's fine, but maybe not for taking a snapshot, so I agree it should be an optional flag as you mentioned below. > And yes, precise refinement is complicated. However, > if you need to uffd-wp memory, then without such a mechanism you need to > ensure no kerenl/DMA write to these pages is possible. The only other > option I can think of is interposing/seccomp on a variety of syscalls, > to prevent uffd-wp of such memory. > > > > > I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far. > > > > For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues. > > Yes, I propose it as an optional flag for UFFD-WP. > Anyhow, I believe the UFFD-WP as implemented now is not efficient and > should’ve been vectored to allow one TLB shootdown for many > non-consecutive pages. Agreed. Would providing a vector of ranges help too for a few uffd ioctls? I'm also curious whether you're still actively developing (or running) your iouring series. > > > > > Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably". > > I am not sure what the best way to detect that a page is write-pinned > reliably. My point was that if a change is already carried to > write-protect mechanisms, then this issue should be considered. Because > otherwise, many use-cases of uffd-wp would encounter implementation > issues. > > I will not “kill” myself over it now, but I think it worth consideration. The current interface change is small and limited only to the extra -ENOMEM retval with memory pressures (personally I don't really know how to trigger this, as I never succeeded myself even with memory pressure..). What you said does sound like a new area to explore, and I think it's fine to change the interface again. Thanks, -- Peter Xu