From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6560EC4708E for ; Thu, 5 Jan 2023 18:02:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6B4F8E0003; Thu, 5 Jan 2023 13:02:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B1B418E0001; Thu, 5 Jan 2023 13:02:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BBC28E0003; Thu, 5 Jan 2023 13:02:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 89FA48E0001 for ; Thu, 5 Jan 2023 13:02:02 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2F711AAF22 for ; Thu, 5 Jan 2023 18:02:02 +0000 (UTC) X-FDA: 80321514084.01.37FD000 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf14.hostedemail.com (Postfix) with ESMTP id 31AEE100019 for ; Thu, 5 Jan 2023 18:01:59 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ehw+5NEf; spf=pass (imf14.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672941720; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vHWW7d/4nqIg9XiaDgLSDFBBLIaN1Qv+J6JoH4V8xI0=; b=s5Xfl2Wza30Qz/lTdtYn0XdZgyqkavjKawK9ZwTMA+Zw9m7p2YEUFDwBEQXnF9y24Zk7Ri 1BBPhd62wQSjoVgqymsJ6n4/uX0lmOZkhBl0futGoAxD06JLvJN4wk+OZ099/YcSLysB6a 6Xaz+Dz1RqV6Iez+UOxH0/qGoG9nPI0= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ehw+5NEf; spf=pass (imf14.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672941720; a=rsa-sha256; cv=none; b=FmnDunuHe6NyDOG1WW2W7uve7E4+9aoClWXlszhwc65H9uZf7u0aVOrP6epfGJuY9AGS9n TGasy53oBkeNGp3PI+U150hxut1NiY9Br1T00EbGxiHC7thSmvfSkHKistTo6gnyoUoXgA s5fueHwsFXsv6HL0AkNNH6U4RQziv3I= Received: by mail-pl1-f181.google.com with SMTP id b2so40107530pld.7 for ; Thu, 05 Jan 2023 10:01:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vHWW7d/4nqIg9XiaDgLSDFBBLIaN1Qv+J6JoH4V8xI0=; b=ehw+5NEfopQ51aIgxvnYdxIqE7BXm0GnFy5XaoDN5Yo3F2NxAP0e/Rb6R5LHRPXkCT E6Qx4l1M3VSpwmdFXhdQTxuMiUNGJJZ59F1Hmop1hF3ZFd4aoQrctvt/mmyMb8XY/OYf gqGGskp8O7dsjo+U6og9Auo9Rzk8MPVpTD1KanPZDjtfcNmIOFc0HEPTnezO0bcZD91D ebrCa8IlkfL6bRYpFtsotZrcmokDrExIyc0keCqk0IWNHb8iw4bIhdIldYZ4doX+b/Im 1M+MvgPm7GlNEycWC2EzBmS79n+ixQpLfnsrXJUpIKM6Tm7lp/CNCfgkDayEdlSd0K8g iW1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vHWW7d/4nqIg9XiaDgLSDFBBLIaN1Qv+J6JoH4V8xI0=; b=1k6P/9i/I7nb+VjmKSZeC8aB3t+Yas4xfk9XEwnYNszywD3yAbL4uavqf7cSfuv+jK VXBs8K8HMP078iV8pydtk2X7rgvJXLr7i2ucpY8iZUE/+yaU64NwIGfzdQBGTnAsAynq LmLnZ83U6FLBvvbdpDXezuruGTFarIvjtynQCcuUNJaR+U0eogp/HmPg1L+gvUs89XUk cpVWGGxlAeeP+91F3JwY9mpc7YWGfIT934DY/DHaLq8PhiF6jutgHUQCEirJxoKaeWSz PuQQ3EpJXqFmpwj1Mx13aetfLIPjKYVg5ZeDtF8w9JP8GnPgdYYqzB7P48WvdXSm9fEk 59tw== X-Gm-Message-State: AFqh2krNroM7y5cHGGP97rxbnScDkKh1L0OfSQLl+8Kmztnal7XkYFSz PiqBMRvYu5s32zQIBK9B6BzGEIN3RqEpYu2f X-Google-Smtp-Source: AMrXdXsbKedjpEK50cNVsdWSFaM53QBXKTEA0m1PziGD3cm+l0oqn0g/2lhfInkIx1RbGWVu2Gihuw== X-Received: by 2002:a17:902:d4ce:b0:191:770:328d with SMTP id o14-20020a170902d4ce00b001910770328dmr97098018plg.46.1672941718815; Thu, 05 Jan 2023 10:01:58 -0800 (PST) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id u6-20020a170902e5c600b00172fad607b3sm26371773plf.207.2023.01.05.10.01.57 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Jan 2023 10:01:58 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.300.101.1.3\)) Subject: Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures From: Nadav Amit In-Reply-To: Date: Thu, 5 Jan 2023 10:01:46 -0800 Cc: Peter Xu , Linux-MM , kernel list , Mike Kravetz , Muchun Song , Andrea Arcangeli , James Houghton , Axel Rasmussen , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: References: <20230104225207.1066932-1-peterx@redhat.com> <20230104225207.1066932-4-peterx@redhat.com> To: David Hildenbrand X-Mailer: Apple Mail (2.3731.300.101.1.3) X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 31AEE100019 X-Stat-Signature: kwdcbwnf4fk3opwe795byxk5wf1783xx X-HE-Tag: 1672941719-63681 X-HE-Meta: U2FsdGVkX1/snjm4YZmpw+Mjt/xMBYb+vE5ue5D9UVWe5wiuqVbyEyVLYCBW5N0PHtnc846DG9vqHDNnSWs0lWxe5rmGz/UOv5BT2XKpYgv/qtl5pHUFDBiJG2hA005aHx63Ho8R0MTUNaNYX2X4/I8LBRktDlzFADNs/2hUPripTJEuF3AVMYgQsrPtL3BN0+FAsTs2WxYT205FK3Lz6isM9U0EwSV0xV8mnzL0xsFag2+oyfHmhPGP5xXHTlve7R727ith3xvARcxY2PhKDUyCb6FROSTYbZwrFXUWsQa4CTyyyxyLsMdAQrjKSfIDhrZCg6zj9KBVTKm4dRolWEYYUOTti+0QZIFktFnfNu39sIrLPG9/eFDQBBrpsNm9JzeZMd8EfXVOEibDqs7WYMPqc+Z5fphDUTtfOQKsLMJBSPsQ6tCYCaXwpkogQTnyYpyx3xDbR/sOIZ1Os8E2NHw1AjojUrh0ygVrsf22WGv7dZ7ahUAN22mDLfzWK/d7tT0fVxPPh0NBA7CvjZnQyeANRITcXGdBexmKcPLv0U1IfRCFd20EOnO9UfFP9/US6H1jjwhJMkFbw4FjBmAW3/ajaCptia/BLDpJiEKftDAnB1SVzLi25ygFgl4t1TY0z1APyjYSZZR08XNG3sF0DwnBab5QHn+QUjYqhca7ed3MoPpTWI7tg5lRmrRiKejv7f6KxasU+G5srnzMs9wnnI9gHO5pDit4bwJBM/koJd4cEVhuYbQ19xZSU9ANLBRKQmRjO1QWFWwOSpNCt449eOuWR3w/RK5Qa3CrlMl7MkC0Ux3TUR3x2xs8rDFzZ7W/41Tm+wu2DiVaiH3GH/3cjk/wkf3o/BOMVH56jVK3NMFdAazq62Wphi+efvTBoSlECMRihXcJUlBwwfIzl04ClIavTshw7lQJx9x5fTDyrd/4KQJOLLHTUuL56ky5iSHJJu3ayHADnQluL9giyhH kCvbzosJ Hk42FLUjempWN3KXYzXppIHfi9olpo7F8cCzoarAv6pXNIoUlGkzzI6WW+TxZCnf5PWErl+Nli0MdaPKhgf7LO8IvqXCiC+VgGH4Y9BlcRSu4DqcNj5Xk1tMwEvw7jdK3F0yOfCUtRklNO9VyHUgqBz3GoQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Jan 5, 2023, at 12:59 AM, David Hildenbrand = wrote: >=20 > On 05.01.23 04:10, Nadav Amit wrote: >>> On Jan 4, 2023, at 2:52 PM, Peter Xu wrote: >>>=20 >>> Before this patch, when there's any pgtable allocation issues = happened >>> during change_protection(), the error will be ignored from the = syscall. >>> For shmem, there will be an error dumped into the host dmesg. Two = issues >>> with that: >>>=20 >>> (1) Doing a trace dump when allocation fails is not anything close = to >>> grace.. >>>=20 >>> (2) The user should be notified with any kind of such error, so the = user >>> can trap it and decide what to do next, either by retrying, or = stop >>> the process properly, or anything else. >>>=20 >>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT = when >>> pgtable allocation failure happened. It should not normally break = anyone, >>> though. If it breaks, then in good ways. >>>=20 >>> One man-page update will be on the way to introduce the new -ENOMEM = for >>> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior = on the >>> 5.19-till-now kernels. >> I understand that the current assumption is that change_protection() = should >> fully succeed or fail, and I guess this is the current behavior. >> However, to be more =E2=80=9Cfuture-proof=E2=80=9D perhaps this needs = to be revisited. >> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to = (based on >> userspace request) prevent write-protection of pages that are pinned. = This is >> necessary to allow userspace uffd monitor to avoid write-protection = of >> O_DIRECT=E2=80=99d memory, for instance, that might change even if a = uffd monitor >> considers it write-protected. >=20 > Just a note that this is pretty tricky IMHO, because: >=20 > a) We cannot distinguished "pinned readable" from "pinned writable" > b) We can have false positives ("pinned") even for compound pages due = to > concurrent GUP-fast. > c) Synchronizing against GUP-fast is pretty tricky ... as we learned. > Concurrent pinning is usually problematic. > d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least > that should be figured out at one point) My prototype used the page-count IIRC, so it had false-positives (but addressed O_DIRECT). And yes, precise refinement is complicated. = However, if you need to uffd-wp memory, then without such a mechanism you need to ensure no kerenl/DMA write to these pages is possible. The only other option I can think of is interposing/seccomp on a variety of syscalls, to prevent uffd-wp of such memory. >=20 > I have a patch lying around for a very long time that removes that = special-pinned handling from softdirty code, because of the above = reasons (and because it forgets THP). For now I didn't send it because = for softdirty, it's acceptable to over-indicate and it hasn't been = reported to be an actual problem so far. >=20 > For existing UFFDIO_WRITEPROTECT users, however, it might be very = harmful (especially for existing users) to get false protection errors. = Failing due to ENOMEM is different to failing due to some temporary = concurrency issues. Yes, I propose it as an optional flag for UFFD-WP. Anyhow, I believe the UFFD-WP as implemented now is not efficient and should=E2=80=99ve = been vectored to allow one TLB shootdown for many non-consecutive pages.=20 >=20 > Having that said, I started thinking about alternative ways of = detecting that in that past, without much outcome so far: that latest = idea was indicating "this MM has had pinned pages at one point, be = careful because any techniques that use write-protection (softdirty, = mprotect, uffd-wp) won't be able to catch writes via pinned pages = reliably". I am not sure what the best way to detect that a page is write-pinned reliably. My point was that if a change is already carried to write-protect mechanisms, then this issue should be considered. Because otherwise, many use-cases of uffd-wp would encounter implementation issues. I will not =E2=80=9Ckill=E2=80=9D myself over it now, but I think it = worth consideration.