From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9BC8CDD0CB for ; Tue, 22 Oct 2024 19:09:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 758396B007B; Tue, 22 Oct 2024 15:09:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 708556B0082; Tue, 22 Oct 2024 15:09:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5B33C6B0083; Tue, 22 Oct 2024 15:09:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3D9046B007B for ; Tue, 22 Oct 2024 15:09:35 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 482D0803FF for ; Tue, 22 Oct 2024 19:09:20 +0000 (UTC) X-FDA: 82702176354.08.53B89A9 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) by imf25.hostedemail.com (Postfix) with ESMTP id 3F097A002A for ; Tue, 22 Oct 2024 19:09:21 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=P53EMgYD; spf=pass (imf25.hostedemail.com: domain of jannh@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729624022; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Aq7vYlKYjVUJ2Hf6UjPLePUAsgsPK2CPSZGZYNK3ODQ=; b=6CXbApymhG3UCXSUXvmfxAC0tsdTkK3KWvcEYK+qnfi8FKuMaRj8xvONZe6P58Lb0/jlAw u5350w0rvhe2hYrjOGZ4CYf//kgyr7kImC8tAYbI0cwUfruaDOcZ35Ol8BCFV3nqjkAdUO no6cER0tOKHbYDFkhV5njcVeErGN7AU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729624022; a=rsa-sha256; cv=none; b=2/Beq1Mw/Wjk2GraKgy9udRc3zVfq3C7cDDR0k18U+WWoOLRr67b+FoD2X96sBT7aoG5xo I6n567cXfK5T5o8qiaZlDp5+/cwf+bCm20z0GEokONyP9woLKird35mYraFsqCZ/J042EM x/KLN4is7KE7ArXpNXq+z50pTYN4vJ4= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=P53EMgYD; spf=pass (imf25.hostedemail.com: domain of jannh@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-43150ea2db6so50125e9.0 for ; Tue, 22 Oct 2024 12:09:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729624171; x=1730228971; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Aq7vYlKYjVUJ2Hf6UjPLePUAsgsPK2CPSZGZYNK3ODQ=; b=P53EMgYDP2BHO0nUxkzJDYBskghzbbnfOzjrmQVJTUDJqPFk6U/k/BVZdpbQzSD22g KxgP0mlLbhublMC25hkqQ4qoY0iaMen6AkT2PVZl650J0Vs6+bRDTL8mS063PfYtzXTB VJuVwGYlFXGdR2KdC75f5GqOKTr3/lajTs3ZX5Tr9720EaCmevtA6BFTSXeryWFC6IB1 biISSu4LC4I7HlAPicVMGU+TgZSc1sXaPzy9KhdAutv9O6d4JNukby9Wyiq/AwHsOsEQ SwrMfHvfxlzvwXcqIPhNLRb9vlUa5mQKPGmdA1R5iaXfXxeFb3RALCAlTTUiTL1MBXE/ wPgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729624171; x=1730228971; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Aq7vYlKYjVUJ2Hf6UjPLePUAsgsPK2CPSZGZYNK3ODQ=; b=iy0+vV/J0ynLQUhFFZAE82Vk6tkM3jkwEm2TaVQbFq0RrS41i5TSao0Ujg8x2q78kd 2PB5EU1ofJ5/TGG2zldxbvF0g9mAABqZbcY+g5oM4yr3ZmNY1f53EIcfK+rTMTxmvsQ/ EAD7NaO2qRGvecVmrJmvNVXV4xv03ZNDkZ+gr+5M0FDp2OVVNbjzB1B95wMlfPotKolp GouX55yvP/D+weJkDOvBqCAbH1Fwt3UQFGwZRc+9vBPIMU9nDQbWHpfMVcjU5Dw3diUi aeuPyJnK+tqishnJuAj7TObVW0FNJb2SdPtpjvbD2Lx0RhXEzbVz6CJ50RWSIb8ZdyQg hKXA== X-Forwarded-Encrypted: i=1; AJvYcCUQv9/UBY2Md0zWQE+XF7cu8M70NMqrvEfy96/93emB73MgKhL1JaoUQJzKvCgMj5gU7z8JAzV+KA==@kvack.org X-Gm-Message-State: AOJu0YwX30IjU8NKrSmilMnCxFP207phLDs2cGwk8JK63iqplSOQ7Jrh Kp2R7YSReVU6vVE+5voW8EHTrZhEYZaeJF8jK9ex1e7h83InhuwLUNKrFQVaKkEJMJ+RXca/b9N /9h3H/wOnnQc/uXOT6KXA6sUkBjXNwpJF3FQt X-Google-Smtp-Source: AGHT+IHdTaQMVD7gfS71JCtoKNqFlL+ZH6CTrJ3WUvE3tBmxfLe5eVJcaVuq9zPzDtwFl3aO3H7IuCLtAGxl1WBUDSc= X-Received: by 2002:a05:600c:3d91:b0:426:7018:2e2f with SMTP id 5b1f17b1804b1-43184628822mr181375e9.5.1729624170899; Tue, 22 Oct 2024 12:09:30 -0700 (PDT) MIME-Version: 1.0 References: <393b0932-1c52-4d59-9466-e5e6184a7daf@lucifer.local> In-Reply-To: From: Jann Horn Date: Tue, 22 Oct 2024 21:08:53 +0200 Message-ID: Subject: Re: [PATCH v2 3/5] mm: madvise: implement lightweight guard page mechanism To: Vlastimil Babka Cc: Lorenzo Stoakes , Andrew Morton , Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , "Paul E . McKenney" , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Jeff Xu , Christoph Hellwig , linux-api@vger.kernel.org, John Hubbard Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3F097A002A X-Stat-Signature: ir1mzgfejc67baiiijbgtnumsy6txk39 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729624161-984593 X-HE-Meta: U2FsdGVkX1+ODkZ8xuBFVN4WdWvKG1rttY9xbDxsx9KqzxZCBJeyo5BX9hb7Eb3aoaDINN5xg6xs0wokHSxyWjlrfxuE0JrM7/EH2YuZgyaTpyfQv+jcTSnWVc7Vd5My7sZ5zQHLTmh69I6yVjbhssb/gvAVmzZ8oDH3rhH6rr1e9ujDuwiPKjetMf5iWdowFhKB27al88KLHFrNGW+ZTPlg+XT50RlcdThS1QElSzA3F0LviM7JMvBQEQRFTakyAbjeamaYqfWaG7PXJq7HZXy1su446SUwTy5qwqV2Ur8HVvkigEMjHqmLfC0kvXvI5YdC+p59SrLdMTdRQkJL966RsKbYTMkWkUVjgL6xTlOxfKeAEXEYVt5KVujuzcKID9J8YjylsWDb2FFbhmFRi4DXFtLyipVN22Zf59tEY6Qcii+Ce8C7+eGv2ER4CTvywlWlV54Qelj/EXfArLomZdYLfQKQLgwA19nyiylcQxvAC0mqzlEezobQNI9/P/0nz02O7E75q/sdRdvd0b7AkrEsNEkXbMtHLqyCyjQMZjzgX2z56RQaVMu5DIOnixdcXq1i+Nz4px3GwCeZvXGMVlillkqDpSKj3f72B1Hu0oBnOIw4icMMTCUhE1jQYFiOHXPM3pFAc2x/XyO1YSZ3otsy4HBpz6bX8N4yU5fVHx1M9zA48hrxUOJ6hfBELT8knT4EfdYvQqnfikGFRtC6zzOzLtTiVesrAlMo/8YI4Gv+z09cd9du0sGktSlN29z60VfEjTKfZSAuYjf8K3Z0ftGZMIxXIm2vw4uzN5K7nt+UKdI0uCCwOLLkPAtKUK6/3xdPz2i2oVzsv1a3a29LpM1UMXzqv+EzxsHjw32/sJnHptinimmOIfUEBZ8i1Pg8IFL6Pnahu8gEKPnnTP5Ka4SZeIA8abTwAkoInSd/h/ahwONyN5iXRZMosVUFJ+nsQRjIEdWnL40ofQhdiFx oiGw6OaM ulry4lvS/5xLy/BNlegGamFtbMGRDG7bM00jY9CdFsyGyuj8ccXM3/v7bm3QuQ1erSElyNpJCuUAqgIXUyo5h+jt1k0k/C+OcoLXdySL8xKuouURU+qlclbjdyF10NWzsEPjIL80p2Cu6PZtTPD9w6QL0TavzC6OOigdkJoOJ9W3VF5JgOq7cBAjeJrkpYBGBMEtN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 21, 2024 at 10:46=E2=80=AFPM Vlastimil Babka w= rote: > On 10/21/24 22:27, Lorenzo Stoakes wrote: > > On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote: > >> On 10/20/24 18:20, Lorenzo Stoakes wrote: > >> > + while (true) { > >> > + /* Returns < 0 on error, =3D=3D 0 if success, > 0 if zap = needed. */ > >> > + err =3D walk_page_range_mm(vma->vm_mm, start, end, > >> > + &guard_poison_walk_ops, NULL); > >> > + if (err <=3D 0) > >> > + return err; > >> > + > >> > + /* > >> > + * OK some of the range have non-guard pages mapped, zap > >> > + * them. This leaves existing guard pages in place. > >> > + */ > >> > + zap_page_range_single(vma, start, end - start, NULL); > >> > >> ... however the potentially endless loop doesn't seem great. Could a > >> malicious program keep refaulting the range (ignoring any segfaults if= it > >> loses a race) with one thread while failing to make progress here with > >> another thread? Is that ok because it would only punish itself? > > > > Sigh. Again, I don't think you've read the previous series have you? Or > > even the changelog... I added this as Jann asked for it. Originally we'= d > > -EAGAIN if we got raced. See the discussion over in v1 for details. > > > > I did it that way specifically to avoid such things, but Jann didn't ap= pear > > to think it was a problem. > > If Jann is fine with this then it must be secure enough. My thinking there was: We can legitimately race with adjacent faults populating the area we're operating on with THP pages; as long as the zapping and poison-marker-setting are separate, *someone* will have to do the retry. Either we do it in the kernel, or we tell userspace to handle it, but having the kernel take care of it is preferable because it makes the stable UAPI less messy. One easy way to do it in the kernel would be to return -ERESTARTNOINTR after the zap_page_range_single() instead of jumping back up, which in terms of locking and signal handling and such would be equivalent to looping in userspace (because really that's what -ERESTARTNOINTR does - it returns out to userspace and moves the instruction pointer back to restart the syscall). Though if we do that immediately, it might make MADV_POISON unnecessarily slow, so we should probably retry once before doing that. The other easy way is to just loop here. The cond_resched() and pending fatal signal check mean that (except on CONFIG_PREEMPT_NONE) the only differences between the current implementation and looping in userspace are that we don't handle non-fatal signals in between iterations and that we keep hogging the mmap_lock in read mode. We do already have a bunch of codepaths that retry on concurrent page table changes, like when zap_pte_range() encounters a pte_offset_map_lock() failure; though I guess the difference is that the retry on those is just a couple instructions, which would be harder to race consistently, while here we redo walks across the entire range, which should be fairly easy to race repeatedly. So I guess you have a point that this might be the easiest way to stall other tasks that are trying to take mmap_lock for an extended amount of time, I did not fully consider that... and then I guess you could use that to slow down usercopy fault handling (once the lock switches to handoff mode because of a stalled writer?) or slow down other processes trying to read /proc/$pid/cmdline? You can already indefinitely hog the mmap_lock with FUSE, though that requires that you can mount a FUSE filesystem (which you wouldn't be able in reasonably sandboxed code) and that you can find something like a pin_user_pages() call that can't drop the mmap lock in between, and there aren't actually that many of those... So I guess you have a point and the -ERESTARTNOINTR approach would be a little bit nicer, as long as it's easy to implement.