From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9786FC3DA7D for ; Tue, 3 Jan 2023 18:15:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 136478E0003; Tue, 3 Jan 2023 13:15:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BFF18E0002; Tue, 3 Jan 2023 13:15:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E54988E0003; Tue, 3 Jan 2023 13:15:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CEB158E0002 for ; Tue, 3 Jan 2023 13:15:56 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A8A52AA9ED for ; Tue, 3 Jan 2023 18:15:56 +0000 (UTC) X-FDA: 80314291512.29.530379A Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf23.hostedemail.com (Postfix) with ESMTP id B545F14000F for ; Tue, 3 Jan 2023 18:15:54 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ZqEip8pD; spf=pass (imf23.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672769754; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=miPQO7iyS3ienKf9idBPEEE606/4jeUgjmeR9+HTk7Q=; b=Iird32sOGl9TkFn+y1L6NJbxafYve+/E1JOCYWjM4NABA7Vixuub9qUehUQkKjs9k7msF8 q4hIUuKKP3M/kdopwzf42wwz0mAiOsqI70VQoLHKiAM53tmlDzlIrAJrmMJBPVQclSmPyi zrDTL11FP4l5KIPVKe06DTRAsbNmyS8= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=ZqEip8pD; spf=pass (imf23.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672769754; a=rsa-sha256; cv=none; b=gXTqkhS6uChzRPVddWWTTodIRUHDkLNHAZ/fUnXr0sjahL9z6g8LfjoOIkKM9yrTRN9utX o8Iw4sE4ARV6M/kJkY/FeuyrymEbHNtO6K0FqBmQhxvbVvyaRWD16QOgB2eGqDa3MxfBMP dhZrmXdDT1veGfXav8YkG5ccRm4BcxM= Received: by mail-ej1-f49.google.com with SMTP id qk9so75898920ejc.3 for ; Tue, 03 Jan 2023 10:15:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=miPQO7iyS3ienKf9idBPEEE606/4jeUgjmeR9+HTk7Q=; b=ZqEip8pDG0jjqj1ZwdnGcMq6QyiWN5RDnAQHdgahxdX8tRjPBkv7t2J7tyNrq0jnlX lsy7PPudtJXMglGFHVeym7Qrr+i1IkTT5uq/ImqVn1eIkVMbgP1wN4Wzumyag5+h1ehl pfUSospnNnUzNyQTmoyF86QJQS7fQI/V5PKn69aLGxR6uaik+6SylMppoZFzqhpmjM0b MK/RjdLM1RAnzrgDJ47LPGvPqybqffXNuz01JFCYnU5y8Ngl/wxbmSTCwVeL86FQ2Iiw gp3OHRLt3zxmHnE7YFLGB9XoI4e5QKw4y28ZiPC/jFfX64AeleVLaoMEwEe/6j9XN09I FCVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=miPQO7iyS3ienKf9idBPEEE606/4jeUgjmeR9+HTk7Q=; b=iI+AqUX4LvisMaT0aan6Yn/Q8H/sCGOCc6DuZMGPwyx07KLR2eZ0w0Niyu78pquZbE 5CCimvjAoyzLnbnKcgfOVJUKSYg1mnTdwHgAQsBeNP17qFrg/TR1dQy3VoWHbGyn7JYg 4PCwtbJMVCMvv3XZJX9/m/xVjlkEBYX/sgewSn36cS+CkeUlHEFEeyKYmFAFB9QF7CjP eY5tGnIsRNAIImz/Dw9BCzBy2lHlXW1Rt86pGG3Ya5Z2phQ2RVob8Lki8ZBRPpE7byuA s/Xv0QudaGf3npY2pHV9yTJ4tuKfPGOys2PETHAjU33lCXSF1VWLBFwUu6veds0LSLqV 2Kng== X-Gm-Message-State: AFqh2kpuNOG7oRp+4L/GD2ziDSPVnFxxA/t2MkAluXgEM9tFnYAwfiHm qt+LiEH7tiaFiPT6uYBQEL4= X-Google-Smtp-Source: AMrXdXuaJ9JajfrxwPkyilr6jRFT1qpJnTZirLCHffgNrq6vaJTpOXjTiVH09Ai1mjZn+dTBYTadnA== X-Received: by 2002:a17:906:18e2:b0:7c1:4bb:b157 with SMTP id e2-20020a17090618e200b007c104bbb157mr46737681ejf.4.1672769753197; Tue, 03 Jan 2023 10:15:53 -0800 (PST) Received: from gmail.com (1F2EF380.nat.pool.telekom.hu. [31.46.243.128]) by smtp.gmail.com with ESMTPSA id ky16-20020a170907779000b00826afe264bcsm14372976ejc.194.2023.01.03.10.15.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Jan 2023 10:15:52 -0800 (PST) Date: Tue, 3 Jan 2023 19:15:50 +0100 From: Ingo Molnar To: "Jason A. Donenfeld" Cc: linux-kernel@vger.kernel.org, patches@lists.linux.dev, tglx@linutronix.de, linux-crypto@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, Greg Kroah-Hartman , Adhemerval Zanella Netto , Carlos O'Donell , Florian Weimer , Arnd Bergmann , Jann Horn , Christian Brauner , linux-mm@kvack.org, Linus Torvalds Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings Message-ID: References: <20230101162910.710293-1-Jason@zx2c4.com> <20230101162910.710293-3-Jason@zx2c4.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: B545F14000F X-Stat-Signature: 8j1f717ouawhmnefohz7bd9i344rd43u X-Rspam-User: X-HE-Tag: 1672769754-889315 X-HE-Meta: U2FsdGVkX1+x3jdhvTQWCuxZrPBalKquBfeA4M92wQOoAC3LvIgZi5N/hWO1LrbDnU6ot2B+x+jF7uXyWX0yk61sZkNwQL+E40Xqv3RuQTOg8CQyEnteehUDd5XzEJJ3IDFdCYcvK8kUmWgh/7MHAzxh/GgyOnGabujCE/EqdsqPajTJ2ypryhQxWSvdLxEpaAL0rDRQMmJfP1+3E6gonGvWoFfxAasNCaMlE/Lt4oVouIG08aw4SSM4rOYoBWdHlB1jF+DpW7bxpMU371DeTC1BqEvkrBArEPiIjnIJRNmowQHtDBDR5yx7jOJZtWOYZSgRKvWSQcaun+nAwUjumTXyFXa51hi9yJab8RYuzlgjM2c207X5/PVC1n2kkcdjr3rXY0Xh0qgY98GKc8xDPJ8bUWglILgAdxn3oDkH4pzBpNo1KIR/JzN5fsxJszYEWG8DUc4DtAYo3TJp2drh5FjiGJViDSEFrL3PH9NGEOp5lToGAqBzvpHqVe/dfo36bdE6fVa9TJpSR+KAKsGNULlqoxVLIAnn0ta9W56nxCQWFz64MAZxAgn+asQWwao0ISO3KFjMqdQUSsvkJmZ0LKUIPahnqvBPNLfRCUrc4Drpa16pi20a7yqJEPd5IN9Kb5iXVSlDBgrEplROKsiBwFRazrp1NDsinD0YkK3nXEI+pSYsHWbWxt7GNgV8YXKq8tB3TEdbH9RIwEe0s4NK+4EaYbTlgPCR5n6koZY/yoEohmrbccmuIxaQo0Utx9K3ao+gsz4qm3bgkQD0nf+CxENzLuKAhexj7W+CSXzG9eOTiF/2EBoLzVmMk7iQlyUGG36D6/6tswTqeP/DC4oDexglyAvA3bBFiDajTUKutGGZFtlm8Ih1mESoLqCvgPb48uTGqh7VNAZs1nJ3tYkxKmTXsx84A1GbR5VBYndQj8lPBeVqW2aIH3J9biQs0HkSFC4JVHJKNZhVIX4juay a6P2XfZH Klwp5ZWe8KeFxqi/Rf601bcnu0fLqXmL257JHibx6J/iLtYwEmX9EM76Jil7oJ8qUQyvOUa3TxSLagMoYB/iW/fZW7yIBPIvPJBAA2XdlVBuOiJEP5xfAnaip4D0ljQBbL3ReGmu51BR9sV+PRaTDT+nPdMXQoI4FArfLqD3L+zOCCLn4H+U5+wvslfAi82abgDqM+1Ad702aT4jA6xCYtdM+VtWyFYndJ2aXBwtfUhmxtz6IsPi8HxgSKIeC9s5ZTu6Oqi7bDnOdZxPSYkLkWeLP08keJb7eFRgqWctGyqiybzOjGsxT6axiKh/yamu/SUC/GFtllcN8rF1FbmO1NEEDcZwEtNBfgid5 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: * Jason A. Donenfeld wrote: > On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote: > > > > * Jason A. Donenfeld wrote: > > > > > The vDSO getrandom() implementation works with a buffer allocated with a > > > new system call that has certain requirements: > > > > > > - It shouldn't be written to core dumps. > > > * Easy: VM_DONTDUMP. > > > - It should be zeroed on fork. > > > * Easy: VM_WIPEONFORK. > > > > > > - It shouldn't be written to swap. > > > * Uh-oh: mlock is rlimited. > > > * Uh-oh: mlock isn't inherited by forks. > > > > > > - It shouldn't reserve actual memory, but it also shouldn't crash when > > > page faulting in memory if none is available > > > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2. > > > * Uh-oh: VM_NORESERVE means segfaults. > > > > > > It turns out that the vDSO getrandom() function has three really nice > > > characteristics that we can exploit to solve this problem: > > > > > > 1) Due to being wiped during fork(), the vDSO code is already robust to > > > having the contents of the pages it reads zeroed out midway through > > > the function's execution. > > > > > > 2) In the absolute worst case of whatever contingency we're coding for, > > > we have the option to fallback to the getrandom() syscall, and > > > everything is fine. > > > > > > 3) The buffers the function uses are only ever useful for a maximum of > > > 60 seconds -- a sort of cache, rather than a long term allocation. > > > > > > These characteristics mean that we can introduce VM_DROPPABLE, which > > > has the following semantics: > > > > > > a) It never is written out to swap. > > > b) Under memory pressure, mm can just drop the pages (so that they're > > > zero when read back again). > > > c) If there's not enough memory to service a page fault, it's not fatal, > > > and no signal is sent. Instead, writes are simply lost. > > > d) It is inherited by fork. > > > e) It doesn't count against the mlock budget, since nothing is locked. > > > > > > This is fairly simple to implement, with the one snag that we have to > > > use 64-bit VM_* flags, but this shouldn't be a problem, since the only > > > consumers will probably be 64-bit anyway. > > > > > > This way, allocations used by vDSO getrandom() can use: > > > > > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > > > > > And there will be no problem with OOMing, crashing on overcommitment, > > > using memory when not in use, not wiping on fork(), coredumps, or > > > writing out to swap. > > > > > > At the moment, rather than skipping writes on OOM, the fault handler > > > just returns to userspace, and the instruction is retried. This isn't > > > terrible, but it's not quite what is intended. The actual instruction > > > skipping has to be implemented arch-by-arch, but so does this whole > > > vDSO series, so that's fine. The following commit addresses it for x86. > > > > Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per > > arch low level work and rarely tested functionality (seriously, whose > > desktop system touches swap these days?), just so we can add a few pages of > > per thread vDSO data of a quirky type that in 99.999% of cases won't ever > > be 'dropped' from under the functionality that is using it and will thus > > bitrot fast? > > It sounds like you've misunderstood the issue. > > Firstly, the arch work is +19 lines (in the vdso branch of random.git). For a single architecture: x86. And it's only 19 lines because x86 already happens to have a bunch of complexity implemented, such as a safe instruction decoder that allows the skipping of an instruction - which relies on thousands of lines of complexity. On an architecture where this isn't present, it would have to be implemented to support the instruction-skipping aspect of VM_DROPPABLE. Even on x86, it's not common today for the software-decoder to be used in unprivileged code - primary use was debugging & instrumentation code. So your patches bring this piece of complexity to a much larger scope of untrusted user-space functionality. > That's very small and basic. Don't misrepresent it just to make a point. I'm not misrepresenting anything. > Secondly, and more importantly, swapping this data is *not* permissible. I did not suggest to swap it: my suggestion is to just pin these vDSO data pages. The per thread memory overhead is infinitesimal on the vast majority of the target systems, and the complexity trade-off you are proposing is poorly reasoned IMO. Anyway: > Don't misrepresent it just to make a point. ... > That seems like a ridiculous rhetorical leap. ... > Did you actually read the commit message? Frankly, I don't appreciate your condescending discussion style that borders on the toxic, and to save time I'm nacking this technical approach until both the patch-set and your reaction to constructive review feedback improves: NAcked-by: Ingo Molnar I think my core point that it would be much simpler to simply pin those pages and not introduce rarely-excercised 'discardable memory' semantics in Linux is a fair one - so it's straightforward to lift this NAK. I'll re-evaluate the NACK on every new iteration of this patchset I see. Thanks, Ingo