From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 887D2C54E58 for ; Tue, 12 Mar 2024 00:03:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F167F6B012C; Mon, 11 Mar 2024 20:03:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC67F6B012D; Mon, 11 Mar 2024 20:03:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D673A6B012E; Mon, 11 Mar 2024 20:03:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C7FC16B012C for ; Mon, 11 Mar 2024 20:03:07 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 85378120265 for ; Tue, 12 Mar 2024 00:03:07 +0000 (UTC) X-FDA: 81886436814.10.6132EEF Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf10.hostedemail.com (Postfix) with ESMTP id 8D77DC001B for ; Tue, 12 Mar 2024 00:03:05 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=VNf9KMMe; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf10.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710201785; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MCzG4fobHODCT1r5cWbxTbb97n1iLcJo3vIQOrO5m1g=; b=no8qjILIPqVtp4kdTbY/O4SUl9TT8OmYdnoAlqVrl5ZTYmDIrpFXCPHEwafnWkTSH7wrkm FJd0WFDayDYwdIM9hm/QMfOGYqC75U1GtsuJpWKj0xdThlM2O90UXtvfblqQVORj9lrSm8 jvPoSMEZH40cjcV6rLXTS3pVWElWu80= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=VNf9KMMe; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf10.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710201785; a=rsa-sha256; cv=none; b=n5MLwimfeDAAZueqlzzrCT6ryVqdUFDYwI0xJgkHMvcfsxG+A6eg+Hoppp2Mf1149/KFD9 rXGhbzBDEVY2vbwi5kOjMW2AHhuLVUN7UfClMIJT3P7D24Nafi/2Es/XLoZd2Eqiozl6y+ ZApcnSdeqKHtlYBXTWSVpQhi87zeu5Q= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 78DD161044; Tue, 12 Mar 2024 00:03:04 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D8EACC433C7; Tue, 12 Mar 2024 00:03:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1710201784; bh=Xy1IpNmkx74HkGHEnoD2Rn57bHRubcUlWf43hiM06HU=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=VNf9KMMeLDyE9G+vJ4mXaNp6zAKBDawH04UUWIEvTVPl2kAlULYPjl4IMg+uGdPs3 /zVyBXkWt4ruxoypyrt0wuyMVLZe+eLdmjjJSyrXNV5oo8eqZf8YNH1UL4A3VeuVKP 3aw6kK2QhJ5aaUih+Y0n5ffC0qj3zBv/iq7IrMbJ/69mkJXrhrwIHewl0HnqHRDxvW C/GDWTt38hmyeABgWh1T+n9lytDVuAreg7R8tyntm6+9+eC0YdtW/mgaMIQyToFYwL E/batZoUL7TktOsOYoue24skDMUbD7kBF5G0bpZI/WQwNJZwwRO7Duu/Nm1PnJiPxN s0x04kf9oUMHw== Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailfauth.nyi.internal (Postfix) with ESMTP id 889931200043; Mon, 11 Mar 2024 20:03:01 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute3.internal (MEProxy); Mon, 11 Mar 2024 20:03:01 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrjedvgddukecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefofgggkfgjfhffhffvvefutgfgsehtqhertderreejnecuhfhrohhmpedftehn ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg ftrfgrthhtvghrnhepudevffdvgedvfefhgeejjeelgfdtffeukedugfekuddvtedvudei leeugfejgefgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh igrdhluhhtohdruhhs X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 7165C31A0066; Mon, 11 Mar 2024 20:03:00 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.11.0-alpha0-251-g8332da0bf6-fm-20240305.001-g8332da0b MIME-Version: 1.0 Message-Id: <11e673cf-8bfa-4493-ab86-2f1f97ddd732@app.fastmail.com> In-Reply-To: <08EFDEDB-7BBB-4D9C-B7E5-D7370EC609BE@gmail.com> References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <20240311164638.2015063-12-pasha.tatashin@soleen.com> <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com> <08EFDEDB-7BBB-4D9C-B7E5-D7370EC609BE@gmail.com> Date: Mon, 11 Mar 2024 17:02:40 -0700 From: "Andy Lutomirski" To: "Nadav Amit" Cc: "Dave Hansen" , "Pasha Tatashin" , "Linux Kernel Mailing List" , linux-mm , "Andrew Morton" , "the arch/x86 maintainers" , "Borislav Petkov" , "Christian Brauner" , bristot@redhat.com, "Ben Segall" , "Dave Hansen" , dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, "hch@infradead.org" , "H. Peter Anvin" , "Jacob Pan" , "Jason Gunthorpe" , jpoimboe@kernel.org, "Joerg Roedel" , juri.lelli@redhat.com, "Kent Overstreet" , kinseyho@google.com, "Kirill A. Shutemov" , lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, "Ingo Molnar" , mjguzik@gmail.com, "Michael S. Tsirkin" , "Nicholas Piggin" , "Peter Zijlstra (Intel)" , "Petr Mladek" , "Rick P Edgecombe" , "Steven Rostedt" , "Suren Baghdasaryan" , "Thomas Gleixner" , "Uladzislau Rezki" , vincent.guittot@linaro.org, vschneid@redhat.com Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 8D77DC001B X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: zx4af7ukcuyc8ysp8sep1y6k5t1ewtby X-HE-Tag: 1710201785-848461 X-HE-Meta: U2FsdGVkX19HcCmJ78PpaVHBkiWnO8azEGKf/89iOD+eq+EO1UOw5eZse2O9OuQYSm9q2/u8Av16ShBsYnSKRjSfw77Y0vnBEH3SRs2yWnXHdP4KcTxHPXml1nfKZdh45zmug4FWBDQ+kPTSz68J1PuewZkF6SPpSvW4VOUJwVDr03ZpVMVNevpWS9t9tCh0VkbTXr+TJK09re+Lu0tGaatAr/g5s+7zvrGgnFMrYehKyctpbwWc5B9aeLGOIpsrtdGFo284jWuh3e9rHl0JgYhgj4XW1BnNei2ezK5ModJva7PKCxJeVhsDYGnYX6yziAIBymPxP8dL6mxCubTyIWqsJCr3qK86AnHsvdZcsSdLhERfkJ4RSK50asG220poXcQzFpZ/rZg+NoZiIsxLVmH2Ca7fGA2et0y/G0NC+89KOGAxyROWWptX6wA4va6DLWLOOckAjh07fuExN+oLgzgfU6pXqac/qKVwqWh3fZZQmmSKoo7mF5JXPPuQmXaywdkoQ1NpnLMyE7JXxQ3qK8sFB+ATUbpILhTMvPvBstqWY3QoSMjY1eiAl30yXgHMvCyo9UX4hu8eFEqvKPEgvUlcxYSS0/YgexlCjwq1CSNDWzOj5PlA7pXuZGb4yqoTqvoqFYIRaDFOvfMv05Bn+jLMZL5wgs97D6gU+8YU2dxHjKBlznSmG+l4kigvkYP/Jh+hdDBh3gr+UAHamW6tNUlaSEf0h4rm6BqQAO2zTK8M+imIAR6X21CQ2aKBycoeatLK7nH9x6OBvs5/EE6uMSAZLpX9uvdak721AplkGpqx1qiKIYExHTQPSKydaTPiICRWLYCKOkQ9/V8W8zCjEmgmkDw4PWuOXD+MG3qmRgHW9GoUIV3WviL3e1lFDhXx9MOPrjCFe6B+Sp5LkpavEworcTW6jpuHfrNiYlnV+ozoAN6Um4imvJ2xeaWwuyEL2h1EUkJ4CFfB+Rhb54R GRR6Ziid yYEMWpRzuXPyHQrc0kNEd2sWpgg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024, at 4:56 PM, Nadav Amit wrote: >> On 12 Mar 2024, at 1:41, Andy Lutomirski wrote: >>=20 >> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote: >>> On 3/11/24 15:17, Andy Lutomirski wrote: >>>> I *think* that all x86 implementations won't fill the TLB for a >>>> non-accessed page without also setting the accessed bit, >>>=20 >>> That's my understanding as well. The SDM is a little more obtuse ab= out it: >>>=20 >>>> Whenever the processor uses a paging-structure entry as part of >>>> linear-address translation, it sets the accessed flag in that entry >>>> (if it is not already set). >>>=20 >>> but it's there. >>>=20 >>> But if we start needing Accessed=3D1 to be accurate, clearing those = PTEs >>> gets more expensive because it needs to be atomic to lock out the pa= ge >>> walker. It basically needs to start getting treated similarly to wh= at >>> is done for Dirty=3D1 on userspace PTEs. Not the end of the world, = of >>> course, but one more source of overhead. >>=20 >> In my fantasy land where I understand the x86 paging machinery, suppo= se we're in finish_task_switch(), and suppose prev is Not Horribly Buggy= (TM). In particular, suppose that no other CPU is concurrently (non-sp= eculatively!) accessing prev's stack. Prev can't be running, because wh= atever magic lock prevents it from being migrated hasn't been released y= et. (I have no idea what lock this is, but it had darned well better ex= ist so prev isn't migrated before switch_to() even returns.) >>=20 >> So the current CPU is not accessing the memory, and no other CPU is a= ccessing the memory, and BPF doesn't exist, so no one is being utterly d= aft and a kernel read probe, and perf isn't up to any funny business, et= c. And a CPU will never *speculatively* set the accessed bit (I told yo= u it's fantasy land), so we just do it unlocked: >>=20 >> if (!pte->accessed) { >> *pte =3D 0; >> reuse the memory; >> } >>=20 >> What could possibly go wrong? >>=20 >> I admit this is not the best idea I've ever had, and I will not waste= anyone's time by trying very hard to defend it :) >>=20 > > Just a thought: you don=E2=80=99t care if someone only reads from the = stack's=20 > page (you can just install another page later). IOW: you only care if=20 > someone writes. > > So you can look on the dirty-bit, which is not being set speculatively=20 > and save yourself one problem. Doesn't this buy a new problem? Install a page, run the thread without = using the page but speculatively load the PTE as read-only into the TLB,= context-switch out the thread, (entirely safely and correctly) determin= e that the page wasn't used, remove it from the PTE, use it for somethin= g else and fill it with things that aren't zero, run the thread again, a= nd read from it. Now it has some other thread's data! One might slightly credibly argue that this isn't a problem -- between R= SP and the bottom of the area that one nominally considers to the by the= stack is allowed to return arbitrary garbage, especially in the kernel = where there's no red zone (until someone builds a kernel with a redzone = on a FRED system, hmm), but this is still really weird. If you *write* = in that area, the CPU hopefully puts the *correct* value in the TLB and = life goes on, but how much do you trust anyone to have validated what ha= ppens when a PTE is present, writable and clean but the TLB contains a s= tale entry pointing somewhere else? And is it really okay to do this to= the poor kernel? If we're going to add a TLB flush on context switch, then (a) we are bei= ng rather silly and (b) we might as well just use atomics to play with t= he accessed bit instead, I think.