From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A8E6C5475B for ; Mon, 11 Mar 2024 23:11:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 990426B0152; Mon, 11 Mar 2024 19:11:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 965DB6B0153; Mon, 11 Mar 2024 19:11:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82E486B0154; Mon, 11 Mar 2024 19:11:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 713326B0152 for ; Mon, 11 Mar 2024 19:11:03 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 443B740B1D for ; Mon, 11 Mar 2024 23:11:03 +0000 (UTC) X-FDA: 81886305606.18.57F8DD9 Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 7C87420011 for ; Mon, 11 Mar 2024 23:11:00 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=jMIynfMH; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf03.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710198660; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Z9HP99X7oLrFR1aslFElOrXkza3lJ4OyqDHRFASCoQ=; b=koZYlcofa5a4Yl6b9xVq5ij3jQWWTvT9zh0qyFGsNc6HAfXmH3hbcmwRgyK4YJi+MY1NlW te7C/+10VaOwyDH7TXtsM1XZGJEjn2Bo6hJTzfVIV/AG5Hik5qideDWK65FMQ7u+e0bz1r 0vKNHGfSBna3GLyBervD09fJ1/wf0Sw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=jMIynfMH; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf03.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710198660; a=rsa-sha256; cv=none; b=nnJZ8o4dRDnvwg9IhfQPcYk2mmV5YYNr2Btue8QTWVcAuV2wbPVl2FUkpG0ZZKQiwWqTwp i5kDJWH9H+ILeOt7GYZu54p6xnRTWc7yn261pyB8PbKFP6kNQ1OdgrU12J/TTSQvtbnSUW B85V2sCk2wMO3Rll/sRBVShK6EtePI8= Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-42f111283a6so19465461cf.3 for ; Mon, 11 Mar 2024 16:11:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1710198659; x=1710803459; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4Z9HP99X7oLrFR1aslFElOrXkza3lJ4OyqDHRFASCoQ=; b=jMIynfMHDR8U+r2K9thdO4hUEN3UgBZQeuZu/dMfM6z3XoRRRftICxc6q6oT3F6K7E eTYI4BDr+yVnPv4od3Hc4Z/E+0od8vMIS6QBxuEsn3yVVLRUPN21FZUbUMjNRg2uUNTn wX/i+fv7ZSnOeSuJW0dAbcbYDQZsnPxdOL8knIv7SiuNIoeundaTor6WJsEz4dz6J5e8 2kWj/KoUCMWDj5cBdRYpUXUyD70patGUrLuBKgVsT+1sq3DhYHzR4N0Anq/D8nn6HxKf Whi0SO/Hpf7xnsbQDgwcpz7m0cTTfd6Z5Ud01I/Vs2fX785UpJEbvNj2Yd7YYPppHqT8 hbtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710198659; x=1710803459; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4Z9HP99X7oLrFR1aslFElOrXkza3lJ4OyqDHRFASCoQ=; b=UsU3MIYTBMS0VtfoIhx5ITQ+HZpT7+zHqxAIddbXt/dgpeTBvO2hTYS0TOh7KPNGyo PkgB1GmnKe1utk1a5klmOt7PZE+GoQTxU+WNCWCEWaMPM6XJgyx95qlaewlOUzRqNBVC wHn9FwhdlSt+nG+/2GYqA3CApj1JvFNnJ40q7DmBp2Eikz4+D98wDLDSYTVNwnzP3Tql xTo4BqLoegvAo/5BoevLT0xgpFBu/5SB8gneU1ewmLD671TGtrGkXbHXuZd1wIgs2UKf icYF5smrEvmnDY881Fg0/XHVoPhZBsEiKAAeZiR0919ePt8bclVo6Oud6q5vq1ovSxrS ppyw== X-Forwarded-Encrypted: i=1; AJvYcCWUrvKPtU2zNzylyM6OTe0b7SjlS/n2dhDEUARLEf8pr/Qwr3PF0Vw2uVJsE/737jo627tcbZvFkGzhPF+FYIUEpQs= X-Gm-Message-State: AOJu0YxTzz7lfdT1E5J1mXtvR/XkYBmu/2niym/9cDu2PAqV3KYmv6/c PG7rJGkhSHKgwbRm4FnRNRt9d8J4QGIHq5lR+LPKGi9rBNqTi5ZoO2/dl6fdAgc4+p6pwuB3YMf p2yTiywnvQC2kVJZpGyS43CWwDNAY8Ku2jRCauzyRPr0COgJOdew= X-Google-Smtp-Source: AGHT+IEJdnE50QXqM/K7jj7wicyKnrktqPjkZamGpF6ft1iHaPJKY8A5T1CSlm3xWVcth4HPqWnA3BQpx+QsfW7MXg4= X-Received: by 2002:a05:622a:1709:b0:42f:1c28:8fcf with SMTP id h9-20020a05622a170900b0042f1c288fcfmr11702005qtk.40.1710198659483; Mon, 11 Mar 2024 16:10:59 -0700 (PDT) MIME-Version: 1.0 References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <20240311164638.2015063-12-pasha.tatashin@soleen.com> <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com> In-Reply-To: <3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com> From: Pasha Tatashin Date: Mon, 11 Mar 2024 19:10:23 -0400 Message-ID: Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks To: Andy Lutomirski Cc: Linux Kernel Mailing List , linux-mm@kvack.org, Andrew Morton , "the arch/x86 maintainers" , Borislav Petkov , Christian Brauner , bristot@redhat.com, Ben Segall , Dave Hansen , dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, "hch@infradead.org" , "H. Peter Anvin" , Jacob Pan , Jason Gunthorpe , jpoimboe@kernel.org, Joerg Roedel , juri.lelli@redhat.com, Kent Overstreet , kinseyho@google.com, "Kirill A. Shutemov" , lstoakes@gmail.com, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, Ingo Molnar , mjguzik@gmail.com, "Michael S. Tsirkin" , Nicholas Piggin , "Peter Zijlstra (Intel)" , Petr Mladek , Rick P Edgecombe , Steven Rostedt , Suren Baghdasaryan , Thomas Gleixner , Uladzislau Rezki , vincent.guittot@linaro.org, vschneid@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7C87420011 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: p7hjcy7wun7j1k7n6w6f3pyjgqesxjju X-HE-Tag: 1710198660-418874 X-HE-Meta: U2FsdGVkX19nJoaFJTZVuJVnKGoz8xqxyOgv+O1HdDo8KKzQVZu+aH7zvypLr5uVisyhw0fJle2ArRpMLhp1TGHXq3PlCjqrvgx56Oey2YIMukRH2ZR37E6azq8ETx+ccpFZOcS8fS381DWl18QUH3xs1pCQNzhT4g17N5RFfpQoiBhNxM46c9bwmsW//vOGmRrjrWjNoj6iss/KCwFpuD27ZW8Zfigssf0p7obt/bceJXBANQTleGfM3bBTVjh9mUYM9XZ3BF1J0Zq5GPyN52AEt5sXjujCG4UxU1kFxF+X18TI+P4dL5pSjpY2fz97bF4/S2dPWDldjLawNhbwD6/6e/27+NIEbYxZKi/4FyH1wKMK6ytD6FbIbovkbkfVMvQiDFtX2gMCXI17AIVJ6TqPziNwX9roM+lX34+025mqF6Kcnmqf58jQGgK18Slt4OzxZbMx8uoTs47gS0oiG+yql8ETI9SSd1V0DyX6sU0rdHjswrnBSpArj+V6AfkNO522PWzPlh1HZ3YSX47toML2h5QJ5o1rI4jeo1MX1SkPvnRN68ktZDP6upRbtiGHW769FFpSkaHdaxW1VlSy3GGUlLIYzJkeBoZqVxYVVbfjA/nM7zSDTrU4m9aFWehPBwnw+iNunMuIvU+F1IwC0r6B/5sWa0lC1YpnC9pJnChsrSohejbOVLRoYJwyfmjDr9KChh9/x3pCDYN8AXo2wyWb4sZd3sRgc1MDQnHD1TOnwFfAUsCqUKFgCvhN+tZhMpSJ037DSJqTbgVdvMpdr3aCDp/ZlguFKGWup2e3FzwQkpo7ki3rcsENIsqFJ/NpjsiSuEfVOpcZ6ehQ4UNa+sIx9oSJQt3JS+0hM0KpEFPgT9ST2gIGIBm+tillNfYcGUzsWgLSzw0K5xd2pFd7ZEFBCHluaWZMXNfV6hdfo2gsn5BbLx+RjenJNTp6COu3uOUQ4vhlX7JEk5LjyR9 Ul1wCnz+ acAK7h8sId+INuVUyaiHIPA4yGaKe7NenVfqqzqOIemzqRGnOVrD1YrcQb4sRpK4Pcp4gUK/YZXKiAQGuF7tBxUa9uFXuuZ+o8YIf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024 at 6:17=E2=80=AFPM Andy Lutomirski w= rote: > > > > On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote: > > Add dynamic_stack_fault() calls to the kernel faults, and also declare > > HAVE_ARCH_DYNAMIC_STACK =3D y, so that dynamic kernel stacks can be > > enabled on x86 architecture. > > > > Signed-off-by: Pasha Tatashin > > --- > > arch/x86/Kconfig | 1 + > > arch/x86/kernel/traps.c | 3 +++ > > arch/x86/mm/fault.c | 3 +++ > > 3 files changed, 7 insertions(+) > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 5edec175b9bf..9bb0da3110fa 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -197,6 +197,7 @@ config X86 > > select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD > > select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD > > select HAVE_ARCH_VMAP_STACK if X86_64 > > + select HAVE_ARCH_DYNAMIC_STACK if X86_64 > > select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET > > select HAVE_ARCH_WITHIN_STACK_FRAMES > > select HAVE_ASM_MODVERSIONS > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > > index c3b2f863acf0..cc05401e729f 100644 > > --- a/arch/x86/kernel/traps.c > > +++ b/arch/x86/kernel/traps.c > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > } > > #endif > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > + > > Sorry, but no, you can't necessarily do this. I say this as the person w= ho write this code, and I justified my code on the basis that we are not re= covering -- we're jumping out to a different context, and we won't crash if= the origin context for the fault is corrupt. The SDM is really quite unam= biguous about it: we're in an "abort" context, and returning is not allowed= . And I this may well be is the real deal -- the microcode does not promis= e to have the return frame and the actual faulting context matched up here,= and there's is no architectural guarantee that returning will do the right= thing. > > Now we do have some history of getting a special exception, e.g. for espf= ix64. But espfix64 is a very special case, and the situation you're lookin= g at is very general. So unless Intel and AMD are both wiling to publicly = document that it's okay to handle stack overflow, where any instruction in = the ISA may have caused the overflow, like this, then we're not going to do= it. Hi Andy, Thank you for the insightful feedback. I'm somewhat confused about why we end up in exc_double_fault() in the first place. My initial assumption was that dynamic_stack_fault() would only be needed within do_kern_addr_fault(). However, while testing in QEMU, I found that when using memset() on a stack variable, code like this: rep stos %rax,%es:(%rdi) causes a double fault instead of a regular fault. I added it to exc_double_fault() as a result, but I'm curious if you have any insights into why this behavior occurs. > There are some other options: you could pre-map Pre-mapping would be expensive. It would mean pre-mapping the dynamic pages for every scheduled thread, and we'd still need to check the access bit every time a thread leaves the CPU. Dynamic thread faults should be considered rare events and thus shouldn't significantly affect the performance of normal context switch operations. With 8K stacks, we might encounter only 0.00001% of stacks requiring an extra page, and even fewer needing 16K. > Also, I think the whole memory allocation concept in this whole series is= a bit odd. Fundamentally, we *can't* block on these stack faults -- we ma= y be in a context where blocking will deadlock. We may be in the page allo= cator. Panicing due to kernel stack allocation would be very unpleasant. We never block during handling stack faults. There's a per-CPU page pool, guaranteeing availability for the faulting thread. The thread simply takes pages from this per-CPU data structure and refills the pool when leaving the CPU. The faulting routine is efficient, requiring a fixed number of loads without any locks, stalling, or even cmpxchg operations. > But perhaps we could have a rule that a task can only be scheduled in if = there is sufficient memory available for its stack. Yes, I've considered this as well. We might implement this to avoid crashes due to page faults. Basically, if the per-CPU pool cannot be refilled, we'd prevent task scheduling until it is. We're already so short on memory that the kernel can't allocate up to 3 pages of memory. Thank you, Pasha > And perhaps we could avoid every page-faulting by filling in the PTEs fo= r the potential stack pages but leaving them un-accessed. I *think* that a= ll x86 implementations won't fill the TLB for a non-accessed page without a= lso setting the accessed bit, so the performance hit of filling the PTEs, r= unning the task, and then doing the appropriate synchronization to clear th= e PTEs and read the accessed bit on schedule-out to release the pages may n= ot be too bad. But you would need to do this cautiously in the scheduler, = possibly in the *next* task but before the prev task is actually released e= nough to be run on a different CPU. It's going to be messy.