From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BA51C54E60 for ; Sat, 16 Mar 2024 19:18:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36A526B0082; Sat, 16 Mar 2024 15:18:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 31AEF6B0083; Sat, 16 Mar 2024 15:18:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E23F6B0085; Sat, 16 Mar 2024 15:18:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 07D916B0082 for ; Sat, 16 Mar 2024 15:18:38 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A7E491C0852 for ; Sat, 16 Mar 2024 19:18:37 +0000 (UTC) X-FDA: 81903863874.26.2D5F921 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf12.hostedemail.com (Postfix) with ESMTP id D178640005 for ; Sat, 16 Mar 2024 19:18:35 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=l1XxO+d4; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf12.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710616715; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OHULfTfAKa7+vagdsuVNuBp2O5vjXv7Mii+l4GG9MeA=; b=KwvZadKNRv66cZxD4jWZBj5fv5l4SXNd3Jaau8G/eVA94rdXTxSfzvciHh5w0ah/XY3TXt 2OyN/9EEfi+LxeFQUsIhlUroyIwSFc28sfs8O2zmj4dQvUQGLUyXndbDDF3nQR2qWucRkp xAXcq4P3Tq7hjzOujPV6NvuDpZ8eWI4= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=l1XxO+d4; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf12.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710616716; a=rsa-sha256; cv=none; b=gNFh0Kx81pSjVNJyOYFuay8MOn05LAIb204ixPG7telxFWv1C9QE3R/WUqEWARCbgi/0cj xpTlNBKUp1nZkVzKWkGZBox9MQ6ltAIuWCXHY6B57Hx6cuuugOI2aeT4vrES+GDcxu6G+S ayIK3Hquf7jqMXi5gb/vOxqglEZvseA= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-430a25ed4e7so17863651cf.0 for ; Sat, 16 Mar 2024 12:18:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1710616715; x=1711221515; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OHULfTfAKa7+vagdsuVNuBp2O5vjXv7Mii+l4GG9MeA=; b=l1XxO+d4cLWfhpOS7HpKVFivsWpfA9n4pQEokyzwwzNq/gbNiXnW7Kg/q5j1zQjh31 MXALRv4ksU18Xn47xaojwpE0DpJ7JxCB5tA6hPNP5BEC8hFjl8AYk6lJfWhBNgLNmIqp 5LvVk5ztGzcTGoOtTHQmNDtR7QbWjKYo0SOabW9voHb59wKVFHB1vMBML089bAaV1CrR E6QMdxtxGRB61XLR5DUxKKS+bl+9NVNlJUHt7cUaUMM/zZTJr7lCYw5EtzsX2fCr21wr UoAk8urgTDbKVN/uUqkA6r1KB0lc6iqYDrgMjwmd1efHe/qOQ9cO1M3zqe8DLActPnkv C73w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710616715; x=1711221515; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OHULfTfAKa7+vagdsuVNuBp2O5vjXv7Mii+l4GG9MeA=; b=th1uNL1N80mfs2lLnYCOYQ7x5xpYTZlsZh+3qbP6TSD4w2mD9QN/87Ho4fGLwKwN6a 3My/36CcpCwUIBWshSIn/3dRvEiF0AqYPL+hV4AkD+Vy4hhohFAxYtz55tV2pddPOyu6 94ILJvSnJg3A9Ax2oCs64402qEfTmokBJ9bUU4kApxObYYywJlVrWCTq9IPa6IfN34E6 stX2cuZxWhNhDOTRbrqFtBrnRwinU6z0Po69pPAeDzc9NJvGATCTyePShbjlkCgdSGWV yFnSsUohLfp9yVpDztQUQQWrvQGk+3bfSqKfyRYt5YWz86ACkh9s7fSvYN8WzTGOORAq xEWg== X-Forwarded-Encrypted: i=1; AJvYcCW35qAfnyPbPCJG2ZrLTDhAFUbbVs7lRG7scXz+FajfXWHfjbmlg6kJtPrcZzdu1WdojeYjbTpse2Kxkolyc4qs+h8= X-Gm-Message-State: AOJu0YzFRHPWQN+gNSb0ZTDQ4fCuy2ielCiAXSNfoHDNu4TKGZaVqxPQ fXR+qLgT/ZxOow84Z0GkMv5RjORma1lalhwWz9BAaOHC96OXEFcfSmAMaiQcJTUS6upd0avWq95 YOFZy3xxwxnlv2+C0PIqFp9ZXtecMqZ4rHxweSg== X-Google-Smtp-Source: AGHT+IGlePJmBiz3YZoHmDuLZOXVxdtk19dG95lfRudzm9Y3ulwqFfldVvMJd4H1NSzuACxuC68Uzuh/qPUxNqxmbqM= X-Received: by 2002:a05:622a:344:b0:42e:fa7c:291c with SMTP id r4-20020a05622a034400b0042efa7c291cmr8844027qtw.13.1710616714891; Sat, 16 Mar 2024 12:18:34 -0700 (PDT) MIME-Version: 1.0 References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42> <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> In-Reply-To: <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> From: Pasha Tatashin Date: Sat, 16 Mar 2024 15:17:57 -0400 Message-ID: Subject: Re: [RFC 00/14] Dynamic Kernel Stacks To: "H. Peter Anvin" Cc: Matthew Wilcox , Kent Overstreet , linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, brauner@kernel.org, bristot@redhat.com, bsegall@google.com, dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, vincent.guittot@linaro.org, vschneid@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D178640005 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: zf9wzw9dj458timobq5dhq9pd7oqux8y X-HE-Tag: 1710616715-619304 X-HE-Meta: U2FsdGVkX19ws03HqIEexqyIzZfFtHEnTcwMg0OyKNzmgtl+Vy/50Do8mkFX2jGHA2RwrLsqVB/xe73CPa7fJApHVOGaQQlT+43tYYaTxsf1fdpca5UBBumc/MMN08Wa4tH9MBTIVB61PjEc/B3aGm97ofDqC4FPzECP+yP42jE4xJ/olODSYkkxhmTG++oiHXRXcZ/fRI0m+0m/ttiUWzLvPT60O2KMqgKdE9btXWFIRsRDhfU3nJfOTRXgPgjUbNp3R3dknWAxhLfFB4RnQ9SfkYTyJ+emtFWScEOh//xl/KYXVlj+zxVP3Mu8uIA1LJs57XWcHQvF5z93cMbCaLyaeLQZ720792zx10oDvu90OSda0U5ySoiL1Ob5QPEMNmfhlgPXmEV53Ld+rVj8a5iVRet/5PN+dFX3Guc/FBnmQibbNCqeOAU+9rl7nbrf7+ydoJcWeedjC/CISpY+E8Y/jcErs4iRM/TRMYRGOb5pwlV+FwOlfcoJCjgyHZpNls7tlZ0DHcimQWclbl+JCxfglofTHliSZN3jZAoRK5PLvtMYtrbFYJ/KVEh+dQP+IA88xqB3QDhrQOt+OsD0Yy0INcHB6PGst0yCE3vmvpNDiBPXydmf2qhekrFGEzMwcW10roCH2xmuC6QCg66RK9EQDv+fgtGd03YxtiQ0By0m5HvkeNq1X0/C556g4fvAUR6kloNLcjZT1ScIWxllLmv6ryIpONNhdLXQI7zlEt1kdD5fSnOe1Ty+mOGXFtsMAqe7cGv3MCyOte9A8Y5IdIadxOntBBhrlhsgkXUh6vWfu3bVib0utrkRr3U0mWFJN8mxuOziR8vQWjKqog3+4gih++wxAHnZ3+LqDBF3W4kK7bqriyaAtC0hbwh4xzWQ479ML/h5lBx+Sz2ZTLMBMZUNO1BOWZ3iJLi2j4wMOmDO+y/ad5QRChwJgbDULlmifjW5Tt4ANxGCBiR0npS cI84o+uZ OwUXDAe0H7kPWVxd6yeKuEC2fi0R24rPtPE1ql7dkaK9iJiedyIZPfylKpQzSMKNBUwY0kzcg7tTAWCBHJlhekyA5YCD1IjR3AyenEqP20T7LvkkhjgK1L/vFIXhfO/Q/ekcA4pTa2neHzz/wRErmhgKbVS6FlFwsgh3NZ85PsN5/bfcByaschpo81zEqPtDlbiXZ+LbrTxhv0VETCbT/WItlVjuPvi/y61qDjPDRzGkllmI+SnLFMn47BWVz0ATwloqGx/e4XehfbIJRzsgiwJl8TdX7/5wo3H9mOY5bmhMWKxNw5FjrMRYuux3rCpztyMSQo/73Ko0pyNjiEYH1Iy1ZH1ZuKgBrMvSbU3H/ZeyhnwGqV2qUdRNernvF8KsThW1EeZ5isctiWGdmwlmjHnUIif/384e03CgiQpKro5EjMs6/wMudC23ZgCU822P77AyA8WU185WwdQRx+SNaJHOj9rIioXeXQDg3XtSdRvPcM7KMn8owlm1atI9Fq6WQxmHxar8gnRqCW0oBuTysGDLMPjKJayG46Uk8lxkh6h4i2ZYBMsPo1sSjEmuBINiOmhD0cZYx+Wr5Tog= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 14, 2024 at 11:40=E2=80=AFPM H. Peter Anvin wro= te: > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin wrote: > >On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox wrote: > >> > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote: > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote: > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote: > >> > > > Second, non-dynamic kernel memory is one of the core design deci= sions in > >> > > > Linux from early on. This means there are lot of deeply embedded= assumptions > >> > > > which would have to be untangled. > >> > > > >> > > I think there are other ways of getting the benefit that Pasha is = seeking > >> > > without moving to dynamically allocated kernel memory. One icky t= hing > >> > > that XFS does is punt work over to a kernel thread in order to use= more > >> > > stack! That breaks a number of things including lockdep (because = the > >> > > kernel thread doesn't own the lock, the thread waiting for the ker= nel > >> > > thread owns the lock). > >> > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of = stack", > >> > > and if less than that was available, we could allocate a temporary > >> > > stack and switch to it. I suspect Google would also be able to us= e this > >> > > API for their rare cases when they need more than 8kB of kernel st= ack. > >> > > Who knows, we might all be able to use such a thing. > >> > > > >> > > I'd been thinking about this from the point of view of allocating = more > >> > > stack elsewhere in kernel space, but combining what Pasha has done= here > >> > > with this idea might lead to a hybrid approach that works better; = allocate > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the to= p of it, > >> > > rely on people using this "I need more stack" API correctly, and f= ree the > >> > > excess pages on return to userspace. No complicated "switch stack= s" API > >> > > needed, just an "ensure we have at least N bytes of stack remainin= g" API. > > > >I like this approach! I think we could also consider having permanent > >big stacks for some kernel only threads like kvm-vcpu. A cooperative > >stack increase framework could work well and wouldn't negatively > >impact the performance of context switching. However, thorough > >analysis would be necessary to proactively identify potential stack > >overflow situations. > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems > >> > like everything we need for what you're talking about. > >> > >> Because double faults are hard, possibly impossible, and the FRED appr= oach > >> Peter described has extra overhead? This was all described up-thread. > > > >Handling faults in #DF is possible. It requires code inspection to > >handle race conditions such as what was shown by tglx. However, as > >Andy pointed out, this is not supported by SDM as it is an abort > >context (yet we return from it because of ESPFIX64, so return is > >possible). > > > >My question, however, if we ignore memory savings and only consider > >reliability aspect of this feature. What is better unconditionally > >crashing the machine because a guard page was reached, or printing a > >huge warning with a backtracing information about the offending stack, > >handling the fault, and survive? I know that historically Linus > >preferred WARN() to BUG() [1]. But, this is a somewhat different > >scenario compared to simple BUG vs WARN. > > > >Pasha > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@= home.transmeta.com > > > > The real issue with using #DF is that if the event that caused it was asy= nchronous, you could lose the event. Got it. So, using a #DF handler for stack page faults isn't feasible. I suppose the only way for this to work would be to use a dedicated Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect that might introduce other complications. Expanding on Mathew's idea of an interface for dynamic kernel stack sizes, here's what I'm thinking: - Kernel Threads: Create all kernel threads with a fully populated THREAD_SIZE stack. (i.e. 16K) - User Threads: Create all user threads with THREAD_SIZE kernel stack but only the top page mapped. (i.e. 4K) - In enter_from_user_mode(): Expand the thread stack to 16K by mapping three additional pages from the per-CPU stack cache. This function is called early in kernel entry points. - exit_to_user_mode(): Unmap the extra three pages and return them to the per-CPU cache. This function is called late in the kernel exit path. Both of the above hooks are called with IRQ disabled on all kernel entries whether through interrupts and syscalls, and they are called early/late enough that 4K is enough to handle the rest of entry/exit. Pasha