From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0B3E4C54E67
	for <linux-mm@archiver.kernel.org>; Sun, 17 Mar 2024 14:43:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 92B4A6B0088; Sun, 17 Mar 2024 10:43:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8D7B06B0089; Sun, 17 Mar 2024 10:43:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 778576B008A; Sun, 17 Mar 2024 10:43:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 648AE6B0088
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 10:43:23 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 2D056805EF
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 14:43:23 +0000 (UTC)
X-FDA: 81906799086.18.0BADDF8
Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169])
	by imf06.hostedemail.com (Postfix) with ESMTP id 4608E180007
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 14:43:20 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="jm+xpT+/";
	spf=pass (imf06.hostedemail.com: domain of brgerst@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=brgerst@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710686601; a=rsa-sha256;
	cv=none;
	b=nr0h8LYjStuRimdZ+t0H1Idy6hSggBjHS0qbxPuN/UrsFhYQOnC5DRPimxvn/4djRud4Kb
	MOaCpZcsEWWh6INR+OXNaRBn5srSpJxFBPSsBTsJzQf2Yz9/ltQjodnvnqPFeGjZy73qCz
	GXkbDKTKTskhwaPAJJV9BBWzME4DFJ4=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="jm+xpT+/";
	spf=pass (imf06.hostedemail.com: domain of brgerst@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=brgerst@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710686601;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=a7i1qle43DJghAOAhcLatIUDn5bRRpZ4ZSCmPHdH/m8=;
	b=SEu35jJoq/9s8X3SOMf3YYC2JVlmBEiW7O+IDbWIiQDbScq+bU3iKq9XLaiPZdfF3GDQdF
	Ag/ycnOoZcmN4QTNlt9iHI2hgCPomfuFLxIq79RJBQzIKh05EnWagpkY9dfnWgXiZFWIj9
	q9i41glnDuhdjlMBqQCPx6bY6EK/iyc=
Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2d2505352e6so40143111fa.3
        for <linux-mm@kvack.org>; Sun, 17 Mar 2024 07:43:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710686599; x=1711291399; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=a7i1qle43DJghAOAhcLatIUDn5bRRpZ4ZSCmPHdH/m8=;
        b=jm+xpT+/+8ltMbM6Oxijm4WfdUP3Kqmb240HyrqdjhF3mdJkIRUtvJ7xvk8UKznwHU
         FQy9hSOt9Z26YU19ruk/LfjFgqWhmlBHvh372QNn2Yc20jw2E6OFYoImzMEZhYwnVTK7
         wcZkHRjrQRYXjEflbB0kx2a4xer004kG65awWjUhXburTemCZyGlxcIfRnF+BvpaFJwT
         R06wjyfHvZE3RVVF1JXJE6IA/h78oUM8FJ+RSxGzwxKl9SGXOZWRGZWono/hvH07Qyg2
         oS8bz38XQDH1I01TzIqEvtYF2O2TdJchZX+PmGhPed1vVocLGQNaGbCDl+MhtjC51Mui
         7dhg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710686599; x=1711291399;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=a7i1qle43DJghAOAhcLatIUDn5bRRpZ4ZSCmPHdH/m8=;
        b=HDx+FHjB5fwyhFVSNkvEbtfRY+WkVOHLVBm90IY2ixiTGPNd7S6HipTorhkJnBafyK
         S6Ex36DBLYO6s9fz9KhbPDm9hJ4p8ZLH9WqcvyQAupF+14eJqgobAwspqWrlSovYZ78w
         NEDssDLEQZGPM7mlWzq64XeImhYk+rJ41R1vBtQBHgeONj/XuElrUlawjLNIsYvd1KD0
         C7QX4zCV+sTyLVAC93dfwxlV83AwU2XeTnmU5B0vK8Orhcrtl/VsDNa3jnuICpiVLmeX
         iuBmpFAYgk2ffSl2kvTR+Mb0TbMUpcIhC9zzmmhPik70vB517QcR2zhzrsQL/W+xPfsE
         lZCA==
X-Forwarded-Encrypted: i=1; AJvYcCVJ+fQmy0UP04QlKxRnOpPbR7yr1lSd4hYGWFXEYYND/nO6dvvhFg81RXjlz0EtyJUsGaMjtXgSb758PV/nrF9NotA=
X-Gm-Message-State: AOJu0YyrLm8CWeo+hriHyU8qAGl8Eyz4pWcKwsXOWqZOGSNSp/7xBLu1
	4K5JLuIwRnuDZVpC+X5HtLvh55kGvoWrhU0/lxZYaLgBy3BU3bYo+F/P84VljGuwZ+/0U0DUyXw
	9q4m9b7sq06MLa0lywMU/d6ElWw==
X-Google-Smtp-Source: AGHT+IFxO/JV40qR4BlpOxNmW7xbooBba31uziSl8jssi9PWlhxCU7RWfHpA/lOHeNoVqgW3TYBVaZT4VgJSqMwE33g=
X-Received: by 2002:a2e:804a:0:b0:2d2:44df:b112 with SMTP id
 p10-20020a2e804a000000b002d244dfb112mr6072309ljg.41.1710686599112; Sun, 17
 Mar 2024 07:43:19 -0700 (PDT)
MIME-Version: 1.0
References: <20240311164638.2015063-1-pasha.tatashin@soleen.com>
 <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <ZfNTSjfE_w50Otnz@casper.infradead.org>
 <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42>
 <ZfNWojLB7qjjB0Zw@casper.infradead.org> <CA+CK2bAmOj2J10szVijNikexFZ1gmA913vvxnqW4DJKWQikwqQ@mail.gmail.com>
 <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> <CA+CK2bDothmwdJ86K1LiKWDKdWdYDjg5WCwdbapL9c3Y_Sf+kg@mail.gmail.com>
In-Reply-To: <CA+CK2bDothmwdJ86K1LiKWDKdWdYDjg5WCwdbapL9c3Y_Sf+kg@mail.gmail.com>
From: Brian Gerst <brgerst@gmail.com>
Date: Sun, 17 Mar 2024 10:43:07 -0400
Message-ID: <CAMzpN2hZgEpJcyLqPhEqKSHy33j1G=FjzrOvnLPqiDeijanM=w@mail.gmail.com>
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
To: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Matthew Wilcox <willy@infradead.org>, 
	Kent Overstreet <kent.overstreet@linux.dev>, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, 
	brauner@kernel.org, bristot@redhat.com, bsegall@google.com, 
	dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, 
	eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, 
	jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, 
	jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, 
	kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, 
	mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, 
	mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, 
	peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, 
	rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, 
	vincent.guittot@linaro.org, vschneid@redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 4608E180007
X-Stat-Signature: atwmiippnq8s1sm5uthxn74cax75x67x
X-Rspam-User: 
X-HE-Tag: 1710686600-692858
X-HE-Meta: U2FsdGVkX18+nA1N7sXAxOiUAIC3NTFF6adP29S9qnA+ClrNjsTO1bjW3ppPAhzHdP5+61WkRqb45FmP1BBIuoP1wOJYuZ2tgDgkaxzK1qcK3e68PNUybfZ+KwuNT8YMI/JEM0awa0useO1mBewEAh9U9hVzs0nJLBTqTunYhOeXGV5MGmc5EsiBSl3XcbI681sm/Eq7G1DgghhzKC1xwIcEL2sn6W/SKYWCWhhvGe2DQgQ/OXwuXtwFLZ/+BVS8dFkXgAOgMJrxW0A/bcsxdI7fBR6rIkAk6WnVjApEUsCGmVGGtp/YZ3V/tq+mfgATn6UsOuSG62wzyLl/WnibAnGeXTuRzYi7I5j/olhAW+5Fan+BU9e6ZF3jSOwoVWp5JQH4Flrx+nH4wICEupXDu4iAbFo/tvZZD0axXchP+alAVF5AqIkVy6M9XwrBTZAEb+IWwqNNxFOt2iFWvE8kOXrV/s0m6QZSQ25vw8PsV0RBKKfzifpvTm4WxDM4cTvKDXHjc90WTY2cee/zkjBROjnYYQRJgdbozy7RlFg2y9SoEdXTIxRCrpa/FNhec22gJ9YJ0NGVclzfl2WNMZCxXT54RoDVs6bmoE3lQ2ZQFUkFWiPqLOcV1WsV/oy55vLEvGk10lRhW3pbt/wGJ9HUezhEn92gIaZ88I1Ef6Gg/OwRenrd0dNVYlaHhl4PPJ6viLBH9iiRr9yZQcJvOBd8gVd7P+j5ZthjbECUk0cfad6DUmUUni8VvEfhVScPSnYECfCvbZ/BJkJi7q0wi2EoKQPidkyCIRWKp6VaPVcPJbpr4ne6d5ajmTNwkZNnscPs8ipFYqGgaaFDgsjy2ounGP6NHc2hDItoIhDEa7TZ0oD7N0Gai6RnLDTHxr+RVqRLReO4WP/1JvHWpQYgxMSN2N/eG8267ZIVmhOa7YjOFvOFb6tzEiukKfQQBx5btl/7L//B88UTX41k/8hWmTo
 BnLARjlp
 4+6x7OsIg6mBXURNhWL49CrhFzvHrbg5c5lRKO9ke5AGVRnLk0wQ4GkxcjvvglqCyUJb4+xWe8pzUGOH5Xz7eXrZCntFQMaGuUwhkji90NhZ2MrwpVsb7w532RkM35m9FvXBYZaFzvjLTIi4exmUTNOZsEpGkkjkL778NqHDJzsAdW6Da/69PQUdhbpxk86ayKUQurzzn1b38F+p6SMQGsJ04WChMzO/cU2ya3BYaNSHbfTSyO52nmN3mt7Iw4IP+1yCkKfW7JcQoGLkfveg16w4K+9MtwZ+YhvW23j2Czikz3ZvkJedRbNbmKjEYhEUXAOhidny6V/gmzc2i8yKc14hSDrZ0Bk16JbLV+kFNWqLRTFJFI72EEY3h3MpagYz76UiTk+yRL2BCU6lVyHCVpwE5RZdSexYWIeGfa209aM7WnCvTdFzHJC1x06AYGK9Kbyst1bcYzxNJSGLOaDreIR+BW6vl04TAne1bEoBz3Ahh7QU6m23s46I3PzMS8OW/L6jmMDpWPIL2NUYcFTnwiGB+IRVWK7pOw8Vsw5oo37JNJza9Cm4grQ+A6EIiyYr5vyIs83WUIFl8MXY4puymJr8Z35iqy0t7Y27MnJZnRgNOn6d0Mu3YJ9LCK70U3RkD3gid5C+30T7kAyQ=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Mar 16, 2024 at 3:18=E2=80=AFPM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Thu, Mar 14, 2024 at 11:40=E2=80=AFPM H. Peter Anvin <hpa@zytor.com> w=
rote:
> >
> > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen=
.com> wrote:
> > >On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox <willy@infradea=
d.org> wrote:
> > >>
> > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > >> > > > Second, non-dynamic kernel memory is one of the core design de=
cisions in
> > >> > > > Linux from early on. This means there are lot of deeply embedd=
ed assumptions
> > >> > > > which would have to be untangled.
> > >> > >
> > >> > > I think there are other ways of getting the benefit that Pasha i=
s seeking
> > >> > > without moving to dynamically allocated kernel memory.  One icky=
 thing
> > >> > > that XFS does is punt work over to a kernel thread in order to u=
se more
> > >> > > stack!  That breaks a number of things including lockdep (becaus=
e the
> > >> > > kernel thread doesn't own the lock, the thread waiting for the k=
ernel
> > >> > > thread owns the lock).
> > >> > >
> > >> > > If we had segmented stacks, XFS could say "I need at least 6kB o=
f stack",
> > >> > > and if less than that was available, we could allocate a tempora=
ry
> > >> > > stack and switch to it.  I suspect Google would also be able to =
use this
> > >> > > API for their rare cases when they need more than 8kB of kernel =
stack.
> > >> > > Who knows, we might all be able to use such a thing.
> > >> > >
> > >> > > I'd been thinking about this from the point of view of allocatin=
g more
> > >> > > stack elsewhere in kernel space, but combining what Pasha has do=
ne here
> > >> > > with this idea might lead to a hybrid approach that works better=
; allocate
> > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the =
top of it,
> > >> > > rely on people using this "I need more stack" API correctly, and=
 free the
> > >> > > excess pages on return to userspace.  No complicated "switch sta=
cks" API
> > >> > > needed, just an "ensure we have at least N bytes of stack remain=
ing" API.
> > >
> > >I like this approach! I think we could also consider having permanent
> > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > >stack increase framework could work well and wouldn't negatively
> > >impact the performance of context switching. However, thorough
> > >analysis would be necessary to proactively identify potential stack
> > >overflow situations.
> > >
> > >> > Why would we need an "I need more stack" API? Pasha's approach see=
ms
> > >> > like everything we need for what you're talking about.
> > >>
> > >> Because double faults are hard, possibly impossible, and the FRED ap=
proach
> > >> Peter described has extra overhead?  This was all described up-threa=
d.
> > >
> > >Handling faults in #DF is possible. It requires code inspection to
> > >handle race conditions such as what was shown by tglx. However, as
> > >Andy pointed out, this is not supported by SDM as it is an abort
> > >context (yet we return from it because of ESPFIX64, so return is
> > >possible).
> > >
> > >My question, however, if we ignore memory savings and only consider
> > >reliability aspect of this feature.  What is better unconditionally
> > >crashing the machine because a guard page was reached, or printing a
> > >huge warning with a backtracing information about the offending stack,
> > >handling the fault, and survive? I know that historically Linus
> > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > >scenario compared to simple BUG vs WARN.
> > >
> > >Pasha
> > >
> > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-10000=
0@home.transmeta.com
> > >
> >
> > The real issue with using #DF is that if the event that caused it was a=
synchronous, you could lose the event.
>
> Got it. So, using a #DF handler for stack page faults isn't feasible.
> I suppose the only way for this to work would be to use a dedicated
> Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> that might introduce other complications.
>
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
>
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack.  (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.
>
> Both of the above hooks are called with IRQ disabled on all kernel
> entries whether through interrupts and syscalls, and they are called
> early/late enough that 4K is enough to handle the rest of entry/exit.

This proposal will not have the memory savings that you are looking
for, since sleeping tasks would still have a fully allocated stack.
This also would add extra overhead to each entry and exit (including
syscalls) that can happen multiple times before a context switch.  It
also doesn't make much sense because a task running in user mode will
quickly need those stack pages back when it returns to kernel mode.
Even if it doesn't make a syscall, the timer interrupt will kick it
out of user mode.

What should happen is that the unused stack is reclaimed when a task
goes to sleep.  The kernel does not use a red zone, so any stack pages
below the saved stack pointer of a sleeping task (task->thread.sp) can
be safely discarded.  Before context switching to a task, fully
populate its task stack.  After context switching from a task, reclaim
its unused stack.  This way, the task stack in use is always fully
allocated and we don't have to deal with page faults.

To make this happen, __switch_to() would have to be split into two
parts, to cleanly separate what happens before and after the stack
switch.  The first part saves processor context for the previous task,
and prepares the next task.  Populating the next task's stack would
happen here.  Then it would return to the assembly code to do the
stack switch.  The second part then loads the context of the next
task, and finalizes any work for the previous task.  Reclaiming the
unused stack pages of the previous task would happen here.


Brian Gerst