From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DFC1AC54E67
	for <linux-mm@archiver.kernel.org>; Sun, 17 Mar 2024 16:16:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0B60C6B007B; Sun, 17 Mar 2024 12:16:03 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 066976B0082; Sun, 17 Mar 2024 12:16:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E488A6B0083; Sun, 17 Mar 2024 12:16:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id D2E7D6B007B
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 12:16:02 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 9E978C04E3
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 16:16:02 +0000 (UTC)
X-FDA: 81907032564.30.BD67EEF
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175])
	by imf01.hostedemail.com (Postfix) with ESMTP id AABB740013
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 16:15:59 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=APrtSaHB;
	dmarc=pass (policy=none) header.from=soleen.com;
	spf=pass (imf01.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710692159;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=PM5k/cKWKO9/pzejbqRuQgA1mwwRpjQHEmGbCSxdmsg=;
	b=WUaNNkFkIdBoOqxKZtg8k/XDPq7MV/9+CJfvSUIK5u+IreXLbrfamfKav5Vfn3rFpUOQU3
	GK4qKJrlhxSaXpGQ+6G/zdowQmnidlvALUNrwKbLIBOlzU6XURc1iwNMSwgiuXcQHtHcyk
	osx8xf4sxSkyRgUaP/auqN176cWHEqI=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=APrtSaHB;
	dmarc=pass (policy=none) header.from=soleen.com;
	spf=pass (imf01.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710692159; a=rsa-sha256;
	cv=none;
	b=ZBuHlM1i2B7+gtQkoJt2tro3GDwnVh7LPotXuSdoJG3MrUnfa9PO+3ar3nuu6AvX5Lcmqj
	2wC2hGgiiJIDEwm2k1AU2dCSHftdGDxA69ANlBy4REXrN1bHlVx/f6J74MOmlcpD6i6FyG
	rdzulEtKzQC6YrX7XRg3OdgHQaKjyXc=
Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-430c41f3f89so6569281cf.0
        for <linux-mm@kvack.org>; Sun, 17 Mar 2024 09:15:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1710692159; x=1711296959; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=PM5k/cKWKO9/pzejbqRuQgA1mwwRpjQHEmGbCSxdmsg=;
        b=APrtSaHBACp+/yPUgZGCleFAYPiwy+RDiBZiItAqkaLnMyhGnO8pDYmR9U+o8RHHr4
         Epz46NBcIfQUDiFjZTGV27kSGDfbWnvDT2TCLZPjJ2QpKqfkvUTCKMDiJGkGZ85EtfhP
         jZoloN+N5helJMtxEfS6pIkRQBYsFDJPBkyO4sNKOAcB9K7qHUGr1B7o+72IB+FychZA
         7hAb0wnn+QX24mwT2KMJtYAY1b9yp6UsSXl7Oo9k/WTNpJL8UIWfbbh10NUhCHzPNGuj
         CGDvx3QaDKm86/Bb4Dt9W+aQbUOYstfN5CMG4zEiBFq/M3x9XYbiXrafkwlRHlTILsqG
         lUxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710692159; x=1711296959;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=PM5k/cKWKO9/pzejbqRuQgA1mwwRpjQHEmGbCSxdmsg=;
        b=uqwKHIV45V9ndDEIpysqj5WBmwbzgq+I1rOo3c3L26/ddI6lOjujeYmY/Ic3rcQ8tX
         PfIvAUsvucBQoNhFeVlpsHwf0TfQCsdOcWniC6Ctv6Zx8VPvFFW/qYR3QygID2WIAqxY
         WT+TEsKIrvphpiMzYD+dWAMQPA62myHCfiwANP5DJW9JEaTSELRnooB77ghAUVjlnybA
         M0I6geCpQOvlVzuvQIb5O0UMRYfTsXbacVTjoh8cmnq5fCgS6A8x53dhtMBulfxAMDtX
         ZSFvEHvNWlougPtqk5fvxaA2DjVj0gaC2cMT2tpk5pLMBRvR45w+RONIXlyniIQbV6fX
         XckA==
X-Forwarded-Encrypted: i=1; AJvYcCV6+I4xmXGzPrGwz/tTtnlV2F7/LP4NaYY8sRVIaPHOE/gkbNhHWaW1oC2JWnDnxgADD89HqKzcf3un+FfX5VDEKhg=
X-Gm-Message-State: AOJu0YwliLRx/+7u00oWri4F3g8GNLJU5rp8blrT7EgdGzx9YGhKAltP
	4ch7wAIKl0/YDChA8LH6xh64NWum2ZEvbfm/Bjc9STLqOQ2jkvEHXF+jEuvdBZlzquLi/unvi0+
	cTaQb7L4vwtkVX4hY3T1jGIYGxB2lKhqQM9MBvg==
X-Google-Smtp-Source: AGHT+IHeM36xTBpJfnVvFsWgP9FrEMuP+RzB6eCLw1k9E2wycoMLf0nlLLPrAgvLw8C45WQP13d++rsG9oTWBmVvrfU=
X-Received: by 2002:a05:622a:1a9a:b0:430:b0db:5b1 with SMTP id
 s26-20020a05622a1a9a00b00430b0db05b1mr13762235qtc.3.1710692158643; Sun, 17
 Mar 2024 09:15:58 -0700 (PDT)
MIME-Version: 1.0
References: <20240311164638.2015063-1-pasha.tatashin@soleen.com>
 <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <ZfNTSjfE_w50Otnz@casper.infradead.org>
 <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42>
 <ZfNWojLB7qjjB0Zw@casper.infradead.org> <CA+CK2bAmOj2J10szVijNikexFZ1gmA913vvxnqW4DJKWQikwqQ@mail.gmail.com>
 <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> <CA+CK2bDothmwdJ86K1LiKWDKdWdYDjg5WCwdbapL9c3Y_Sf+kg@mail.gmail.com>
 <CAMzpN2hZgEpJcyLqPhEqKSHy33j1G=FjzrOvnLPqiDeijanM=w@mail.gmail.com>
In-Reply-To: <CAMzpN2hZgEpJcyLqPhEqKSHy33j1G=FjzrOvnLPqiDeijanM=w@mail.gmail.com>
From: Pasha Tatashin <pasha.tatashin@soleen.com>
Date: Sun, 17 Mar 2024 12:15:21 -0400
Message-ID: <CA+CK2bBTrrJerZMdJrKhg683H4VmnqbgkGu2VG2UuirWNm1TnA@mail.gmail.com>
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
To: Brian Gerst <brgerst@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Matthew Wilcox <willy@infradead.org>, 
	Kent Overstreet <kent.overstreet@linux.dev>, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, 
	brauner@kernel.org, bristot@redhat.com, bsegall@google.com, 
	dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, 
	eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, 
	jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, 
	jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, 
	kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, 
	mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, 
	mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, 
	peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, 
	rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, 
	vincent.guittot@linaro.org, vschneid@redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: AABB740013
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: mgp74nx54p3kjgnf8oyhfgm4ywe41xm8
X-HE-Tag: 1710692159-914517
X-HE-Meta: U2FsdGVkX19MvxsDHvbPp7xMsVZu7N9t1SF+XctdmH44rC86HS2F6osSyYOAnzvycJaeSV+SWLs4sS/mOH/hm4boNKsTpx+2OQQs2dr25/P0nV0cwVfSqxzBV5kkWDMBkFPmZsYVG3tK+9RuOwm+K/Szv2k7LAOmY0qSCojjUKIx/Tz0o8h9Whkh8OaMVf6F+20QVWJdsh1O+xxNLpUOlQz8siYqdORliXL0IGCHbICv7u1w8Tn/aB5LhFzlckxuSL6h6khlL9nvKRcnn5r1kE/Qk+kR7IX5LZH+xXNnuo64akNCwmkHLDXh+H9utaxSbKPPebsnhhny4/sOCFmcaoFwGQrF+XUoXOPJIvpieKAj8sI/5Jr+bxINgtV4icR7Ks19TIBb+OND3tVTu4Z14xlpZpQDihHM2jFuVqRTmHixm9deq0/g8D8u8l1683U5x7BBLtbQp2/NnuLWqqqxrkYLn5KiBCWzpK+8xk5/nBIS0/eKljj4QfZJZADQ69cWcziE0YHYx/EFtXBZW59S+O5YZj1IBqxNI9eRZnEyRduKRV6sCGFTWGShNdFVJafFPCWHLLFSDYQkdPQVF7sa7684zfn7QnB7X4CAOdx4Rac/ln1Gg+dSlMILojeMmoFsvi/Z3oixwocutgy50du6KLKCt4UwDne/QijEFqjcn4m3oQ4kt8pufM8xTZ9YMk10fnqZjrYp0IWWp9Yanv07Cl0dxvDcfuSm1gg5eV8aV8xgGBC63kRCDStUIzO5d8+BpuqALtBpildOkEf2RTqg6F+b8PqnHlsGwLTITOgZKuzf0eo/ukqOHm3Mor/P/hT+Lq9oiACnOG3Sqz933T5mV0WGKxJI3KBFgAnjBPHZl8vV92ufiPOIyFP05nPRTv+jgsciKEGoWXBflVi0U+BRO3nqzns/NZebq4jSmMJdxb2M2g9zPzqanvDO2FoOEGmglhyH80Dm5VYtdPGFPYU
 LlCg57zB
 BBhYSMjewWYvDPpWovuSpPclcpk3V03eH/O/7SN96U5IxA3FzeO5qIta5zXqminfDz4cUe54jEDILVqPj7SrVhzaWemi0SlECBFgNwf2KdcL6iTr6wfHdVowYfTVfF9412IGktm30rUTB5uENUPr5H9MpqlHDt9I3IuKJ+5vIUdGDYMBDv8mBbDIHNhlGBCERZdNP9GhQqc4E4T5oOaPKbmEs8wjhwOSkjeVrIlbmfadp6Ya0BhGt4Z3F+gPaDENZAkavoDmwyjmB/ChpGktET0u9QZAfA9Z5rRYm3AajjHnptwqGm/feS/kZIKbMTTxmi8uU029v0frRSqNkinjVWeBo37cuzKxltAuGxyFxSoPyAROeuxVGpy++bDpw04g4Xzg1H8PNKjpkHZvls0r7np23y1HLeY1Gj7zvfuAoT6ZvWDipN0zVHeKYGBQk/OLlPaaOPQstzID81YfhTMEkNEN4E1u2etJ7+jL4zvP11C/CCOCcnYFJNNtq7j9tIlNtFdoKfzwEYRIG935m81ZkDm6jnLk/Z0lYyWV4h3xpKUYLsj2mFVsuiCdljMbpRKSxoLtV3GV6HelsUdCo8zRw8aoG/F/kk9UrkYb0Cf4PW+7Uwf4eVoSRVy6gd3iXHm67PCDhCnmIaULyQXI1xJyGdcZiHgfJVHW9EZbq
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sun, Mar 17, 2024 at 10:43=E2=80=AFAM Brian Gerst <brgerst@gmail.com> wr=
ote:
>
> On Sat, Mar 16, 2024 at 3:18=E2=80=AFPM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 11:40=E2=80=AFPM H. Peter Anvin <hpa@zytor.com>=
 wrote:
> > >
> > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@sole=
en.com> wrote:
> > > >On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox <willy@infrad=
ead.org> wrote:
> > > >>
> > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote=
:
> > > >> > > > Second, non-dynamic kernel memory is one of the core design =
decisions in
> > > >> > > > Linux from early on. This means there are lot of deeply embe=
dded assumptions
> > > >> > > > which would have to be untangled.
> > > >> > >
> > > >> > > I think there are other ways of getting the benefit that Pasha=
 is seeking
> > > >> > > without moving to dynamically allocated kernel memory.  One ic=
ky thing
> > > >> > > that XFS does is punt work over to a kernel thread in order to=
 use more
> > > >> > > stack!  That breaks a number of things including lockdep (beca=
use the
> > > >> > > kernel thread doesn't own the lock, the thread waiting for the=
 kernel
> > > >> > > thread owns the lock).
> > > >> > >
> > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB=
 of stack",
> > > >> > > and if less than that was available, we could allocate a tempo=
rary
> > > >> > > stack and switch to it.  I suspect Google would also be able t=
o use this
> > > >> > > API for their rare cases when they need more than 8kB of kerne=
l stack.
> > > >> > > Who knows, we might all be able to use such a thing.
> > > >> > >
> > > >> > > I'd been thinking about this from the point of view of allocat=
ing more
> > > >> > > stack elsewhere in kernel space, but combining what Pasha has =
done here
> > > >> > > with this idea might lead to a hybrid approach that works bett=
er; allocate
> > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at th=
e top of it,
> > > >> > > rely on people using this "I need more stack" API correctly, a=
nd free the
> > > >> > > excess pages on return to userspace.  No complicated "switch s=
tacks" API
> > > >> > > needed, just an "ensure we have at least N bytes of stack rema=
ining" API.
> > > >
> > > >I like this approach! I think we could also consider having permanen=
t
> > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > >stack increase framework could work well and wouldn't negatively
> > > >impact the performance of context switching. However, thorough
> > > >analysis would be necessary to proactively identify potential stack
> > > >overflow situations.
> > > >
> > > >> > Why would we need an "I need more stack" API? Pasha's approach s=
eems
> > > >> > like everything we need for what you're talking about.
> > > >>
> > > >> Because double faults are hard, possibly impossible, and the FRED =
approach
> > > >> Peter described has extra overhead?  This was all described up-thr=
ead.
> > > >
> > > >Handling faults in #DF is possible. It requires code inspection to
> > > >handle race conditions such as what was shown by tglx. However, as
> > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > >context (yet we return from it because of ESPFIX64, so return is
> > > >possible).
> > > >
> > > >My question, however, if we ignore memory savings and only consider
> > > >reliability aspect of this feature.  What is better unconditionally
> > > >crashing the machine because a guard page was reached, or printing a
> > > >huge warning with a backtracing information about the offending stac=
k,
> > > >handling the fault, and survive? I know that historically Linus
> > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > >scenario compared to simple BUG vs WARN.
> > > >
> > > >Pasha
> > > >
> > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100=
000@home.transmeta.com
> > > >
> > >
> > > The real issue with using #DF is that if the event that caused it was=
 asynchronous, you could lose the event.
> >
> > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > I suppose the only way for this to work would be to use a dedicated
> > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > that might introduce other complications.
> >
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.

Hi Brian,

> This proposal will not have the memory savings that you are looking
> for, since sleeping tasks would still have a fully allocated stack.

The tasks that were descheduled while running in user mode should not
increase their stack. The potential saving is greater than the
origianl proposal, because in the origianl proposal we never shrink
stacks after faults.

> This also would add extra overhead to each entry and exit (including
> syscalls) that can happen multiple times before a context switch.  It
> also doesn't make much sense because a task running in user mode will
> quickly need those stack pages back when it returns to kernel mode.
> Even if it doesn't make a syscall, the timer interrupt will kick it
> out of user mode.
>
> What should happen is that the unused stack is reclaimed when a task
> goes to sleep.  The kernel does not use a red zone, so any stack pages
> below the saved stack pointer of a sleeping task (task->thread.sp) can
> be safely discarded.  Before context switching to a task, fully

Excellent observation, this makes Andy Lutomirski per-map proposal [1]
usable without tracking dirty/accessed bits. More reliable, and also
platform independent.

> populate its task stack.  After context switching from a task, reclaim
> its unused stack.  This way, the task stack in use is always fully
> allocated and we don't have to deal with page faults.
>
> To make this happen, __switch_to() would have to be split into two
> parts, to cleanly separate what happens before and after the stack
> switch.  The first part saves processor context for the previous task,
> and prepares the next task.

By knowing the stack requirements of __switch_to(), can't we actually
do all that in the common code in context_switch() right before
__switch_to()? We would do an arch specific call to get the
__switch_to() stack requirement, and use that to change the value of
task->thread.sp to know where the stack is going to be while sleeping.
At this time we can do the unmapping of the stack pages from the
previous task, and mapping the pages to the next task.

> Populating the next task's stack would
> happen here.  Then it would return to the assembly code to do the
> stack switch.  The second part then loads the context of the next
> task, and finalizes any work for the previous task.  Reclaiming the
> unused stack pages of the previous task would happen here.

The problem with this (and the origianl Andy's approach), is that we
cannot sleep here. What happens if we get per-cpu stack cache
exhausted because several threads sleep while having deep stacks? How
can we schedule the next task? This is probably a corner case, but it
needs to have a proper handling solution. One solution is while in
schedule() and while interrupts are still enabled before going to
switch_to() we must pre-allocate 3-page in the per-cpu. However, what
if the pre-allocation itself calls cond_resched() because it enters
page allocator slowpath?

Other than the above concern, I concur, this approach looks to be the
best so far. I will think more about it.

Thank you,
Pasha

[1] https://lore.kernel.org/all/3e180c07-53db-4acb-a75c-1a33447d81af@app.fa=
stmail.com