From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ED4C8CD11BF
	for <linux-mm@archiver.kernel.org>; Tue, 19 Mar 2024 14:56:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 69B1C6B0085; Tue, 19 Mar 2024 10:56:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 623026B0088; Tue, 19 Mar 2024 10:56:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4C3646B008A; Tue, 19 Mar 2024 10:56:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 3412B6B0085
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 10:56:57 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id EC22A120963
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 14:56:56 +0000 (UTC)
X-FDA: 81914090832.21.6A39357
Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41])
	by imf16.hostedemail.com (Postfix) with ESMTP id ECEB518001E
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 14:56:54 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=wF2t+ymw;
	dmarc=pass (policy=none) header.from=soleen.com;
	spf=pass (imf16.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710860215;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=lKpalR5cw9CXmhQ6CEPMwqOfzi/hbbVFb5st3efxJjo=;
	b=lTBekFQgqPJIeRXP5KIt2cyEbGCACJHeVAgYCX+ssx817BPLaqHLqshUzD6xx35ZZ3Zfhx
	C6yCbyQa9YwnAc8El2H3WLQrWvtX7/QrfD/M7AVKE4+EVeH5W9xwroQnAC5NDI1hQeGwJH
	xlSvgeUyh886xN3Ng2fPDJUlo2tYUgw=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=wF2t+ymw;
	dmarc=pass (policy=none) header.from=soleen.com;
	spf=pass (imf16.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710860215; a=rsa-sha256;
	cv=none;
	b=UaFZkxVrnIGd/+tbxDRcm28xbQhCVir2nKsHtcs+G4n7gJqSOwsS7lA2cwkBZW1AQtAVJL
	0vVN5dCcRU3rPuNgV57vmdvgmBJ7dn0EyS9uFoOQINJ4qV1gT+QMAjfByFyBHAE8Q7MNFL
	9SiA5oIkhUnZnAbKGd/ghWdARAuRlNE=
Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-696315c9da5so5743106d6.2
        for <linux-mm@kvack.org>; Tue, 19 Mar 2024 07:56:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1710860214; x=1711465014; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=lKpalR5cw9CXmhQ6CEPMwqOfzi/hbbVFb5st3efxJjo=;
        b=wF2t+ymw/J/ppQY59r7k9vl0x5aHv3vOFOpSBRDmGsx62gS2Q5exlRvk8OqNlSyPLg
         UHPJTkB51lzvQF4U7ryCqXMRwN0ATPn4z289S3hDHOklP/oic/Z03Ze2hsd7LkpSNORW
         jse9lam8bVKcJsG+SnSu9hx/NZLiwxesyiZL1ggGIXw0siTmQ5YBYhR6QeQiK3Yg1tJS
         92yA4GcZQp5cmYdsrO9GAwlaJeQrDOkBD64HLFzxffIwWJUxuGCXQI0J8g6FzqgGxXmJ
         LniLj4Z1ZAmIvArqEVdq8aqef7L09I2XzW6gWiICEH3kN7qQgxMDd7nRLRabL6Ib0nQI
         domQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710860214; x=1711465014;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=lKpalR5cw9CXmhQ6CEPMwqOfzi/hbbVFb5st3efxJjo=;
        b=w3OPNai7epUpC2OjQ3JILcuqMBQ+pGIFeFrB7WO2a9RThFYdXb+liBSV0+gBDE/6b6
         LNcrmzW1pJC6AzviasqlPkqsifI4yhi3RnWhgY0dp6fOwpznswjyPfzml9ThoRS2tHYq
         F1hDda7+w5V2UMDk4bZGUSXrCIwpAB3vwgtSwW2IXuonbMtG9nmGW/S1Z7du3t/fyV0H
         lp9wr0oO1horbJd3+O/6TtX7SrstXSjWAFu+96E2z+rjEy/YPtf5h4OyPjXRnXnVgApV
         Y1wU51meBs7sDMBlOol7JUq00RYkPze6gkgEKrkO7141Gi6if8x8IoFiZhzNwoyPEzx2
         nWfg==
X-Forwarded-Encrypted: i=1; AJvYcCWLXQJR2Deifa9Yd7Y40cAIYR1607sGsa0cnZA45bQAoQD8paXK04z2M9iH3NU4Oo9KuJsSRktFPQ+6zgShOY5q7yQ=
X-Gm-Message-State: AOJu0Yy2WOXiFBcDFc7p/i+Badcd2TawUgTZUqSEmXgavGXQiFIdaHJt
	8jd9OE405JDR7cGUGKhODpqHyGkQv1K/xFyz7dfNEMCyW5PxHzfqIJ53VyBzmCAQNd1kkyhv7vC
	NOKs63lpCu5peT/nS78vn5uJZDrgpRjA6ozWZcw==
X-Google-Smtp-Source: AGHT+IHt2UGe10VsD0CiL4RMgWpq7Isv4MrWJZj3oEIRH84D189zTNzEiOdp0TjIx5SeiPdQawduKadS44XNQT9izDI=
X-Received: by 2002:a05:622a:1aa4:b0:430:b697:a8c9 with SMTP id
 s36-20020a05622a1aa400b00430b697a8c9mr14972686qtc.12.1710860213900; Tue, 19
 Mar 2024 07:56:53 -0700 (PDT)
MIME-Version: 1.0
References: <20240311164638.2015063-1-pasha.tatashin@soleen.com>
 <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <ZfNTSjfE_w50Otnz@casper.infradead.org>
 <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42>
 <ZfNWojLB7qjjB0Zw@casper.infradead.org> <CA+CK2bAmOj2J10szVijNikexFZ1gmA913vvxnqW4DJKWQikwqQ@mail.gmail.com>
 <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> <CA+CK2bDothmwdJ86K1LiKWDKdWdYDjg5WCwdbapL9c3Y_Sf+kg@mail.gmail.com>
 <CAMzpN2hZgEpJcyLqPhEqKSHy33j1G=FjzrOvnLPqiDeijanM=w@mail.gmail.com>
 <CA+CK2bBTrrJerZMdJrKhg683H4VmnqbgkGu2VG2UuirWNm1TnA@mail.gmail.com>
 <CAMzpN2jmQoG9Cw56JOh7t_Y21Fax3bA9iAEA2B7TLnYs5ycdJQ@mail.gmail.com>
 <CA+CK2bDO=LV8nEFn=q6w3Pyna3aqKAiFEzHMb-d7xzMOThOXSQ@mail.gmail.com> <CAMzpN2i8SRkgUZ+XSj7wJrtRn=-mB=7v7=C8auES=FAW_MFN-Q@mail.gmail.com>
In-Reply-To: <CAMzpN2i8SRkgUZ+XSj7wJrtRn=-mB=7v7=C8auES=FAW_MFN-Q@mail.gmail.com>
From: Pasha Tatashin <pasha.tatashin@soleen.com>
Date: Tue, 19 Mar 2024 10:56:16 -0400
Message-ID: <CA+CK2bBX6HtP_=-GhTN3uV8mARZ1vjCWW+3-t-HFLiBcEMzmqg@mail.gmail.com>
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
To: Brian Gerst <brgerst@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Matthew Wilcox <willy@infradead.org>, 
	Kent Overstreet <kent.overstreet@linux.dev>, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, 
	brauner@kernel.org, bristot@redhat.com, bsegall@google.com, 
	dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, 
	eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, 
	jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, 
	jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, 
	kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, 
	mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, 
	mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, 
	peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, 
	rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, 
	vincent.guittot@linaro.org, vschneid@redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: ECEB518001E
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: pf6wb8gaopagkdisokqfa87wrwj7jcxi
X-HE-Tag: 1710860214-985518
X-HE-Meta: U2FsdGVkX18KHFEHpnVOOlhW6yL5tn7Nc95wU+Ci5gLbASGUJRKoTb5X+KcfAxGDTbk5Kj6zm2zjQ++OY+Q9Xx6cX64ZBEpyKO31jdS85wZClwTHNhYPtvztxzLQJMO3g0Sjte4Y4h9Ykqvo1anRcFpas/r0S1EJd6v0RwCXQm/x80ycEmMRST3K5E8XZl13z9hXXE4+v/GsRpyE3COQoc7uNdhoo4X5/BdKew+a0F3sVE0HTXt6Llyzht42n/8gY9gT7dmqC+YPywtkTig5qysZfO1dovDT9z1FWbXL0wS7cCRsEgqh3ML3wVE9IIf691/GH8IPFNoYuOkiXZaurwsyRm1Cla6tLFew0l+o5nj2uWyaSTjGRzpvBZmFs8PLNxkSXlyUnj9frNxoToyZzVt38G/vAd+lb1FyHUlUNgTmJfJoKYjsOvwNVIkh0h+A5Hk4SCa63pznDQSmg4375buQBE2k1lWa2pyY6WkHKprx3C/JBKllDaKW8lxmAnnLmWCswGvMDVn3Llg1PvSU0sITGh6k52uIF6/0I93HwA0FIh+NMWwQ4mhKJGjG87dHbxCbEd/awy5rHUjRWopZynshDqZyuWkHksG1p+G+r2t8qP+4WqeLa9lwp+2LnCxv+o3pjOx2B+EaPHYUBc3klD7lfvtZRXNwRuo9Q+pdl3EB+WTdR9kdGWhb4mBAW7ftj19pRaZMCNCDXomFYZb4tX2B48TVzPWKwNBUkNf0Fs42lkO9031zcvY96gxwAxaV21Hnt6xO+1zjLVPhv4IiAasCCNg80k1R76hRlQPtAVQFatVvdOMBJWyrMDMDn69s8Ns0/boclwDugZL0B1MXDJzczDNWfvcdhTqOftzgleqg4YCbo2SphVczxkV2YqD4CAx6EQ83rZFibzJEPGracCMq40osVoJOJeICbcVrFbssXtafus7uS+clW6X4lbftZnvvXYA2ZVijeICBa4G
 0OqwEW1V
 akSSXvsMQCWA5hTY8PIDIPVWHrPW13pwjMTOggMQ6b9ULiPFzQsTzA/AqCMMM4NnvoCSHbLhQMwAT7+bqphbNNzLhJ6aJbdbAG/CAjjOcZDgaBrd/qHOgvpQ4qgZCxkh5pIbVqrd39SSXsH0Hl/onETx3rkxJ3xBejgJbYycFLxk5Uu4dlmYyNUkCx7hAUtGl0GU/Zt9fEpMgLScAmx7LEg7o3BnbfEhcYLOlAhyV54JnF6agTDOKf0wXz5ru6dlUNB1ntOMGuoyvFUUCnOyom/Do+aHVw6exwievputpHqV9XNdLANWoJnUwfPtQHwoWdCUZACpUpFupr2VsWMq7V5KRGh6YF+wvxbbGQnQCA5TIprXSRE+1zNOIsAB8mdxRjGwEl1cDRI5xMZar3aCz3d4hDD2TF1E/xrVDhoNqQ+OyAxP4EYxjNUqF+S9rHYeLafcUtQbetCGA9XibynYwG83PuPupIQT6MGd73fjLmhJsKmgfKhV9jsVfSxqS/a4fIiR1UWbHpPUQb9hQ1yHiNBGwZvrbbW2B0TLB5tlWaU8nEmw6blHbrlDgTwDbteY026yMY1GWMiH9AE1YwqUdD4fk+rokkD2TuRlE
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 18, 2024 at 5:02=E2=80=AFPM Brian Gerst <brgerst@gmail.com> wro=
te:
>
> On Mon, Mar 18, 2024 at 11:00=E2=80=AFAM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 5:30=E2=80=AFPM Brian Gerst <brgerst@gmail.com>=
 wrote:
> > >
> > > On Sun, Mar 17, 2024 at 12:15=E2=80=AFPM Pasha Tatashin
> > > <pasha.tatashin@soleen.com> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 10:43=E2=80=AFAM Brian Gerst <brgerst@gmail=
.com> wrote:
> > > > >
> > > > > On Sat, Mar 16, 2024 at 3:18=E2=80=AFPM Pasha Tatashin
> > > > > <pasha.tatashin@soleen.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 11:40=E2=80=AFPM H. Peter Anvin <hpa@zy=
tor.com> wrote:
> > > > > > >
> > > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatas=
hin@soleen.com> wrote:
> > > > > > > >On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox <will=
y@infradead.org> wrote:
> > > > > > > >>
> > > > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet =
wrote:
> > > > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox=
 wrote:
> > > > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anv=
in wrote:
> > > > > > > >> > > > Second, non-dynamic kernel memory is one of the core=
 design decisions in
> > > > > > > >> > > > Linux from early on. This means there are lot of dee=
ply embedded assumptions
> > > > > > > >> > > > which would have to be untangled.
> > > > > > > >> > >
> > > > > > > >> > > I think there are other ways of getting the benefit th=
at Pasha is seeking
> > > > > > > >> > > without moving to dynamically allocated kernel memory.=
  One icky thing
> > > > > > > >> > > that XFS does is punt work over to a kernel thread in =
order to use more
> > > > > > > >> > > stack!  That breaks a number of things including lockd=
ep (because the
> > > > > > > >> > > kernel thread doesn't own the lock, the thread waiting=
 for the kernel
> > > > > > > >> > > thread owns the lock).
> > > > > > > >> > >
> > > > > > > >> > > If we had segmented stacks, XFS could say "I need at l=
east 6kB of stack",
> > > > > > > >> > > and if less than that was available, we could allocate=
 a temporary
> > > > > > > >> > > stack and switch to it.  I suspect Google would also b=
e able to use this
> > > > > > > >> > > API for their rare cases when they need more than 8kB =
of kernel stack.
> > > > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > > > >> > >
> > > > > > > >> > > I'd been thinking about this from the point of view of=
 allocating more
> > > > > > > >> > > stack elsewhere in kernel space, but combining what Pa=
sha has done here
> > > > > > > >> > > with this idea might lead to a hybrid approach that wo=
rks better; allocate
> > > > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memo=
ry at the top of it,
> > > > > > > >> > > rely on people using this "I need more stack" API corr=
ectly, and free the
> > > > > > > >> > > excess pages on return to userspace.  No complicated "=
switch stacks" API
> > > > > > > >> > > needed, just an "ensure we have at least N bytes of st=
ack remaining" API.
> > > > > > > >
> > > > > > > >I like this approach! I think we could also consider having =
permanent
> > > > > > > >big stacks for some kernel only threads like kvm-vcpu. A coo=
perative
> > > > > > > >stack increase framework could work well and wouldn't negati=
vely
> > > > > > > >impact the performance of context switching. However, thorou=
gh
> > > > > > > >analysis would be necessary to proactively identify potentia=
l stack
> > > > > > > >overflow situations.
> > > > > > > >
> > > > > > > >> > Why would we need an "I need more stack" API? Pasha's ap=
proach seems
> > > > > > > >> > like everything we need for what you're talking about.
> > > > > > > >>
> > > > > > > >> Because double faults are hard, possibly impossible, and t=
he FRED approach
> > > > > > > >> Peter described has extra overhead?  This was all describe=
d up-thread.
> > > > > > > >
> > > > > > > >Handling faults in #DF is possible. It requires code inspect=
ion to
> > > > > > > >handle race conditions such as what was shown by tglx. Howev=
er, as
> > > > > > > >Andy pointed out, this is not supported by SDM as it is an a=
bort
> > > > > > > >context (yet we return from it because of ESPFIX64, so retur=
n is
> > > > > > > >possible).
> > > > > > > >
> > > > > > > >My question, however, if we ignore memory savings and only c=
onsider
> > > > > > > >reliability aspect of this feature.  What is better uncondit=
ionally
> > > > > > > >crashing the machine because a guard page was reached, or pr=
inting a
> > > > > > > >huge warning with a backtracing information about the offend=
ing stack,
> > > > > > > >handling the fault, and survive? I know that historically Li=
nus
> > > > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat diffe=
rent
> > > > > > > >scenario compared to simple BUG vs WARN.
> > > > > > > >
> > > > > > > >Pasha
> > > > > > > >
> > > > > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.=
1714-100000@home.transmeta.com
> > > > > > > >
> > > > > > >
> > > > > > > The real issue with using #DF is that if the event that cause=
d it was asynchronous, you could lose the event.
> > > > > >
> > > > > > Got it. So, using a #DF handler for stack page faults isn't fea=
sible.
> > > > > > I suppose the only way for this to work would be to use a dedic=
ated
> > > > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I =
suspect
> > > > > > that might introduce other complications.
> > > > > >
> > > > > > Expanding on Mathew's idea of an interface for dynamic kernel s=
tack
> > > > > > sizes, here's what I'm thinking:
> > > > > >
> > > > > > - Kernel Threads: Create all kernel threads with a fully popula=
ted
> > > > > > THREAD_SIZE stack.  (i.e. 16K)
> > > > > > - User Threads: Create all user threads with THREAD_SIZE kernel=
 stack
> > > > > > but only the top page mapped. (i.e. 4K)
> > > > > > - In enter_from_user_mode(): Expand the thread stack to 16K by =
mapping
> > > > > > three additional pages from the per-CPU stack cache. This funct=
ion is
> > > > > > called early in kernel entry points.
> > > > > > - exit_to_user_mode(): Unmap the extra three pages and return t=
hem to
> > > > > > the per-CPU cache. This function is called late in the kernel e=
xit
> > > > > > path.
> > > > > >
> > > > > > Both of the above hooks are called with IRQ disabled on all ker=
nel
> > > > > > entries whether through interrupts and syscalls, and they are c=
alled
> > > > > > early/late enough that 4K is enough to handle the rest of entry=
/exit.
> > > >
> > > > Hi Brian,
> > > >
> > > > > This proposal will not have the memory savings that you are looki=
ng
> > > > > for, since sleeping tasks would still have a fully allocated stac=
k.
> > > >
> > > > The tasks that were descheduled while running in user mode should n=
ot
> > > > increase their stack. The potential saving is greater than the
> > > > origianl proposal, because in the origianl proposal we never shrink
> > > > stacks after faults.
> > >
> > > A task has to enter kernel mode in order to be rescheduled.  If it
> > > doesn't make a syscall or hit an exception, then the timer interrupt
> > > will eventually kick it out of user mode.  At some point schedule() i=
s
> > > called, the task is put to sleep and context is switched to the next
> > > task.  A sleeping task will always be using some amount of kernel
> > > stack.  How much depends a lot on what caused the task to sleep.  If
> > > the timeslice expired it could switch right before the return to user
> > > mode.  A page fault could go deep into filesystem and device code
> > > waiting on an I/O operation.
> > >
> > > > > This also would add extra overhead to each entry and exit (includ=
ing
> > > > > syscalls) that can happen multiple times before a context switch.=
  It
> > > > > also doesn't make much sense because a task running in user mode =
will
> > > > > quickly need those stack pages back when it returns to kernel mod=
e.
> > > > > Even if it doesn't make a syscall, the timer interrupt will kick =
it
> > > > > out of user mode.
> > > > >
> > > > > What should happen is that the unused stack is reclaimed when a t=
ask
> > > > > goes to sleep.  The kernel does not use a red zone, so any stack =
pages
> > > > > below the saved stack pointer of a sleeping task (task->thread.sp=
) can
> > > > > be safely discarded.  Before context switching to a task, fully
> > > >
> > > > Excellent observation, this makes Andy Lutomirski per-map proposal =
[1]
> > > > usable without tracking dirty/accessed bits. More reliable, and als=
o
> > > > platform independent.
> > >
> > > This is x86-specific.  Other architectures will likely have differenc=
es.
> > >
> > > > > populate its task stack.  After context switching from a task, re=
claim
> > > > > its unused stack.  This way, the task stack in use is always full=
y
> > > > > allocated and we don't have to deal with page faults.
> > > > >
> > > > > To make this happen, __switch_to() would have to be split into tw=
o
> > > > > parts, to cleanly separate what happens before and after the stac=
k
> > > > > switch.  The first part saves processor context for the previous =
task,
> > > > > and prepares the next task.
> > > >
> > > > By knowing the stack requirements of __switch_to(), can't we actual=
ly
> > > > do all that in the common code in context_switch() right before
> > > > __switch_to()? We would do an arch specific call to get the
> > > > __switch_to() stack requirement, and use that to change the value o=
f
> > > > task->thread.sp to know where the stack is going to be while sleepi=
ng.
> > > > At this time we can do the unmapping of the stack pages from the
> > > > previous task, and mapping the pages to the next task.
> > >
> > > task->thread.sp is set in __switch_to_asm(), and is pretty much the
> > > last thing done in the context of the previous task.  Trying to
> > > predict that value ahead of time is way too fragile.
> >
> > We don't require an exact value, but rather an approximate upper
> > limit. To illustrate, subtract 1K from the current .sp, then determine
> > the corresponding page to decide the number of pages needing
> > unmapping. The primary advantage is that we can avoid
> > platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
> > switch_to() function. Instead, each platform can provide an
> > appropriate upper bound for switch_to() operations. We know the amount
> > of information is going to be stored on the stack by the routines, and
> > also since interrupts are disabled stacks are not used for anything
> > else there, so I do not see a problem with determining a reasonable
> > upper bound.
>
> The stack usage will vary depending on compiler version and
> optimization settings.  Making an educated guess is possible, but may
> not be enough in the future.
>
> What would be nice is to get some actual data on stack usage under
> various workloads, both maximum depth and depth at context switch.
>
> > >  Also, the key
> > > point I was trying to make is that you cannot safely shrink the activ=
e
> > > stack.  It can only be done after the stack switch to the new task.
> >
> > Can you please elaborate why this is so? If the lowest pages are not
> > used, and interrupts are disabled what is not safe about removing them
> > from the page table?
> >
> > I am not against the idea of unmapping in __switch_to(), I just want
> > to understand the reasons why more generic but perhaps not as precise
> > approach would not  work.
>
> As long as a wide buffer is given, it would probably be safe.  But it
> would still be safer and more precise if done after the switch.

Makes sense. Looks like using task->thread.sp during context is not
possible because the pages might have been shared with another CPU. We
would need to do ipi tlb invalidation, which would be too expensive
for the context switch. Therefore, using pte->accessed is more
reliable to determine which pages can be unmapped. However, we could
still use task->thread.sp in a garbage collector.

Pasha