From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C3D2C54E68 for ; Sun, 17 Mar 2024 21:30:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7DFF26B0082; Sun, 17 Mar 2024 17:30:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 78FDB6B0083; Sun, 17 Mar 2024 17:30:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 658D46B0085; Sun, 17 Mar 2024 17:30:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 54B4F6B0082 for ; Sun, 17 Mar 2024 17:30:46 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id EFC74A0635 for ; Sun, 17 Mar 2024 21:30:45 +0000 (UTC) X-FDA: 81907825650.01.D846EE3 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf09.hostedemail.com (Postfix) with ESMTP id 09A48140013 for ; Sun, 17 Mar 2024 21:30:43 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Foscpz3W; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of brgerst@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=brgerst@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710711044; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RB9Yx9eY5jYX8PX6GpKT4tDKSK1Ladoxrcta15yab4Q=; b=4uydctlJT7zpaesnQu90X64sRT3x1Mhg4GlyPrmEVzvy4wHAjUpW+TlHDRdBv/mItcdz+A cgEyU9ZQk+PWbTPLbXIWbIdnbnjwvM6kp4F8PEf24r8hnzfCQsjmTv2qUMyNSsxUafSRDK PrBz+0cbVSeW7KWdUXdN+8ws8W8hrlU= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Foscpz3W; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of brgerst@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=brgerst@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710711044; a=rsa-sha256; cv=none; b=MEsoJdqvmhWuB8TYA4d1AjSA4gfWhcjXsLbakiGkoaHCZVvRUcHf46MKnolb870w2Une28 lUJ3yxvYMnCNc5RTyd5h6XfQG82Qz38lgV6ssXQyzbLMbI2sx0V7YqL8djzcTnBu2bH57e DhlmKWUk+tkB0Cyxwlpvrhv2egdzVHs= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2d2509c66daso53549901fa.3 for ; Sun, 17 Mar 2024 14:30:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710711042; x=1711315842; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RB9Yx9eY5jYX8PX6GpKT4tDKSK1Ladoxrcta15yab4Q=; b=Foscpz3Whe76p30oD1q2OTDjRBjTTdnjJFQSjwyNpoOcj/Sypj9yQ3YPgVy2brLS9d JJd9ZWXt9RnaOhEgkwKezESw8o0fEmoIVnBrFpd8Bi1RcX1mHY1HmubYHjVCS9CXjiY5 ETKcYJfoPsNMJAeTuUC13JmLfvv33xdmQX6GFweTTFdbW7U8jyueQceUj5MHWzqyIFZK FZGSXc2tS3ZiHiIZl+YS1f2otkw0589EzcP9b/yEvoQJsh2unP08o6oU5Ax4MNxRc7qw /pvI1KTlvz3g8ofucb1exjZFttJGsggq1m7putglLccimjhld7AD4vpZxBP8vJGwV7HM mMtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710711042; x=1711315842; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RB9Yx9eY5jYX8PX6GpKT4tDKSK1Ladoxrcta15yab4Q=; b=SA3iegWjmb5kzEk8q6TbnI9RuBQaUGpUh9SCE8/jotHEtwg3idILZGt0bOeiz+TLSM CJ1JlgUP+4bWAF2+oGX1/lerW0UWvFxn3FfCLoq2ALAU14u9xfzSy5c2cvlTZNkAyXCz hYbvu66KW49n5Jvt9TyB31mTVXYt5Vmj3ugAuO6jtTKKShYsr3h04KAQHM1aA8OOwRGX hX1Z+J2HTQ3oNptwCs+gam+qNp2+36VgWtblRQi0Pe8WwmuhhSgwncFXOC75Cn7mSFZ2 yseMx0+VrxTX1OZni3ilgPabGDTcLvNsnevAz+05+TfWw8fOfGw3sAnayZ9pqx8DFvtT BATw== X-Forwarded-Encrypted: i=1; AJvYcCXm4kFh5/yVjw9YrSMWohH3rm7xFw69HSSI6ZDXhnPlxujfMNpjQgz+eBVEc8ZXukYXqP14C3zwvOLhE71gLB5gWn4= X-Gm-Message-State: AOJu0YyOPFoB76jH6CbxG1mLqD9f7rSWXObWlwfMhesk4ppAldiEDTNi HrK8k33qsa56oe5Hq64pfXhQ2MspaGtTfgeU1Hx+KsnjlR9h85UdhmY2SM+TlZO7sfcleLA+OE2 2NcZFO6En5loz4krZMm36kQrQPw== X-Google-Smtp-Source: AGHT+IHl1h1I3nVn2sdWL3YEaMVKQSVeUwKtpWFndlr9o+JYWm04ptb1L9Hz+JDpxv6wdhtcmOUFrmLieiV/4lWl324= X-Received: by 2002:a2e:b8d2:0:b0:2d4:375e:9e43 with SMTP id s18-20020a2eb8d2000000b002d4375e9e43mr7496720ljp.27.1710711041777; Sun, 17 Mar 2024 14:30:41 -0700 (PDT) MIME-Version: 1.0 References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42> <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com> In-Reply-To: From: Brian Gerst Date: Sun, 17 Mar 2024 17:30:30 -0400 Message-ID: Subject: Re: [RFC 00/14] Dynamic Kernel Stacks To: Pasha Tatashin Cc: "H. Peter Anvin" , Matthew Wilcox , Kent Overstreet , linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, brauner@kernel.org, bristot@redhat.com, bsegall@google.com, dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, vincent.guittot@linaro.org, vschneid@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: iqr58nhd36f1a6s5x63iwmaumwbzsrf5 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 09A48140013 X-HE-Tag: 1710711043-95365 X-HE-Meta: U2FsdGVkX1+X+ZuRS3eZlDlED3VpCHUlo2zEO8PKdySpfaFhmEnQuZ5gF3b+j7Yye4GMvKRavIk0UZCapY3kR+lgtEXftA/73fUBGrbsIMS7ICSCteijnIAtW68Gh3/VSTaDSGKtWth+CD+bat5TGTc2TTmteO4Dg/ubQsJRsB3gB2Vu0j4tAAmxDoe0oYud6crg3iNR7M1Psj+PcFjFJeXPcYPwhQ/tnEz/xVAZ/wLP24fOyML80sZ9IteXdfwtN9pPj1UlmmCszfE8oT4Sew/gEb5L0tOJbuTkyuJlmULJm+Qsbau64nPTZX32DBm22Q0W23qjbY49Ivm0coFqvy5sl/S+b8rb4DNyLTXYwI8Z5fuDt7HDcizUoO3VhxxsQ2QT7DJMlKqOoyPjIywdc9+niiXoNQm7eE7o+4m66cJOR4AWXYF+f8Nln/oqg0updz6v6ykujofFgmzocCMryxN5kcQNvLTTR//MRJ6vpqoLFhbFlYNRAqzW58ks5+nexXp2bnPK9S9HKWunEap2c2PPZ1fajpre8kKI4+jF+Q7G6k5UjTdHK6lqT7gww5ehMp8Qeb/w424cH4DTWUegYtXuM2t3/ptD0OPy5h9E5a9Agsi6n5w0cy6XG5zdwLICdx+YAAnd+zDmfqBfQiKZT8zU5If7UMdD2MOnwH3BCrW+Q6Q4mnLdZzzpAXs4SRvNgdRymHZVb1jN9SgtWTmoGCR+u7b6JliRpqq1dyjSScHyxg85vxdoflN87xFce9CTJRA0hVSHh/xbza6utj8xkwv2MzDHnH7b6rr9NNIdv6pZM3pPIbY+cjKoCq0EoqP35OG8dgGurZyQJmfFDHRWgdZpMuLKuCe+ZrdbY0omho93MnqTah3YHhlDisk8uzgvCd6tZrv7DMt5ND+ODvGWYCnP8AbCbWxUnDK9EePFdIsQyuxfoMYfgNdTWT4Hb2PJzYrhW1p3fvgqJK8D1J2 AlRJRTfC tJTx00LrxInsPBkKxGwYj59ltWt8XCjvF7zrHv5fzyv7yTpzPco+kUjvclPGJqJMySSTFGS82GhY4G+ci4AXvPIONoQCx8fbkg9FVpSUg4BmR4AtAfQuxnF1B5VmlXQ9U38LdLbkTdiRbnypSPpFtRmiLPjNZYfft6ip2+xn87PaeXSGzRLqALahzOCLIsMWrUbb6GB/QyXTMPPvElI+gSXTlTa1rG+lYEVAPBYH0ug8KW1WiQ1xPyIwreFFnk5V5rBtTngsVmSyQXsY+rZxN8NEvAKTxbVYMV9QRBORlZ17GFO2JgM/dRgTZD6igEJDsbSCkUvy6LEgRL4Tq9dsBgif7u3xTPXtIjJUlBqrBetsCsGq9YKRIJCwTZvzCnEeYh8X+/RnacWagtPkoZBXoTjYc0s4aVrllQ2aB79f+0Pluf1yZ2zMyqSd94cQbQujQ6vkNybld3xx7LIZOWJx9gBohQ3LH2qlL8LuGKJawCrEwsYCw/OwFFYq3Fh2eqQ6NhjJEJ4jweysP2kDCbcrDhOFrL5mxO+MgqyLwzgiaUnH4Ocs58eeM4PP5uQDPk2z+S38Vvjn7e7d1bPgFkBTHK8o8tQdc+mv4IlzgYIBpv0mOFOZwT6+V79mBauW5OGK/5XBvaWTAm8l1bgJ345TUqLn8bGjAy4hDlTJU4eHQKrdokV4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Mar 17, 2024 at 12:15=E2=80=AFPM Pasha Tatashin wrote: > > On Sun, Mar 17, 2024 at 10:43=E2=80=AFAM Brian Gerst = wrote: > > > > On Sat, Mar 16, 2024 at 3:18=E2=80=AFPM Pasha Tatashin > > wrote: > > > > > > On Thu, Mar 14, 2024 at 11:40=E2=80=AFPM H. Peter Anvin wrote: > > > > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin wrote: > > > > >On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox wrote: > > > > >> > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote: > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote= : > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wro= te: > > > > >> > > > Second, non-dynamic kernel memory is one of the core desig= n decisions in > > > > >> > > > Linux from early on. This means there are lot of deeply em= bedded assumptions > > > > >> > > > which would have to be untangled. > > > > >> > > > > > > >> > > I think there are other ways of getting the benefit that Pas= ha is seeking > > > > >> > > without moving to dynamically allocated kernel memory. One = icky thing > > > > >> > > that XFS does is punt work over to a kernel thread in order = to use more > > > > >> > > stack! That breaks a number of things including lockdep (be= cause the > > > > >> > > kernel thread doesn't own the lock, the thread waiting for t= he kernel > > > > >> > > thread owns the lock). > > > > >> > > > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6= kB of stack", > > > > >> > > and if less than that was available, we could allocate a tem= porary > > > > >> > > stack and switch to it. I suspect Google would also be able= to use this > > > > >> > > API for their rare cases when they need more than 8kB of ker= nel stack. > > > > >> > > Who knows, we might all be able to use such a thing. > > > > >> > > > > > > >> > > I'd been thinking about this from the point of view of alloc= ating more > > > > >> > > stack elsewhere in kernel space, but combining what Pasha ha= s done here > > > > >> > > with this idea might lead to a hybrid approach that works be= tter; allocate > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at = the top of it, > > > > >> > > rely on people using this "I need more stack" API correctly,= and free the > > > > >> > > excess pages on return to userspace. No complicated "switch= stacks" API > > > > >> > > needed, just an "ensure we have at least N bytes of stack re= maining" API. > > > > > > > > > >I like this approach! I think we could also consider having perman= ent > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperati= ve > > > > >stack increase framework could work well and wouldn't negatively > > > > >impact the performance of context switching. However, thorough > > > > >analysis would be necessary to proactively identify potential stac= k > > > > >overflow situations. > > > > > > > > > >> > Why would we need an "I need more stack" API? Pasha's approach= seems > > > > >> > like everything we need for what you're talking about. > > > > >> > > > > >> Because double faults are hard, possibly impossible, and the FRE= D approach > > > > >> Peter described has extra overhead? This was all described up-t= hread. > > > > > > > > > >Handling faults in #DF is possible. It requires code inspection to > > > > >handle race conditions such as what was shown by tglx. However, as > > > > >Andy pointed out, this is not supported by SDM as it is an abort > > > > >context (yet we return from it because of ESPFIX64, so return is > > > > >possible). > > > > > > > > > >My question, however, if we ignore memory savings and only conside= r > > > > >reliability aspect of this feature. What is better unconditionall= y > > > > >crashing the machine because a guard page was reached, or printing= a > > > > >huge warning with a backtracing information about the offending st= ack, > > > > >handling the fault, and survive? I know that historically Linus > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different > > > > >scenario compared to simple BUG vs WARN. > > > > > > > > > >Pasha > > > > > > > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-1= 00000@home.transmeta.com > > > > > > > > > > > > > The real issue with using #DF is that if the event that caused it w= as asynchronous, you could lose the event. > > > > > > Got it. So, using a #DF handler for stack page faults isn't feasible. > > > I suppose the only way for this to work would be to use a dedicated > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspec= t > > > that might introduce other complications. > > > > > > Expanding on Mathew's idea of an interface for dynamic kernel stack > > > sizes, here's what I'm thinking: > > > > > > - Kernel Threads: Create all kernel threads with a fully populated > > > THREAD_SIZE stack. (i.e. 16K) > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack > > > but only the top page mapped. (i.e. 4K) > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mappin= g > > > three additional pages from the per-CPU stack cache. This function is > > > called early in kernel entry points. > > > - exit_to_user_mode(): Unmap the extra three pages and return them to > > > the per-CPU cache. This function is called late in the kernel exit > > > path. > > > > > > Both of the above hooks are called with IRQ disabled on all kernel > > > entries whether through interrupts and syscalls, and they are called > > > early/late enough that 4K is enough to handle the rest of entry/exit. > > Hi Brian, > > > This proposal will not have the memory savings that you are looking > > for, since sleeping tasks would still have a fully allocated stack. > > The tasks that were descheduled while running in user mode should not > increase their stack. The potential saving is greater than the > origianl proposal, because in the origianl proposal we never shrink > stacks after faults. A task has to enter kernel mode in order to be rescheduled. If it doesn't make a syscall or hit an exception, then the timer interrupt will eventually kick it out of user mode. At some point schedule() is called, the task is put to sleep and context is switched to the next task. A sleeping task will always be using some amount of kernel stack. How much depends a lot on what caused the task to sleep. If the timeslice expired it could switch right before the return to user mode. A page fault could go deep into filesystem and device code waiting on an I/O operation. > > This also would add extra overhead to each entry and exit (including > > syscalls) that can happen multiple times before a context switch. It > > also doesn't make much sense because a task running in user mode will > > quickly need those stack pages back when it returns to kernel mode. > > Even if it doesn't make a syscall, the timer interrupt will kick it > > out of user mode. > > > > What should happen is that the unused stack is reclaimed when a task > > goes to sleep. The kernel does not use a red zone, so any stack pages > > below the saved stack pointer of a sleeping task (task->thread.sp) can > > be safely discarded. Before context switching to a task, fully > > Excellent observation, this makes Andy Lutomirski per-map proposal [1] > usable without tracking dirty/accessed bits. More reliable, and also > platform independent. This is x86-specific. Other architectures will likely have differences. > > populate its task stack. After context switching from a task, reclaim > > its unused stack. This way, the task stack in use is always fully > > allocated and we don't have to deal with page faults. > > > > To make this happen, __switch_to() would have to be split into two > > parts, to cleanly separate what happens before and after the stack > > switch. The first part saves processor context for the previous task, > > and prepares the next task. > > By knowing the stack requirements of __switch_to(), can't we actually > do all that in the common code in context_switch() right before > __switch_to()? We would do an arch specific call to get the > __switch_to() stack requirement, and use that to change the value of > task->thread.sp to know where the stack is going to be while sleeping. > At this time we can do the unmapping of the stack pages from the > previous task, and mapping the pages to the next task. task->thread.sp is set in __switch_to_asm(), and is pretty much the last thing done in the context of the previous task. Trying to predict that value ahead of time is way too fragile. Also, the key point I was trying to make is that you cannot safely shrink the active stack. It can only be done after the stack switch to the new task. > > Populating the next task's stack would > > happen here. Then it would return to the assembly code to do the > > stack switch. The second part then loads the context of the next > > task, and finalizes any work for the previous task. Reclaiming the > > unused stack pages of the previous task would happen here. > > The problem with this (and the origianl Andy's approach), is that we > cannot sleep here. What happens if we get per-cpu stack cache > exhausted because several threads sleep while having deep stacks? How > can we schedule the next task? This is probably a corner case, but it > needs to have a proper handling solution. One solution is while in > schedule() and while interrupts are still enabled before going to > switch_to() we must pre-allocate 3-page in the per-cpu. However, what > if the pre-allocation itself calls cond_resched() because it enters > page allocator slowpath? You would have to keep extra pages in reserve for allocation failures. mempool could probably help with that. Brian Gerst