From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 197A5C54E5D for ; Fri, 15 Mar 2024 03:14:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 82771800F5; Thu, 14 Mar 2024 23:14:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D718800B4; Thu, 14 Mar 2024 23:14:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69EEC800F5; Thu, 14 Mar 2024 23:14:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5A1B3800B4 for ; Thu, 14 Mar 2024 23:14:36 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 04B07120831 for ; Fri, 15 Mar 2024 03:14:35 +0000 (UTC) X-FDA: 81897805752.21.0D3110A Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) by imf26.hostedemail.com (Postfix) with ESMTP id 4D5C8140006 for ; Fri, 15 Mar 2024 03:14:34 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=waNqxrd7; spf=pass (imf26.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710472474; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kltlRsU//wSb9eRIerjBP4RsEfMtLq7gwv46ptQ/oTs=; b=MAWgsv7VmtGblGcDphuw+dORrtwM96WO5l5JzA5S668lNU2oJvZYnz/ys9Oxvmqu+yVPye cg6tcAlPPYkbXpOe3W5lE6XTe+VtiMDiklXqoY3KvrIxThtYZX8gUPkmttDzy/uMBhMoVG 2XXLMLJpe5SUes2CJA3wwYzJCx5ImVk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710472474; a=rsa-sha256; cv=none; b=GGQQk5sqtHHrZmLsZgCx9ZMSVX1xfJngpzgsIutGxZrbGwXGr+NYbBwfQA3jDSqL0C4Ok4 z4xlNiyLey2Beo+LsNlH6umduh4HhEG6Q1tRaUw0gtgQ5rK3Jib27Rik41tPz2hYTF4kvB e4q119/eYYbqdF/9Uc3OEOutkOl6nMY= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=waNqxrd7; spf=pass (imf26.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-60a0a1bd04eso17309867b3.1 for ; Thu, 14 Mar 2024 20:14:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1710472473; x=1711077273; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kltlRsU//wSb9eRIerjBP4RsEfMtLq7gwv46ptQ/oTs=; b=waNqxrd7KZhTqYpoAkHqWAbsaqJ+Jv5WOSYgDNGkI4BZe0WBXSwfFOI+IZ+fZ1alD8 4YyiPX9sLmFjbz69NHieuYmNmPKE0C3IdMroDyUkLZYdU+L7Mg2EotBtqREmEe6pZwK8 /mm5jW7aZsKzuvWzyk67D7tukT0u0pQAiJ8mHGH8ag6pu5Q7fXF8rhmlQodUasHRHeR+ OKlYthKz+QzfspblMT+Xsmh4xTAAmxLCMjvPAggNf5/VQ0wIdYf4ZhUSVYjTcJzEHEM4 TKG+qMvXmnNjAWYN8Jb7Uqcabnd3E1BBEH89dTc0+kAuFz8Lz8NCqZQKMRsdsUl4M72T hqSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710472473; x=1711077273; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kltlRsU//wSb9eRIerjBP4RsEfMtLq7gwv46ptQ/oTs=; b=IT76sFzziYwhuPwrUTGYh7C2I0zu6jjYG3OBkdkLvWjHybsDOKU3wIYgGVvGHebO41 vxwBH//ThwFGusSZjfsKp9DOLSkjCj8sXRO6qRJIchXwvrx517k8KJQpRBXTnJJMW3H7 Vaz8dLt8+wRDyXa0ZeSF05pFaCRYrm51HH4KkmR7RJqI4wyq4GSJ2UWOYaWJgQd0aFE5 3+fiyR+Y4oJO+w6BhxObb2sECAiIghNK3J+I+5DugeKOX8J1HqOA3v4fXmUcUfRTzNRZ KACFP+fq3ogWdoh8W8R/54Omw46swgrSeG0/hoSVdTbqA2u2GVjknrAC9CwU8HBYCyuB 8MPA== X-Forwarded-Encrypted: i=1; AJvYcCVCyKWYmMH1UNlQpjlGmm6lYLlyt5Md1NPtUH/OecrYjQSTuAiWSjSpuJuSnCnJOvho8Sn+9FjxqC4c5hY8obBXj98= X-Gm-Message-State: AOJu0Yzh13C6rOP4nwRkWX/3o1qMUi/T/jmZ0aXGnYyluZrtvWlC3PP5 boZBHJMSjb7pcnbzwH+DfhtAgl6ccF34g9JmVK7dROcreCxgzMnHZ70TRGzadeIs/SbpAwWRu+N A39E/ll5XUNEj6EnWW6H6LtYnj1T1fIZfP0rp+g== X-Google-Smtp-Source: AGHT+IEZOzGrgTsuOx4rBel1EyOKlek8kFa0ONghcC5HsoRs52ve5zNEZFcAwjiKbm17kuVvOQfsjM1ZMaJeQJAWF+Y= X-Received: by 2002:a81:d244:0:b0:609:ef6d:9b30 with SMTP id m4-20020a81d244000000b00609ef6d9b30mr3129892ywl.4.1710472473294; Thu, 14 Mar 2024 20:14:33 -0700 (PDT) MIME-Version: 1.0 References: <20240311164638.2015063-1-pasha.tatashin@soleen.com> <2cb8f02d-f21e-45d2-afe2-d1c6225240f3@zytor.com> <2qp4uegb4kqkryihqyo6v3fzoc2nysuhltc535kxnh6ozpo5ni@isilzw7nth42> In-Reply-To: From: Pasha Tatashin Date: Thu, 14 Mar 2024 23:13:56 -0400 Message-ID: Subject: Re: [RFC 00/14] Dynamic Kernel Stacks To: Matthew Wilcox Cc: Kent Overstreet , "H. Peter Anvin" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de, brauner@kernel.org, bristot@redhat.com, bsegall@google.com, dave.hansen@linux.intel.com, dianders@chromium.org, dietmar.eggemann@arm.com, eric.devolder@oracle.com, hca@linux.ibm.com, hch@infradead.org, jacob.jun.pan@linux.intel.com, jgg@ziepe.ca, jpoimboe@kernel.org, jroedel@suse.de, juri.lelli@redhat.com, kinseyho@google.com, kirill.shutemov@linux.intel.com, lstoakes@gmail.com, luto@kernel.org, mgorman@suse.de, mic@digikod.net, michael.christie@oracle.com, mingo@redhat.com, mjguzik@gmail.com, mst@redhat.com, npiggin@gmail.com, peterz@infradead.org, pmladek@suse.com, rick.p.edgecombe@intel.com, rostedt@goodmis.org, surenb@google.com, tglx@linutronix.de, urezki@gmail.com, vincent.guittot@linaro.org, vschneid@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: idzxoaykxy9uk4eu6dx7w4wt1bgz6poy X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4D5C8140006 X-Rspam-User: X-HE-Tag: 1710472474-499389 X-HE-Meta: U2FsdGVkX1/jJL0wdxdWdk3NrmssBS+K/874djVK8htjCmF1FA6t7t4RiUU+O2Mu5+UhT8w9iR/eDYF0jS4U9J1jczDz0o/glHuYR/5pW8woiqiUO5Hme3NNl3SYP9/xH7pjEoZOJgj6yv188W6uzwA/iQwq8CZuDWxjydnh6LRXEh4VKpkMgAl9MZ1VcQU1oRAtgVZ83yNLk7ZhTkAs8eBMGVngtw8wp3sL8kb82uMIoibK0uA0ZxwItx1Sz98yqDTNEEyx0g/cq7pre+QudXvMrHPEoq2a7WVFpIt7qHBiW6oLGYoZ5FUjU+O3kzwvlikf6+ufozwxe96PLeo1JmaDJsrNaTCEvWAER2mnN5oKdfN9qj/qRZ9E8Rp4SUCfE719orx+wJt4b2F7DUZERvJ0SKgQQWXeY32EDwY4DKdpPd77jOEQQhTKkL3/hKHY4I/sxSQwXNK1ysk34KMiL5YHpPq8U9LFpMAPBqwt8rbecPrVjogyl6KbSKadmV4YU3RQcVjjGHaYeytVsPaD+dA9e9XmbIFp7zoBEqQaLiI0AXE3gGnYaK+nIK4N8UpTFqNU75k9jtr1RmYWAqbBi0VaaStRBscAZBSbLI5TFBYuQb/B4Cyc+3FzCdsLnT5NvMnFNfwXGEqL/NfrXre+yccwAbuFVhr+cGOcD55UiZz91YBNkrVTTSEdVRrlEkp2PlBi1YiFYAL5VjVCvbIIbYA3n+yYQqQUMYLaJNujJAiD9ZUFiwohzg4O71fxN4yG39rc94VaYyMQ5ZVfck4UvOn0OgDgD5ihrAZJp5XKZDWo9gEm7V6sDD05ggcfwUrUD9xVAookAC8u3NGPXdtt5IEGVGY5sJZi9JdMTkpDBJEmOiTYhzDpdttjCQUwFnghx8rHbJLEhYWJ+0sconMII51DLcZ9wC3aRD30+Y0/mY4MC6batWVokcf2de/PH6csVl6hMUzCD7OkQ7nc+bJ Yse5EKfX IzkXOC+epQ+d6/bzngtbIGCeZHHNjumtGpSFkwzt5lIrYM3DkOA0j/YA0aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 14, 2024 at 3:57=E2=80=AFPM Matthew Wilcox wrote: > > On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote: > > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote: > > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote: > > > > Second, non-dynamic kernel memory is one of the core design decisio= ns in > > > > Linux from early on. This means there are lot of deeply embedded as= sumptions > > > > which would have to be untangled. > > > > > > I think there are other ways of getting the benefit that Pasha is see= king > > > without moving to dynamically allocated kernel memory. One icky thin= g > > > that XFS does is punt work over to a kernel thread in order to use mo= re > > > stack! That breaks a number of things including lockdep (because the > > > kernel thread doesn't own the lock, the thread waiting for the kernel > > > thread owns the lock). > > > > > > If we had segmented stacks, XFS could say "I need at least 6kB of sta= ck", > > > and if less than that was available, we could allocate a temporary > > > stack and switch to it. I suspect Google would also be able to use t= his > > > API for their rare cases when they need more than 8kB of kernel stack= . > > > Who knows, we might all be able to use such a thing. > > > > > > I'd been thinking about this from the point of view of allocating mor= e > > > stack elsewhere in kernel space, but combining what Pasha has done he= re > > > with this idea might lead to a hybrid approach that works better; all= ocate > > > 32kB of vmap space per kernel thread, put 12kB of memory at the top o= f it, > > > rely on people using this "I need more stack" API correctly, and free= the > > > excess pages on return to userspace. No complicated "switch stacks" = API > > > needed, just an "ensure we have at least N bytes of stack remaining" = API. I like this approach! I think we could also consider having permanent big stacks for some kernel only threads like kvm-vcpu. A cooperative stack increase framework could work well and wouldn't negatively impact the performance of context switching. However, thorough analysis would be necessary to proactively identify potential stack overflow situations. > > Why would we need an "I need more stack" API? Pasha's approach seems > > like everything we need for what you're talking about. > > Because double faults are hard, possibly impossible, and the FRED approac= h > Peter described has extra overhead? This was all described up-thread. Handling faults in #DF is possible. It requires code inspection to handle race conditions such as what was shown by tglx. However, as Andy pointed out, this is not supported by SDM as it is an abort context (yet we return from it because of ESPFIX64, so return is possible). My question, however, if we ignore memory savings and only consider reliability aspect of this feature. What is better unconditionally crashing the machine because a guard page was reached, or printing a huge warning with a backtracing information about the offending stack, handling the fault, and survive? I know that historically Linus preferred WARN() to BUG() [1]. But, this is a somewhat different scenario compared to simple BUG vs WARN. Pasha [1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@hom= e.transmeta.com