From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C24BC0218D for ; Mon, 27 Jan 2025 00:21:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A68EB280113; Sun, 26 Jan 2025 19:21:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A1972280112; Sun, 26 Jan 2025 19:21:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B9FD280113; Sun, 26 Jan 2025 19:21:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6A32E280112 for ; Sun, 26 Jan 2025 19:21:17 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D712B121E81 for ; Mon, 27 Jan 2025 00:21:16 +0000 (UTC) X-FDA: 83051327352.18.98FB00E Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) by imf22.hostedemail.com (Postfix) with ESMTP id 9489FC0006 for ; Mon, 27 Jan 2025 00:21:14 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=LQ+EVXX0; spf=pass (imf22.hostedemail.com: domain of "prvs=1155a3140=graf@amazon.de" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=1155a3140=graf@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737937274; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z3MAmEJTRa2/JeWCmfiOScUn7hCgJXqgSZGro0miCoI=; b=LYBjaQfIkQWL+b4i3RFGY4c6X1ZxHaxZ8MoAokx3OHC7v+pX9uT1tWcYIgEKAj5kBEJUEi wbCXh20J7OUi96Tw3m3z4EB1NDT8lhFD3zoRxkIqp7+KjU0JPFQBNUwTPqDojzRlddkfVb Clpega7zBgoRy/5CdS/wL6d1h/pWZKM= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=LQ+EVXX0; spf=pass (imf22.hostedemail.com: domain of "prvs=1155a3140=graf@amazon.de" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=1155a3140=graf@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737937274; a=rsa-sha256; cv=none; b=104ThPd3qtk+joxB3tKmhiPgd+dagUNVWpvkdkHm2OS4AOilj/+63WrJg+NzOK+m/0oyWu zOzn5wUwUTgShyhreW7feiko4Qj8ZP4vqPkgVu62onXipDs8YzH6EsYMd+AqDm4MHKB0+Z dvzulCUlFuZhRDBTdMkgsJyVMdocCg8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1737937274; x=1769473274; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=z3MAmEJTRa2/JeWCmfiOScUn7hCgJXqgSZGro0miCoI=; b=LQ+EVXX0+iVRzIvktnZrkeXCovY7UlFxkPkLSifIq/flB3922++T9Dhd E2LNvYNn4E1rKU76bygL7uYGAU8A3P4LQZKmkR90iafUkRzSkI7bHfAH4 8kxFwzOyg06GR/h4AqkLcIJgxB84KPvtCL68xRV+uyZwn13G5Pj4cwSHD w=; X-IronPort-AV: E=Sophos;i="6.13,237,1732579200"; d="scan'208";a="164414725" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jan 2025 00:21:12 +0000 Received: from EX19MTAUWC002.ant.amazon.com [10.0.7.35:44277] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.54.10:2525] with esmtp (Farcaster) id 404acaa8-5fb8-4469-bad5-a013f7d5b928; Mon, 27 Jan 2025 00:21:12 +0000 (UTC) X-Farcaster-Flow-ID: 404acaa8-5fb8-4469-bad5-a013f7d5b928 Received: from EX19D020UWC004.ant.amazon.com (10.13.138.149) by EX19MTAUWC002.ant.amazon.com (10.250.64.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39; Mon, 27 Jan 2025 00:21:11 +0000 Received: from [0.0.0.0] (10.253.83.51) by EX19D020UWC004.ant.amazon.com (10.13.138.149) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39; Mon, 27 Jan 2025 00:21:08 +0000 Message-ID: <54945e03-c437-48b4-b739-4e8ac822c1fc@amazon.com> Date: Sun, 26 Jan 2025 16:21:05 -0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] memory persistence over kexec To: Pasha Tatashin , Jason Gunthorpe CC: Mike Rapoport , David Rientjes , , "Gowans, James" , References: <20250120141427.GK674319@ziepe.ca> <20250126200404.GA1103620@ziepe.ca> Content-Language: en-US From: Alexander Graf In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.253.83.51] X-ClientProxiedBy: EX19D037UWC001.ant.amazon.com (10.13.139.197) To EX19D020UWC004.ant.amazon.com (10.13.138.149) X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9489FC0006 X-Stat-Signature: sgumf7cmbkz5xjutwz8o5a95jw85ckuy X-Rspam-User: X-HE-Tag: 1737937274-333031 X-HE-Meta: U2FsdGVkX1/9GJUjeCh6XZMuv+gkKC/DCf+toDnHRJ74DRbqCFUH1VAsMUOv6K8msZQ9JVGewVInPZR9j6ZFYciBPJEDY55U2kwVwnuhhk7Y1csOXwqHSmxHEZpRpClkyTTqZu3QSk2/eMcebDy5QqYHkPNIjmvZAdbKmk+Z05zCEXmWBbMBCKlPcxvJyPmRvisEp93HxiPML6WTUlBrEEb4bp1i9Zel+Gi1qgaxYPQEb4pCGQP+WTm44knOrIVw1BU4shU5ZVqhDlPVI12WKXFESazsMohSYwb4HH8tSujf9gKW/ESwtRwRPvMWUIPbdOt9fB6QW042JDx+r1up3H3y/2nP0dKV11Ghnxjlfdvc8hZx9zRCBzcJEKGwGjIKQZQw9CBCRg23uaAse8rpLNHQ4jY3I6JkH7406mj71EJ1jbmkP4jLhisIGW2TTBjPw0XurhJJ+gao5IU5QelkkYSqX3L2ADCtUIsXjQYSzMM/wpnu7FMOSqNg+3daAVwt1apxGDlLKOWeTsd2LW1ELR5k5Tlwn4KEr4D+utkCpwVUAOSfdIITH3KcEMBsvo1b1JhaEHA9+L2NWHn/4sfS7Rzw8iQLs+9KntlWa7QMHKMpP/HGd6SKYrg6q0Q2ByCK/A0f5EfmZfv3S6AGPvCSoosve/x1y4KsR8l+eltcjeb7ubK4Q50+gq4mlg8nvSVwUl9O3egKB84LjMQillooJjWFQGJYGl9e6vvPRyHhh8GhuUbMBUf+FJMLPIxNzF7ONo959bchDkv8ngCuNNPnuTNY+J59E2QilvrBhDdwVFdP1gfs7awEN2xGd1h+iRYZcxRYCU2n/43LgQzG3E344P/yA1ISetFFIVVWPTsiQxzkHsCH3FZ3rsZegMXh3sGgpGntHIisYlJWZDR1QXnh1YHJ6XSh/YLGS+kq0Zqo9YEMmm8TrCPRtPhAxgCmaalljlCiamVy999x/9PYC8L BtV0u4qa P3Tl3QUls6CUyePrRsGS72WA4Hewbwazo3CCQEWwHtiSbflMAM7p/LL4t60SvnApRzIB8pq+o3TSWp56rYjh2qdxexIZzd0sa4/B2Wt4ZsGSUtSGjgz7KV8ve50FQPLuDGKJk/RFvCSNwb3oHJ++x760IYZu3DRAlsTWNK2SI8/rJxzoWA+r0vjOzmGUU7OLizZ+d2Adh9AJvAj6iuGKiAuTCLDBYdl1+HfZDxnacn5jUUeUARbgBJS2QZOUvNnx9yBHO3jGsnL6NMvznbUAhX8tOQhLvRyND7hHbnL/qEyAfxYadQ99zuIt9tPOumJsRQiohBdKPYSOKT9L/etoJy/HBsfPDmR0eH3LItY5SIufITlY8nyrmHiKDt1sYx4jrrJvgqLrqs0STzHEDg2DmqqP26rZI/wdsU8Tyhe73eAoiUzGDlpQ8l6zWVJZxnHhJS1yxjOXGlFCZ7kqiEJzYghUHgAJn6L7orw7Yin++rs/w9KHGupYvH6OkQOYKgRCe9u6Izg3kOh2rREpJK+iyV41TEFyfqvNAypUZkwCrnKgmTqetV4oudNclHP0Zj5mfqbPpx3w6cKXnyqq/QillaL1/8hzvsJI98EF4Yfhp4ZbL3PlhMUfdCrhkY8BwX2mrKgw1ARaagY9Be19aiQhSEly1yVu51Yf7yNCn2UzzBFKj7p06BMfCZX15N+GAzzsmI52x X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 26.01.25 12:41, Pasha Tatashin wrote: > On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe wrote: >> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote: >> >>> One way to solve that is pre-reserving space for the KHO tree - >>> ideally a reasonable amount, perhaps 32-64 MB and allocating it at >>> kexec load time. >> Why is there any weird limit? > Setting a limit for KHO trees is similar to the limit we set for the > scratch area; we can overrun both. It is just one simple way to ensure > serialization is possible after kexec load, but there are obviously > other ways to solve this problem." The problem is not only with allocation. Kexec has 2 schemes: User space and kernel based file loading. In the latter, we can do whatever we like. In the former, the flow expects user space has ultimate control over placement of the future data blobs and their contents. I like the flexibility this allows for. It means that user space can inject its own KHO data for example if it wants to. Or modify it. It will come in very handy for debugging and testing later. >> We are preserving hudreds of GB of pages >> backing the VM and more. There is endless memory being preserved across? > There are other ways to do that, but even with this limit, I do not > see this as an issue. The gigabytes of pages backing VMs would not be > scattered as individual 4K pages; that's simply inefficient. The > number of physical ranges is going to be small. If the preserved data > is so large that it cannot fit into a reasonably sized tree, then I > claim that the data should not be saved directly in the tree. Instead, > it should have its own metadata that is pointed to from the tree. Correct :). The way I think of the KHO DT is as a uniform way to implement setup_data across kexec that is identical across all architectures, enforces review and structure to ensure we keep compatibility and generalizes memory reservation. The alternative we have today are hacks like IMA: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/uapi/asm/setup_data.h#n73 > Alternatively, we could allow allocate FDT tree during kernel shutdown > time. At that time there should be plenty of free memory as we already > finished with userland. However, we have to be careful to allocate > from memory that does not overlap the area where kernel segments and > initramfs are going to be relocated. Yes, this is easier said than done. In the user space driven kexec path, user space is in control of memory locations. At least after the first kexec iteration, these locations will overlap with the existing Linux runtime environment, because both lie in the scratch region. Only the purgatory moves everything to where it should be. Maybe we could create a special kexec memory type that means "KHO DT"? Alex