Re: [LSF/MM/BPF TOPIC] memory persistence over kexec

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alexander Graf <graf@amazon.com>
To: Pasha Tatashin <pasha.tatashin@soleen.com>,
	Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Rapoport <rppt@kernel.org>,
	David Rientjes <rientjes@google.com>,
	<lsf-pc@lists.linux-foundation.org>,
	"Gowans, James" <jgowans@amazon.com>, <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
Date: Sun, 26 Jan 2025 16:21:05 -0800	[thread overview]
Message-ID: <54945e03-c437-48b4-b739-4e8ac822c1fc@amazon.com> (raw)
In-Reply-To: <CA+CK2bACZeQgsfH1Rbs9o53svTZCDe20rH_p=3A5eQQKgb0zrw@mail.gmail.com>

On 26.01.25 12:41, Pasha Tatashin wrote:
> On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:
>>
>>> One way to solve that is pre-reserving space for the KHO tree -
>>> ideally a reasonable amount, perhaps 32-64 MB and allocating it at
>>> kexec load time.
>> Why is there any weird limit?
> Setting a limit for KHO trees is similar to the limit we set for the
> scratch area; we can overrun both. It is just one simple way to ensure
> serialization is possible after kexec load, but there are obviously
> other ways to solve this problem."

The problem is not only with allocation. Kexec has 2 schemes: User space 
and kernel based file loading. In the latter, we can do whatever we 
like. In the former, the flow expects user space has ultimate control 
over placement of the future data blobs and their contents.

I like the flexibility this allows for. It means that user space can 
inject its own KHO data for example if it wants to. Or modify it. It 
will come in very handy for debugging and testing later.

>> We are preserving hudreds of GB of pages
>> backing the VM and more. There is endless memory being preserved across?
> There are other ways to do that, but even with this limit, I do not
> see this as an issue. The gigabytes of pages backing VMs would not be
> scattered as individual 4K pages; that's simply inefficient. The
> number of physical ranges is going to be small. If the preserved data
> is so large that it cannot fit into a reasonably sized tree, then I
> claim that the data should not be saved directly in the tree. Instead,
> it should have its own metadata that is pointed to from the tree.

Correct :). The way I think of the KHO DT is as a uniform way to 
implement setup_data across kexec that is identical across all 
architectures, enforces review and structure to ensure we keep 
compatibility and generalizes memory reservation.

The alternative we have today are hacks like IMA: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/uapi/asm/setup_data.h#n73

> Alternatively, we could allow allocate FDT tree during kernel shutdown
> time. At that time there should be plenty of free memory as we already
> finished with userland. However, we have to be careful to allocate
> from memory that does not overlap the area where kernel segments and
> initramfs are going to be relocated.

Yes, this is easier said than done. In the user space driven kexec path, 
user space is in control of memory locations. At least after the first 
kexec iteration, these locations will overlap with the existing Linux 
runtime environment, because both lie in the scratch region. Only the 
purgatory moves everything to where it should be.

Maybe we could create a special kexec memory type that means "KHO DT"?

Alex

next prev parent reply	other threads:[~2025-01-27  0:21 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-20  7:54 Mike Rapoport
2025-01-20 14:14 ` Jason Gunthorpe
2025-01-20 19:42   ` David Rientjes
2025-01-22 23:30     ` Pasha Tatashin
2025-01-25  9:53       ` Mike Rapoport
2025-01-25 15:19         ` Pasha Tatashin
2025-01-26 20:04           ` Jason Gunthorpe
2025-01-26 20:41             ` Pasha Tatashin
2025-01-27  0:21               ` Alexander Graf [this message]
2025-01-27 13:15                 ` Jason Gunthorpe
2025-01-27 16:12                   ` Alexander Graf
2025-01-28 14:04                     ` Jason Gunthorpe
2025-01-27 13:05               ` Jason Gunthorpe
2025-01-24 21:03     ` Zhu Yanjun
2025-01-24 11:30   ` Mike Rapoport
2025-01-24 14:56     ` Jason Gunthorpe
2025-01-24 18:23 ` Andrey Ryabinin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54945e03-c437-48b4-b739-4e8ac822c1fc@amazon.com \
    --to=graf@amazon.com \
    --cc=jgg@ziepe.ca \
    --cc=jgowans@amazon.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox