Re: [RFC PATCH v2 0/7] Introduce persistent memory pool

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Baoquan He <bhe@redhat.com>,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	ebiederm@xmission.com, akpm@linux-foundation.org,
	stanislav.kinsburskii@gmail.com, corbet@lwn.net,
	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
	linux-mm@kvack.org, kys@microsoft.com, jgowans@amazon.com,
	wei.liu@kernel.org, arnd@arndb.de, gregkh@linuxfoundation.org,
	graf@amazon.de, pbonzini@redhat.com, "Shutemov,
	Kirill" <kirill.shutemov@intel.com>
Subject: Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
Date: Wed, 27 Sep 2023 17:38:31 -0700	[thread overview]
Message-ID: <20230928003831.GA20366@skinsburskii.> (raw)
In-Reply-To: <760bbb08-83b4-7bb1-822f-2ceba26278a6@intel.com>

On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> ...
> > Well, not exactly. That's something I'd like to have indeed, but from my
> > POV this goal is out of scope of discussion at the moment.
> > Let me try to express it the same way you did above:
> > 
> > 1. Boot some kernel
> > 2. Grow the deposited memory a bunch
> > 5. Kexec
> > 4. Kernel panic due to GPF upon accessing the memory deposited to
> > hypervisor.
> 
> I basically consider this a bug in the first kernel.  It *can't* kexec
> when it's left RAM in shambles.  It doesn't know what features the new
> kernel has and whether this is even safe.
> 

Could you elaborate more on why this is a bug in the first kernel?
Say, kernel memory can be allocated in big physically consequitive
chunks by the first kernel for depositing. The information about these
chunks is then passed the the second kernel via FDT or even command
line, so the seconds kernel can reserve this region during booting.
What's wrong with this approach?

> Can the new kernel even read the new device tree data?
> 

I'm not sure I understand the question, to be honest.
Why can't it? This series contains code parts for both first and seconds
kernels.

> >> Can't the deposited memory just be shrunk before kexec?  Surely there
> >> aren't a bunch of pathological things consuming that memory right before
> >> kexec, which is basically a reboot.
> > 
> > In general it can. But for this to happen hypervisor needs to release
> > this memory. And it can release the memory iff the guests are stopped.
> > And stopping the guests during kexec isn't something we want to have in the
> > long run.
> > Also, even if we stop the guests before kexec, we need to restart them
> > after boot meaning we have to deposit the pages once again.
> > All this: stopping the guests, withdrawing the pages upon kexec,
> > allocating after boot and depostiting them again significatnly affect
> > guests downtime.
> 
> Ahh, and you're presumably kexec'ing in the first place because you've
> got a bug in the first kernel and you want a second kernel with fewer bugs.
> 

Right. All this is for "kernel servicing" purposes, when kexec is used
to update the kernel in a fleet with in attempt to reduce users downtime
as mush as possible.
I'm sorry for keeping this bit of context to myself instead of
explicitly stating it the series description: it wasn't intentional.

> I still think the only way this will possibly work when kexec'ing both
> old and new kernels is to do it with the memory maps that *all* kernels
> can read.
> 

Could you elaborate more on this?
The avaiable memory map actually stays the same for both kernels. The
difference here can be in a different list of memory regions to reserve,
when the first kernel allocated and deposited another chunk, and thus
the second kernel needs to reserve this memory as a new region upon
booting.

Can all this considered, as, say, the first kernel uses device tree to
inform the second kernel about the memory regions to reserve?
In this case the first kernel behaves a bit like a firmware piece for
the second one.

> Can the hypervisor be improved to make this release operation faster?

I guess it can, but shutting down guests contributes to downtime the
most. And without shutting down the guests the deposited memory can't be
withdrawn.

Thanks,
Stanislav

next prev parent reply	other threads:[~2023-09-28 18:20 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <01828.123092517290700465@us-mta-156.us.mimecast.lan>
2023-09-27  5:44 ` Baoquan He
2023-09-27 16:13   ` Stanislav Kinsburskii
2023-09-28 13:22     ` Dave Hansen
2023-09-27 23:25       ` Stanislav Kinsburskii
2023-09-28 17:29         ` Dave Hansen
2023-09-28  0:02           ` Stanislav Kinsburskii
2023-09-28 18:00             ` Dave Hansen
2023-09-28  0:38               ` Stanislav Kinsburskii [this message]
2023-09-28 19:16                 ` Dave Hansen
2023-09-28  2:46                   ` Stanislav Kinsburskii
2023-09-29 10:13                     ` Shutemov, Kirill
2023-09-28  9:16                       ` Stanislav Kinsburskii
     [not found]                   ` <64208.123092816192300612@us-mta-483.us.mimecast.lan>
2023-09-28 23:56                     ` Baoquan He
2023-09-28  7:18                       ` Stanislav Kinsburskii
2023-09-28 17:35       ` David Hildenbrand
2023-09-28 17:37         ` Dave Hansen
2023-09-28 18:12           ` [EXTERNAL] " KY Srinivasan
     [not found]   ` <58146.123092712145601339@us-mta-73.us.mimecast.lan>
2023-09-28 10:25     ` Baoquan He
2023-09-27 22:44       ` Stanislav Kinsburskii
2023-09-28 17:29       ` David Hildenbrand
2023-09-25 21:27 Stanislav Kinsburskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230928003831.GA20366@skinsburskii. \
    --to=skinsburskii@linux.microsoft.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiederm@xmission.com \
    --cc=graf@amazon.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=jgowans@amazon.com \
    --cc=kexec@lists.infradead.org \
    --cc=kirill.shutemov@intel.com \
    --cc=kys@microsoft.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=stanislav.kinsburskii@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox