Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
       [not found] <20210804154046.88552-1-imbrenda@linux.ibm.com>
@ 2021-08-06  7:10 ` David Hildenbrand
  2021-08-06  9:30   ` Claudio Imbrenda
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2021-08-06  7:10 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, linux-s390,
	linux-kernel, Ulrich.Weigand, linux-mm, Michal Hocko

On 04.08.21 17:40, Claudio Imbrenda wrote:
> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs). This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
> 
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
> 
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
> 
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.

That doesn't sound too hacky to me, and actually sounds like a good 
idea, doing what the guest would do either way but speeding it up 
asynchronously, but ...

> 
> For the shutdown case, a list of pages to be destroyed is formed when
> the mm is torn down. Instead of just unmapping the pages when the
> address space is being torn down, they are also set aside. Later when
> KVM cleans up the VM, a thread is started to clean up the pages from
> the list.

... this ...

> 
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.

... this ...
> 
> When a guest is destroyed, its memory still counts towards its memory
> control group until it's actually freed (I tested this experimentally)
> 
> When the system runs out of memory, if a guest has terminated and its
> memory is being cleaned asynchronously, the OOM killer will wait a
> little and then see if memory has been freed. This has the practical
> effect of slowing down memory allocations when the system is out of
> memory to give the cleanup thread time to cleanup and free memory, and
> avoid an actual OOM situation.

... and this sound like the kind of arch MM hacks that will bite us in 
the long run. Of course, I might be wrong, but already doing excessive 
GFP_ATOMIC allocations or messing with the OOM killer that way for a 
pure (shutdown) optimization is an alarm signal. Of course, I might be 
wrong.

You should at least CC linux-mm. I'll do that right now and also CC 
Michal. He might have time to have a quick glimpse at patch #11 and #13.

https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm.com
https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm.com

IMHO, we should proceed with patch 1-10, as they solve a really 
important problem ("slow reboots") in a nice way, whereby patch 11 
handles a case that can be worked around comparatively easily by 
management tools -- my 2 cents.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
  2021-08-06  7:10 ` [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy David Hildenbrand
@ 2021-08-06  9:30   ` Claudio Imbrenda
  2021-08-06 11:30     ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Claudio Imbrenda @ 2021-08-06  9:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, cohuck, borntraeger, frankja, thuth, pasic, linux-s390,
	linux-kernel, Ulrich.Weigand, linux-mm, Michal Hocko

On Fri, 6 Aug 2021 09:10:28 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 04.08.21 17:40, Claudio Imbrenda wrote:
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs). This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> > 
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> > 
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> > 
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.  
> 
> That doesn't sound too hacky to me, and actually sounds like a good 
> idea, doing what the guest would do either way but speeding it up 
> asynchronously, but ...
> 
> > 
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.  
> 
> ... this ...
> 
> > 
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.  
> 
> ... this ...

this ^ is exactly the reboot case.

> > When a guest is destroyed, its memory still counts towards its
> > memory control group until it's actually freed (I tested this
> > experimentally)
> > 
> > When the system runs out of memory, if a guest has terminated and
> > its memory is being cleaned asynchronously, the OOM killer will
> > wait a little and then see if memory has been freed. This has the
> > practical effect of slowing down memory allocations when the system
> > is out of memory to give the cleanup thread time to cleanup and
> > free memory, and avoid an actual OOM situation.  
> 
> ... and this sound like the kind of arch MM hacks that will bite us
> in the long run. Of course, I might be wrong, but already doing
> excessive GFP_ATOMIC allocations or messing with the OOM killer that

they are GFP_ATOMIC but they should not put too much weight on the
memory and can also fail without consequences, I used:

GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN

also notice that after every page allocation a page gets freed, so this
is only temporary.

I would not call it "messing with the OOM killer", I'm using the same
interface used by virtio-baloon

> way for a pure (shutdown) optimization is an alarm signal. Of course,
> I might be wrong.
> 
> You should at least CC linux-mm. I'll do that right now and also CC 
> Michal. He might have time to have a quick glimpse at patch #11 and
> #13.
> 
> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm.com
> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm.com
> 
> IMHO, we should proceed with patch 1-10, as they solve a really 
> important problem ("slow reboots") in a nice way, whereby patch 11 
> handles a case that can be worked around comparatively easily by 
> management tools -- my 2 cents.

how would management tools work around the issue that a shutdown can
take very long?

also, without my patches, the shutdown case would use export instead of
destroy, making it even slower.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
  2021-08-06  9:30   ` Claudio Imbrenda
@ 2021-08-06 11:30     ` David Hildenbrand
  2021-08-06 13:44       ` Claudio Imbrenda
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2021-08-06 11:30 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, cohuck, borntraeger, frankja, thuth, pasic, linux-s390,
	linux-kernel, Ulrich.Weigand, linux-mm, Michal Hocko

>>> This means that the same address space can have memory belonging to
>>> more than one protected guest, although only one will be running,
>>> the others will in fact not even have any CPUs.
>>
>> ... this ...
> 
> this ^ is exactly the reboot case.

Ah, right, we're having more than one protected guest per process, so 
it's all handled within the same process.

> 
>>> When a guest is destroyed, its memory still counts towards its
>>> memory control group until it's actually freed (I tested this
>>> experimentally)
>>>
>>> When the system runs out of memory, if a guest has terminated and
>>> its memory is being cleaned asynchronously, the OOM killer will
>>> wait a little and then see if memory has been freed. This has the
>>> practical effect of slowing down memory allocations when the system
>>> is out of memory to give the cleanup thread time to cleanup and
>>> free memory, and avoid an actual OOM situation.
>>
>> ... and this sound like the kind of arch MM hacks that will bite us
>> in the long run. Of course, I might be wrong, but already doing
>> excessive GFP_ATOMIC allocations or messing with the OOM killer that
> 
> they are GFP_ATOMIC but they should not put too much weight on the
> memory and can also fail without consequences, I used:
> 
> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
> 
> also notice that after every page allocation a page gets freed, so this
> is only temporary.

Correct me if I'm wrong: you're allocating unmovable pages for tracking 
(e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will free a 
movable process page, correct? Or which page will you be freeing?

> 
> I would not call it "messing with the OOM killer", I'm using the same
> interface used by virtio-baloon

Right, and for virtio-balloon it's actually a workaround to restore the 
original behavior of a rarely used feature: deflate-on-oom. Commit 
da10329cb057 ("virtio-balloon: switch back to OOM handler for 
VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched back 
from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:

"The name "deflate on OOM" makes it pretty clear when deflation should
  happen - after other approaches to reclaim memory failed, not while
  reclaiming. This allows to minimize the footprint of a guest - memory
  will only be taken out of the balloon when really needed."

Note some subtle differences:

a) IIRC, before running into the OOM killer, will try reclaiming
    anything  else. This is what we want for deflate-on-oom, it might not
    be what you want for your feature (e.g., flushing other processes/VMs
    to disk/swap instead of waiting for a single process to stop).

b) Migration of movable balloon inflated pages continues working because
    we are dealing with non-lru page migration.

Will page reclaim, page migration, compaction, ... of these movable LRU 
pages still continue working while they are sitting around waiting to be 
cleaned up? I can see that we're grabbing an extra reference when we put 
them onto the list, that might be a problem: for example, we can most 
certainly not swap out these pages or write them back to disk on memory 
pressure.

> 
>> way for a pure (shutdown) optimization is an alarm signal. Of course,
>> I might be wrong.
>>
>> You should at least CC linux-mm. I'll do that right now and also CC
>> Michal. He might have time to have a quick glimpse at patch #11 and
>> #13.
>>
>> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm.com
>> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm.com
>>
>> IMHO, we should proceed with patch 1-10, as they solve a really
>> important problem ("slow reboots") in a nice way, whereby patch 11
>> handles a case that can be worked around comparatively easily by
>> management tools -- my 2 cents.
> 
> how would management tools work around the issue that a shutdown can
> take very long?

The traditional approach is to wait starting a new VM on another 
hypervisor instead until memory has been freed up, or start it on 
another hypervisor. That raises the question about the target use case.

What I don't get is that we have to pay the price for freeing up that 
memory. Why isn't it sufficient to keep the process running and let 
ordinary MM do it's thing?

Maybe you should clearly spell out what the target use case for the fast 
shutdown (fast quitting of the process?) is?. I assume it is, starting a 
new VM / process / whatsoever on the same host immediately, and then

a) Eventually slowing down other processes due heavy reclaim.
b) Slowing down the new process because you have to pay the price of 
cleaning up memory.

I think I am missing why we need the lazy destroy at all when killing a 
process. Couldn't you instead teach the OOM killer "hey, we're currently 
quitting a heavy process that is just *very* slow to free up memory, 
please wait for that before starting shooting around" ?

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
  2021-08-06 11:30     ` David Hildenbrand
@ 2021-08-06 13:44       ` Claudio Imbrenda
  2021-08-09  8:50         ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Claudio Imbrenda @ 2021-08-06 13:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, cohuck, borntraeger, frankja, thuth, pasic, linux-s390,
	linux-kernel, Ulrich.Weigand, linux-mm, Michal Hocko

On Fri, 6 Aug 2021 13:30:21 +0200
David Hildenbrand <david@redhat.com> wrote:

[...]

> >>> When the system runs out of memory, if a guest has terminated and
> >>> its memory is being cleaned asynchronously, the OOM killer will
> >>> wait a little and then see if memory has been freed. This has the
> >>> practical effect of slowing down memory allocations when the
> >>> system is out of memory to give the cleanup thread time to
> >>> cleanup and free memory, and avoid an actual OOM situation.  
> >>
> >> ... and this sound like the kind of arch MM hacks that will bite us
> >> in the long run. Of course, I might be wrong, but already doing
> >> excessive GFP_ATOMIC allocations or messing with the OOM killer
> >> that  
> > 
> > they are GFP_ATOMIC but they should not put too much weight on the
> > memory and can also fail without consequences, I used:
> > 
> > GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
> > 
> > also notice that after every page allocation a page gets freed, so
> > this is only temporary.  
> 
> Correct me if I'm wrong: you're allocating unmovable pages for
> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will
> free a movable process page, correct? Or which page will you be
> freeing?

we are transforming ALL moveable pages belonging to userspace into
unmoveable pages. every ~500 pages one page gets actually
allocated (unmoveable), and another (moveable) one gets freed.

> > 
> > I would not call it "messing with the OOM killer", I'm using the
> > same interface used by virtio-baloon  
> 
> Right, and for virtio-balloon it's actually a workaround to restore
> the original behavior of a rarely used feature: deflate-on-oom.
> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for 
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched
> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
> 
> "The name "deflate on OOM" makes it pretty clear when deflation should
>   happen - after other approaches to reclaim memory failed, not while
>   reclaiming. This allows to minimize the footprint of a guest -
> memory will only be taken out of the balloon when really needed."
> 
> Note some subtle differences:
> 
> a) IIRC, before running into the OOM killer, will try reclaiming
>     anything  else. This is what we want for deflate-on-oom, it might
> not be what you want for your feature (e.g., flushing other
> processes/VMs to disk/swap instead of waiting for a single process to
> stop).

we are already reclaiming the memory of the dead secure guest.

> b) Migration of movable balloon inflated pages continues working
> because we are dealing with non-lru page migration.
> 
> Will page reclaim, page migration, compaction, ... of these movable
> LRU pages still continue working while they are sitting around
> waiting to be cleaned up? I can see that we're grabbing an extra
> reference when we put them onto the list, that might be a problem:
> for example, we can most certainly not swap out these pages or write
> them back to disk on memory pressure.

this is true. on the other hand, swapping a moveable page would be even
slower, because those pages would need to be exported and not destroyed.

> >   
> >> way for a pure (shutdown) optimization is an alarm signal. Of
> >> course, I might be wrong.
> >>
> >> You should at least CC linux-mm. I'll do that right now and also CC
> >> Michal. He might have time to have a quick glimpse at patch #11 and
> >> #13.
> >>
> >> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm.com
> >> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm.com
> >>
> >> IMHO, we should proceed with patch 1-10, as they solve a really
> >> important problem ("slow reboots") in a nice way, whereby patch 11
> >> handles a case that can be worked around comparatively easily by
> >> management tools -- my 2 cents.  
> > 
> > how would management tools work around the issue that a shutdown can
> > take very long?  
> 
> The traditional approach is to wait starting a new VM on another 
> hypervisor instead until memory has been freed up, or start it on 
> another hypervisor. That raises the question about the target use
> case.
> 
> What I don't get is that we have to pay the price for freeing up that 
> memory. Why isn't it sufficient to keep the process running and let 
> ordinary MM do it's thing?

what price?

you mean let mm do the slowest possible thing when tearing down a dead
guest?

without this, the dying guest would still take up all the memory. and
swapping it would not be any faster (it would be slower, in fact). the
system would OOM anyway.

> Maybe you should clearly spell out what the target use case for the
> fast shutdown (fast quitting of the process?) is?. I assume it is,
> starting a new VM / process / whatsoever on the same host
> immediately, and then
> 
> a) Eventually slowing down other processes due heavy reclaim.

for each dying guest, only one CPU is used by the reclaim; depending on
the total load of the system, this might not even be noticeable

> b) Slowing down the new process because you have to pay the price of 
> cleaning up memory.

do you prefer to OOM because the dying guest will need ages to clean up
its memory?

> I think I am missing why we need the lazy destroy at all when killing
> a process. Couldn't you instead teach the OOM killer "hey, we're
> currently quitting a heavy process that is just *very* slow to free
> up memory, please wait for that before starting shooting around" ?

isn't this ^ exactly what the OOM notifier does?


another note here:

when the process quits, the mm starts the tear down. at this point, the
mm has no idea that this is a dying KVM guest, so the best it can do is
exporting (which is significantly slower than destroy page)

kvm comes into play long after the mm is gone, and at this point it
can't do anything anymore. the memory is already gone (very slowly).

if I kill -9 qemu (or if qemu segfaults), KVM will never notice until
the mm is gone.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
  2021-08-06 13:44       ` Claudio Imbrenda
@ 2021-08-09  8:50         ` David Hildenbrand
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2021-08-09  8:50 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, cohuck, borntraeger, frankja, thuth, pasic, linux-s390,
	linux-kernel, Ulrich.Weigand, linux-mm, Michal Hocko

On 06.08.21 15:44, Claudio Imbrenda wrote:
> On Fri, 6 Aug 2021 13:30:21 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> [...]
> 
>>>>> When the system runs out of memory, if a guest has terminated and
>>>>> its memory is being cleaned asynchronously, the OOM killer will
>>>>> wait a little and then see if memory has been freed. This has the
>>>>> practical effect of slowing down memory allocations when the
>>>>> system is out of memory to give the cleanup thread time to
>>>>> cleanup and free memory, and avoid an actual OOM situation.
>>>>
>>>> ... and this sound like the kind of arch MM hacks that will bite us
>>>> in the long run. Of course, I might be wrong, but already doing
>>>> excessive GFP_ATOMIC allocations or messing with the OOM killer
>>>> that
>>>
>>> they are GFP_ATOMIC but they should not put too much weight on the
>>> memory and can also fail without consequences, I used:
>>>
>>> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
>>>
>>> also notice that after every page allocation a page gets freed, so
>>> this is only temporary.
>>
>> Correct me if I'm wrong: you're allocating unmovable pages for
>> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will
>> free a movable process page, correct? Or which page will you be
>> freeing?
> 
> we are transforming ALL moveable pages belonging to userspace into
> unmoveable pages. every ~500 pages one page gets actually
> allocated (unmoveable), and another (moveable) one gets freed.
> 
>>>
>>> I would not call it "messing with the OOM killer", I'm using the
>>> same interface used by virtio-baloon
>>
>> Right, and for virtio-balloon it's actually a workaround to restore
>> the original behavior of a rarely used feature: deflate-on-oom.
>> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for
>> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched
>> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
>>
>> "The name "deflate on OOM" makes it pretty clear when deflation should
>>    happen - after other approaches to reclaim memory failed, not while
>>    reclaiming. This allows to minimize the footprint of a guest -
>> memory will only be taken out of the balloon when really needed."
>>
>> Note some subtle differences:
>>
>> a) IIRC, before running into the OOM killer, will try reclaiming
>>      anything  else. This is what we want for deflate-on-oom, it might
>> not be what you want for your feature (e.g., flushing other
>> processes/VMs to disk/swap instead of waiting for a single process to
>> stop).
> 
> we are already reclaiming the memory of the dead secure guest.
> 
>> b) Migration of movable balloon inflated pages continues working
>> because we are dealing with non-lru page migration.
>>
>> Will page reclaim, page migration, compaction, ... of these movable
>> LRU pages still continue working while they are sitting around
>> waiting to be cleaned up? I can see that we're grabbing an extra
>> reference when we put them onto the list, that might be a problem:
>> for example, we can most certainly not swap out these pages or write
>> them back to disk on memory pressure.
> 
> this is true. on the other hand, swapping a moveable page would be even
> slower, because those pages would need to be exported and not destroyed.
> 
>>>    
>>>> way for a pure (shutdown) optimization is an alarm signal. Of
>>>> course, I might be wrong.
>>>>
>>>> You should at least CC linux-mm. I'll do that right now and also CC
>>>> Michal. He might have time to have a quick glimpse at patch #11 and
>>>> #13.
>>>>
>>>> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm.com
>>>> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm.com
>>>>
>>>> IMHO, we should proceed with patch 1-10, as they solve a really
>>>> important problem ("slow reboots") in a nice way, whereby patch 11
>>>> handles a case that can be worked around comparatively easily by
>>>> management tools -- my 2 cents.
>>>
>>> how would management tools work around the issue that a shutdown can
>>> take very long?
>>
>> The traditional approach is to wait starting a new VM on another
>> hypervisor instead until memory has been freed up, or start it on
>> another hypervisor. That raises the question about the target use
>> case.
>>
>> What I don't get is that we have to pay the price for freeing up that
>> memory. Why isn't it sufficient to keep the process running and let
>> ordinary MM do it's thing?
> 
> what price?
> 
> you mean let mm do the slowest possible thing when tearing down a dead
> guest?
> 
> without this, the dying guest would still take up all the memory. and
> swapping it would not be any faster (it would be slower, in fact). the
> system would OOM anyway.
> 
>> Maybe you should clearly spell out what the target use case for the
>> fast shutdown (fast quitting of the process?) is?. I assume it is,
>> starting a new VM / process / whatsoever on the same host
>> immediately, and then
>>
>> a) Eventually slowing down other processes due heavy reclaim.
> 
> for each dying guest, only one CPU is used by the reclaim; depending on
> the total load of the system, this might not even be noticeable
> 
>> b) Slowing down the new process because you have to pay the price of
>> cleaning up memory.
> 
> do you prefer to OOM because the dying guest will need ages to clean up
> its memory?
> 
>> I think I am missing why we need the lazy destroy at all when killing
>> a process. Couldn't you instead teach the OOM killer "hey, we're
>> currently quitting a heavy process that is just *very* slow to free
>> up memory, please wait for that before starting shooting around" ?
> 
> isn't this ^ exactly what the OOM notifier does?
> 
> 
> another note here:
> 
> when the process quits, the mm starts the tear down. at this point, the
> mm has no idea that this is a dying KVM guest, so the best it can do is
> exporting (which is significantly slower than destroy page)
> 
> kvm comes into play long after the mm is gone, and at this point it
> can't do anything anymore. the memory is already gone (very slowly).
> 
> if I kill -9 qemu (or if qemu segfaults), KVM will never notice until
> the mm is gone.
> 

Summarizing what we discussed offline:

1. We should optimize for proper shutdowns first, this is the most 
important use case. We should look into letting QEMU tear down the KVM 
secure context such that we can just let MM teardown do its thing -> 
destroy instead of export secure pages. If no kernel changes are 
required to get that implemented, even better.

2. If we want to optimize "there is a big process dying horribly slow, 
please OOM killer please wait a bit instead of starting killing other 
processes", we might want to do that in a more generic way (if not 
already in place, no expert).

3. If we really want to go down the path of optimizing "kill -9" and 
friends to e.g., take 40min instead of 20min on a huge VM (who cares? 
especially, the OOM handler will struggle already if memory is getting 
freed that slowly, no matter if 40 or 20 minutes), we should look into 
being able to release the relevant KVM secure context before tearing 
down MM. We should avoid any arch specific hacks.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-09  8:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210804154046.88552-1-imbrenda@linux.ibm.com>
2021-08-06  7:10 ` [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy David Hildenbrand
2021-08-06  9:30   ` Claudio Imbrenda
2021-08-06 11:30     ` David Hildenbrand
2021-08-06 13:44       ` Claudio Imbrenda
2021-08-09  8:50         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox