From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xipb=NA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2B129C4320A
	for <linux-mm@archiver.kernel.org>; Mon,  9 Aug 2021 08:51:06 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A98286103B
	for <linux-mm@archiver.kernel.org>; Mon,  9 Aug 2021 08:51:05 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A98286103B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 1AA7E8D0010; Mon,  9 Aug 2021 04:51:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 15ADF8D0003; Mon,  9 Aug 2021 04:51:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0494A8D0010; Mon,  9 Aug 2021 04:51:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0032.hostedemail.com [216.40.44.32])
	by kanga.kvack.org (Postfix) with ESMTP id DB3818D0003
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 04:51:04 -0400 (EDT)
Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 82AE11D58F
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 08:51:04 +0000 (UTC)
X-FDA: 78454922448.07.13D6327
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf30.hostedemail.com (Postfix) with ESMTP id CE411E00BAB0
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 08:51:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1628499063;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=eXR+JIzWEYNeD1UnXjRtYb+ykUgRxVjnQJbgR8HpkXE=;
	b=Fxs8GoetsqZxcitod0kM1GgQW/7TyPCJ63LxsFWgoUVQG7dERCWrFzU3x2kebfuz3yEwQv
	GyaY8WbsMtXyJo1MgGgHUEPxvRzMAYiD4EX3Ck34Hz7ZQ6T0WX6YxnvV+t7pR9GjoQiHBe
	KPapPFSGeieYNpspUu0T1pEMaUgBkTU=
Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com
 [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-244-_89KY2z7O5-7qoAsdXa1wg-1; Mon, 09 Aug 2021 04:50:59 -0400
X-MC-Unique: _89KY2z7O5-7qoAsdXa1wg-1
Received: by mail-wm1-f69.google.com with SMTP id l19-20020a05600c4f13b029025b036c91c6so3821926wmq.2
        for <linux-mm@kvack.org>; Mon, 09 Aug 2021 01:50:59 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=eXR+JIzWEYNeD1UnXjRtYb+ykUgRxVjnQJbgR8HpkXE=;
        b=JYXUu0/we4Tk1Bgx0mHJsaNuOgxyKoJleQVgc2Pp7VtanQlZMLLrNzPfsdqlz8Buhc
         83niDFUaNheguiuxTltkJZ8venopFBVG5xU7V+eTDbGUIm6BuiCTPGoH1ZqwsrR7gwEj
         QATIRwkSyaqJt4mJwJ+KfUznUuEyOwoG4z9kj5LJ9+pR0KUA2YEoSKbgYfcElSWROMkC
         6g/41+BWo4wMi6XH5THbLvzMKNRuXJnLdcvmKTr24iihl//PuU+DRy7kDzgqvs71oTBP
         Y/znZXbv1qZE8C6/yEfzXvdjkngaZvgLgBpgAqvABV/ZGzSOpZ5/kq8GNTt72DYz+sOV
         IzuQ==
X-Gm-Message-State: AOAM530a3/m1oLfWMlqseM5mXYqbSxX93ByGGFFn4PH9zK/lu99BMqxF
	yUtkODFm6aCjRN+K9sObUzAg1J5XjJjhqbXmcDoBwihthU6afLi1bJxxM9FpPaIaQU1DIqaqtc5
	KCtQhGKsxMhw=
X-Received: by 2002:a05:600c:1d12:: with SMTP id l18mr15423206wms.88.1628499058657;
        Mon, 09 Aug 2021 01:50:58 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyn8BfKomzEPX0PtqpbGSxmBgVo/T64uydRSEX/+7aDK1leW+8vEKtx+wRwoAiMKr32K3b/YA==
X-Received: by 2002:a05:600c:1d12:: with SMTP id l18mr15423168wms.88.1628499058304;
        Mon, 09 Aug 2021 01:50:58 -0700 (PDT)
Received: from ?IPv6:2003:d8:2f0a:7f00:fad7:3bc9:69d:31f? (p200300d82f0a7f00fad73bc9069d031f.dip0.t-ipconnect.de. [2003:d8:2f0a:7f00:fad7:3bc9:69d:31f])
        by smtp.gmail.com with ESMTPSA id j4sm16841393wmi.4.2021.08.09.01.50.57
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 09 Aug 2021 01:50:57 -0700 (PDT)
To: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: kvm@vger.kernel.org, cohuck@redhat.com, borntraeger@de.ibm.com,
 frankja@linux.ibm.com, thuth@redhat.com, pasic@linux.ibm.com,
 linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org,
 Ulrich.Weigand@de.ibm.com, "linux-mm@kvack.org" <linux-mm@kvack.org>,
 Michal Hocko <mhocko@kernel.org>
References: <20210804154046.88552-1-imbrenda@linux.ibm.com>
 <86b114ef-41ea-04b6-327c-4a036f784fad@redhat.com>
 <20210806113005.0259d53c@p-imbrenda>
 <ada27c6d-4dc9-04c3-d5b9-566e65359701@redhat.com>
 <20210806154400.2ca55563@p-imbrenda>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy
Message-ID: <8f1502a4-8ee3-f70f-ca04-4a13d44368fb@redhat.com>
Date: Mon, 9 Aug 2021 10:50:57 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <20210806154400.2ca55563@p-imbrenda>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: CE411E00BAB0
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Fxs8Goet;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf30.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com
X-Stat-Signature: 891w796g3x7kx45qnwet7nj8ishouhe6
X-HE-Tag: 1628499063-357575
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 06.08.21 15:44, Claudio Imbrenda wrote:
> On Fri, 6 Aug 2021 13:30:21 +0200
> David Hildenbrand <david@redhat.com> wrote:
>=20
> [...]
>=20
>>>>> When the system runs out of memory, if a guest has terminated and
>>>>> its memory is being cleaned asynchronously, the OOM killer will
>>>>> wait a little and then see if memory has been freed. This has the
>>>>> practical effect of slowing down memory allocations when the
>>>>> system is out of memory to give the cleanup thread time to
>>>>> cleanup and free memory, and avoid an actual OOM situation.
>>>>
>>>> ... and this sound like the kind of arch MM hacks that will bite us
>>>> in the long run. Of course, I might be wrong, but already doing
>>>> excessive GFP_ATOMIC allocations or messing with the OOM killer
>>>> that
>>>
>>> they are GFP_ATOMIC but they should not put too much weight on the
>>> memory and can also fail without consequences, I used:
>>>
>>> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
>>>
>>> also notice that after every page allocation a page gets freed, so
>>> this is only temporary.
>>
>> Correct me if I'm wrong: you're allocating unmovable pages for
>> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will
>> free a movable process page, correct? Or which page will you be
>> freeing?
>=20
> we are transforming ALL moveable pages belonging to userspace into
> unmoveable pages. every ~500 pages one page gets actually
> allocated (unmoveable), and another (moveable) one gets freed.
>=20
>>>
>>> I would not call it "messing with the OOM killer", I'm using the
>>> same interface used by virtio-baloon
>>
>> Right, and for virtio-balloon it's actually a workaround to restore
>> the original behavior of a rarely used feature: deflate-on-oom.
>> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for
>> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched
>> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
>>
>> "The name "deflate on OOM" makes it pretty clear when deflation should
>>    happen - after other approaches to reclaim memory failed, not while
>>    reclaiming. This allows to minimize the footprint of a guest -
>> memory will only be taken out of the balloon when really needed."
>>
>> Note some subtle differences:
>>
>> a) IIRC, before running into the OOM killer, will try reclaiming
>>      anything  else. This is what we want for deflate-on-oom, it might
>> not be what you want for your feature (e.g., flushing other
>> processes/VMs to disk/swap instead of waiting for a single process to
>> stop).
>=20
> we are already reclaiming the memory of the dead secure guest.
>=20
>> b) Migration of movable balloon inflated pages continues working
>> because we are dealing with non-lru page migration.
>>
>> Will page reclaim, page migration, compaction, ... of these movable
>> LRU pages still continue working while they are sitting around
>> waiting to be cleaned up? I can see that we're grabbing an extra
>> reference when we put them onto the list, that might be a problem:
>> for example, we can most certainly not swap out these pages or write
>> them back to disk on memory pressure.
>=20
> this is true. on the other hand, swapping a moveable page would be even
> slower, because those pages would need to be exported and not destroyed=
.
>=20
>>>   =20
>>>> way for a pure (shutdown) optimization is an alarm signal. Of
>>>> course, I might be wrong.
>>>>
>>>> You should at least CC linux-mm. I'll do that right now and also CC
>>>> Michal. He might have time to have a quick glimpse at patch #11 and
>>>> #13.
>>>>
>>>> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm=
.com
>>>> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm=
.com
>>>>
>>>> IMHO, we should proceed with patch 1-10, as they solve a really
>>>> important problem ("slow reboots") in a nice way, whereby patch 11
>>>> handles a case that can be worked around comparatively easily by
>>>> management tools -- my 2 cents.
>>>
>>> how would management tools work around the issue that a shutdown can
>>> take very long?
>>
>> The traditional approach is to wait starting a new VM on another
>> hypervisor instead until memory has been freed up, or start it on
>> another hypervisor. That raises the question about the target use
>> case.
>>
>> What I don't get is that we have to pay the price for freeing up that
>> memory. Why isn't it sufficient to keep the process running and let
>> ordinary MM do it's thing?
>=20
> what price?
>=20
> you mean let mm do the slowest possible thing when tearing down a dead
> guest?
>=20
> without this, the dying guest would still take up all the memory. and
> swapping it would not be any faster (it would be slower, in fact). the
> system would OOM anyway.
>=20
>> Maybe you should clearly spell out what the target use case for the
>> fast shutdown (fast quitting of the process?) is?. I assume it is,
>> starting a new VM / process / whatsoever on the same host
>> immediately, and then
>>
>> a) Eventually slowing down other processes due heavy reclaim.
>=20
> for each dying guest, only one CPU is used by the reclaim; depending on
> the total load of the system, this might not even be noticeable
>=20
>> b) Slowing down the new process because you have to pay the price of
>> cleaning up memory.
>=20
> do you prefer to OOM because the dying guest will need ages to clean up
> its memory?
>=20
>> I think I am missing why we need the lazy destroy at all when killing
>> a process. Couldn't you instead teach the OOM killer "hey, we're
>> currently quitting a heavy process that is just *very* slow to free
>> up memory, please wait for that before starting shooting around" ?
>=20
> isn't this ^ exactly what the OOM notifier does?
>=20
>=20
> another note here:
>=20
> when the process quits, the mm starts the tear down. at this point, the
> mm has no idea that this is a dying KVM guest, so the best it can do is
> exporting (which is significantly slower than destroy page)
>=20
> kvm comes into play long after the mm is gone, and at this point it
> can't do anything anymore. the memory is already gone (very slowly).
>=20
> if I kill -9 qemu (or if qemu segfaults), KVM will never notice until
> the mm is gone.
>=20

Summarizing what we discussed offline:

1. We should optimize for proper shutdowns first, this is the most=20
important use case. We should look into letting QEMU tear down the KVM=20
secure context such that we can just let MM teardown do its thing ->=20
destroy instead of export secure pages. If no kernel changes are=20
required to get that implemented, even better.

2. If we want to optimize "there is a big process dying horribly slow,=20
please OOM killer please wait a bit instead of starting killing other=20
processes", we might want to do that in a more generic way (if not=20
already in place, no expert).

3. If we really want to go down the path of optimizing "kill -9" and=20
friends to e.g., take 40min instead of 20min on a huge VM (who cares?=20
especially, the OOM handler will struggle already if memory is getting=20
freed that slowly, no matter if 40 or 20 minutes), we should look into=20
being able to release the relevant KVM secure context before tearing=20
down MM. We should avoid any arch specific hacks.

--=20
Thanks,

David / dhildenb