From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B129C4320A for ; Mon, 9 Aug 2021 08:51:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A98286103B for ; Mon, 9 Aug 2021 08:51:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A98286103B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1AA7E8D0010; Mon, 9 Aug 2021 04:51:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 15ADF8D0003; Mon, 9 Aug 2021 04:51:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0494A8D0010; Mon, 9 Aug 2021 04:51:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0032.hostedemail.com [216.40.44.32]) by kanga.kvack.org (Postfix) with ESMTP id DB3818D0003 for ; Mon, 9 Aug 2021 04:51:04 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 82AE11D58F for ; Mon, 9 Aug 2021 08:51:04 +0000 (UTC) X-FDA: 78454922448.07.13D6327 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf30.hostedemail.com (Postfix) with ESMTP id CE411E00BAB0 for ; Mon, 9 Aug 2021 08:51:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628499063; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eXR+JIzWEYNeD1UnXjRtYb+ykUgRxVjnQJbgR8HpkXE=; b=Fxs8GoetsqZxcitod0kM1GgQW/7TyPCJ63LxsFWgoUVQG7dERCWrFzU3x2kebfuz3yEwQv GyaY8WbsMtXyJo1MgGgHUEPxvRzMAYiD4EX3Ck34Hz7ZQ6T0WX6YxnvV+t7pR9GjoQiHBe KPapPFSGeieYNpspUu0T1pEMaUgBkTU= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-244-_89KY2z7O5-7qoAsdXa1wg-1; Mon, 09 Aug 2021 04:50:59 -0400 X-MC-Unique: _89KY2z7O5-7qoAsdXa1wg-1 Received: by mail-wm1-f69.google.com with SMTP id l19-20020a05600c4f13b029025b036c91c6so3821926wmq.2 for ; Mon, 09 Aug 2021 01:50:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=eXR+JIzWEYNeD1UnXjRtYb+ykUgRxVjnQJbgR8HpkXE=; b=JYXUu0/we4Tk1Bgx0mHJsaNuOgxyKoJleQVgc2Pp7VtanQlZMLLrNzPfsdqlz8Buhc 83niDFUaNheguiuxTltkJZ8venopFBVG5xU7V+eTDbGUIm6BuiCTPGoH1ZqwsrR7gwEj QATIRwkSyaqJt4mJwJ+KfUznUuEyOwoG4z9kj5LJ9+pR0KUA2YEoSKbgYfcElSWROMkC 6g/41+BWo4wMi6XH5THbLvzMKNRuXJnLdcvmKTr24iihl//PuU+DRy7kDzgqvs71oTBP Y/znZXbv1qZE8C6/yEfzXvdjkngaZvgLgBpgAqvABV/ZGzSOpZ5/kq8GNTt72DYz+sOV IzuQ== X-Gm-Message-State: AOAM530a3/m1oLfWMlqseM5mXYqbSxX93ByGGFFn4PH9zK/lu99BMqxF yUtkODFm6aCjRN+K9sObUzAg1J5XjJjhqbXmcDoBwihthU6afLi1bJxxM9FpPaIaQU1DIqaqtc5 KCtQhGKsxMhw= X-Received: by 2002:a05:600c:1d12:: with SMTP id l18mr15423206wms.88.1628499058657; Mon, 09 Aug 2021 01:50:58 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyn8BfKomzEPX0PtqpbGSxmBgVo/T64uydRSEX/+7aDK1leW+8vEKtx+wRwoAiMKr32K3b/YA== X-Received: by 2002:a05:600c:1d12:: with SMTP id l18mr15423168wms.88.1628499058304; Mon, 09 Aug 2021 01:50:58 -0700 (PDT) Received: from ?IPv6:2003:d8:2f0a:7f00:fad7:3bc9:69d:31f? (p200300d82f0a7f00fad73bc9069d031f.dip0.t-ipconnect.de. [2003:d8:2f0a:7f00:fad7:3bc9:69d:31f]) by smtp.gmail.com with ESMTPSA id j4sm16841393wmi.4.2021.08.09.01.50.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 09 Aug 2021 01:50:57 -0700 (PDT) To: Claudio Imbrenda Cc: kvm@vger.kernel.org, cohuck@redhat.com, borntraeger@de.ibm.com, frankja@linux.ibm.com, thuth@redhat.com, pasic@linux.ibm.com, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, Ulrich.Weigand@de.ibm.com, "linux-mm@kvack.org" , Michal Hocko References: <20210804154046.88552-1-imbrenda@linux.ibm.com> <86b114ef-41ea-04b6-327c-4a036f784fad@redhat.com> <20210806113005.0259d53c@p-imbrenda> <20210806154400.2ca55563@p-imbrenda> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy Message-ID: <8f1502a4-8ee3-f70f-ca04-4a13d44368fb@redhat.com> Date: Mon, 9 Aug 2021 10:50:57 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210806154400.2ca55563@p-imbrenda> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: CE411E00BAB0 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Fxs8Goet; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf30.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Stat-Signature: 891w796g3x7kx45qnwet7nj8ishouhe6 X-HE-Tag: 1628499063-357575 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 06.08.21 15:44, Claudio Imbrenda wrote: > On Fri, 6 Aug 2021 13:30:21 +0200 > David Hildenbrand wrote: >=20 > [...] >=20 >>>>> When the system runs out of memory, if a guest has terminated and >>>>> its memory is being cleaned asynchronously, the OOM killer will >>>>> wait a little and then see if memory has been freed. This has the >>>>> practical effect of slowing down memory allocations when the >>>>> system is out of memory to give the cleanup thread time to >>>>> cleanup and free memory, and avoid an actual OOM situation. >>>> >>>> ... and this sound like the kind of arch MM hacks that will bite us >>>> in the long run. Of course, I might be wrong, but already doing >>>> excessive GFP_ATOMIC allocations or messing with the OOM killer >>>> that >>> >>> they are GFP_ATOMIC but they should not put too much weight on the >>> memory and can also fail without consequences, I used: >>> >>> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN >>> >>> also notice that after every page allocation a page gets freed, so >>> this is only temporary. >> >> Correct me if I'm wrong: you're allocating unmovable pages for >> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will >> free a movable process page, correct? Or which page will you be >> freeing? >=20 > we are transforming ALL moveable pages belonging to userspace into > unmoveable pages. every ~500 pages one page gets actually > allocated (unmoveable), and another (moveable) one gets freed. >=20 >>> >>> I would not call it "messing with the OOM killer", I'm using the >>> same interface used by virtio-baloon >> >> Right, and for virtio-balloon it's actually a workaround to restore >> the original behavior of a rarely used feature: deflate-on-oom. >> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched >> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM: >> >> "The name "deflate on OOM" makes it pretty clear when deflation should >> happen - after other approaches to reclaim memory failed, not while >> reclaiming. This allows to minimize the footprint of a guest - >> memory will only be taken out of the balloon when really needed." >> >> Note some subtle differences: >> >> a) IIRC, before running into the OOM killer, will try reclaiming >> anything else. This is what we want for deflate-on-oom, it might >> not be what you want for your feature (e.g., flushing other >> processes/VMs to disk/swap instead of waiting for a single process to >> stop). >=20 > we are already reclaiming the memory of the dead secure guest. >=20 >> b) Migration of movable balloon inflated pages continues working >> because we are dealing with non-lru page migration. >> >> Will page reclaim, page migration, compaction, ... of these movable >> LRU pages still continue working while they are sitting around >> waiting to be cleaned up? I can see that we're grabbing an extra >> reference when we put them onto the list, that might be a problem: >> for example, we can most certainly not swap out these pages or write >> them back to disk on memory pressure. >=20 > this is true. on the other hand, swapping a moveable page would be even > slower, because those pages would need to be exported and not destroyed= . >=20 >>> =20 >>>> way for a pure (shutdown) optimization is an alarm signal. Of >>>> course, I might be wrong. >>>> >>>> You should at least CC linux-mm. I'll do that right now and also CC >>>> Michal. He might have time to have a quick glimpse at patch #11 and >>>> #13. >>>> >>>> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@linux.ibm= .com >>>> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@linux.ibm= .com >>>> >>>> IMHO, we should proceed with patch 1-10, as they solve a really >>>> important problem ("slow reboots") in a nice way, whereby patch 11 >>>> handles a case that can be worked around comparatively easily by >>>> management tools -- my 2 cents. >>> >>> how would management tools work around the issue that a shutdown can >>> take very long? >> >> The traditional approach is to wait starting a new VM on another >> hypervisor instead until memory has been freed up, or start it on >> another hypervisor. That raises the question about the target use >> case. >> >> What I don't get is that we have to pay the price for freeing up that >> memory. Why isn't it sufficient to keep the process running and let >> ordinary MM do it's thing? >=20 > what price? >=20 > you mean let mm do the slowest possible thing when tearing down a dead > guest? >=20 > without this, the dying guest would still take up all the memory. and > swapping it would not be any faster (it would be slower, in fact). the > system would OOM anyway. >=20 >> Maybe you should clearly spell out what the target use case for the >> fast shutdown (fast quitting of the process?) is?. I assume it is, >> starting a new VM / process / whatsoever on the same host >> immediately, and then >> >> a) Eventually slowing down other processes due heavy reclaim. >=20 > for each dying guest, only one CPU is used by the reclaim; depending on > the total load of the system, this might not even be noticeable >=20 >> b) Slowing down the new process because you have to pay the price of >> cleaning up memory. >=20 > do you prefer to OOM because the dying guest will need ages to clean up > its memory? >=20 >> I think I am missing why we need the lazy destroy at all when killing >> a process. Couldn't you instead teach the OOM killer "hey, we're >> currently quitting a heavy process that is just *very* slow to free >> up memory, please wait for that before starting shooting around" ? >=20 > isn't this ^ exactly what the OOM notifier does? >=20 >=20 > another note here: >=20 > when the process quits, the mm starts the tear down. at this point, the > mm has no idea that this is a dying KVM guest, so the best it can do is > exporting (which is significantly slower than destroy page) >=20 > kvm comes into play long after the mm is gone, and at this point it > can't do anything anymore. the memory is already gone (very slowly). >=20 > if I kill -9 qemu (or if qemu segfaults), KVM will never notice until > the mm is gone. >=20 Summarizing what we discussed offline: 1. We should optimize for proper shutdowns first, this is the most=20 important use case. We should look into letting QEMU tear down the KVM=20 secure context such that we can just let MM teardown do its thing ->=20 destroy instead of export secure pages. If no kernel changes are=20 required to get that implemented, even better. 2. If we want to optimize "there is a big process dying horribly slow,=20 please OOM killer please wait a bit instead of starting killing other=20 processes", we might want to do that in a more generic way (if not=20 already in place, no expert). 3. If we really want to go down the path of optimizing "kill -9" and=20 friends to e.g., take 40min instead of 20min on a huge VM (who cares?=20 especially, the OOM handler will struggle already if memory is getting=20 freed that slowly, no matter if 40 or 20 minutes), we should look into=20 being able to release the relevant KVM secure context before tearing=20 down MM. We should avoid any arch specific hacks. --=20 Thanks, David / dhildenb