From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3D0FC0015E for ; Thu, 27 Jul 2023 08:18:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C54F900005; Thu, 27 Jul 2023 04:18:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 775508D0001; Thu, 27 Jul 2023 04:18:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63E16900005; Thu, 27 Jul 2023 04:18:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5050E8D0001 for ; Thu, 27 Jul 2023 04:18:13 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 15A6812105E for ; Thu, 27 Jul 2023 08:18:13 +0000 (UTC) X-FDA: 81056689266.14.0B9B122 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf04.hostedemail.com (Postfix) with ESMTP id 094604000C for ; Thu, 27 Jul 2023 08:18:10 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b="C/77BzXT"; spf=pass (imf04.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690445891; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+vEZggNbc6DGoi3i2Ep2+ExdQTZDFALTtNMLWFugNXY=; b=zFyTO1cMIxKJd+bY4Ne0XXYI7tv18aYTNffcgcZkUqAVxGUue+Ly5aYKH8hANbs85wcZM+ +79EpBf24segfRnVaClP4FIlm9BvafRZFy0HhBuWIggarykBF6GffmUI2A4Snhwu6pWw28 wGN65oNSm9Dkz2HDmmeP7VpqtMcESn4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690445891; a=rsa-sha256; cv=none; b=IQ9ps+8ulN7aY2AbHDAAItf81mJbzB7Cf8wVrLAVqHJEqqOVWZzx06zNVkLwJwuRXlAvVh +k0ioPaDbPppqp8CBG46cjOHq29XNm1w0JszijJ4eBGNT4XlLZrQ1eqiZTklZofBO+F2YW tZLBe2SjavlTxuTEqqXMj9oZkLZlZSM= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b="C/77BzXT"; spf=pass (imf04.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 48ADB1F747; Thu, 27 Jul 2023 08:18:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1690445889; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=+vEZggNbc6DGoi3i2Ep2+ExdQTZDFALTtNMLWFugNXY=; b=C/77BzXTfBrVPOhMw4lZFYjA2YCLI46gcQbQ+iI259zIdarqgGNQPTDzOzLRyo+Qzp9k9F tzXNzjGyC7J/DVZNdqpbBldFblkGUkzRIeOrXbZK3bO8QWomMOM3Ow7FAtmqBupCY/MasF TMKjyhQzUt0McY0L2n2QgmqqaIas5iY= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2A4E913902; Thu, 27 Jul 2023 08:18:09 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id q9bpB0EowmS2SgAAMHmgww (envelope-from ); Thu, 27 Jul 2023 08:18:09 +0000 Date: Thu, 27 Jul 2023 10:18:08 +0200 From: Michal Hocko To: David Hildenbrand Cc: Ross Zwisler , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Rapoport , Andrew Morton , Matthew Wilcox , Mel Gorman , Vlastimil Babka Subject: Re: collision between ZONE_MOVABLE and memblock allocations Message-ID: References: <20230718220106.GA3117638@google.com> <20230719224821.GC3528218@google.com> <9ef757dc-da4b-9fa1-de84-1328a74f18a7@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9ef757dc-da4b-9fa1-de84-1328a74f18a7@redhat.com> X-Stat-Signature: ziegrhsfotbsoj5zsfpuou13csyihctf X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 094604000C X-Rspam-User: X-HE-Tag: 1690445890-129505 X-HE-Meta: U2FsdGVkX1+7+F2B/zw+wx3UIGvN7cQhKdc+czxQhcUadin9ILXX6OVNRTaZMgTMNpmw03q660DJxt2yOejgT/alzAoMaF6cdeTxa7oHV5Rd/BLsohJqKUdp0oA406FR1vibx0J7RzJXkwRHOnIDvaJ3z/R5T2p7kOnP3mq+nfDYmQWu1qBtily90ACZQFr3J+M8d/MirOMpcwLiwoWdFgR4Ocv4uzGlQYdaS8Q75S9KrW4ml10vt+010Q7dyTT6rEvKMvAX6xmahY+myB97iy5a1k9PLTv9gqDT8XnGbpgundZzhDCdZCNL19AFYfR7UC6Iarlf1wpJi9p0rL3jf9Unmg+bpZYj2I4hTlTBymKPw+l0g11DAxi0m3BuzWcXO5P8KbbmLrLqvdZGuwMoc6NgzRHKJfZhL8SOwxnsAEV7d3N5vGij6SSvmH8eMC1PWXgzNDa1qhkHl+hLwdWOaxu5YiI912L//G3zZsnTxGUZ6I19e2NvE5az5+hHOwQYTZuNI1Nv/JRjIA75cRll2zZwBGZ2Q7MaS5fAU+ytUOUoDgXenLq3r7qdfPo6LO+cwUONJwM9nhbYaLWX5vMpF3sCJVZonXBtnLIWACu61jwQo+b4k6L3Md/0OEvfYraLX7jc0JU/tqASy15Rvp/aambMdZTFMlS1lPBHblrbl/QyW0MbJUXsEit58CJI9VrhU4Xob8MH7+Ot4+cQhZ/LnbZ7SJ7XKJEeTWxNudKbQ2A5gCos4T4x6+i1UmjoCfcO0xDUtKlJ5OHXYFHZL9eoj/+jLtXnR2hHAh5iS8yKuXCiGRwmAWIIln2DQBGQ9PcmTEDOMxCxa8Xk9Ue9wYYqdLOvLXztNncyr5/f+vR1r0BMoN0zrsjpfMpDtcRNXRIQLdLkUH/+M8G+KmBsFnIy5CC7HCD7DZq8+84GdA08J5PSfti77b6msNpWfCCS+8i89piNLaMaFnijxVQmOnT 9tu8jriC zemRmWwFBGRkBkybtZNVcfVJlhsAmXKywVsL3b/sW2kpEytfJbtb2bAtbLO5WwVGF70teLhUf0eSHP7r+WNQbjjQyTXBeWJIbrXt6ecfs+4PMWVk1Hu530+VsKPaXt9SEEQilnZxI6GhqlAI795lz/BgQnClko2en1Ie7n10Gpkyo3CigBYgBQcVO+5UMPJRonZ95 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 26-07-23 10:44:21, David Hildenbrand wrote: > On 20.07.23 00:48, Ross Zwisler wrote: > > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > > [...] > > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > > allocations, because this issue essentially makes the movablecore= kernel > > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > > creates will often actually be unmovable. > > > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > > rather than build more into it. Could you be more specific about your > > > use case? > > > > The problem that I'm trying to solve is that I'd like to be able to get kernel > > core dumps off machines (chromebooks) so that we can debug crashes. Because > > the memory used by the crash kernel ("crashkernel=" kernel command line > > option) is consumed the entire time the machine is booted, there is a strong > > motivation to keep the crash kernel as small and as simple as possible. To > > this end I'm trying to get away without SSD drivers, not having to worry about > > encryption on the SSDs, etc. > > Okay, so you intend to keep the crashkernel area as small as possible. > > > > > So, the rough plan right now is: > > > 1) During boot set aside some memory that won't contain kernel > allocations. > > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > > region (or whatever non-kernel region) will be set aside as PMEM in the crash > > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > > parameter passed to the crash kernel. > > > > So, in my sample 4G VM system, I see: > > > > # lsmem --split ZONES --output-all > > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > > Total online memory: 4G > > Total offline memory: 0B > > > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > > aside only contains user data, which we don't want to store anyway. > > I raised that in different context already, but such assumptions are not > 100% future proof IMHO. For example, we might at one point be able to make > user page tables movable and place them on there. > > But yes, most kernel data structures (which you care about) will probably > never be movable and never end up on these regions. > > > We make a > > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > > > mkfs.ext4 /dev/pmem0 > > mount /dev/pmem0 /mnt > > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > > > We then set up the next full kernel boot to also have this same PMEM region, > > using the same memmap kernel parameter. We reboot back into a full kernel. > > > > 3) The next full kernel will be a normal boot with a full networking stack, > > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > > the kdump and either store it somewhere persistent or upload it somewhere. We > > can then unmount the PMEM and reconfigure it back to system ram so that the > > live system isn't missing memory. > > > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > > daxctl reconfigure-device --mode=system-ram dax0.0 > > > > This is the flow I'm trying to support, and have mostly working in a VM, > > except up until now makedumpfile would crash because all the memblock > > structures it needed were in the PMEM area that I had just wiped out by making > > a new filesystem. :) > > > Thinking out loud (and remembering that some architectures relocate the > crashkernel during kexec, if I am not wrong), maybe the following would also > work and make your setup eventually easier: > > 1) Don't reserve a crashkernel area in the traditional way, instead reserve > that area using CMA. It can be used for MOVABLE allocations. > > 2) Let kexec load the crashkernel+initrd into ordinary memory only > (consuming as much as you would need there). > > 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting > any movable data in there) > > 4) In makedumpfile, don't dump any memory that falls into the crashkernel > area. It might already have been overwritten by the second kernel This is more or less what Jiri is looking into. -- Michal Hocko SUSE Labs