From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB6FFEB64DA for ; Thu, 20 Jul 2023 07:49:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C9F42800CC; Thu, 20 Jul 2023 03:49:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27A6828004C; Thu, 20 Jul 2023 03:49:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11D982800CC; Thu, 20 Jul 2023 03:49:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F283228004C for ; Thu, 20 Jul 2023 03:49:39 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C38C4401DC for ; Thu, 20 Jul 2023 07:49:39 +0000 (UTC) X-FDA: 81031215678.16.FEDF311 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf13.hostedemail.com (Postfix) with ESMTP id C3D5420011 for ; Thu, 20 Jul 2023 07:49:37 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=aQCzKzGB; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf13.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689839378; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lRC0RZqoqHV1iZY41MWNjsNJG0i4hN0HBuJmKMyE0Wo=; b=WmkyM7NEQp+8iM+nxUJyshcBY4cKHFFEulXV3Zu+JKt/aPuFWE6FYRMpPGiYOurh6ha7XC 72oP4kA/e3JOBCbdc5TVvFv5KWkr6vIzfAH4RUtcVuKapAkKdq9sFM9CQ1Z73k322Yf8x0 ytkdjn3eKCeyRHHKVLpNS/4MFKY4p94= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=aQCzKzGB; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf13.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689839378; a=rsa-sha256; cv=none; b=gcqq9IsNSq+UBOVYwwIE9e5TGZYTrCfu42CQsRnFvvQnmBw1eexJ9/9V3gJUZU9K/qJiwf RF3pkuSKOSWMJpvLE4ryUigTcCDjd2aZsIWNDs286+8i6NuMXp8JvRHxpSE/lsEqK8oK2Q ZqCA7FnI04kTQleKD7pUp+5JS1XAmNY= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 206D720564; Thu, 20 Jul 2023 07:49:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1689839376; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lRC0RZqoqHV1iZY41MWNjsNJG0i4hN0HBuJmKMyE0Wo=; b=aQCzKzGBMeKi2sz6zlSxFkgL6KNrlffTJAKxJ6W/tSXCyraqlNP8363C5ZdtM+eUPG7SYq OaQTD+7y/8DeJNf1nuta4obUaMKlHOr/zxlTsadPxS/XvzTGMs86rsmOIiw7o+i3yOnBX0 /GSI5f9uVgZk+AnvtksAvwDvNa59EhA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 12050138EC; Thu, 20 Jul 2023 07:49:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 6/KNAxDnuGSAEQAAMHmgww (envelope-from ); Thu, 20 Jul 2023 07:49:36 +0000 Date: Thu, 20 Jul 2023 09:49:35 +0200 From: Michal Hocko To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Rapoport , Andrew Morton , Matthew Wilcox , Mel Gorman , Vlastimil Babka , David Hildenbrand , Jiri Bohac Subject: Re: collision between ZONE_MOVABLE and memblock allocations Message-ID: References: <20230718220106.GA3117638@google.com> <20230719224821.GC3528218@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230719224821.GC3528218@google.com> X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C3D5420011 X-Stat-Signature: 56wys5ehd7m7bcty79fmyc19c8tjxat3 X-Rspam-User: X-HE-Tag: 1689839377-858605 X-HE-Meta: U2FsdGVkX18zPjkugH72IUsunsw4AC6AJ96WUEse8wJkz7HQfA0q6zlmAbXpDSDFgErqgklcn/msVKka1YVfZfbLtqrktIjBU5TtIEEBC3T9wdvS+QqkBgird42rtLWvFjDoD0v/oKOvIumdy4Cb5QgOVYi19xcB1VvNFX9I+3b5cRALBxXIG5o8/3MhnnWvAwHy0GrD49QhNFxT78cPoMNJLNZ12xw36pV9FeTI+dqUcqu+fz8Dz1nTqyfqcvgLSXIV5yLbwdjNISYY0wzaCFdf0vJ0jV+4N5vseD8pe/2mzkTQt1fcuHfaQFeOKRQrRZTdQSc3yYjx4ow57TiIyMRb5+sJYcGKu+ybW8+Is3mmgAyZ37+e9RhtHhKQZ+kVYcLQc+iZ/Sapa6KpqF+A/mMJSJKbb/ZG59mhB+JtPUCmqcVbKxDULgFiuEQ7PvZaK2sp/ysaTk2tdZY4tQPV42hDPKMAcqqpT6Yt5zUYbWIxJ/KFDtf6Kn8fN9N4LfLzeCpsIogFkSGNWnVUJH7DghUX1JmPRFq8xotzy3P5u8p3ah2zuphEO7c3k+nKUJMuozKU+RBidaYWp0XtLQPo5F/osNqG9yHVXM/MHBp7aZv/OUFAJ76ziWfr0MQurDuG8fdtIhWTUGpYodr1QZnVM5OwQz8dtbCn7mz/fmDJLyo+c1yif9if/Z3lnVBJfkKeT80TlnXg2xe9gnT0FYr3fdjtb6PnvTB9yGeiKwkmKw2bi+d53qNBPnAZBFwCUUbFEimfjSljdXazgQSHzcIo7cnuwOEX+BlJ/UHsXrNeCIfe7cURcmd+DJTsoaZD3M90jwQAZ6rkrCyj2V6CkCWmyJwSduJx8pJsrRegbH02y/dwj37PnS8n+L0e0fojsXV4wFE5i50wOsi4uxYLcFrdBEthvIzQRtVVp8SuDb6hUbPGV5JTK8rAKDLZQgQsAcfwjhsj7T/dFWHgWIzl02b A/PGym7k 49k3TBH5EsMgD5Vo8WMTWd6VnVWEuy/9O5Jyp4zL+8zEnyjcNGYDA5hdb7nPvgg58wOpNjvQEAId/K9nFJ/A3Qsk+FPzNJ+JDhzCmIMsoEs9qfbtyCRcYM/xeufGypcJmsb+yD3chxsgngbgpBgzPyazv010s9XO327zijswwchFzcH4qPey6IXP5C66FtP+1zudWkA2DD2UMje4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [CC Jiri Bohac] On Wed 19-07-23 16:48:21, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > [...] > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > allocations, because this issue essentially makes the movablecore= kernel > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > creates will often actually be unmovable. > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > rather than build more into it. Could you be more specific about your > > use case? > > The problem that I'm trying to solve is that I'd like to be able to get kernel > core dumps off machines (chromebooks) so that we can debug crashes. Because > the memory used by the crash kernel ("crashkernel=" kernel command line > option) is consumed the entire time the machine is booted, there is a strong > motivation to keep the crash kernel as small and as simple as possible. To > this end I'm trying to get away without SSD drivers, not having to worry about > encryption on the SSDs, etc. This is something Jiri is also looking into. > So, the rough plan right now is: > > 1) During boot set aside some memory that won't contain kernel allocations. > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > region (or whatever non-kernel region) will be set aside as PMEM in the crash > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > parameter passed to the crash kernel. > > So, in my sample 4G VM system, I see: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > aside only contains user data, which we don't want to store anyway. We make a > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > mkfs.ext4 /dev/pmem0 > mount /dev/pmem0 /mnt > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > We then set up the next full kernel boot to also have this same PMEM region, > using the same memmap kernel parameter. We reboot back into a full kernel. > > 3) The next full kernel will be a normal boot with a full networking stack, > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > the kdump and either store it somewhere persistent or upload it somewhere. We > can then unmount the PMEM and reconfigure it back to system ram so that the > live system isn't missing memory. > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > daxctl reconfigure-device --mode=system-ram dax0.0 > > This is the flow I'm trying to support, and have mostly working in a VM, > except up until now makedumpfile would crash because all the memblock > structures it needed were in the PMEM area that I had just wiped out by making > a new filesystem. :) > > Do you see any blockers that would make this infeasible? > > For the non-kernel memory, is the ZONE_MOVABLE path that I'm currently > pursuing the best option, or would we be better off with your suggestion > elsewhere in this thread: The main problem I would see with this approach is that the small Movable zone you set aside would be easily consumed and reclaimed. That could generate some unexpected performance artifacts. We used to see those with small zones or large differences in zone sizes in the past. But functionally this should work. Or I do not see any fundamental problems at least. Jiri is looking at this from a slightly different angle. Very broadly, he would like to have a dedicated CMA pool and reuse that for the kernel memory (dropping anything sitting there) when crashing. GFP_MOVABLE allocations can use CMA pools. > > If the spefic placement of the movable memory is not important and the only > > thing that matters is the size and numa locality then an easier to maintain > > solution would be to simply offline enough memory blocks very early in the > > userspace bring up and online it back as movable. If offlining fails just > > try another memblock. This doesn't require any kernel code change. > > If this 2nd way is preferred, can you point me to how I can offline the memory > blocks & then get them back later in boot? /bin/echo offline > /sys/devices/system/memory/memory$NUM/state && \ echo online_movable > /sys/devices/system/memory/memory$NUM/state more in Documentation/admin-guide/mm/memory-hotplug.rst -- Michal Hocko SUSE Labs