From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80B3DC00528 for ; Wed, 26 Jul 2023 08:44:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A565A8D0002; Wed, 26 Jul 2023 04:44:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A05258D0001; Wed, 26 Jul 2023 04:44:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A5EE8D0002; Wed, 26 Jul 2023 04:44:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7B21A8D0001 for ; Wed, 26 Jul 2023 04:44:29 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3E57D401B8 for ; Wed, 26 Jul 2023 08:44:29 +0000 (UTC) X-FDA: 81053126658.04.34DC8FB Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id D91DDC000B for ; Wed, 26 Jul 2023 08:44:26 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SulJ4g9H; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690361067; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3lL0ygifzgSDKujqi7KQAvSTuQ2XO7Cnzt9tFycfXrQ=; b=KHklJXszWDst3rjpWXQ0qJwQnC+YN0SVBnDKGAd/2VJkjT+pmgTNoax86J9BvYMRV0RJ9x V+cOIBLcFjS7dgEWVKyNUW4YOQXi4d0vjRMSc0TDolxNAiv4pOoXcqU2L2JoSStfn5dbyn NpzSa8qp7q+c428Tej6+7WPFOHx3vUk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690361067; a=rsa-sha256; cv=none; b=nP6Etb16ZSyFWEuoD6mxdDUrKExwmXWrr3MOosGBltZt9wLe5ZhxOQdcWCSr7mwz0d+9hV 2Zyf+xWLwJz4SR5qGiF6yWlSkfKJa6eRezoBnePXm/bIhfPAENv3+i//Rgbf3NQBNwLs0k Un8b2VKZewSujkk3RPRuEdaG9v/KFes= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SulJ4g9H; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690361066; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3lL0ygifzgSDKujqi7KQAvSTuQ2XO7Cnzt9tFycfXrQ=; b=SulJ4g9H6ThpqYMHGX2vwQF+ilhCkphjI+iTPA+eRsJcHu04JyVhd1idHiv0V2fGAnRmW1 kUqe3kBcD1/huvOWd5OA54htITvQddlySGK/rwdFiaq9IzFdmeiEk4J5j5YpVCanWJlmxn aq0rCuMopGOO8AfDKlrL7WZhXZiItvw= Received: from mail-lj1-f198.google.com (mail-lj1-f198.google.com [209.85.208.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-544-5nb_jYJqM0mQnuryHfXeSQ-1; Wed, 26 Jul 2023 04:44:24 -0400 X-MC-Unique: 5nb_jYJqM0mQnuryHfXeSQ-1 Received: by mail-lj1-f198.google.com with SMTP id 38308e7fff4ca-2b6f0527454so54595211fa.1 for ; Wed, 26 Jul 2023 01:44:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690361063; x=1690965863; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3lL0ygifzgSDKujqi7KQAvSTuQ2XO7Cnzt9tFycfXrQ=; b=AgcLW5x7tDOmHwHtBKWV2LswuL8hmSLuUNHrcgcSdIDZP6A/xSFt4JkKgiXKExiN4y M2cX13p9WncFXnaiP7TYU8fKJ+GsKzBeub1CoL+3pK6jrdnAkQQTgH95CZa7ZKsaFLUW 9w+4uZD5gr6BykR1vY5ZyOAKGSl92OKNRI/iRMG9PdzRjQE1uFiJIc2kUEUX50DWBok6 lM5sIZPZIuWhT12xmFyjnJHXRPNa3u0zQIy296ZHBP8eMaWYDnxwMoKMunoa5Z+Nc1p7 7BcMgyFnktEvdLE4WTMDVQrmFQBrG4BL/T387h1iE25QJh1ln9oywpuvKyXWyqhuJwvQ 9nmA== X-Gm-Message-State: ABy/qLZeauMaMUmCtf5g+Rhw8i3sHPkp94NHbq+WiazHBenZsqhttemN XlyYIAf2Awy371mg6Jp7+JTUBnxG9Hmdj9JNPI5EEVX9QkrZ58d9Eu31ht58hou8yBM7P5a3C42 wbSaxUbx55nM= X-Received: by 2002:a2e:8395:0:b0:2b6:cd12:24f7 with SMTP id x21-20020a2e8395000000b002b6cd1224f7mr884487ljg.44.1690361063239; Wed, 26 Jul 2023 01:44:23 -0700 (PDT) X-Google-Smtp-Source: APBJJlGivoMJb3rHOFMuAoAgPn0XW9Wzaq8qlmesjJLgbyq+3Zktg6DeOrVA+y6Ym22wjs8s+JGDkg== X-Received: by 2002:a2e:8395:0:b0:2b6:cd12:24f7 with SMTP id x21-20020a2e8395000000b002b6cd1224f7mr884474ljg.44.1690361062737; Wed, 26 Jul 2023 01:44:22 -0700 (PDT) Received: from ?IPV6:2003:cb:c705:f600:a519:c50:799b:f1e3? (p200300cbc705f600a5190c50799bf1e3.dip0.t-ipconnect.de. [2003:cb:c705:f600:a519:c50:799b:f1e3]) by smtp.gmail.com with ESMTPSA id 24-20020a05600c229800b003fbcdba1a52sm1395286wmf.3.2023.07.26.01.44.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 26 Jul 2023 01:44:22 -0700 (PDT) Message-ID: <9ef757dc-da4b-9fa1-de84-1328a74f18a7@redhat.com> Date: Wed, 26 Jul 2023 10:44:21 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 To: Ross Zwisler , Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Rapoport , Andrew Morton , Matthew Wilcox , Mel Gorman , Vlastimil Babka References: <20230718220106.GA3117638@google.com> <20230719224821.GC3528218@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: collision between ZONE_MOVABLE and memblock allocations In-Reply-To: <20230719224821.GC3528218@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: D91DDC000B X-Rspam-User: X-Stat-Signature: mrx195jcqe7j3rjf5puok37wjzucwxoh X-Rspamd-Server: rspam03 X-HE-Tag: 1690361066-883511 X-HE-Meta: U2FsdGVkX18gknzvHJnZuXuFnjZvanzueU9Sv5EkGRk5WsYQ53the1ZsR0HjidVBVCcS1OpInH6thMyJE5+sNM47XfBU8iAMA5QQl07Fq3KHtlxvsplClcv8QWNJC7mriwnFKSf2DfQMEFZkSfuzIKLChpJP2wCNjwv1+rCPpHE0B2SVcWkFfp5mgugBq0HCXMuYpzgooLDkLxyQtvwHP7Bj759YUz8zD3yKPCLQQkK6DdMi+h9Ck6FNeacppsK7UX0JVw2u1V7eB/17HcQ5/V9RkzNMvQyQOjZU10PYmmJRjeI4EsQ+dZAc7nH+TVYBi/TCBNZz7xpvgtwMzhaDRCnJ4PMf1RDKu3sfZaDpfq706MPMyNqR32nYoMB4fUnZ+vHo+5O3AGERvZHS/Ee1xg82Tf0HjNzfy7HHaZ/ukj+bQfNRLbU8qyMeSKWmLTfq2znIBn755bpgjF/tMCCga7cRO1WT9lmHicBzRHlVpaiUlSteT5mlta/GCNO5p6PDs3k6A3MjmlkrJsMoVW95l1ZUAnRNVZqAza76KTjwncyFsf3kCV59pjAWsYb4POqx9cPeeZrFyy6+EMWd2d9ASl2aBuKSXTdypLKUzdb2hYD2lbeKI3AFjwU8DByeqj/FHwQ1zO77F4JOrIJQvpRRjUMtlJa9+VXrQeZLl8CGwvIYO0F5mL7HroaMlC0xf6JXJw3W1x92zUcEPZbyrdLopw+rZA6RH6WSuK9hPVZtID1YKCn3iP1k/MwQoNa5WONkSfRVJsuXaRx2SKIIz7HCwKQIvxgIaAiGx04eh419hXTXy/Y6ko+BDsgu66a+KwE1+b/c9xabBINY9ja5RHxzjiZI/EnC8jLpTU3KstkiwnZoxJ9RFy5P45g3Ws0dplMUfG1IxEQKsa2s4DmPHLJqAJyfv7tTgxK3X/wnO4hVl7pGXMzG+VoctjQQkYUnAPd63KArurWcAFRKUbDUHte 0rn7RaWT TJD1axpUdBMqgh3Bb2RdS5Wz0NDYRptyd4I7uNuuz10CqeRMlUi+oD7xSCNUWuUoBCGQqFzoX31GECaHjSNBdaoGpYlRLioXAnFyOEVrgyFfwowTiEQbyC3kndcgF40S90AOhBgV4bQ0Z1qzoRwX1L2r2p8ggVw1d+DNJYBjmCDTztr/H7FJGCoXb4gsFmOJn9r1UwDON1JCKiY8/5HwFGy/yJp0HI6KB6p6c7V3bCer84zlMPCx+S8XtN9fd1mSCU2M8GuG7Jf1GFjBzzzGJH9oTLjva+a6efyQJBUIzqDQHHGd4hYMaq6tWLI0kaL9TSFvM+7GoFO4TkvRw/SkmTUyZRBcxy5ZAtMB+ntIA/SjYEK7LOCY4vuQGjyt4wUfTof//Klbq5jbzaTPDuWPk6UIOvnJ04tIXb0UwjfFO931SU4FohAOE9iA0Tw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20.07.23 00:48, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >> [...] >>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>> allocations, because this issue essentially makes the movablecore= kernel >>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>> creates will often actually be unmovable. >> >> movablecore is kinda hack and I would be more inclined to get rid of it >> rather than build more into it. Could you be more specific about your >> use case? > > The problem that I'm trying to solve is that I'd like to be able to get kernel > core dumps off machines (chromebooks) so that we can debug crashes. Because > the memory used by the crash kernel ("crashkernel=" kernel command line > option) is consumed the entire time the machine is booted, there is a strong > motivation to keep the crash kernel as small and as simple as possible. To > this end I'm trying to get away without SSD drivers, not having to worry about > encryption on the SSDs, etc. Okay, so you intend to keep the crashkernel area as small as possible. > > So, the rough plan right now is: > > 1) During boot set aside some memory that won't contain kernel allocations. > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > region (or whatever non-kernel region) will be set aside as PMEM in the crash > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > parameter passed to the crash kernel. > > So, in my sample 4G VM system, I see: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > aside only contains user data, which we don't want to store anyway. I raised that in different context already, but such assumptions are not 100% future proof IMHO. For example, we might at one point be able to make user page tables movable and place them on there. But yes, most kernel data structures (which you care about) will probably never be movable and never end up on these regions. > We make a > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > mkfs.ext4 /dev/pmem0 > mount /dev/pmem0 /mnt > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > We then set up the next full kernel boot to also have this same PMEM region, > using the same memmap kernel parameter. We reboot back into a full kernel. > > 3) The next full kernel will be a normal boot with a full networking stack, > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > the kdump and either store it somewhere persistent or upload it somewhere. We > can then unmount the PMEM and reconfigure it back to system ram so that the > live system isn't missing memory. > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > daxctl reconfigure-device --mode=system-ram dax0.0 > > This is the flow I'm trying to support, and have mostly working in a VM, > except up until now makedumpfile would crash because all the memblock > structures it needed were in the PMEM area that I had just wiped out by making > a new filesystem. :) Thinking out loud (and remembering that some architectures relocate the crashkernel during kexec, if I am not wrong), maybe the following would also work and make your setup eventually easier: 1) Don't reserve a crashkernel area in the traditional way, instead reserve that area using CMA. It can be used for MOVABLE allocations. 2) Let kexec load the crashkernel+initrd into ordinary memory only (consuming as much as you would need there). 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting any movable data in there) 4) In makedumpfile, don't dump any memory that falls into the crashkernel area. It might already have been overwritten by the second kernel Maybe that would allow you to make the crashkernel+initrd slightly bigger (to include SSD drivers etc.) and have a bigger crashkernel area, because while the crashkernel is armed it will only consume the crashkernel+initrd size and not the overall crashkernel area size. If that makes any sense :) -- Cheers, David / dhildenb