From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3D10C001DC for ; Thu, 27 Jul 2023 09:42:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2ACFD8D0001; Thu, 27 Jul 2023 05:42:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 25CFB6B0074; Thu, 27 Jul 2023 05:42:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 125C28D0001; Thu, 27 Jul 2023 05:42:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 02C656B0072 for ; Thu, 27 Jul 2023 05:42:02 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B78D6C0462 for ; Thu, 27 Jul 2023 09:42:01 +0000 (UTC) X-FDA: 81056900442.01.1D7BB07 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 6DFDF120020 for ; Thu, 27 Jul 2023 09:41:59 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gNNNXxwh; spf=pass (imf29.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690450919; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lfjeYG5xLJPz3EzsvfUrell+1AE3g43fvMdVtQbwndE=; b=PvdOfm9AzR6oX33IJjE0f35579HGBHwrPPWor7oci+GDjvkUFEQ8hq9jhb7U29hLd+lFNU OyBlW3is3zbJIVB/HwoF0+dNTH6jwYbwJ6MU8YeGC4ne6RfgVbpQwrwK+b2zWjKY1DUaH7 irQA/uWRCr35y6g85dazW3PxxVniqHU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690450919; a=rsa-sha256; cv=none; b=TM1RksAVHNgRBvrPFo6lPm0+L3IWqHct9dvvkCe1ptOvGpl6O0foBwDW+7lHC0mX4Nalu2 Ps1r50TmPzAUHRtwQOJelNR4RZIhVgImt5xRDKfVHKzQFnkX/5ZNs/v6Gf3uMlwpLdKra9 35x6T4TFrL57old6xbVtWDt14QmSf0s= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gNNNXxwh; spf=pass (imf29.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690450918; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lfjeYG5xLJPz3EzsvfUrell+1AE3g43fvMdVtQbwndE=; b=gNNNXxwhhT23prUow+C3vO4+lhmK8GHuuBDBpXJpagcPlqURosdRISOnVYwjrlU8NLLgyF ISagFA7z5OpaezCzwYQCmpNMekM8n8ALNAQGg10n/vBoNDH3gpoDkWJFBLaYD/dKk/YRmS 5Go2EeClmPqS2+kAbV7rtGV6wFUfs7E= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-387-VUHyC2kbNRinU0RoLQfa3A-1; Thu, 27 Jul 2023 05:41:57 -0400 X-MC-Unique: VUHyC2kbNRinU0RoLQfa3A-1 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-2b6ff15946fso6687301fa.2 for ; Thu, 27 Jul 2023 02:41:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690450916; x=1691055716; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=lfjeYG5xLJPz3EzsvfUrell+1AE3g43fvMdVtQbwndE=; b=JkUduFj8jJB+r0/Y2NXSW2nbDIu1NdIgpkwF7gXCdtPHRGTKJPpVnSs7dt2iLucIDc nJebVHTYtXEcHqZQsZEuoGRYyLb3DxUsAe6CTZdItA5Xo9iT1rDLRudiQ/HHwlhOtuUu Y+xRNykiOfoX5sO04QtY3ciHSvT+wxkuyKjpjW/eSi/I9BFRvD5iE624BOVnBsOpQXlS wMmc3RHVzf8H9T4oBj2e6857D7L22UyZyRsifbUIMR7bxq1YD/bJhotEdJXMXZjG4qgx 69vPR48uAvZim8HVv6cJe59Bsce0sJpwOBKTs2OfRKULYhbcBVT22twHIwsIVY3izunR O4dw== X-Gm-Message-State: ABy/qLYMFeELNOFhMFUYtsANS1em8md7nXO+Psu1JSSxciNr79m0ySeD omDImH91nhaAs6xUdL8bUpoDdyPzfH0nps+RMDmm6OYQauMmLrf1OJmc4ub+1j30GU4MHpBj1xH vDBTK4MDYnhk= X-Received: by 2002:a05:6512:682:b0:4fb:893e:8ffc with SMTP id t2-20020a056512068200b004fb893e8ffcmr1295655lfe.17.1690450915785; Thu, 27 Jul 2023 02:41:55 -0700 (PDT) X-Google-Smtp-Source: APBJJlFTA2HgUm2brJFZAzFf71G9fcrl+Wse+RnzoN0k15ro+yXgQPkF4mKy0Vtx8+KdllBcgmErYA== X-Received: by 2002:a05:6512:682:b0:4fb:893e:8ffc with SMTP id t2-20020a056512068200b004fb893e8ffcmr1295623lfe.17.1690450915287; Thu, 27 Jul 2023 02:41:55 -0700 (PDT) Received: from ?IPV6:2a09:80c0:192:0:5dac:bf3d:c41:c3e7? ([2a09:80c0:192:0:5dac:bf3d:c41:c3e7]) by smtp.gmail.com with ESMTPSA id z10-20020a7bc7ca000000b003fc00212c1esm1338797wmk.28.2023.07.27.02.41.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 27 Jul 2023 02:41:54 -0700 (PDT) Message-ID: <73db2622-4985-2f93-a118-d7d249094239@redhat.com> Date: Thu, 27 Jul 2023 11:41:54 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: collision between ZONE_MOVABLE and memblock allocations To: Michal Hocko Cc: Ross Zwisler , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Mike Rapoport , Andrew Morton , Matthew Wilcox , Mel Gorman , Vlastimil Babka References: <20230718220106.GA3117638@google.com> <20230719224821.GC3528218@google.com> <9ef757dc-da4b-9fa1-de84-1328a74f18a7@redhat.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 6DFDF120020 X-Rspam-User: X-Stat-Signature: u7fszy5dd8fk8xa75e55kwohataimdck X-Rspamd-Server: rspam03 X-HE-Tag: 1690450919-825783 X-HE-Meta: U2FsdGVkX18xhIfiS5Y3GzODEmx6DzxYjnMf2umnFOYBfRgo0KIrgHrnZabtgVVAMEdq6nOI26Ej7i0oiqfiKWY+Q8v1on2Ov+MSBqEazm+GtnBwhO3PE8cGrG4NH3XesH/q+7TsPSpQA2eYYt2+SVhTNF2u8fFYhXkO6+uAQ3s750Kd7x6vwQ/YCi9xVebEzRZ4xrtKtuPmxhCuvbrQDAQWTejPxxHVWhf5bF4YbYdbo6jOOEKs1TGVPH/A753jTUX3eXG+lPL9F87pZTW0FRC1Fx3+OUKNmfEH4aYiTJW0aL2Vf78tidhAyh448nzv2LNpuXSrKuOwVioyWLtzXNO0q1ggaghInsfMZ7G7MIb4XAwLBayfQ4OcuB86eL+JRHd96vQheFoVB/DKEXpXDc2wkWWTYjN1B5TusgfYXKXCdffke/KmEKVvG2qjBLC/L0LJIbL6dVMHaSOddRbjefdrb+ZewqgWUWZvEnFF6SlgKYm2pGCQe7OPKeXi7B9Kxh4c6uEVexsLXb6bd6ET/CyvrdPG+1+8ZDaACgs4V44N/N8t0Qdx3ixXV8FrpqPD4nOcBoOMIxVjAZi5106ZYM3LRSyCoqnICE7+EU7c0h1KYflQTN2Zi7m8h9E3KDk9ZtBA2PDpL2zCtFnHipAnOa7h+ncT9bs49i1BTbmOJ3NdzbUVYry39JFQoMOi8sg3Ze5E27acG9o1NEHvt7SwR2iwPWvMgMWqNpD4KCOqxNLqk+utvQ1f+YGDl9RbO0YHU11GBww85tM79Lu2dTLnWQF9s0mEnC5J0e1FdWZOZOVXALwysPzxcS+T4CVUdttlxn83RUoS3f9AD1VpQsKZsDOCPny4EmNpRLxeO779n8/CKkMOqY5uTrRG7magSax3xlfEoxrBQuuYW9j0Ol74/nKqe5AkNl9AA/mS4nYS4yPENjXVuD7QCa/J1P1XKzkbUTtFwXE7Q7gfcDFeKAR ogDcJaNM 0AWFRCSlEyf3sWVs797JQB3909QTBfGz5k5oEtpwaP7cI5c3ymkIkhetnYGSKLbv0KTz0txOUxuMYHgzWaWAEWlGwlOude+GIxMpnVsDWDeVxP6fqnFTpYrXzVtIVW3dSLMQmUiPtces3mRu1p7IIB13F0c0OrC5cI9Vi6u/j52KqTdoRLhLTLhHlzxLUEwr5STgC5AdhxpzNpYHlumNut5/k6FE8TDQOVNk4kfVSRlcKhKfqKZkYKRZCpSKYDonDWEbYjnXJl8LwkAlPhH8oAvqk0AZXDvCJUXkOi3xHP6GhyNm9fRG141Z80E0rkjaV/IVnGuE7ZSp0vS8XezEh/ROh+D8/2L5XnqMP9u/pZff8WgvhaAaLHZgAUnhu298Fp4nXH50sGmNsGhd+H13YLVqgwYP7X/SlJ1ZaMdmC/RQrYb1FuZy3BBBsBdyNLAAZrVq9RjfCVgcNqhw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.07.23 10:18, Michal Hocko wrote: > On Wed 26-07-23 10:44:21, David Hildenbrand wrote: >> On 20.07.23 00:48, Ross Zwisler wrote: >>> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >>>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >>>> [...] >>>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>>>> allocations, because this issue essentially makes the movablecore= kernel >>>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>>>> creates will often actually be unmovable. >>>> >>>> movablecore is kinda hack and I would be more inclined to get rid of it >>>> rather than build more into it. Could you be more specific about your >>>> use case? >>> >>> The problem that I'm trying to solve is that I'd like to be able to get kernel >>> core dumps off machines (chromebooks) so that we can debug crashes. Because >>> the memory used by the crash kernel ("crashkernel=" kernel command line >>> option) is consumed the entire time the machine is booted, there is a strong >>> motivation to keep the crash kernel as small and as simple as possible. To >>> this end I'm trying to get away without SSD drivers, not having to worry about >>> encryption on the SSDs, etc. >> >> Okay, so you intend to keep the crashkernel area as small as possible. >> >>> >>> So, the rough plan right now is: >>> > 1) During boot set aside some memory that won't contain kernel >> allocations. >>> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. >>> >>> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE >>> region (or whatever non-kernel region) will be set aside as PMEM in the crash >>> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line >>> parameter passed to the crash kernel. >>> >>> So, in my sample 4G VM system, I see: >>> >>> # lsmem --split ZONES --output-all >>> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES >>> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None >>> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 >>> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal >>> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable >>> Memory block size: 128M >>> Total online memory: 4G >>> Total offline memory: 0B >>> >>> so I'll pass "memmap=256M!0x130000000" to the crash kernel. >>> >>> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set >>> aside only contains user data, which we don't want to store anyway. >> >> I raised that in different context already, but such assumptions are not >> 100% future proof IMHO. For example, we might at one point be able to make >> user page tables movable and place them on there. >> >> But yes, most kernel data structures (which you care about) will probably >> never be movable and never end up on these regions. >> >>> We make a >>> filesystem in there, and create a kernel crash dump using 'makedumpfile': >>> >>> mkfs.ext4 /dev/pmem0 >>> mount /dev/pmem0 /mnt >>> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump >>> >>> We then set up the next full kernel boot to also have this same PMEM region, >>> using the same memmap kernel parameter. We reboot back into a full kernel. >>> >>> 3) The next full kernel will be a normal boot with a full networking stack, >>> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out >>> the kdump and either store it somewhere persistent or upload it somewhere. We >>> can then unmount the PMEM and reconfigure it back to system ram so that the >>> live system isn't missing memory. >>> >>> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f >>> daxctl reconfigure-device --mode=system-ram dax0.0 >>> >>> This is the flow I'm trying to support, and have mostly working in a VM, >>> except up until now makedumpfile would crash because all the memblock >>> structures it needed were in the PMEM area that I had just wiped out by making >>> a new filesystem. :) >> >> >> Thinking out loud (and remembering that some architectures relocate the >> crashkernel during kexec, if I am not wrong), maybe the following would also >> work and make your setup eventually easier: >> >> 1) Don't reserve a crashkernel area in the traditional way, instead reserve >> that area using CMA. It can be used for MOVABLE allocations. >> >> 2) Let kexec load the crashkernel+initrd into ordinary memory only >> (consuming as much as you would need there). >> >> 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting >> any movable data in there) >> >> 4) In makedumpfile, don't dump any memory that falls into the crashkernel >> area. It might already have been overwritten by the second kernel > > This is more or less what Jiri is looking into. > Ah, very nice. -- Cheers, David / dhildenb