From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 602A4C0015E for ; Tue, 18 Jul 2023 22:01:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 345C228000B; Tue, 18 Jul 2023 18:01:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F5D68D0012; Tue, 18 Jul 2023 18:01:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1961328000B; Tue, 18 Jul 2023 18:01:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 079C78D0012 for ; Tue, 18 Jul 2023 18:01:18 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C2A071204B3 for ; Tue, 18 Jul 2023 22:01:17 +0000 (UTC) X-FDA: 81026104194.19.C176BA9 Received: from mail-io1-f48.google.com (mail-io1-f48.google.com [209.85.166.48]) by imf04.hostedemail.com (Postfix) with ESMTP id 901BC40024 for ; Tue, 18 Jul 2023 22:01:11 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=zQOyvfMD; spf=pass (imf04.hostedemail.com: domain of zwisler@google.com designates 209.85.166.48 as permitted sender) smtp.mailfrom=zwisler@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689717671; a=rsa-sha256; cv=none; b=HQ+bxoCGMwFtQ3qFttWQ96AkbBQaDeo/8hPyPk+EUbhVvLcSHewrOe28qLk9hEBefMVrV2 9Udco9vKS8NeAx6FmXxDFjoRuOQJS3YAcxYq6zx6lP72BYd7EC6rBMP3mwl8wni/mUi9bZ xYaN1z7MtGWoPYg1ABqhMSzBTPYVYZE= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=zQOyvfMD; spf=pass (imf04.hostedemail.com: domain of zwisler@google.com designates 209.85.166.48 as permitted sender) smtp.mailfrom=zwisler@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689717671; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=CvXtdjiTdSu7pRelEcHGOcRS4iWAssoWYQyugv4oWIQ=; b=cl9qOZek0GK+89VUppVXxcbmm4U5gJ9x2SrjSSXthAanwO1HRuyvi+jT7/3JILpPzPHGKp q8CEQ6fuOoEnlGQ3JY+9PCr2yzxQnXRMomIeIJw4tfyeRFTuMTlgOjDG/GRjVBdgCDmbXG Dgo47Gn3bF5+kfhnYUtx6EaJpRlOcPg= Received: by mail-io1-f48.google.com with SMTP id ca18e2360f4ac-78372b896d0so293402139f.2 for ; Tue, 18 Jul 2023 15:01:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689717670; x=1692309670; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=CvXtdjiTdSu7pRelEcHGOcRS4iWAssoWYQyugv4oWIQ=; b=zQOyvfMDZir+7hpHfKayyWYecVl7ZFPqcSf2Yq2eh9hlGt8SxPV0tbJZZfI6jcCUSk I2N3wzhtJOlYCu/Iaoi1pYGCU21nxKnlXrbJVD8/LAs96v2ehgagcDFNHce450Grl8+Q h4kXyTuMfmlSKEDp0pCLXAswN7AEnU8iInHqh/eT+viITxM65pLHB44Oz8Ux4OvRE085 OjLtmXN73t5fdHTMKLxV3qR6n75Aws6IVQRUMVIlYh2wuiIsoS0LP9c7trwsyHJ3Pth7 8EFFDcttjDJ7ClHaWPVqJWfbTAZYHnXanhKPqQ8CO4yyTq6+P+7MH8zr+hhR9E28u/7L XilA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689717670; x=1692309670; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CvXtdjiTdSu7pRelEcHGOcRS4iWAssoWYQyugv4oWIQ=; b=hG1xQb2WRunlirURhFootsBdSQVDgxaxNGkWGC+lcu0SW/cq3UDcEDhpAqqMFbWfLl 82lyMUFluNaav1XNVdkBRQKK47WFfCkSOsilj3qvSV1DLy0x5olYlaHb6HtkU4FDHWcb rJkCE8heUr5p7mFI1KMizWzP2PuyXo26ThxiMCEaqVom5pXwXcTIbO8rJqkrlG4VnGjI 7nJh+P2gEJNCejc2Y6uuMonCF7GzjPsRJGd4MZcJkYeZB3FPcKheivQvmeEmwrCBsNR8 tvFDozxlAXdQk4XYYm+6Kltqwkyr8gHzmlGiF7U8ShZDWbvTdPHrq9N+mDDB6IgUbPV4 SQYQ== X-Gm-Message-State: ABy/qLYXtuTUK0HoWzWP/QZk5gi3OFdtCXP+QyivXxe+TElFzDHu3ztm Vr0fnnvYuS/xH0dRalY1m5Q3Ow== X-Google-Smtp-Source: APBJJlEIvyhsuAY/IAmmsxTlJjteq7JzxRm+oSSx0aQac6TD2PdPOPWHCK5RTKNhyDnxcXMM++CBeA== X-Received: by 2002:a5d:9ada:0:b0:785:fbe8:1da0 with SMTP id x26-20020a5d9ada000000b00785fbe81da0mr3178779ion.15.1689717670404; Tue, 18 Jul 2023 15:01:10 -0700 (PDT) Received: from google.com ([2620:15c:183:200:1ddf:dfe:d771:e9d2]) by smtp.gmail.com with ESMTPSA id h8-20020a6bfb08000000b0077e24ace3edsm865699iog.52.2023.07.18.15.01.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Jul 2023 15:01:09 -0700 (PDT) Date: Tue, 18 Jul 2023 16:01:06 -0600 From: Ross Zwisler To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Mike Rapoport , Andrew Morton , Matthew Wilcox , Mel Gorman , Michal Hocko , Vlastimil Babka , David Hildenbrand Subject: collision between ZONE_MOVABLE and memblock allocations Message-ID: <20230718220106.GA3117638@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 901BC40024 X-Stat-Signature: xwe5ag1eruaj8ce6eoo7wgbs5xd6pwke X-Rspam-User: X-HE-Tag: 1689717671-648404 X-HE-Meta: U2FsdGVkX1/rFkDOSfwbBJYHF/2GmDBYcmnr4zQU+8otvz6ctlAG3tOHXWo6s+UAUmbVpT04+964ajMOCIDn0izpE6PB7sOdjziPlKHlIxsQ1AEzJNdENaiYrdggPCzpLa1h7lIfUKK7apghv5xRThZZJ2kpikVeT4eODCkzs2lhuCK4BR9Ogck8aAhMUDqkJOh4YXXutHox0wuq7xbEbTvYZ1AuNOi2/m7xrecqnTQNn/NMrHfFawuPQf2CCVzect92PgYrWyjIqv5gzAV/2AZV3hbe8aSaECiZWder7wWH2FJvSJWL1e7Kfh0s4zbsJvgK4CoJIzUuAcYNLCL3W3I7YIWJxEsrLGFyQ7dZLNHbZD+w+hMW2/l3S0rszimiKUxlZw0LpqiKqd3aX7KDRxUJqg6j012N/bK+3RObU6s7+bWtsfOuRlXinc/pbTbTOfxUu8MIGZx+okkXrZebMNFL70ph1caHB4TmBy5yKcBkFC4VycXJ/mO5ed4APfhERomL+62ETJWQv7kcTqFCjoltS7sHarsK3zNLEX+3NmPzHCAbllwMB1Ld/cNWXS64imBQSB7d2xUJbNSkGKK66fUR3zJx6AKD7qsslwPbz4cWiUQkwtIZNoX8SEmbMu+60qdP2k+ZZkEzRHqW1Z66E9KPqV2B8tXtbKh8VhrXaxmXmiwhyexvDOgp233DYu6+KqvewV8QIMO950tHOepAG2D5AtaW0u3kKuz8jHoONtRIjA4JH9osq7pLm5Cz9meG7gRs1dr/tmtTQiBNpp4gMJEPnMBJlmD5sO7ytWOnxvZIvNpy7oLsLdRbcHs+adLjrWrMC536XXdbbwfTvgSj5Zzc5dQbjWsdMYnoUt32CefWDlDQl3kRpfvzd7ec2iU+Vzj2xIVmdA0E1n8QzbKIG2ZhWrgLb0fHA7KdLslhD/QIcrZxl2xIyIolqYkd863VbbOVUdNCz3BK6xP+4ud oIGX0hpw GPBUxc4bWMJcbxN5CXf94y2Iz8vnP5W70qpjnAJHFslIVGSO+/tUZv/l/AvUgv6SvaYveZMOFthp30u6V6lYvdIXZwz/SXTK6pSLHD61u3Zm2HnackoZ/XGWqmmqx7h47ouhHi5URciMtilPhG8nRU38QFU437J5wS5uG3p39QLaO46TJr2gQrQIeoEtZaK3DhLv+3mAFZcR3UN5JNczrFCh9J8eb7gTThfgVt3ybtyZ7qztz1t0W+CiqmXn/axiuXB6VGk30/8rfDQFuiRo0lkPgsSmeWj9KKeVrOQDxZWbOWeCxGWJ8quBj0SsgJuqOBoeu8Ym2L0tJGWz1ildg/B3/Ns2/thJlKH2bYe6ykIIlbF1M1hVA9DztB304kpQIp0/txh6cKdi4lGZTEbHRw0xv5ivCkC3Ux803OlJXGBerlF5ZZ/Zt1gZRr1uXz+gNWm/KuWM0oT6VCArPe4yox5ifpjtAliX92VNmaBUNCUpDzIeMrnIZ8dax+w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, I've been trying to use the 'movablecore=' kernel command line option to create a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that offlining the resulting ZONE_MOVABLE area consistently fails in my setups because that zone contains unmovable pages. My testing has been in a x86_64 QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail 100% of the time. Digging into it a bit, these unmovable pages are Reserved pages which were allocated in early boot as part of the memblock allocator. Many of these allocations are for data structures for the SPARSEMEM memory model, including 'struct mem_section' objects. These memblock allocations can be tracked by setting the 'memblock=debug' kernel command line parameter, and are marked as reserved in: memmap_init_reserved_pages() reserve_bootmem_region() With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2 kernel I get the following on my 4G system: # lsmem --split ZONES --output-all RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable Memory block size: 128M Total online memory: 4G Total offline memory: 0B And when I try to offline memory block 39, I get: # echo 0 > /sys/devices/system/memory/memory39/online bash: echo: write error: Device or resource busy with dmesg saying: [ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00 [ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff) [ 57.447301] page_type: 0xffffffff() [ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000 [ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 57.452011] page dumped because: unmovable page Looking back at the memblock allocations, I can see that the physical address for pfn:0x13ff00 was used in a memblock allocation: [ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150 The full dmesg output can be found here: https://pastebin.com/cNztqa4u The 'movablecore=' command line parameter is handled in 'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA node. The issue is that the memblock allocator and the processing of the movablecore= command line parameter don't know about one another, and in my x86_64 testing they both always use memory at the end of the NUMA node and have collisions. >From several comments in the code I believe that this is a known issue: https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59 /* * Both, bootmem allocations and memory holes are marked * PG_reserved and are unmovable. We can even have unmovable * allocations inside ZONE_MOVABLE, for example when * specifying "movablecore". */ https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765 * 2. memblock allocations: kernelcore/movablecore setups might create * situations where ZONE_MOVABLE contains unmovable allocations * after boot. Memory offlining and allocations fail early. We check for these unmovable pages by scanning for 'PageReserved()' in the area we are trying to offline, which happens in has_unmovable_pages(). Interestingly, the boot timing works out like this: 1. Allocate memblock areas to set up the SPARSEMEM model. [ 0.369990] Call Trace: [ 0.370404] [ 0.370759] ? dump_stack_lvl+0x43/0x60 [ 0.371410] ? sparse_init_nid+0x2dc/0x560 [ 0.372116] ? sparse_init+0x346/0x450 [ 0.372755] ? paging_init+0xa/0x20 [ 0.373349] ? setup_arch+0xa6a/0xfc0 [ 0.373970] ? slab_is_available+0x5/0x20 [ 0.374651] ? start_kernel+0x5e/0x770 [ 0.375290] ? x86_64_start_reservations+0x14/0x30 [ 0.376109] ? x86_64_start_kernel+0x71/0x80 [ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b [ 0.377755] 2. Process movablecore= kernel command line parameter and set up memory zones [ 0.489382] Call Trace: [ 0.489818] [ 0.490187] ? dump_stack_lvl+0x43/0x60 [ 0.490873] ? free_area_init+0x115/0xc80 [ 0.491588] ? __printk_cpu_sync_put+0x5/0x30 [ 0.492354] ? dump_stack_lvl+0x48/0x60 [ 0.493002] ? sparse_init_nid+0x2dc/0x560 [ 0.493697] ? zone_sizes_init+0x60/0x80 [ 0.494361] ? setup_arch+0xa6a/0xfc0 [ 0.494981] ? slab_is_available+0x5/0x20 [ 0.495674] ? start_kernel+0x5e/0x770 [ 0.496312] ? x86_64_start_reservations+0x14/0x30 [ 0.497123] ? x86_64_start_kernel+0x71/0x80 [ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b [ 0.498768] 3. Mark memblock areas as Reserved. [ 0.761136] Call Trace: [ 0.761534] [ 0.761876] dump_stack_lvl+0x43/0x60 [ 0.762474] reserve_bootmem_region+0x1e/0x170 [ 0.763201] memblock_free_all+0xe3/0x250 [ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130 [ 0.764812] ? swiotlb_init_remap+0x195/0x2c0 [ 0.765519] mem_init+0x19/0x1b0 [ 0.766047] mm_core_init+0x9c/0x3d0 [ 0.766630] start_kernel+0x264/0x770 [ 0.767229] x86_64_start_reservations+0x14/0x30 [ 0.767987] x86_64_start_kernel+0x71/0x80 [ 0.768666] secondary_startup_64_no_verify+0x167/0x16b [ 0.769534] So, during ZONE_MOVABLE setup we currently can't do the same has_unmovable_pages() scan looking for PageReserved() to check for overlap because the pages have not yet been marked as Reserved. I do think that we need to fix this collision between ZONE_MOVABLE and memmap allocations, because this issue essentially makes the movablecore= kernel command line parameter useless in many cases, as the ZONE_MOVABLE region it creates will often actually be unmovable. Here are the options I currently see for resolution: 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from the beginning of the NUMA node instead of the end. This should fix my use case, but again is prone to breakage in other configurations (# of NUMA nodes, other architectures) where ZONE_MOVABLE and memblock allocations might overlap. I think that this should be relatively straightforward and low risk, though. 2. Make the code which processes the movablecore= command line option aware of the memblock allocations, and have it choose a region for ZONE_MOVABLE which does not have these allocations. This might be done by checking for PageReserved() as we do with offlining memory, though that will take some boot time reordering, or we'll have to figure out the overlap in another way. This may also result in us having two ZONE_NORMAL zones for a given NUMA node, with a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If we can get it working, this seems like the most correct solution to me, but also the most difficult and risky because it involves significant changes in the code for memory setup at early boot. Am I missing anything are there other solutions we should consider, or do you have an opinion on which solution we should pursue? Thanks, - Ross