From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBC1CC53210 for ; Thu, 5 Jan 2023 01:43:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8CA5C8E0005; Wed, 4 Jan 2023 20:43:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 87ABF8E0001; Wed, 4 Jan 2023 20:43:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 768F28E0005; Wed, 4 Jan 2023 20:43:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 637FA8E0001 for ; Wed, 4 Jan 2023 20:43:56 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 3F0AAAAC26 for ; Thu, 5 Jan 2023 01:43:56 +0000 (UTC) X-FDA: 80319049272.25.F94854C Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by imf05.hostedemail.com (Postfix) with ESMTP id A679D100008 for ; Thu, 5 Jan 2023 01:43:54 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=aGe2tFaw; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of rientjes@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672883034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qvbp9StBfREPm9I8aa88N5F9Rq5kJCaZobikFDCIkUs=; b=FM8PvoyO33bellHkJwspHsZYLEfDeDeg0nALgSbWCBaJTTBAITpM3yQeAup0tAnah+2tgP GRnx1kJh5apD3Ee5j7OyY/MOPJT7TfQmiVST8D5gAdSgWQSgoI3srxmgrdNSk5Ekvga2c4 QM0eVXRC6A6p8b8RxCD49fv/w0SeDfk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=aGe2tFaw; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of rientjes@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672883034; a=rsa-sha256; cv=none; b=BaWXmGw8h6XdQotQl82RoZoKSlZeqmguv/3kFKdZk5zIwmp+fQIJ4vHq8niAKBoGA34wiz RS9U/nF6pojltA9N08AWGIStG2dswlCVBmEAoHr2leKB/ex/dtcs0ftKerpkL9TRmUsIHv DgX92SO7SnOGmYwIaFu8h4WKdif9sTw= Received: by mail-pl1-f170.google.com with SMTP id w3so6284760ply.3 for ; Wed, 04 Jan 2023 17:43:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=qvbp9StBfREPm9I8aa88N5F9Rq5kJCaZobikFDCIkUs=; b=aGe2tFawcoyWZsfDcMGLCaXjZkLyoN9C1fE3zXrR6TwPkxwOpA0GBIkNJTpGi5CSMt 6Afftkz7Pss9n5W3QSKnI0/aBJa9N4USaAvMHkW3UypczfmZiDys3mx0H/XTYlPQHe4j 0rL4bapJ24GHgybCzjhI7JwK7vpcJ51fBYzYroPnZwxfoNPhYPs/qfKUq1dUU0vJPs5L E+bEauIOgYtbC2kpEbabKgYE1lHNyJNe/grIdHIHuoaV8qgsIgXZJfKJZJ9flp1r7P7u yWT/6yBlBRZHiwxEXD//rzbqdv0uyx+pQF5H9mk8Y/YgrvY37kdbi7pbLPFIl9ncnQFz uG+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qvbp9StBfREPm9I8aa88N5F9Rq5kJCaZobikFDCIkUs=; b=VPunrBt5lboEZnjBmProd200aqY83GClH2yVwXzWTTUguFf/suwzrIqAzI1t7TYPH5 hGfnCPlk7VYzSGEQrAw63Qh/lAHzPbNnuGdVcTS3eZOWmlK4dJrJ5T4LLHfoKOQ6ISOD 0ZN/gTU/b/973ZCdsV3O7gSgxLMjqaomLlCnUp7IU6ifB8IMcTaTOrzmuWAyGaU3idEj JR4NBiOAyZTGhgOtCsNlC5D+rgxJIy0e8sPM7fhG3aVYNSjOi1WLl0WPGaootQMgv4Ow ypQkRzZT4jzOHTizjMcUWf0qj95Cfp85xpaMkVWIhTbX9ztAsYhasItlvHqNt1Po5g8j 5d/g== X-Gm-Message-State: AFqh2krqEUj6vxuMLmTw+LJZvMt06o3PPqktx0+4VJ7fkRoYQU1zGaAD 18PV7vkTjmKCxREwW3dQYFxHfQ== X-Google-Smtp-Source: AMrXdXuEZIYlNipB/1x/TwxCUNmQ8yIlGOKAJDrVShsUI6SI7yNkmlLkGEX66W+sSvYY8CQL8xSlsg== X-Received: by 2002:a17:902:9b95:b0:189:6d32:afeb with SMTP id y21-20020a1709029b9500b001896d32afebmr59plp.1.1672883033545; Wed, 04 Jan 2023 17:43:53 -0800 (PST) Received: from [2620:15c:29:203:fc97:724c:15bb:25c7] ([2620:15c:29:203:fc97:724c:15bb:25c7]) by smtp.gmail.com with ESMTPSA id l19-20020a170902d35300b0019290a36553sm11980965plk.63.2023.01.04.17.43.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Jan 2023 17:43:52 -0800 (PST) Date: Wed, 4 Jan 2023 17:43:51 -0800 (PST) From: David Rientjes To: Aaron Thompson cc: linux-mm@kvack.org, Mike Rapoport , "H. Peter Anvin" , Alexander Potapenko , Andrew Morton , Andy Shevchenko , Ard Biesheuvel , Borislav Petkov , Darren Hart , Dave Hansen , Dmitry Vyukov , Ingo Molnar , Marco Elver , Thomas Gleixner , kasan-dev@googlegroups.com, linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org, platform-driver-x86@vger.kernel.org, x86@kernel.org Subject: Re: [PATCH 0/1] Pages not released from memblock to the buddy allocator In-Reply-To: <010101857bbc3a41-173240b3-9064-42ef-93f3-482081126ec2-000000@us-west-2.amazonses.com> Message-ID: <30478b4a-870b-bf48-76d0-a236a40e7674@google.com> References: <010101857bbc3a41-173240b3-9064-42ef-93f3-482081126ec2-000000@us-west-2.amazonses.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A679D100008 X-Stat-Signature: 3o66wqissioz7bibpbs33sth58phhm34 X-HE-Tag: 1672883034-307631 X-HE-Meta: U2FsdGVkX19M2glyfSUbLnusgEkfTSuXGMiFpIda9/F7mv+kDD1kDfisMEmeIgavDc178SpmQxxg31CPfOIVJxA79wNzi7Kw80VLfVIy4PbZ/Wa/LB8JaHJBn8+CkT7ksKnRlFj7hzD8VGlQtA/mFescnJ03842n+xlsEO0kQgmoqQCWMZrWiDPSDKCPqYmpxMglqKp6rIHQoyESF5w+Kbo7y42Iwzixa7YCgXuneykG9SoFd60uO6rwW6AovLvvQSA8DxzM65F4x1k73xNGvKevoI21DxqYrWY/qJMDBctHJBJkkLD+057KAZY8fm1WG6TGtNjnQzA4ds/zFjOj8pvOckRFWEpaJ3Y8eLAK0IDJBMgFTuXdJQwi0E1ZsGFgOMunQRSsuuKxmQ8w/51BRMLxFv8+2coeW3eFq+O1J4YFqzAjeFGumM611DEacqcectAV6YMwtptnPUO1N61QL6xova7buiyBNZjm/zmdNj1kX/OWe3drnqcCBpE1gQZPArhQsSOl0iaEol8ESyXNLxU2+tpfEJXcbeKwyCoA3+3uSmZbpvYuzQEIMeQ+a2ZqTwXVu2bw7mB1K42KcS7Mry/OvZ0/0Gny7GMDFKzXSnzzXUW7/FaY/8BgcpKgVmElir10RRTp6MULGpHb/1Tpn2SXjSPgj0qUcW/y884gFFuS6g1eQEDYEoVJsAoP7bJtr+zlUg0hkT3ttEyrsKScYflXvJFLQlEr8XOwLIJWLa0Kc5tEUtvbIc87nMnefs3L/36EdH80Rhmn2FeU12abqSXjpa1y0iKy1Aol5OjkrLTanJaX64KxiSgCuGAU3nl1f0WEvedkGtjm8GdDuZgDY7aLfm2gFY3qk5oksOm3aeXyEEHl+kXZnOc6rI15+JDOFwWZLWCMIHfba2+00dZmf7Y6CELkF5T6UMfjRVFz7rKoU9p3l8VmMW31/9HR07DSYCfzOHUkYgHbCDgoypL sDeSdzLc OWsjc3Ns6QFSpgxXJ0I2unB2D6op600MditgZCjkKs6L0jC7ACbXozL/nGPJz6kIpi58GIpw0yxT3hCPHT8dKVuuDtNXEeiD8BER3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 4 Jan 2023, Aaron Thompson wrote: > Hi all, > > (I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change > I'm proposing is in memblock.) > > I've run into a case where pages are not released from memblock to the buddy > allocator. If deferred struct page init is enabled, and memblock_free_late() is > called before page_alloc_init_late() has run, and the pages being freed are in > the deferred init range, then the pages are never released. memblock_free_late() > calls memblock_free_pages() which only releases the pages if they are not in the > deferred range. That is correct for free pages because they will be initialized > and released by page_alloc_init_late(), but memblock_free_late() is dealing with > reserved pages. If memblock_free_late() doesn't release those pages, they will > forever be reserved. All reserved pages were initialized by memblock_free_all(), > so I believe the fix is to simply have memblock_free_late() call > __free_pages_core() directly instead of memblock_free_pages(). > > In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN > initialization routines") that added a call to kmsan_memblock_free_pages() in > memblock_free_pages(). It looks to me like it would also be incorrect to make > that call in the memblock_free_late() case, because the KMSAN metadata was > already initialized for all reserved pages by kmsan_init_shadow(), which runs > before memblock_free_all(). Having memblock_free_late() call __free_pages_core() > directly also fixes this issue. > > I encountered this issue when I tried to switch some x86_64 VMs I was running > from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services > ranges via memblock_reserve() (part of setup_arch()), and it frees them later > via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI > implementation of the VM I was attempting this on, an Amazon EC2 t3.micro > instance, maps north of 170 MB in boot services ranges that happen to fall in > the deferred init range. I certainly noticed when that much memory went missing > on a 1 GB VM. > > I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real > x86_64 EFI systems, and they all look good to me. However, the physical systems > that I have don't actually trigger this issue because they all have more than 4 > GB of RAM, so their deferred init range starts above 4 GB (it's always in the > highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings > are below 4 GB. > > Deferred struct page init can't be enabled on x86_32 so those systems are > unaffected. I haven't found any other code paths that would trigger this issue, > though I can't promise that there aren't any. I did run with this patch on an > arm64 VM as a sanity check, but memblock=debug didn't show any calls to > memblock_free_late() so that system was unaffected as well. > > I am guessing that this change should also go the stable kernels but it may not > apply cleanly (__free_pages_core() was __free_pages_boot_core() and > memblock_free_pages() was __free_pages_bootmem() when this issue was first > introduced). I haven't gone through that process before so please let me know if > I can help with that. > > This is the end result on an EC2 t3.micro instance booting via EFI: > > v6.2-rc2: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 178867 > > v6.2-rc2 + patch: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 222816 > The above before + after seems useful information to include in the commit description of the change.