From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE47EC433F5 for ; Thu, 14 Apr 2022 21:01:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D0116B0071; Thu, 14 Apr 2022 17:01:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27ECA6B0073; Thu, 14 Apr 2022 17:01:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11FAD6B0074; Thu, 14 Apr 2022 17:01:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id F28A06B0071 for ; Thu, 14 Apr 2022 17:01:28 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C94F860EE2 for ; Thu, 14 Apr 2022 21:01:28 +0000 (UTC) X-FDA: 79356705456.22.9F9471E Received: from alexa-out-sd-02.qualcomm.com (alexa-out-sd-02.qualcomm.com [199.106.114.39]) by imf19.hostedemail.com (Postfix) with ESMTP id E410A1A0016 for ; Thu, 14 Apr 2022 21:01:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; i=@quicinc.com; q=dns/txt; s=qcdkim; t=1649970088; x=1681506088; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=OJgEQyin2t4JEloHIXXJuUIE9uG18Me3MoYg/tl+rzk=; b=kRgktNsHpt6ntFSR1QhB6Jk7czgq0WvvjVzqNj/1E+MjxMQ4xwwPoYKu yi39ec7Y1hJ5FR5IaZRIAqdOYWDbWqk46uYkGXbsKNh98o69FFlx1SUNj uynCllPpbodP8t+jEIPSwbjcG0SU5QDBrVlqaIz234HU+Gv3sRDIPigS9 Y=; Received: from unknown (HELO ironmsg01-sd.qualcomm.com) ([10.53.140.141]) by alexa-out-sd-02.qualcomm.com with ESMTP; 14 Apr 2022 14:01:26 -0700 X-QCInternal: smtphost Received: from nasanex01b.na.qualcomm.com ([10.46.141.250]) by ironmsg01-sd.qualcomm.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 14:01:26 -0700 Received: from [10.110.91.141] (10.80.80.8) by nasanex01b.na.qualcomm.com (10.46.141.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.22; Thu, 14 Apr 2022 14:01:23 -0700 Message-ID: Date: Fri, 15 Apr 2022 02:30:52 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.1.1 Subject: Re: [PATCH] mm, page_alloc: check pfn is valid before moving to freelist Content-Language: en-US To: Mike Rapoport CC: Andrew Morton , , , Anshuman Khandual , Suren Baghdasaryan References: From: Sudarshan Rajagopalan In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nasanex01b.na.qualcomm.com (10.46.141.250) X-Rspam-User: X-Rspamd-Queue-Id: E410A1A0016 X-Stat-Signature: omzc8w7sffphax3jr1uuhti9scxypnip Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcdkim header.b=kRgktNsH; spf=pass (imf19.hostedemail.com: domain of quic_sudaraja@quicinc.com designates 199.106.114.39 as permitted sender) smtp.mailfrom=quic_sudaraja@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com X-Rspamd-Server: rspam01 X-HE-Tag: 1649970087-177373 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/14/2022 2:18 AM, Mike Rapoport wrote: > On Tue, Apr 12, 2022 at 01:16:23PM -0700, Sudarshan Rajagopalan wrote: >> Check if pfn is valid before or not before moving it to freelist. >> >> There are possible scenario where a pageblock can have partial physical >> hole and partial part of System RAM. This happens when base address in RAM >> partition table is not aligned to pageblock size. >> >> Example: >> >> Say we have this first two entries in RAM partition table - >> >> Base Addr: 0x0000000080000000 Length: 0x0000000058000000 >> Base Addr: 0x00000000E3930000 Length: 0x0000000020000000 > I wonder what was done to memory DIMMs to get such an interesting > physical memory layout... We have a feature where we carve out some portion of memory in RAM partition table, hence we see such base addresses here. > >> ... >> >> Physical hole: 0xD8000000 - 0xE3930000 >> >> On system having 4K as page size and hence pageblock size being 4MB, the >> base address 0xE3930000 is not aligned to 4MB pageblock size. >> >> Now we will have pageblock which has partial physical hole and partial part >> of System RAM - >> >> Pageblock [0xE3800000 - 0xE3C00000] - >> 0xE3800000 - 0xE3930000 -- physical hole >> 0xE3930000 - 0xE3C00000 -- System RAM >> >> Now doing __alloc_pages say we get a valid page with PFN 0xE3B00 from >> __rmqueue_fallback, we try to put other pages from the same pageblock as well >> into freelist by calling steal_suitable_fallback(). >> >> We then search for freepages from start of the pageblock due to below code - >> >> move_freepages_block(zone, page, migratetype, ...) >> { >> pfn = page_to_pfn(page); >> start_pfn = pfn & ~(pageblock_nr_pages - 1); >> end_pfn = start_pfn + pageblock_nr_pages - 1; >> ... >> } >> >> With the pageblock which has partial physical hole at the beginning, we will >> run into PFNs from the physical hole whose struct page is not initialized and >> is invalid, and system would crash as we operate on invalid struct page to find >> out of page is in Buddy or LRU or not > struct page must be initialized and valid even for holes in the physical > memory. When a pageblock spans both existing memory and a hole, the struct > pages for the "hole" part should be marked as PG_Reserved. > > If you see that struct pages for memory holes exist but invalid, we should > solve the underlying issue that causes wrong struct pages contents. We are using 5.15 kernel, arm64 platform. For the pages belonging to the physical hole, I don't see that pages are being initialized. Looking into memmap_init code, we call init_unavailable_range() to initialize the pages that belong to holes in the zone. But again we only do this for PFNs that are valid according to below code snippet - init_unavailable_range() { 6667     for (pfn = spfn; pfn < epfn; pfn++) { 6668         if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { 6669             pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) 6670                 + pageblock_nr_pages - 1; 6671             continue; 6672         } https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n6668 With arm64 specific definition of pfn_valid(), a PFN which isn't present in RAM partition table (i.e. belongs to physical hole), pfn_valid will return FALSE. Hence we don't initialize any pages that belongs to physical hole here. Or am I missing anything in kernel that initializes pages belonging to physical holes too? If so could you point me to that? I see that in next kernel versions, we are removing arm64 specific definition of pfn_valid by Anshuman. Doing so, PFNs in hole would have pfn_valid return TRUE and we would then initialize pages in holes as well. But this patch was reverted by Will Deacon on 5.15 kernel. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/mm?h=v5.17.3&id=3de360c3fdb34fbdbaf6da3af94367d3fded95d3 >> [ 107.629453][ T9688] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 >> [ 107.639214][ T9688] Mem abort info: >> [ 107.642829][ T9688] ESR = 0x96000006 >> [ 107.646696][ T9688] EC = 0x25: DABT (current EL), IL = 32 bits >> [ 107.652878][ T9688] SET = 0, FnV = 0 >> [ 107.656751][ T9688] EA = 0, S1PTW = 0 >> [ 107.660705][ T9688] FSC = 0x06: level 2 translation fault >> [ 107.666455][ T9688] Data abort info: >> [ 107.670151][ T9688] ISV = 0, ISS = 0x00000006 >> [ 107.674827][ T9688] CM = 0, WnR = 0 >> [ 107.678615][ T9688] user pgtable: 4k pages, 39-bit VAs, pgdp=000000098a237000 >> [ 107.685970][ T9688] [0000000000000000] pgd=0800000987170003, p4d=0800000987170003, pud=0800000987170003, pmd=0000000000000000 >> [ 107.697582][ T9688] Internal error: Oops: 96000006 [#1] PREEMPT SMP >> >> [ 108.209839][ T9688] pc : move_freepages_block+0x174/0x27c > can you post fadd2line for this address? fadd2line didn't work quite well. I used aarch64-linux-android-addr2line on the address (move_freepages_block+0x174) and it points to arch_test_bit() at include/asm-generic/bitops/non-atomic.h:118. On T32 using stacktrace, it points to PageLRU() in the below code under move_freepages() move_freepages() { 2520             if (num_movable && 2521                     (PageLRU(page) || __PageMovable(page))) https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n2521 The struct page* contents of the page are invalid, including the page->lru where system crashes doing PageLRU(page).   (struct page)0xFFFFFFFE018E0000 = (     flags = 0,     lru = (next = 0x0010000000000021, prev = 0x0042000000000004),     mapping = 0x0,     index = 549755813904,     private = 0,     pp_magic = 4503599627370529,     pp = 0x0042000000000004,     _pp_mapping_pad = 0,     dma_addr = 549755813904,     dma_addr_upper = 0,     pp_frag_count = (counter = 0),     slab_list = (next = 0x0010000000000021, prev = 0x0042000000000004),     next = 0x0010000000000021, > >> [ 108.215407][ T9688] lr : steal_suitable_fallback+0x20c/0x398 >> >> [ 108.305908][ T9688] Call trace: >> [ 108.309151][ T9688] move_freepages_block+0x174/0x27c [PageLRU] >> [ 108.314359][ T9688] steal_suitable_fallback+0x20c/0x398 >> [ 108.319826][ T9688] rmqueue_bulk+0x250/0x934 >> [ 108.324325][ T9688] rmqueue_pcplist+0x178/0x2ac >> [ 108.329086][ T9688] rmqueue+0x5c/0xc10 >> [ 108.333048][ T9688] get_page_from_freelist+0x19c/0x430 >> [ 108.338430][ T9688] __alloc_pages+0x134/0x424 >> [ 108.343017][ T9688] page_cache_ra_unbounded+0x120/0x324 >> [ 108.348494][ T9688] do_sync_mmap_readahead+0x1b0/0x234 >> [ 108.353878][ T9688] filemap_fault+0xe0/0x4c8 >> [ 108.358375][ T9688] do_fault+0x168/0x6cc >> [ 108.362518][ T9688] handle_mm_fault+0x5c4/0x848 >> [ 108.367280][ T9688] do_page_fault+0x3fc/0x5d0 >> [ 108.371867][ T9688] do_translation_fault+0x6c/0x1b0 >> [ 108.376985][ T9688] do_mem_abort+0x68/0x10c >> [ 108.381389][ T9688] el0_ia+0x50/0xbc >> [ 108.385175][ T9688] el0t_32_sync_handler+0x88/0xbc >> [ 108.390208][ T9688] el0t_32_sync+0x1b8/0x1bc >> >> Hence, avoid operating on invalid pages within the same pageblock by checking >> if pfn is valid or not. >> Signed-off-by: Sudarshan Rajagopalan >> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE") >> Cc: Mike Rapoport > For now the patch looks like a band-aid for more fundamental bug, so > > NAKED-by: Mike Rapoport > This patch may look like work around solution but yes I think there's a fundamental problem where kernel takes a pageblock which has partial holes and partial System RAM as valid pageblock, which occurs when Base Address in RAM partition table are not aligned to pageblock size. This fundamental problem needs to be fixed, and looking for your suggestions. >> Cc: Anshuman Khandual >> Cc: Suren Baghdasaryan >> --- >> mm/page_alloc.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 6e5b448..e87aa053 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone, >> int pages_moved = 0; >> >> for (pfn = start_pfn; pfn <= end_pfn;) { >> + if (!pfn_valid(pfn)) { >> + pfn++; >> + continue; >> + } >> + >> page = pfn_to_page(pfn); >> if (!PageBuddy(page)) { >> /* >> -- >> 2.7.4 >>