From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55819C43603 for ; Wed, 11 Dec 2019 16:32:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 07C3E2077B for ; Wed, 11 Dec 2019 16:32:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Vyk5RWos" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 07C3E2077B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ACC9D6B32DD; Wed, 11 Dec 2019 11:32:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A7DD56B32DE; Wed, 11 Dec 2019 11:32:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BBE96B32DF; Wed, 11 Dec 2019 11:32:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0251.hostedemail.com [216.40.44.251]) by kanga.kvack.org (Postfix) with ESMTP id 8930C6B32DD for ; Wed, 11 Dec 2019 11:32:15 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 3DF912C89 for ; Wed, 11 Dec 2019 16:32:15 +0000 (UTC) X-FDA: 76253403030.09.maid55_8807567c19958 X-HE-Tag: maid55_8807567c19958 X-Filterd-Recvd-Size: 7872 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Wed, 11 Dec 2019 16:32:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1576081934; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0gJwfxwyql/ZBro6j3PHhc8Sr1IUHu9YKmw5lChLB3M=; b=Vyk5RWosPVx2GohOeuC6hibMvwNlRh5JllYUbdNQW2WE43b9iLkIfuwn7zeLi5fDT24VQ7 lYZUU6GqRgK4p0++kUPs9a+RPrvVntT4+VbywbVX8Zy2RxllD3mCY8AMr5cCaPk7KGkYGu FyvsmfJ6K2bXyPxLGOFKmA68YCZo2NA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-188-wmcTqCOIMYGjeEvJ5IZrFg-1; Wed, 11 Dec 2019 11:32:12 -0500 X-MC-Unique: wmcTqCOIMYGjeEvJ5IZrFg-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2568E8C2B48; Wed, 11 Dec 2019 16:32:11 +0000 (UTC) Received: from t480s.redhat.com (ovpn-117-148.ams2.redhat.com [10.36.117.148]) by smtp.corp.redhat.com (Postfix) with ESMTP id 89B7F605FF; Wed, 11 Dec 2019 16:32:08 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Daniel Jordan , stable@vger.kernel.org, Naoya Horiguchi , Pavel Tatashin , Andrew Morton , Steven Sistare , Michal Hocko , Bob Picco , Oscar Salvador Subject: [PATCH v2 1/3] mm: fix uninitialized memmaps on a partially populated last section Date: Wed, 11 Dec 2019 17:31:59 +0100 Message-Id: <20191211163201.17179-2-david@redhat.com> In-Reply-To: <20191211163201.17179-1-david@redhat.com> References: <20191211163201.17179-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: ffffffffffff= fffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W = 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO= S rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c= 0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 000000000= 0000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 000000000= 0000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 000000000= 0000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000= 0144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb1394= 01cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlG= S:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 000000000= 00006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null= " after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: ffffffffffff= fffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of the= se pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemma= p") Tested-by: Daniel Jordan Cc: # v4.15+ Cc: Naoya Horiguchi Cc: Pavel Tatashin Cc: Andrew Morton Cc: Steven Sistare Cc: Michal Hocko Cc: Daniel Jordan Cc: Bob Picco Cc: Oscar Salvador Signed-off-by: David Hildenbrand --- mm/page_alloc.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62dcd6b76c80..1eb2ce7c79e4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6932,7 +6932,8 @@ static u64 zero_pfn_range(unsigned long spfn, unsig= ned long epfn) * This function also addresses a similar issue where struct pages are l= eft * uninitialized because the physical address range is not covered by * memblock.memory or memblock.reserved. That could happen when memblock - * layout is manually configured via memmap=3D. + * layout is manually configured via memmap=3D, or when the highest phys= ical + * address (max_pfn) does not end on a section boundary. */ void __init zero_resv_unavail(void) { @@ -6950,7 +6951,16 @@ void __init zero_resv_unavail(void) pgcnt +=3D zero_pfn_range(PFN_DOWN(next), PFN_UP(start)); next =3D end; } - pgcnt +=3D zero_pfn_range(PFN_DOWN(next), max_pfn); + + /* + * Early sections always have a fully populated memmap for the whole + * section - see pfn_valid(). If the last section has holes at the + * end and that section is marked "online", the memmap will be + * considered initialized. Make sure that memmap has a well defined + * state. + */ + pgcnt +=3D zero_pfn_range(PFN_DOWN(next), + round_up(max_pfn, PAGES_PER_SECTION)); =20 /* * Struct pages that do not have backing memory. This could be because --=20 2.23.0