From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8ABBC4360C for ; Fri, 11 Oct 2019 00:12:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 85EC4214E0 for ; Fri, 11 Oct 2019 00:12:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 85EC4214E0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ah.jp.nec.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1FF868E0006; Thu, 10 Oct 2019 20:12:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 18B2C8E0003; Thu, 10 Oct 2019 20:12:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0299C8E0006; Thu, 10 Oct 2019 20:12:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0023.hostedemail.com [216.40.44.23]) by kanga.kvack.org (Postfix) with ESMTP id CD01E8E0003 for ; Thu, 10 Oct 2019 20:12:49 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 486A33A87 for ; Fri, 11 Oct 2019 00:12:49 +0000 (UTC) X-FDA: 76029578058.26.point54_6fd56f0e31753 X-HE-Tag: point54_6fd56f0e31753 X-Filterd-Recvd-Size: 7321 Received: from tyo161.gate.nec.co.jp (tyo161.gate.nec.co.jp [114.179.232.161]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Fri, 11 Oct 2019 00:12:48 +0000 (UTC) Received: from mailgate02.nec.co.jp ([114.179.233.122]) by tyo161.gate.nec.co.jp (8.15.1/8.15.1) with ESMTPS id x9B0CQAx018009 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 11 Oct 2019 09:12:26 +0900 Received: from mailsv02.nec.co.jp (mailgate-v.nec.co.jp [10.204.236.94]) by mailgate02.nec.co.jp (8.15.1/8.15.1) with ESMTP id x9B0CQBm029745; Fri, 11 Oct 2019 09:12:26 +0900 Received: from mail02.kamome.nec.co.jp (mail02.kamome.nec.co.jp [10.25.43.5]) by mailsv02.nec.co.jp (8.15.1/8.15.1) with ESMTP id x9B0Bs25024015; Fri, 11 Oct 2019 09:12:26 +0900 Received: from bpxc99gp.gisp.nec.co.jp ([10.38.151.152] [10.38.151.152]) by mail01b.kamome.nec.co.jp with ESMTP id BT-MMP-9354403; Fri, 11 Oct 2019 09:11:26 +0900 Received: from BPXM23GP.gisp.nec.co.jp ([10.38.151.215]) by BPXC24GP.gisp.nec.co.jp ([10.38.151.152]) with mapi id 14.03.0439.000; Fri, 11 Oct 2019 09:11:25 +0900 From: Naoya Horiguchi To: David Hildenbrand CC: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Qian Cai , "Alexey Dobriyan" , Andrew Morton , Stephen Rothwell , Michal Hocko , Toshiki Fukasawa , Konstantin Khlebnikov , Mike Rapoport , Anthony Yznaga , Jason Gunthorpe , Dan Williams , Logan Gunthorpe , Ira Weiny , "Aneesh Kumar K.V" , "linux-fsdevel@vger.kernel.org" Subject: Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c Thread-Topic: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c Thread-Index: AQHVfoGmAHSAUTUjNku3uhYTH+YfcKdRe/+AgAFpK4CAARfIAA== Date: Fri, 11 Oct 2019 00:11:25 +0000 Message-ID: <20191011001124.GA17127@hori.linux.bs1.fc.nec.co.jp> References: <20191009091205.11753-1-david@redhat.com> <20191009095721.GC20971@hori.linux.bs1.fc.nec.co.jp> In-Reply-To: Accept-Language: en-US, ja-JP Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.34.125.96] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <7C29B4784E15DE49A19DE8FCF5C94CBC@gisp.nec.co.jp> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-TM-AS-MML: disable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: > On 09.10.19 11:57, Naoya Horiguchi wrote: > > Hi David, > >=20 > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: > >> There are various places where we access uninitialized memmaps, namely= : > >> - /proc/kpagecount > >> - /proc/kpageflags > >> - /proc/kpagecgroup > >> - memory_failure() - which reuses stable_page_flags() from fs/proc/pag= e.c > >=20 > > Ah right, memory_failure is another victim of this bug. > >=20 > >> > >> We have initialized memmaps either when the section is online or when > >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps con= tain > >> garbage and in the worst case trigger kernel BUGs, especially with > >> CONFIG_PAGE_POISONING. > >> > >> For example, not onlining a DIMM during boot and calling /proc/kpageco= unt > >> with CONFIG_PAGE_POISONING: > >> :/# cat /proc/kpagecount > tmp.test > >> [ 95.600592] BUG: unable to handle page fault for address: fffffffff= ffffffe > >> [ 95.601238] #PF: supervisor read access in kernel mode > >> [ 95.601675] #PF: error_code(0x0000) - not-present page > >> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 > >> [ 95.602596] Oops: 0000 [#1] SMP NOPTI > >> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20= 191004+ #11 > >> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), = BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 > >> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 > >> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e= 7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 > >> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 > >> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 000000= 0000000000 > >> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff3= 5645000000 > >> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 000000= 0000000000 > >> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 000000= 0000140000 > >> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa1= 4e409b7f08 > >> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) k= nlGS:0000000000000000 > >> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 000000= 00000006f0 > >> [ 95.611686] Call Trace: > >> [ 95.611906] proc_reg_read+0x3c/0x60 > >> [ 95.612228] vfs_read+0xc5/0x180 > >> [ 95.612505] ksys_read+0x68/0xe0 > >> [ 95.612785] do_syscall_64+0x5c/0xa0 > >> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >> > >> Note that there are still two possible races as far as I can see: > >> - pfn_to_online_page() succeeding but the memory getting offlined and > >> removed. get_online_mems() could help once we run into this. > >> - pfn_zone_device() succeeding but the memmap not being fully > >> initialized yet. As the memmap is initialized outside of the memory > >> hoptlug lock, get_online_mems() can't help. > >> > >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We > >> can later come back and fix these rare races and eventually speed-up t= he > >> ZONE_DEVICE detection. > >=20 > > Actually, Toshiki is writing code to refactor and optimize the pfn walk= ing > > part, where we find the pfn ranges covered by zone devices by running o= ver > > xarray pgmap_array and use the range info to reduce pointer dereference= s > > to speed up pfn walk. I hope he will share it soon. >=20 > AFAIKT, Michal is not a friend of special-casing PFN walkers in that > way. We should have a mechanism to detect if a memmap was initialized > without having to go via pgmap, special-casing. See my other mail where > I draft one basic approach. OK, so considering your v2 approach, we could have another pfn_to_page() variant like pfn_to_zone_device_page(), where we check that a given pfn belongs to the memory section backed by zone memory, then another check if the pfn has initialized memmap or not, and return NULL if memmap not initialied. We'll try this approach then, but if you find problems/concern= s, please let me know. Thanks, Naoya Horiguchi=