From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEC8FC64EC7 for ; Tue, 28 Feb 2023 14:09:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89B956B0071; Tue, 28 Feb 2023 09:09:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 824E36B0072; Tue, 28 Feb 2023 09:09:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69E1C6B0078; Tue, 28 Feb 2023 09:09:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 525EE6B0071 for ; Tue, 28 Feb 2023 09:09:17 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 11E18A88AA for ; Tue, 28 Feb 2023 14:09:17 +0000 (UTC) X-FDA: 80516882754.26.D9630DC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf02.hostedemail.com (Postfix) with ESMTP id EF71E8002B for ; Tue, 28 Feb 2023 14:09:13 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SAkr2Fez; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dtIIV3Py; spf=pass (imf02.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677593354; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1ESrld006RWsyPkhdVerW4JO9w3LuFA1DAtayIv7Uao=; b=TioJP7iWlLugMUI3DLRXkWKOxI5QGfBTDlmG2ocZqJLTOYDBQuH7wL7z1+DBMsfV2g5Ow1 rfGuc3Zs0elU45aq3aY0WqEqwTE0chjKm1jifRSggvaBFips7K0OX2VjPdkYp9Rxm04YE4 7BJAIOKDKT2xAjeuyoPf9+EZada1cWw= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SAkr2Fez; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dtIIV3Py; spf=pass (imf02.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677593354; a=rsa-sha256; cv=none; b=Fbf9FdsS0JA3h3kp0AWnpFNF5cmAeBb6kbFc7TthbmFTS9fzFILeiK7bZvjKUmPr9NP4bu z/9rzPNx2CfsSIAaiH0qvrHu3e4HNmw8DYel9SbDBdxFaQ79kJAWq/MuDa7juO5H5MSznm hdhiXc1e+ZTLhxxB4yqnMbLhjKxhmHA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677593052; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1ESrld006RWsyPkhdVerW4JO9w3LuFA1DAtayIv7Uao=; b=SAkr2FezSOxBpE3zUmbijQsRXR2+/ODSsFhGAcDp0m5N78dLLSg2rjmLv6tfrJwy7qu7oZ CqSde/hSQNZoQZ6sUQ0SVQ55WPi9B7qjuksoRlfFpxjLkIP/3WGKYBCGC1QxAQYysaAS/+ NaTvMiO7XUhchuEAFvvMdOBhECjO2Ac= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677593353; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1ESrld006RWsyPkhdVerW4JO9w3LuFA1DAtayIv7Uao=; b=dtIIV3PyecvI+STu3K1NMJcd7+WeVs5m+lYBqf+jpD973RxhqrWMvZVXuFqYCY0fy8rxPo aNGzOdhrdX08dOIKXPvgH5vC9Y0NUAuBAUR2iIw+TDj79OBLT0VLqdEq93M7/WWjklfh+c kQxpA2BkIpDSYb5A3i3F5LDrgem+KGg= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-80-ee3cnWmIMkS4QAFkSqBm9Q-1; Tue, 28 Feb 2023 09:04:05 -0500 X-MC-Unique: ee3cnWmIMkS4QAFkSqBm9Q-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C74C61871D9A; Tue, 28 Feb 2023 14:03:56 +0000 (UTC) Received: from localhost (ovpn-13-194.pek2.redhat.com [10.72.13.194]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4401418EC6; Tue, 28 Feb 2023 14:03:54 +0000 (UTC) Date: Tue, 28 Feb 2023 22:03:49 +0800 From: Baoquan He To: "lizhijian@fujitsu.com" Cc: "kexec@lists.infradead.org" , "nvdimm@lists.linux.dev" , "linux-mm@kvack.org" , "vgoyal@redhat.com" , "dyoung@redhat.com" , "vishal.l.verma@intel.com" , "dan.j.williams@intel.com" , "dave.jiang@intel.com" , "horms@verge.net.au" , "k-hagio-ab@nec.com" , "akpm@linux-foundation.org" , "Yasunori Gotou (Fujitsu)" , "yangx.jy@fujitsu.com" , "ruansy.fnst@fujitsu.com" Subject: Re: [RFC][nvdimm][crash] pmem memmap dump support Message-ID: References: <3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com> MIME-Version: 1.0 In-Reply-To: <3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: EF71E8002B X-Stat-Signature: 8ho4akrqbjrbq1muj9hfzxci9akatgr4 X-HE-Tag: 1677593353-721061 X-HE-Meta: U2FsdGVkX1+rKWydUYvWRkePrNg/Yh4fAlrgBenqG3aYWvoOl2w2Dkxz8hYVR1rxdyDnsAYv7q79BVRVvn6LuhAt3wWhn6euafijvvSb10VoGHWjJ2X60lmQ0Z1yTPOCfURowkleJh7UpqHEN2GoSduA3YQySOFx0cmWHootQozjAH9wn86Z4OeRz3pTWiLbMSCraQkDWcWHcI0KSAvtVlM07biZwDm+p4dH3DqiNw7huySnuIHHJIgSMcLyD6ntB2Eb1hALdU3y9UGD7qry/gtV5UuQ9DCTXbBXR2jodcqY39AgLj4uLTHISpjrO7UWVf+4zND1ggVh00RXiaIFpSe70MXMszShxha9f6pdOoVgNiDRPv5JqpjPh51KK363nObkLs8/T6GQtBDQ0DFHfs/PylskqqrDDb/Tdo8Xnqfqyfix+us+adXRydT6eYLt3NzLma+1sWLyDTFVyp97vpnkjHaMAZ+kN3S4yC6urkSY6R8Y0uqz7yqnw1yWycdKR8EvK8EV2vvRUsL2ZNcOh9zQmm/tC6NskA/JUvXwQNqtCYJQalbCkvEpODLglSWwVT3VSQRu1cYIVYeSifSsnfQOS9oBQjBStPhH6fz96jj2uspyDi3RNEBGq5BAP8gVE+OfJx4y3ZwgI9e/QGxyKoGZh65q+KUmatk8xZP5x0/LtNQAdMyVwIlAfmF0crAwI+A0Hcfn0Z0F8i/ioV/jbFj6iLie0wVqft86VkkW2jG3s/3Sll+2oOcnPtGp/fn0jufPLNSZWNbStU/WIxvDKItgWQQC2vhKdHUk8lBvYLlihxHIdaalHAAb3vCo60yvAflB4iw5w7OqXV116ZG4c6/oYGgE2GCWbst4XPBhHpOIG3li5pYHivbu3uZvq7ywc0HIr1xD3LTeWDS3+PxYq+k1VSMbLyd3hyD+nQwMy14Q5EZ+inNzwnKa8lz/QQj7GneWwetWJ9Yh18YzbUx /z5tGr5M NqJHeH/LbFIePamRWbW9RcjmsDORY1aVQ61W4L7wq3BX31X2LRhlnEPwtMcAVaVNyORhBKit4bDJ7wuqqlG9zyuA6SQ/gjFQMVrBKQMymod1jqOMFGtcBY23SBLkh5QSbn2xazlfVjZD8ohC3t8asFXcxvdrePip52d4Uu8z+ujlyroGYaEov6VnRgFoSVpesESIlqyUWi4hO7b53gEDt3U2+i0B0n5Jy6AYHIQcYBYuqa1lw8qjBZKFTn8Bt8wEH5rFkCYjUCdcb92M7luUFuDcwtWTVXOXyZwg6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 02/23/23 at 06:24am, lizhijian@fujitsu.com wrote: > Hello folks, > > This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature. > I really hope you can provide some feedback. > > pmem memmap can also be called pmem metadata here. > > ### Background and motivate overview ### > --- > Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what > happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support. > However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze > trouble around pmem (especially Filesystem-DAX). > > > A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation > can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for > more details. In fsdax, struct page array becomes very important, it is one of the key data to find > status of reverse map. > > So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means > troubleshooters are unable to check more details about pmem from the dumpfile. > > ### Make pmem memmap dump support ### > --- > Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the > crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled. > > First, based on our previous investigation, according to the location of metadata and the scope of > dump, we can divide it into the following four cases: A, B, C, D. > It should be noted that although we mentioned case A&B below, we do not want these two cases to be > part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly, > it may contain user sensitive data. > > +-------------+----------+------------+ > |\+--------+\ metadata location | > | ++-----------------------+ > | dump scope | mem | PMEM | > +-------------+----------+------------+ > | entire pmem | A | B | > +-------------+----------+------------+ > | metadata | C | D | > +-------------+----------+------------+ > > Case A&B: unsupported > - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem > region into vmcore's PT_LOADs in kexec-tools. > - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs > are readable, and then skips/excludes the specific page according to its attributes. But in the case > of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw > errors[2] when specific -d options are specified. > Accordingly, we should make makedumpfile to ignore these errors if it's pmem region. > > Because these above cases are not in our goal, we must consider how to prevent the data part of pmem > from reading by the dump application(makedumpfile). > > Case C: native supported > metadata is stored in mem, and the entire mem/ram is dumpable. > > Case D: unsupported && need your input > To support this situation, the makedumpfile needs to know the location of metadata for each pmem > namespace and the address and size of metadata in the pmem [start, end) > > We have thought of a few possible options: > > 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y} > exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata > 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of > each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata. > 3) others ? > > But then we found that we have always ignored a user case, that is, the user could save the dumpfile > to the pmem. Neither of these two options can solve this problem, because the pmem drivers will > re-initialize the metadata during the pmem drivers loading process, which leads to the metadata > we dumped is inconsistent with the metadata at the moment of the crash happening. > Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be > destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing > dumpfile on the filesystem/partition based on pmem. 1) In kernel side, export info of pmem meta data; 2) in makedumpfile size, add an option to specify if we want to dump pmem meta data; An option or in dump level? 3) In glue script, detect and warn if pmem data is in pmem and wanted, and dump target is the same pmem. Does this work for you? Not sure if above items are all do-able. As for parking pmem device till in kdump kernel, I believe intel pmem expert know how to achieve that. If there's no way to park pmem during kdump jumping, case D) is daydream. Thanks Baoquan