linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "HAGIO KAZUHITO(萩尾 一仁)" <k-hagio-ab@nec.com>
To: "lizhijian@fujitsu.com" <lizhijian@fujitsu.com>,
	"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
	"nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: Baoquan He <bhe@redhat.com>,
	"vgoyal@redhat.com" <vgoyal@redhat.com>,
	"dyoung@redhat.com" <dyoung@redhat.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"horms@verge.net.au" <horms@verge.net.au>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>,
	"yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>,
	"ruansy.fnst@fujitsu.com" <ruansy.fnst@fujitsu.com>
Subject: Re: [RFC][nvdimm][crash] pmem memmap dump support
Date: Tue, 7 Mar 2023 02:05:04 +0000	[thread overview]
Message-ID: <1fecbb60-d9f1-908c-31c9-16a3c890cf3f@nec.com> (raw)
In-Reply-To: <3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com>

On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.

Hi Zhijian,

sorry, probably I don't understand enough, but do these mean that
  1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
     unreadable ones, and
  2. makedumpfile gets to know the readable regions somehow?

Then /proc/vmcore with pmem cannot be captured by other commands,
e.g. cp command?

Thanks,
Kazu

> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.
> 
> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
> for the cases A&B&D mentioned above, it would be greatly appreciated.
> 
> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
> 
> 
> [1] Pmem region layout:
>     ^<--namespace0.0---->^<--namespace0.1------>^
>     |                    |                      |
>     +--+m----------------+--+m------------------+---------------------+-+a
>     |++|e                |++|e                  |                     |+|l
>     |++|t                |++|t                  |                     |+|i
>     |++|a                |++|a                  |                     |+|g
>     |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>     |++|a    fsdax       |++|a     devdax       |                     |+|m
>     |++|t                |++|t                  |                     |+|e
>     +--+a----------------+--+a------------------+---------------------+-+n
>     |                                                                   |t
>     v<-----------------------pmem region------------------------------->v
> 
> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
> 
> 
> Thanks
> Zhijian

  parent reply	other threads:[~2023-03-07  2:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-23  6:24 lizhijian
2023-02-28 14:03 ` Baoquan He
2023-03-01  6:27   ` lizhijian
2023-03-01  8:17     ` Baoquan He
2023-03-03  2:27       ` lizhijian
2023-03-03  9:21         ` Baoquan He
2023-03-07  2:05 ` HAGIO KAZUHITO(萩尾 一仁) [this message]
2023-03-07  2:49   ` lizhijian
2023-03-07  8:31     ` HAGIO KAZUHITO(萩尾 一仁)
2023-03-17  6:12 ` Dan Williams
2023-03-17  7:30   ` lizhijian
2023-03-17 15:19     ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1fecbb60-d9f1-908c-31c9-16a3c890cf3f@nec.com \
    --to=k-hagio-ab@nec.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dyoung@redhat.com \
    --cc=horms@verge.net.au \
    --cc=kexec@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhijian@fujitsu.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=vgoyal@redhat.com \
    --cc=vishal.l.verma@intel.com \
    --cc=y-goto@fujitsu.com \
    --cc=yangx.jy@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox