Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Alexander Potapenko" <glider@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Andrey Ryabinin" <aryabinin@virtuozzo.com>,
	"Balbir Singh" <bsingharora@gmail.com>,
	"Baoquan He" <bhe@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Dave Young" <dyoung@redhat.com>,
	"Dmitry Vyukov" <dvyukov@google.com>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Hari Bathini" <hbathini@linux.vnet.ibm.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Hugh Dickins" <hughd@google.com>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Jaewon Kim" <jaewon31.kim@samsung.com>,
	"Jan Kara" <jack@suse.cz>, "Jérôme Glisse" <jglisse@redhat.com>,
	"Joonsoo Kim" <iamjoonsoo.kim@lge.com>,
	"Juergen Gross" <jgross@suse.com>,
	"Kate Stewart" <kstewart@linuxfoundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Matthew Wilcox" <mawilcox@microsoft.com>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Miles Chen" <miles.chen@mediatek.com>,
	"Oscar Salvador" <osalvador@techadventures.net>,
	"Paul Mackerras" <paulus@samba.org>,
	"Pavel Tatashin" <pasha.tatashin@oracle.com>,
	"Philippe Ombredanne" <pombredanne@nexb.com>,
	"Rashmica Gupta" <rashmica.g@gmail.com>,
	"Reza Arbab" <arbab@linux.vnet.ibm.com>,
	"Souptick Joarder" <jrdr.linux@gmail.com>,
	"Tetsuo Handa" <penguin-kernel@I-love.SAKURA.ne.jp>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	kexec@lists.infradead.org
Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
Date: Fri, 25 May 2018 17:08:37 +0200	[thread overview]
Message-ID: <216ca71b-9880-f013-878b-ae39e865b94b@redhat.com> (raw)
In-Reply-To: <20180524120341.GF20441@dhcp22.suse.cz>

>> So, no, virtio-mem is not a balloon driver :)
> [...]
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>
>>> More details please.
>>
>> I just had another look at the whole complexity of
>> makedumfile/kdump/uevents and I'll follow up with a detailed description.
>>
>> kdump.service is definitely reloaded when setting a memory block
>> online/offline (not when adding/removing as I wrongly claimed before).
>>
>> I'll follow up with a more detailed description and all the pointers.
> 
> Please make sure to describe what is the architecture then. I have no
> idea what kdump.servise is supposed to do for example.

Giving a high level description, going into applicable details:

Dump tools always generate the dump file from /proc/vmcore inside the
kexec environment. This is a vmcore dump in ELF format, with required
and optional headers and notes.

1. Core collectors

The tool that writes /proc/vmcore into a file is called "core collector".

"This allows you to specify the command to copy the vmcore. You could
use the dump filtering program makedumpfile, the default one, to
retrieve your core, which on some arches can drastically reduce core
file size." [1]

E.g. under RHEL, the only supported core collector is in fact
makedumpfile [2][3], which is e.g. able to exclude e.g. hwpoison pages,
which could result otherwise in a crash if you simply copy /proc/vmcore
into a file on harddisk.

2. vmcoreinfo

/proc/vmcore can optionally contain a vmcoreinfo, that exposes some
magic variables necessary to e.g. find and interpret segments but also
struct pages. This is generated in "kernel/crash_core.c" in the crashed
linux kernel.

...
VMCOREINFO_SYMBOL_ARRAY(mem_section);
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
...
VMCOREINFO_NUMBER(PG_hwpoison);
...
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
...

If not available, it is e.g. tried to extract relevant
symbols/variables/pointers from vmlinux (similar like e.g. GDB).

3. PM_LOAD / Memory holes

Each vmcore contains "PM_LOAD" sections. These sections define which
physical memory areas are available in the vmcore (and to which virtual
addresses they translate). Generated e.g. in "kernel/kexec_file.c" - and
in some other places "git grep Elf64_Phdr".

This information is generated in the crashed kernel.

arch/x86/kernel/crash.c:
 walk_system_ram_res() is effectively used to generate PM_LOAD segments

arch/s390/kernel/crash_dump.c:
 for_each_mem_range() is effectively used to generate PM_LOAD
 information

At this point, I don't see how offline sections are treated. I assume
they are always also included. So PT_LOAD will include all memory, no
matter if online or offline.

4. Reloading kexec/kdump.service

The important thing is that the vmcore *excluding* the actual memory has
to be prepared by the *old* kernel. The kexec kernel will allow to
- Read the prepared vmcore (contained in kexec kernel)
- Read the memory

So dump tools only have the vmcore (esp. PT_LOAD) to figure out which
physical memory was available in the *old* system. The kexec kernel
neither reads or interprets segments/struct pages from the old kernel
(and there would be no way to really do it). All it does is allow to
read old memory as defined in the prepared vmcore. If that memory is not
accessible or broken (hwpoison), we will crash the system.

So what does this imply? vmcore (including PT_LOAD sections) has to be
regenerated every time memory is added/removed from the system.
Otherwise the data contained in the prepared vmcore is stale. As far as
I understand this cannot be done by the actual kernel when
adding/removing memory but has to be done by user space.

The same is e.g. also true when hot(un)plugging CPUs.

This is done by reloading kexec, resulting in a regeneration of the
vmcore. UDEV events are used to reload kdump.service and therefore
regenerate. This events are triggered when onlining/offlining a memory
block.

...
SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl
try-restart kdump.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl
try-restart kdump.service"
...

For "online", this is the right thing to do.

I am right now not 100% if that is the right thing to do for "offline".
I guess we should regenerate actually after "remove" events, but I
didn't follow the details. Otherwise it could happen that the vmcore is
regenerated before the actual removal of memory blocks. So the
applicable memory blocks would still be included as PT_LOAD in the
vmcore. If we then remove the actual DIMM then, trying to dump the
vmcore will result in reading invalid memory. But maybe I am missing
something there.

5. Access to vmcore / memory in the kexec environment

fs/proc/vmcore.c: contains the code for parsing vmcore in the kexec
kernel, prepared by the crashed kernel. The kexec kernel provides read
access to /proc/vmcore on this basis.

All PT_LOAD sections will be converted and stored in "vmcore_list".

When reading the vmcore, this list will be used to actually provide
access to the original crash memory (__read_vmcore()).

So only memory that was originally in vmcore PT_LOAD will be allowed to
be red.

read_from_oldmem() will perform the actual read. At that point we have
no control over old page flags or segments. Just a straight memory read.

There is special handling for e.g. XEN in there: pfn_is_ram() can be
used to hinder reading inflated memory. (register_oldmem_pfn_is_ram)

However reusing that for virtio-mem with multiple devices and queues and
such might not be possible. It is the last resort :)

6. makedumpfile

makedumpfile can exclude free (buddy) pages, hwpoison pages and some
more. It will *not* exclude reserved pages or balloon (e.g.
virtio-balloon) inflated pages. So it will read inflated pages and if
they are zero, save a compressed zero page. However it will (read)
access that memory.

makedumpfile was adapted to the new SECTION_IS_ONLINE bit (to mask the
right section address), offline sections will *not* be excluded. So also
all memory in offline sections will be accessed and dumped - unless
pages don't fall into PT_LOAD sections ("memory hole"), in this case
they are not accessed.

7. Further information

Some more details can be found in "Documentation/kdump/kdump.txt".

"All of the necessary information about the system kernel's core image
is encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter."
-> I am pretty sure this is why the kexec reload from user space is
   necessary

"For s390x there are two kdump modes: If a ELF header is specified with
 the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
 is done on all other architectures. If no elfcorehdr= kernel parameter
 is specified, the s390x kdump kernel dynamically creates the header.
 The second mode has the advantage that for CPU and memory hotplug,
 kdump has not to be reloaded with kexec_load()."

Any experts, please jump in :)

[1] https://www.systutorials.com/docs/linux/man/5-kdump/
[2] https://sourceforge.net/projects/makedumpfile/
[3] git://git.code.sf.net/p/makedumpfile/code

-- 

Thanks,

David / dhildenb

     prev parent reply	other threads:[~2018-05-25 15:08 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-23 15:11 David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 01/10] mm: introduce and use PageOffline() David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 02/10] mm/page_ext.c: support online/offline of memory < section size David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 03/10] kasan: prepare for online/offline of different start/size David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 04/10] kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 05/10] mm/memory_hotplug: limit offline_pages() to sizes we can actually handle David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 06/10] mm/memory_hotplug: onlining pages can only fail due to notifiers David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 07/10] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages() David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 08/10] mm/memory_hotplug: allow to control onlining/offlining of memory by a driver David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever David Hildenbrand
2018-05-24 14:39   ` Michal Hocko
2018-05-24 20:36     ` David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module David Hildenbrand
2018-05-23 19:51   ` Christoph Hellwig
2018-05-24  5:59     ` David Hildenbrand
2018-05-24  7:53 ` [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
2018-05-24  8:31   ` David Hildenbrand
2018-05-24  8:56     ` Dave Young
2018-05-24  9:14       ` David Hildenbrand
2018-05-28  8:28         ` Dave Young
2018-05-28 10:03           ` David Hildenbrand
2018-05-24  9:31     ` Michal Hocko
2018-05-24 10:45       ` David Hildenbrand
2018-05-24 12:03         ` Michal Hocko
2018-05-24 14:04           ` David Hildenbrand
2018-05-24 14:22             ` Michal Hocko
2018-05-24 21:07               ` David Hildenbrand
2018-06-11 11:53                 ` David Hildenbrand
2018-06-11 11:56                   ` Michal Hocko
2018-06-11 12:33                     ` David Hildenbrand
2018-07-16 19:48                       ` David Hildenbrand
2018-07-16 20:05                         ` Michal Hocko
2018-07-18  9:56                           ` David Hildenbrand
2018-07-18 11:23                             ` Michal Hocko
2018-07-18 13:19                 ` Michal Hocko
2018-07-18 13:39                   ` David Hildenbrand
2018-07-18 13:43                     ` Michal Hocko
2018-07-18 13:47                       ` David Hildenbrand
2018-07-18 13:56                         ` Michal Hocko
2018-05-25 15:08           ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=216ca71b-9880-f013-878b-ae39e865b94b@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arbab@linux.vnet.ibm.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=benh@kernel.crashing.org \
    --cc=bhe@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dvyukov@google.com \
    --cc=dyoung@redhat.com \
    --cc=glider@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hbathini@linux.vnet.ibm.com \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=jack@suse.cz \
    --cc=jaewon31.kim@samsung.com \
    --cc=jglisse@redhat.com \
    --cc=jgross@suse.com \
    --cc=jrdr.linux@gmail.com \
    --cc=kexec@lists.infradead.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kstewart@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mawilcox@microsoft.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=miles.chen@mediatek.com \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=osalvador@techadventures.net \
    --cc=pasha.tatashin@oracle.com \
    --cc=paulus@samba.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=pombredanne@nexb.com \
    --cc=rashmica.g@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox