* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations [not found] <87o8kcttjp.fsf@mpe.ellerman.id.au> @ 2020-11-05 8:29 ` David Hildenbrand 2020-11-05 10:47 ` Michael Ellerman 0 siblings, 1 reply; 5+ messages in thread From: David Hildenbrand @ 2020-11-05 8:29 UTC (permalink / raw) To: Michael Ellerman Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev, Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador, Wei Yang > Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>: > > David Hildenbrand <david@redhat.com> writes: >> Let's use alloc_contig_pages() for allocating memory and remove the >> linear mapping manually via arch_remove_linear_mapping(). Mark all pages >> PG_offline, such that they will definitely not get touched - e.g., >> when hibernating. When freeing memory, try to revert what we did. >> The original idea was discussed in: >> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com >> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other >> architectures, whereby only single pages are unmapped from the linear >> mapping. Let's mimic what memory hot(un)plug would do with the linear >> mapping. >> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. >> Simple test under QEMU TCG (10GB RAM, single NUMA node): >> sh-5.0# mount -t debugfs none /sys/kernel/debug/ >> sh-5.0# cat /sys/devices/system/memory/block_size_bytes >> 40000000 >> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages >> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0 >> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages >> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0 >> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0 >> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 > > I gave this a quick spin on a real machine, seems to work OK. > > I don't have the actual memtrace tools setup to do an actual trace, will > try and get someone to test that also. > > One observation is that previously the memory was zeroed when enabling > the memtrace, whereas now it's not. > > eg, before: > > # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 10000000 > > whereas after: > > # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....| > 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....| > 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....| > 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > ... > 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9| > 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@| > 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..| > 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`| > 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@| > 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.| > 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H| > 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...| > 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`| > 10000000 > > > That's a nice way for root to read kernel memory, so we should probably > add a __GFP_ZERO or memset in there somewhere. Thanks for catching that! Will have a look on Monday if alloc_contig_pages() already properly handled __GFP_ZERO so we can use it, otherwise I‘ll fix that. I don‘t recall that memory hotunplug does any zeroing - that‘s why I didn‘t add any explicit zeroing. Could be you were just lucky in your experiment - I assume we‘ll leak kernel memory already. Thank! > cheers ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations 2020-11-05 8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand @ 2020-11-05 10:47 ` Michael Ellerman 0 siblings, 0 replies; 5+ messages in thread From: Michael Ellerman @ 2020-11-05 10:47 UTC (permalink / raw) To: David Hildenbrand Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev, Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador, Wei Yang David Hildenbrand <david@redhat.com> writes: >> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>: >> >> David Hildenbrand <david@redhat.com> writes: >>> Let's use alloc_contig_pages() for allocating memory and remove the >>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages >>> PG_offline, such that they will definitely not get touched - e.g., >>> when hibernating. When freeing memory, try to revert what we did. >>> The original idea was discussed in: >>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com >>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other >>> architectures, whereby only single pages are unmapped from the linear >>> mapping. Let's mimic what memory hot(un)plug would do with the linear >>> mapping. >>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. >>> Simple test under QEMU TCG (10GB RAM, single NUMA node): >>> sh-5.0# mount -t debugfs none /sys/kernel/debug/ >>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes >>> 40000000 >>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable >>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages >>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0 >>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable >>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages >>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0 >>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0 >>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> >> I gave this a quick spin on a real machine, seems to work OK. >> >> I don't have the actual memtrace tools setup to do an actual trace, will >> try and get someone to test that also. >> >> One observation is that previously the memory was zeroed when enabling >> the memtrace, whereas now it's not. >> >> eg, before: >> >> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace >> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> * >> 10000000 >> >> whereas after: >> >> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace >> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> * >> 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....| >> 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> * >> 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....| >> 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> * >> 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....| >> 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> ... >> 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9| >> 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@| >> 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..| >> 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`| >> 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@| >> 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.| >> 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H| >> 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...| >> 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`| >> 10000000 >> >> >> That's a nice way for root to read kernel memory, so we should probably >> add a __GFP_ZERO or memset in there somewhere. > > Thanks for catching that! Will have a look on Monday if > alloc_contig_pages() already properly handled __GFP_ZERO so we can use > it, otherwise I‘ll fix that. I had a quick look and didn't see it, but maybe it is in there somewhere. > I don‘t recall that memory hotunplug does any zeroing - that‘s why I > didn‘t add any explicit zeroing. Could be you were just lucky in your > experiment - I assume we‘ll leak kernel memory already. Hmm yeah good point. I did try it multiple times, and I never get anything non-zero with the existing code. I guess it's just that the new method is more likely to give us memory that's already been used for something. So I guess that's not actually a problem with your patch, it's just exposing an existing issue. cheers ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
@ 2020-10-29 16:27 David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Andrew Morton,
Benjamin Herrenschmidt, Michael Ellerman, Michal Hocko,
Michal Hocko, Mike Rapoport, Oscar Salvador, Paul Mackerras,
Rashmica Gupta, Wei Yang
powernv/memtrace is the only in-kernel user that rips out random memory
it never added (doesn't own) in order to allocate memory without a
linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
that - use alloc_contig_pages() for allocating memory and remove the
linear mapping manually.
The original idea was discussed in:
https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
I only tested allocations briefly via QEMU TCG - see patch #4 for more
details.
David Hildenbrand (4):
powerpc/mm: factor out creating/removing linear mapping
powerpc/mm: print warning in arch_remove_linear_mapping()
powerpc/mm: remove linear mapping if __add_pages() fails in
arch_add_memory()
powernv/memtrace: don't abuse memory hot(un)plug infrastructure for
memory allocations
arch/powerpc/mm/mem.c | 48 +++++---
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++--------------
include/linux/memory_hotplug.h | 3 +
4 files changed, 86 insertions(+), 107 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations 2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand @ 2020-10-29 16:27 ` David Hildenbrand 2020-11-03 9:23 ` Michal Hocko 0 siblings, 1 reply; 5+ messages in thread From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw) To: linux-kernel Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michal Hocko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador, Wei Yang Let's use alloc_contig_pages() for allocating memory and remove the linear mapping manually via arch_remove_linear_mapping(). Mark all pages PG_offline, such that they will definitely not get touched - e.g., when hibernating. When freeing memory, try to revert what we did. The original idea was discussed in: https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com This is similar to CONFIG_DEBUG_PAGEALLOC handling on other architectures, whereby only single pages are unmapped from the linear mapping. Let's mimic what memory hot(un)plug would do with the linear mapping. We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. Simple test under QEMU TCG (10GB RAM, single NUMA node): sh-5.0# mount -t debugfs none /sys/kernel/debug/ sh-5.0# cat /sys/devices/system/memory/block_size_bytes 40000000 sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages [ 75.430549][ T356] memtrace: Freed trace memory back on node 0 [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages [ 80.430493][ T356] memtrace: Freed trace memory back on node 0 [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0 sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 Note 1: We currently won't be allocating from ZONE_MOVABLE - because our pages are not movable. However, as we don't run with any memory hot(un)plug mechanism around, we could make an exception to increase the chance of allocations succeeding. Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used along PG_reserved in hibernation code will always return "true" on powerpc, resulting in the pages getting touched. It's too generic - e.g., indicates boot allocations. Note 3: For now, we keep using memory_block_size_bytes() as minimum granularity. I'm not able to come up with a better guess (most probably, doing it on a section basis could be possible). Suggested-by: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rashmica Gupta <rashmica.g@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- arch/powerpc/platforms/powernv/Kconfig | 8 +- arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++-------------- 2 files changed, 49 insertions(+), 93 deletions(-) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 938803eab0ad..619b093a0657 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -27,11 +27,11 @@ config OPAL_PRD recovery diagnostics on OpenPower machines config PPC_MEMTRACE - bool "Enable removal of RAM from kernel mappings for tracing" - depends on PPC_POWERNV && MEMORY_HOTREMOVE + bool "Enable runtime allocation of RAM for tracing" + depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC help - Enabling this option allows for the removal of memory (RAM) - from the kernel mappings to be used for hardware tracing. + Enabling this option allows for runtime allocation of memory (RAM) + for hardware tracing. config PPC_VAS bool "IBM Virtual Accelerator Switchboard (VAS)" diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index 6828108486f8..8f47797a78c2 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -50,83 +50,50 @@ static const struct file_operations memtrace_fops = { .open = simple_open, }; -static int check_memblock_online(struct memory_block *mem, void *arg) -{ - if (mem->state != MEM_ONLINE) - return -1; - - return 0; -} - -static int change_memblock_state(struct memory_block *mem, void *arg) -{ - unsigned long state = (unsigned long)arg; - - mem->state = state; - - return 0; -} - -/* called with device_hotplug_lock held */ -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) +static u64 memtrace_alloc_node(u32 nid, u64 size) { - const unsigned long start = PFN_PHYS(start_pfn); - const unsigned long size = PFN_PHYS(nr_pages); + const unsigned long nr_pages = PHYS_PFN(size); + unsigned long pfn, start_pfn; + struct page *page; - if (walk_memory_blocks(start, size, NULL, check_memblock_online)) - return false; - - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE, - change_memblock_state); - - if (offline_pages(start_pfn, nr_pages)) { - walk_memory_blocks(start, size, (void *)MEM_ONLINE, - change_memblock_state); - return false; - } + /* + * Trace memory needs to be aligned to the size, which is guaranteed + * by alloc_contig_pages(). + */ + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN, + nid, NULL); + if (!page) + return 0; - walk_memory_blocks(start, size, (void *)MEM_OFFLINE, - change_memblock_state); + /* + * Set pages PageOffline(), to indicate that nobody (e.g., hibernation, + * dumping, ...) should be touching these pages. + */ + start_pfn = page_to_pfn(page); + for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++) + __SetPageOffline(pfn_to_page(pfn)); + arch_remove_linear_mapping(PFN_PHYS(start_pfn), size); - return true; + return PFN_PHYS(start_pfn); } -static u64 memtrace_alloc_node(u32 nid, u64 size) +static int memtrace_free(int nid, u64 start, u64 size) { - u64 start_pfn, end_pfn, nr_pages, pfn; - u64 base_pfn; - u64 bytes = memory_block_size_bytes(); + struct mhp_params params = { .pgprot = PAGE_KERNEL }; + const unsigned long nr_pages = PHYS_PFN(size); + const unsigned long start_pfn = PHYS_PFN(start); + unsigned long pfn; + int ret; - if (!node_spanned_pages(nid)) - return 0; + ret = arch_create_linear_mapping(nid, start, size, ¶ms); + if (ret) + return ret; - start_pfn = node_start_pfn(nid); - end_pfn = node_end_pfn(nid); - nr_pages = size >> PAGE_SHIFT; - - /* Trace memory needs to be aligned to the size */ - end_pfn = round_down(end_pfn - nr_pages, nr_pages); - - lock_device_hotplug(); - for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) { - if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) { - /* - * Remove memory in memory block size chunks so that - * iomem resources are always split to the same size and - * we never try to remove memory that spans two iomem - * resources. - */ - end_pfn = base_pfn + nr_pages; - for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> PAGE_SHIFT) { - __remove_memory(nid, pfn << PAGE_SHIFT, bytes); - } - unlock_device_hotplug(); - return base_pfn << PAGE_SHIFT; - } - } - unlock_device_hotplug(); + for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++) + __ClearPageOffline(pfn_to_page(pfn)); + free_contig_range(start_pfn, nr_pages); return 0; } @@ -197,16 +164,11 @@ static int memtrace_init_debugfs(void) return ret; } -static int online_mem_block(struct memory_block *mem, void *arg) -{ - return device_online(&mem->dev); -} - /* - * Iterate through the chunks of memory we have removed from the kernel - * and attempt to add them back to the kernel. + * Iterate through the chunks of memory we allocated and attempt to expose + * them back to the kernel. */ -static int memtrace_online(void) +static int memtrace_free_regions(void) { int i, ret = 0; struct memtrace_entry *ent; @@ -214,7 +176,7 @@ static int memtrace_online(void) for (i = memtrace_array_nr - 1; i >= 0; i--) { ent = &memtrace_array[i]; - /* We have onlined this chunk previously */ + /* We have freed this chunk previously */ if (ent->nid == NUMA_NO_NODE) continue; @@ -224,30 +186,25 @@ static int memtrace_online(void) ent->mem = 0; } - if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) { - pr_err("Failed to add trace memory to node %d\n", + if (memtrace_free(ent->nid, ent->start, ent->size)) { + pr_err("Failed to free trace memory on node %d\n", ent->nid); ret += 1; continue; } - lock_device_hotplug(); - walk_memory_blocks(ent->start, ent->size, NULL, - online_mem_block); - unlock_device_hotplug(); - /* - * Memory was added successfully so clean up references to it - * so on reentry we can tell that this chunk was added. + * Memory was freed successfully so clean up references to it + * so on reentry we can tell that this chunk was freed. */ debugfs_remove_recursive(ent->dir); - pr_info("Added trace memory back to node %d\n", ent->nid); + pr_info("Freed trace memory back on node %d\n", ent->nid); ent->size = ent->start = ent->nid = NUMA_NO_NODE; } if (ret) return ret; - /* If all chunks of memory were added successfully, reset globals */ + /* If all chunks of memory were freed successfully, reset globals */ kfree(memtrace_array); memtrace_array = NULL; memtrace_size = 0; @@ -269,16 +226,15 @@ static int memtrace_enable_set(void *data, u64 val) return -EINVAL; } - /* Re-add/online previously removed/offlined memory */ + /* Free all previously allocated memory. */ if (memtrace_size) { - if (memtrace_online()) + if (memtrace_free_regions()) return -EAGAIN; } if (!val) return 0; - /* Offline and remove memory */ if (memtrace_init_regions_runtime(val)) return -EINVAL; -- 2.26.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations 2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand @ 2020-11-03 9:23 ` Michal Hocko 2020-11-03 9:29 ` David Hildenbrand 0 siblings, 1 reply; 5+ messages in thread From: Michal Hocko @ 2020-11-03 9:23 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta, Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang On Thu 29-10-20 17:27:18, David Hildenbrand wrote: > Let's use alloc_contig_pages() for allocating memory and remove the > linear mapping manually via arch_remove_linear_mapping(). Mark all pages > PG_offline, such that they will definitely not get touched - e.g., > when hibernating. When freeing memory, try to revert what we did. > > The original idea was discussed in: > https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com > > This is similar to CONFIG_DEBUG_PAGEALLOC handling on other > architectures, whereby only single pages are unmapped from the linear > mapping. Let's mimic what memory hot(un)plug would do with the linear > mapping. > > We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. > > Simple test under QEMU TCG (10GB RAM, single NUMA node): > > sh-5.0# mount -t debugfs none /sys/kernel/debug/ > sh-5.0# cat /sys/devices/system/memory/block_size_bytes > 40000000 > sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable > [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 > sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable > [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages > [ 75.430549][ T356] memtrace: Freed trace memory back on node 0 > [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 > sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable > [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages > [ 80.430493][ T356] memtrace: Freed trace memory back on node 0 > [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0 > sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable > [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 > > Note 1: We currently won't be allocating from ZONE_MOVABLE - because our > pages are not movable. However, as we don't run with any memory > hot(un)plug mechanism around, we could make an exception to > increase the chance of allocations succeeding. > > Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used > along PG_reserved in hibernation code will always return "true" > on powerpc, resulting in the pages getting touched. It's too > generic - e.g., indicates boot allocations. > > Note 3: For now, we keep using memory_block_size_bytes() as minimum > granularity. I'm not able to come up with a better guess (most > probably, doing it on a section basis could be possible). > > Suggested-by: Michal Hocko <mhocko@kernel.org> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Rashmica Gupta <rashmica.g@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Wei Yang <richard.weiyang@linux.alibaba.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Thanks! This looks like a move into the right direction. I cannot really judge implementation details because I am not familiar with the code. I have only one tiny concern: [...] > -/* called with device_hotplug_lock held */ > -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) > +static u64 memtrace_alloc_node(u32 nid, u64 size) > { > - const unsigned long start = PFN_PHYS(start_pfn); > - const unsigned long size = PFN_PHYS(nr_pages); > + const unsigned long nr_pages = PHYS_PFN(size); > + unsigned long pfn, start_pfn; > + struct page *page; > > - if (walk_memory_blocks(start, size, NULL, check_memblock_online)) > - return false; > - > - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE, > - change_memblock_state); > - > - if (offline_pages(start_pfn, nr_pages)) { > - walk_memory_blocks(start, size, (void *)MEM_ONLINE, > - change_memblock_state); > - return false; > - } > + /* > + * Trace memory needs to be aligned to the size, which is guaranteed > + * by alloc_contig_pages(). > + */ > + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN, > + nid, NULL); __GFP_THISNODE without other modifiers looks suspicious. I suspect you want to enfore node locality and exclude movable zones by this. While this works it is an antipattern. I would rather use GFP_KERNEL | __GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage. If for no other reasons we want to be able to work inside a normal compaction context (comparing to effectively GFP_NOIO which the above implies). Also this looks like a sleepable context. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations 2020-11-03 9:23 ` Michal Hocko @ 2020-11-03 9:29 ` David Hildenbrand 0 siblings, 0 replies; 5+ messages in thread From: David Hildenbrand @ 2020-11-03 9:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta, Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang On 03.11.20 10:23, Michal Hocko wrote: > On Thu 29-10-20 17:27:18, David Hildenbrand wrote: >> Let's use alloc_contig_pages() for allocating memory and remove the >> linear mapping manually via arch_remove_linear_mapping(). Mark all pages >> PG_offline, such that they will definitely not get touched - e.g., >> when hibernating. When freeing memory, try to revert what we did. >> >> The original idea was discussed in: >> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com >> >> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other >> architectures, whereby only single pages are unmapped from the linear >> mapping. Let's mimic what memory hot(un)plug would do with the linear >> mapping. >> >> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. >> >> Simple test under QEMU TCG (10GB RAM, single NUMA node): >> >> sh-5.0# mount -t debugfs none /sys/kernel/debug/ >> sh-5.0# cat /sys/devices/system/memory/block_size_bytes >> 40000000 >> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages >> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0 >> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages >> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0 >> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0 >> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable >> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000 >> >> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our >> pages are not movable. However, as we don't run with any memory >> hot(un)plug mechanism around, we could make an exception to >> increase the chance of allocations succeeding. >> >> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used >> along PG_reserved in hibernation code will always return "true" >> on powerpc, resulting in the pages getting touched. It's too >> generic - e.g., indicates boot allocations. >> >> Note 3: For now, we keep using memory_block_size_bytes() as minimum >> granularity. I'm not able to come up with a better guess (most >> probably, doing it on a section basis could be possible). >> >> Suggested-by: Michal Hocko <mhocko@kernel.org> >> Cc: Michael Ellerman <mpe@ellerman.id.au> >> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> >> Cc: Paul Mackerras <paulus@samba.org> >> Cc: Rashmica Gupta <rashmica.g@gmail.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Mike Rapoport <rppt@kernel.org> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Oscar Salvador <osalvador@suse.de> >> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Thanks! This looks like a move into the right direction. I cannot really > judge implementation details because I am not familiar with the code. > I have only one tiny concern: > [...] >> -/* called with device_hotplug_lock held */ >> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) >> +static u64 memtrace_alloc_node(u32 nid, u64 size) >> { >> - const unsigned long start = PFN_PHYS(start_pfn); >> - const unsigned long size = PFN_PHYS(nr_pages); >> + const unsigned long nr_pages = PHYS_PFN(size); >> + unsigned long pfn, start_pfn; >> + struct page *page; >> >> - if (walk_memory_blocks(start, size, NULL, check_memblock_online)) >> - return false; >> - >> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE, >> - change_memblock_state); >> - >> - if (offline_pages(start_pfn, nr_pages)) { >> - walk_memory_blocks(start, size, (void *)MEM_ONLINE, >> - change_memblock_state); >> - return false; >> - } >> + /* >> + * Trace memory needs to be aligned to the size, which is guaranteed >> + * by alloc_contig_pages(). >> + */ >> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN, >> + nid, NULL); > > __GFP_THISNODE without other modifiers looks suspicious. I suspect you > want to enfore node locality and exclude movable zones by this. While > this works it is an antipattern. I would rather use GFP_KERNEL | > __GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage. Agreed GFP_KERNEL should be the right thing to do here. > > If for no other reasons we want to be able to work inside a normal > compaction context (comparing to effectively GFP_NOIO which the above > implies). Also this looks like a sleepable context. > Yes it is. Thanks! -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-05 10:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
2020-11-05 8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox