* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
[not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
@ 2020-11-05 8:29 ` David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2020-11-05 8:29 UTC (permalink / raw)
To: Michael Ellerman
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
>
> David Hildenbrand <david@redhat.com> writes:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>> The original idea was discussed in:
>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> I gave this a quick spin on a real machine, seems to work OK.
>
> I don't have the actual memtrace tools setup to do an actual trace, will
> try and get someone to test that also.
>
> One observation is that previously the memory was zeroed when enabling
> the memtrace, whereas now it's not.
>
> eg, before:
>
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 10000000
>
> whereas after:
>
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....|
> 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....|
> 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....|
> 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> ...
> 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9|
> 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@|
> 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..|
> 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`|
> 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@|
> 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.|
> 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H|
> 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...|
> 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`|
> 10000000
>
>
> That's a nice way for root to read kernel memory, so we should probably
> add a __GFP_ZERO or memset in there somewhere.
Thanks for catching that! Will have a look on Monday if alloc_contig_pages() already properly handled __GFP_ZERO so we can use it, otherwise I‘ll fix that.
I don‘t recall that memory hotunplug does any zeroing - that‘s why I didn‘t add any explicit zeroing. Could be you were just lucky in your experiment - I assume we‘ll leak kernel memory already.
Thank!
> cheers
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-11-05 8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
@ 2020-11-05 10:47 ` Michael Ellerman
0 siblings, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2020-11-05 10:47 UTC (permalink / raw)
To: David Hildenbrand
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
David Hildenbrand <david@redhat.com> writes:
>> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
>>
>> David Hildenbrand <david@redhat.com> writes:
>>> Let's use alloc_contig_pages() for allocating memory and remove the
>>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>>> PG_offline, such that they will definitely not get touched - e.g.,
>>> when hibernating. When freeing memory, try to revert what we did.
>>> The original idea was discussed in:
>>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>>> architectures, whereby only single pages are unmapped from the linear
>>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>>> mapping.
>>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>>> 40000000
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>
>> I gave this a quick spin on a real machine, seems to work OK.
>>
>> I don't have the actual memtrace tools setup to do an actual trace, will
>> try and get someone to test that also.
>>
>> One observation is that previously the memory was zeroed when enabling
>> the memtrace, whereas now it's not.
>>
>> eg, before:
>>
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 10000000
>>
>> whereas after:
>>
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....|
>> 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....|
>> 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....|
>> 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> ...
>> 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9|
>> 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@|
>> 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..|
>> 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`|
>> 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@|
>> 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.|
>> 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H|
>> 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...|
>> 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`|
>> 10000000
>>
>>
>> That's a nice way for root to read kernel memory, so we should probably
>> add a __GFP_ZERO or memset in there somewhere.
>
> Thanks for catching that! Will have a look on Monday if
> alloc_contig_pages() already properly handled __GFP_ZERO so we can use
> it, otherwise I‘ll fix that.
I had a quick look and didn't see it, but maybe it is in there somewhere.
> I don‘t recall that memory hotunplug does any zeroing - that‘s why I
> didn‘t add any explicit zeroing. Could be you were just lucky in your
> experiment - I assume we‘ll leak kernel memory already.
Hmm yeah good point. I did try it multiple times, and I never get
anything non-zero with the existing code.
I guess it's just that the new method is more likely to give us memory
that's already been used for something.
So I guess that's not actually a problem with your patch, it's just
exposing an existing issue.
cheers
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-11-03 9:23 ` Michal Hocko
@ 2020-11-03 9:29 ` David Hildenbrand
0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2020-11-03 9:29 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang
On 03.11.20 10:23, Michal Hocko wrote:
> On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>>
>> The original idea was discussed in:
>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>>
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>
>> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
>> pages are not movable. However, as we don't run with any memory
>> hot(un)plug mechanism around, we could make an exception to
>> increase the chance of allocations succeeding.
>>
>> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
>> along PG_reserved in hibernation code will always return "true"
>> on powerpc, resulting in the pages getting touched. It's too
>> generic - e.g., indicates boot allocations.
>>
>> Note 3: For now, we keep using memory_block_size_bytes() as minimum
>> granularity. I'm not able to come up with a better guess (most
>> probably, doing it on a section basis could be possible).
>>
>> Suggested-by: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Rashmica Gupta <rashmica.g@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Thanks! This looks like a move into the right direction. I cannot really
> judge implementation details because I am not familiar with the code.
> I have only one tiny concern:
> [...]
>> -/* called with device_hotplug_lock held */
>> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>> {
>> - const unsigned long start = PFN_PHYS(start_pfn);
>> - const unsigned long size = PFN_PHYS(nr_pages);
>> + const unsigned long nr_pages = PHYS_PFN(size);
>> + unsigned long pfn, start_pfn;
>> + struct page *page;
>>
>> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
>> - return false;
>> -
>> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
>> - change_memblock_state);
>> -
>> - if (offline_pages(start_pfn, nr_pages)) {
>> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
>> - change_memblock_state);
>> - return false;
>> - }
>> + /*
>> + * Trace memory needs to be aligned to the size, which is guaranteed
>> + * by alloc_contig_pages().
>> + */
>> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
>> + nid, NULL);
>
> __GFP_THISNODE without other modifiers looks suspicious. I suspect you
> want to enfore node locality and exclude movable zones by this. While
> this works it is an antipattern. I would rather use GFP_KERNEL |
> __GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
Agreed GFP_KERNEL should be the right thing to do here.
>
> If for no other reasons we want to be able to work inside a normal
> compaction context (comparing to effectively GFP_NOIO which the above
> implies). Also this looks like a sleepable context.
>
Yes it is. Thanks!
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
@ 2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2020-11-03 9:23 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang
On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
>
> The original idea was discussed in:
> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
>
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 40000000
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
> pages are not movable. However, as we don't run with any memory
> hot(un)plug mechanism around, we could make an exception to
> increase the chance of allocations succeeding.
>
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
> along PG_reserved in hibernation code will always return "true"
> on powerpc, resulting in the pages getting touched. It's too
> generic - e.g., indicates boot allocations.
>
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
> granularity. I'm not able to come up with a better guess (most
> probably, doing it on a section basis could be possible).
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> - const unsigned long start = PFN_PHYS(start_pfn);
> - const unsigned long size = PFN_PHYS(nr_pages);
> + const unsigned long nr_pages = PHYS_PFN(size);
> + unsigned long pfn, start_pfn;
> + struct page *page;
>
> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> - return false;
> -
> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> - change_memblock_state);
> -
> - if (offline_pages(start_pfn, nr_pages)) {
> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> - change_memblock_state);
> - return false;
> - }
> + /*
> + * Trace memory needs to be aligned to the size, which is guaranteed
> + * by alloc_contig_pages().
> + */
> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> + nid, NULL);
__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
2020-11-03 9:23 ` Michal Hocko
0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michal Hocko,
Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
Let's use alloc_contig_pages() for allocating memory and remove the
linear mapping manually via arch_remove_linear_mapping(). Mark all pages
PG_offline, such that they will definitely not get touched - e.g.,
when hibernating. When freeing memory, try to revert what we did.
The original idea was discussed in:
https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
architectures, whereby only single pages are unmapped from the linear
mapping. Let's mimic what memory hot(un)plug would do with the linear
mapping.
We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
Simple test under QEMU TCG (10GB RAM, single NUMA node):
sh-5.0# mount -t debugfs none /sys/kernel/debug/
sh-5.0# cat /sys/devices/system/memory/block_size_bytes
40000000
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
[ 75.430549][ T356] memtrace: Freed trace memory back on node 0
[ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
[ 80.430493][ T356] memtrace: Freed trace memory back on node 0
[ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
pages are not movable. However, as we don't run with any memory
hot(un)plug mechanism around, we could make an exception to
increase the chance of allocations succeeding.
Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
along PG_reserved in hibernation code will always return "true"
on powerpc, resulting in the pages getting touched. It's too
generic - e.g., indicates boot allocations.
Note 3: For now, we keep using memory_block_size_bytes() as minimum
granularity. I'm not able to come up with a better guess (most
probably, doing it on a section basis could be possible).
Suggested-by: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++--------------
2 files changed, 49 insertions(+), 93 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..619b093a0657 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,11 +27,11 @@ config OPAL_PRD
recovery diagnostics on OpenPower machines
config PPC_MEMTRACE
- bool "Enable removal of RAM from kernel mappings for tracing"
- depends on PPC_POWERNV && MEMORY_HOTREMOVE
+ bool "Enable runtime allocation of RAM for tracing"
+ depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC
help
- Enabling this option allows for the removal of memory (RAM)
- from the kernel mappings to be used for hardware tracing.
+ Enabling this option allows for runtime allocation of memory (RAM)
+ for hardware tracing.
config PPC_VAS
bool "IBM Virtual Accelerator Switchboard (VAS)"
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 6828108486f8..8f47797a78c2 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -50,83 +50,50 @@ static const struct file_operations memtrace_fops = {
.open = simple_open,
};
-static int check_memblock_online(struct memory_block *mem, void *arg)
-{
- if (mem->state != MEM_ONLINE)
- return -1;
-
- return 0;
-}
-
-static int change_memblock_state(struct memory_block *mem, void *arg)
-{
- unsigned long state = (unsigned long)arg;
-
- mem->state = state;
-
- return 0;
-}
-
-/* called with device_hotplug_lock held */
-static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
+static u64 memtrace_alloc_node(u32 nid, u64 size)
{
- const unsigned long start = PFN_PHYS(start_pfn);
- const unsigned long size = PFN_PHYS(nr_pages);
+ const unsigned long nr_pages = PHYS_PFN(size);
+ unsigned long pfn, start_pfn;
+ struct page *page;
- if (walk_memory_blocks(start, size, NULL, check_memblock_online))
- return false;
-
- walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
- change_memblock_state);
-
- if (offline_pages(start_pfn, nr_pages)) {
- walk_memory_blocks(start, size, (void *)MEM_ONLINE,
- change_memblock_state);
- return false;
- }
+ /*
+ * Trace memory needs to be aligned to the size, which is guaranteed
+ * by alloc_contig_pages().
+ */
+ page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
+ nid, NULL);
+ if (!page)
+ return 0;
- walk_memory_blocks(start, size, (void *)MEM_OFFLINE,
- change_memblock_state);
+ /*
+ * Set pages PageOffline(), to indicate that nobody (e.g., hibernation,
+ * dumping, ...) should be touching these pages.
+ */
+ start_pfn = page_to_pfn(page);
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __SetPageOffline(pfn_to_page(pfn));
+ arch_remove_linear_mapping(PFN_PHYS(start_pfn), size);
- return true;
+ return PFN_PHYS(start_pfn);
}
-static u64 memtrace_alloc_node(u32 nid, u64 size)
+static int memtrace_free(int nid, u64 start, u64 size)
{
- u64 start_pfn, end_pfn, nr_pages, pfn;
- u64 base_pfn;
- u64 bytes = memory_block_size_bytes();
+ struct mhp_params params = { .pgprot = PAGE_KERNEL };
+ const unsigned long nr_pages = PHYS_PFN(size);
+ const unsigned long start_pfn = PHYS_PFN(start);
+ unsigned long pfn;
+ int ret;
- if (!node_spanned_pages(nid))
- return 0;
+ ret = arch_create_linear_mapping(nid, start, size, ¶ms);
+ if (ret)
+ return ret;
- start_pfn = node_start_pfn(nid);
- end_pfn = node_end_pfn(nid);
- nr_pages = size >> PAGE_SHIFT;
-
- /* Trace memory needs to be aligned to the size */
- end_pfn = round_down(end_pfn - nr_pages, nr_pages);
-
- lock_device_hotplug();
- for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
- if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
- /*
- * Remove memory in memory block size chunks so that
- * iomem resources are always split to the same size and
- * we never try to remove memory that spans two iomem
- * resources.
- */
- end_pfn = base_pfn + nr_pages;
- for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> PAGE_SHIFT) {
- __remove_memory(nid, pfn << PAGE_SHIFT, bytes);
- }
- unlock_device_hotplug();
- return base_pfn << PAGE_SHIFT;
- }
- }
- unlock_device_hotplug();
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __ClearPageOffline(pfn_to_page(pfn));
+ free_contig_range(start_pfn, nr_pages);
return 0;
}
@@ -197,16 +164,11 @@ static int memtrace_init_debugfs(void)
return ret;
}
-static int online_mem_block(struct memory_block *mem, void *arg)
-{
- return device_online(&mem->dev);
-}
-
/*
- * Iterate through the chunks of memory we have removed from the kernel
- * and attempt to add them back to the kernel.
+ * Iterate through the chunks of memory we allocated and attempt to expose
+ * them back to the kernel.
*/
-static int memtrace_online(void)
+static int memtrace_free_regions(void)
{
int i, ret = 0;
struct memtrace_entry *ent;
@@ -214,7 +176,7 @@ static int memtrace_online(void)
for (i = memtrace_array_nr - 1; i >= 0; i--) {
ent = &memtrace_array[i];
- /* We have onlined this chunk previously */
+ /* We have freed this chunk previously */
if (ent->nid == NUMA_NO_NODE)
continue;
@@ -224,30 +186,25 @@ static int memtrace_online(void)
ent->mem = 0;
}
- if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
- pr_err("Failed to add trace memory to node %d\n",
+ if (memtrace_free(ent->nid, ent->start, ent->size)) {
+ pr_err("Failed to free trace memory on node %d\n",
ent->nid);
ret += 1;
continue;
}
- lock_device_hotplug();
- walk_memory_blocks(ent->start, ent->size, NULL,
- online_mem_block);
- unlock_device_hotplug();
-
/*
- * Memory was added successfully so clean up references to it
- * so on reentry we can tell that this chunk was added.
+ * Memory was freed successfully so clean up references to it
+ * so on reentry we can tell that this chunk was freed.
*/
debugfs_remove_recursive(ent->dir);
- pr_info("Added trace memory back to node %d\n", ent->nid);
+ pr_info("Freed trace memory back on node %d\n", ent->nid);
ent->size = ent->start = ent->nid = NUMA_NO_NODE;
}
if (ret)
return ret;
- /* If all chunks of memory were added successfully, reset globals */
+ /* If all chunks of memory were freed successfully, reset globals */
kfree(memtrace_array);
memtrace_array = NULL;
memtrace_size = 0;
@@ -269,16 +226,15 @@ static int memtrace_enable_set(void *data, u64 val)
return -EINVAL;
}
- /* Re-add/online previously removed/offlined memory */
+ /* Free all previously allocated memory. */
if (memtrace_size) {
- if (memtrace_online())
+ if (memtrace_free_regions())
return -EAGAIN;
}
if (!val)
return 0;
- /* Offline and remove memory */
if (memtrace_init_regions_runtime(val))
return -EINVAL;
--
2.26.2
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-05 10:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
2020-11-05 8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox