linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
       [not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
@ 2020-11-05  8:29 ` David Hildenbrand
  2020-11-05 10:47   ` Michael Ellerman
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2020-11-05  8:29 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
	Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
	Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
	Oscar Salvador, Wei Yang


> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
> 
> David Hildenbrand <david@redhat.com> writes:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>> The original idea was discussed in:
>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   75.424302][  T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [   75.430549][  T356] memtrace: Freed trace memory back on node 0
>> [   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   80.418835][  T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [   80.430493][  T356] memtrace: Freed trace memory back on node 0
>> [   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> 
> I gave this a quick spin on a real machine, seems to work OK.
> 
> I don't have the actual memtrace tools setup to do an actual trace, will
> try and get someone to test that also.
> 
> One observation is that previously the memory was zeroed when enabling
> the memtrace, whereas now it's not.
> 
> eg, before:
> 
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 10000000
> 
> whereas after:
> 
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00000080  e0 fd 43 00 00 00 00 00  e0 fd 43 00 00 00 00 00  |..C.......C.....|
> 00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00000830  98 bf 39 00 00 00 00 00  98 bf 39 00 00 00 00 00  |..9.......9.....|
> 00000840  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 000008a0  b0 c8 47 00 00 00 00 00  b0 c8 47 00 00 00 00 00  |..G.......G.....|
> 000008b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> ...
> 0fffff70  78 53 49 7d 00 00 29 2e  88 00 92 41 01 00 49 39  |xSI}..)....A..I9|
> 0fffff80  b4 07 4a 7d 28 f8 00 7d  00 48 08 7c 0c 00 c2 40  |..J}(..}.H.|...@|
> 0fffff90  2d f9 40 7d f0 ff c2 40  b4 07 0a 7d 00 48 8a 7f  |-.@}...@...}.H..|
> 0fffffa0  70 fe 9e 41 cc ff ff 4b  00 00 00 60 00 00 00 60  |p..A...K...`...`|
> 0fffffb0  01 00 00 48 00 00 00 60  00 00 a3 2f 0c fd 9e 40  |...H...`.../...@|
> 0fffffc0  00 00 a2 3c 00 00 a5 e8  00 00 62 3c 00 00 63 e8  |...<......b<..c.|
> 0fffffd0  01 00 20 39 83 02 80 38  00 00 3c 99 01 00 00 48  |.. 9...8..<....H|
> 0fffffe0  00 00 00 60 e4 fc ff 4b  00 00 80 38 78 fb e3 7f  |...`...K...8x...|
> 0ffffff0  01 00 00 48 00 00 00 60  2c fe ff 4b 00 00 00 60  |...H...`,..K...`|
> 10000000
> 
> 
> That's a nice way for root to read kernel memory, so we should probably
> add a __GFP_ZERO or memset in there somewhere.

Thanks for catching that! Will have a look on Monday if alloc_contig_pages() already properly handled __GFP_ZERO so we can use it, otherwise I‘ll fix that.

I don‘t recall that memory hotunplug does any zeroing - that‘s why I didn‘t add any explicit zeroing. Could be you were just lucky in your experiment - I assume we‘ll leak kernel memory already.

Thank!

> cheers



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
  2020-11-05  8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
@ 2020-11-05 10:47   ` Michael Ellerman
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2020-11-05 10:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
	Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
	Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
	Oscar Salvador, Wei Yang

David Hildenbrand <david@redhat.com> writes:
>> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
>> 
>> David Hildenbrand <david@redhat.com> writes:
>>> Let's use alloc_contig_pages() for allocating memory and remove the
>>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>>> PG_offline, such that they will definitely not get touched - e.g.,
>>> when hibernating. When freeing memory, try to revert what we did.
>>> The original idea was discussed in:
>>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>>> architectures, whereby only single pages are unmapped from the linear
>>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>>> mapping.
>>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>>> 40000000
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   75.424302][  T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>>> [   75.430549][  T356] memtrace: Freed trace memory back on node 0
>>> [   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   80.418835][  T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>>> [   80.430493][  T356] memtrace: Freed trace memory back on node 0
>>> [   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> 
>> I gave this a quick spin on a real machine, seems to work OK.
>> 
>> I don't have the actual memtrace tools setup to do an actual trace, will
>> try and get someone to test that also.
>> 
>> One observation is that previously the memory was zeroed when enabling
>> the memtrace, whereas now it's not.
>> 
>> eg, before:
>> 
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace 
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 10000000
>> 
>> whereas after:
>> 
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00000080  e0 fd 43 00 00 00 00 00  e0 fd 43 00 00 00 00 00  |..C.......C.....|
>> 00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00000830  98 bf 39 00 00 00 00 00  98 bf 39 00 00 00 00 00  |..9.......9.....|
>> 00000840  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 000008a0  b0 c8 47 00 00 00 00 00  b0 c8 47 00 00 00 00 00  |..G.......G.....|
>> 000008b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> ...
>> 0fffff70  78 53 49 7d 00 00 29 2e  88 00 92 41 01 00 49 39  |xSI}..)....A..I9|
>> 0fffff80  b4 07 4a 7d 28 f8 00 7d  00 48 08 7c 0c 00 c2 40  |..J}(..}.H.|...@|
>> 0fffff90  2d f9 40 7d f0 ff c2 40  b4 07 0a 7d 00 48 8a 7f  |-.@}...@...}.H..|
>> 0fffffa0  70 fe 9e 41 cc ff ff 4b  00 00 00 60 00 00 00 60  |p..A...K...`...`|
>> 0fffffb0  01 00 00 48 00 00 00 60  00 00 a3 2f 0c fd 9e 40  |...H...`.../...@|
>> 0fffffc0  00 00 a2 3c 00 00 a5 e8  00 00 62 3c 00 00 63 e8  |...<......b<..c.|
>> 0fffffd0  01 00 20 39 83 02 80 38  00 00 3c 99 01 00 00 48  |.. 9...8..<....H|
>> 0fffffe0  00 00 00 60 e4 fc ff 4b  00 00 80 38 78 fb e3 7f  |...`...K...8x...|
>> 0ffffff0  01 00 00 48 00 00 00 60  2c fe ff 4b 00 00 00 60  |...H...`,..K...`|
>> 10000000
>> 
>> 
>> That's a nice way for root to read kernel memory, so we should probably
>> add a __GFP_ZERO or memset in there somewhere.
>
> Thanks for catching that! Will have a look on Monday if
> alloc_contig_pages() already properly handled __GFP_ZERO so we can use
> it, otherwise I‘ll fix that.

I had a quick look and didn't see it, but maybe it is in there somewhere.

> I don‘t recall that memory hotunplug does any zeroing - that‘s why I
> didn‘t add any explicit zeroing. Could be you were just lucky in your
> experiment - I assume we‘ll leak kernel memory already.

Hmm yeah good point. I did try it multiple times, and I never get
anything non-zero with the existing code.

I guess it's just that the new method is more likely to give us memory
that's already been used for something.

So I guess that's not actually a problem with your patch, it's just
exposing an existing issue.

cheers


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
  2020-11-03  9:23   ` Michal Hocko
@ 2020-11-03  9:29     ` David Hildenbrand
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2020-11-03  9:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
	Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang

On 03.11.20 10:23, Michal Hocko wrote:
> On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>>
>> The original idea was discussed in:
>>   https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>>
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   75.424302][  T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [   75.430549][  T356] memtrace: Freed trace memory back on node 0
>> [   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   80.418835][  T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [   80.430493][  T356] memtrace: Freed trace memory back on node 0
>> [   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>
>> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
>> 	pages are not movable. However, as we don't run with any memory
>> 	hot(un)plug mechanism around, we could make an exception to
>> 	increase the chance of allocations succeeding.
>>
>> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
>> 	along PG_reserved in hibernation code will always return "true"
>> 	on powerpc, resulting in the pages getting touched. It's too
>> 	generic - e.g., indicates boot allocations.
>>
>> Note 3: For now, we keep using memory_block_size_bytes() as minimum
>> 	granularity. I'm not able to come up with a better guess (most
>> 	probably, doing it on a section basis could be possible).
>>
>> Suggested-by: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Rashmica Gupta <rashmica.g@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Thanks! This looks like a move into the right direction. I cannot really
> judge implementation details because I am not familiar with the code.
> I have only one tiny concern:
> [...]
>> -/* called with device_hotplug_lock held */
>> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>>   {
>> -	const unsigned long start = PFN_PHYS(start_pfn);
>> -	const unsigned long size = PFN_PHYS(nr_pages);
>> +	const unsigned long nr_pages = PHYS_PFN(size);
>> +	unsigned long pfn, start_pfn;
>> +	struct page *page;
>>   
>> -	if (walk_memory_blocks(start, size, NULL, check_memblock_online))
>> -		return false;
>> -
>> -	walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
>> -			   change_memblock_state);
>> -
>> -	if (offline_pages(start_pfn, nr_pages)) {
>> -		walk_memory_blocks(start, size, (void *)MEM_ONLINE,
>> -				   change_memblock_state);
>> -		return false;
>> -	}
>> +	/*
>> +	 * Trace memory needs to be aligned to the size, which is guaranteed
>> +	 * by alloc_contig_pages().
>> +	 */
>> +	page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
>> +				  nid, NULL);
> 
> __GFP_THISNODE without other modifiers looks suspicious. I suspect you
> want to enfore node locality and exclude movable zones by this. While
> this works it is an antipattern. I would rather use GFP_KERNEL |
> __GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.

Agreed GFP_KERNEL should be the right thing to do here.

> 
> If for no other reasons we want to be able to work inside a normal
> compaction context (comparing to effectively GFP_NOIO which the above
> implies). Also this looks like a sleepable context.
> 

Yes it is. Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
  2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
@ 2020-11-03  9:23   ` Michal Hocko
  2020-11-03  9:29     ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2020-11-03  9:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
	Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang

On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
> 
> The original idea was discussed in:
>  https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
> 
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
> 
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
> 
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
> 
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 40000000
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   75.424302][  T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
> [   75.430549][  T356] memtrace: Freed trace memory back on node 0
> [   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   80.418835][  T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
> [   80.430493][  T356] memtrace: Freed trace memory back on node 0
> [   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> 
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
> 	pages are not movable. However, as we don't run with any memory
> 	hot(un)plug mechanism around, we could make an exception to
> 	increase the chance of allocations succeeding.
> 
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
> 	along PG_reserved in hibernation code will always return "true"
> 	on powerpc, resulting in the pages getting touched. It's too
> 	generic - e.g., indicates boot allocations.
> 
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
> 	granularity. I'm not able to come up with a better guess (most
> 	probably, doing it on a section basis could be possible).
> 
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>  {
> -	const unsigned long start = PFN_PHYS(start_pfn);
> -	const unsigned long size = PFN_PHYS(nr_pages);
> +	const unsigned long nr_pages = PHYS_PFN(size);
> +	unsigned long pfn, start_pfn;
> +	struct page *page;
>  
> -	if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> -		return false;
> -
> -	walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> -			   change_memblock_state);
> -
> -	if (offline_pages(start_pfn, nr_pages)) {
> -		walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> -				   change_memblock_state);
> -		return false;
> -	}
> +	/*
> +	 * Trace memory needs to be aligned to the size, which is guaranteed
> +	 * by alloc_contig_pages().
> +	 */
> +	page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> +				  nid, NULL);

__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.

If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
  2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
  2020-11-03  9:23   ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michal Hocko,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
	Oscar Salvador, Wei Yang

Let's use alloc_contig_pages() for allocating memory and remove the
linear mapping manually via arch_remove_linear_mapping(). Mark all pages
PG_offline, such that they will definitely not get touched - e.g.,
when hibernating. When freeing memory, try to revert what we did.

The original idea was discussed in:
 https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com

This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
architectures, whereby only single pages are unmapped from the linear
mapping. Let's mimic what memory hot(un)plug would do with the linear
mapping.

We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.

Simple test under QEMU TCG (10GB RAM, single NUMA node):

sh-5.0# mount -t debugfs none /sys/kernel/debug/
sh-5.0# cat /sys/devices/system/memory/block_size_bytes
40000000
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
[   75.424302][  T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
[   75.430549][  T356] memtrace: Freed trace memory back on node 0
[   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
[   80.418835][  T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
[   80.430493][  T356] memtrace: Freed trace memory back on node 0
[   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000

Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
	pages are not movable. However, as we don't run with any memory
	hot(un)plug mechanism around, we could make an exception to
	increase the chance of allocations succeeding.

Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
	along PG_reserved in hibernation code will always return "true"
	on powerpc, resulting in the pages getting touched. It's too
	generic - e.g., indicates boot allocations.

Note 3: For now, we keep using memory_block_size_bytes() as minimum
	granularity. I'm not able to come up with a better guess (most
	probably, doing it on a section basis could be possible).

Suggested-by: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/platforms/powernv/Kconfig    |   8 +-
 arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++--------------
 2 files changed, 49 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..619b093a0657 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,11 +27,11 @@ config OPAL_PRD
 	  recovery diagnostics on OpenPower machines
 
 config PPC_MEMTRACE
-	bool "Enable removal of RAM from kernel mappings for tracing"
-	depends on PPC_POWERNV && MEMORY_HOTREMOVE
+	bool "Enable runtime allocation of RAM for tracing"
+	depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC
 	help
-	  Enabling this option allows for the removal of memory (RAM)
-	  from the kernel mappings to be used for hardware tracing.
+	  Enabling this option allows for runtime allocation of memory (RAM)
+	  for hardware tracing.
 
 config PPC_VAS
 	bool "IBM Virtual Accelerator Switchboard (VAS)"
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 6828108486f8..8f47797a78c2 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -50,83 +50,50 @@ static const struct file_operations memtrace_fops = {
 	.open	= simple_open,
 };
 
-static int check_memblock_online(struct memory_block *mem, void *arg)
-{
-	if (mem->state != MEM_ONLINE)
-		return -1;
-
-	return 0;
-}
-
-static int change_memblock_state(struct memory_block *mem, void *arg)
-{
-	unsigned long state = (unsigned long)arg;
-
-	mem->state = state;
-
-	return 0;
-}
-
-/* called with device_hotplug_lock held */
-static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
+static u64 memtrace_alloc_node(u32 nid, u64 size)
 {
-	const unsigned long start = PFN_PHYS(start_pfn);
-	const unsigned long size = PFN_PHYS(nr_pages);
+	const unsigned long nr_pages = PHYS_PFN(size);
+	unsigned long pfn, start_pfn;
+	struct page *page;
 
-	if (walk_memory_blocks(start, size, NULL, check_memblock_online))
-		return false;
-
-	walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
-			   change_memblock_state);
-
-	if (offline_pages(start_pfn, nr_pages)) {
-		walk_memory_blocks(start, size, (void *)MEM_ONLINE,
-				   change_memblock_state);
-		return false;
-	}
+	/*
+	 * Trace memory needs to be aligned to the size, which is guaranteed
+	 * by alloc_contig_pages().
+	 */
+	page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
+				  nid, NULL);
+	if (!page)
+		return 0;
 
-	walk_memory_blocks(start, size, (void *)MEM_OFFLINE,
-			   change_memblock_state);
+	/*
+	 * Set pages PageOffline(), to indicate that nobody (e.g., hibernation,
+	 * dumping, ...) should be touching these pages.
+	 */
+	start_pfn = page_to_pfn(page);
+	for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+		__SetPageOffline(pfn_to_page(pfn));
 
+	arch_remove_linear_mapping(PFN_PHYS(start_pfn), size);
 
-	return true;
+	return PFN_PHYS(start_pfn);
 }
 
-static u64 memtrace_alloc_node(u32 nid, u64 size)
+static int memtrace_free(int nid, u64 start, u64 size)
 {
-	u64 start_pfn, end_pfn, nr_pages, pfn;
-	u64 base_pfn;
-	u64 bytes = memory_block_size_bytes();
+	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+	const unsigned long nr_pages = PHYS_PFN(size);
+	const unsigned long start_pfn = PHYS_PFN(start);
+	unsigned long pfn;
+	int ret;
 
-	if (!node_spanned_pages(nid))
-		return 0;
+	ret = arch_create_linear_mapping(nid, start, size, &params);
+	if (ret)
+		return ret;
 
-	start_pfn = node_start_pfn(nid);
-	end_pfn = node_end_pfn(nid);
-	nr_pages = size >> PAGE_SHIFT;
-
-	/* Trace memory needs to be aligned to the size */
-	end_pfn = round_down(end_pfn - nr_pages, nr_pages);
-
-	lock_device_hotplug();
-	for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
-		if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
-			/*
-			 * Remove memory in memory block size chunks so that
-			 * iomem resources are always split to the same size and
-			 * we never try to remove memory that spans two iomem
-			 * resources.
-			 */
-			end_pfn = base_pfn + nr_pages;
-			for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> PAGE_SHIFT) {
-				__remove_memory(nid, pfn << PAGE_SHIFT, bytes);
-			}
-			unlock_device_hotplug();
-			return base_pfn << PAGE_SHIFT;
-		}
-	}
-	unlock_device_hotplug();
+	for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+		__ClearPageOffline(pfn_to_page(pfn));
 
+	free_contig_range(start_pfn, nr_pages);
 	return 0;
 }
 
@@ -197,16 +164,11 @@ static int memtrace_init_debugfs(void)
 	return ret;
 }
 
-static int online_mem_block(struct memory_block *mem, void *arg)
-{
-	return device_online(&mem->dev);
-}
-
 /*
- * Iterate through the chunks of memory we have removed from the kernel
- * and attempt to add them back to the kernel.
+ * Iterate through the chunks of memory we allocated and attempt to expose
+ * them back to the kernel.
  */
-static int memtrace_online(void)
+static int memtrace_free_regions(void)
 {
 	int i, ret = 0;
 	struct memtrace_entry *ent;
@@ -214,7 +176,7 @@ static int memtrace_online(void)
 	for (i = memtrace_array_nr - 1; i >= 0; i--) {
 		ent = &memtrace_array[i];
 
-		/* We have onlined this chunk previously */
+		/* We have freed this chunk previously */
 		if (ent->nid == NUMA_NO_NODE)
 			continue;
 
@@ -224,30 +186,25 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
-			pr_err("Failed to add trace memory to node %d\n",
+		if (memtrace_free(ent->nid, ent->start, ent->size)) {
+			pr_err("Failed to free trace memory on node %d\n",
 				ent->nid);
 			ret += 1;
 			continue;
 		}
 
-		lock_device_hotplug();
-		walk_memory_blocks(ent->start, ent->size, NULL,
-				   online_mem_block);
-		unlock_device_hotplug();
-
 		/*
-		 * Memory was added successfully so clean up references to it
-		 * so on reentry we can tell that this chunk was added.
+		 * Memory was freed successfully so clean up references to it
+		 * so on reentry we can tell that this chunk was freed.
 		 */
 		debugfs_remove_recursive(ent->dir);
-		pr_info("Added trace memory back to node %d\n", ent->nid);
+		pr_info("Freed trace memory back on node %d\n", ent->nid);
 		ent->size = ent->start = ent->nid = NUMA_NO_NODE;
 	}
 	if (ret)
 		return ret;
 
-	/* If all chunks of memory were added successfully, reset globals */
+	/* If all chunks of memory were freed successfully, reset globals */
 	kfree(memtrace_array);
 	memtrace_array = NULL;
 	memtrace_size = 0;
@@ -269,16 +226,15 @@ static int memtrace_enable_set(void *data, u64 val)
 		return -EINVAL;
 	}
 
-	/* Re-add/online previously removed/offlined memory */
+	/* Free all previously allocated memory. */
 	if (memtrace_size) {
-		if (memtrace_online())
+		if (memtrace_free_regions())
 			return -EAGAIN;
 	}
 
 	if (!val)
 		return 0;
 
-	/* Offline and remove memory */
 	if (memtrace_init_regions_runtime(val))
 		return -EINVAL;
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-05 10:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
2020-11-05  8:29 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-11-05 10:47   ` Michael Ellerman
2020-10-29 16:27 [PATCH v1 0/4] " David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 4/4] " David Hildenbrand
2020-11-03  9:23   ` Michal Hocko
2020-11-03  9:29     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox