linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, Ben Widawsky <ben.widawsky@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Feng Tang <feng.tang@intel.com>, Michal Hocko <mhocko@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Andi Kleen <ak@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	linux-api@vger.kernel.org
Subject: Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
Date: Tue, 30 Nov 2021 14:29:02 +0530	[thread overview]
Message-ID: <87wnkqaujt.fsf@linux.ibm.com> (raw)
In-Reply-To: <20211129140215.11b7cf9f1034a7fe7017768c@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 16 Nov 2021 12:12:37 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>> 
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> 	    new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>> 
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>> 
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>> 
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>> 
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node.  Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>> 
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>> 
>>  new_nodes = numa_bitmask_alloc(nr_nodes);
>> 
>>  numa_bitmask_setbit(new_nodes, 1);
>>  numa_bitmask_setbit(new_nodes, 2);
>>  numa_bitmask_setbit(new_nodes, 3);
>> 
>>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
>> 
>>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>> 
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>> 
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>> 
>> ...
>>
>> @@ -1477,6 +1478,60 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>>  	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>>  }
>>  
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *new;
>> +	unsigned long vmstart;
>> +	unsigned long vmend;
>> +	unsigned long end;
>> +	int err = -ENOENT;
>> +
>> +	if (start & ~PAGE_MASK)
>> +		return -EINVAL;
>> +	/*
>> +	 * flags is used for future extension if any.
>> +	 */
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	if (!node_online(home_node))
>> +		return -EINVAL;
>
> What's the thinking here?  The node can later be offlined and the
> kernel takes no action to reset home nodes, so why not permit setting a
> presently-offline node as the home node?  Checking here seems rather
> arbitrary?

The node online check was needed to avoid accessing 
uninitialised pgdat structure. Such an access can result in
below crash

cpu 0x0: Vector: 300 (Data Access) at [c00000000a693840]                                                                                                                                      
    pc: c0000000004e9bac: __next_zones_zonelist+0xc/0xa0                                                                                                                                      
    lr: c000000000558d54: __alloc_pages+0x474/0x540                                                                                                                                           
    sp: c00000000a693ae0                                                                                                                                                                      
   msr: 8000000000009033                                                                                                                                                                      
   dar: 1508                                                                                                                                                                                  
 dsisr: 40000000                                                                                                                                                                              
  current = 0xc0000000087f8380                                                                                                                                                                
  paca    = 0xc000000003130000   irqmask: 0x03   irq_happened: 0x01                                                                                                                           
    pid   = 1161, comm = test_mpol_prefe                                                                                                                                                      
Linux version 5.16.0-rc3-14872-gd6ef4ee28b4f-dirty (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-1                                                                                                
7ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #505 SMP Mon Nov 29 22:16:49 CST                                                                                                
 2021                                                                                                                                                                                         
enter ? for help                                                                                                                                                                              
[link register   ] c000000000558d54 __alloc_pages+0x474/0x540                                                                                                                                 
[c00000000a693ae0] c000000000558c68 __alloc_pages+0x388/0x540 (unreliable)                                                                                                                    
[c00000000a693b60] c00000000059299c alloc_pages_vma+0xcc/0x380                                                                                                                                
[c00000000a693bd0] c00000000052129c __handle_mm_fault+0xcec/0x1900                                                                                                                            
[c00000000a693cc0] c000000000522094 handle_mm_fault+0x1e4/0x4f0                                                                                                                               
[c00000000a693d20] c000000000087288 ___do_page_fault+0x2f8/0xc20                               
[c00000000a693de0] c000000000087e50 do_page_fault+0x60/0x130                                   
[c00000000a693e10] c00000000000891c data_access_common_virt+0x19c/0x1f0                        
--- Exception: 300 (Data Access) at 000074931e429160                                           
SP (7fffe8116a50) is in userspace
0:mon>                                         

Now IIUC, even after a node is marked offline via try_offline_node() we
still be able to access the zonelist details using the pgdata struct. 
I was not able to force a NUMA node offline in my test, even after removing the
memory assigned to it. 

root@ubuntu-guest:/sys/devices/system/node/node2# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 4046 MB
node 0 free: 3362 MB
node 1 cpus: 2 3
node 1 size: 4090 MB
node 1 free: 3788 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 7
node 3 size: 4063 MB
node 3 free: 3832 MB
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 



  reply	other threads:[~2021-11-30  8:59 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
2021-11-29 10:11   ` Michal Hocko
2021-11-29 10:12   ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
2021-11-29 10:32   ` Michal Hocko
2021-11-29 10:46     ` Aneesh Kumar K.V
2021-11-29 12:45       ` Michal Hocko
2021-11-29 13:47         ` Aneesh Kumar K.V
2021-11-29 14:52           ` Michal Hocko
2021-11-29 14:59             ` Aneesh Kumar K.V
2021-11-29 15:19               ` Michal Hocko
2021-11-29 22:02   ` Andrew Morton
2021-11-30  8:59     ` Aneesh Kumar K.V [this message]
2021-11-30  9:59       ` Michal Hocko
2021-12-01  3:00       ` Andrew Morton
2021-12-01  6:22         ` Aneesh Kumar K.V
2021-12-01  0:47   ` Daniel Jordan
2021-12-01  6:15     ` Aneesh Kumar K.V
2021-12-01 16:22       ` Daniel Jordan
2021-11-16  6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-29  8:37 ` [PATCH v5 0/3] mm: add new " Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wnkqaujt.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ben.widawsky@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rdunlap@infradead.org \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox