From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F85FC433F5 for ; Thu, 21 Oct 2021 07:33:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1C94261130 for ; Thu, 21 Oct 2021 07:33:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1C94261130 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 605226B0071; Thu, 21 Oct 2021 03:33:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58DF86B0072; Thu, 21 Oct 2021 03:33:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4562D6B0073; Thu, 21 Oct 2021 03:33:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0113.hostedemail.com [216.40.44.113]) by kanga.kvack.org (Postfix) with ESMTP id 2EB2E6B0071 for ; Thu, 21 Oct 2021 03:33:07 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E18762FD83 for ; Thu, 21 Oct 2021 07:33:06 +0000 (UTC) X-FDA: 78719628372.22.44C749B Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf11.hostedemail.com (Postfix) with ESMTP id A70D8F0000B2 for ; Thu, 21 Oct 2021 07:33:05 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10143"; a="209759285" X-IronPort-AV: E=Sophos;i="5.87,169,1631602800"; d="scan'208";a="209759285" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2021 00:32:11 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.87,169,1631602800"; d="scan'208";a="444682526" Received: from shbuild999.sh.intel.com (HELO localhost) ([10.239.146.189]) by orsmga006.jf.intel.com with ESMTP; 21 Oct 2021 00:32:07 -0700 Date: Thu, 21 Oct 2021 15:32:06 +0800 From: Feng Tang To: "Aneesh Kumar K.V" Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Ben Widawsky , Dave Hansen , Michal Hocko , Andrea Arcangeli , Mel Gorman , Mike Kravetz , Randy Dunlap , Vlastimil Babka , Andi Kleen , Dan Williams , Huang Ying , linux-api@vger.kernel.org Subject: Re: [RFC PATCH v2 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Message-ID: <20211021073206.GA20861@shbuild999.sh.intel.com> References: <20211020092453.179929-1-aneesh.kumar@linux.ibm.com> <20211020092453.179929-2-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211020092453.179929-2-aneesh.kumar@linux.ibm.com> X-Stat-Signature: af6ebdwwc5tqinrij41z3i7437wyfwyt Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf11.hostedemail.com: domain of feng.tang@intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=feng.tang@intel.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A70D8F0000B2 X-HE-Tag: 1634801585-278958 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Aneesh, On Wed, Oct 20, 2021 at 02:54:52PM +0530, Aneesh Kumar K.V wrote: > This syscall can be used to set a home node for the MPOL_BIND > and MPOL_PREFERRED_MANY memory policy. Users should use this > syscall after setting up a memory policy for the specified range > as shown below. > > mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, > new_nodes->size + 1, 0); > sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size, > home_node, 0); > > The syscall allows specifying a home node/preferred node from which kernel > will fulfill memory allocation requests first. > > For address range with MPOL_BIND memory policy, if nodemask specifies more > than one node, page allocations will come from the node in the nodemask > with sufficient free memory that is closest to the home node/preferred node. > > For MPOL_PREFERRED_MANY if the nodemask specifies more than one node, > page allocation will come from the node in the nodemask with sufficient > free memory that is closest to the home node/preferred node. If there is > not enough memory in all the nodes specified in the nodemask, the allocation > will be attempted from the closest numa node to the home node in the system. I can understand the requirement for MPOL_BIND, and for MPOL_PREFERRED_MANY, it provides 3 levels of preference: home node --> preferred nodes --> all nodes Any real usage cases for this? For a platform which may have 3 types of memory (HBM, DRAM, PMEM), this may be useful. > This helps applications to hint at a memory allocation preference node > and fallback to _only_ a set of nodes if the memory is not available > on the preferred node. Fallback allocation is attempted from the node which is > nearest to the preferred node. > > This helps applications to have control on memory allocation numa nodes and > avoids default fallback to slow memory NUMA nodes. For example a system with > NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory > > new_nodes = numa_bitmask_alloc(nr_nodes); > > numa_bitmask_setbit(new_nodes, 1); > numa_bitmask_setbit(new_nodes, 2); > numa_bitmask_setbit(new_nodes, 3); > > p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0); > mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0); > > sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0); For this example, it's 'mbind + sys_set_mempolicy_home_node', will case 'set_mempolicy + sys_set_mempolicy_home_node' be also supported? Thanks, Feng > This will allocate from nodes closer to node 2 and will make sure kernel will > only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory > nodes 10, 11 and 12 > > With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the > closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have > enough memory, kernel will allocate from slow memory node 10, 11 and 12 which > ever is closer to node 2. [SNIP]