From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2C02DC54E58
	for <linux-mm@archiver.kernel.org>; Mon, 18 Mar 2024 06:19:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A30136B0085; Mon, 18 Mar 2024 02:19:00 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9DFB06B0087; Mon, 18 Mar 2024 02:19:00 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8CF176B0088; Mon, 18 Mar 2024 02:19:00 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 7EB5C6B0085
	for <linux-mm@kvack.org>; Mon, 18 Mar 2024 02:19:00 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 2C0F91A06CB
	for <linux-mm@kvack.org>; Mon, 18 Mar 2024 06:19:00 +0000 (UTC)
X-FDA: 81909156840.29.018A057
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	by imf14.hostedemail.com (Postfix) with ESMTP id 5F627100008
	for <linux-mm@kvack.org>; Mon, 18 Mar 2024 06:18:57 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BeHe7+kD;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710742738;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=8KFI2ggq+zTOq8MlHIoLjc6D84dm/5f39A1+J+Yaxcw=;
	b=TUCRYJ89RQ4e6NhDX/Yrsuox+NoVv4u7hLldRn6X+g9aJaqs6xXDgchtqGIwqzduXv4Rhz
	ghLvh8HdQOoUui3JLVw9qpugf9lo2vnW3KDUMaJxuN2gLvum7L5c33xTVBfdli7XhLwS/g
	r0W04+eLtiNn7VEWKoHh70zvesGBE/8=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BeHe7+kD;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710742738; a=rsa-sha256;
	cv=none;
	b=dLc905+K77DWDIgNHMc84yZFGZPMfaoLHJ7TESLg/M8tkpVBtRkcwG6Fqg18LjszGmml+6
	D+HFkYCoC7lwCE1SUMQTco+KMcQ7yC8mJYSSftQcYjzR9Vv4tB3/+C3GZSYFH/GArzGRSA
	a443ICHW+plkFFNzLgL+01FbrRgThmM=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1710742738; x=1742278738;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=/uxO/1t0VF+da2zFcsIPhImm0cli6tNt05gP60hyl1k=;
  b=BeHe7+kD9IxdluRg7uyYE4fsdyy6Bu4+5tIWjNja++x6mAhG4H075QJ7
   wfsupcYqb6xlC0K16ho+IKYxYqqHhV1EdBQBJX9SiN7dSGij1DEDFXw3g
   aORDSM/NWEH2Wvv/RjByWFdawAympTvVpjqDWuV6hd2K6Qp3c+LKdH+vE
   cplf927pBSe+o20vguRQO9uiJypT3WJCUNJgBJLCdnHnD3G2aKIsg3CsX
   bFtdGgkUmFrKmOD5TZVymt3A5ixLCG1CqyzYn5XcieA7XueG1W13D5gwD
   lHSIRY8ubvlRXuXFpVSLpCGXVdUjVmkwMBcWB6xx2K+HASo7r9W/npmye
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,11016"; a="16087123"
X-IronPort-AV: E=Sophos;i="6.07,134,1708416000"; 
   d="scan'208";a="16087123"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2024 23:18:56 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.07,134,1708416000"; 
   d="scan'208";a="13754252"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2024 23:18:53 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org,  <david@redhat.com>,
  <mgorman@techsingularity.net>,  <wangkefeng.wang@huawei.com>,
  <jhubbard@nvidia.com>,  <21cnbao@gmail.com>,  <ryan.roberts@arm.com>,
  <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing
In-Reply-To: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com>
	(Baolin Wang's message of "Fri, 15 Mar 2024 17:18:14 +0800")
References: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com>
Date: Mon, 18 Mar 2024 14:16:59 +0800
Message-ID: <871q88vzc4.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 5F627100008
X-Stat-Signature: 7ftqzrpxpgp5pism79gd8aj7a1sa3xm9
X-Rspam-User: 
X-HE-Tag: 1710742737-8324
X-HE-Meta: U2FsdGVkX1+YNU1CyySkcyTQKVcriiS8t3M32zvnKYPSI9D4YarB+vaBFk04OIqOFh1Yb/hMTXbmV/EGzQ0yfVu6+aAgsoLDkJAiqLoJHFi+CobbodhKrK1h1EC2Xr6SiTZI0sGp7sc6VFUwlsr+umxDEgIWSfkwqyPj/xpo0PfAGAN/7c92gX830GFQgeT7lGl2dbMJtwEgst4HSGcEce5MzepyKtOClDrFwXRpSETegUpuIujWFaEXg1RTdDdmb2VLxqc5i68ivbxy9GIA1bj02KdEwVPJPR46z/X9LiZ+jtNYVtfQ4P6ZiufjVJ+WuLIeGKrYGsk7PujXsyeJ55RJTW2mHLKNL3Np47sbwpG4n6V36HaeMTe0xwfJ5PEcMrGh88w0TBUxvfCvbLATLCMXgJX89sYiPn8s0DEgwC+bO5n+nIhe6P62b0tMzTO69lJgIL0mFuQcms2fp72kHN9+MJdErlVBYWIko9BvVOujc+oSNjqChbwF44g1dT5uuj2Cjzzp5Q3FAlmoK5RPMmxsmc3dxdCvZh78xC91iB6dfgdM2PVUYpCpMn6uqtunWCsfKvX4ZCRsrbEHIUxQOFuItm3WZviz1nAf910uV0nwCmz4J56k+Ix8N14h6BkX6ymeF05w2XV/gV/mZwoWryD1Q4/EzwEJrGKdAmN/Krzihz20tt+a9VzGUp1cKTC5QRZ1d1Dpb3kCZXka6QFpA6MwHeTk7rIPDJfmqxGy2gAWBW9QDateElLZkBMoz15YD/yjWFDRMx1CaFd/HR/QiZLIKznkgTHjAsH0UoAeND8OFE30ICLzuFRTMkUKuLhLKcfF2c+OO/CJ8sdfw5+GxX7AAGUQ0minRx4abQ3oRENJPEHDymzMqOoIDy6lKipaOC/ikeeI9BqgSu/PGBp8Ra4tGpgWMSHEbpfKHOhjxRaCQRN6tqkbDyDq5Qmzh0CZK/mi7GWUY2zKSKMrEwI
 ibpK5Vwt
 uJGjwBwTHhhzFF2mcU5dmddsuR/UDSh0DfmRC9jJwxPHxe+fKYZkwjHOGKcFEnLwtlkOnkPgVyKU6SZ45vBjelD2WYK3sJgkBYSuEueMguBUHaecSSRCAWo4O0l6wgMfDfq+i0YXoXNfAgAH2QTgRs9Q7DNFiU7O838hoI66hK1fVodfNGoy9xkvx4wTmgzQ9BjtAuxVkjf2wxqcMPzjtIdOd/RFmsjIg2eJcOJ5qLef7D8MkTSp05iMEGvBjpi6XJk/Qy9HNV600ct8wgWINB3aKv3J2F6khZYdL
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> Now the anonymous page allocation already supports multi-size THP (mTHP),
> but the numa balancing still prohibits mTHP migration even though it is an
> exclusive mapping, which is unreasonable. Thus let's support the exclusive
> mTHP numa balancing firstly.
>
> Allow scanning mTHP:
> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
> pages") skips shared CoW pages' NUMA page migration to avoid shared data
> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
> NUMA-migrate COW pages that have other uses") change to use page_count()
> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
> issue, although there is still a GUP race, the issue seems to have been
> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
> to skip shared CoW pages though this is not a precise sharers count. To
> check if the folio is shared, ideally we want to make sure every page is
> mapped to the same process, but doing that seems expensive and using
> the estimated mapcount seems can work when running autonuma benchmark.
>
> Allow migrating mTHP:
> As mentioned in the previous thread[1], large folios are more susceptible
> to false sharing issues, leading to pages ping-pong back and forth during
> numa balancing, which is currently hard to resolve. Therefore, as a start to
> support mTHP numa balancing, only exclusive mappings are allowed to perform
> numa migration to avoid the false sharing issues with large folios. Similarly,
> use the estimated mapcount to skip shared mappings, which seems can work
> in most cases (?), and we've used folio_estimated_sharers() to skip shared
> mappings in migrate_misplaced_folio() for numa balancing, seems no real
> complaints.

IIUC, folio_estimated_sharers() cannot identify multi-thread
applications.  If some mTHP is shared by multiple threads in one
process, how to deal with that?

For example, I think that we should avoid to migrate on the first fault
for mTHP in should_numa_migrate_memory().

More thoughts?  Can we add a field in struct folio for mTHP to count
hint page faults from the same node?

--
Best Regards,
Huang, Ying

> Performance data:
> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
> Base: 2024-3-15 mm-unstable branch
> Enable mTHP=64K to run autonuma-benchmark
>
> Base without the patch:
> numa01
> 222.97
> numa01_THREAD_ALLOC
> 115.78
> numa02
> 13.04
> numa02_SMT
> 14.69
>
> Base with the patch:
> numa01
> 125.36
> numa01_THREAD_ALLOC
> 44.58
> numa02
> 9.22
> numa02_SMT
> 7.46
>
> [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Changes from RFC v1:
>  - Add some preformance data per Huang, Ying.
>  - Allow mTHP scanning per David Hildenbrand.
>  - Avoid sharing mapping for numa balancing to avoid false sharing.
>  - Add more commit message.
> ---
>  mm/memory.c   | 9 +++++----
>  mm/mprotect.c | 3 ++-
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..b9d5d88c5a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	int last_cpupid;
>  	int target_nid;
>  	pte_t pte, old_pte;
> -	int flags = 0;
> +	int flags = 0, nr_pages = 0;
>  
>  	/*
>  	 * The pte cannot be used safely until we verify, while holding the page
> @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	if (!folio || folio_is_zone_device(folio))
>  		goto out_map;
>  
> -	/* TODO: handle PTE-mapped THP */
> -	if (folio_test_large(folio))
> +	/* Avoid large folio false sharing */
> +	if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
>  		goto out_map;
>  
>  	/*
> @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  		flags |= TNF_SHARED;
>  
>  	nid = folio_nid(folio);
> +	nr_pages = folio_nr_pages(folio);
>  	/*
>  	 * For memory tiering mode, cpupid of slow memory page is used
>  	 * to record page access time.  So use default value.
> @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  
>  out:
>  	if (nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, nid, 1, flags);
> +		task_numa_fault(last_cpupid, nid, nr_pages, flags);
>  	return 0;
>  out_map:
>  	/*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f8a4544b4601..f0b9c974aaae 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>  
>  				/* Also skip shared copy-on-write pages */
>  				if (is_cow_mapping(vma->vm_flags) &&
> -				    folio_ref_count(folio) != 1)
> +				    (folio_maybe_dma_pinned(folio) ||
> +				     folio_estimated_sharers(folio) > 1))
>  					continue;
>  
>  				/*