From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CAFBEC3DA70
	for <linux-mm@archiver.kernel.org>; Tue, 23 Jul 2024 01:19:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F052D6B007B; Mon, 22 Jul 2024 21:19:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EB5246B0083; Mon, 22 Jul 2024 21:19:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D55DA6B0085; Mon, 22 Jul 2024 21:19:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B67FE6B007B
	for <linux-mm@kvack.org>; Mon, 22 Jul 2024 21:19:40 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 56A5E1C4215
	for <linux-mm@kvack.org>; Tue, 23 Jul 2024 01:19:40 +0000 (UTC)
X-FDA: 82369260120.28.BD24700
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	by imf16.hostedemail.com (Postfix) with ESMTP id 1AA6918000C
	for <linux-mm@kvack.org>; Tue, 23 Jul 2024 01:19:37 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=aNlQ9fiP;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721697532;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AOfGxqKdmoUhr0awqEL35jUt6XQSIzlGxH7m/IdP1xY=;
	b=3dZ9Ch9uKqogiGUqcC3IqkQu3+dp8jp1jmTmkulVbjh8si4QSIUzOCJDWKFW2TbBVSjr4l
	xOrfcOVkOWNDa8y91vsH1PYmguSw6vfZ8AJLhwMkuIxuOMASoRX1lqjqm4iRv7pOqRs2D2
	x1D0a5PHp1PeerbwZaC2ykDGWl9/bj4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721697532; a=rsa-sha256;
	cv=none;
	b=7geR4pieM+ENpf3jwukYFOQxTF+Rf1kAAKdUTla3EywAi5ywv7I3hFfI8PoS8Medbtx7t3
	gZnJYxiKGyxoDvbwkh80IvGqH+LRzWfnh3mD+CFwEtdBwuzDhpavjDrzipXQ+wulGPXJwM
	/O7WfTpEcooaFW6laA4qWe+iL7AYMgE=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=aNlQ9fiP;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721697579; x=1753233579;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=uQId9yvXpctcPoyCmgZd8/FtklDpyd+YOvMsRcNbkWU=;
  b=aNlQ9fiPxyywAljjxizPxLi/egMIaKBIVAvnuD804VJnGA3ZivznqdFp
   eU1yssu4h0qScF7CZJfsZqNePmxsW7Lz/lYUeGDfcgJNFiNwhEhQqr1Bn
   jZcmZ1cFKQtU5JuSUXhcRaivGvghI2XkcN7UUgY6Lll7fkASY4mqli4pz
   ltCAm3YWVPf1i1xIAfO5D2pKypkjekiWagwgAjSt8j+qNcdr2qGdeHWxp
   ieEQjuNWwzbfVMZ3K2se54gJ7/PnGxoaoPOcv+FOV3n5uZWeH1OUDq1gL
   64o+IkThVsv0pZH4v3X89AhP4dyvS7OnfFgFNq1dgiOc1Y5pDbhT6RSmU
   Q==;
X-CSE-ConnectionGUID: 7DmXR5JsRYu1wybu+OLgxw==
X-CSE-MsgGUID: 85SOLFi0Rh2KHFR/ijgnRA==
X-IronPort-AV: E=McAfee;i="6700,10204,11141"; a="36741537"
X-IronPort-AV: E=Sophos;i="6.09,229,1716274800"; 
   d="scan'208";a="36741537"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jul 2024 18:19:37 -0700
X-CSE-ConnectionGUID: 54FM8z5JRVWs2KJhjlcbLg==
X-CSE-MsgGUID: KoaRHLK1RCae7dgYAvtmGg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,229,1716274800"; 
   d="scan'208";a="52809836"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jul 2024 18:19:34 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>,  <linux-mm@kvack.org>,  Andrew
 Morton <akpm@linux-foundation.org>,  Baolin Wang
 <baolin.wang@linux.alibaba.com>,  <linux-kernel@vger.kernel.org>,  Mel
 Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC PATCH 3/3] mm/migrate: move common code to
 numa_migrate_check (was numa_migrate_prep)
In-Reply-To: <B9364CE0-1F23-45DD-8AD7-AAC8461FE6D8@nvidia.com> (Zi Yan's
	message of "Mon, 22 Jul 2024 11:21:27 -0400")
References: <20240712024455.163543-1-zi.yan@sent.com>
	<20240712024455.163543-4-zi.yan@sent.com>
	<87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<735B3DEE-5C4A-43BD-B003-17F4B1F0DC98@nvidia.com>
	<87sew2jiyr.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<1844A1CA-7120-4DB2-8F67-8F91BA5B18C6@nvidia.com>
	<B9364CE0-1F23-45DD-8AD7-AAC8461FE6D8@nvidia.com>
Date: Tue, 23 Jul 2024 09:16:01 +0800
Message-ID: <87frs0kiwe.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: h8i78yoxkebpwh1r3gy7s8cpuejqfo6j
X-Rspamd-Queue-Id: 1AA6918000C
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1721697577-298553
X-HE-Meta: U2FsdGVkX1/duZ3Gru1ybqmAeSNFUc9ZEjqZh3+l2QNp6NCilrLLrtZ1SCJBDueYihpXaH0ZVze1rsEmDNDjG811T0KcMO85oOd5RraeWItRFQXAAzNetOfTiSCGu3YXhB8MewZNaPT/WxVAcj7EJqaVLVjaLWNveYnNe6Hoe3E4GIRXORp7ykoD4xwOJh4sR9FyMQZy5biA2tyWHw5qEBTutl/2hJzZRqQHti6VET9VtyYWXlo0+uxZ4nut9+xue9YA9eB7Ge3xWiFpwZZj0uu61mjm09ggTLOIqruxjt1zmvn1/Z7lDTdxhf1flBg5IokkU8B3K6wFsFm6/KRHoOPRFVdcurwGVJyTDLJYD3rjkbWUr/k9kNVTmnNEMKeqWPTK2iet8mfA1pHZKdI3aSlXUOKbyZ0KGvxxZncWjkvTKdWXE1lIoz+yywkAcVGtikbTJvyKTyclV4N1zeGF6ohjx6/ABf1iSp7a/aT+lQOW84PpdDD1lU7EgVPQo5+Bhp+DTashLJOXKxOXuue7ZeV++KgmMdKHGqL7XbK6PRmc40t9UO8qnfIwnIz1G7tWBjq6PJ4cMZMkfL6ZusyAchYSxMIuKmus5ghMHcydv88E2wn5A30FnaKdtflUqqdsvSDGwrlRt36xSyTrJ7IHiJUoom/YKi8Zyrm72uEjkZmpeEu4qk1mDP8hmZBEJ0Ne9m7Bf3kRK8av6GBoBvVjEThP9sJWbKILL5Y2k/Wui7ah+pS/r1V4RSxYc/z7+ej6txIuNcjNoBmGHR3dk95PNoZ9nO8fE3BNvYetOPE+yQIi9YPrsW5S4vT0TllQq+Nm3p69os27nOQ9HKOS0oqyCsZs5B28pnDz4O1wRE7PS2SZIISzOuO6FcARY99WH9wGdknU2boPao0zaMU/6dNam+tX/VUOm5ayFF7PK/S9mOOz4UiNfVGhgg9L8dMFllK4KcGzowhMfDRQ/Gfg/NR
 fP2A2tV7
 Qo1hSBPKaeMHAowB+smCEfC5eaPAJ8kk/mNoVowSWkKIU0DJfYzCjiuNwxgLTSHWDU2N7LhSF3qo4vYw+wKD+lgIkWyxW3iubN5/1caqFm0Gm0P0zzbW1F5Ov0w5xF52Rsy1UIgSRUQHu3Tb3JcAkRQbiFR4Rls90ecQ2wChJESbNR7MwWrqSlx0IaqJOB3D62u1e6SJLMyETePrOU4icqJtryAVum6WCpzGk5lgqZhzHLWHF0W6PNqYITApOfygBB7unuvPQxKMzjn4yjdMf/RUQSKIGefzriXZgrlWRHc2RXkYAfor6aFi8rFYVFs76aZKB6c42YOxHcY7hYhtSrxBWhlSOa/GGzOsK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Zi Yan <ziy@nvidia.com> writes:

> On 22 Jul 2024, at 10:01, Zi Yan wrote:
>
>> On 21 Jul 2024, at 21:47, Huang, Ying wrote:
>>
>>> Zi Yan <ziy@nvidia.com> writes:
>>>
>>>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>>>>
>>>>> Zi Yan <zi.yan@sent.com> writes:
>>>>>
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common cod=
e. To
>>>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>>>>> the function to numa_migrate_check() to reflect its functionality.
>>>>>>
>>>>>> There is some code difference between do_numa_page() and
>>>>>> do_huge_pmd_numa_page() before the code move:
>>>>>>
>>>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SH=
ARED.
>>>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>  mm/huge_memory.c | 28 ++++++-----------
>>>>>>  mm/internal.h    |  5 +--
>>>>>>  mm/memory.c      | 81 +++++++++++++++++++++++----------------------=
---
>>>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_f=
ault *vmf)
>>>>>>  	pmd_t pmd;
>>>>>>  	struct folio *folio;
>>>>>>  	unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK;
>>>>>> -	int nid =3D NUMA_NO_NODE;
>>>>>> -	int target_nid, last_cpupid =3D (-1 & LAST_CPUPID_MASK);
>>>>>> +	int target_nid =3D NUMA_NO_NODE;
>>>>>> +	int last_cpupid =3D (-1 & LAST_CPUPID_MASK);
>>>>>>  	bool writable =3D false;
>>>>>> -	int flags =3D 0;
>>>>>> +	int flags =3D 0, nr_pages;
>>>>>>
>>>>>>  	vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_f=
ault *vmf)
>>>>>>  		writable =3D true;
>>>>>>
>>>>>>  	folio =3D vm_normal_folio_pmd(vma, haddr, pmd);
>>>>>> -	if (!folio)
>>>>>> +	if (!folio || folio_is_zone_device(folio))
>>>>>
>>>>> This change appears unrelated.  Can we put it in a separate patch?
>>>>>
>>>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>>>>> change_pte_range(), folio_is_zone_device() has been checked already.
>>>>> But It doesn't hurt too.
>>>>>
>>>>>>  		goto out_map;
>>>>>>
>>>>>> -	/* See similar comment in do_numa_page for explanation */
>>>>>> -	if (!writable)
>>>>>> -		flags |=3D TNF_NO_GROUP;
>>>>>> +	nr_pages =3D folio_nr_pages(folio);
>>>>>>
>>>>>> -	nid =3D folio_nid(folio);
>>>>>> -	/*
>>>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>>>>> -	 * to record page access time.  So use default value.
>>>>>> -	 */
>>>>>> -	if (folio_has_cpupid(folio))
>>>>>> -		last_cpupid =3D folio_last_cpupid(folio);
>>>>>> -	target_nid =3D numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>> +	target_nid =3D numa_migrate_check(folio, vmf, haddr, writable,
>>>>>> +			&flags, &last_cpupid);
>>>>>>  	if (target_nid =3D=3D NUMA_NO_NODE)
>>>>>>  		goto out_map;
>>>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fau=
lt *vmf)
>>>>>>
>>>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>>>>  		flags |=3D TNF_MIGRATED;
>>>>>> -		nid =3D target_nid;
>>>>>>  	} else {
>>>>>> +		target_nid =3D NUMA_NO_NODE;
>>>>>>  		flags |=3D TNF_MIGRATE_FAIL;
>>>>>>  		vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fau=
lt *vmf)
>>>>>>  	}
>>>>>>
>>>>>>  out:
>>>>>> -	if (nid !=3D NUMA_NO_NODE)
>>>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>>>>> +	if (target_nid !=3D NUMA_NO_NODE)
>>>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>>>>
>>>>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>>>>
>>>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>>>>> nid.  "target_nid" as variable name here is confusing, because
>>>>> folio_nid() is needed in fact.
>>>>>
>>>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>>>>> code is buggy.
>>>>>
>>>>> Similar issues for do_numa_page().
>>>>>
>>>>> If my understanding were correct, we should implement a separate patch
>>>>> to fix 2) above.  And that may need to be backported.
>>>>
>>>> Hmm, the original code seems OK after I checked the implementation.
>>>> There are two possible !pte_same()/!pmd_same() locations:
>>>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and =
the faulted
>>>> PTE/PMD changed before the folio can be checked, task_numa_fault() sho=
uld not be
>>>> called.
>>>
>>> Yes.
>>>
>>>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but =
the folio
>>>> has been determined and checked. task_numa_fault() should be called ev=
en if
>>>> !pte_same()/!pmd_same(),
>>>
>>> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
>>> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
>>> been called on another CPU and task_numa_fault() has been called for the
>>> PTE/PMD already.
>>
>> Hmm, this behavior at least dates back to 2015 at
>> commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures=
 occur=E2=80=9D).
>> So cc Mel.
>>
>> The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin=
ux.git/tree/mm/memory.c?id=3D074c238177a75f5e79af3b2cb6a84e54823ef950#n3102=
. I have not checked older
>> commits.
>>
>> I wonder how far we should trace back.
>
> OK, I find the commit where task_numa_fault policy settled:
> 8191acbd30c7 ("mm: numa: Sanitize task_numa_fault() callsites=E2=80=9D).
>
> It says:
> =E2=80=9CSo modify all three sites to always account; we did after all re=
ceive
> the fault; and always account to where the page is after migration,
> regardless of success.=E2=80=9C, where the three call sites were:
> do_huge_pmd_numa_page(), do_numa_page(), and do_pmd_numa_page().
>
> The current code still follows what the commit log does.

Per my understanding, the issue is introduced in commit b99a342d4f11
("NUMA balancing: reduce TLB flush via delaying mapping on hint page
fault").  Before that, the PTE is restored before migration, so
task_numa_fault() should be called for migration failure too.  After
that, the PTE is restored after migration failure, if the PTE has been
changed by someone else, someone else should have called
task_numa_fault() if necessary, we shouldn't call it again.

--
Best Regards,
Huang, Ying