From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=1c+p=LQ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75FE4C4743C
	for <linux-mm@archiver.kernel.org>; Tue, 22 Jun 2021 01:14:47 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 10C4E60C41
	for <linux-mm@archiver.kernel.org>; Tue, 22 Jun 2021 01:14:46 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 10C4E60C41
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 574176B0036; Mon, 21 Jun 2021 21:14:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 524A36B0062; Mon, 21 Jun 2021 21:14:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 39EF26B006C; Mon, 21 Jun 2021 21:14:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50])
	by kanga.kvack.org (Postfix) with ESMTP id 049746B0036
	for <linux-mm@kvack.org>; Mon, 21 Jun 2021 21:14:45 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id A05ACA77A
	for <linux-mm@kvack.org>; Tue, 22 Jun 2021 01:14:45 +0000 (UTC)
X-FDA: 78279590130.27.3D2D7FD
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf20.hostedemail.com (Postfix) with ESMTP id D89512BDE
	for <linux-mm@kvack.org>; Tue, 22 Jun 2021 01:14:41 +0000 (UTC)
IronPort-SDR: TPId+9ztUf7CN1Pb41SjbJE0/sse1GBg+1Y6te/PiD57cyqUb02AR7x06DlFRoVSszC9/LldNF
 zHnnxQpXSdZA==
X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="270808879"
X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; 
   d="scan'208";a="270808879"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:36 -0700
IronPort-SDR: S4SXaIKNke6jbeLRI8Zz8HPKRwgjG7yyFA3hcpwT0N3cYT6i/aP4uSL22ZcCDzJEUFRfjglT42
 19EetnP6bLkA==
X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; 
   d="scan'208";a="454080998"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119])
  by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:31 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  Yang Shi <shy828301@gmail.com>,  Michal
 Hocko <mhocko@suse.com>,  Wei Xu <weixugc@google.com>,  David Rientjes
 <rientjes@google.com>,  Dan Williams <dan.j.williams@intel.com>,  "David
 Hildenbrand" <david@redhat.com>,  osalvador <osalvador@suse.de>
Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node
 migration order
References: <20210618061537.434999-1-ying.huang@intel.com>
	<20210618061537.434999-3-ying.huang@intel.com>
	<79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com>
	<87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com>
Date: Tue, 22 Jun 2021 09:14:29 +0800
In-Reply-To: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> (Zi Yan's
	message of "Mon, 21 Jun 2021 10:50:14 -0400")
Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Authentication-Results: imf20.hostedemail.com;
	dkim=none;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none);
	spf=none (imf20.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=ying.huang@intel.com
X-Rspamd-Server: rspam02
X-Stat-Signature: ih6w1asaeqaz9f866gzqmnjogigtipe8
X-Rspamd-Queue-Id: D89512BDE
X-HE-Tag: 1624324481-257633
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Zi Yan <ziy@nvidia.com> writes:

> On 19 Jun 2021, at 4:18, Huang, Ying wrote:
>
>> Zi Yan <ziy@nvidia.com> writes:
>>
>>> On 18 Jun 2021, at 2:15, Huang Ying wrote:

[snip]

>>>> +/*
>>>> + * When memory fills up on a node, memory contents can be
>>>> + * automatically migrated to another node instead of
>>>> + * discarded at reclaim.
>>>> + *
>>>> + * Establish a "migration path" which will start at nodes
>>>> + * with CPUs and will follow the priorities used to build the
>>>> + * page allocator zonelists.
>>>> + *
>>>> + * The difference here is that cycles must be avoided.  If
>>>> + * node0 migrates to node1, then neither node1, nor anything
>>>> + * node1 migrates to can migrate to node0.
>>>> + *
>>>> + * This function can run simultaneously with readers of
>>>> + * node_demotion[].  However, it can not run simultaneously
>>>> + * with itself.  Exclusion is provided by memory hotplug events
>>>> + * being single-threaded.
>>>> + */
>>>> +static void __set_migration_target_nodes(void)
>>>> +{
>>>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t used_targets = NODE_MASK_NONE;
>>>> +	int node;
>>>> +
>>>> +	/*
>>>> +	 * Avoid any oddities like cycles that could occur
>>>> +	 * from changes in the topology.  This will leave
>>>> +	 * a momentary gap when migration is disabled.
>>>> +	 */
>>>> +	disable_all_migrate_targets();
>>>> +
>>>> +	/*
>>>> +	 * Ensure that the "disable" is visible across the system.
>>>> +	 * Readers will see either a combination of before+disable
>>>> +	 * state or disable+after.  They will never see before and
>>>> +	 * after state together.
>>>> +	 *
>>>> +	 * The before+after state together might have cycles and
>>>> +	 * could cause readers to do things like loop until this
>>>> +	 * function finishes.  This ensures they can only see a
>>>> +	 * single "bad" read and would, for instance, only loop
>>>> +	 * once.
>>>> +	 */
>>>> +	smp_wmb();
>>>> +
>>>> +	/*
>>>> +	 * Allocations go close to CPUs, first.  Assume that
>>>> +	 * the migration path starts at the nodes with CPUs.
>>>> +	 */
>>>> +	next_pass = node_states[N_CPU];
>>>
>>> Is there a plan of allowing user to change where the migration
>>> path starts? Or maybe one step further providing an interface
>>> to allow user to specify the demotion path. Something like
>>> /sys/devices/system/node/node*/node_demotion.
>>
>> I don't think that's necessary at least for now.  Do you know any real
>> world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node.
> For the GPU workloads with data size greater than GPU memory size,
> it will be very helpful to allow pages in GPU memory to be migrated/demoted
> to CPU memory. With your current assumption, GPU memory -> CPU memory
> demotion seems not possible, right? This should also apply to any
> system with a device memory exposed as a NUMA node and workloads running
> on the device and using CPU memory as a lower tier memory than the device
> memory.

Thanks a lot for your use case!  It appears that the demotion path
specified by users is one possible way to satisfy your requirement.  And
I think it's possible to enable that on top of this patchset.  But we
still have no specific plan to work on that at least for now.

Best Regards,
Huang, Ying