From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0F3ADC67871
	for <linux-mm@archiver.kernel.org>; Thu, 27 Oct 2022 09:32:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 30E138E0002; Thu, 27 Oct 2022 05:32:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2BDE08E0001; Thu, 27 Oct 2022 05:32:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 15E9B8E0002; Thu, 27 Oct 2022 05:32:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 037228E0001
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 05:32:22 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id BA73D406F6
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 09:32:21 +0000 (UTC)
X-FDA: 80066213682.30.5546B79
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
	by imf19.hostedemail.com (Postfix) with ESMTP id 49CCC1A0006
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 09:32:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1666863140; x=1698399140;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=o45ZowjyZfFdbT4eNKTF4OINZYO4WatGCLArdaxRHBY=;
  b=IQyTqm82MpVmOOMCuqLDbxc6o5EoaAJuDPrCbM4sT0sEhGQob2RMm7Q4
   Hj7ZtMdKYnz/au3kw4S9uwyCgwx4mrYXc1CFO47NorOTcYYj0lIqbSUVj
   TobnXJwNfNRO3MBu1l9oXOdtHWpE1zcIvjw8I7Kznf8wivdixim9ctbx2
   Vm37iKJe2bdn/2li9T28nsyYFGxVG215R4Czy153UCkAMKzLqzrBsvuoM
   /ho/q54qcwxJsgWZk1Fkkapb5MNkXuk5X0LqdnYaDUataxzYA4DHO0DIk
   6Yi4fPdio79E7EULQv0PsR58f/GcMKoPbOJzpUFMdtuQP9Nrsu46nHTv2
   Q==;
X-IronPort-AV: E=McAfee;i="6500,9779,10512"; a="295586799"
X-IronPort-AV: E=Sophos;i="5.95,217,1661842800"; 
   d="scan'208";a="295586799"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Oct 2022 02:32:18 -0700
X-IronPort-AV: E=McAfee;i="6500,9779,10512"; a="877518330"
X-IronPort-AV: E=Sophos;i="5.95,217,1661842800"; 
   d="scan'208";a="877518330"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Oct 2022 02:32:15 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Feng Tang <feng.tang@intel.com>,  Aneesh Kumar K V
 <aneesh.kumar@linux.ibm.com>,  Andrew Morton <akpm@linux-foundation.org>,
  Johannes Weiner <hannes@cmpxchg.org>,  Tejun Heo <tj@kernel.org>,  Zefan
 Li <lizefan.x@bytedance.com>,  Waiman Long <longman@redhat.com>,
  "linux-mm@kvack.org" <linux-mm@kvack.org>,  "cgroups@vger.kernel.org"
 <cgroups@vger.kernel.org>,  "linux-kernel@vger.kernel.org"
 <linux-kernel@vger.kernel.org>,  "Hansen, Dave" <dave.hansen@intel.com>,
  "Chen, Tim C" <tim.c.chen@intel.com>,  "Yin, Fengwei"
 <fengwei.yin@intel.com>
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion
References: <20221026074343.6517-1-feng.tang@intel.com>
	<dc453287-015d-fd1c-fe7f-6ee45772d6aa@linux.ibm.com>
	<Y1jpDfwBQId3GkJC@feng-clx> <Y1j7tsj5M0Md/+Er@dhcp22.suse.cz>
	<Y1kl8VbPE0RYdyEB@feng-clx> <Y1lZV6qHp3gIINGc@dhcp22.suse.cz>
	<87wn8lkbk5.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<Y1ou5DGHrEsKnhri@dhcp22.suse.cz>
	<87o7txk963.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<Y1o63SWD2KmQkT3v@dhcp22.suse.cz>
Date: Thu, 27 Oct 2022 17:31:35 +0800
In-Reply-To: <Y1o63SWD2KmQkT3v@dhcp22.suse.cz> (Michal Hocko's message of
	"Thu, 27 Oct 2022 10:01:33 +0200")
Message-ID: <87fsf9k3yg.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666863141; a=rsa-sha256;
	cv=none;
	b=bIrPQfFwx9j+zUL/LVkTRXwyTRsBe9TYL8qYA9D0P+0VYH6HkvvEISxlyt+6xbFIAiDkId
	OdazpJswb5M0qwx5Cf5Ez6frxY+2rwxZoRXpuM32NBtmdS/Q8i0jAKk2gShI2YiY7M9+J2
	cTOT3kzcuoWngxbwUB5vjsEdspMOsk0=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=IQyTqm82;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1666863141;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=25+ZoINXyvJY1Cjp9nexcrz3ByO0jVyCVMucU8UpLCE=;
	b=Xn4YzAk2puvVNij6P2+z1sNCIASEjr1AgJh4rNg11tyyS2jQmLjFImeRssIncdyhDb/IXx
	YgtCZl/sRnCdJpYx4jjgSW5Ep3MPkdJZabZYL+NBzt/ONoCoctAVk1eAFrJYmcpYidyT0U
	mCpwSbGlGZojH2xePGxXuCBOUskiEfA=
X-Rspamd-Queue-Id: 49CCC1A0006
X-Rspam-User: 
Authentication-Results: imf19.hostedemail.com;
	dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=IQyTqm82;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
X-Rspamd-Server: rspam04
X-Stat-Signature: 76hgza8kowys7j1peoynctwo5zebbxui
X-HE-Tag: 1666863140-610448
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Michal Hocko <mhocko@suse.com> writes:

> On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> > [...]
>> >> > I can imagine workloads which wouldn't like to get their memory demoted
>> >> > for some reason but wouldn't it be more practical to tell that
>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> > policies explicitly?
>> >> 
>> >> If my understanding were correct, prctl() configures the process or
>> >> thread.
>> >
>> > Not necessarily. There are properties which are per adddress space like
>> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>> >
>> >> How can we get process/thread configuration at demotion time?
>> >
>> > As already pointed out in previous emails. You could hook into
>> > folio_check_references path, more specifically folio_referenced_one
>> > where you have all that you need already - all vmas mapping the page and
>> > then it is trivial to get the corresponding vm_mm. If at least one of
>> > them has the flag set then the demotion is not allowed (essentially the
>> > same model as VM_LOCKED).
>> 
>> Got it!  Thanks for detailed explanation.
>> 
>> One bit may be not sufficient.  For example, if we want to avoid or
>> control cross-socket demotion and still allow demoting to slow memory
>> nodes in local socket, we need to specify a node mask to exclude some
>> NUMA nodes from demotion targets.
>
> Isn't this something to be configured on the demotion topology side? Or
> do you expect there will be per process/address space usecases? I mean
> different processes running on the same topology, one requesting local
> demotion while other ok with the whole demotion topology?

I think that it's possible for different processes have different
requirements.

- Some processes don't care about where the memory is placed, prefer
  local, then fall back to remote if no free space.

- Some processes want to avoid cross-socket traffic, bind to nodes of
  local socket.

- Some processes want to avoid to use slow memory, bind to fast memory
  node only.

>> >From overhead point of view, this appears similar as that of VMA/task
>> memory policy?  We can make mm->owner available for memory tiers
>> (CONFIG_NUMA && CONFIG_MIGRATION).  The advantage is that we don't need
>> to introduce new ABI.  I guess users may prefer to use `numactl` than a
>> new ABI?
>
> mm->owner is a wrong direction. It doesn't have a strong meaning because
> there is no one task explicitly responsible for the mm so there is no
> real owner (our clone() semantic is just to permissive for that). The
> memcg::owner is a crude and ugly hack and it should go away over time
> rather than build new uses.
>
> Besides that, and as I have already tried to explain, per task demotion
> policy is what makes this whole thing expensive. So this better be a per
> mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE
> or there are explicit requirements for fine grain control on the vma
> level I dunno. I haven't seen those usecases yet and it is really easy
> to overengineer this.
>
> To be completely honest I would much rather wait for those usecases
> before adding a more complex APIs.  PR_[GS]_DEMOTION_DISABLED sounds
> like a reasonable first step. Should we have more fine grained
> requirements wrt address space I would follow the MADV_{NO}HUGEPAGE
> lead.
>
> If we really need/want to give a fine grained control over demotion
> nodemask then we would have to go with vma->mempolicy interface. In
> any case a per process on/off knob sounds like a reasonable first step
> before we learn more about real usecases.

Yes.  Per-mm or per-vma property is much better than per-task property.
Another possibility, how about add a new flag to set_mempolicy() system
call to set the per-mm mempolicy?  `numactl` can use that by default.

Best Regards,
Huang, Ying