From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 66634C4332F
	for <linux-mm@archiver.kernel.org>; Fri, 16 Dec 2022 03:17:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BB9748E0003; Thu, 15 Dec 2022 22:17:21 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B69BA8E0002; Thu, 15 Dec 2022 22:17:21 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A323C8E0003; Thu, 15 Dec 2022 22:17:21 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 953528E0002
	for <linux-mm@kvack.org>; Thu, 15 Dec 2022 22:17:21 -0500 (EST)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 60EF1803A4
	for <linux-mm@kvack.org>; Fri, 16 Dec 2022 03:17:21 +0000 (UTC)
X-FDA: 80246708682.19.83E4519
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
	by imf18.hostedemail.com (Postfix) with ESMTP id 97BC61C0005
	for <linux-mm@kvack.org>; Fri, 16 Dec 2022 03:17:18 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="Hu/M3IIt";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1671160638;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=zSkqLmK/5Sk8AcqKWBGC54VAk+z9QFGQXJOxdxJiJzg=;
	b=hxOZqcTWVJ8JnLRTT+7KGM2DLMqUJiYTbJbryxG3m02FksVZ0oslfCWSDfg19kw49tOyrs
	OanT01Fo0vzhbl6kLtpECUFL/UgEc6qli1JvLWXup9jymIVwzd2afy0XfUf9zKYkFf0UxV
	iioYQ9MCaGhRjT/j/yVvKDQPhB6wh8E=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="Hu/M3IIt";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671160638; a=rsa-sha256;
	cv=none;
	b=YQ9cFx1/7t5KwJD7z/emA3cVJm9xsDJIuEjo6p09lbaFgAdPNDC8ssMuR/Rzjt7P0jP1eY
	D9Tk1R07pkfiwDX6vXOa0iLUKoIrtRyn+eXQhknoRGz6OPvRop4iwm/EaTx1O8a1JqkbVM
	4oYBfhzUBY0rjFH/Khy2JhFRHV9wPcs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1671160638; x=1702696638;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=V7nMACfvQKW9NVbyjcAlrDxmwIs/5qIqkQZBaVPjfTQ=;
  b=Hu/M3IIte+B5L6tri1Hqa8xz1rlBCFUo1RydSJwaUYrjL33wVTYV4nJT
   G596Ua5bZ8qlIgt1wMJNhwuNzE1k8hsbvXFyR+XJ1u4qAX4vNwLvzSg3X
   yIv76bnKaBTLkxNpu6ApoQMF1Z6WKctfJrF4TrKQE5kGXFz87OwloQ8xi
   8Tsu95cOsBFHurgbB3xTQmYpPwkPBL/yhMpa2GHBZvWMnxRsXysjnHBj1
   bWPfpZEE/dLpxPpZw+AfZl9ZCmnmMDk6bfUnlJLqG56F0azIk5R9L2OHu
   MNJhpS+l3pUzojGcMLK8q2DaVME25BzR76iHiJY9ndtiGT2VIKfJrCJM1
   Q==;
X-IronPort-AV: E=McAfee;i="6500,9779,10562"; a="299209617"
X-IronPort-AV: E=Sophos;i="5.96,248,1665471600"; 
   d="scan'208";a="299209617"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Dec 2022 19:17:16 -0800
X-IronPort-AV: E=McAfee;i="6500,9779,10562"; a="713147851"
X-IronPort-AV: E=Sophos;i="5.96,248,1665471600"; 
   d="scan'208";a="713147851"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Dec 2022 19:17:14 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>,  Dave Hansen <dave.hansen@intel.com>,
  Yang Shi <shy828301@gmail.com>,  Wei Xu <weixugc@google.com>,  Andrew
 Morton <akpm@linux-foundation.org>,  linux-mm@kvack.org,  LKML
 <linux-kernel@vger.kernel.org>
Subject: Re: memcg reclaim demotion wrt. isolation
References: <Y5idFucjKVbjatqc@dhcp22.suse.cz> <Y5ik+CCmvapf87Mb@cmpxchg.org>
	<Y5maoIUuH79KrfJt@dhcp22.suse.cz>
	<87edt1dwd2.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<Y5rZSRxcgQzQQVbS@cmpxchg.org>
Date: Fri, 16 Dec 2022 11:16:26 +0800
In-Reply-To: <Y5rZSRxcgQzQQVbS@cmpxchg.org> (Johannes Weiner's message of
	"Thu, 15 Dec 2022 09:22:33 +0100")
Message-ID: <877cys9gxh.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 97BC61C0005
X-Stat-Signature: iueqcb5gpxfqno4dy5jwemfhoficxy3g
X-HE-Tag: 1671160638-356588
X-HE-Meta: U2FsdGVkX1/hLpiqUPbKscxodAr/XqGyq5mssr9DJkuHLw7TGqLYn+46rA0bp+7yO4dRZh+LR6N41h/YgtrI3s56X8H+hgXiAcU7gAlMaatfF1tPIJPgue5xDPJWJrykTXUCSW48EWDtIB6rdkKvzWwZ0zhV9nkeHaN7DuL3JWJg5B2nzqHY4vpVQE72NXSYTgfcWxtnmuYfOxG8FB/pth5pYgtlyTJ5E7lY+yGU6YqKLJTdNW4tYQ0+GMm83NNEv4rZ8BIa+rh2BqO1VXDcbVP0MPkpVWdZGZiwrmomf6vZj1dJLgLSzoteOlekR+z4imNAYD7yeXENmTV83Ijyr8M+h5UwEAcIVcb7QzSOcNK6eRkN57KbvFWopeDwTZZX/L5w/GrCvMQ3cLXnzg3HNA7zc+xQjvhGjuHpKZreg6IyrClKxRaLHoTWXg7lo3lwsvZ0+Gl8KIUKuO9vmvfeZZwJsfeJcWHEtmSWWoGlLycx1mxG+jp+bmpIKhHAKFhnZgTqlnCVdmChqZlHsxRJoeojiyDAvG1YgOp6bKJcUJMdJOae7ImVr11Q3x9hcCsZiDm7a3Y2gP2ecMh1QNEGreIN2jCARoZNtIO0wcMPP3kQUoXX4twxVIB1dc5r7qJAYM5O0JDCLKrWiKoIwCR8qdPIvMebGOal2uWynS/wdfyClSIIKlj9enG0K9pfv6t8pcAStlWdFr93UzEBiomkMwo2hKKLlKMieaCggOZuIQOes6clKx8ACSOepIbf5vJFUV3P8zC5O6B7LBMOGMXeXF3A+Z4Kzlbc/uXuBV3k1cYirki2JYhlBo83CDFdGorVxYO4JeZ+mYrIgww5fJ9EJu1XoQBsEgA6wSK1X+oC3Es=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Thu, Dec 15, 2022 at 02:17:13PM +0800, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
>> >> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
>> >> > Hi,
>> >> > I have just noticed that that pages allocated for demotion targets
>> >> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
>> >> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
>> >> > pages during reclaim"). I suspect the intention is to trigger the aging
>> >> > on the fallback node and either drop or further demote oldest pages.
>> >> > 
>> >> > This makes sense but I suspect that this wasn't intended also for
>> >> > memcg triggered reclaim. This would mean that a memory pressure in one
>> >> > hierarchy could trigger paging out pages of a different hierarchy if the
>> >> > demotion target is close to full.
>> >> 
>> >> This is also true if you don't do demotion. If a cgroup tries to
>> >> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
>> >> enter global reclaim directly which may push out the memory of other
>> >> cgroups, regardless of the respective cgroup limits.
>> >
>> > You are right on this. But this is describing a slightly different
>> > situaton IMO. 
>> >
>> >> The demotion allocations don't strike me as any different. They're
>> >> just allocations on behalf of a cgroup. I would expect them to wake
>> >> kswapd and reclaim physical memory as needed.
>> >
>> > I am not sure this is an expected behavior. Consider the currently
>> > discussed memory.demote interface when the userspace can trigger
>> > (almost) arbitrary demotions. This can deplete fallback nodes without
>> > over-committing the memory overall yet push out demoted memory from
>> > other workloads. From the user POV it would look like a reclaim while
>> > the overall memory is far from depleted so it would be considered as
>> > premature and a warrant a bug report.
>> >
>> > The reclaim behavior would make more sense to me if it was constrained
>> > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
>> > disrupted.
>> 
>> When we reclaim/demote some pages from a memcg proactively, what is our
>> goal?  To free up some memory in this memcg for other memcgs to use?  If
>> so, it sounds reasonable to keep the pages of other memcgs as many as
>> possible.
>
> The goal of proactive aging is to free up any resources that aren't
> needed to meet the SLAs (e.g. end-to-end response time of webserver).
> Meaning, to run things as leanly as possible within spec. Into that
> free space, another container can then be co-located.
>
> This means that the goal is to free up as many resources as possible,
> starting with the coveted hightier. If a container has been using
> all-hightier memory but is able demote to lowtier, there are 3 options
> for existing memory in the lower tier:
>
> 1) Colder/stale memory - should be displaced
>
> 2) Memory that can be promoted once the hightier is free -
>    reclaim/demotion of the coldest pages needs to happen at least
>    temporarily, or the tierswap is in stale mate.
>
> 3) Equally hot memory - if this exceeds capacity of the lower tier,
>    the hottest overall pages should stay, the excess demoted/reclaimed.
>
> You can't know what scenario you're in until you put the demoted pages
> in direct LRU competition with what's already there. And in all three
> scenarios, direct LRU competition also produces the optimal outcome.

If my understanding were correct, your preferred semantics is to be memcg
specific in the higher tier, and global in the lower tier.

Another choice is to add another global "memory.reclaim" knob, for
example, as /sys/devices/virtual/memory_tiering/memory_tier<N>/memory.reclaim ?
Then we can trigger global memory reclaim in lower tiers firstly.  Then
trigger memcg specific memory reclaim in higher tier for the specified
memcg.

The cons of this choice is that you need 2 steps to finish the work.
The pros is that you don't need to combine memcg-specific and global
behavior in one interface.

Best Regards,
Huang, Ying