From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4F040C4332F
	for <linux-mm@archiver.kernel.org>; Fri,  2 Dec 2022 02:02:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B54F6B0071; Thu,  1 Dec 2022 21:02:51 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 866046B0073; Thu,  1 Dec 2022 21:02:51 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 72D316B0074; Thu,  1 Dec 2022 21:02:51 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 637D76B0071
	for <linux-mm@kvack.org>; Thu,  1 Dec 2022 21:02:51 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 32C80AB149
	for <linux-mm@kvack.org>; Fri,  2 Dec 2022 02:02:51 +0000 (UTC)
X-FDA: 80195717742.28.AD4706A
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by imf28.hostedemail.com (Postfix) with ESMTP id 96BE3C0009
	for <linux-mm@kvack.org>; Fri,  2 Dec 2022 02:02:49 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=j6dshGYf;
	spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1669946569;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Lm7Q226Br34gSao+O/ovVufKczPe7yyzD5m+NFNN8Dw=;
	b=5qHGEVgxU1GmpEMbSmNgwMnK6tTl7nBQ88bUdOLoQTwQMNOnDhz+yp6d72TMNt/AQ5iyGC
	10+Z9njpc4/p4oUhjW7OSRX6WSVtCYukFLcifSdyhoGghFky136okdUdxqxivGt6mBetpm
	2+3l7/8ZM9C9hmWjwPYt86FYqpOOuvU=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=j6dshGYf;
	spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669946569; a=rsa-sha256;
	cv=none;
	b=EuDJcr/+GUsoOwTwygLQNEMKV1tOohV1NicW5oA8AJ36w/8ia20eX956yrBYqxmKEC6mZ7
	h2TzYk5mOB9mP9IhevS0XvFPy0/0tRnbMAxAwoS3IPTfQ3ZFmBEW7wM2Hxp/lNw4MKHJzM
	Ib6jhG361AnhXjODYuwjo0Duw2X2mMs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1669946569; x=1701482569;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=DUxVs+XVQBl0UL/Tbqjptes7IbkNXXTyQuoYBy96rpk=;
  b=j6dshGYfJbKwqWnN4MuAxALAQwRM98LrhOmPN9c8yLkOYVX/DQLEEUMN
   j9UWv1+dgZLaRYH3laGgTbIRk9YkL6KoEOZkajW2fJzyjjIQPx9cM7TzG
   eRxvHSPzih+0Wm6wKYZOtgPbVeQl7S48Z3qjwmQr//fvxWl86/hAXOqhu
   KHZldeZ+I7A1o+F/9cL3iEJ6/89SffI534jrm8hado448uYtUTeMVOqzI
   zHRKkRem4TpZ3XIrzjKXQj4j38dJHBLcGBcoPzfPYLeSqAAWxdxfloEJ7
   EZfhK3VwhV0nXDFTNWBUBXua/y5AumMxMWuOF1vKQe8DuRPKxT1ikYx8G
   A==;
X-IronPort-AV: E=McAfee;i="6500,9779,10548"; a="378006746"
X-IronPort-AV: E=Sophos;i="5.96,210,1665471600"; 
   d="scan'208";a="378006746"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Dec 2022 18:02:32 -0800
X-IronPort-AV: E=McAfee;i="6500,9779,10548"; a="889970343"
X-IronPort-AV: E=Sophos;i="5.96,210,1665471600"; 
   d="scan'208";a="889970343"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Dec 2022 18:02:29 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Mina Almasry <almasrymina@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,  Yang Shi
 <yang.shi@linux.alibaba.com>,  Yosry Ahmed <yosryahmed@google.com>,  Tim
 Chen <tim.c.chen@linux.intel.com>,  weixugc@google.com,
  shakeelb@google.com,  gthelen@google.com,  fvdl@google.com,  Michal Hocko
 <mhocko@kernel.org>,  Roman Gushchin <roman.gushchin@linux.dev>,  Muchun
 Song <songmuchun@bytedance.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  linux-kernel@vger.kernel.org,
  cgroups@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim
References: <20221122203850.2765015-1-almasrymina@google.com>
	<Y35fw2JSAeAddONg@cmpxchg.org>
	<CAHS8izN+xqM67XLT4y5qyYnGQMUWRQCJrdvf2gjTHd8nZ_=0sw@mail.gmail.com>
	<Y36XchdgTCsMP4jT@cmpxchg.org>
	<874juonbmv.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<Y4ZKgXdUXZ3ycMp4@cmpxchg.org>
	<87wn7dayfz.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAHS8izNc=NeMcNudOB0r4aYbZURpU+Y9xpek_KpO=55R+7ec5g@mail.gmail.com>
Date: Fri, 02 Dec 2022 10:01:39 +0800
In-Reply-To: <CAHS8izNc=NeMcNudOB0r4aYbZURpU+Y9xpek_KpO=55R+7ec5g@mail.gmail.com>
	(Mina Almasry's message of "Thu, 1 Dec 2022 12:40:16 -0800")
Message-ID: <87tu2e36nw.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Spamd-Result: default: False [-3.90 / 9.00];
	BAYES_HAM(-3.00)[100.00%];
	DMARC_POLICY_ALLOW(-0.50)[intel.com,none];
	R_SPF_ALLOW(-0.20)[+ip4:134.134.136.31/32];
	R_DKIM_ALLOW(-0.20)[intel.com:s=Intel];
	MIME_GOOD(-0.10)[text/plain];
	RCVD_NO_TLS_LAST(0.10)[];
	FROM_EQ_ENVFROM(0.00)[];
	RCPT_COUNT_TWELVE(0.00)[16];
	DKIM_TRACE(0.00)[intel.com:+];
	MID_RHS_MATCH_FROMTLD(0.00)[];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	RCVD_COUNT_THREE(0.00)[3];
	FROM_HAS_DN(0.00)[];
	MIME_TRACE(0.00)[0:+];
	TO_DN_SOME(0.00)[];
	ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1];
	ARC_NA(0.00)[]
X-Rspamd-Queue-Id: 96BE3C0009
X-Rspam-User: 
X-Rspamd-Server: rspam07
X-Stat-Signature: cambxrgcb6sd35kd43p1onrhhqzwyeiu
X-HE-Tag: 1669946569-232788
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Mina Almasry <almasrymina@google.com> writes:

> On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>>
>> > Hello Ying,
>> >
>> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >> > The fallback to reclaim actually strikes me as wrong.
>> >> >
>> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >> >
>> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> >> > breaking the layering. Rather it should deflect that pressure to the
>> >> > lower layers to make room. This makes sure we maintain an aging
>> >> > pipeline that honors the memory tier hierarchy.
>> >>
>> >> Yes.  I think that we should avoid to fall back to reclaim as much as
>> >> possible too.  Now, when we allocate memory for demotion
>> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> >> to reclaim on current (higher tier) node.  This may be not good enough,
>> >> for example, the following patch from Hasan may help via waking up
>> >> kswapd earlier.
>> >>
>> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> >>
>> >> Do you know what is the next step plan for this patch?
>> >>
>> >> Should we do even more?
>> >>
>> >> From another point of view, I still think that we can use falling back
>> >> to reclaim as the last resort to avoid OOM in some special situations,
>> >> for example, most pages in the lowest tier node are mlock() or too hot
>> >> to be reclaimed.
>> >
>> > If they're hotter than reclaim candidates on the toptier, shouldn't
>> > they get promoted instead and make room that way? We may have to tweak
>> > the watermark logic a bit to facilitate that (allow promotions where
>> > regular allocations already fail?). But this sort of resorting would
>> > be preferable to age inversions.
>>
>> Now it's legal to enable demotion and disable promotion.  Yes, this is
>> wrong configuration in general.  But should we trigger OOM for these
>> users?
>>
>> And now promotion only works for default NUMA policy (and MPOL_BIND to
>> both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
>> we use some other NUMA policy, the pages cannot be promoted too.
>>
>> > The mlock scenario sounds possible. In that case, it wouldn't be an
>> > aging inversion, since there is nothing colder on the CXL node.
>> >
>> > Maybe a bypass check should explicitly consult the demotion target
>> > watermarks against its evictable pages (similar to the file_is_tiny
>> > check in prepare_scan_count)?
>>
>> Yes.  This sounds doable.
>>
>> > Because in any other scenario, if there is a bug in the promo/demo
>> > coordination, I think we'd rather have the OOM than deal with age
>> > inversions causing intermittent performance issues that are incredibly
>> > hard to track down.
>>
>> Previously, I thought that people will always prefer performance
>> regression than OOM.  Apparently, I am wrong.
>>
>> Anyway, I think that we need to reduce the possibility of OOM or falling
>> back to reclaim as much as possible firstly.  Do you agree?
>>
>
> I've been discussing this with a few folks here. I think FWIW general
> feeling here is that demoting from top tier nodes is preferred, except
> in extreme circumstances we would indeed like to run with a
> performance issue rather than OOM a customer VM. I wonder if there is
> another way to debug mis-tiered pages rather than trigger an oom to
> debug.
>
> One thing I think/hope we can trivially agree on is that proactive
> reclaim/demotion is _not_ an extreme circumstance. I would like me or
> someone from the team to follow up with a patch that disables fallback
> to reclaim on proactive reclaim/demotion (sc->proactive).

Yes.  This makes sense to me.

Best Regards,
Huang, Ying

>> One possibility, can we fall back to reclaim only if the sc->priority is
>> small enough (even 0)?
>>
>
> This makes sense to me.
>
>> Best Regards,
>> Huang, Ying
>>