From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E2D2FD1CA15
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 02:04:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5D60D6B0093; Mon,  4 Nov 2024 21:04:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 586506B0095; Mon,  4 Nov 2024 21:04:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 44D356B0098; Mon,  4 Nov 2024 21:04:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 291136B0093
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 21:04:39 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id B94191C6D88
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 02:04:38 +0000 (UTC)
X-FDA: 82750397142.28.C842770
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14])
	by imf13.hostedemail.com (Postfix) with ESMTP id 015372000F
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 02:04:01 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=mGwphpqU;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730772109; a=rsa-sha256;
	cv=none;
	b=fNP1bJX/ZRFE+D94r7OhQbSfEGuN5fciLPt3avjylJvlTiIS+hW51IXrREtjMtrukkXDUD
	TN/nqTlApAh+YLIAHd0l62YAgM1g7zMgWgNOoAuKdVSs1R0lzZxWKkH7SHIFPanPlh3qUj
	dxFrCK8U7MYiJLdgIkm+WH7FwSYf5ww=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=mGwphpqU;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730772109;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nMWxgFkdZf//Ioxre8zGSbX6cswDql/8LsuhT17ZwA8=;
	b=uB/vayDXGB/SZdZYwahurrspX/kG+vZCUNHFuXTCt9vBV99b00Pb/rMOmTADcqlSq0D0e4
	85j60xzyPgsczFg660MymS++5/9/bv0uNj95Tbv9L5NvZMqIYpI18gagg5wIRj1HtH09TH
	UPADd0PAmYh6eQ4y4o3iYqMxYQRhuDY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730772275; x=1762308275;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=SLSWkE6E9RKAffLI8PA1u4xqrbbvNTV8e2Req3pDfwg=;
  b=mGwphpqUsx8vD/n77SVG4EXjKUCfrBTQfaFimXT3aR2YG3R9UiEqXsRg
   laQ0XUvbcKhL9a8EixjU2Td2b6TSpd8QxogdmOJGfOpwQ5eHmWXcUAjcE
   AUPTJH4GNid5UqS8HhFd8RzYQ1/vVYN0llGzJcST+I1fdmvKhL6Wp+suC
   A0c541On0JbETG91WNWqVQJwFDfXC1c/3V2Wvno8CQHipvwMWjzG1LUge
   1NXrUFTTXB2gSqvJgAN90G44JYOnorYOruoHCCEFTZssZYctZZPJRpjC8
   lKa3BfliazcYrRrhb/EKh8RSeQI0nhGW9KTdmkhdNz7pCEzSPCZmqsA+r
   Q==;
X-CSE-ConnectionGUID: CjhpT0PSRTmeVLEUJTWcjg==
X-CSE-MsgGUID: cun9b4LfRWSPvMC4P0GnNA==
X-IronPort-AV: E=McAfee;i="6700,10204,11246"; a="34283262"
X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; 
   d="scan'208";a="34283262"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 18:04:34 -0800
X-CSE-ConnectionGUID: 5cvHDuQyTxiit0v8Iyf+uw==
X-CSE-MsgGUID: iA7d+4/wSz2O1+rRwHFaSw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; 
   d="scan'208";a="83725654"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 18:04:32 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
  akpm@linux-foundation.org,  david@redhat.com,  nphamcs@gmail.com,
  nehagholkar@meta.com,  abhishekd@meta.com,  Johannes Weiner
 <hannes@cmpxchg.org>,  Feng Tang <feng.tang@intel.com>
Subject: Re: [PATCH 0/3] mm,TPP: Enable promotion of unmapped pagecache
In-Reply-To: <ZykOqYJpgL4lw7mw@PC2K9PVX.TheFacebook.com> (Gregory Price's
	message of "Mon, 4 Nov 2024 13:12:57 -0500")
References: <20240803094715.23900-1-gourry@gourry.net>
	<875xrxhs5j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZsNhgU-TiTz2WKg5@PC2K9PVX.TheFacebook.com>
	<87ikvefswp.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZykOqYJpgL4lw7mw@PC2K9PVX.TheFacebook.com>
Date: Tue, 05 Nov 2024 10:00:59 +0800
Message-ID: <87jzdi782s.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Stat-Signature: yiasry7of8foojkp1xtr6tryzqt3n4c6
X-Rspamd-Queue-Id: 015372000F
X-Rspamd-Server: rspam08
X-Rspam-User: 
X-HE-Tag: 1730772241-735633
X-HE-Meta: U2FsdGVkX18+JqPgpKCJ//jUEgvpCA8PuFwclIVn/mCopwJvfjO5B0mXo0ZH3oMO5PjqbYxCE2vq5iXIdLObRJVA5eSSt5kmFvhGOcLmzeBCXO72MTLk+ZOu7VvtnbOpVVFQ+aWx9DdO2q0w0CRuWYHDAecD+CiMJYVcTpwuSNqBZ/FJvI0Unw07+E+2RCwjmF0UyHjZYaWJxrOjJuCXhPd0TNBYSQ3u8P2+w4itzgryLZR/7bPceEklVO6L1Ov/M88TpIu8Xk771R/sCkxQzWVMlYq+tf95I35QawVLIyH8LQgMlwv0PdG29RyfDAzpM8I99dkdMKWFFpDBTSgCuVhfbZMkoXHZ57JqcD8mJXpyyK8T+x/wHQH/dQIqfq2k92TDJRkF+RzgwBmhEMdtbZMp0a4puzDML1PK6s4pqRL2ZMlrc0wrB5N2VKVqhRCIstDa8Mmmq8SYXb+XbpKDtI+vtEbMn9vTuHwW92C2a/GDeHFBRxq8l5sBWvGCJ6ExNKU86Gr/OvQY+7XMkJzBJ0vZ4RxSXunY4RILUCYHKJsexkxc1K0ctl7wiWZ9lV36YNOIPqPYx4hWSlSrpgro+qUXN/rt0qxxZhwcCjaQHsP0ENH8J8QFtu1l6FY6W33AWyLNcw2h0FhMmFMCXoc3zlL2/LuSkfPqlIEUaMeZO38/ptyW0+qRebTt7gd67xH6+y1eB35aZ/Ut784ZPFskAi7WF3jMUq8eZtSFjGs9W6UQ63t2MLvoCgUmS/KfuD1KTMX8iGo3HLAtpxZjfsBdkRdcksafPSYx+nXmYcZ7nwEkcSEoivuLilNkahfmruMdQ1eR/f5Oi+OWs3UjwByQMOFdqX/c5ybGxAmLAI1NOQogwOgqQi6P/s/IGV6OKtWtcjvDz4FtZw268KQwZGcV363ik7ZN5rqw6TxEGiLJo4ZNrt/fy19GjHFYGmPuP2qHhs2U62b6+uWP8iIvpNv
 NdtNCS2A
 sShQd4cuzHaCMpFCTt67INQI+p3XNvDKDdolKCfOcV+Ol3Gqqc73tOVG/7UFTUuv4S8iOhhRDPA1ww3iln1IxF6BeuICi4xKizcsMEd0hiDymx+cNqmj+ef9TqQMLGwGqr3x2ZUrO8p5udDICif+kbX4IQQd2rlEk8woylFDR+cWkJSP0412RTfC5O0TkvB99Ldr45IbZUJxNw6a1MYsU5f0qw0N2lBAUeeJaVJrhB0he4UTXKYbGEMkAYufn7WzTnYVB+h2znVI131sAncFNd6tY7A==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi, Gregory,

Gregory Price <gourry@gourry.net> writes:

> On Mon, Sep 02, 2024 at 02:53:26PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > On Mon, Aug 19, 2024 at 03:46:00PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry@gourry.net> writes:
>> >> 
>> >> > Unmapped pagecache pages can be demoted to low-tier memory, but 
>> >> > they can only be promoted if a process maps the pages into the
>> >> > memory space (so that NUMA hint faults can be caught).  This can
>> >> > cause significant performance degradation as the pagecache ages
>> >> > and unmapped, cached files are accessed.
>> >> >
>> >> > This patch series enables the pagecache to request a promotion of
>> >> > a folio when it is accessed via the pagecache.
>> >> >
>> >> > We add a new `numa_hint_page_cache` counter in vmstat to capture
>> >> > information on when these migrations occur.
>> >> 
>> >> It appears that you will promote page cache page on the second access.
>> >> Do you have some better way to identify hot pages from the not-so-hot
>> >> pages?  How to balance between unmapped and mapped pages?  We have hot
>> >> page selection for hot pages.
>> >> 
>> >> [snip]
>> >> 
>> >
>> > I've since explored moving this down under a (referenced && active) check.
>> >
>> > This would be more like promotion on third access within an LRU shrink
>> > round (the LRU should, in theory, hack off the active bits on some decent
>> > time interval when the system is pressured).
>> >
>> > Barring adding new counters to folios to track hits, I don't see a clear
>> > and obvious way way to track hotness.  The primary observation here is 
>> > that pagecache is un-mapped, and so cannot use numa-fault hints.
>> >
>> > This is more complicated with MGLRU, but I'm saving that for after I
>> > figure out the plan for plain old LRU.
>> 
>> Several years ago, we have tried to use the access time tracking
>> mechanism of NUMA balancing to track the access time latency of unmapped
>> file cache folios.  The original implementation is as follows,
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>> 
>> What do you think about this?
>> 
>
> Coming back around to explore this topic a bit more, dug into this old
> patch and the LRU patch by Keith - I'm struggling find a good option
> that doesn't over-complicate or propose something contentious.
>
>
> I did a browse through lore and did not see any discussion on this patch
> or on Keith's LRU patch, so i presume discussion on this happened largely
> off-list.  So if you have any context as to why this wasn't RFC'd officially
> I would like more information.

Thanks for doing this.  There's no much discussion offline.  We just
don't have enough time to work on the solution.

> My observations between these 3 proposals:
>
> - The page-lock state is complex while trying interpose in mark_folio_accessed,
>   meaning inline promotion inside that interface is a non-starter.
>
>   We found one deadlock during task exit due to the PTL being held. 
>
>   This worries me more generally, but we did find some success changing certain
>   calls to mark_folio_accessed to mark_folio_accessed_and_promote - rather than
>   modifying mark_folio_accessed. This ends up changing code in similar places
>   to your hook - but catches a more conditions that mark a page accessed.
>
> - For Keith's proposal, promotions via LRU requires memory pressure on the lower
>   tier to cause a shrink and therefore promotions. I'm not well versed in LRU
>   LRU sematics, but it seems we could try proactive reclaim here.
>   
>   Doing promote-reclaim and demote/swap/evict reclaim on the same triggers
>   seems counter-intuitive.

IIUC, in TPP paper (https://arxiv.org/abs/2206.02878), a similar method
is proposed for page promoting.  I guess that it works together with
proactive reclaiming.

> - Doing promotions inline with access creates overhead.  I've seen some research
>   suggesting 60us+ per migration - so aggressiveness could harm performance.
>
>   Doing it async would alleviate inline access overheads - but it could also make
>   promotion pointless if time-to-promote is to far from liveliness of the pages.

Async promotion needs to deal with the resource (CPU/memory) charging
too.  You do some work for a task, so you need to charge the consumed
resource for the task.

> - Doing async-promotion may also require something like PG_PROMOTABLE (as proposed
>   by Keith's patch), which will obviously be a very contentious topic.

Some additional data structure can be used to record pages.

> tl;dr: I'm learning towards a solution like you have here, but we may need to
> make a sysfs switch similar to demotion_enabled in case of poor performance due
> to heuristically degenerate access patterns, and we may need to expose some
> form of adjustable aggressiveness value to make it tunable.

Yes.  We may need that, because the performance benefit may be lower
than the overhead introduced.

> Reading more into the code surrounding this and other migration logic, I also
> think we should explore an optimization to mempolicy that tries to aggressively
> keep certain classes of memory on the local node (RX memory and stack
> for example).
>
> Other areas of reclaim try to actively prevent demoting this type of memory, so we
> should try not to allocate it there in the first place.

We have already used DRAM first allocation policy.  So, we need to
measure its effect firstly.

--
Best Regards,
Huang, Ying