From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E5187C433F5
	for <linux-mm@archiver.kernel.org>; Thu, 25 Nov 2021 01:24:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 607056B0074; Wed, 24 Nov 2021 20:23:56 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5B6F06B0075; Wed, 24 Nov 2021 20:23:56 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 464B46B007B; Wed, 24 Nov 2021 20:23:56 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0131.hostedemail.com [216.40.44.131])
	by kanga.kvack.org (Postfix) with ESMTP id 35CE76B0074
	for <linux-mm@kvack.org>; Wed, 24 Nov 2021 20:23:56 -0500 (EST)
Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id F2FD68D8E8
	for <linux-mm@kvack.org>; Thu, 25 Nov 2021 01:23:45 +0000 (UTC)
X-FDA: 78845705652.25.E51259A
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by imf10.hostedemail.com (Postfix) with ESMTP id 2612A60019AB
	for <linux-mm@kvack.org>; Thu, 25 Nov 2021 01:23:40 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10178"; a="235657737"
X-IronPort-AV: E=Sophos;i="5.87,261,1631602800"; 
   d="scan'208";a="235657737"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2021 17:23:35 -0800
X-IronPort-AV: E=Sophos;i="5.87,261,1631602800"; 
   d="scan'208";a="510091926"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.101])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2021 17:23:33 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Hasan Al Maruf <hasan3050@gmail.com>
Cc: dave.hansen@linux.intel.com,  yang.shi@linux.alibaba.com,
  mgorman@techsingularity.net,  riel@surriel.com,  hannes@cmpxchg.org,
  linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/5] Transparent Page Placement for Tiered-Memory
References: <cover.1637778851.git.hasanalmaruf@fb.com>
Date: Thu, 25 Nov 2021 09:23:31 +0800
In-Reply-To: <cover.1637778851.git.hasanalmaruf@fb.com> (Hasan Al Maruf's
	message of "Wed, 24 Nov 2021 13:58:25 -0500")
Message-ID: <874k812fl8.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 2612A60019AB
X-Stat-Signature: hqjh3pism6h48yejjofjpda8x7n53tzf
Authentication-Results: imf10.hostedemail.com;
	dkim=none;
	spf=none (imf10.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.115) smtp.mailfrom=ying.huang@intel.com;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none)
X-HE-Tag: 1637803420-745420
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hasan Al Maruf <hasan3050@gmail.com> writes:

> [resend in proper format]
>
> With the advent of new memory types and technologies, we can see different
> types of memory together, e.g. DRAM, PMEM, CXL-enabled memory, etc. In
> recent future, we can see CXL-Memory be available in the physical address-
> space as a CPU-less NUMA node along with the native DDR memory channels.
> As different types of memory have different level of performance impact,
> how we manage pages across the NUMA nodes should be a matter of concern.
>
> Dave Hansen's patchset on "Migrate Pages in lieu of discard" demotes
> toptier pages to a slow tier node during the reclamation process.
>
>     		https://lwn.net/Articles/860215/
>
> However, that patchset does not include the features to promote pages on
> slow tier memory node to the toptier one. As a result, pages demoted or
> newly allocated on the slow tier node, experiences NUMA latency and hurt
> application performance. In this patch set, we augment existing AutoNUMA
> mechanism to promote pages from slow tier nodes to toptier nodes.
>
> We decouple reclamation and allocation logics for the toptier node so that
> reclamation gets triggered at a higher watermark and demotes colder pages
> to the slow-tier memory. As a result, toptier nodes can maintain some free
> space to accept both new allocation and promotion from slowtier nodes.
> During promotion, we add hysteresis to page and only promote pages that
> are less likely to be demoted within a short period of time. This reduces
> the chance for a page being ping-ponged across the NUMA nodes due to
> frequent demotion and promotion within a short period of time.
>
> We tested this patchset on systems with CXL-enabled DRAM and PMEM tiers.
> We find this patchset can bring hotter pages to the toptier node while
> moving the colder pages to the slow-tier nodes for a good range of Meta
> production workloads with live traffic. As a result, toptier nodes serve
> more hot pages and the application performance improves.
>
> Case Study of a Meta cache application with two NUMA nodes
> ==========================================================
> Toptier node: DRAM directly attached to the CPU
> Slowtier node: DRAM attached through CXL
>
> Toptier vs Slowtier memory capacity ratio is 1:4
>
> With default page placement policy, file caches fills up the toptier node
> and anons get trapped in the slowtier node. Only 14% of the total anons
> reside in toptier node. Remote NUMA read bandwidth is 80%. Throughput
> regression is 18% compared to all memory being served from toptier node.
>
> This patchset brings 80% of the anons to the toptier node. Anons on the
> slowtier memory is mostly cold anons. As the toptier node can not host all
> the hot memory, some hot files still remain on the slowtier node. Even
> though, remote NUMA read bandwidth reduces from 80% to 40%. With this
> patchset, throughput regression is only 5% compared to the baseline of
> toptier node serving the whole working set.

Hi, Hasan,

I found that quite some code in your patchset is exactly same as that in
my patchset as follows,

https://lore.kernel.org/lkml/20211116013522.140575-1-ying.huang@intel.com/

and patches in the following repo we used to publish some patchset that
hasn't been sent to community for review,

https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=tiering-0.72

I am glad that more people have interest and worked on optimizing page
placement for tiering memory system.  How about we merge instead of
duplicate our effort?

Because I tried to make the patches above as simple as possible (at
least first 3), can you comment the most basic patches there to help
them to be improved.  And then we can build our more complex/advanced
patches on top of that?

Best Regards,
Huang, Ying


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 917CFC433F5
	for <linux-mm@archiver.kernel.org>; Tue, 30 Nov 2021 00:28:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C9F696B006C; Mon, 29 Nov 2021 19:28:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C28136B0072; Mon, 29 Nov 2021 19:28:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AA1A66B0073; Mon, 29 Nov 2021 19:28:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0045.hostedemail.com [216.40.44.45])
	by kanga.kvack.org (Postfix) with ESMTP id 939366B006C
	for <linux-mm@kvack.org>; Mon, 29 Nov 2021 19:28:41 -0500 (EST)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 4F5E9181AEF1F
	for <linux-mm@kvack.org>; Tue, 30 Nov 2021 00:28:31 +0000 (UTC)
X-FDA: 78863710338.27.51D1D39
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52])
	by imf20.hostedemail.com (Postfix) with ESMTP id BF008D0000B5
	for <linux-mm@kvack.org>; Tue, 30 Nov 2021 00:28:25 +0000 (UTC)
Received: by mail-qv1-f52.google.com with SMTP id bu11so16323663qvb.0
        for <linux-mm@kvack.org>; Mon, 29 Nov 2021 16:28:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=KgwVk16X0tOT5NSJOLn+LfHJOb/YERyJUpEEJC77JU4=;
        b=RMYAUvQIJOZXnvb++JtH/OlZtSIxgeFNZIh4UgIoHA9X3vbyF09SAvmMExFm/fcNOv
         qSYR/aDR6Kn4UIse3B8rhaO3YYti36ARf30+LtcnDf0lsWzaUlRU5kiO4+rypEzl6Y93
         9T0+mjDYvUqsxktng/AkjrJVu1aNlTknn42bfZJ8FhV+0YIf5yseN1exu5/AQK07kznN
         q4vwhfhEJwiw5bLcKCG3bUX3VM4LRmxAXKHjfBVlvtSpdkoqDc0QGXoJfPr3vNgwJBWE
         i33tQ6KskuS54KdP5VI1PMs+OTrMMhaSH7Px1HmGjZnpGt1P0qLd/JoSBpF/ZURp2nme
         mF5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=KgwVk16X0tOT5NSJOLn+LfHJOb/YERyJUpEEJC77JU4=;
        b=tJvbpwn+4a2LWvdzN/Qvwq1ZHZtL9ImRXhLnsF0jmiP8/W9cOwT0W9TwSp2Bf2smUN
         Z7Ek8ANZqolYSWo/q9ceg8HX7RRbiTSsdVw8Z+fU1BziWNKXFhEkvB5RzDpPz7cmAIcu
         1aITuzwRxBGeIJjYuWlBMDmtGAfPAnddhQxTcnFx9kGLqMI2xJMGPQml8xCabOQFllCg
         P8q9YaiRgqZh59pdu/sKwGfecnU1F0CREan/y2+Fxo7N8yIwH6KTtrQWQSSaXB6LOKFM
         /B14L+x8WWgJq+VMWtVTYsbqIbGA3FLKPmrNpw92vGaY1PeZ7rFZw61mLXmOmQUf6CKG
         ieNQ==
X-Gm-Message-State: AOAM533OYzOy87QXYOb3i82eD2BYOnWrT6iB3eNJXLTbvDS3UIfFGRt/
	oF9/+1Ocomqp9xiK+DpBbQc=
X-Google-Smtp-Source: ABdhPJyaLvjh/yGIzGs4G2UWuV82/ZaC0bn8HyVf/K/DeiTl5uABWiz1VcfBtsjlgcm7yHwnzU2xFw==
X-Received: by 2002:a05:6214:104b:: with SMTP id l11mr33731937qvr.111.1638232110178;
        Mon, 29 Nov 2021 16:28:30 -0800 (PST)
Received: from hasanalmaruf-mbp.thefacebook.com ([2620:10d:c091:480::1:86be])
        by smtp.gmail.com with ESMTPSA id o20sm9812078qkp.114.2021.11.29.16.28.28
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 29 Nov 2021 16:28:29 -0800 (PST)
From: Hasan Al Maruf <hasan3050@gmail.com>
X-Google-Original-From: Hasan Al Maruf <hasanalmaruf@fb.com>
To: ying.huang@intel.com
Cc: dave.hansen@linux.intel.com,
	hannes@cmpxchg.org,
	hasan3050@gmail.com,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	mgorman@techsingularity.net,
	riel@surriel.com,
	yang.shi@linux.alibaba.com
Subject: Re: [PATCH 0/5] Transparent Page Placement for Tiered-Memory
Date: Mon, 29 Nov 2021 19:28:16 -0500
Message-ID: <874k812fl8.fsf@yhuang6-desk2.ccr.corp.intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <874k812fl8.fsf@yhuang6-desk2.ccr.corp.intel.com>
References: <874k812fl8.fsf@yhuang6-desk2.ccr.corp.intel.com>
MIME-Version: 1.0
X-Stat-Signature: zcphjyt16pgzusxirnts9mfftjakoktz
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=RMYAUvQI;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf20.hostedemail.com: domain of hasan3050@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=hasan3050@gmail.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: BF008D0000B5
X-HE-Tag: 1638232105-258608
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000286, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
Message-ID: <20211130002816.EhQlW24A6FQ4rB0tc4qSf9jl_g41toP8oKOdj4Fn1EE@z>

Hi Huang,

We find the patches in the tiering series are well thought and helpful.
For our workloads, we initially started with that series and we find the
whole series is too complex and some features do not benefit as
expected. Therefore, we have come up with the current basic patches which
are essential and help achieve most of the intended behaviors while
reducing complexity as much as possible.

As we started with your tiering series (with 72 patches), there are
overlaps between our patches and the tiering series. We adopt the
functionalities from the tiering series, modify, and extend them to make
page placement mechanism simpler but workable. Here is the key points for
each of the patches in our Transparent Page Placement series.

Patch #1:
We combine all the promotion and demotion related statistics in this patc=
h
Having statistics on both promotion, demotion, and failures help observe
the systems behavior and reason about performance behavior. Besides, anon
vs file breakdown in both promotion and demotion path help understand
application behavior on a tiered memory systems. As applications may have
different sensitivity toward the anon and file placements, this breakdown
in the migration path is often helpful to assess the effectiveness of the
page placement policy.

Patch #2:
This patch largely overlaps with your current series on NUMA Balancing.
https://lore.kernel.org/lkml/20211116013522.140575-1-ying.huang@intel.com=
/
This patch is a combination of your Patch #2 and Patch #3 except the
static 10MB free space in the top-tier node to maintain a free headroom
for new allocation and promotion. Rather, we find having a user defined
demote watermark would make it more generic that we include in our patch#=
3

Patch #3:
This patch has the logic for having a separate demote watermark per node.
In the tiering series, that demote watermark is somewhat bound to the
cgroup and triggered on per-application basis. Besides, It only supports
cgroup-v1. However, we think, instead of cgroup based soft reclamation,
a global per-node demote watermark is more meaningful and should be the
basic one to start with. In that case, the user does not have to think
about per-application setup.

Patch #4:
This patch includes the code for kswapd based reclamation. As I mentioned
earlier, instead of cgroup-based reclamation, here we look whether a node
is balanced during each kswapd invocation. For top-tier node, we check
whether kswapd reclaimed till DEMOTE_WMARK is satisfied, for other nodes
the default mechanism continues. The differences between tiering series
and this patch is the cgroup based reclamation vs per-node reclamation.

Patch #5:
In your patches for promotion, you consider re-fault time for promotion
candidate selection. Although the hot-threshold is tunable, from our
experiments, we find this not helpful to some extent. For example, if
different subset of pages have different re-access time, time-based
promotion should not be able to distinguish between them. If you make
the time window long enough, then any infrequently accessed pages will
also become the promotion candidate, and later be a candidate for the
demotion.

In this patch, we propose LRU based promotion, which would give anon and
files different promotion paths. If pages are used sporadically at high
frequency, irregular pages would be eventually moved from the active LRU
list. We find that our LRU based approach can reduce up to 11x promotion
traffic while retaining the same application throughput for multiple
workloads.

Besides, with promotion rate limit, if files largely get promoted to
top-tier, anon promotion rate often gets hampered as files are taking the
large portion of the total rate (which often happen for applications that
generates huge caches). In our LRU-based approach, each type has their ow=
n
separate LRU to check. So for workloads with smaller anons and large file
usage, with LRU-based approach, we can see more anons are being promoted
rather than the files.

I don't mind this patchset being merged to your current patchset under
discussion or any later ones. But, I think this series contains the very
basic functionalities to have a workable page placement mechanism for
tiered-memory. This can obviously be augmented by the other features in
you future tiering series.

Best,
Hasan