From: "Huang, Ying" <ying.huang@intel.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Wei Xu <weixugc@google.com>, Yang Shi <shy828301@gmail.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Tim C Chen <tim.c.chen@intel.com>,
Michal Hocko <mhocko@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Hesham Almatary <hesham.almatary@huawei.com>,
Dave Hansen <dave.hansen@intel.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alistair Popple <apopple@nvidia.com>,
Dan Williams <dan.j.williams@intel.com>,
Johannes Weiner <hannes@cmpxchg.org>,
jvgediya.oss@gmail.com, Jagdish Gediya <jvgediya@linux.ibm.com>
Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
Date: Mon, 18 Jul 2022 14:08:11 +0800 [thread overview]
Message-ID: <87y1wr2bsk.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <87sfn2u0vy.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Fri, 15 Jul 2022 15:57:13 +0530")
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>>
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface. For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface. So I suggest you
>>> to implement with the new interface in mind. What do you think about
>>> the following design?
>>>
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100.
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>> corresponds to a "abstract distance", so each NUMA node corresonds to
>>> a "distance". For simplicity, we can start with static distances, for
>>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA
>>> node can be recorded in a global array,
>>>
>>> int node_distances[MAX_NUMNODES];
>>>
>>> or, just
>>>
>>> pgdat->distance
>>>
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch?
>>
>>
>>> - Each memory tier corresponds to a range of distance, for example,
>>> 0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>>
>>> - The core API of memory tier could be
>>>
>>> struct memory_tier *find_create_memory_tier(int distance);
>>>
>>> it will find the memory tier which covers "distance" in the memory
>>> tier list, or create a new memory tier if not found.
>>>
>>
>> I was expecting this to be internal to dax kmem. How dax kmem maps
>> "abstract distance" to a memory tier. At this point this patchset is
>> keeping all that for a future patchset.
>>
>
> This shows how i was expecting "abstract distance" to be integrated.
>
Thanks!
To make the first version as simple as possible, I think we can just use
some static "abstract distance" for dax_kmem, e.g., 250. Because we
use it for PMEM only now. We can enhance dax_kmem later.
IMHO, we should make the core framework correct firstly.
- A device driver should report the capability (or performance level) of
the hardware to the memory tier core via abstract distance. This can
be done via some global data structure (e.g. node_distances[]) at
least in the first version.
- Memory tier core determines the mapping from the abstract distance to
the memory tier via abstract distance ranges, and allocate the struct
memory_tier when necessary. That is, memory tier core determines
whether to allocate or reuse which memory tier for NUMA nodes, not
device drivers.
- It's better to place the NUMA node to the correct memory tier in the
fist place. We should avoid to place the PMEM node in the default
tier, then change it to the correct memory tier. That is, device
drivers should report the abstract distance before onlining NUMA
nodes.
Please check my reply to Wei too about my other suggestions for the
first version.
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2022-07-18 6:08 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-14 4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-15 7:53 ` Huang, Ying
2022-07-15 9:08 ` Aneesh Kumar K V
2022-07-15 9:24 ` Aneesh Kumar K V
2022-07-15 10:27 ` Aneesh Kumar K.V
2022-07-18 6:08 ` Huang, Ying [this message]
2022-07-18 6:57 ` Huang, Ying
2022-07-18 8:00 ` Aneesh Kumar K V
2022-07-18 8:55 ` Huang, Ying
2022-07-15 16:59 ` Wei Xu
2022-07-18 5:28 ` Huang, Ying
2022-07-18 5:58 ` Alistair Popple
2022-07-18 6:56 ` Aneesh Kumar K V
2022-07-14 4:53 ` [PATCH v9 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-15 4:38 ` Alistair Popple
2022-07-15 7:23 ` Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-15 4:47 ` Alistair Popple
2022-07-15 7:21 ` Aneesh Kumar K.V
2022-07-18 5:41 ` Alistair Popple
2022-07-14 4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-15 5:49 ` Alistair Popple
2022-07-15 7:19 ` Aneesh Kumar K.V
2022-07-18 5:22 ` Alistair Popple
2022-07-14 4:53 ` [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y1wr2bsk.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=apopple@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=hannes@cmpxchg.org \
--cc=hesham.almatary@huawei.com \
--cc=jvgediya.oss@gmail.com \
--cc=jvgediya@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox