From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA1E6CCA479 for ; Mon, 18 Jul 2022 08:55:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DBFC06B0071; Mon, 18 Jul 2022 04:55:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D7B226B0072; Mon, 18 Jul 2022 04:55:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C37506B0073; Mon, 18 Jul 2022 04:55:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B54DB6B0071 for ; Mon, 18 Jul 2022 04:55:46 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id 81CB180696 for ; Mon, 18 Jul 2022 08:55:46 +0000 (UTC) X-FDA: 79699612692.07.B9651FF Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf04.hostedemail.com (Postfix) with ESMTP id B855C4005E for ; Mon, 18 Jul 2022 08:55:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1658134545; x=1689670545; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=Ucf6RGF4kkfg9paC6Z2vnUGhxY1a1SfILXXdFVsOPp4=; b=ElDqlQ3Cx72cseMbciEWY7WF03FLANJALDP5nITsIZlD+pjAjtyne1FN SuHgnY+wLUG3rjhdCszRuivk+HzA389/Plb7avF8JdAZa53SAqRDVrEEP lzcGBXDLmlbxgCP3yr2v5NxIcLY+ZEJSjlQtvtT/8XZ4VC5urHvZ1IKMx J1GIFYYZUitfJ21mtmeCcLT2n3KVGNYa3LUV7R9Y144imWiE2jI7Izplf DSORalJg6eSw/sD4+88jCw+ERi3J5B/8ZH9+XBm9XV6IGlbM6FDuxjF+r Klwz8U+GhN/qGb9LA3cBUrjD+HHjPXVd0L3TgwOsoifa3EgaZCXius8b3 w==; X-IronPort-AV: E=McAfee;i="6400,9594,10411"; a="269194913" X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="269194913" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2022 01:55:44 -0700 X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="924264809" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.13.94]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2022 01:55:40 -0700 From: "Huang, Ying" To: Aneesh Kumar K V Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers References: <20220714045351.434957-1-aneesh.kumar@linux.ibm.com> <20220714045351.434957-2-aneesh.kumar@linux.ibm.com> <87bktq4xs7.fsf@yhuang6-desk2.ccr.corp.intel.com> <3659f1bb-a82e-1aad-f297-808a2c17687d@linux.ibm.com> <87tu7e3o2h.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 18 Jul 2022 16:55:36 +0800 In-Reply-To: (Aneesh Kumar K. V.'s message of "Mon, 18 Jul 2022 13:30:55 +0530") Message-ID: <87r12i3ilz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658134546; a=rsa-sha256; cv=none; b=s/PD+BlPIXxrSKRETk3IVi94GF1nPO++1rylk2MQ2x557v2G8DItwLTlw3eLW3E+bM5akr qUXo1gdzWXm0XGtJHKVD4tM0HdOiGy7DDHwB616inO3kixJqIwNprT+w3vjSqdtK9SRASL zJeRlprhkpQ2bq5PtZP09ehX5LboCpw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ElDqlQ3C; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf04.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.126) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658134546; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iMEUKoCg0SkDpfhglACSNhsSw8Vn8w0EcCiBWZp9emo=; b=icGOlsx97uqCA5+6sUz6zyFPt0+q1IAXISb30p6CVlbrFgCD/wC7f14INdBaFvlp0m96lx 4ruhWRDuA/dzQs5aVFYO24u36pJvssIxmUdqZXcKyf6+FX7HanSmlD5zCKNeA13eU31ATs iJgqhxxkq2ZQ8VVxcfeX5q343VRQN/o= X-Rspam-User: Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ElDqlQ3C; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf04.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.126) smtp.mailfrom=ying.huang@intel.com X-Rspamd-Queue-Id: B855C4005E X-Rspamd-Server: rspam12 X-Stat-Signature: bpqorqg8aqi1rwn9rd5wqh59orjjgwuc X-HE-Tag: 1658134545-50163 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Aneesh Kumar K V writes: > On 7/18/22 12:27 PM, Huang, Ying wrote: >> Aneesh Kumar K V writes: >> >>> On 7/15/22 1:23 PM, Huang, Ying wrote: >> >> [snip] >> >>>> >>>> You dropped the original sysfs interface patches from the series, but >>>> the kernel internal implementation is still for the original sysfs >>>> interface. For example, memory tier ID is for the original sysfs >>>> interface, not for the new proposed sysfs interface. So I suggest you >>>> to implement with the new interface in mind. What do you think about >>>> the following design? >>>> >>> >>> Sorry I am not able to follow you here. This patchset completely drops >>> exposing memory tiers to userspace via sysfs. Instead it allow >>> creation of memory tiers with specific tierID from within the kernel/device driver. >>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. >>> >>> >>>> - Each NUMA node belongs to a memory type, and each memory type >>>> corresponds to a "abstract distance", so each NUMA node corresonds to >>>> a "distance". For simplicity, we can start with static distances, for >>>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA >>>> node can be recorded in a global array, >>>> >>>> int node_distances[MAX_NUMNODES]; >>>> >>>> or, just >>>> >>>> pgdat->distance >>>> >>> >>> I don't follow this. I guess you are trying to have a different design. >>> Would it be much easier if you can write this in the form of a patch? >> >> Written some pseudo code as follow to show my basic idea. >> >> #define MEMORY_TIER_ADISTANCE_DRAM 150 >> #define MEMORY_TIER_ADISTANCE_PMEM 250 >> >> struct memory_tier { >> /* abstract distance range covered by the memory tier */ >> int adistance_start; >> int adistance_len; >> struct list_head list; >> nodemask_t nodemask; >> }; >> >> /* RCU list of memory tiers */ >> static LIST_HEAD(memory_tiers); >> >> /* abstract distance of each NUMA node */ >> int node_adistances[MAX_NUMNODES]; >> >> struct memory_tier *find_create_memory_tier(int adistance) >> { >> struct memory_tier *tier; >> >> list_for_each_entry(tier, &memory_tiers, list) { >> if (adistance >= tier->adistance_start && >> adistance < tier->adistance_start + tier->adistance_len) >> return tier; >> } >> /* allocate a new memory tier and return */ >> } >> >> void memory_tier_add_node(int nid) >> { >> int adistance; >> struct memory_tier *tier; >> >> adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM; >> tier = find_create_memory_tier(adistance); >> node_set(nid, &tier->nodemask); >> /* setup demotion data structure, etc */ >> } >> >> static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, >> unsigned long action, void *_arg) >> { >> struct memory_notify *arg = _arg; >> int nid; >> >> nid = arg->status_change_nid; >> if (nid < 0) >> return notifier_from_errno(0); >> >> switch (action) { >> case MEM_ONLINE: >> memory_tier_add_node(nid); >> break; >> } >> >> return notifier_from_errno(0); >> } >> >> /* kmem.c */ >> static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> { >> node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM; >> /* add_memory_driver_managed() */ >> } >> >> [snip] >> >> Best Regards, >> Huang, Ying > > > Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property > instead it is a kernel parameter like memory_tier_chunk_size which can > be tuned to create more memory tiers. It's not determined how to represent the range of abstract distance of memory tier. perf_level_chunk_size or perf_level_granularity is another possible solution. But I don't think it should be a kernel parameter for the fist step. > How about this? Not yet tested. > > struct memory_tier { > struct list_head list; > int id; We don't need "id" for now in fact. So I suggest to remove it. We can add it when we really need it. > int perf_level; > nodemask_t nodelist; > }; > > static LIST_HEAD(memory_tiers); > static DEFINE_MUTEX(memory_tier_lock); > static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF; > core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644); > static unsigned int memtier_perf_chunk_size = 150; > core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644); > > /* > * performance levels are grouped into memtiers each of chunk size > * memtier_perf_chunk > */ > static struct memory_tier *find_create_memory_tier(unsigned int perf_level) > { > bool found_slot = false; > struct list_head *ent; > struct memory_tier *memtier, *new_memtier; > static int next_memtier_id = 0; > /* > * zero is special in that it indicates uninitialized > * perf level by respective driver. Pick default memory > * tier perf level for that. > */ > if (!perf_level) > perf_level = default_memtier_perf_level; > > lockdep_assert_held_once(&memory_tier_lock); > > list_for_each(ent, &memory_tiers) { > memtier = list_entry(ent, struct memory_tier, list); > if (perf_level >= memtier->perf_level && > perf_level < memtier->perf_level + memtier_perf_chunk_size) > return memtier; > else if (perf_level < memtier->perf_level) { > found_slot = true; > break; > } > } > > new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); > if (!new_memtier) > return ERR_PTR(-ENOMEM); > > new_memtier->id = next_memtier_id++; > new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size); > if (found_slot) > list_add_tail(&new_memtier->list, ent); > else > list_add_tail(&new_memtier->list, &memory_tiers); > return new_memtier; > } > > static int __init memory_tier_init(void) > { > int node; > struct memory_tier *memtier; > > /* > * Since this is early during boot, we could avoid > * holding memtory_tier_lock. But keep it simple by > * holding locks. So we can add lock held debug checks > * in other functions. > */ > mutex_lock(&memory_tier_lock); > memtier = find_create_memory_tier(default_memtier_perf_level); > if (IS_ERR(memtier)) > panic("%s() failed to register memory tier: %ld\n", > __func__, PTR_ERR(memtier)); > > /* CPU only nodes are not part of memory tiers. */ > memtier->nodelist = node_states[N_MEMORY]; > > /* > * nodes that are already online and that doesn't > * have perf level assigned is assigned a default perf > * level. > */ > for_each_node_state(node, N_MEMORY) { > struct node *node_property = node_devices[node]; > > if (!node_property->perf_level) > node_property->perf_level = default_memtier_perf_level; > } > mutex_unlock(&memory_tier_lock); > return 0; > } > subsys_initcall(memory_tier_init); I think that this can be a starting point of our future discussion and review. Thanks! Best Regards, Huang, Ying