From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 290E5C41513 for ; Wed, 26 Jul 2023 07:35:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9386F8D0001; Wed, 26 Jul 2023 03:35:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E96C6B0075; Wed, 26 Jul 2023 03:35:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 789AA8D0001; Wed, 26 Jul 2023 03:35:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 663E46B0071 for ; Wed, 26 Jul 2023 03:35:22 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 31B23120167 for ; Wed, 26 Jul 2023 07:35:22 +0000 (UTC) X-FDA: 81052952484.26.07912ED Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf05.hostedemail.com (Postfix) with ESMTP id 96ADD100014 for ; Wed, 26 Jul 2023 07:35:19 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="KlSgxrJ/"; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690356920; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v9FtI1hk5sSr+uYdl+saEZt7S6klNZzxILpaCG/Bwv0=; b=g7oPupctpAsLmhCx/hOFtZEh61vV1ZVcUnvTLJqDKyguncuLXeNzTdTH4iUoixd+f265d4 8CYKhdIlWvuZ/0AEkR+tc0ilomGVbo3I5Mke3Bg8k5B5zUxzfo3x7/5rEs06LLFUBZUjlH kerAg7P2Wysg3ONTDBQaAywNRdrr1hI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690356920; a=rsa-sha256; cv=none; b=4EXn2bBPuML6ftL2OxB22hYFN1OBYaIBVeZxLG/mqMCM3PK1QL4WzLcFuOfzzSjTvnoT78 dFA228DQSpr6sNS5qPTYUVaVcFBbfIT1UePndw/jGvkKd5oFZ/V1uBmTjHDsXH9W+35ov7 S2k9Z2xkd8Y8gE/mR4FwLpCaMp30ijY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="KlSgxrJ/"; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690356919; x=1721892919; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=6qVZBUs21Wpj2aQjHSwyHHybwSsIvWbW48TAPSEojyI=; b=KlSgxrJ/f8UOV1xQiPYpvG+O2RwQXJb8oG2YB8pcK2bPh9Pz+ShU5Jmo UGeDs4wX8BWIAoUSs2nsrwebbK2YYO0c4GLclXhNiW7ht7Jj2PG8UiHLn 823Zv+X6eWSVV6yFBtsl+3x8P31AHwj6LuQxsI4N8ebsz53j4krj+2jv2 FK441frLxBRIm8qnOQZvs3+sG5lkp2cDePiqNHLT0gZbbUeG1zpAoYlbQ tniU84JtoGG/mChhS9LhiDOf5FIHU4hbbQyRTgdg3QFHlvyJYP6K0msCA r0qBmlK+vAPpmZr97HDN9dgnuyWlTXGtwyM6Tpgjd+xSonwIQfZyJo8gF A==; X-IronPort-AV: E=McAfee;i="6600,9927,10782"; a="398875130" X-IronPort-AV: E=Sophos;i="6.01,231,1684825200"; d="scan'208";a="398875130" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 00:35:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10782"; a="726408026" X-IronPort-AV: E=Sophos;i="6.01,231,1684825200"; d="scan'208";a="726408026" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 00:35:13 -0700 From: "Huang, Ying" To: Alistair Popple Cc: Andrew Morton , , , , , , "Aneesh Kumar K . V" , Wei Xu , Dan Williams , Dave Hansen , "Davidlohr Bueso" , Johannes Weiner , "Jonathan Cameron" , Michal Hocko , Yang Shi , Rafael J Wysocki , Dave Jiang Subject: Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management References: <20230721012932.190742-1-ying.huang@intel.com> <20230721012932.190742-2-ying.huang@intel.com> <87r0owzqdc.fsf@nvdebian.thelocal> <87r0owy95t.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf9cxupz.fsf@nvdebian.thelocal> Date: Wed, 26 Jul 2023 15:33:26 +0800 In-Reply-To: <87sf9cxupz.fsf@nvdebian.thelocal> (Alistair Popple's message of "Tue, 25 Jul 2023 18:26:15 +1000") Message-ID: <878rb3xh2x.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 96ADD100014 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: jzroyaffhjizy3b8u55m8m3gxjhg1c47 X-HE-Tag: 1690356919-787410 X-HE-Meta: U2FsdGVkX1/0IQaxYZ7F7f1XTQKwkMi2BZo5KcbJ+mxNkckvazI8263EqhT2EvYwlqnz8I3vrcem7p89JZYLXdN73Neay887I8MzDfh5O4AM4nsQ3VYlebksLZnzOn/h9cBLuLk4Fk6GwhGClaYWeMdtX8CdsAxzNOCETK3WpUMxXilT3cHiD+WoYH9se5xpeofJbodM96WeyqJ/xeDzJFOWO4FhTPzNNuqXWk0AKLhhKqsvwobIoQ4S1kvQ1rGfuwagwpnyrBzA8OYItmTR5dY1zBcKHOxW3WERS3dmJ9xaSHkkIVrn1g5W/nfq+IpG4gLDC0e0xxu37Df0gSLKc6b8phWpDEELIrEEGGtGFK/VPNk60i6aKr1E4A4PwEWMq7RRaJ+rXdBw/szG3DNAXRVJjGFENGEUtzz4csct1yalCW/NUzo6I+yvVl9K/dGSlIRXfU2HcXrQiD/3Y+rORN0uk7COfCm2dFjSzFQ82U8DULCAZjfr9Vs2N1Ou6Gtm/q4oJfkGMrgTXcb8sAm1gDu8j3GfBVBHjksvkYcQUjK/CwCrRIogahzB3u06Lhqgv+Q6lbE/qViGpEJhbuoc58gLsY2apAgVQwJ1J9qkCZWBZwsPg/b6VBmWlb0VQJc9rSEK60GK2IvrqH6BJpbWbTr23exdMr1jdIgDKinDQ1i+TpXmiFRX9NVklsRWpNZS/hGzxsvFdPIKNSbpUIb1wTYlRd3HQVs4VBeckfx8rN/QYcu1Xyi8vL05ep7wBqYMBY3XUhl4d+7lGOI7C9OS1n6FaMBvWa6xajyKR+4RBR6RevxxSZmtyedycWe3pPONTgey1qMtVqycZUvHxcz1LLspg/zzHMZuLLocpWNtwiSmRXL2l4t9djYPOn6LsewfLlcTn2qnwmokBmP1V6XSEMthiCVeepCEIOTmo+L5VccMiYTFXzegfv0eHv5jEEUeiPq15IfunFSYWT4cMkP BjXFgb3c P0tl6Pd89zR1Ydb5sIJvX3MnbydMchLjVbSOIPmUjNrAPaDgcD/kv4+XwWX73EtuJB94oIY3Ac+RtABZ3fYSztGbrPN4OgzSSAINz6hzanV7f69nFigb6pkzj62vIS4mm9P1w7iEwdLsYjiiiy8z1HaL7ak36BFjlODdBLBZVGOMek9r3mTs9/gT2/hrQkUixgVOV9uDas1Im5ff0y2v0wdZD3A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Alistair Popple writes: > "Huang, Ying" writes: > >> Hi, Alistair, >> >> Thanks a lot for comments! >> >> Alistair Popple writes: >> >>> Huang Ying writes: >>> >>>> The abstract distance may be calculated by various drivers, such as >>>> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which >>>> hot-add memory node, such as dax/kmem etc. To decouple the algorithm >>>> users and the providers, the abstract distance calculation algorithms >>>> management mechanism is implemented in this patch. It provides >>>> interface for the providers to register the implementation, and >>>> interface for the users. >>> >>> I wonder if we need this level of decoupling though? It seems to me like >>> it would be simpler and better for drivers to calculate the abstract >>> distance directly themselves by calling the desired algorithm (eg. ACPI >>> HMAT) and pass this when creating the nodes rather than having a >>> notifier chain. >> >> Per my understanding, ACPI HMAT and memory device drivers (such as >> dax/kmem) may belong to different subsystems (ACPI vs. dax). It's not >> good to call functions across subsystems directly. So, I think it's >> better to use a general subsystem: memory-tier.c to decouple them. If >> it turns out that a notifier chain is unnecessary, we can use some >> function pointers instead. >> >>> At the moment it seems we've only identified two possible algorithms >>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one >>> of those to fallback to the other based on priority, so why not just >>> have drivers call the correct algorithm directly? >> >> For example, we have a system with PMEM (persistent memory, Optane >> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected >> via CXL link to a remote memory pool. We will need ACPI HMAT for PMEM >> and CXL CDAT for CXL.mem. One way is to make dax/kmem identify the >> types of the device and call corresponding algorithms. > > Yes, that is what I was thinking. > >> The other way (suggested by this series) is to make dax/kmem call a >> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of >> device and calculate the distance if the type is correct for them. I >> don't think that it's good to make dax/kem to know every possible >> types of memory devices. > > Do we expect there to be lots of different types of memory devices > sharing a common dax/kmem driver though? Must admit I'm coming from a > GPU background where we'd expect each type of device to have it's own > driver anyway so wasn't expecting different types of memory devices to > be handled by the same driver. Now, dax/kmem.c is used for - PMEM (Optane DCPMM, or AEP) - CXL.mem - HBM (attached to CPU) I understand that for a CXL GPU driver it's OK to call some CXL CDAT helper to identify the abstract distance of memory attached to GPU. Because there's no cross-subsystem function calls. But it looks not very clean to call from dax/kmem.c to CXL CDAT because it's a cross-subsystem function call. >>>> Multiple algorithm implementations can cooperate via calculating >>>> abstract distance for different memory nodes. The preference of >>>> algorithm implementations can be specified via >>>> priority (notifier_block.priority). >>> >>> How/what decides the priority though? That seems like something better >>> decided by a device driver than the algorithm driver IMHO. >> >> Do we need the memory device driver specific priority? Or we just share >> a common priority? For example, the priority of CXL CDAT is always >> higher than that of ACPI HMAT? Or architecture specific? > > Ok, thanks. Having read the above I think the priority is > unimportant. Algorithms can either decide to return a distance and > NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they > can't for a specific device. Yes. In most cases, there are no overlaps among algorithms. >> And, I don't think that we are forced to use the general notifier >> chain interface in all memory device drivers. If the memory device >> driver has better understanding of the memory device, it can use other >> way to determine abstract distance. For example, a CXL memory device >> driver can identify abstract distance by itself. While other memory >> device drivers can use the general notifier chain interface at the >> same time. > > Whilst I think personally I would find that flexibility useful I am > concerned it means every driver will just end up divining it's own > distance rather than ensuring data in HMAT/CDAT/etc. is correct. That > would kind of defeat the purpose of it all then. But we have no way to enforce that too. -- Best Regards, Huang, Ying