From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D1D8EE49A0 for ; Wed, 23 Aug 2023 05:58:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE3C2280063; Wed, 23 Aug 2023 01:58:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B93A1280062; Wed, 23 Aug 2023 01:58:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5C09280063; Wed, 23 Aug 2023 01:58:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9430C280062 for ; Wed, 23 Aug 2023 01:58:37 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 57BBC160522 for ; Wed, 23 Aug 2023 05:58:37 +0000 (UTC) X-FDA: 81154315074.18.272CBFD Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by imf09.hostedemail.com (Postfix) with ESMTP id 34EFF140019 for ; Wed, 23 Aug 2023 05:58:34 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PAGj5IwD; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692770315; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=h0dUJmmiLzuTxw3yvzutnGpvyorSW6d3kkf/wYtpIx4=; b=kidW0tgeIbVYHmtshC1ZKfU3m4OB0507ULjXM4quc/1Qdti016CQnphkNDhMcR7aAmvRRX Z/DkcdFOUAvSGjA7g9EFPqrHezwrEuykROMKkYAj1BSRtJq0NMD1A8QYG2SGfDjo97f0Qo bUjEfpssvBieEYvgqfbbG3UicVj7rIk= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PAGj5IwD; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692770315; a=rsa-sha256; cv=none; b=LhsQxeAlrwDH1pX+bhSIbIwvSTwb0aXxbv1K5jXdzRyq5+m3ibZzfIs30J/3b24TvuslXf spw3fTWP9gbJULlFbnpDhjvJ+iVmFsqvIYE0XyVJP+CNWOooXOl/4KqpthKw70xQi5QqiW HE5AjV4i5/efqGsN6ZZIOwZH3huh5xA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1692770315; x=1724306315; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=LW+RWmleuQGwpDrnLOlCnwX6p1Gg6/2pGZbejbm+ZBY=; b=PAGj5IwDPLbO46ln1O3Kd0nSP4qmQ45pkUSEpebZe+3ZuOUFDUY2sBMO wh6sUft0puqN2427Om4uKXqUAujoVyZaGofCYVQyEK6mq3Tcj24/R0Rc+ H/re21bukJgwQkc2Na8hWxta1BKSRUPumwh1Vf9PctzrKZaNn3f4Es3Dr snVUP07bXB6fg8kpB9dRDJZK1ojxWM8P5JXl02m5e3j2CG2CAiKtQPtTd irNOkAlmqBosUxLIScp9oWQpSfJCBUefhIP2ruPRTq1R59StO83Skn5hq o5RpFMET688As9gasQkFrPC1A5aoPXZrVeWueIyIt/HLXvB4JzwsymHCW g==; X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="372964702" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="372964702" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Aug 2023 22:58:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="713437752" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="713437752" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Aug 2023 22:58:28 -0700 From: "Huang, Ying" To: Alistair Popple Cc: Andrew Morton , , , , , , "Aneesh Kumar K . V" , Wei Xu , Dan Williams , Dave Hansen , "Davidlohr Bueso" , Johannes Weiner , "Jonathan Cameron" , Michal Hocko , Yang Shi , Rafael J Wysocki , Dave Jiang Subject: Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management References: <20230721012932.190742-1-ying.huang@intel.com> <20230721012932.190742-2-ying.huang@intel.com> <87r0owzqdc.fsf@nvdebian.thelocal> <87r0owy95t.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf9cxupz.fsf@nvdebian.thelocal> <878rb3xh2x.fsf@yhuang6-desk2.ccr.corp.intel.com> <87351axbk6.fsf@nvdebian.thelocal> <87edkuvw6m.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1j2vvqw.fsf@nvdebian.thelocal> <87a5vhx664.fsf@yhuang6-desk2.ccr.corp.intel.com> <87lef0x23q.fsf@nvdebian.thelocal> <87r0oack40.fsf@yhuang6-desk2.ccr.corp.intel.com> <87cyzgwrys.fsf@nvdebian.thelocal> <87il98c8ms.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edjwlzn7.fsf@nvdebian.thelocal> <875y57dhar.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmxnlfer.fsf@nvdebian.thelocal> Date: Wed, 23 Aug 2023 13:56:03 +0800 In-Reply-To: <87wmxnlfer.fsf@nvdebian.thelocal> (Alistair Popple's message of "Tue, 22 Aug 2023 17:11:34 +1000") Message-ID: <878ra2b8uk.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 34EFF140019 X-Stat-Signature: nqjnjnjxnra6j36sjk1jnxdp7qg6g3xg X-HE-Tag: 1692770314-417933 X-HE-Meta: U2FsdGVkX1+584wav/kWStutxxkNnPVZ2SFc8sqZGLRH/FQqR7Fch7uTQlbiYX9CuYMvPa9PzZlf9HAeeUNqkza363ZT7J7QTWQjFHkfUdbyqJg8ky9pA3YT52ebdVNAFu9yD6B9aqfNljKc71cJQN/LSve1N0/zpulIgJRU0wNZ8WbSANW/r54PiBonpRcihsVRFEJu6jn0RFjrBZRsAfGO9C68hUCEfNwFXmLEsFHGRQr8NxKf6JNAIrSVGCrXNRaQ9KebPrjNxgJ21eklH08wID8x16pIlk8w4bznkmwcgmrY+qBDAvfXm3CGUNTWdY1ufIlZXg63Qh+zeYn/8eUA55MUEz1nu/WqyYzvg0sPooKLhSEdWyb9hAbTqG9PVdhbV9ch7LYUWdz92SUFct9dLE7n0Bp/B7V6yM3sm0CDOqgGcp3JGaaEznTaB6SJ60AEvAD4LLPyFmi9/aJXCkJpt408rtZURpFQxPXzccwg9UvJDvb55+LFq1R10W3tRDJymCFfQzBvTEBHF/ilMvGZGlpd9W+rMt2+V5c+yTFJ7tMrMAMC4Vz+AA113O0XiSLDGyuR25teC9XVUdrzsvYXzmLctYoUjy6o72RVXPYBCBdRy3b41/ib/q/Lx260wjEJUQuHS61o2IF2rt+6bBRnGY1xSqXifR/WjOiocQpI9q6G+42oWKWD82kSi1YentzRNwOqVSMtaPBKhq7TR6wFdw2yy1a16vwQNwS/+OFYibGwz5Vtiu2iUy8DI8ePamytZWs6DTO1xktUF+Za6FfpKwLBa9iSaTHOmf135jD+86GiE+DDxKIMGzDvxfrucLZuwQq7nMfSixg0rHnXWJCOjZOsdu4avFctSw4n6Vq2FZnVFdFrrtMtAzUAhOPBASddvhrWoVhRpDh4rw/ZXFZOHLBo10AEVqY/4kFlWHa303JJS2WATz5etLif8afhDmO/OxwfzYp3hA5GGES 0m4x8AaP r1wcExlcUKKCe07oxF1FuBlNWCezueg0ydzYBBjqtFcIiNrkBtvfzAi3oGB6w1Y0P+9a1Xn+9qqiNJv2GTEuGWL4XWjGoz1adB9tvbFi4kvbQO3So4E/jHI6Bc7Z3Yqtf/r256GlOxQ9f/w8LX9cP2CwVkrLxw2V10ToPhyAZfUv9El/4u8X+564SvZYW4Evbh7DqTlZmCwclB5KqrVvQoRRqcoTqajEWgnNS2rRL53LZsRRJ6wG4b+SUxg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> "Huang, Ying" writes: >>> >>>> Alistair Popple writes: >>>> >>>>> "Huang, Ying" writes: >>>>> >>>>>> Hi, Alistair, >>>>>> >>>>>> Sorry for late response. Just come back from vacation. >>>>> >>>>> Ditto for this response :-) >>>>> >>>>> I see Andrew has taken this into mm-unstable though, so my bad for not >>>>> getting around to following all this up sooner. >>>>> >>>>>> Alistair Popple writes: >>>>>> >>>>>>> "Huang, Ying" writes: >>>>>>> >>>>>>>> Alistair Popple writes: >>>>>>>> >>>>>>>>> "Huang, Ying" writes: >>>>>>>>> >>>>>>>>>> Alistair Popple writes: >>>>>>>>>> >>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain >>>>>>>>>>>>>> interface at the same time. >>>>>>>>>>> >>>>>>>>>>> How would that work in practice though? The abstract distance as far as >>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences >>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to >>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own >>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in >>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT >>>>>>>>>>> or something else. >>>>>>>>>> >>>>>>>>>> Only if different algorithms follow the same basic principle. For >>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed >>>>>>>>>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >>>>>>>>>> in linear direct proportion to the memory latency and inversely >>>>>>>>>> proportional to the memory bandwidth. Use the memory latency and >>>>>>>>>> bandwidth of default DRAM nodes as base. >>>>>>>>>> >>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth. If there are >>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we >>>>>>>>>> can use them too. >>>>>>>>> >>>>>>>>> Argh! So we could address my concerns by having drivers feed >>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right? >>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we >>>>>>>>> have the notifier chains return the raw performance data from which the >>>>>>>>> abstract distance is derived. >>>>>>>> >>>>>>>> Now, memory device drivers only need a general interface to get the >>>>>>>> abstract distance from the NUMA node ID. In the future, if they need >>>>>>>> more interfaces, we can add them. For example, the interface you >>>>>>>> suggested above. >>>>>>> >>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract >>>>>>> distance, it's a meaningless number. The only reason they care about it >>>>>>> is so they can pass it to alloc_memory_type(): >>>>>>> >>>>>>> struct memory_dev_type *alloc_memory_type(int adistance) >>>>>>> >>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers >>>>>>> and the calculation of abstract distance should be done there. That >>>>>>> resovles the issues about how drivers are supposed to devine adistance >>>>>>> and also means that when CDAT is added we don't have to duplicate the >>>>>>> calculation code. >>>>>> >>>>>> In the current design, the abstract distance is the key concept of >>>>>> memory types and memory tiers. And it is used as interface to allocate >>>>>> memory types. This provides more flexibility than some other interfaces >>>>>> (e.g. read/write bandwidth/latency). For example, in current >>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract >>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used. This is still useful >>>>>> to support some systems now. On a system without HMAT/CDAT, it's >>>>>> possible to calculate abstract distance from ACPI SLIT, although this is >>>>>> quite limited. I'm not sure whether all systems will provide read/write >>>>>> bandwith/latency data for all memory devices. >>>>>> >>>>>> HMAT and CDAT or some other mechanisms may provide the read/write >>>>>> bandwidth/latency data to be used to calculate abstract distance. For >>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map >>>>>> from read/write bandwith/latency to the abstract distance. Can this >>>>>> solve your concerns about the consistency among algorithms? If so, we >>>>>> can do that when we add the second algorithm that needs that. >>>>> >>>>> I guess it would address my concerns if we did that now. I don't see why >>>>> we need to wait for a second implementation for that though - the whole >>>>> series seems to be built around adding a framework for supporting >>>>> multiple algorithms even though only one exists. So I think we should >>>>> support that fully, or simplfy the whole thing and just assume the only >>>>> thing that exists is HMAT and get rid of the general interface until a >>>>> second algorithm comes along. >>>> >>>> We will need a general interface even for one algorithm implementation. >>>> Because it's not good to make a dax subsystem driver (dax/kmem) to >>>> depend on a ACPI subsystem driver (acpi/hmat). We need some general >>>> interface at subsystem level (memory tier here) between them. >>> >>> I don't understand this argument. For a single algorithm it would be >>> simpler to just define acpi_hmat_calculate_adistance() and a static >>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding >>> a layer of indirection through notifier blocks. That breaks any >>> dependency on ACPI and there's plenty of precedent for this approach in >>> the kernel already. >> >> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI. >> But HMAT is a driver of ACPI subsystem (controlled via >> CONFIG_ACPI_HMAT). It's not good for a driver of DAX subsystem >> (dax/kmem) to depend on a *driver* of ACPI subsystem. >> >> Yes. Technically, there's no hard wall to prevent this. But I think >> that a good design should make drivers depends on subsystems or drivers >> of the same subsystem, NOT drivers of other subsystems. > > Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand > where you're coming from but I really don't see the problem with using a > static inline. It doesn't create dependencies (you could still use > dax/kmem without ACPI) and results in smaller and easier to follow code. > > IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist() > returns either a default if ACPI HMAT isn't configured or a calculated > value than it is to figure out what notifiers may or may not be > registered at runtime and what priority they may be called in from > mt_calc_adistance(). > > It appears you think that is a bad design, but I don't understand > why. What does this approach give us that a simpler approach wouldn't? Think about all these again. Finally I admit you are right. The general interface is better mainly if there are multiple implementations of the interface. In this series, we provide just one implementation: HMAT. And, the second one: CDAT will be implemented soon. And, CDAT will use the same method to translate from read/write bandwidth/latency to adistance. So, I suggest to: - Keep the general interface (and notifier chain), for HMAT and soon available CDAT - Move the code to translate from read/write bandwidth/latency to adistance to memory-tiers.c. Which is used by HMAT now and will be used by CDAT soon. And it can be used by other drivers. What do you think about that? -- Best Regards, Huang, Ying