From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27014EE49A0 for ; Tue, 22 Aug 2023 01:00:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 95AB594000B; Mon, 21 Aug 2023 21:00:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 90AC6940008; Mon, 21 Aug 2023 21:00:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7AB3294000B; Mon, 21 Aug 2023 21:00:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6C727940008 for ; Mon, 21 Aug 2023 21:00:47 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3E5101A0186 for ; Tue, 22 Aug 2023 01:00:47 +0000 (UTC) X-FDA: 81149935734.01.64422D6 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.151]) by imf14.hostedemail.com (Postfix) with ESMTP id 2348B100016 for ; Tue, 22 Aug 2023 01:00:43 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XwlLZugv; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692666045; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L2h0ezzJUEwobCUUWqDfCWBOWwY1ibBsBN5n78SCRPs=; b=fE32WrnGc4yRC58VmN5ogQdydJiUOXDHMvcJ0h/7QmAcZJHeHWmyggHLbRng3F8wHqfvZz thYrgCEly1pml2C4rspSiLbGMdX1Q+ebkuZUCqmT/TbyQhSdS8mW7uAbOl5AqpLWZDQLil T2RyJXm0FB/gf9cWowiJbjaqwQdsFT0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692666045; a=rsa-sha256; cv=none; b=HDUNQw9higSzuHe7JqDZR4hvReJpKWXFLnlNUXRB3njIxqK58aYIlJvGqiHJRk2/zLsqus nma/mz+3ffIE1JnANARYnQJlV0uKMrvSiUJ9KRL0Ka7J1D4C8Ssg/Cv2s8pJhq3yXA2pIs nsiEzu5WG3bsxJEqTXIrqz3uRj5lzwM= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XwlLZugv; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1692666044; x=1724202044; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=PpIQtwVVDLgNvYt1ipLFOy1XV/k+n83Ca/rODVEtFaw=; b=XwlLZugv5BN6egDNnGa/tuQ3Muo+oyfdQya8LHtf8Z+abkR58A3pvwfQ QAgtjp7WnPDYuaXXS6usfNPzf3Ss2oCYlATuh/mhCyeRXVjJ1ZA+zx8rX xxgsMGYLbqcrmkWblF3Kf9akwcziNCPVZdZUosPsBtDTamdjP/jXL3u/e ClhiHYQ8iVmQVPLMt/EkcKWupa3Bgu2EqmvG0BUqB4kLFtceNKz9++eE+ VMKYPcZuxdJ/u0CRt6Mu5OZkdb+NeeOenB3TjlblBjZiGTnelaQHURVvd kZ0nWE6vvg5xUlBjwmOeBQ5fJZ3g9OEOBZceO+DDBFWzvBHR47tRRtb70 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10809"; a="354072393" X-IronPort-AV: E=Sophos;i="6.01,191,1684825200"; d="scan'208";a="354072393" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Aug 2023 18:00:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10809"; a="685853988" X-IronPort-AV: E=Sophos;i="6.01,191,1684825200"; d="scan'208";a="685853988" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Aug 2023 18:00:35 -0700 From: "Huang, Ying" To: Alistair Popple Cc: Andrew Morton , , , , , , "Aneesh Kumar K . V" , Wei Xu , Dan Williams , Dave Hansen , "Davidlohr Bueso" , Johannes Weiner , "Jonathan Cameron" , Michal Hocko , Yang Shi , Rafael J Wysocki , Dave Jiang Subject: Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management References: <20230721012932.190742-1-ying.huang@intel.com> <20230721012932.190742-2-ying.huang@intel.com> <87r0owzqdc.fsf@nvdebian.thelocal> <87r0owy95t.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf9cxupz.fsf@nvdebian.thelocal> <878rb3xh2x.fsf@yhuang6-desk2.ccr.corp.intel.com> <87351axbk6.fsf@nvdebian.thelocal> <87edkuvw6m.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1j2vvqw.fsf@nvdebian.thelocal> <87a5vhx664.fsf@yhuang6-desk2.ccr.corp.intel.com> <87lef0x23q.fsf@nvdebian.thelocal> <87r0oack40.fsf@yhuang6-desk2.ccr.corp.intel.com> <87cyzgwrys.fsf@nvdebian.thelocal> <87il98c8ms.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edjwlzn7.fsf@nvdebian.thelocal> Date: Tue, 22 Aug 2023 08:58:20 +0800 In-Reply-To: <87edjwlzn7.fsf@nvdebian.thelocal> (Alistair Popple's message of "Tue, 22 Aug 2023 09:52:43 +1000") Message-ID: <875y57dhar.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 2348B100016 X-Rspam-User: X-Stat-Signature: nfu6xbjfjuuqqn4r7u4fgc1h8m51wu1h X-Rspamd-Server: rspam03 X-HE-Tag: 1692666043-712907 X-HE-Meta: U2FsdGVkX1/gzszLSBBeT4ikWAUphSJAPkHfVfj0UsPPl5ScoUsJxZy7QTR7yqLfEir5YsipG8SWBkGYzgLLiSHyBpLOEy6bypAnMqB/ApGIQD92ULp6vSC+H8iZcKCu9CcAXwM4Z3AkpT0IuU8DAtPU29W+CBBaVoA95ZNbjfcmaJ5oWtDrMtle55SZZWI6w3rjsYZXb8nV/j0QkX6gpqWhrJDdUaaLnOVcQjYhs0y4uNgVGJIW2N9ABaFlFRcOB8JGmUXbiGFA9uv0qzHLJhQxFTp+LMFZk53RHuO6LBD1y/VSJ9L7NOJoXYGGbDdNsE8P1ArXNWeNGSVPB0uzKskjpTBwuDJPftuOitp3KClRE9pfu8HRDk7Jn/9txWbeZrNQSwX5cVCt0T7TOBfRm5hxNBp7TlxFY8FfIvqMH+/OqadX+j1+rBI85OxBdH57mlR08KDkUe+a+VFu2f2V+z6cQoJ79VnblkYtcwOqCUOdMcVGQ7/bAbqCgYe2KXxzp6hNEhWeWs2gN59Xhvk236vYBH4p6FBU4LqcmgfIv8VyF6HHFGcdsFx3dHEEPcBF7bISOTR6NuNBziqTz4jKZvGqBqaUBrKAAx7ZkqpYBZrAJjVx/ZRSbM2H1tsx4Up9IApAY4BAoF1GjcRvgqeWjEYWVXBA85DT/yktsS+9F1FHm4aceZgM1fODs+/yC4yWrfYrom1WG8nnmRlSfvw2M5lUvNvtKVzo1wzdRtZmO+luEP3yATr5ddSzhMlDYGOAhLDZMgJv01VzfRRtcM8+jD6m4k4BeSQ9tvsQe8aWIerHcX3kWfGN6Zd1jAA7GIJGb+RDgIXhf9ax+tX6nZJCy4nZPGkNqK8EfK+0P9eoyudyphn8zWKjUCftPWTZ1kG1jNcGtqDoyPJANXg8kImk/MnheY+p6KQqfu1fsSU0vOAujuJVdyd0MPg5T0Y2JXWJ8uSSoCnCnQrakzJvRpR NU45egKa EL+XRLbafaUhwAWVlRTPseG1RKSTUxEARm7nKv3VUgoDFVo7Gd9MKITEIl7GQuJGO2RWrsDKJp+vwW5x0LtrRwsxIl4ea1BnN2OCMEu3xw7Es82q25ZkVIxTP3clB3WWZ/eXAn0Es1YByY4/zTP5IRg9mZFfOZ/4P1TbokLw0FnT1TcfLW6lktOH4AfQGTBjmvI2c1OzS25q7TMpiwE/FgChyD92r8FVsYNSsrHbMY4zIhhItVBBBi00fCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> "Huang, Ying" writes: >>> >>>> Hi, Alistair, >>>> >>>> Sorry for late response. Just come back from vacation. >>> >>> Ditto for this response :-) >>> >>> I see Andrew has taken this into mm-unstable though, so my bad for not >>> getting around to following all this up sooner. >>> >>>> Alistair Popple writes: >>>> >>>>> "Huang, Ying" writes: >>>>> >>>>>> Alistair Popple writes: >>>>>> >>>>>>> "Huang, Ying" writes: >>>>>>> >>>>>>>> Alistair Popple writes: >>>>>>>> >>>>>>>>>>>> While other memory device drivers can use the general notifier chain >>>>>>>>>>>> interface at the same time. >>>>>>>>> >>>>>>>>> How would that work in practice though? The abstract distance as far as >>>>>>>>> I can tell doesn't have any meaning other than establishing preferences >>>>>>>>> for memory demotion order. Therefore all calculations are relative to >>>>>>>>> the rest of the calculations on the system. So if a driver does it's own >>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in >>>>>>>>> coordinating all that through a standard interface, whether that is HMAT >>>>>>>>> or something else. >>>>>>>> >>>>>>>> Only if different algorithms follow the same basic principle. For >>>>>>>> example, the abstract distance of default DRAM nodes are fixed >>>>>>>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >>>>>>>> in linear direct proportion to the memory latency and inversely >>>>>>>> proportional to the memory bandwidth. Use the memory latency and >>>>>>>> bandwidth of default DRAM nodes as base. >>>>>>>> >>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth. If there are >>>>>>>> some other methods to report the raw memory latency and bandwidth, we >>>>>>>> can use them too. >>>>>>> >>>>>>> Argh! So we could address my concerns by having drivers feed >>>>>>> latency/bandwidth numbers into a standard calculation algorithm right? >>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we >>>>>>> have the notifier chains return the raw performance data from which the >>>>>>> abstract distance is derived. >>>>>> >>>>>> Now, memory device drivers only need a general interface to get the >>>>>> abstract distance from the NUMA node ID. In the future, if they need >>>>>> more interfaces, we can add them. For example, the interface you >>>>>> suggested above. >>>>> >>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract >>>>> distance, it's a meaningless number. The only reason they care about it >>>>> is so they can pass it to alloc_memory_type(): >>>>> >>>>> struct memory_dev_type *alloc_memory_type(int adistance) >>>>> >>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers >>>>> and the calculation of abstract distance should be done there. That >>>>> resovles the issues about how drivers are supposed to devine adistance >>>>> and also means that when CDAT is added we don't have to duplicate the >>>>> calculation code. >>>> >>>> In the current design, the abstract distance is the key concept of >>>> memory types and memory tiers. And it is used as interface to allocate >>>> memory types. This provides more flexibility than some other interfaces >>>> (e.g. read/write bandwidth/latency). For example, in current >>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract >>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used. This is still useful >>>> to support some systems now. On a system without HMAT/CDAT, it's >>>> possible to calculate abstract distance from ACPI SLIT, although this is >>>> quite limited. I'm not sure whether all systems will provide read/write >>>> bandwith/latency data for all memory devices. >>>> >>>> HMAT and CDAT or some other mechanisms may provide the read/write >>>> bandwidth/latency data to be used to calculate abstract distance. For >>>> them, we can provide a shared implementation in mm/memory-tiers.c to map >>>> from read/write bandwith/latency to the abstract distance. Can this >>>> solve your concerns about the consistency among algorithms? If so, we >>>> can do that when we add the second algorithm that needs that. >>> >>> I guess it would address my concerns if we did that now. I don't see why >>> we need to wait for a second implementation for that though - the whole >>> series seems to be built around adding a framework for supporting >>> multiple algorithms even though only one exists. So I think we should >>> support that fully, or simplfy the whole thing and just assume the only >>> thing that exists is HMAT and get rid of the general interface until a >>> second algorithm comes along. >> >> We will need a general interface even for one algorithm implementation. >> Because it's not good to make a dax subsystem driver (dax/kmem) to >> depend on a ACPI subsystem driver (acpi/hmat). We need some general >> interface at subsystem level (memory tier here) between them. > > I don't understand this argument. For a single algorithm it would be > simpler to just define acpi_hmat_calculate_adistance() and a static > inline version of it that returns -ENOENT when !CONFIG_ACPI than adding > a layer of indirection through notifier blocks. That breaks any > dependency on ACPI and there's plenty of precedent for this approach in > the kernel already. ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI. But HMAT is a driver of ACPI subsystem (controlled via CONFIG_ACPI_HMAT). It's not good for a driver of DAX subsystem (dax/kmem) to depend on a *driver* of ACPI subsystem. Yes. Technically, there's no hard wall to prevent this. But I think that a good design should make drivers depends on subsystems or drivers of the same subsystem, NOT drivers of other subsystems. -- Best Regards, Huang, Ying