From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AECB8C43334 for ; Mon, 18 Jul 2022 06:08:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C962C6B0083; Mon, 18 Jul 2022 02:08:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C45D56B0085; Mon, 18 Jul 2022 02:08:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0DFE6B0087; Mon, 18 Jul 2022 02:08:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9B9E36B0083 for ; Mon, 18 Jul 2022 02:08:21 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 717FC33B56 for ; Mon, 18 Jul 2022 06:08:21 +0000 (UTC) X-FDA: 79699190802.18.48C81E1 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf17.hostedemail.com (Postfix) with ESMTP id 833D240081 for ; Mon, 18 Jul 2022 06:08:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1658124500; x=1689660500; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=Iy/C5nIVHEHmLHZUhJ378y+kBUGWLBxvC8bLlhAGmqQ=; b=No4h/8I9N9yRoIafiEEZJOtKApl9UlHR25vUcN4VjREUyIrqgvCFnezj ZmVd3Qp9OCn0UFXEWVswjXCG1hSuAzf7oKRx5SHP3isKsjRPNzdU4UkXc L0e/Qv6o8W5gH+BlQf1zuAaPc79s7pTn8ul89IGX4+vxcLHZ/JsvVcVUi +Y50TtMXVgGZSCcF/pDHOhewnMMN233ViJhuJdeZ3H+c2p0VGUgUjPVRa GqnwezNDZOGnoMfYA6CJB0wJwr7JNy0pWtmM04a3xfWoFQhHJVe0q4Fl5 ynapYvlw2ePX6BTFYo+hRwzK/G5CdkjwOBnuzbgzeyqnwuijoNPvZOont w==; X-IronPort-AV: E=McAfee;i="6400,9594,10411"; a="286880291" X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="286880291" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jul 2022 23:08:18 -0700 X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="572269840" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.13.94]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jul 2022 23:08:14 -0700 From: "Huang, Ying" To: "Aneesh Kumar K.V" Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers References: <20220714045351.434957-1-aneesh.kumar@linux.ibm.com> <20220714045351.434957-2-aneesh.kumar@linux.ibm.com> <87bktq4xs7.fsf@yhuang6-desk2.ccr.corp.intel.com> <3659f1bb-a82e-1aad-f297-808a2c17687d@linux.ibm.com> <87sfn2u0vy.fsf@linux.ibm.com> Date: Mon, 18 Jul 2022 14:08:11 +0800 In-Reply-To: <87sfn2u0vy.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Fri, 15 Jul 2022 15:57:13 +0530") Message-ID: <87y1wr2bsk.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658124501; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5oM5osu4iAaHiwzA/fBuFmFVezMI620rftmdjqS1nE8=; b=EKEbBRG96lUVY099e3114yfCfubM3p8xkyGdVX7o0cTX5NPWbk8+T07xGq7PCYgNipItSF 3JsgKBm7RHyY8WZrwy3tKoKuzGz6LzgQGSdeyKS3QYYELaC9x4iGAPGZu6/zmiMowrxeWc pLKFPTZ7dCp0QnEhftPfTMDHfI9DCY4= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="No4h/8I9"; spf=none (imf17.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658124501; a=rsa-sha256; cv=none; b=hvBT34RcZuVd0uC1sMWpu8C+eE9vo88bx9GqBJDeUbyy4ZLZdFno6q+b4DMZjRDzh9+MJO WqAFduTzQQpEZ58Qh4hCYFY1Q2yXUPNLE4qOUkQzxQziVeOu9mWV5AGVevxM+X3LIsNtWG vExvT7bKUJSWizcx1tyrE6D1rqzSwOI= X-Stat-Signature: n1kbm1eignn7rxt5j3qriupmxnxt39t9 X-Rspamd-Queue-Id: 833D240081 X-Rspamd-Server: rspam02 X-Rspam-User: Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="No4h/8I9"; spf=none (imf17.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-HE-Tag: 1658124500-298214 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: "Aneesh Kumar K.V" writes: > Aneesh Kumar K V writes: > > .... > >> >>> You dropped the original sysfs interface patches from the series, but >>> the kernel internal implementation is still for the original sysfs >>> interface. For example, memory tier ID is for the original sysfs >>> interface, not for the new proposed sysfs interface. So I suggest you >>> to implement with the new interface in mind. What do you think about >>> the following design? >>> >> >> Sorry I am not able to follow you here. This patchset completely drops >> exposing memory tiers to userspace via sysfs. Instead it allow >> creation of memory tiers with specific tierID from within the kernel/device driver. >> Default tierID is 200 and dax kmem creates memory tier with tierID 100. >> >> >>> - Each NUMA node belongs to a memory type, and each memory type >>> corresponds to a "abstract distance", so each NUMA node corresonds to >>> a "distance". For simplicity, we can start with static distances, for >>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA >>> node can be recorded in a global array, >>> >>> int node_distances[MAX_NUMNODES]; >>> >>> or, just >>> >>> pgdat->distance >>> >> >> I don't follow this. I guess you are trying to have a different design. >> Would it be much easier if you can write this in the form of a patch? >> >> >>> - Each memory tier corresponds to a range of distance, for example, >>> 0-100, 100-200, 200-300, >300, we can start with static ranges too. >>> >>> - The core API of memory tier could be >>> >>> struct memory_tier *find_create_memory_tier(int distance); >>> >>> it will find the memory tier which covers "distance" in the memory >>> tier list, or create a new memory tier if not found. >>> >> >> I was expecting this to be internal to dax kmem. How dax kmem maps >> "abstract distance" to a memory tier. At this point this patchset is >> keeping all that for a future patchset. >> > > This shows how i was expecting "abstract distance" to be integrated. > Thanks! To make the first version as simple as possible, I think we can just use some static "abstract distance" for dax_kmem, e.g., 250. Because we use it for PMEM only now. We can enhance dax_kmem later. IMHO, we should make the core framework correct firstly. - A device driver should report the capability (or performance level) of the hardware to the memory tier core via abstract distance. This can be done via some global data structure (e.g. node_distances[]) at least in the first version. - Memory tier core determines the mapping from the abstract distance to the memory tier via abstract distance ranges, and allocate the struct memory_tier when necessary. That is, memory tier core determines whether to allocate or reuse which memory tier for NUMA nodes, not device drivers. - It's better to place the NUMA node to the correct memory tier in the fist place. We should avoid to place the PMEM node in the default tier, then change it to the correct memory tier. That is, device drivers should report the abstract distance before onlining NUMA nodes. Please check my reply to Wei too about my other suggestions for the first version. Best Regards, Huang, Ying