From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21B17C636D7 for ; Fri, 10 Feb 2023 22:19:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B2356B017B; Fri, 10 Feb 2023 17:19:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9623A6B0199; Fri, 10 Feb 2023 17:19:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 829EB6B019B; Fri, 10 Feb 2023 17:19:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 725686B017B for ; Fri, 10 Feb 2023 17:19:20 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 336771601EF for ; Fri, 10 Feb 2023 22:19:20 +0000 (UTC) X-FDA: 80452799280.03.6C509DF Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf09.hostedemail.com (Postfix) with ESMTP id CC982140012 for ; Fri, 10 Feb 2023 22:19:17 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=dZMVu8wR; spf=pass (imf09.hostedemail.com: domain of dave.jiang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=dave.jiang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676067558; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T1Xyoys5IBINbXyAYHGshdLNgjI4ywqhZzOcm0Bu+90=; b=joQmv1Rr4s03ta5zTEe7a+bm0BPH9Iz7z4plfN1OAs79w+L4IZYIit5uYwuI9yVhYVMnz+ Fmilg3FK4j6LWKsa/qPZX8kYE02H4suRKyow3I6h72DOTCrgsXR/lfVg/4eII1rY66DXR4 VH+5a4iecY0yWfbL6XJF3UK7lS/bU5g= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=dZMVu8wR; spf=pass (imf09.hostedemail.com: domain of dave.jiang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=dave.jiang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676067558; a=rsa-sha256; cv=none; b=E7gAIEmaEPsoVS5uhvmDatmu2F8We82ftoNNmIj0al7Xd0LJnZQ2bvdjrcq4ppLoCPbIFf WABr4W/NYeontWuZweC6edxfT4ccAP6SXRpfDHU8T7QUT3tpjJiy7E+Uc0QPc6BVImZB4i edSeISqPzcmD7BirOWoVZTxG5aabXmw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676067557; x=1707603557; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=5QEIEhNAp3nmGGpMMbKD0XHNWNNpBBL7AkmI44tqVdc=; b=dZMVu8wRzkguHueJuaw0xvB4f527cgTH+5/1FFEXokmHGs3rbjdArOwA iU9uc44wrvPeYeBL9gq7Qq6XhnDuyjNenEiHW3sfZtUBI9Nhjj/yM6LvY 6/1xjcIQHdYejNr8fBHaWCS5Nx5ezG1/lWlfVFmEAJVBMZchzuB+Yyb0c Veg+f4R9KdSxWyvJWdlVm3eAYzBnX4gdexHpi2e843NMhNpUrFA3ea+pj UjUQPu9ARgnG/lpZfKklzxAEOTPJpxTUetLGR1I+PY9WXI4DnNSlGciXJ aPTEF50jfcLGQkWqef2nKN9hzeZTnSPtZBDTiAu7Y0ymLkeEmebCjHxxG A==; X-IronPort-AV: E=McAfee;i="6500,9779,10617"; a="318558880" X-IronPort-AV: E=Sophos;i="5.97,287,1669104000"; d="scan'208";a="318558880" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2023 14:19:16 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10617"; a="736875933" X-IronPort-AV: E=Sophos;i="5.97,287,1669104000"; d="scan'208";a="736875933" Received: from djiang5-mobl3.amr.corp.intel.com (HELO [10.213.190.133]) ([10.213.190.133]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2023 14:19:12 -0800 Message-ID: <065ecb4a-c71e-b293-cc01-599e5ded50da@intel.com> Date: Fri, 10 Feb 2023 15:19:11 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0 Thunderbird/102.6.0 Subject: Re: [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Content-Language: en-US To: Dan Williams , linux-cxl@vger.kernel.org Cc: Michal Hocko , David Hildenbrand , Dave Hansen , Gregory Price , Fan Ni , vishal.l.verma@intel.com, linux-mm@kvack.org, linux-acpi@vger.kernel.org References: <167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com> <167602003336.1924368.6809503401422267885.stgit@dwillia2-xfh.jf.intel.com> From: Dave Jiang In-Reply-To: <167602003336.1924368.6809503401422267885.stgit@dwillia2-xfh.jf.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: CC982140012 X-Stat-Signature: tifr1bqcy876mefh91giumn9hzwo7ug7 X-Rspam-User: X-HE-Tag: 1676067557-934895 X-HE-Meta: U2FsdGVkX19we4ex+ktP41kKAhreLsiUvq21cgxg4ZmyIKXWXHhC2uiB6FzzYRsxkisfopXP69D70UTciwcl842HT1In9WJPVgIqtHBrZp7N9rXaVt7/qgIxzFuUXdfdf+jZWX4D5DdlAU8qEvG2TwyqcRf/rGYrK3yn3LksyRVbsKvJFwomHtJHeBJf1kNTE1OGc7XmWBMfjKqc+iMR5dDE9e5c/2TGJtmGqioPFDnxLQ0ah/X5xABTDuAJHB03nnB8NsZTSNsqnlWR1FzcGNdIRPbiJa91xXtNHXvYk1ZL5B2tk7XfXBxcV5pdDcUXfj7yAru/hThF9m3IRfX5hIrNRJ2QUfDM/nkgGZ2JjqN2yBh52aU+7aHgIKP4q1LdZunIOOM8rVCJZ7kTaPEw07wP9mi1R6uHdPM71vPhPcUo3IPgqgD9l+hF3rYiYrDpn22ZDeG/vG18J+D8L3F58kcp0kRZQewezn/O5/YdeTUdEMeTw+lsfcYUz/EVmR/TEakNGAfgbWJpbJ2iMVOhwvHYZJEPkvrSjFjoDw/uIx3xbSAYURtt0197ckYCjJjremIpIvUqRTqFj3ZwVhlEYocLTPUGLl+lpL1vzAETVX7EaVU3JgZgmkUp70UcNk0XV+3R8FuSKKRJpysc7ZrX17AngBBWLO382jRz6HcIW+H57bfagqEYc32WU+UTcSJlK1f7o3PGCCZAoVKvCJoNgk+MxpR4yl74K4gR8x4Vb8PUtvqfRA+KrepTLyxY9aJGJs5Oa1t5kFT4upALhN4lAJM9+G08R3YljT/lN5azBC9eAw9TvZgJOM6tw/Qqkc9JBF23aacXG8e+rkS+Pq2nh2g4d29h4rlSXpb/X8sOck+ZYSkV5I0KZ/ROitafRjvQ9tvnBXj45dfI3OWzJm0x81Gb6fzDnlAROgn3MVmTZn1a5TKmKXIjGa/ZiAhZZ2Q/pT2NvOHDQejE+7hwA/k srQas74U /6eejQSmJlkDG0AY8jPA1AZaaWGc4fiND9pOzy1kXpVY2s2vF/cyCBWZNAxkxOtmcMGnCqp8k3s8cj4lmkESgArHcIv2jwVrfx50Vi4jXFcIUZL3knaX1kVFVvt41euSJ4UQarurknmI7QrACI3jqG9bkl/Fnr4rT3loXBWEMwSfihxi/tXGh19WVGJUL2H/tEiug+nHLm+4lTj9DQQitFOOxg/8W7yfUm6/iGCQ7MIUbrUOIPn+9LNmAaP8SuqNVAv9s5dz1GoCY3Xo/5J/A+K1pXoxoth7M7/sLhJwtk2mQrwzzfKCxtbdl+cvxLsGdBKCr8HjftPoNxdECUp0c+nehGwViM3BKAHoGf0zvEG73tAhrozcAn5zj4+RKR1EmHw9kBnbA0f+bLdZeM2y3QmX/ERi1hJjAAw+a X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/10/23 2:07 AM, Dan Williams wrote: > The default mode for device-dax instances is backwards for RAM-regions > as evidenced by the fact that it tends to catch end users by surprise. > "Where is my memory?". Recall that platforms are increasingly shipping > with performance-differentiated memory pools beyond typical DRAM and > NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic > interleave, varied media types, and future fabric attached > possibilities). > > For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux > 'Soft Reserved') attribute is expected to be applied to all memory-pools > that are not the general purpose pool. This designation gives an > Operating System a chance to defer usage of a memory pool until later in > the boot process where its performance properties can be interrogated > and administrator policy can be applied. > > 'Soft Reserved' memory can be anything from too limited and precious to > be part of the general purpose pool (HBM), too slow to host hot kernel > data structures (some PMEM media), or anything in between. However, in > the absence of an explicit policy, the memory should at least be made > usable by default. The current device-dax default hides all > non-general-purpose memory behind a device interface. > > The expectation is that the distribution of users that want the memory > online by default vs device-dedicated-access by default follows the > Pareto principle. A small number of enlightened users may want to do > userspace memory management through a device, but general users just > want the kernel to make the memory available with an option to get more > advanced later. > > Arrange for all device-dax instances not backed by PMEM to default to > attaching to the dax_kmem driver. From there the baseline memory hotplug > policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=) > gates whether the memory comes online or stays offline. Where, if it > stays offline, it can be reliably converted back to device-mode where it > can be partitioned, or fronted by a userspace allocator. > > So, if someone wants device-dax instances for their 'Soft Reserved' > memory: > > 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot > with memhp_default_state=offline, or roll the dice and hope that the > kernel has not pinned a page in that memory before step 2. > > 2/ Write a udev rule to convert the target dax device(s) from > 'system-ram' mode to 'devdax' mode: > > daxctl reconfigure-device $dax -m devdax -f > > Cc: Michal Hocko > Cc: David Hildenbrand > Cc: Dave Hansen > Reviewed-by: Gregory Price > Tested-by: Fan Ni > Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com > Signed-off-by: Dan Williams Reviewed-by: Dave Jiang > --- > drivers/dax/Kconfig | 2 +- > drivers/dax/bus.c | 53 ++++++++++++++++++++--------------------------- > drivers/dax/bus.h | 12 +++++++++-- > drivers/dax/device.c | 3 +-- > drivers/dax/hmem/hmem.c | 12 ++++++++++- > drivers/dax/kmem.c | 1 + > 6 files changed, 46 insertions(+), 37 deletions(-) > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > index d13c889c2a64..1163eb62e5f6 100644 > --- a/drivers/dax/Kconfig > +++ b/drivers/dax/Kconfig > @@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES > def_bool y > > config DEV_DAX_KMEM > - tristate "KMEM DAX: volatile-use of persistent memory" > + tristate "KMEM DAX: map dax-devices as System-RAM" > default DEV_DAX > depends on DEV_DAX > depends on MEMORY_HOTPLUG # for add_memory() and friends > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 1dad813ee4a6..012d576004e9 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev) > return match; > } > > +static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev) > +{ > + enum dax_driver_type type = DAXDRV_DEVICE_TYPE; > + struct dev_dax *dev_dax = to_dev_dax(dev); > + > + if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM) > + type = DAXDRV_KMEM_TYPE; > + > + if (dax_drv->type == type) > + return 1; > + > + /* default to device mode if dax_kmem is disabled */ > + if (dax_drv->type == DAXDRV_DEVICE_TYPE && > + !IS_ENABLED(CONFIG_DEV_DAX_KMEM)) > + return 1; > + > + return 0; > +} > + > enum id_action { > ID_REMOVE, > ID_ADD, > @@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv) > { > struct dax_device_driver *dax_drv = to_dax_drv(drv); > > - /* > - * All but the 'device-dax' driver, which has 'match_always' > - * set, requires an exact id match. > - */ > - if (dax_drv->match_always) > + if (dax_match_id(dax_drv, dev)) > return 1; > - > - return dax_match_id(dax_drv, dev); > + return dax_match_type(dax_drv, dev); > } > > /* > @@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) > } > EXPORT_SYMBOL_GPL(devm_create_dev_dax); > > -static int match_always_count; > - > int __dax_driver_register(struct dax_device_driver *dax_drv, > struct module *module, const char *mod_name) > { > struct device_driver *drv = &dax_drv->drv; > - int rc = 0; > > /* > * dax_bus_probe() calls dax_drv->probe() unconditionally. > @@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv, > drv->mod_name = mod_name; > drv->bus = &dax_bus_type; > > - /* there can only be one default driver */ > - mutex_lock(&dax_bus_lock); > - match_always_count += dax_drv->match_always; > - if (match_always_count > 1) { > - match_always_count--; > - WARN_ON(1); > - rc = -EINVAL; > - } > - mutex_unlock(&dax_bus_lock); > - if (rc) > - return rc; > - > - rc = driver_register(drv); > - if (rc && dax_drv->match_always) { > - mutex_lock(&dax_bus_lock); > - match_always_count -= dax_drv->match_always; > - mutex_unlock(&dax_bus_lock); > - } > - > - return rc; > + return driver_register(drv); > } > EXPORT_SYMBOL_GPL(__dax_driver_register); > > @@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv) > struct dax_id *dax_id, *_id; > > mutex_lock(&dax_bus_lock); > - match_always_count -= dax_drv->match_always; > list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) { > list_del(&dax_id->list); > kfree(dax_id); > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h > index fbb940293d6d..8cd79ab34292 100644 > --- a/drivers/dax/bus.h > +++ b/drivers/dax/bus.h > @@ -11,7 +11,10 @@ struct dax_device; > struct dax_region; > void dax_region_put(struct dax_region *dax_region); > > -#define IORESOURCE_DAX_STATIC (1UL << 0) > +/* dax bus specific ioresource flags */ > +#define IORESOURCE_DAX_STATIC BIT(0) > +#define IORESOURCE_DAX_KMEM BIT(1) > + > struct dax_region *alloc_dax_region(struct device *parent, int region_id, > struct range *range, int target_node, unsigned int align, > unsigned long flags); > @@ -25,10 +28,15 @@ struct dev_dax_data { > > struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); > > +enum dax_driver_type { > + DAXDRV_KMEM_TYPE, > + DAXDRV_DEVICE_TYPE, > +}; > + > struct dax_device_driver { > struct device_driver drv; > struct list_head ids; > - int match_always; > + enum dax_driver_type type; > int (*probe)(struct dev_dax *dev); > void (*remove)(struct dev_dax *dev); > }; > diff --git a/drivers/dax/device.c b/drivers/dax/device.c > index 5494d745ced5..ecdff79e31f2 100644 > --- a/drivers/dax/device.c > +++ b/drivers/dax/device.c > @@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe); > > static struct dax_device_driver device_dax_driver = { > .probe = dev_dax_probe, > - /* all probe actions are unwound by devm, so .remove isn't necessary */ > - .match_always = 1, > + .type = DAXDRV_DEVICE_TYPE, > }; > > static int __init dax_init(void) > diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c > index e7bdff3132fa..5ec08f9f8a57 100644 > --- a/drivers/dax/hmem/hmem.c > +++ b/drivers/dax/hmem/hmem.c > @@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644); > > static int dax_hmem_probe(struct platform_device *pdev) > { > + unsigned long flags = IORESOURCE_DAX_KMEM; > struct device *dev = &pdev->dev; > struct dax_region *dax_region; > struct memregion_info *mri; > struct dev_dax_data data; > struct dev_dax *dev_dax; > > + /* > + * @region_idle == true indicates that an administrative agent > + * wants to manipulate the range partitioning before the devices > + * are created, so do not send them to the dax_kmem driver by > + * default. > + */ > + if (region_idle) > + flags = 0; > + > mri = dev->platform_data; > dax_region = alloc_dax_region(dev, pdev->id, &mri->range, > - mri->target_node, PMD_SIZE, 0); > + mri->target_node, PMD_SIZE, flags); > if (!dax_region) > return -ENOMEM; > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 4852a2dbdb27..918d01d3fbaa 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > static struct dax_device_driver device_dax_kmem_driver = { > .probe = dev_dax_kmem_probe, > .remove = dev_dax_kmem_remove, > + .type = DAXDRV_KMEM_TYPE, > }; > > static int __init dax_kmem_init(void) >