From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51968C00144 for ; Tue, 2 Aug 2022 01:58:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 541988E0001; Mon, 1 Aug 2022 21:58:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4CA9C6B0072; Mon, 1 Aug 2022 21:58:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31CC88E0001; Mon, 1 Aug 2022 21:58:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1B0B66B0071 for ; Mon, 1 Aug 2022 21:58:39 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E1BCA1C6437 for ; Tue, 2 Aug 2022 01:58:38 +0000 (UTC) X-FDA: 79752993516.07.D95E403 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf30.hostedemail.com (Postfix) with ESMTP id 0654180108 for ; Tue, 2 Aug 2022 01:58:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659405518; x=1690941518; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=N6ddLI1FzmQXAr3nPznxZnrgs8Fz5EdUEt75OuJpNl0=; b=lVLZyEdf8zgHHeqjV50p0yydb0uy+GjYbE5lcrzEJkMg9gY78Mm0N/xA 728qw/FBCPYTjgh6N7mebfeWUB2fPzxa6tCXSTgF1q4z7WXzDsbJ4KbWG ySF1RhBcHb3K7/4jqJqLmmtx0lXkpXLxDHtrLru/nvu/4Azt+8GrCB5LI HBStQF7c3BCr41Akkhdf9e+y3M2uEdsFK9pqdQabgqZfyyAs0EQdinBJk LzQoIA3VKpkTQt9K5rGSUbx1XJSLwSv9oLBzfPhqLO5N4PVoHwXV5TER7 HjjNNbRdgXztBtaShXUekXigIXwjvJ8zI2r+V1qhNBsYUdAWPPRXDTprw g==; X-IronPort-AV: E=McAfee;i="6400,9594,10426"; a="289310431" X-IronPort-AV: E=Sophos;i="5.93,209,1654585200"; d="scan'208";a="289310431" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2022 18:58:36 -0700 X-IronPort-AV: E=Sophos;i="5.93,209,1654585200"; d="scan'208";a="578017779" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2022 18:58:31 -0700 From: "Huang, Ying" To: Aneesh Kumar K V Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com Subject: Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM References: <20220728190436.858458-1-aneesh.kumar@linux.ibm.com> <20220728190436.858458-5-aneesh.kumar@linux.ibm.com> <875yjgmocg.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkt8s7w9.fsf@linux.ibm.com> <87k07slnt7.fsf@yhuang6-desk2.ccr.corp.intel.com> <87tu6wk0q5.fsf@yhuang6-desk2.ccr.corp.intel.com> <826fbdbc-219f-8f4a-7373-41c718287533@linux.ibm.com> <87les8jwpu.fsf@yhuang6-desk2.ccr.corp.intel.com> <1aba0c44-b096-8c75-8086-62d3cffc08b3@linux.ibm.com> <87h72wjv27.fsf@yhuang6-desk2.ccr.corp.intel.com> <394c0599-2dc0-0303-cd86-bdd2d265d1ee@linux.ibm.com> Date: Tue, 02 Aug 2022 09:58:28 +0800 In-Reply-To: <394c0599-2dc0-0303-cd86-bdd2d265d1ee@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Mon, 1 Aug 2022 13:11:11 +0530") Message-ID: <878ro7jtiz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659405518; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HrwUSpMAPbQv8M9eKoHG17xvjm5IJUql1AIKqa6x6/A=; b=qIIvdfRFUjm4B9ciOYqXJxQL2fPGrzGos1kxorS5UgTmjaGJi/9C3ZzjX9lXBweAdVdQyb euMxzNshhOYRtjJEVrWafaZBGhcoTuqLSZVDHdYssYYmKY+Jq5ZcVnc+O/4cODkPC22u2+ cgAlO+94Ei8n9qeGeIXb3niLLSngvLU= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=lVLZyEdf; spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659405518; a=rsa-sha256; cv=none; b=vrr9FUGKtDDnqUwU3A09BUtexhyKycB++ApKpG1G+TdIOasiWK6AYdBlKLyUxK6Yz/tch8 UtqzgvJfQCG9i7g8tHfxLZU1nEWsvWAC5AtMwHnBOUqzEpOePEEF79J7XU+F4izEfpoyGq FMNggoueXcZGEu5pS4lsNfQPb0cyLkE= Authentication-Results: imf30.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=lVLZyEdf; spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0654180108 X-Stat-Signature: srg7k9hi8qw1hc5kemj9ekibxhmmhy9r X-HE-Tag: 1659405517-811898 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Aneesh Kumar K V writes: > On 8/1/22 12:43 PM, Huang, Ying wrote: >> Aneesh Kumar K V writes: >> >>> On 8/1/22 12:07 PM, Huang, Ying wrote: >>>> Aneesh Kumar K V writes: >>>> >>>>> On 8/1/22 10:40 AM, Huang, Ying wrote: >>>>>> Aneesh Kumar K V writes: >>>>>> >>>>>>> On 8/1/22 7:36 AM, Huang, Ying wrote: >>>>>>>> "Aneesh Kumar K.V" writes: >>>>>>>> >>>>>>>>> "Huang, Ying" writes: >>>>>>>>> >>>>>>>>>> "Aneesh Kumar K.V" writes: >>> >>> .... >>> >>>>>>> >>>>>>> With the module unload, it is kind of force removing the usage of the specific memtype. >>>>>>> Considering module unload will remove the usage of specific memtype from other parts >>>>>>> of the kernel and we already do all the required reset in memory hot unplug, do we >>>>>>> need to do the clear_node_memory_type above? >>>>>> >>>>>> Per my understanding, we need to call clear_node_memory_type() in >>>>>> dev_dax_kmem_remove(). After that, we have nothing to do in >>>>>> dax_kmem_exit(). >>>>>> >>>>> >>>>> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. >>>> >>>> Can we use node_memory_types[] to indicate whether a node is managed by >>>> a driver? >>>> >>>> Regardless being succeeded or failed, dev_dax_kmem_remove() will set >>>> node_memory_types[] = NULL. But until node is offlined, we will still >>>> keep the node in the memory_dev_type (dax_pmem_type). >>>> >>>> And we will prevent dax/kmem from unloading via try_module_get() and add >>>> "struct module *" to struct memory_dev_type. >>>> >>> >>> Current dax/kmem driver is not holding any module reference and allows the module to be unloaded >>> anytime. Even if the memory onlined by the driver fails to be unplugged. Addition of memory_dev_type >>> as suggested by you will be different than that. Page demotion can continue to work without the >>> support of dax_pmem_type as long as we keep the older demotion order. Any new demotion order >>> rebuild will remove the the memory node which was not hotunplugged from the demotion order. Isn't that >>> a much simpler implementation? >> >> Per my understanding, unbinding/binding the dax/kmem driver means >> changing the memory type of a memory device. For example, unbinding >> dax/kmem driver may mean changing the memory type from dax_pmem_type to >> default_memory_type (or default_dram_type). That appears strange. But >> if we force the NUMA node to be offlined for unbinding, we can avoid to >> change the memory type to default_memory_type. >> > > If we are able to unplug all the memory, we do remove the node from N_MEMORY. > If we fail to unplug the memory, we have two options. > > 1) Keep the same demotion order > 2) Rebuild the demotion order which results in memory NUMA node not participating > in demotion. > > I agree with you that we should not switch to default memory type. > > The below code demonstrate how it can be done. If we want to keep > the same demotion order, we can remove establish_demotion_target() from > the below code. > > void clear_node_memory_type(int node, struct memory_dev_type *memtype) > { > struct memory_tier *memtier; > pg_data_t *pgdat = NODE_DATA(node); > > mutex_lock(&memory_tier_lock); > /* > * Even if we fail to unplug memory, clear the association of > * this node to this specific memory type. > */ > if (node_isset(node, memtype->nodes) && node_memory_types[node] == memtype) { > > memtier = __node_get_memory_tier(node); > if (memtier) { > rcu_assign_pointer(pgdat->memtier, NULL); > synchronize_rcu(); > } > node_clear(node, memtype->nodes); > if (nodes_empty(memtype->nodes)) { > list_del(&memtype->tier_sibiling); > memtype->memtier = NULL; > if (memtier && list_empty(&memtier->memory_types)) > destroy_memory_tier(memtier); > > } > establish_demotion_targets(); > } > node_memory_types[node] = NULL; > mutex_unlock(&memory_tier_lock); > } > > > If we agree that we want to keep the current behavior (that is to allow kmem driver unload > even on memory unplug failure) we can go with the above change. If we are suggesting we > should prevent a driver unload, then IMHO it should be independent of memory_dev_type > (or this patch series). We should make sure we take a module reference on successful > memory online and drop it only on successful offline. I suggest to keep a NUMA node in the memory_dev_type (dax_pmem_type) until the node is offlined. Yes. The dax/kmem driver may be unbound to the dax device even if memory offlining fails. But we can still find someway to keep the memory_dev_type of the NUMA node unchanged. Solution 1 is to prevent dax/kmem driver from unloading via module reference. I think we do that in this series. Solution 2 is to allocate dax_pmem_type dynamically, and manage it like "kmem_name". We may need some kind of reference counting to manage it. Best Regards, Huang, Ying