From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75FE4C4743C for ; Tue, 22 Jun 2021 01:14:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 10C4E60C41 for ; Tue, 22 Jun 2021 01:14:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 10C4E60C41 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 574176B0036; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 524A36B0062; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39EF26B006C; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id 049746B0036 for ; Mon, 21 Jun 2021 21:14:45 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A05ACA77A for ; Tue, 22 Jun 2021 01:14:45 +0000 (UTC) X-FDA: 78279590130.27.3D2D7FD Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf20.hostedemail.com (Postfix) with ESMTP id D89512BDE for ; Tue, 22 Jun 2021 01:14:41 +0000 (UTC) IronPort-SDR: TPId+9ztUf7CN1Pb41SjbJE0/sse1GBg+1Y6te/PiD57cyqUb02AR7x06DlFRoVSszC9/LldNF zHnnxQpXSdZA== X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="270808879" X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="270808879" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:36 -0700 IronPort-SDR: S4SXaIKNke6jbeLRI8Zz8HPKRwgjG7yyFA3hcpwT0N3cYT6i/aP4uSL22ZcCDzJEUFRfjglT42 19EetnP6bLkA== X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="454080998" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:31 -0700 From: "Huang, Ying" To: Zi Yan Cc: Dave Hansen , , , Yang Shi , Michal Hocko , Wei Xu , David Rientjes , Dan Williams , "David Hildenbrand" , osalvador Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order References: <20210618061537.434999-1-ying.huang@intel.com> <20210618061537.434999-3-ying.huang@intel.com> <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> Date: Tue, 22 Jun 2021 09:14:29 +0800 In-Reply-To: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> (Zi Yan's message of "Mon, 21 Jun 2021 10:50:14 -0400") Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf20.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=ying.huang@intel.com X-Rspamd-Server: rspam02 X-Stat-Signature: ih6w1asaeqaz9f866gzqmnjogigtipe8 X-Rspamd-Queue-Id: D89512BDE X-HE-Tag: 1624324481-257633 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Zi Yan writes: > On 19 Jun 2021, at 4:18, Huang, Ying wrote: > >> Zi Yan writes: >> >>> On 18 Jun 2021, at 2:15, Huang Ying wrote: [snip] >>>> +/* >>>> + * When memory fills up on a node, memory contents can be >>>> + * automatically migrated to another node instead of >>>> + * discarded at reclaim. >>>> + * >>>> + * Establish a "migration path" which will start at nodes >>>> + * with CPUs and will follow the priorities used to build the >>>> + * page allocator zonelists. >>>> + * >>>> + * The difference here is that cycles must be avoided. If >>>> + * node0 migrates to node1, then neither node1, nor anything >>>> + * node1 migrates to can migrate to node0. >>>> + * >>>> + * This function can run simultaneously with readers of >>>> + * node_demotion[]. However, it can not run simultaneously >>>> + * with itself. Exclusion is provided by memory hotplug events >>>> + * being single-threaded. >>>> + */ >>>> +static void __set_migration_target_nodes(void) >>>> +{ >>>> + nodemask_t next_pass = NODE_MASK_NONE; >>>> + nodemask_t this_pass = NODE_MASK_NONE; >>>> + nodemask_t used_targets = NODE_MASK_NONE; >>>> + int node; >>>> + >>>> + /* >>>> + * Avoid any oddities like cycles that could occur >>>> + * from changes in the topology. This will leave >>>> + * a momentary gap when migration is disabled. >>>> + */ >>>> + disable_all_migrate_targets(); >>>> + >>>> + /* >>>> + * Ensure that the "disable" is visible across the system. >>>> + * Readers will see either a combination of before+disable >>>> + * state or disable+after. They will never see before and >>>> + * after state together. >>>> + * >>>> + * The before+after state together might have cycles and >>>> + * could cause readers to do things like loop until this >>>> + * function finishes. This ensures they can only see a >>>> + * single "bad" read and would, for instance, only loop >>>> + * once. >>>> + */ >>>> + smp_wmb(); >>>> + >>>> + /* >>>> + * Allocations go close to CPUs, first. Assume that >>>> + * the migration path starts at the nodes with CPUs. >>>> + */ >>>> + next_pass = node_states[N_CPU]; >>> >>> Is there a plan of allowing user to change where the migration >>> path starts? Or maybe one step further providing an interface >>> to allow user to specify the demotion path. Something like >>> /sys/devices/system/node/node*/node_demotion. >> >> I don't think that's necessary at least for now. Do you know any real >> world use case for this? > > In our P9+volta system, GPU memory is exposed as a NUMA node. > For the GPU workloads with data size greater than GPU memory size, > it will be very helpful to allow pages in GPU memory to be migrated/demoted > to CPU memory. With your current assumption, GPU memory -> CPU memory > demotion seems not possible, right? This should also apply to any > system with a device memory exposed as a NUMA node and workloads running > on the device and using CPU memory as a lower tier memory than the device > memory. Thanks a lot for your use case! It appears that the demotion path specified by users is one possible way to satisfy your requirement. And I think it's possible to enable that on top of this patchset. But we still have no specific plan to work on that at least for now. Best Regards, Huang, Ying