From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AB05C001DB for ; Tue, 15 Aug 2023 02:36:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A7BEA900016; Mon, 14 Aug 2023 22:36:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A05FD90000B; Mon, 14 Aug 2023 22:36:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 87E11900016; Mon, 14 Aug 2023 22:36:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7333790000B for ; Mon, 14 Aug 2023 22:36:40 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 33DD01606FA for ; Tue, 15 Aug 2023 02:36:40 +0000 (UTC) X-FDA: 81124775760.15.407AC05 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100]) by imf20.hostedemail.com (Postfix) with ESMTP id 9944C1C0005 for ; Tue, 15 Aug 2023 02:36:37 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=SIZCJCm9; spf=none (imf20.hostedemail.com: domain of yuan.yao@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=yuan.yao@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692066998; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vHR4xYlNFY3QXhiHYYeRXBjOvyRCRyOZW2MYGLu5e4g=; b=3ZKRrMDpQbQX3Sqs22myz+ZJvP0yp3Omnz3QLGI11ESMJ6tlx8KmzeED/cg8ggqgwoIJRT GUZd67O0PzauAqKED4JAu/NpEaiuIbF17Qy0q7A32nlSX5P+FJg4JlzhvyO6i4CuZ+CCO8 DM0n4hk61tK0AX6AQVkhFBbS8n/STqo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692066998; a=rsa-sha256; cv=none; b=CofFbYac5aW+E0P1u1NnFS6wJ1pC3xRmDbSSS2PUJrIL3aJQzHn/H/Pc3ktA3ZNZyTGWGY /1yn0ay1ZoYV3BIrVRmUScINi3c8eAgPN6/ihL8P33ZwVgWoCdd/VNOrld6JIvnsLSvVpg NAHb6wH4QJCEfnPXV23FvVbdfSRcBL4= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=SIZCJCm9; spf=none (imf20.hostedemail.com: domain of yuan.yao@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=yuan.yao@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1692066997; x=1723602997; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=vHR4xYlNFY3QXhiHYYeRXBjOvyRCRyOZW2MYGLu5e4g=; b=SIZCJCm9pejrTgPOko+ycG8NMwhUCjl///hJRI58TsU7zHb1meh5tXk9 hciz85f5LMQEzp4d+uK5tGc2sKvyqrfiQ/VoMz+2f6lIVnPH90h2anecQ maLnlxFGea1OgoXrMG+vN+0XGjLSy+Gr8qKYLgGYb8Qs9j3tTfKp8Xpiw ZcpdVI6Oqc4Yu7RjrW2Ud6E0Djeb4iODTvtw+D8r+cwI9ZcGFxB1lnQOR GCP6qBs8kpKhyT5pJ0oY9TP2gQhNNKZfjhTNv1n/RVm0Xc+HSYJZc7Ifu 950rb95yHWMv6980wQ4BdoUJFejsb0I4GYfreUE6efW07IGwFty0j7aQw Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10802"; a="438525452" X-IronPort-AV: E=Sophos;i="6.01,173,1684825200"; d="scan'208";a="438525452" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Aug 2023 19:36:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10802"; a="857275482" X-IronPort-AV: E=Sophos;i="6.01,173,1684825200"; d="scan'208";a="857275482" Received: from yy-desk-7060.sh.intel.com (HELO localhost) ([10.239.159.76]) by orsmga004.jf.intel.com with ESMTP; 14 Aug 2023 19:36:19 -0700 Date: Tue, 15 Aug 2023 10:36:18 +0800 From: Yuan Yao To: Yan Zhao Cc: John Hubbard , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com, Mel Gorman Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Message-ID: <20230815023618.uvefne3af7fn5msn@yy-desk-7060> References: <20230810085636.25914-1-yan.y.zhao@intel.com> <41a893e1-f2e7-23f4-cad2-d5c353a336a3@redhat.com> <6b48a161-257b-a02b-c483-87c04b655635@redhat.com> <1ad2c33d-95e1-49ec-acd2-ac02b506974e@nvidia.com> <846e9117-1f79-a5e0-1b14-3dba91ab8033@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: NeoMutt/20171215 X-Rspamd-Queue-Id: 9944C1C0005 X-Rspam-User: X-Stat-Signature: osie6t39u5ra69ibkh5eng7uppmgg83o X-Rspamd-Server: rspam03 X-HE-Tag: 1692066997-618897 X-HE-Meta: U2FsdGVkX1+NL8Av2O2Sx30Zu1gwpKCkTpwgE/E9+mHOhInaZEgPxfywyfXyOd40KCbXcDJN1Mw8oifFIgQKWFNG1Amco7e2wlqwpYEZknVwRnh95AfLkuYOK72iEb627YAfvts1yXJ6vx7odT1wO8vG6iLO505dSLqgCbqx4sDoItqzShSEzTkEhO52Kt6YdWu6JONiF4kawKpc2YxP1pNTMRhXx1F8bybe312oX1+PIXukbT/m0JvrCYbHOW316plSztgDyAHwerjbLCJAowYe4rtEuqY7TGgvFilq5sbneAUOJHb6D6DeFmmMpkUhhufMcAeDqF0mmfANjvWg67Xz2OA51nDV12HEmCJgy7IJuv+lPDTVlP7agNMBncjWRDszJr6qpk58s/QAMRAZdKT68hkUAoLZmGlqPXOEHyeSehngCxDX8mWzvDcWyruYVqWJe3mskLtqBKb0gE3do8MGUOKiuWRrq7HQND9VdvClntYCzAcu2uADasjJzr1QOhdef62jIQXJvt8vLLajL7HBZqKimCA+ybK7JgKKXOR0iE9f8EQyTY3UZhgrfe/84zX6X4ACmIu0H7v2DJ2mBjyDMV0VeKHo3Rm2PnCht+/JXG2VUaQ0TNT4FYblcL7Sqd1XRVtQKN5cYRlEBQTw0D/EWALSz+Y73IfsXOJOMxh2HW4nEnXh69Bg0FAzNj8o9nfB/mNSmcZrmPIZx0mWUK5Wkvbhwe23lwnrs8SOYCuY0P36T423OD5KnEfhCmpAlL/ikYN1znoNCVG6dp1t4c8tcRsKl+Ewk6nOatZA0QADo/33Z+tiugvm9QeW3AGrDcMQ/24rhfeVxw7G9B5HBQcoov8fGG+CRbpQwGdAg3MxKk+vs7jvEkY3nZTxN7GbTHu7aEiK+QFRVJQ2a4kAf6xvJBVy6G8eggglHeoRiHSINyv7/g9yRrNUw/0yLiQ1Y3Gbl0Va8jrO0VYzqZx uw5/zFRR ypmN4npMsLjhXeGsi+r7Smgs8FnhHYwJoeq/KRdxHeRWzIpjQ5DVwOQzhixD3lEpLdOOIv1dKeNuR6BO4yvtuIx+V5QMOpJ3WzfYqlYzPu3uicmSXDkzgWnlbDtpszDk4GXs7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote: > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote: > > On 8/11/23 11:39, David Hildenbrand wrote: > > ... > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it. > > > > > > > > > > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > > > > accelerators that map memory...and VMs as well, it seems. Basically, > > > > anything that has its own processors and page tables needs to be left > > > > strictly alone by NUMA balancing. Because the kernel is (still, even > > > > today) unaware of what those processors are doing, and so it has no way > > > > to do productive NUMA balancing. > > > > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > > > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > > > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > > > > > CCing Mel. > > > > > > > Let's discern between page pinning situations, and HMM-style situations. > > Page pinning of CPU memory is unnecessary when setting up for using that > > memory by modern GPUs or accelerators, because the latter can handle > > replayable page faults. So for such cases, the pages are in use by a GPU > > or accelerator, but unpinned. > > > > The performance problem occurs because for those pages, the NUMA > > balancing causes unmapping, which generates callbacks to the device > > driver, which dutifully unmaps the pages from the GPU or accelerator, > > even if the GPU might be busy using those pages. The device promptly > > causes a device page fault, and the driver then re-establishes the > > device page table mapping, which is good until the next round of > > unmapping from the NUMA balancer. > > > > hmm_range_fault()-based memory management in particular might benefit > > from having NUMA balancing disabled entirely for the memremap_pages() > > region, come to think of it. That seems relatively easy and clean at > > first glance anyway. > > > > For other regions (allocated by the device driver), a per-VMA flag > > seems about right: VM_NO_NUMA_BALANCING ? > > > Thanks a lot for those good suggestions! > For VMs, when could a per-VMA flag be set? > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > after it's mapped into VFIO. > Then, should VFIO set this flag on after it maps a range? > Could this flag be unset after device hot-unplug? Emm... syscall madvise() in my mind, it does things like change flags on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA. > >