From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2AB05C001DB
	for <linux-mm@archiver.kernel.org>; Tue, 15 Aug 2023 02:36:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A7BEA900016; Mon, 14 Aug 2023 22:36:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A05FD90000B; Mon, 14 Aug 2023 22:36:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 87E11900016; Mon, 14 Aug 2023 22:36:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 7333790000B
	for <linux-mm@kvack.org>; Mon, 14 Aug 2023 22:36:40 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 33DD01606FA
	for <linux-mm@kvack.org>; Tue, 15 Aug 2023 02:36:40 +0000 (UTC)
X-FDA: 81124775760.15.407AC05
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf20.hostedemail.com (Postfix) with ESMTP id 9944C1C0005
	for <linux-mm@kvack.org>; Tue, 15 Aug 2023 02:36:37 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=SIZCJCm9;
	spf=none (imf20.hostedemail.com: domain of yuan.yao@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=yuan.yao@linux.intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1692066998;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vHR4xYlNFY3QXhiHYYeRXBjOvyRCRyOZW2MYGLu5e4g=;
	b=3ZKRrMDpQbQX3Sqs22myz+ZJvP0yp3Omnz3QLGI11ESMJ6tlx8KmzeED/cg8ggqgwoIJRT
	GUZd67O0PzauAqKED4JAu/NpEaiuIbF17Qy0q7A32nlSX5P+FJg4JlzhvyO6i4CuZ+CCO8
	DM0n4hk61tK0AX6AQVkhFBbS8n/STqo=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692066998; a=rsa-sha256;
	cv=none;
	b=CofFbYac5aW+E0P1u1NnFS6wJ1pC3xRmDbSSS2PUJrIL3aJQzHn/H/Pc3ktA3ZNZyTGWGY
	/1yn0ay1ZoYV3BIrVRmUScINi3c8eAgPN6/ihL8P33ZwVgWoCdd/VNOrld6JIvnsLSvVpg
	NAHb6wH4QJCEfnPXV23FvVbdfSRcBL4=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=SIZCJCm9;
	spf=none (imf20.hostedemail.com: domain of yuan.yao@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=yuan.yao@linux.intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1692066997; x=1723602997;
  h=date:from:to:cc:subject:message-id:references:
   mime-version:content-transfer-encoding:in-reply-to;
  bh=vHR4xYlNFY3QXhiHYYeRXBjOvyRCRyOZW2MYGLu5e4g=;
  b=SIZCJCm9pejrTgPOko+ycG8NMwhUCjl///hJRI58TsU7zHb1meh5tXk9
   hciz85f5LMQEzp4d+uK5tGc2sKvyqrfiQ/VoMz+2f6lIVnPH90h2anecQ
   maLnlxFGea1OgoXrMG+vN+0XGjLSy+Gr8qKYLgGYb8Qs9j3tTfKp8Xpiw
   ZcpdVI6Oqc4Yu7RjrW2Ud6E0Djeb4iODTvtw+D8r+cwI9ZcGFxB1lnQOR
   GCP6qBs8kpKhyT5pJ0oY9TP2gQhNNKZfjhTNv1n/RVm0Xc+HSYJZc7Ifu
   950rb95yHWMv6980wQ4BdoUJFejsb0I4GYfreUE6efW07IGwFty0j7aQw
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10802"; a="438525452"
X-IronPort-AV: E=Sophos;i="6.01,173,1684825200"; 
   d="scan'208";a="438525452"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Aug 2023 19:36:22 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10802"; a="857275482"
X-IronPort-AV: E=Sophos;i="6.01,173,1684825200"; 
   d="scan'208";a="857275482"
Received: from yy-desk-7060.sh.intel.com (HELO localhost) ([10.239.159.76])
  by orsmga004.jf.intel.com with ESMTP; 14 Aug 2023 19:36:19 -0700
Date: Tue, 15 Aug 2023 10:36:18 +0800
From: Yuan Yao <yuan.yao@linux.intel.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
	David Hildenbrand <david@redhat.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com,
	apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org,
	akpm@linux-foundation.org, kevin.tian@intel.com,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in
 a VM
Message-ID: <20230815023618.uvefne3af7fn5msn@yy-desk-7060>
References: <20230810085636.25914-1-yan.y.zhao@intel.com>
 <41a893e1-f2e7-23f4-cad2-d5c353a336a3@redhat.com>
 <ZNSyzgyTxubo0g/D@yzhao56-desk.sh.intel.com>
 <6b48a161-257b-a02b-c483-87c04b655635@redhat.com>
 <1ad2c33d-95e1-49ec-acd2-ac02b506974e@nvidia.com>
 <846e9117-1f79-a5e0-1b14-3dba91ab8033@redhat.com>
 <d0ad2642-6d72-489e-91af-a7cb15e75a8a@nvidia.com>
 <ZNnvPuRUVsUl5umM@yzhao56-desk.sh.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <ZNnvPuRUVsUl5umM@yzhao56-desk.sh.intel.com>
User-Agent: NeoMutt/20171215
X-Rspamd-Queue-Id: 9944C1C0005
X-Rspam-User: 
X-Stat-Signature: osie6t39u5ra69ibkh5eng7uppmgg83o
X-Rspamd-Server: rspam03
X-HE-Tag: 1692066997-618897
X-HE-Meta: U2FsdGVkX1+NL8Av2O2Sx30Zu1gwpKCkTpwgE/E9+mHOhInaZEgPxfywyfXyOd40KCbXcDJN1Mw8oifFIgQKWFNG1Amco7e2wlqwpYEZknVwRnh95AfLkuYOK72iEb627YAfvts1yXJ6vx7odT1wO8vG6iLO505dSLqgCbqx4sDoItqzShSEzTkEhO52Kt6YdWu6JONiF4kawKpc2YxP1pNTMRhXx1F8bybe312oX1+PIXukbT/m0JvrCYbHOW316plSztgDyAHwerjbLCJAowYe4rtEuqY7TGgvFilq5sbneAUOJHb6D6DeFmmMpkUhhufMcAeDqF0mmfANjvWg67Xz2OA51nDV12HEmCJgy7IJuv+lPDTVlP7agNMBncjWRDszJr6qpk58s/QAMRAZdKT68hkUAoLZmGlqPXOEHyeSehngCxDX8mWzvDcWyruYVqWJe3mskLtqBKb0gE3do8MGUOKiuWRrq7HQND9VdvClntYCzAcu2uADasjJzr1QOhdef62jIQXJvt8vLLajL7HBZqKimCA+ybK7JgKKXOR0iE9f8EQyTY3UZhgrfe/84zX6X4ACmIu0H7v2DJ2mBjyDMV0VeKHo3Rm2PnCht+/JXG2VUaQ0TNT4FYblcL7Sqd1XRVtQKN5cYRlEBQTw0D/EWALSz+Y73IfsXOJOMxh2HW4nEnXh69Bg0FAzNj8o9nfB/mNSmcZrmPIZx0mWUK5Wkvbhwe23lwnrs8SOYCuY0P36T423OD5KnEfhCmpAlL/ikYN1znoNCVG6dp1t4c8tcRsKl+Ewk6nOatZA0QADo/33Z+tiugvm9QeW3AGrDcMQ/24rhfeVxw7G9B5HBQcoov8fGG+CRbpQwGdAg3MxKk+vs7jvEkY3nZTxN7GbTHu7aEiK+QFRVJQ2a4kAf6xvJBVy6G8eggglHeoRiHSINyv7/g9yRrNUw/0yLiQ1Y3Gbl0Va8jrO0VYzqZx
 uw5/zFRR
 ypmN4npMsLjhXeGsi+r7Smgs8FnhHYwJoeq/KRdxHeRWzIpjQ5DVwOQzhixD3lEpLdOOIv1dKeNuR6BO4yvtuIx+V5QMOpJ3WzfYqlYzPu3uicmSXDkzgWnlbDtpszDk4GXs7
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > On 8/11/23 11:39, David Hildenbrand wrote:
> > ...
> > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > >
> > > >
> > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > anything that has its own processors and page tables needs to be left
> > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > today) unaware of what those processors are doing, and so it has no way
> > > > to do productive NUMA balancing.
> > >
> > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > >
> > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > >
> > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > >
> > > CCing Mel.
> > >
> >
> > Let's discern between page pinning situations, and HMM-style situations.
> > Page pinning of CPU memory is unnecessary when setting up for using that
> > memory by modern GPUs or accelerators, because the latter can handle
> > replayable page faults. So for such cases, the pages are in use by a GPU
> > or accelerator, but unpinned.
> >
> > The performance problem occurs because for those pages, the NUMA
> > balancing causes unmapping, which generates callbacks to the device
> > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > even if the GPU might be busy using those pages. The device promptly
> > causes a device page fault, and the driver then re-establishes the
> > device page table mapping, which is good until the next round of
> > unmapping from the NUMA balancer.
> >
> > hmm_range_fault()-based memory management in particular might benefit
> > from having NUMA balancing disabled entirely for the memremap_pages()
> > region, come to think of it. That seems relatively easy and clean at
> > first glance anyway.
> >
> > For other regions (allocated by the device driver), a per-VMA flag
> > seems about right: VM_NO_NUMA_BALANCING ?
> >
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?

Emm... syscall madvise() in my mind, it does things like change flags
on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.

>
>