Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yan Zhao <yan.y.zhao@intel.com>
To: Yuan Yao <yuan.yao@linux.intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
	David Hildenbrand <david@redhat.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<pbonzini@redhat.com>, <seanjc@google.com>,
	<mike.kravetz@oracle.com>, <apopple@nvidia.com>, <jgg@nvidia.com>,
	<rppt@kernel.org>, <akpm@linux-foundation.org>,
	<kevin.tian@intel.com>, Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
Date: Tue, 15 Aug 2023 10:37:55 +0800	[thread overview]
Message-ID: <ZNrlAyzo93oGngM2@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <20230815023618.uvefne3af7fn5msn@yy-desk-7060>

On Tue, Aug 15, 2023 at 10:36:18AM +0800, Yuan Yao wrote:
> On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > > On 8/11/23 11:39, David Hildenbrand wrote:
> > > ...
> > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > > >
> > > > >
> > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > > anything that has its own processors and page tables needs to be left
> > > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > > today) unaware of what those processors are doing, and so it has no way
> > > > > to do productive NUMA balancing.
> > > >
> > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > > >
> > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > > >
> > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > > >
> > > > CCing Mel.
> > > >
> > >
> > > Let's discern between page pinning situations, and HMM-style situations.
> > > Page pinning of CPU memory is unnecessary when setting up for using that
> > > memory by modern GPUs or accelerators, because the latter can handle
> > > replayable page faults. So for such cases, the pages are in use by a GPU
> > > or accelerator, but unpinned.
> > >
> > > The performance problem occurs because for those pages, the NUMA
> > > balancing causes unmapping, which generates callbacks to the device
> > > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > > even if the GPU might be busy using those pages. The device promptly
> > > causes a device page fault, and the driver then re-establishes the
> > > device page table mapping, which is good until the next round of
> > > unmapping from the NUMA balancer.
> > >
> > > hmm_range_fault()-based memory management in particular might benefit
> > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > region, come to think of it. That seems relatively easy and clean at
> > > first glance anyway.
> > >
> > > For other regions (allocated by the device driver), a per-VMA flag
> > > seems about right: VM_NO_NUMA_BALANCING ?
> > >
> > Thanks a lot for those good suggestions!
> > For VMs, when could a per-VMA flag be set?
> > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > after it's mapped into VFIO.
> > Then, should VFIO set this flag on after it maps a range?
> > Could this flag be unset after device hot-unplug?
> 
> Emm... syscall madvise() in my mind, it does things like change flags
> on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.
Yes, madvise() might work.
And setting this flag might be an easy decision, while unsetting it might be hard
unless some counters introduced.

next prev parent reply	other threads:[~2023-08-15  3:05 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10  8:56 Yan Zhao
2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-10  8:58 ` [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
2023-08-11 18:52   ` Nadav Amit
2023-08-14  7:52     ` Yan Zhao
2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
2023-08-10 13:16   ` bibo mao
2023-08-11  3:45     ` Yan Zhao
2023-08-11  7:40       ` bibo mao
2023-08-11  8:01         ` Yan Zhao
2023-08-11 17:14           ` Sean Christopherson
2023-08-11 17:18             ` Jason Gunthorpe
2023-08-14  6:52             ` Yan Zhao
2023-08-14  7:44               ` Yan Zhao
2023-08-14 16:40               ` Sean Christopherson
2023-08-15  1:54                 ` Yan Zhao
2023-08-15 14:50                   ` Sean Christopherson
2023-08-16  2:43                     ` bibo mao
2023-08-16  3:44                       ` bibo mao
2023-08-16  5:14                         ` Yan Zhao
2023-08-16  7:29                           ` bibo mao
2023-08-16  7:18                             ` Yan Zhao
2023-08-16  7:53                               ` bibo mao
2023-08-16 13:39                                 ` Sean Christopherson
2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
2023-08-10  9:50   ` Yan Zhao
2023-08-11 17:25     ` David Hildenbrand
2023-08-11 18:20       ` John Hubbard
2023-08-11 18:39         ` David Hildenbrand
2023-08-11 19:35           ` John Hubbard
2023-08-14  9:09             ` Yan Zhao
2023-08-15  2:34               ` John Hubbard
2023-08-16  7:43                 ` David Hildenbrand
2023-08-16  9:06                   ` Yan Zhao
2023-08-16  9:49                     ` David Hildenbrand
2023-08-16 18:00                       ` John Hubbard
2023-08-17  5:05                         ` Yan Zhao
2023-08-17  7:38                           ` David Hildenbrand
2023-08-18  0:13                             ` Yan Zhao
2023-08-18  2:29                               ` John Hubbard
2023-09-04  9:18                                 ` Yan Zhao
2023-08-15  2:36               ` Yuan Yao
2023-08-15  2:37                 ` Yan Zhao [this message]
2023-08-10 13:58 ` Chao Gao
2023-08-11  5:22   ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNrlAyzo93oGngM2@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=pbonzini@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=yuan.yao@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox