Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Bibo Mao <maobibo@loongson.cn>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Yang Shi <shy828301@gmail.com>
Subject: Re: [PATCH v3] mm/khugepaged: sched to numa node when collapse huge page
Date: Thu, 28 Apr 2022 12:34:07 -0400	[thread overview]
Message-ID: <YmrB/7ehG2kj2RMn@xz-m1.local> (raw)
In-Reply-To: <3a441789-b3e4-236e-2e44-e7a1c7258a94@redhat.com>

On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> On 17.03.22 07:50, Bibo Mao wrote:
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> > 
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> 
> If it helps, that's nice as long as it doesn't hurt other cases.
> 
> > 
> > Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> > ---
> > changelog:
> > V2: remove node record for thp daemon
> > V3: remove unlikely statement
> > ---
> >  mm/khugepaged.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 131492fd1148..b3cf0885f5a2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	struct vm_area_struct *vma;
> >  	struct mmu_notifier_range range;
> >  	gfp_t gfp;
> > +	const struct cpumask *cpumask;
> >  
> >  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >  
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 * that. We will recheck the vma after taking it again in write mode.
> >  	 */
> >  	mmap_read_unlock(mm);
> > +
> > +	/* sched to specified node before huage page memory copy */
> 
> huage? I assume "huge"
> 
> > +	if (task_node(current) != node) {
> > +		cpumask = cpumask_of_node(node);
> > +		if (!cpumask_empty(cpumask))
> > +			set_cpus_allowed_ptr(current, cpumask);
> > +	}
> 
> I wonder if that will always be optimized out without NUMA and if we
> want to check for IS_ENABLED(CONFIG_NUMA).
> 
> 
> Regarding comments from others, I agree: I think what we'd actually want
> is something like "try to reschedule to one of these CPUs immediately.
> If they are all busy, just stay here.
> 
> 
> Also, I do wonder if there could already be scenarios where someone
> wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
> when pinning VCPUs in a VM to physical CPUs). It might even degrade the
> VM performance in that case if we schedule something unrelated on these
> CPUs. (I don't know which interfaces we might already have to configure
> housekeeping CPUs for kthreads).
> 
> I can spot in kernel/kthread.c:kthread()
> 
> set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
> 
> Hmmmmm ...

Yes that's a valid point, for RT afaik many users tunes the kernel threads
specifically on demand by pinning them.  So I'm not sure how this new
algorithm could break some users already, by either (1) trying to pin
khugepaged onto some isolated cores (which can cause spikes?), or (2) mess
up with the admin's previous pin settings on the khugepagd kthread.

The other thing is the khugepaged movement on the cores seems to be quite
random, because the pages it scans can be unpredictably stored on different
numa nodes, so logically it can start bouncing easily on some hosts and
that does sound questionalbe.. as I raised the (pure) question previously
on the 2nd point irrelevant of the benchmark result.

-- 
Peter Xu

next prev parent reply	other threads:[~2022-04-28 16:34 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17  6:50 Bibo Mao
2022-04-27 20:48 ` Andrew Morton
2022-04-27 22:29   ` Yang Shi
2022-04-28 10:07   ` David Hildenbrand
2022-04-28 13:50 ` Peter Xu
2022-04-28 15:17 ` David Hildenbrand
2022-04-28 16:34   ` Peter Xu [this message]
2022-05-13  0:36     ` Andrew Morton
2022-05-13  1:19       ` maobibo
2022-05-13  1:29         ` maobibo
2022-05-13  1:49           ` Andrew Morton
2022-05-13  1:59             ` maobibo
2022-05-13  2:40             ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YmrB/7ehG2kj2RMn@xz-m1.local \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maobibo@loongson.cn \
    --cc=shy828301@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox