linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Zhongkun He <hezhongkun.hzk@bytedance.com>
To: Abel Wu <wuyun.abel@bytedance.com>
Cc: peterz@infradead.org, mgorman@suse.de, ying.huang@intel.com,
	 linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1] mm/numa_balancing: Fix the memory thrashing problem in the single-threaded process
Date: Wed, 24 Jul 2024 11:55:30 +0800	[thread overview]
Message-ID: <CACSyD1M_nqrOZh3CDqydaasX3_9JdsqDFQTqOZ+q-xkvNMY1Kg@mail.gmail.com> (raw)
In-Reply-To: <e3a75483-d3f7-4963-9332-4893d22463ad@bytedance.com>

On Tue, Jul 23, 2024 at 9:39 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Hi Zhongkun,
>
> On 7/23/24 1:32 PM, Zhongkun He Wrote:
> > I found a problem in my test machine that the memory of a process is
> > repeatedly migrated between two nodes and does not stop.
> >
> > 1.Test step and the machines.
> > ------------
> > VM machine: 4 numa nodes and 10GB per node.
> >
> > stress --vm 1 --vm-bytes 12g --vm-keep
> >
> > The info of numa stat:
> > while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>
> I am curious what was the exact reason made the worker migrated
> to N3? And later...

The maximum capacity of each node is 10 GB, but it requires 12GB,
so there's always 2G on other nodes. With the patch below we only
have page_faults in other nodes, not local. so we will migrate pages
to other nodes because p->numa_preferred_nid is always the other node.

>
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> > anon N0=98304 N1=0 N2=9937256448 N3=2947825664
> > anon N0=98304 N1=0 N2=8863514624 N3=4021567488
> > anon N0=98304 N1=0 N2=7789772800 N3=5095309312
> > anon N0=98304 N1=0 N2=6716030976 N3=6169051136
> > anon N0=98304 N1=0 N2=5642289152 N3=7242792960
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=4837007360 N3=8048074752
> > anon N0=98304 N1=0 N2=3763265536 N3=9121816576
> > anon N0=98304 N1=0 N2=2689523712 N3=10195558400
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>
> .. why it was moved back to N2?

The private page_faults on N2 are larger than that on N3.

>
> > anon N0=98304 N1=0 N2=3320455168 N3=9564626944
> > anon N0=98304 N1=0 N2=4394196992 N3=8490885120
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=6174195712 N3=6710886400
> > anon N0=98304 N1=0 N2=7247937536 N3=5637144576
> > anon N0=98304 N1=0 N2=8321679360 N3=4563402752
> > anon N0=98304 N1=0 N2=9395421184 N3=3489660928
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> >
> > 2. Root cause:
> > Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
> > PTEs which are on the right node")the PTE of local pages will not be
> > changed in change_pte_range() for single-threaded process, so no
> > page_faults information will be generated in do_numa_page(). If a
> > single-threaded process has memory on another node, it will
> > unconditionally migrate all of it's local memory to that node,
> > even if the remote node has only one page.
>
> IIUC the remote pages will be moved to the node where the worker
> is running since local (private) PTEs are not set to protnone and
> won't be faulted on.
>

Yes.

> >
> > So, let's fix it. The memory of single-threaded process should follow
> > the cpu, not the numa faults info in order to avoid memory thrashing.
>
> Don't forget the 'Fixes' tag for bugfix patches :)

OK, thanks.

>
> >
> > ...>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 24dda708b699..d7cbbda568fb 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
> >               numa_group_count_active_nodes(ng);
> >               spin_unlock_irq(group_lock);
> >               max_nid = preferred_group_nid(p, max_nid);
> > +     } else if (atomic_read(&p->mm->mm_users) == 1) {
> > +             /*
> > +              * The memory of a single-threaded process should
> > +              * follow the CPU in order to avoid memory thrashing.
> > +              */
> > +             max_nid = numa_node_id();
> >       }
> >
> >       if (max_faults) {
>
> Since you don't want to respect the faults info, can we simply
> skip task placement?

This is a good suggestion. It would be even better if there were some
feedback from others.


  reply	other threads:[~2024-07-24  3:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-23  5:32 Zhongkun He
2024-07-23  6:15 ` Anshuman Khandual
2024-07-23  7:00   ` [External] " Zhongkun He
2024-07-23 13:38 ` Abel Wu
2024-07-24  3:55   ` Zhongkun He [this message]
2024-07-24 12:11     ` Abel Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACSyD1M_nqrOZh3CDqydaasX3_9JdsqDFQTqOZ+q-xkvNMY1Kg@mail.gmail.com \
    --to=hezhongkun.hzk@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox