From: Gabriele Monaco <gmonaco@redhat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>, Shuah Khan <shuah@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
linux-mm@kvack.org
Subject: Re: [PATCH v7 1/2] sched: Move task_mm_cid_work to mm work_struct
Date: Thu, 20 Feb 2025 09:00:25 +0100 [thread overview]
Message-ID: <a6065d54390a8358cd4ed68acb803414e27a1ded.camel@redhat.com> (raw)
In-Reply-To: <0493b3c4-c37f-4ddd-93ee-6d7946e42846@efficios.com>
On Wed, 2025-02-19 at 12:08 -0500, Mathieu Desnoyers wrote:
> On 2025-02-19 11:32, Gabriele Monaco wrote:
> >
> >
> > On Wed, 2025-02-19 at 10:13 -0500, Mathieu Desnoyers wrote:
> > > > On 2025-02-19 06:31, Gabriele Monaco wrote:
> > > > > > Currently, the task_mm_cid_work function is called in a
> > > > > > task work
> > > > > > triggered by a scheduler tick to frequently compact the
> > > > > > mm_cids of
> > > > > > each
> > > > > > process. This can delay the execution of the corresponding
> > > > > > thread
> > > > > > for
> > > > > > the entire duration of the function, negatively affecting
> > > > > > the
> > > > > > response
> > > > > > in case of real time tasks. In practice, we observe
> > > > > > task_mm_cid_work
> > > > > > increasing the latency of 30-35us on a 128 cores system,
> > > > > > this order
> > > > > > of
> > > > > > magnitude is meaningful under PREEMPT_RT.
> > > > > >
> > > > > > Run the task_mm_cid_work in a new work_struct connected to
> > > > > > the
> > > > > > mm_struct rather than in the task context before returning
> > > > > > to
> > > > > > userspace.
> > > > > >
> > > > > > This work_struct is initialised with the mm and disabled
> > > > > > before
> > > > > > freeing
> > > > > > it. Its execution is no longer triggered by scheduler
> > > > > > ticks: the
> > > > > > queuing
> > > > > > of the work happens while returning to userspace in
> > > > > > __rseq_handle_notify_resume, maintaining the checks to
> > > > > > avoid
> > > > > > running
> > > > > > more frequently than MM_CID_SCAN_DELAY.
> > > > > >
> > > > > > The main advantage of this change is that the function can
> > > > > > be
> > > > > > offloaded
> > > > > > to a different CPU and even preempted by RT tasks.
> > > > > >
> > > > > > Moreover, this new behaviour is more predictable with
> > > > > > periodic
> > > > > > tasks
> > > > > > with short runtime, which may rarely run during a scheduler
> > > > > > tick.
> > > > > > Now, the work is always scheduled when the task returns to
> > > > > > userspace.
> > > > > >
> > > > > > The work is disabled during mmdrop, since the function
> > > > > > cannot sleep
> > > > > > in
> > > > > > all kernel configurations, we cannot wait for possibly
> > > > > > running work
> > > > > > items to terminate. We make sure the mm is valid in case
> > > > > > the task
> > > > > > is
> > > > > > terminating by reserving it with mmgrab/mmdrop, returning
> > > > > > prematurely if
> > > > > > we are really the last user before mmgrab.
> > > > > > This situation is unlikely since we don't schedule the work
> > > > > > for
> > > > > > exiting
> > > > > > tasks, but we cannot rule it out.
> > > > > >
> > > > > > Fixes: 223baf9d17f2 ("sched: Fix performance regression
> > > > > > introduced
> > > > > > by mm_cid")
> > > > > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > > > > ---
> > > > > > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > > > > > index 442aba29bc4cf..f8394ebbb6f4d 100644
> > > > > > --- a/kernel/rseq.c
> > > > > > +++ b/kernel/rseq.c
> > > > > > @@ -419,6 +419,7 @@ void __rseq_handle_notify_resume(struct
> > > > > > ksignal
> > > > > > *ksig, struct pt_regs *regs)
> > > > > > }
> > > > > > if (unlikely(rseq_update_cpu_node_id(t)))
> > > > > > goto error;
> > > > > > + task_queue_mm_cid(t);
> > > > > > return;
> > > > > >
> > > > > > error:
> > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > > index 9aecd914ac691..ee35f9962444b 100644
> > > > > > --- a/kernel/sched/core.c
> > > > > > +++ b/kernel/sched/core.c
> > > > > > @@ -5663,7 +5663,6 @@ void sched_tick(void)
> > > > > > resched_latency = cpu_resched_latency(rq);
> > > > > > calc_global_load_tick(rq);
> > > > > > sched_core_tick(rq);
> > > > > > - task_tick_mm_cid(rq, donor);
> > > >
> > > > I agree that this approach is promising, however I am concerned
> > > > about
> > > > the fact that a task running alone on its runqueue (thus
> > > > without
> > > > preemption) for a long time will never recompact mm_cid, and
> > > > also
> > > > will never update its mm_cid field.
> > > >
> > > > So I am tempted to insert this in the sched_tick to cover that
> > > > scenario:
> > > >
> > > > rseq_preempt(current);
> > > >
> > > > This would ensure that the task runs
> > > > __rseq_handle_notify_resume() at
> > > > least each tick.
> > > >
> >
> > Right, I thought about this scenario but forgot to add it in the
> > final patch.
> > We could have a test doing that: instead of sleeping, the task busy
> > waits.
> >
> > Does __rseq_handle_notify_resume need to run in this case, besides
> > for the cid compaction, I mean? Otherwise we could again just
> > enqueu
> > the work from there.
>
> Yes we need to do both:
>
> - compact cid,
> - run __rseq_handle_notify_resume to update the mm_cid.
>
> We we don't care much if compacting the cid is done at some point
> and __rseq_handle_notify_resume only updates the mm_cid field on
> the following tick.
>
> So enqueuing the work is not sufficient there, I would really
> issue rseq_preempt(current) which makes sure a busy thread both
> triggers cid compaction *and* gets its mm_cid updated.
>
Sure, will do.
I've been trying to test this scenario but it's quite hard to achieve.
I set all threads to FIFO and highest priority, removed all system
calls from the leader thread (even the ones to wait for other threads)
and replaced the usleep with a busy wait, running on a VM so not sure
if interrupts can bother. The test still passes..
Anyway it seems a reasonable situation to happens and I guess it won't
hurt to just run an rseq_preempt in the tick.
Testing and sending V8 without touching the selftest.
Thanks,
Gabriele
prev parent reply other threads:[~2025-02-20 8:00 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250219113108.325545-1-gmonaco@redhat.com>
2025-02-19 11:31 ` Gabriele Monaco
2025-02-19 15:13 ` Mathieu Desnoyers
2025-02-19 16:32 ` Gabriele Monaco
2025-02-19 17:08 ` Mathieu Desnoyers
2025-02-20 8:00 ` Gabriele Monaco [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a6065d54390a8358cd4ed68acb803414e27a1ded.camel@redhat.com \
--to=gmonaco@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox