Re: [PATCH v6 2/3] sched: Move task_mm_cid_work to mm delayed work

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gabriele Monaco <gmonaco@redhat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	 linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, aubrey.li@linux.intel.com,
	yu.c.chen@intel.com, Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar	 <mingo@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Shuah Khan	 <shuah@kernel.org>
Subject: Re: [PATCH v6 2/3] sched: Move task_mm_cid_work to mm delayed work
Date: Fri, 14 Feb 2025 07:44:13 +0100	[thread overview]
Message-ID: <ed5ff8e242c7abb760f408b4fea9701d9b39d08d.camel@redhat.com> (raw)
In-Reply-To: <0888d6a3-8dea-455b-893f-d8d929e827e2@efficios.com>

On Thu, 2025-02-13 at 12:31 -0500, Mathieu Desnoyers wrote:
> On 2025-02-13 08:25, Gabriele Monaco wrote:
> > On Thu, 2025-02-13 at 14:52 +0800, kernel test robot wrote:
> > > kernel test robot noticed
> > > "WARNING:at_kernel/workqueue.c:#__queue_delayed_work" on:
> > > 
> > > [    2.640924][    T0] ------------[ cut here ]------------
> > > [ 2.641646][ T0] WARNING: CPU: 0 PID: 0 at
> > > kernel/workqueue.c:2495
> > > __queue_delayed_work (kernel/workqueue.c:2495 (discriminator 9))
> > > [    2.642874][    T0] Modules linked in:
> > > [    2.643381][    T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not
> > > tainted
> > > 6.14.0-rc2-00002-g287adf9e9c1f #1
> > > [    2.644582][    T0] Hardware name: QEMU Standard PC (i440FX +
> > > PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > [ 2.645943][ T0] RIP: 0010:__queue_delayed_work
> > > (kernel/workqueue.c:2495 (discriminator 9))
> > 
> > There seem to be major problems with this configuration, I'm trying
> > to
> > understand what's wrong but, for the time being, this patchset is
> > not
> > ready for inclusion.
> 
> I'm staring at this now, and I'm thinking we could do a simpler
> change
> that would solve your RT issues without having to introduce a
> dependency
> on workqueue.c.
> 
> So if the culprit is that task_mm_cid_work() runs for too long on
> large
> many-cpus systems, why not break it up into smaller iterations ?
> 
> So rather than iterating on "for_each_possible_cpu", we could simply
> break this down into iteration on at most N cpus, so:
> 
> tick #1: iteration on CPUs 0 ..   N - 1
> tick #2: iteration on CPUs N .. 2*N - 1
> ...
> circling back to 0 when it reaches the number of possible cpus.
> 
> This N value could be configurable, e.g. CONFIG_RSEQ_CID_SCAN_BATCH,
> with a sane default. An RT system could decide to make that value
> lower.
> 
> Then all we need to do is remember which was that last observed cpu
> number in the mm struct, so the next tick picks up from there.
> 
> The main downside of this approach compared to scheduling delayed
> work in a workqueue is that it depends on having the mm be current
> when
> the scheduler tick happens. But perhaps this is something we could
> fix
> in a different way that does not add a dependency on workqueue. I'm
> not
> sure how though.
> 
> Thoughts ?

Mmh, that's indeed neat, what is not so good about this type of task
work is that it's a pure latency, it will happen before scheduling the
task and can't be interrupted.
The only acceptable latency is a bounded one and your idea is going in
that direction.

As you mentioned, this will make the compaction of mm_cid even more
rare and will likely have the test in 3/3 fail even more often, I'm not
sure if this is necessarily a bad thing though, since mm_cid compaction
is mainly aesthetic, so we could just increase the duration of the test
or even add a busy loop inside to make the task more likely to run this
compaction.

I gave a thought about this whole thing, don't take this too seriously,
but what I see essentially flawed in this approach is:
1. task_works are set on tick
2. task_works are run returning to userspace

1. is the issue with frequency and 2. can be mitigated by your idea,
but not essentially solved. What if we (also) did:
1+. set this task_work while switching in
2+. run this task_work while switching out to sleep (i.e. no
preemption)

1+. would make sure all threads have this task_work scheduled at a
certain point (perhaps a bit too much, but we have a periodicity check
in place). 2+. can effectively run the task in a moment when it is not
problematic for the real time response: on a preemptible kernel, as
soon as a task with higher priority is ready to run, it will preempt
the currently running one, the fact current is going to sleep willingly
implies there's no higher priority task ready, so likely no task really
caring about RT response is going to run after.
Not all tasks are ever going to sleep, so we must keep the original
TWA_RESUME in the task_work, especially for those long-running or low-
priority, both unlikely to be RT tasks.

I'm going to try a patch with this CONFIG_RSEQ_CID_SCAN_BATCH and
tuning the test to pass. In the future we can see if those ideas make
sense and perhaps bring them in.

Thanks,
Gabriele

     prev parent reply	other threads:[~2025-02-14  6:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250210153253.460471-1-gmonaco@redhat.com>
2025-02-10 15:32 ` [PATCH v6 1/3] sched: Compact RSEQ concurrency IDs with reduced threads and affinity Gabriele Monaco
2025-02-13 14:56   ` Mathieu Desnoyers
2025-02-18  7:53     ` Peter Zijlstra
2025-02-10 15:32 ` [PATCH v6 2/3] sched: Move task_mm_cid_work to mm delayed work Gabriele Monaco
2025-02-13  6:52   ` kernel test robot
2025-02-13 13:25     ` Gabriele Monaco
2025-02-13 13:55       ` Mathieu Desnoyers
2025-02-13 14:37         ` Gabriele Monaco
2025-02-13 14:52       ` Mathieu Desnoyers
2025-02-13 14:54         ` Gabriele Monaco
2025-02-13 17:31       ` Mathieu Desnoyers
2025-02-14  6:44         ` Gabriele Monaco [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ed5ff8e242c7abb760f408b4fea9701d9b39d08d.camel@redhat.com \
    --to=gmonaco@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aubrey.li@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=shuah@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox