From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CFE84C021B2
	for <linux-mm@archiver.kernel.org>; Thu, 20 Feb 2025 17:31:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 47E9B280303; Thu, 20 Feb 2025 12:31:15 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 406B9280301; Thu, 20 Feb 2025 12:31:15 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 21044280303; Thu, 20 Feb 2025 12:31:15 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id E9448280301
	for <linux-mm@kvack.org>; Thu, 20 Feb 2025 12:31:14 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 47C86518F5
	for <linux-mm@kvack.org>; Thu, 20 Feb 2025 17:31:14 +0000 (UTC)
X-FDA: 83141014068.22.668F2BC
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf14.hostedemail.com (Postfix) with ESMTP id 7A61A100029
	for <linux-mm@kvack.org>; Thu, 20 Feb 2025 17:31:09 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Iw9CaWcf;
	spf=pass (imf14.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740072671;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=;
	b=2/hQe4V/IJYymMQ0DbSVsU6Ue1FiirYC9x2ECbKV0Af8lN43w12XrxQPzG97bzP3gpu0AN
	EIGC6kroLeouocj5vO2B2UePfr3/O1kyODGSr7XVO8A6XRg04EVqQYabp3Sx5rD2+oTye3
	hKp3qcGX8GC2vRAaCEEbhSrYs/pnm54=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Iw9CaWcf;
	spf=pass (imf14.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740072671; a=rsa-sha256;
	cv=none;
	b=XthF1VX5sx7zaJwdMzxFT9EIUi5ZmNosdcw+5YX282JvRF+rNjHZAv6gorcw8e5gSRnYMJ
	fb9DUYu8dKqIcmBT9uBW6OWSh93Y3Vx9mAfuqQh8giNM+aPnGabd49ZVwJdNOrIZLP5Rvw
	fritY0z2oIkYkNQ6D8jAlGBXLhAkyLw=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1740072668;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=;
	b=Iw9CaWcfau/atJJYzwuxhOb3WSX8zHvAkJdPxQvlGepFweM0ObHGxv9sB8m98NJ4woiyq+
	vLi47wbR9Ea6FuKrW0+PU6WE0Z472a5M6X/mAPUtyKtY3nQ5yod2eHGKJqFnSvfgjDf1r9
	fYev7q0zXc6tfnilelM+naMHo5R2fH8=
Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com
 [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-630-C4Q7gQajNEiqhRyY1R2tqA-1; Thu, 20 Feb 2025 12:31:07 -0500
X-MC-Unique: C4Q7gQajNEiqhRyY1R2tqA-1
X-Mimecast-MFC-AGG-ID: C4Q7gQajNEiqhRyY1R2tqA_1740072666
Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-38f4e47d0b2so589590f8f.2
        for <linux-mm@kvack.org>; Thu, 20 Feb 2025 09:31:06 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740072666; x=1740677466;
        h=content-transfer-encoding:mime-version:subject:references
         :in-reply-to:message-id:cc:to:from:date:x-gm-message-state:from:to
         :cc:subject:date:message-id:reply-to;
        bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=;
        b=SWA18aGwZR5IkV54JSWG1l2ePH/Qn8pMgNNYKmTowYg2EsWiGApviNUzurk2esdFu4
         ELjORK1/QQKHAXJtgOetHbk5HmGCxo4Zb+ja9rHfKPEMp5u9hcZ0AkKtRQUvBzis5wsf
         nwBXw4aI2VCM0fUQlHAsnSbsDjOdUpqBR09GWJaXqRh46xm2zcAgnbobNALAw80j2UKD
         qFWIUfwGJeYURh1miUJ1fPj12EfB5IFdtyMHKeR2fLy4X9w5SyN+ISt7JUxBTin1j9Ff
         3YQb7vYWpqJA7UBF4NqmKmrGGk3ufy/x8Z6HYWRmOO7yUEu3aX+ZXGWjAUzcEHqueIet
         CS3g==
X-Forwarded-Encrypted: i=1; AJvYcCUAIcxDFXI6NFVs1Nx0Y5UZFRip9sH1fgNnp7nhPEtSJlXN5MDlZcvb7Kof66NzPYoqDcptwg7XVw==@kvack.org
X-Gm-Message-State: AOJu0YzkcNjcKJPszvxJZTmFkomCdlsZURS1ie4zKC5PJZ8yAx9lCw2E
	kisv7MqTVY4zV9fTnYH21MSFysrz+8jWWUH6exeQbibd2BXtw4Rtix3e+D22ji3H/wE43hJjjFg
	qThiLfFUOLH9pYaFN7fFCvI++E9IrqjlGheVyCFrvs0sOU2dr
X-Gm-Gg: ASbGncsmTRY8zvqrXcZeN4GP9anA7YTCnXZX/7l3x5MRvvJOMdlVfBz2Fb5LcVLbaOD
	+uUXOTF+8R1uzAGFRP5UfIs3F7KOlj2IrbxzdVIvz5nL2zdqD/A5oDQXm8iTtloxac8Wi4B+ZEK
	2Wciq+3bLXBboaIPbhAIPH+HbkBgasiH4FalFpAlsOEA7ADjb9iQy9aIISthK/RVI1YHV9owJ3M
	sHqvYO8m1T7Pb6r5JBxeeBbAbdCsvksff1U1vWsTQB4EPV+RseHhLoozZ/ssOD4teR9gDTusnCi
	6SA=
X-Received: by 2002:a05:6000:1f83:b0:38f:2193:f8c2 with SMTP id ffacd0b85a97d-38f6e979541mr105666f8f.31.1740072665692;
        Thu, 20 Feb 2025 09:31:05 -0800 (PST)
X-Google-Smtp-Source: AGHT+IFQ9dY1vksBSyw5YrEbmeUp0GicOvbhGSLfYn63nymjJQ6qavWqQdpc5kUCfbNIfd/uix2VhA==
X-Received: by 2002:a05:6000:1f83:b0:38f:2193:f8c2 with SMTP id ffacd0b85a97d-38f6e979541mr105615f8f.31.1740072665165;
        Thu, 20 Feb 2025 09:31:05 -0800 (PST)
Received: from [127.0.0.1] ([195.174.132.168])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38f258ddba7sm21477624f8f.38.2025.02.20.09.31.03
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 20 Feb 2025 09:31:04 -0800 (PST)
Date: Thu, 20 Feb 2025 17:31:00 +0000 (UTC)
From: Gabriele Monaco <gmonaco@redhat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@kernel.org>, linux-mm@kvack.org,
	Ingo Molnar <mingo@kernel.org>, Shuah Khan <shuah@kernel.org>
Message-ID: <f76f6c3a-a1c1-4bdc-bc0b-f419446cac9b@redhat.com>
In-Reply-To: <6b542d40-8163-4156-93af-b3f26c397010@efficios.com>
References: <20250220102639.141314-1-gmonaco@redhat.com> <20250220102639.141314-2-gmonaco@redhat.com> <c9026605-da1b-4631-b0dd-68ae0700ec87@efficios.com> <ebc70e9e9ad4a7055286d0db93085536ed070a6f.camel@redhat.com> <6b542d40-8163-4156-93af-b3f26c397010@efficios.com>
Subject: Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work_struct
MIME-Version: 1.0
X-Correlation-ID: <f76f6c3a-a1c1-4bdc-bc0b-f419446cac9b@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: wkGUBvFnfaS-N0ccuB_qhMAsOOCkkb2UYVteSXFxsGQ_1740072666
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: si3natu4j195ywiu77zzonzpck6prmim
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 7A61A100029
X-HE-Tag: 1740072669-939171
X-HE-Meta: U2FsdGVkX1/x7bCr9oMO7CFD+lQ4pnzjiqiKCerkBmjAT4GAfV/Xpoan94RF4ZRotN30xbhdIn5GxKLQtQM/3zL4NrfvD93iRALnRJMvkWDDvMX9uAAwyMQcjY23EK5s5H1e7W0i7vT3auVaQJGMeOrsKQf2GXRtnJhaM4IB3lkdwSotKLuH4/speW8gDiTYzscD6ZA39PLPbopYdlr+KmYoa1rivuwnIaJlFqu7YSo1eAfZcuq7u7/BBK3mUMTK7sEXnnu64Ri/+umOIXcqNJjpqMZPHp9MY1E1UG9cqbhV5WRAHsGi+18Lm2kdwSsI82Ogf73JlzoadSsmtym/Z7qtqr5H7aonWjodnX+TMQY+lvnGWEbO4oU7nSOs7SZ6gsz6f4Yd3wgFQtCbRiIFf0Cjr/FBchfRYpBZ4Rn9VkqOktw+cZe40JYb66Slp2i/VfoJesgF/EZLBR10hTa2O4Uafx5u6qP+Czuyb+T4IPYW7ykDAcZt8GSVH9gX0q+OQdBiAZRNVHUG0WzUABwPgbkx6DKCBRHoY/A/7SBJG1tNoq/4R7ZytsYAmCsYhmqz4iVDIvJxyyWyBp+PcsebjumK57vtsI41MCZ+R9KlWev+Nly4z43rnZ3n4RQx5u+JnEmGQhDBb4D8ngbwnJrwBAFB6yqCTCmVsTfr20FVFsRsgKMqxI4gQOoxxeLWYHjiv8/VtNWryW8ap/0s+gDso/oQxRcyBd82YpAFm7RGW4wvR/jT/0cMrfMQCxgK98kMw+Q023xSD0mfU+sLmfpsyuOkOUllgCdaMRM6slAyNH8V2v+p6kO17wx7WR5+YKV0/vlok/CJobWHMllUlXffer856Gpbp7NY0OOVfE9ulS9MwOgaRnpF2hr8/lOSilxoJnX0Si6PCm22ZyT0cVH0tSYDQh+4GhzfZcLt1stdTJGr1KdvzU10ygo3HnLnbBc+rVbT04FoWQ9VR6gq5CH
 OnCLvEws
 ynApvfYBsapfcJoUn3MgfRIx1646Qsypf522clwgqGTd2H2wrEPP3iGg1LWvtIpW5UIa6fz7yzm+xZZult5XB0LbXTMOi5avhgMr37J2e6oJ6Kv1eJokORCSzX/g6n2gvkaavZPzCsh0MlQqJIGFaqvagcvTtC0BTKKLwuJIIAPPGLrS4gv8Qf2kg0JJ7IJiev8BClrsTRa8dSYUeHKhlhCqSzUhaOAKMeicbFyItDKhx6qH8R5hkx0um0NX3oNLCJwxFqXlSy+sjvUQAvAXw/K7SgtLhY6X/CK/u+hQNDcLSCn1I2LDviax/rW50U3lGbt8Qypzo7u6citGjhYs3Nfg4YExn+lraDfiElT7B6MdWDAzfGaCWPDVbFrzmEVdWzL8PdGZVYFmKVGTLijIUp28kGK7wse7zcjtLzbpzVF3vKJWz/kDVbeQnyDnropwkIi/qxZ1ZYVFLhRKozBlCSSWPpR7T/FuK/f4CXK5f3eQaQ/4dS2hwNOKyws27BDF2Iotumzg64Dfp2c5KOF0JVjJPtL8CYPo01ItHCSb9zeWwvjI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

2025-02-20T15:47:26Z Mathieu Desnoyers <mathieu.desnoyers@efficios.com>:

> On 2025-02-20 10:30, Gabriele Monaco wrote:
>>
>> On Thu, 2025-02-20 at 09:42 -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-20 05:26, Gabriele Monaco wrote:
>>>> Currently, the task_mm_cid_work function is called in a task work
>>>> triggered by a scheduler tick to frequently compact the mm_cids of
>>>> each
>>>> process. This can delay the execution of the corresponding thread
>>>> for
>>>> the entire duration of the function, negatively affecting the
>>>> response
>>>> in case of real time tasks. In practice, we observe
>>>> task_mm_cid_work
>>>> increasing the latency of 30-35us on a 128 cores system, this order
>>>> of
>>>> magnitude is meaningful under PREEMPT_RT.
>>>>
>>>> Run the task_mm_cid_work in a new work_struct connected to the
>>>> mm_struct rather than in the task context before returning to
>>>> userspace.
>>>>
>>>> This work_struct is initialised with the mm and disabled before
>>>> freeing
>>>> it. The queuing of the work happens while returning to userspace in
>>>> __rseq_handle_notify_resume, maintaining the checks to avoid
>>>> running
>>>> more frequently than MM_CID_SCAN_DELAY.
>>>> To make sure this happens predictably also on long running tasks,
>>>> we
>>>> trigger a call to __rseq_handle_notify_resume also from the
>>>> scheduler
>>>> tick (which in turn will also schedule the work item).
>>>>
>>>> The main advantage of this change is that the function can be
>>>> offloaded
>>>> to a different CPU and even preempted by RT tasks.
>>>>
>>>> Moreover, this new behaviour is more predictable with periodic
>>>> tasks
>>>> with short runtime, which may rarely run during a scheduler tick.
>>>> Now, the work is always scheduled when the task returns to
>>>> userspace.
>>>>
>>>> The work is disabled during mmdrop, since the function cannot sleep
>>>> in
>>>> all kernel configurations, we cannot wait for possibly running work
>>>> items to terminate. We make sure the mm is valid in case the task
>>>> is
>>>> terminating by reserving it with mmgrab/mmdrop, returning
>>>> prematurely if
>>>> we are really the last user while the work gets to run.
>>>> This situation is unlikely since we don't schedule the work for
>>>> exiting
>>>> tasks, but we cannot rule it out.
>>>>
>>>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced
>>>> by mm_cid")
>>>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>>>> ---
>>> [...]
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 9aecd914ac691..363e51dd25175 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -5663,7 +5663,7 @@ void sched_tick(void)
>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 resched_latency =3D cpu_resched_l=
atency(rq);
>>>> =C2=A0=C2=A0 calc_global_load_tick(rq);
>>>> =C2=A0=C2=A0 sched_core_tick(rq);
>>>> -=C2=A0=C2=A0 task_tick_mm_cid(rq, donor);
>>>> +=C2=A0=C2=A0 rseq_preempt(donor);
>>>> =C2=A0=C2=A0 scx_tick(rq);
>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 rq_unlock(rq, &rf);
>>>
>>> There is one tiny important detail worth discussing here: I wonder if
>>> executing a __rseq_handle_notify_resume() on return to userspace on
>>> every scheduler tick will cause noticeable performance degradation ?
>>>
>>> I think we can mitigate the impact if we can quickly compute the
>>> amount
>>> of contiguous unpreempted runtime since last preemption, then we
>>> could
>>> use this as a way to only issue rseq_preempt() when there has been a
>>> minimum amount of contiguous unpreempted execution. Otherwise the
>>> rseq_preempt() already issued by preemption is enough.
>>>
>>> I'm not entirely sure how to compute this "unpreempted contiguous
>>> runtime" value within sched_tick() though, any ideas ?
>> I was a bit concerned but, at least from the latency perspective, I
>> didn't see any noticeable difference. This may also depend on the
>> system under test, though.
>
> I see this as an issue for performance-related workloads, not
> specifically for latency: we'd be adding additional rseq notifiers
> triggered by the tick in workloads that are CPU-heavy and would
> otherwise not run it after tick. And we'd be adding this overhead
> even in scenarios where there are relatively frequent preemptions
> happening, because every tick would end up issuing rseq_preempt().
>
>> We may not need to do that, what we are doing here is improperly
>> calling rseq_preempt. What if we call an rseq_tick which sets a
>> different bit in rseq_event_mask and take that into consideration while
>> running __rseq_handle_notify_resume?
>
> I'm not sure how much it would help. It may reduce the amount of
> work to do, but we'd still be doing additional work at every tick.
>
> See my other email about using
>
> =C2=A0=C2=A0 se->sum_exec_runtime - se->prev_sum_exec_runtime
>
> to only do rseq_preempt() when the last preemption was a certain amount
> of consecutive runtime long ago. This is a better alternative I think.
>
>> We could follow the periodicity of the mm_cid compaction and, if the
>> rseq event is a tick, only continue if it is time to compact (and we
>> can return this value from task_queue_mm_cid to avoid checking twice).
>
> Note that the mm_cid compaction delay is per-mm, and the fact that we
> want to run __rseq_handle_notify_resume periodically to update the
> mm_cid fields applies to all threads. Therefore, I don't think we can
> use the mm_cid compaction delay (per-mm) for this.
>

Alright, didn't think of that, I can explore your suggestion. Looks like mo=
st of it is already implemented.
What would be a good value to consider the notify waited enough? 100ms or e=
ven less?
I don't think this would deserve a config.

>> We would be off by one period (commit the rseq happens before we
>> schedule the next compaction), but it should be acceptable:
>> =C2=A0=C2=A0=C2=A0=C2=A0 __rseq_handle_notify_resume()
>> =C2=A0=C2=A0=C2=A0=C2=A0 {
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 should_queue =3D task_q=
ueue_mm_cid();
>
>> Another doubt about this case, here we are worrying about this
>> hypothetical long-running task, I'm assuming this can happen only for:
>> 1. isolated cpus with nohz_full and 1 task (the approach wouldn't work)
>
> The prev_sum_exec_runtime approach would work for this case.
>

I mean in that case nohz_full and isolation would ensure nothing else runs =
on the core, not even the tick (or perhaps that's also nohz=3Don). I don't =
think there's much we can do in such a case is there? (On that core/context=
 at least)

>> =C2=A0=C2=A0 or
>> 2. tasks with RT priority mostly starving the cpu
>
> Likewise.
>
>> In 1. I'm not sure the user would really need rseq in the first place,
>
> Not sure, but I'd prefer to keep this option available unless we have a
> strong reason for not being able to support this.
>
>> in 2., assuming nothing like stalld/sched rt throttling is in place, we
>> will probably also never run the kworker doing mm_cid compaction (I'm
>> using the system_wq), for this reason it's probably wiser to use the
>> system_unbound_wq, which as far as I could understand is the only one
>> that would allow the work to run on any other CPU.
>> I might be missing something trivial here, what do you think though?
>
> Good point. I suspect using the system_unbound_wq would be preferable
> here, especially given that we're iterating over possible CPUs anyway,
> so I don't expect much gain from running in a system_wq over
> system_unbound_wq. Or am I missing something ?

I don't think so, I just picked it as it was easier, but it's probably best=
 to switch.

Thanks,
Gabriele