From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFE84C021B2 for ; Thu, 20 Feb 2025 17:31:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 47E9B280303; Thu, 20 Feb 2025 12:31:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 406B9280301; Thu, 20 Feb 2025 12:31:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21044280303; Thu, 20 Feb 2025 12:31:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E9448280301 for ; Thu, 20 Feb 2025 12:31:14 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 47C86518F5 for ; Thu, 20 Feb 2025 17:31:14 +0000 (UTC) X-FDA: 83141014068.22.668F2BC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf14.hostedemail.com (Postfix) with ESMTP id 7A61A100029 for ; Thu, 20 Feb 2025 17:31:09 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Iw9CaWcf; spf=pass (imf14.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740072671; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=; b=2/hQe4V/IJYymMQ0DbSVsU6Ue1FiirYC9x2ECbKV0Af8lN43w12XrxQPzG97bzP3gpu0AN EIGC6kroLeouocj5vO2B2UePfr3/O1kyODGSr7XVO8A6XRg04EVqQYabp3Sx5rD2+oTye3 hKp3qcGX8GC2vRAaCEEbhSrYs/pnm54= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Iw9CaWcf; spf=pass (imf14.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740072671; a=rsa-sha256; cv=none; b=XthF1VX5sx7zaJwdMzxFT9EIUi5ZmNosdcw+5YX282JvRF+rNjHZAv6gorcw8e5gSRnYMJ fb9DUYu8dKqIcmBT9uBW6OWSh93Y3Vx9mAfuqQh8giNM+aPnGabd49ZVwJdNOrIZLP5Rvw fritY0z2oIkYkNQ6D8jAlGBXLhAkyLw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1740072668; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=; b=Iw9CaWcfau/atJJYzwuxhOb3WSX8zHvAkJdPxQvlGepFweM0ObHGxv9sB8m98NJ4woiyq+ vLi47wbR9Ea6FuKrW0+PU6WE0Z472a5M6X/mAPUtyKtY3nQ5yod2eHGKJqFnSvfgjDf1r9 fYev7q0zXc6tfnilelM+naMHo5R2fH8= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-630-C4Q7gQajNEiqhRyY1R2tqA-1; Thu, 20 Feb 2025 12:31:07 -0500 X-MC-Unique: C4Q7gQajNEiqhRyY1R2tqA-1 X-Mimecast-MFC-AGG-ID: C4Q7gQajNEiqhRyY1R2tqA_1740072666 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-38f4e47d0b2so589590f8f.2 for ; Thu, 20 Feb 2025 09:31:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740072666; x=1740677466; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=AmT2LTewgUk6W3CyABUagCN/TNALuT3MZwNF040lvPc=; b=SWA18aGwZR5IkV54JSWG1l2ePH/Qn8pMgNNYKmTowYg2EsWiGApviNUzurk2esdFu4 ELjORK1/QQKHAXJtgOetHbk5HmGCxo4Zb+ja9rHfKPEMp5u9hcZ0AkKtRQUvBzis5wsf nwBXw4aI2VCM0fUQlHAsnSbsDjOdUpqBR09GWJaXqRh46xm2zcAgnbobNALAw80j2UKD qFWIUfwGJeYURh1miUJ1fPj12EfB5IFdtyMHKeR2fLy4X9w5SyN+ISt7JUxBTin1j9Ff 3YQb7vYWpqJA7UBF4NqmKmrGGk3ufy/x8Z6HYWRmOO7yUEu3aX+ZXGWjAUzcEHqueIet CS3g== X-Forwarded-Encrypted: i=1; AJvYcCUAIcxDFXI6NFVs1Nx0Y5UZFRip9sH1fgNnp7nhPEtSJlXN5MDlZcvb7Kof66NzPYoqDcptwg7XVw==@kvack.org X-Gm-Message-State: AOJu0YzkcNjcKJPszvxJZTmFkomCdlsZURS1ie4zKC5PJZ8yAx9lCw2E kisv7MqTVY4zV9fTnYH21MSFysrz+8jWWUH6exeQbibd2BXtw4Rtix3e+D22ji3H/wE43hJjjFg qThiLfFUOLH9pYaFN7fFCvI++E9IrqjlGheVyCFrvs0sOU2dr X-Gm-Gg: ASbGncsmTRY8zvqrXcZeN4GP9anA7YTCnXZX/7l3x5MRvvJOMdlVfBz2Fb5LcVLbaOD +uUXOTF+8R1uzAGFRP5UfIs3F7KOlj2IrbxzdVIvz5nL2zdqD/A5oDQXm8iTtloxac8Wi4B+ZEK 2Wciq+3bLXBboaIPbhAIPH+HbkBgasiH4FalFpAlsOEA7ADjb9iQy9aIISthK/RVI1YHV9owJ3M sHqvYO8m1T7Pb6r5JBxeeBbAbdCsvksff1U1vWsTQB4EPV+RseHhLoozZ/ssOD4teR9gDTusnCi 6SA= X-Received: by 2002:a05:6000:1f83:b0:38f:2193:f8c2 with SMTP id ffacd0b85a97d-38f6e979541mr105666f8f.31.1740072665692; Thu, 20 Feb 2025 09:31:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IFQ9dY1vksBSyw5YrEbmeUp0GicOvbhGSLfYn63nymjJQ6qavWqQdpc5kUCfbNIfd/uix2VhA== X-Received: by 2002:a05:6000:1f83:b0:38f:2193:f8c2 with SMTP id ffacd0b85a97d-38f6e979541mr105615f8f.31.1740072665165; Thu, 20 Feb 2025 09:31:05 -0800 (PST) Received: from [127.0.0.1] ([195.174.132.168]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38f258ddba7sm21477624f8f.38.2025.02.20.09.31.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 20 Feb 2025 09:31:04 -0800 (PST) Date: Thu, 20 Feb 2025 17:31:00 +0000 (UTC) From: Gabriele Monaco To: Mathieu Desnoyers Cc: linux-kernel@vger.kernel.org, Andrew Morton , Ingo Molnar , Peter Zijlstra , "Paul E. McKenney" , linux-mm@kvack.org, Ingo Molnar , Shuah Khan Message-ID: In-Reply-To: <6b542d40-8163-4156-93af-b3f26c397010@efficios.com> References: <20250220102639.141314-1-gmonaco@redhat.com> <20250220102639.141314-2-gmonaco@redhat.com> <6b542d40-8163-4156-93af-b3f26c397010@efficios.com> Subject: Re: [PATCH v8 1/2] sched: Move task_mm_cid_work to mm work_struct MIME-Version: 1.0 X-Correlation-ID: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: wkGUBvFnfaS-N0ccuB_qhMAsOOCkkb2UYVteSXFxsGQ_1740072666 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: si3natu4j195ywiu77zzonzpck6prmim X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7A61A100029 X-HE-Tag: 1740072669-939171 X-HE-Meta: U2FsdGVkX1/x7bCr9oMO7CFD+lQ4pnzjiqiKCerkBmjAT4GAfV/Xpoan94RF4ZRotN30xbhdIn5GxKLQtQM/3zL4NrfvD93iRALnRJMvkWDDvMX9uAAwyMQcjY23EK5s5H1e7W0i7vT3auVaQJGMeOrsKQf2GXRtnJhaM4IB3lkdwSotKLuH4/speW8gDiTYzscD6ZA39PLPbopYdlr+KmYoa1rivuwnIaJlFqu7YSo1eAfZcuq7u7/BBK3mUMTK7sEXnnu64Ri/+umOIXcqNJjpqMZPHp9MY1E1UG9cqbhV5WRAHsGi+18Lm2kdwSsI82Ogf73JlzoadSsmtym/Z7qtqr5H7aonWjodnX+TMQY+lvnGWEbO4oU7nSOs7SZ6gsz6f4Yd3wgFQtCbRiIFf0Cjr/FBchfRYpBZ4Rn9VkqOktw+cZe40JYb66Slp2i/VfoJesgF/EZLBR10hTa2O4Uafx5u6qP+Czuyb+T4IPYW7ykDAcZt8GSVH9gX0q+OQdBiAZRNVHUG0WzUABwPgbkx6DKCBRHoY/A/7SBJG1tNoq/4R7ZytsYAmCsYhmqz4iVDIvJxyyWyBp+PcsebjumK57vtsI41MCZ+R9KlWev+Nly4z43rnZ3n4RQx5u+JnEmGQhDBb4D8ngbwnJrwBAFB6yqCTCmVsTfr20FVFsRsgKMqxI4gQOoxxeLWYHjiv8/VtNWryW8ap/0s+gDso/oQxRcyBd82YpAFm7RGW4wvR/jT/0cMrfMQCxgK98kMw+Q023xSD0mfU+sLmfpsyuOkOUllgCdaMRM6slAyNH8V2v+p6kO17wx7WR5+YKV0/vlok/CJobWHMllUlXffer856Gpbp7NY0OOVfE9ulS9MwOgaRnpF2hr8/lOSilxoJnX0Si6PCm22ZyT0cVH0tSYDQh+4GhzfZcLt1stdTJGr1KdvzU10ygo3HnLnbBc+rVbT04FoWQ9VR6gq5CH OnCLvEws ynApvfYBsapfcJoUn3MgfRIx1646Qsypf522clwgqGTd2H2wrEPP3iGg1LWvtIpW5UIa6fz7yzm+xZZult5XB0LbXTMOi5avhgMr37J2e6oJ6Kv1eJokORCSzX/g6n2gvkaavZPzCsh0MlQqJIGFaqvagcvTtC0BTKKLwuJIIAPPGLrS4gv8Qf2kg0JJ7IJiev8BClrsTRa8dSYUeHKhlhCqSzUhaOAKMeicbFyItDKhx6qH8R5hkx0um0NX3oNLCJwxFqXlSy+sjvUQAvAXw/K7SgtLhY6X/CK/u+hQNDcLSCn1I2LDviax/rW50U3lGbt8Qypzo7u6citGjhYs3Nfg4YExn+lraDfiElT7B6MdWDAzfGaCWPDVbFrzmEVdWzL8PdGZVYFmKVGTLijIUp28kGK7wse7zcjtLzbpzVF3vKJWz/kDVbeQnyDnropwkIi/qxZ1ZYVFLhRKozBlCSSWPpR7T/FuK/f4CXK5f3eQaQ/4dS2hwNOKyws27BDF2Iotumzg64Dfp2c5KOF0JVjJPtL8CYPo01ItHCSb9zeWwvjI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 2025-02-20T15:47:26Z Mathieu Desnoyers : > On 2025-02-20 10:30, Gabriele Monaco wrote: >> >> On Thu, 2025-02-20 at 09:42 -0500, Mathieu Desnoyers wrote: >>> On 2025-02-20 05:26, Gabriele Monaco wrote: >>>> Currently, the task_mm_cid_work function is called in a task work >>>> triggered by a scheduler tick to frequently compact the mm_cids of >>>> each >>>> process. This can delay the execution of the corresponding thread >>>> for >>>> the entire duration of the function, negatively affecting the >>>> response >>>> in case of real time tasks. In practice, we observe >>>> task_mm_cid_work >>>> increasing the latency of 30-35us on a 128 cores system, this order >>>> of >>>> magnitude is meaningful under PREEMPT_RT. >>>> >>>> Run the task_mm_cid_work in a new work_struct connected to the >>>> mm_struct rather than in the task context before returning to >>>> userspace. >>>> >>>> This work_struct is initialised with the mm and disabled before >>>> freeing >>>> it. The queuing of the work happens while returning to userspace in >>>> __rseq_handle_notify_resume, maintaining the checks to avoid >>>> running >>>> more frequently than MM_CID_SCAN_DELAY. >>>> To make sure this happens predictably also on long running tasks, >>>> we >>>> trigger a call to __rseq_handle_notify_resume also from the >>>> scheduler >>>> tick (which in turn will also schedule the work item). >>>> >>>> The main advantage of this change is that the function can be >>>> offloaded >>>> to a different CPU and even preempted by RT tasks. >>>> >>>> Moreover, this new behaviour is more predictable with periodic >>>> tasks >>>> with short runtime, which may rarely run during a scheduler tick. >>>> Now, the work is always scheduled when the task returns to >>>> userspace. >>>> >>>> The work is disabled during mmdrop, since the function cannot sleep >>>> in >>>> all kernel configurations, we cannot wait for possibly running work >>>> items to terminate. We make sure the mm is valid in case the task >>>> is >>>> terminating by reserving it with mmgrab/mmdrop, returning >>>> prematurely if >>>> we are really the last user while the work gets to run. >>>> This situation is unlikely since we don't schedule the work for >>>> exiting >>>> tasks, but we cannot rule it out. >>>> >>>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced >>>> by mm_cid") >>>> Signed-off-by: Gabriele Monaco >>>> --- >>> [...] >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>> index 9aecd914ac691..363e51dd25175 100644 >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -5663,7 +5663,7 @@ void sched_tick(void) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 resched_latency =3D cpu_resched_l= atency(rq); >>>> =C2=A0=C2=A0 calc_global_load_tick(rq); >>>> =C2=A0=C2=A0 sched_core_tick(rq); >>>> -=C2=A0=C2=A0 task_tick_mm_cid(rq, donor); >>>> +=C2=A0=C2=A0 rseq_preempt(donor); >>>> =C2=A0=C2=A0 scx_tick(rq); >>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 rq_unlock(rq, &rf); >>> >>> There is one tiny important detail worth discussing here: I wonder if >>> executing a __rseq_handle_notify_resume() on return to userspace on >>> every scheduler tick will cause noticeable performance degradation ? >>> >>> I think we can mitigate the impact if we can quickly compute the >>> amount >>> of contiguous unpreempted runtime since last preemption, then we >>> could >>> use this as a way to only issue rseq_preempt() when there has been a >>> minimum amount of contiguous unpreempted execution. Otherwise the >>> rseq_preempt() already issued by preemption is enough. >>> >>> I'm not entirely sure how to compute this "unpreempted contiguous >>> runtime" value within sched_tick() though, any ideas ? >> I was a bit concerned but, at least from the latency perspective, I >> didn't see any noticeable difference. This may also depend on the >> system under test, though. > > I see this as an issue for performance-related workloads, not > specifically for latency: we'd be adding additional rseq notifiers > triggered by the tick in workloads that are CPU-heavy and would > otherwise not run it after tick. And we'd be adding this overhead > even in scenarios where there are relatively frequent preemptions > happening, because every tick would end up issuing rseq_preempt(). > >> We may not need to do that, what we are doing here is improperly >> calling rseq_preempt. What if we call an rseq_tick which sets a >> different bit in rseq_event_mask and take that into consideration while >> running __rseq_handle_notify_resume? > > I'm not sure how much it would help. It may reduce the amount of > work to do, but we'd still be doing additional work at every tick. > > See my other email about using > > =C2=A0=C2=A0 se->sum_exec_runtime - se->prev_sum_exec_runtime > > to only do rseq_preempt() when the last preemption was a certain amount > of consecutive runtime long ago. This is a better alternative I think. > >> We could follow the periodicity of the mm_cid compaction and, if the >> rseq event is a tick, only continue if it is time to compact (and we >> can return this value from task_queue_mm_cid to avoid checking twice). > > Note that the mm_cid compaction delay is per-mm, and the fact that we > want to run __rseq_handle_notify_resume periodically to update the > mm_cid fields applies to all threads. Therefore, I don't think we can > use the mm_cid compaction delay (per-mm) for this. > Alright, didn't think of that, I can explore your suggestion. Looks like mo= st of it is already implemented. What would be a good value to consider the notify waited enough? 100ms or e= ven less? I don't think this would deserve a config. >> We would be off by one period (commit the rseq happens before we >> schedule the next compaction), but it should be acceptable: >> =C2=A0=C2=A0=C2=A0=C2=A0 __rseq_handle_notify_resume() >> =C2=A0=C2=A0=C2=A0=C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 should_queue =3D task_q= ueue_mm_cid(); > >> Another doubt about this case, here we are worrying about this >> hypothetical long-running task, I'm assuming this can happen only for: >> 1. isolated cpus with nohz_full and 1 task (the approach wouldn't work) > > The prev_sum_exec_runtime approach would work for this case. > I mean in that case nohz_full and isolation would ensure nothing else runs = on the core, not even the tick (or perhaps that's also nohz=3Don). I don't = think there's much we can do in such a case is there? (On that core/context= at least) >> =C2=A0=C2=A0 or >> 2. tasks with RT priority mostly starving the cpu > > Likewise. > >> In 1. I'm not sure the user would really need rseq in the first place, > > Not sure, but I'd prefer to keep this option available unless we have a > strong reason for not being able to support this. > >> in 2., assuming nothing like stalld/sched rt throttling is in place, we >> will probably also never run the kworker doing mm_cid compaction (I'm >> using the system_wq), for this reason it's probably wiser to use the >> system_unbound_wq, which as far as I could understand is the only one >> that would allow the work to run on any other CPU. >> I might be missing something trivial here, what do you think though? > > Good point. I suspect using the system_unbound_wq would be preferable > here, especially given that we're iterating over possible CPUs anyway, > so I don't expect much gain from running in a system_wq over > system_unbound_wq. Or am I missing something ? I don't think so, I just picked it as it was easier, but it's probably best= to switch. Thanks, Gabriele