From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80EF9C83F1B for ; Wed, 16 Jul 2025 16:07:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05A038D0003; Wed, 16 Jul 2025 12:07:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00A868D0001; Wed, 16 Jul 2025 12:07:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E8A008D0003; Wed, 16 Jul 2025 12:07:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D96008D0001 for ; Wed, 16 Jul 2025 12:07:01 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8FAFCB99A6 for ; Wed, 16 Jul 2025 16:07:01 +0000 (UTC) X-FDA: 83670606642.17.EE7A92D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf05.hostedemail.com (Postfix) with ESMTP id A98F7100017 for ; Wed, 16 Jul 2025 16:06:59 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Mt7dKTzR; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752682019; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ICf9beWgXfGrKSdsZJKdY+yXhXtYFoWe+KDS9YCRDxI=; b=6GtKqO83Igf6wWgzhuL3uAHc5xA4csJgqNR360h0szV+c6BE2VDxzZBxvXI5Q+UHe6u9Ro /NFQf4Y9oY55QZHW7+Ewh+a3OoiG+1m9niR3B01siplAnFThBxdL9a/2hTGCGGOd1L4EDn gUzrfuF+gQ9heU6LBD2R4BUQmdisZ5c= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752682019; a=rsa-sha256; cv=none; b=6/gbn2cGa0cOmr4Djo07h0S5/tZ7g6swHJaElQwrJ6BMzWYvy2n9lgfjR21ujPn41D4+hJ scUvXPChiub7SlxZYIgfsfL4w9lsBr9GdGpLaq+4mSvM+T23iJ3Kl/F6XrJnDBcxTetB0p msALgRJ4a6gL6vRLmem3/UiWjsV077A= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Mt7dKTzR; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752682019; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ICf9beWgXfGrKSdsZJKdY+yXhXtYFoWe+KDS9YCRDxI=; b=Mt7dKTzRKduIrv6ic5SwSfGW5BSPqx3F5WheT7BaMYXaeSz1hkMoepE70Dj96dz2Ti6Hwn EXD2QhShyB2HS1we6IPE7c6u4dkNZpXngejxYnx3KwV+AzEotA8BCDjy/Clr4RbH0NBQlp 0yxGak6kB4Il55G1r3F2bl/WcXzT1u4= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-67-Iu7prPRiP8Cdi1wrCyuGtA-1; Wed, 16 Jul 2025 12:06:56 -0400 X-MC-Unique: Iu7prPRiP8Cdi1wrCyuGtA-1 X-Mimecast-MFC-AGG-ID: Iu7prPRiP8Cdi1wrCyuGtA_1752682015 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 28C7B1800C31; Wed, 16 Jul 2025 16:06:55 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.com (unknown [10.44.33.144]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 7AD6319560AB; Wed, 16 Jul 2025 16:06:51 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Andrew Morton , David Hildenbrand , Ingo Molnar , Peter Zijlstra , Mathieu Desnoyers , linux-mm@kvack.org Cc: Gabriele Monaco , Ingo Molnar Subject: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Date: Wed, 16 Jul 2025 18:06:07 +0200 Message-ID: <20250716160603.138385-9-gmonaco@redhat.com> In-Reply-To: <20250716160603.138385-6-gmonaco@redhat.com> References: <20250716160603.138385-6-gmonaco@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Stat-Signature: 3ppd4knnpbttp7isrpa6ihe5imcshnkw X-Rspamd-Queue-Id: A98F7100017 X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1752682019-39169 X-HE-Meta: U2FsdGVkX19PTZ3jbBMeU68m6Y+3LLsEYDn5Byl/yEvRAZIsaQVzwMAQvr9OpQyAG7HbtnE7Fcfh38sBmgITaewRVeU3rbyy0xkypjhjGPPhGnvjZOQlZwKCUpTKO+CywzypQhxn7zcKfGXqThkhdHbU/9DNI4BY71Hcl9OJbmwlDw0bJObYjGj3xc0xh65di+/6t7gT9R06hH4Mv52QA0A0M+collOuEyXbNahqtOG1tpswwV6yjOEGAsTtWhOKHKFKULXbgEVBFt7KoXU7ycJqsMSschYqofXZnjQE5FNwj+T61qG3v6mWGp5/qiHeUqMLDbt49Dj7AJiydPiF2vijqLHDa3l8nAwsVPtIQSM7ekjs4fBBJzPCAT6rzP73Hxck2OaOkxxouFw1KIJImjMFoVLGcANLqvYs2PkRvy5VoAa33IxXiWQh7YB0fuPqlrN1Mbz8afvM9z8CjZdJdjNzrAFddz4DWZhPA6uu/UgoKU1iW2Xgu6yYOw1JVrrx8JxyOno14qdTNp6rtk70LsBgRGREOQ+bNgiK/V5upyrb41V7W/4cWExjpY6Rfi/YoZK/AK/9ERo75U9wWg2OJxiRSKH4PA51DSOUmm+zr+Q6Zmm5dFQf/bS4DKo1Jw67G4JxYQ74cBGRBrH6TIEGT7PK8jmxodNxnrrPiGQAfGe4vXpmjlL/NtsMsWQE/xne8j7qyIM9EjMGv7uGLqfYokLAcCXPSTstui+A4gn1C3cuZNfVoECuiOOIM37F51+yEFRNnpmNAwto/gxqVepIvqXZCcds7JZs6Inwb/Ot5d8F41FVq8mbbxcoQDwndTX54FJms1ZfedZpOvmsVhTQwJqpEF3MIpvDtSzFrPc72yecgCPfeGmdftpxHjV1GQZveooHlfoiyhLNPF/qL83EC3/0wPx1uBJrXupkmsHRwNx1i8bAXJHCQEpP1aDWy9WUYhYxQDb/P176cqs32AA nDQtRkiG BaYfuJ5M5iyEk9ePZQOY0g0xg0guOZ09UTKU5gyo+Q16np3UZ3N4fAQK7WVa98qNDCDwN/TufafSfDucrosLndCGyYF2kaVrz0AOJTUzR1nVN2MN53Gq29CZNCzA/4+8O+Sy2QzkjFqgw+JlG8hF/Z5g5g81Kwv06AwGU1e9v9rjcE90VpOYt905ksf/bQL5XsL0CnM4cVjyI2qkexkd0LFJzzZn00v50VKLRdsTgoqA7CJ9lqzPFQF2n0IJPqx7fEqYf91sf9AajlyD0hbpekrSSO3T9glIt12A0JOOTQzude8q7MIizmcVic8FQf6+CkXYqtZMADpWBVM4wzmqZfkBA4whj3bSKaAgj7/khkWMGjXE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, task_mm_cid_work() is called from resume_user_mode_work(). This can delay the execution of the corresponding thread for the entire duration of the function, negatively affecting the response in case of real time tasks. In practice, we observe task_mm_cid_work increasing the latency of 30-35us on a 128 cores system, this order of magnitude is meaningful under PREEMPT_RT. Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the delay for each scan. The task_mm_cid_work contains a mechanism to avoid running more frequently than every 100ms. Keep this pseudo-periodicity only on complete scans. This means each call to task_mm_cid_work returns prematurely if the period did not elapse and a scan is not ongoing (i.e. the next batch to scan is not the first). This way full scans are not excessively delayed while still keeping each run, and introduced latency, short. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Gabriele Monaco --- include/linux/mm_types.h | 15 +++++++++++++++ init/Kconfig | 12 ++++++++++++ kernel/sched/core.c | 37 ++++++++++++++++++++++++++++++++++--- 3 files changed, 61 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e6d6e468e64b4..a822966a584f3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -995,6 +995,13 @@ struct mm_struct { * When the next mm_cid scan is due (in jiffies). */ unsigned long mm_cid_next_scan; + /* + * @mm_cid_scan_batch: Counter for batch used in the next scan. + * + * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH. This field + * increments at each scan and reset when all batches are done. + */ + unsigned int mm_cid_scan_batch; /** * @nr_cpus_allowed: Number of CPUs allowed for mm. * @@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) raw_spin_lock_init(&mm->cpus_allowed_lock); cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask); cpumask_clear(mm_cidmask(mm)); + mm->mm_cid_scan_batch = 0; } static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p) @@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas static inline bool mm_cid_needs_scan(struct mm_struct *mm) { + unsigned int next_batch; + if (!mm) return false; + next_batch = READ_ONCE(mm->mm_cid_scan_batch); + /* Always needs scan unless it's the first batch. */ + if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch < num_possible_cpus() && + next_batch) + return true; return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan)); } #else /* CONFIG_SCHED_MM_CID */ diff --git a/init/Kconfig b/init/Kconfig index 666783eb50abd..98d7f078cd6df 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1860,6 +1860,18 @@ config DEBUG_RSEQ If unsure, say N. +config RSEQ_CID_SCAN_BATCH + int "Number of CPUs to scan at every mm_cid compaction attempt" + range 1 NR_CPUS + default 8 + depends on SCHED_MM_CID + help + CPUs are scanned pseudo-periodically to compact the CID of each task, + this operation can take a longer amount of time on systems with many + CPUs, resulting in higher scheduling latency for the current task. + A higher value means the CID is compacted faster, but results in + higher scheduling latency. + config CACHESTAT_SYSCALL bool "Enable cachestat() system call" if EXPERT default y diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 27b856a1cb0a9..eae4c8faf980b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10591,11 +10591,26 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu, void task_mm_cid_work(struct task_struct *t) { + int weight, cpu, from_cpu, this_batch, next_batch, idx; unsigned long now = jiffies, old_scan, next_scan; struct cpumask *cidmask; - int weight, cpu; struct mm_struct *mm = t->mm; + /* + * This function is called from __rseq_handle_notify_resume, which + * makes sure t is a user thread and is not exiting. + */ + this_batch = READ_ONCE(mm->mm_cid_scan_batch); + next_batch = this_batch + 1; + from_cpu = cpumask_nth(this_batch * CONFIG_RSEQ_CID_SCAN_BATCH, + cpu_possible_mask); + if (from_cpu >= nr_cpu_ids) { + from_cpu = 0; + next_batch = 1; + } + /* Delay scan only if we are done with all cpus. */ + if (from_cpu != 0) + goto cid_compact; old_scan = READ_ONCE(mm->mm_cid_next_scan); next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY); if (!old_scan) { @@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct *t) return; if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan)) return; + +cid_compact: + if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch, next_batch)) + return; cidmask = mm_cidmask(mm); /* Clear cids that were not recently used. */ - for_each_possible_cpu(cpu) + idx = 0; + cpu = from_cpu; + for_each_cpu_from(cpu, cpu_possible_mask) { + if (idx == CONFIG_RSEQ_CID_SCAN_BATCH) + break; sched_mm_cid_remote_clear_old(mm, cpu); + ++idx; + } weight = cpumask_weight(cidmask); /* * Clear cids that are greater or equal to the cidmask weight to * recompact it. */ - for_each_possible_cpu(cpu) + idx = 0; + cpu = from_cpu; + for_each_cpu_from(cpu, cpu_possible_mask) { + if (idx == CONFIG_RSEQ_CID_SCAN_BATCH) + break; sched_mm_cid_remote_clear_weight(mm, cpu, weight); + ++idx; + } } void init_sched_mm_cid(struct task_struct *t) -- 2.50.1