From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F13FC5B549 for ; Fri, 30 May 2025 09:37:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3EA126B00C4; Fri, 30 May 2025 05:37:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39AFD6B00C5; Fri, 30 May 2025 05:37:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2641F6B00CB; Fri, 30 May 2025 05:37:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 05A8D6B00C4 for ; Fri, 30 May 2025 05:37:09 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B835E162D74 for ; Fri, 30 May 2025 09:37:09 +0000 (UTC) X-FDA: 83499070578.03.F97D87D Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by imf08.hostedemail.com (Postfix) with ESMTP id DB20016000E for ; Fri, 30 May 2025 09:37:07 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=RaqG5sRK; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf08.hostedemail.com: domain of libo.gcs85@bytedance.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=libo.gcs85@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748597827; a=rsa-sha256; cv=none; b=Thl7vChEnpHjUQGoNac/la+pBYBkJH7FnCICTRk5MJKLZS2GpH9j1vF8i2tOepLhA94+HM pYp129wpC0/D4EVd4UM/x2TaJtfuMfBLV25yGr4U+0PkwyjkKvlTY3djw8euelxqRluy8v DTxMFGdJku7MyJODLyWqFk6vl3J0h4A= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=RaqG5sRK; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf08.hostedemail.com: domain of libo.gcs85@bytedance.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=libo.gcs85@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748597827; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ff5bIYNqY4yUnollIgoWVZucxS5phnK+GSArqqp8njA=; b=tjmf3G8qhmeJRwXgRYM96EzN+1dfqV8LkyDjRDZ1w0ww9oXjXvTiNZIbCx9tDbc16qB6cN IXqVfeI7OP3H/g60NFCU6wgqdqIQAyD83ECKZCooVv8EdvtQrSLD9/AeZlmVbm0sv354qj LwCmw2aZ2+ITDbXLty98Z1c96WYpA2w= Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-72d3b48d2ffso1432573b3a.2 for ; Fri, 30 May 2025 02:37:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597827; x=1749202627; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Ff5bIYNqY4yUnollIgoWVZucxS5phnK+GSArqqp8njA=; b=RaqG5sRKCLnkEDW8lgB9zaw7BBKyldMNrszvqI2MoDagCcOV0B1gJeQdUaLc7QGR2V mAZA6JkAXi1xGj6JEUHbNMAvyNMsCL7H/KIHrCGs6t8Ss9F9U5tDdAA5cszgENzDTct5 T+dL/959HKz+9EVcpuiV9/L9yYqfBltERVtu6yvbHCVme2BI3x5FCyltz+BUrGpxD3y3 5sbhKCE3OfC4G92NwigyvjLj3HiCNLX2xVJiN+ifHP6VtN4rKsg6EyJBFpi72lZejtg+ o9s/pPyVG4qYoEUHfeR3Zd1EL5F1GYhANn1OhDhMeIE4PCg2b9mray5l7O4ZBwJl1dhc ciZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597827; x=1749202627; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ff5bIYNqY4yUnollIgoWVZucxS5phnK+GSArqqp8njA=; b=O/FKnlHxNSesKFa50otCpgFRgwY0czvjtUF4FYpd36fi9ibFT50g0OTbOjkwz+VYuq Q3dsDBxVTeTPcrZfvA9hZF4KjCEGljX+9HmfPwLCXlOG66H2Iwubq7ZFz2hpw8AfojF2 VNZaSEW7wuPrubheTu/fL9cg+WXTtPkIay+Bv03TMj/U9jxnoi4bxzoZggacoOQdsshQ f0NDRhc0IGnmfpofa6rW4pLbxiSozq7hv0NqzkuEH6Yfj8YRzPaRpJu2AuFL4HiZUWcT 406nY2R/wqu1EYU3SrzPNH2GmQmqvbUpQ01aI3jL3/ZBBy9reyfb0adHFDj1ol4iktyr HheQ== X-Forwarded-Encrypted: i=1; AJvYcCWdRAu64GrTviMll89tfw0Da7bfSC7iB5jZJig2Na1BmQv5CARoh+gcXFMq82iB442CrzIKe+9K8A==@kvack.org X-Gm-Message-State: AOJu0YzlPPXILHLtdoSICb6Z9soi2azN481xO8pWXqwtfn0o/J4Xr/+I pQyQzW0JMlet0puy6KglmzDFQpQuRoWwxbr4aoabKibm6re9OlNl4vt8kC4BfZtQOz4= X-Gm-Gg: ASbGncs4XopxNqJacUT7gawDe3kYFJaliu5Uijks2J+T2BH601XcvPpat4hbHXRiUpb 32WahqUeFrHGWZRFvNrxeZH5rSupAg0He4pc7VPwWSrNXappR9Hj7/yrR18+3zxAiNRbPH2mzHT 24L9PTVNPsOoJt6fU+7CmtbHaOSWGd+5hmhKdoLniRYKE3Q6CJil53cZVmnvLFxQyxpRoeRZVN5 6sSzWAhDHTO1roxJ68ejI53NE3EuQDRbYVOi2hVnAoHhLohXQ4woPEH4wYZJMOlEspKozjgoyrO Ggh1XucSkcFr3iBhwrwh4u6fKfBHDcMemg0LYMxg18J8HrinGYf0FHeKXfCuyTC8zol/oDMbipZ 2fZIdKhAraA== X-Google-Smtp-Source: AGHT+IGPAReH+gj/mAHjw8Fc+v5rLzKHU0DKOTk3FhYthS79DtKAel685Vp30oKSxbQ9599VbyBn4A== X-Received: by 2002:a17:90b:4d:b0:311:be51:bdec with SMTP id 98e67ed59e1d1-3125036326fmr2501710a91.11.1748597826688; Fri, 30 May 2025 02:37:06 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.36.51 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:37:06 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 34/35] RPAL: enable fast epoll wait Date: Fri, 30 May 2025 17:28:02 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: DB20016000E X-Stat-Signature: 16jhq845j9fyr1f33tj5rwwxb4he4k58 X-Rspam-User: X-HE-Tag: 1748597827-283738 X-HE-Meta: U2FsdGVkX182hC1ItQYbGv6DG3sKTnVV1+iaOYLPVw8i54OKUIGpBemYSc0JT7vVPy/Q5AcCgpGnDfRhfC49SUG9aQJCge3K0R5VpGh1efgUzm205wQW3yzqc5a6ZgN13gsPOMoIxITEAzUQcYOuJCWrL9HOUftw2X0kNrAu9rM6WbDBxhtJS7pmJ9sigiSfw1wr81UGh9iG15X6qXm9V0RYZS69UFeoqOBV237XALHZ2W1SkrBrTRVCZjD4Vf4gC33vZuWajZWLJs7tws8QiRwqRRAa+3t0oMMlOAt2bBR2/4rvnGFovTtUs4eSYEzKjXTf+mB4Bj9gDEuIipCZQksB86N54Fx+kC13X+a4GHfULxZWkGtbQqmlTxWpDiCR6YeTxUp5rIQwn4hms13C/mJpu+flpHRvQyDQln1XTSy0NXLXGG4U5kxrTCHbQSELk+d+bZnZDFZY60Hpp6LApXGsAKWBHJEh7UAf95el01j36/cFn/D+b9TLvm7dg1iecOiPn2mqA52xZ9HVfrKUMM4tJbY+1hrNdVInle0iKkHXe3E0GL8r/Rw39woQU2FciC1qKZez59jfXf84041uzFAEGXScOZF9ctokp0dF3EhFFtQSrfwneGq87kGWqrunUdbYBN/xei729/B2ICHs7CTB9OIYiszRu0od/ZqaoAu/eUk5LvIR4gCl6CXazpVpsKzSm0o0SSSnat4PJVYB/RcUKiaN9GkWquP5MfZNeVwmUy8/MT1qiU2uGbp+aVH/uctX2rmCMn5meR8H8dL5Y/LpM91/crpBPz/QjVB58oFMoltDg8tv84QnUgbNXxyJ6TiaKnvDwGjv3pDitqkkEuSiYuVObhKEVeFa+zVNNqRa8WVVCpNDm8rxcNTIQK0EWQzjLRdr/OUPoe0J+bFGEMe38Q2n9WDV98XMheBdJ3d7ojmHJ2oi12u4Ns9MW/YFRsnJSX+eslf6T6/FvpT Vbb0TccR ZrVk4cogC6yLzFnokRFOcxjGZ1SG+XXF6mlIr0zJbNn/y+6JM/vT+tekV3D9ye48Bk7WOs7z3T7EJo4N+mUEnkUZfIsBm+EhHihw8RY7H/RzMcidrCYF/zzHy6HrbOEiVVZlCHCj3//A2xXd2wX6nRV8JkZYxMksbcmo9Hyrw3PIu0MxsdxWqXe8YDUEFfrsc8bUMM5/zjuilSV5XJISzZRmf0ydyHnpnOkmD4Vn5asGxEhOc0JbLmjmnFJFVUMUZYAB+RvN3G7QHVBTiyzi4ug6fZuhrDJqyB+in0v+8OpP50ZwCIKzS+rei0g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When a kernel event occurs during an RPAL call and triggers a lazy switch, the kernel context switches from the sender to the receiver. When the receiver later returns from user space to the sender, a second lazy switch is required to switch the kernel context back to the sender. In the current implementation, after the second lazy switch, the receiver returns to user space via rpal_kernel_ret() and then calls epoll_wait() from user space to re-enter the kernel. This causes the receiver to be unable to process epoll events for a long period, degrading performance. This patch introduces a fast epoll wait feature. During the second lazy switch, the kernel configures epoll-related data structures so that the receiver can directly enter the epoll wait state without first returning to user space and then calling epoll_wait(). The patch adds a new state RPAL_RECEIVER_STATE_READY_LS, which is used to mark that the receiver can transition to RPAL_RECEIVER_STATE_WAIT during the second lazy switch. The kernel then performs this state transition in rpal_lazy_switch_tail(). Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 29 ++++++++++++- fs/eventpoll.c | 101 +++++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 3 ++ kernel/sched/core.c | 13 +++++- 4 files changed, 143 insertions(+), 3 deletions(-) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 2ac5d932f69c..7b6efde23e48 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -51,7 +51,25 @@ void rpal_lazy_switch_tail(struct task_struct *tsk) atomic_cmpxchg(&rcc->receiver_state, rpal_build_call_state(tsk->rpal_sd), RPAL_RECEIVER_STATE_LAZY_SWITCH); } else { + /* tsk is receiver */ + int state; + + rcc = tsk->rpal_rd->rcc; + state = atomic_read(&rcc->receiver_state); + /* receiver may be scheduled on another cpu after unlock. */ rpal_unlock_cpu(tsk); + /* + * We must not use RPAL_RECEIVER_STATE_READY instead of + * RPAL_RECEIVER_STATE_READY_LS. As receiver may at + * TASK_RUNNING state and then call epoll_wait() again, + * the state may become RPAL_RECEIVER_STATE_READY, we should + * not changed its state to RPAL_RECEIVER_STATE_WAIT since + * the state is set by another RPAL call. + */ + if (state == RPAL_RECEIVER_STATE_READY_LS) + atomic_cmpxchg(&rcc->receiver_state, + RPAL_RECEIVER_STATE_READY_LS, + RPAL_RECEIVER_STATE_WAIT); rpal_unlock_cpu(current); } } @@ -63,8 +81,14 @@ void rpal_kernel_ret(struct pt_regs *regs) int state; if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { - rcc = current->rpal_rd->rcc; - regs->ax = rpal_try_send_events(current->rpal_rd->ep, rcc); + struct rpal_receiver_data *rrd = current->rpal_rd; + + rcc = rrd->rcc; + if (rcc->timeout > 0) + hrtimer_cancel(&rrd->ep_sleeper.timer); + rpal_remove_ep_wait_list(rrd); + regs->ax = rpal_try_send_events(rrd->ep, rcc); + fdput(rrd->f); atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET); } else { tsk = current->rpal_sd->receiver; @@ -173,6 +197,7 @@ rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *regs) * Otherwise, sender's user context will be corrupted. */ rebuild_receiver_stack(current->rpal_rd, regs); + rpal_fast_ep_poll(current->rpal_rd, regs); rpal_schedule(next); rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT); prev->rpal_rd->sender = NULL; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 791321639561..b70c1cd82335 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2143,6 +2143,107 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, } #ifdef CONFIG_RPAL +static void *rpal_get_eventpoll(struct rpal_receiver_data *rrd, struct pt_regs *regs) +{ + struct rpal_receiver_call_context *rcc = rrd->rcc; + int epfd = rcc->epfd; + struct epoll_event __user *events = rcc->events; + int maxevents = rcc->maxevents; + struct file *file; + + if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) { + regs->ax = -EINVAL; + return NULL; + } + + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) { + regs->ax = -EFAULT; + return NULL; + } + + rrd->f = fdget(epfd); + file = fd_file(rrd->f); + if (!file) { + regs->ax = -EBADF; + return NULL; + } + + if (!is_file_epoll(file)) { + regs->ax = -EINVAL; + fdput(rrd->f); + return NULL; + } + + rrd->ep = file->private_data; + return rrd->ep; +} + +void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *regs) +{ + struct eventpoll *ep; + struct rpal_receiver_call_context *rcc = rrd->rcc; + ktime_t ts = 0; + struct hrtimer *ht = &rrd->ep_sleeper.timer; + int state; + int avail; + + regs->orig_ax = __NR_epoll_wait; + ep = rpal_get_eventpoll(rrd, regs); + + if (!ep || signal_pending(current) || + unlikely(ep_events_available(ep)) || + atomic_read(&rcc->ep_pending) || unlikely(rcc->timeout == 0)) { + INIT_LIST_HEAD(&rrd->ep_wait.entry); + } else { + /* + * Here we use RPAL_RECEIVER_STATE_READY_LS to avoid conflict with + * RPAL_RECEIVER_STATE_READY. As the RPAL_RECEIVER_STATE_READY_LS + * is convert to RPAL_RECEIVER_STATE_WAIT in rpal_lazy_switch_tail(), + * it is possible the receiver is woken at that time. Thus, + * rpal_lazy_switch_tail() should figure out whether the receiver + * state is set by lazy switch or not. See rpal_lazy_switch_tail() + * for details. + */ + state = atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_READY_LS); + if (unlikely(state != RPAL_RECEIVER_STATE_LAZY_SWITCH)) + rpal_err("%s: unexpected state: %d\n", __func__, state); + init_waitqueue_func_entry(&rrd->ep_wait, rpal_ep_autoremove_wake_function); + rrd->ep_wait.private = rrd; + INIT_LIST_HEAD(&rrd->ep_wait.entry); + write_lock(&ep->lock); + set_current_state(TASK_INTERRUPTIBLE); + avail = ep_events_available(ep); + if (!avail) + __add_wait_queue_exclusive(&ep->wq, &rrd->ep_wait); + write_unlock(&ep->lock); + if (avail) { + /* keep state consistent when we enter rpal_kernel_ret() */ + atomic_set(&rcc->receiver_state, + RPAL_RECEIVER_STATE_LAZY_SWITCH); + set_current_state(TASK_RUNNING); + return; + } + + if (rcc->timeout > 0) { + rrd->ep_sleeper.task = rrd->rcd.bp_task; + ts = ms_to_ktime(rcc->timeout); + hrtimer_start(ht, ts, HRTIMER_MODE_REL); + } + } +} + +void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd) +{ + struct eventpoll *ep = (struct eventpoll *)rrd->ep; + wait_queue_entry_t *wait = &rrd->ep_wait; + + if (!list_empty_careful(&wait->entry)) { + write_lock_irq(&ep->lock); + __remove_wait_queue(&ep->wq, wait); + write_unlock_irq(&ep->lock); + } +} + void *rpal_get_epitemep(wait_queue_entry_t *wait) { struct epitem *epi = ep_item_from_wait(wait); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index f5f4da63f28c..676113f0ba1f 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -126,6 +126,7 @@ enum rpal_receiver_state { RPAL_RECEIVER_STATE_WAIT, RPAL_RECEIVER_STATE_CALL, RPAL_RECEIVER_STATE_LAZY_SWITCH, + RPAL_RECEIVER_STATE_READY_LS, RPAL_RECEIVER_STATE_MAX, }; @@ -627,4 +628,6 @@ void rpal_resume_ep(struct task_struct *tsk); void *rpal_get_epitemep(wait_queue_entry_t *wait); int rpal_get_epitemfd(wait_queue_entry_t *wait); int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc); +void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd); +void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *regs); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d6f8e0d76fc0..1728b04d1387 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3965,6 +3965,11 @@ static bool rpal_check_state(struct task_struct *p) case RPAL_RECEIVER_STATE_LAZY_SWITCH: case RPAL_RECEIVER_STATE_RUNNING: break; + /* + * Allow RPAL_RECEIVER_STATE_READY_LS to be woken will cause irq + * being enabled in rpal_unlock_cpu. + */ + case RPAL_RECEIVER_STATE_READY_LS: case RPAL_RECEIVER_STATE_CALL: rpal_set_task_thread_flag(p, RPAL_WAKE_BIT); ret = false; @@ -11403,7 +11408,13 @@ void __sched notrace rpal_schedule(struct task_struct *next) prev_state = READ_ONCE(prev->__state); if (prev_state) { - try_to_block_task(rq, prev, &prev_state); + if (!try_to_block_task(rq, prev, &prev_state)) { + /* + * As the task enter TASK_RUNNING state, we should clean up + * RPAL_RECEIVER_STATE_READY_LS status. + */ + rpal_check_ready_state(prev, RPAL_RECEIVER_STATE_READY_LS); + } switch_count = &prev->nvcsw; } -- 2.20.1