From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65CD6C369C2 for ; Tue, 22 Apr 2025 10:46:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F18306B000C; Tue, 22 Apr 2025 06:46:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC6CF6B000D; Tue, 22 Apr 2025 06:46:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D68676B000E; Tue, 22 Apr 2025 06:46:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B76636B000C for ; Tue, 22 Apr 2025 06:46:12 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6F6D6BE09A for ; Tue, 22 Apr 2025 10:46:14 +0000 (UTC) X-FDA: 83361350268.08.C4BEB1F Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf14.hostedemail.com (Postfix) with ESMTP id 84B18100003 for ; Tue, 22 Apr 2025 10:46:12 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cfrVI2HK; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=qq282012236@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745318772; a=rsa-sha256; cv=none; b=LGY03XDWLmW437+oNeq/2ImNPdZgotiFOfVBxhmeutR7pSL+a5NFDB9mSpZ7H8ikjlj4KP RbCBGyPotfz3FgxX5N0vOwaRhqz7vBk3w24tUItoxBJZwTMP3LCdye88CKRaKGz1jJa9n/ 8dAwC++EkcrjBFCc/NLbTjJVW+DeJBo= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cfrVI2HK; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=qq282012236@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745318772; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=46xNBkAds9S2JuwMvzLESb7WgL9Z7roCgubVvThFP7A=; b=sZSnLIm4B+InBYR6xlMDtFZBqhLpcLMOHi4fQvXAsC/hvDQqI+AIktnVbrSjLONZ5XlBhE SLwyp2IjSvFNSn1wMe6srnyMCULiD4A5/XrW03T4wSAFPFgMEXYPNCnkW0HB94gLnbTk/n /FwYKww6N9qGrDhGyOVX6SUbGT10tKE= Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-306b6ae4fb2so4632732a91.3 for ; Tue, 22 Apr 2025 03:46:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745318771; x=1745923571; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=46xNBkAds9S2JuwMvzLESb7WgL9Z7roCgubVvThFP7A=; b=cfrVI2HKzCHLENrQ8Z3bwBTfoyi3Qf0YMYSm2ZNB5wrczFCYd/he44j2y/o+5YHmyn mUWrIV37Bv1L3nPlNWvkqDh4GP8a4VOZUu2v33Z66/X5g5gT49hSPd+0WTaoMiWIIJ70 aMOVr/HbU+w0Zv/vVFYVrFFwPlOFYzwviBunY3BKrcqIa3F7AeM21UTjGqv/svcwEcDk 91JrvU1XSciJ1X+fIpoh9mr1uQLn1phhyyHEk4grPHt0K3PNEmje8IXPqeEJy3STTYjr QQSF3fBB+uH9zGq6zKj1SvtHTgXwgHX3q7fmpXW2GP+V74FRrTSgTGSL5MG2bPMUU7ID xUtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745318771; x=1745923571; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=46xNBkAds9S2JuwMvzLESb7WgL9Z7roCgubVvThFP7A=; b=asnhL9coJk/Ds0TbQ29VqTz6LR2DjXsIUTvgFQZ6e2HoqY8kQ71bdz9xqmXuBgQdYC srPWr8aDBLomsSNVafs/TTurZtG+1OVknV1YU3vyhO1A4+6fXI5b0/46v1cZ3vmQD8Tm s3H9aKA/znrNNC+o6XLJzn494x7lT7FRFl/Tqr68N9RjxA0dQoev/BOrkczsqDRbw0fn jlfnBCJV2ZD34WUDHCmTwl0R9hTUGJ474NL2zAR5OhmOfRt7s8V/k5uCvsRnznJgC2Jp zDw6lT7i2HfVcF3ncOfoaJJD6CO8aW3Axb3C78xth0piN2R0JSrGq4wm22YYl/7u3JoQ lhgA== X-Forwarded-Encrypted: i=1; AJvYcCUY5fhhzb3znlC4fpPkFkbGHxkvvqTjKiUeQ+6RFWOqB6dOnbo5PcifF0h7K7v9X26Vha9+EsFwfg==@kvack.org X-Gm-Message-State: AOJu0YwjV86AUp2xMdkMxqG8tzw81rQnIUBoLkX9DcsWKfQcPK55n0lu qgL5A4IxpCnFciOw8jRvv37OEbhjhinkqDR6EsLyJi+SaoKDpZ+m X-Gm-Gg: ASbGncsj7p+i5Sx8OC6soKrydhyaw3R+TcOFkT9sLMRUui5r32mVKCnvvA88XjHkDYD HmbU9opN7xXsnN4/xwhuxmbZ5wISyJ6bChDQBsJXQFb10KqJhJG6NulsdnqhN9lPoFHLhsSbwVN ++r6rjRYwGLQ0y4ceoGQPpeKb1MkrFV5ALCU6cebsOZ/WKiMmtF2V1O0wKkA2w1U5upinw4+OAL gN22su7lrLOpYSqaLaFhe2G9ltKfnoormfA4U0BRDUhJrrwSTr/qTNAtUiDCdAcL8o9ynTSoFdW uLsfTw7V17/x3PYEmn2681AlfGMO6q6A6BOwixPWCaCOWWQqy2SMUQGYOaHg+WAgxclHu0ZgkyG g4WasZisYbo6yA66zFM3zcqAsMo9QMH+468jbNV/CMXV7uoBYAKQTrCAqZp+0IbJGeGUpgGK0i6 a12WmfdO6fGd3HlKw= X-Google-Smtp-Source: AGHT+IEqm/DNI3N1N8dhJ0Fxva3uN96gL/QRNLBQRrVtNPBP7fWK3AnIAn74ryqa0MNY4U0+zJHsUg== X-Received: by 2002:a17:90b:2810:b0:2fa:f8d:65de with SMTP id 98e67ed59e1d1-3087bbc2abfmr18128646a91.22.1745318771409; Tue, 22 Apr 2025 03:46:11 -0700 (PDT) Received: from linux-devops-jiangzhiwei-1.asia-southeast1-a.c.monica-ops.internal (92.206.124.34.bc.googleusercontent.com. [34.124.206.92]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3087e05e90bsm8276853a91.45.2025.04.22.03.46.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Apr 2025 03:46:11 -0700 (PDT) From: Zhiwei Jiang To: viro@zeniv.linux.org.uk Cc: brauner@kernel.org, jack@suse.cz, akpm@linux-foundation.org, peterx@redhat.com, axboe@kernel.dk, asml.silence@gmail.com, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Zhiwei Jiang Subject: [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios Date: Tue, 22 Apr 2025 10:45:44 +0000 Message-Id: <20250422104545.1199433-2-qq282012236@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250422104545.1199433-1-qq282012236@gmail.com> References: <20250422104545.1199433-1-qq282012236@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 84B18100003 X-Stat-Signature: fexn8ptkzupkzafsty13ky9h1nr3fj3w X-Rspam-User: X-HE-Tag: 1745318772-725201 X-HE-Meta: U2FsdGVkX1+b7VvhKRGx7UBEdBJhPPJXL9jNWl6PBrEPlbxH+90Ea2CLJOZzNzj8AGnmjSwbkp33Js7zXolyotTBNJD0r8nV6QNpiigUXgBSZgwA713oQ+NcQ0W/MX/7E0QF7LwPsltbep/fEcNVGsxyhZm57CbpY42l94z3fnio/sMiChfJCy8cauKfPBkzkO2E/qMPilWX61U06cimxY61VjCupLpALYxfdQqyVflaeDkfDSH8+N1QvtyA1WH7xA6vbOzO3Ime0Z3XWY1Wj0MkkQ+L3zgJgeGuCg7p92Wi4Z8xJCBsjZkfCod+kWNXLSfWrQP9bis8mYJsDAcbQXdwcSdQhHHUKtB9gMfvB3bT0b+7gQecvBZoZfNsD3DCSdRaT6ftf/Fhc/3sN7/L7YnqjuVRAG5bWgQ+43KzaAXEpPQrwHlkYKhds1MWaBtI0N8gxbxdfSfTcETf9wwWIHDdsIswvvQmHuSN8mcwekh0KSB5N2bT2fmv2VBFT6HFByQXbnynIo8xFF3K/mfJjp8YBpwHnYGCjfff8KTRRYLk0jy9488tBqqToL/Mlixk2sQhI1V2ZPZ1T7Ix+FWIvsBm2ABLo8hx6ULrRRz1dyahvg/2Ofvdt8b6RGhC8uPoeDffTvcuEYkJcv5QEwf7V7ZvY2LTV8/YCGj+0xTFYojGfrbEMvqtZosFIVMx7BV939O/NZ/BkwCjrudf3d5yseJ2Caw+6WbxxrSDz1eFkBNFncSoeV29XhphN+hBAfgCMmkekV8pIRmQc3adZrYj44mrjhQuP2xp6Gx0kNMh6Du9/aNNF1NY51NybrDNif2yUAVAh/NL1optyhImWFgRS8raSL2PX0GEDI3O7HwLXxmucVpcOKO2TZKY+qX/ruuurMvM5yTeQ31shmhCiKmU3SXufTRp814cvGJ+Shrjwn8yAX9w6B7UN3O7EqvjS9ToO/9pdi7ZlghWvH1S2iO 64JSgXWJ eLOjx5oc+m9AZ9uYCnt2nUv28dqFje1x4WpqsjMb8SPlss3k2ofLEQAkUg3AQf0NJcsn14gfG8ik4+YzCztJqI13EPewAAmPn1+bH601RNnRRXnXiXVzhIwk79ZIIMMSod6cP75fXY5YnAYPdy/Iws4doVdFchYc2CPQkl2YM7WVMLXpV5r1MscIKcc3yuQXIfzVWhyp1wmPoavs0RTQO+MFmArUA5W1JRHNOq4WP88SkkBpd2qTRkd0oh8j49FKQMZFAiOYWzbeSm2kBoBGKQpe+lIr7uJSGx4Vb1yTa2TDafNjEPu4hIc7f4ee9qIj1ZwBMluQ6mwyM2/2ZYURvWFtupV8ME7ID/TUv2IOvJ88tiBqc+QPepgncGIN8Pxwhv/StZihBx62nesqIarq6BTtfP+yieYC6PMwN76CtwjGjkMhPPerrb8xabjD1ryrny6U/oTFcHX8caawaHwwbFLlri7CRFNK5AxvS2h7cRm4flBfKEJZ60AuCOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In the Firecracker VM scenario, sporadically encountered threads with the UN state in the following call stack: [<0>] io_wq_put_and_exit+0xa1/0x210 [<0>] io_uring_clean_tctx+0x8e/0xd0 [<0>] io_uring_cancel_generic+0x19f/0x370 [<0>] __io_uring_cancel+0x14/0x20 [<0>] do_exit+0x17f/0x510 [<0>] do_group_exit+0x35/0x90 [<0>] get_signal+0x963/0x970 [<0>] arch_do_signal_or_restart+0x39/0x120 [<0>] syscall_exit_to_user_mode+0x206/0x260 [<0>] do_syscall_64+0x8d/0x170 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 The cause is a large number of IOU kernel threads saturating the CPU and not exiting. When the issue occurs, CPU usage 100% and can only be resolved by rebooting. Each thread's appears as follows: iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe iou-wrk-44588 [kernel.kallsyms] [k] io_write iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault iou-wrk-44588 [kernel.kallsyms] [k] schedule iou-wrk-44588 [kernel.kallsyms] [k] __schedule iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping I tracked the address that triggered the fault and the related function graph, as well as the wake-up side of the user fault, and discovered this : In the IOU worker, when fault in a user space page, this space is associated with a userfault but does not sleep. This is because during scheduling, the judgment in the IOU worker context leads to early return. Meanwhile, the listener on the userfaultfd user side never performs a COPY to respond, causing the page table entry to remain empty. However, due to the early return, it does not sleep and wait to be awakened as in a normal user fault, thus continuously faulting at the same address,so CPU loop. Therefore, I believe it is necessary to specifically handle user faults by setting a new flag to allow schedule function to continue in such cases, make sure the thread to sleep.Export the relevant functions and struct for user fault. Signed-off-by: Zhiwei Jiang --- io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- 2 files changed, 61 insertions(+), 41 deletions(-) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 04a75d666195..8faad766d565 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -26,12 +26,6 @@ #define WORKER_IDLE_TIMEOUT (5 * HZ) #define WORKER_INIT_LIMIT 3 -enum { - IO_WORKER_F_UP = 0, /* up and active */ - IO_WORKER_F_RUNNING = 1, /* account as running */ - IO_WORKER_F_FREE = 2, /* worker on free list */ -}; - enum { IO_WQ_BIT_EXIT = 0, /* wq exiting */ }; @@ -40,33 +34,6 @@ enum { IO_ACCT_STALLED_BIT = 0, /* stalled on hash */ }; -/* - * One for each thread in a wq pool - */ -struct io_worker { - refcount_t ref; - unsigned long flags; - struct hlist_nulls_node nulls_node; - struct list_head all_list; - struct task_struct *task; - struct io_wq *wq; - struct io_wq_acct *acct; - - struct io_wq_work *cur_work; - raw_spinlock_t lock; - - struct completion ref_done; - - unsigned long create_state; - struct callback_head create_work; - int init_retries; - - union { - struct rcu_head rcu; - struct delayed_work work; - }; -}; - #if BITS_PER_LONG == 64 #define IO_WQ_HASH_ORDER 6 #else @@ -706,6 +673,16 @@ static int io_wq_worker(void *data) return 0; } +void set_userfault_flag_for_ioworker(struct io_worker *worker) +{ + set_bit(IO_WORKER_F_FAULT, &worker->flags); +} + +void clear_userfault_flag_for_ioworker(struct io_worker *worker) +{ + clear_bit(IO_WORKER_F_FAULT, &worker->flags); +} + /* * Called when a worker is scheduled in. Mark us as currently running. */ @@ -715,12 +692,14 @@ void io_wq_worker_running(struct task_struct *tsk) if (!worker) return; - if (!test_bit(IO_WORKER_F_UP, &worker->flags)) - return; - if (test_bit(IO_WORKER_F_RUNNING, &worker->flags)) - return; - set_bit(IO_WORKER_F_RUNNING, &worker->flags); - io_wq_inc_running(worker); + if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) { + if (!test_bit(IO_WORKER_F_UP, &worker->flags)) + return; + if (test_bit(IO_WORKER_F_RUNNING, &worker->flags)) + return; + set_bit(IO_WORKER_F_RUNNING, &worker->flags); + io_wq_inc_running(worker); + } } /* diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h index d4fb2940e435..9444912d038d 100644 --- a/io_uring/io-wq.h +++ b/io_uring/io-wq.h @@ -15,6 +15,13 @@ enum { IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; +enum { + IO_WORKER_F_UP = 0, /* up and active */ + IO_WORKER_F_RUNNING = 1, /* account as running */ + IO_WORKER_F_FREE = 2, /* worker on free list */ + IO_WORKER_F_FAULT = 3, /* used for userfault */ +}; + enum io_wq_cancel { IO_WQ_CANCEL_OK, /* cancelled before started */ IO_WQ_CANCEL_RUNNING, /* found, running, and attempted cancelled */ @@ -24,6 +31,32 @@ enum io_wq_cancel { typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *); typedef void (io_wq_work_fn)(struct io_wq_work *); +/* + * One for each thread in a wq pool + */ +struct io_worker { + refcount_t ref; + unsigned long flags; + struct hlist_nulls_node nulls_node; + struct list_head all_list; + struct task_struct *task; + struct io_wq *wq; + struct io_wq_acct *acct; + + struct io_wq_work *cur_work; + raw_spinlock_t lock; + struct completion ref_done; + + unsigned long create_state; + struct callback_head create_work; + int init_retries; + + union { + struct rcu_head rcu; + struct delayed_work work; + }; +}; + struct io_wq_hash { refcount_t refs; unsigned long map; @@ -70,8 +103,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, void *data, bool cancel_all); #if defined(CONFIG_IO_WQ) -extern void io_wq_worker_sleeping(struct task_struct *); -extern void io_wq_worker_running(struct task_struct *); +extern void io_wq_worker_sleeping(struct task_struct *tsk); +extern void io_wq_worker_running(struct task_struct *tsk); +extern void set_userfault_flag_for_ioworker(struct io_worker *worker); +extern void clear_userfault_flag_for_ioworker(struct io_worker *worker); #else static inline void io_wq_worker_sleeping(struct task_struct *tsk) { @@ -79,6 +114,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk) static inline void io_wq_worker_running(struct task_struct *tsk) { } +static inline void set_userfault_flag_for_ioworker(struct io_worker *worker) +{ +} +static inline void clear_userfault_flag_for_ioworker(struct io_worker *worker) +{ +} #endif static inline bool io_wq_current_is_worker(void) -- 2.34.1