From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E827FC369CB for ; Wed, 23 Apr 2025 02:50:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A6D16B0005; Tue, 22 Apr 2025 22:50:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 455356B0007; Tue, 22 Apr 2025 22:50:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31E9B6B0008; Tue, 22 Apr 2025 22:50:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 116376B0005 for ; Tue, 22 Apr 2025 22:50:01 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6142D1A0953 for ; Wed, 23 Apr 2025 02:50:02 +0000 (UTC) X-FDA: 83363779044.23.D8140CA Received: from mail-oo1-f45.google.com (mail-oo1-f45.google.com [209.85.161.45]) by imf07.hostedemail.com (Postfix) with ESMTP id 831E240003 for ; Wed, 23 Apr 2025 02:50:00 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZBF8CZ+w; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=qq282012236@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745376600; a=rsa-sha256; cv=none; b=jRlQrDqDST+/WrvOUV+IcrU3V2tfHV6bM6kLwdQ5NFu7CH3ev4lxjJlcsS3i5gkh1SA9v3 1q5ZHZ9W8UPKeLhgks1DlqMCy2xHFpmbRFc8N69ggj5AZ0fHzP/9IJQxPoq8IKrR24hIqL daw4/tlIvd+0sJa6grHllT235PBId+8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745376600; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aSu+pf0KYpfeh0JO63IUt/4r9CyD732VsqOE26NueLI=; b=qjHTV9SFJH3qx7zp8pV5wvj+pF3CaQ0kGWKGjg86yi8aOEE00/Jq+CKxPpWJnGhbg7wyS3 h5Gm54N4RdOQm1mIbzVVKd/EWTnJmVMKFbHJpt4aUShQfYOr0TOzhRWJ6I9efYJUe8B57N 4YFQbhfKpKG4gsn7yRwo0DJFjBKIbQQ= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZBF8CZ+w; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=qq282012236@gmail.com Received: by mail-oo1-f45.google.com with SMTP id 006d021491bc7-601ad30bc0cso357219eaf.0 for ; Tue, 22 Apr 2025 19:50:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745376599; x=1745981399; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aSu+pf0KYpfeh0JO63IUt/4r9CyD732VsqOE26NueLI=; b=ZBF8CZ+wL1P2O4o8V0+g4HnCbYXL5lXbbMDKfO0BKbUrnnJADr7aGm0vtL88cFF6Mr Ecmd2ZU8th5fqYL/oJfoz6sMypBvOAXYsYzSmcNwg0xqbGc7jnXo4JVnM71YZoBaxn6R qQ37a3jpnY0mmm0n3CUmDxDwW31MWGUlzYSP+harAjDxqGQUgHPuTHbCHyVqxqguNgyq mkGi5a3NsPsAkR2pom/IbsE55hxexf8Kg2mlq4/Gdx3d3JBX3+Sm7EGQYkVe95gOaQa0 22d3Y3hG4OVC9AdjPVDiSLWvnsR1Mh/4WwlYRAZM5pT8Jmm8PMwQWXNGnqQXcii9SFhH BOzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745376599; x=1745981399; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aSu+pf0KYpfeh0JO63IUt/4r9CyD732VsqOE26NueLI=; b=PzMYutnhRAGyYOxB65aM+2h+TDJm4E2hAcNkMrzJu8X/U6MvvRJPrRpMnkQ94ztDFR 8zUf/vSx7+znJPrj43hEPgl7eWvOqJvEDv0W+O6RP2wE+myEdCymjBY3XxWX8jbU25aJ qLxWAGrTw/vZPzpbWIot7Ty1N8Cjv8SLx6y570n8+nXQ2IjCYYlzJ9bJmqEztQi1aDTG saN5XQt9PK7dGbuw2ArwaY2qpT9PqfVtzLW5zqFuppTmByfIcBYX73V5NneSYJdSuMmD fE6cy4I7ya8mGaPnMYxXDBtLhBhcWjMI8agWiUy5XbTbdso10kLz5zIWIlLimekYjxQ5 2Xvg== X-Forwarded-Encrypted: i=1; AJvYcCXr7D2htDrZpZEFM79HcVAizyjvFZ9AgHzSu3xu4KT/Ih8Yk+bWq+ltnK18OhagwG7nQm1HfOzkhw==@kvack.org X-Gm-Message-State: AOJu0YysXkEJ/mUodr6X3ofwlZREgvo/yTiDcZg0Stk9g2GmsuZYK2+b zJs50sYJn42+CMAqaPDhwEKCg785F5eIw/n5P8U4GDPKgDDXc9nPi7Q9rSLEQGVdoXhFxqtsxIy OI7XXkrAejDYx06pooYfaSgOsZYM= X-Gm-Gg: ASbGncvcN7ReWSrYqR2mE+8SFDHVHxbKggwKsCvF4S3Gqub5yXKQkLBMhShpgV28fOi LV+2x54Qeep+GihiZAWyPeCFaQfjkDEQeT1arr3M0UPyRNWk8tUNfeyg6xmx6HmlMliFRsNqgpd J//dzH0aX8snQyGed9BirZS0A= X-Google-Smtp-Source: AGHT+IHnspHIf/BWSKvgnsUQ0frS+igiTjJx9QOaB5iml44SDzeGlHt+6vO4QDQnQ7ghczJ6JeKcS3CVhsd9vHdEBVw= X-Received: by 2002:a05:6870:704b:10b0:2c1:4d18:383a with SMTP id 586e51a60fabf-2d5f91394bfmr576794fac.3.1745376599286; Tue, 22 Apr 2025 19:49:59 -0700 (PDT) MIME-Version: 1.0 References: <20250422162913.1242057-1-qq282012236@gmail.com> <20250422162913.1242057-2-qq282012236@gmail.com> <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk> In-Reply-To: From: =?UTF-8?B?5aec5pm65Lyf?= Date: Wed, 23 Apr 2025 10:49:46 +0800 X-Gm-Features: ATxdqUE_qKZz54c5VFiqkyU5GhjetlCjpBX_oWwUivoDc1gIIcxTDbz0DUzXhj4 Message-ID: Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios To: Jens Axboe Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 831E240003 X-Stat-Signature: cc4f54a9pqomomuoor47cmg6ddcfmzp1 X-Rspam-User: X-HE-Tag: 1745376600-244938 X-HE-Meta: U2FsdGVkX1+eDT34SiMtVSUwZPUkrwHN9VKDLN704I6gVWd6yyT+Rc8vZh+ZXpWb62SSdAAz1vtadJQ+3ncBpw+/BUxXeZ0YuERuYer8N84ZD28f+I7PQU9eB7pUP4yPsoZmZpIRDc1yU+k6gm7zeqfWuVETTxW65VqvmclZnUPIleRpaIlvt17GUE3ZynbPi7Lqc50xxqcQu9b7buqxLjUiDG3N+de/WI88HIYcz99MmMICC/BZlyNxPpf7msSTM2I5mG7Fj9c4KKNYEyeA57oViNCH1ExPYAPvR3gScHhpioS/oHjD0EgSSmm3GCY5V4aSwrUeT4KuxJpHehYoSkNuJzYOC0+LHlEWT7LrbAzuMZT8cnstm3UcOx0TKohzf8lGcGrbjNFu1h/zrQl1WfRSZ22RNPk0s3chrc+jnI85hJ9GPmurHCYfPSZijWDuwHUDm4Psk46VZCOJ6HcQ9zUhqyxcEuQvRKW2RopFYVKHqBKvTDt+ZiYKg0tO97bfFuUXxN0pRYE4cHXO9PmK28bo+dEjCxoXxURm5Rg+VCy50VNLABgUPolrojV9vxtmWbFLoDnUNr5ovJPZQvBvF3/DvPK/G9ixIJjgYCADJBbRrw+SP9Euoshg6+K5qOMXgBVC23RKApkSyNJfCRECAUNRFONlh344Tvtt4p9kE1uW2Hi8fFLV9nTxWpIJ+lj+doF+1pclql+1hiwGWPM4yQiD/z/Wcjgm5ln2bPSSkZ8X9CwAqdOZYjn1KMrT35FMN36OR5JKnUrIFy2N4a4uySaDePBtmC0Yt6Bivz3F+1coTq+JweHofKKE+UL7X+Tyb9v46tVaNB0h1SlwNZnoZxPK6WfnwvdQLaKq5/+DNM6gwaHp3opv06fQL7huKlgFNA5byO+vPUqTSRnV1JeaDg1Kq0Ep+p9Td4RRGeY76qpsG8BqFceqpEIJvIRKOYy0WixdEWp6kO/vcE5K52g bJk2/ksV J0b0UwdbV0WSoL/R57Wvs4+G7yDtpF1DIwi0yzgRsh0sNvah+g5s1RY9+yUikIzphqb3yEmk7UquZdr+OubdloZ+shtTspCc2ITdTm7D/oDyoBKz/827SWluhCB4i35nd/Au+8L/rh7mEb68mAzYXFDbN8qv3IfMSrtLISn4gpdf8YlP9wPnTHmASs5jfZZu1smNS2ymaipwryXC/c8uKb5KA3TxHIxCKfC7VyWUHxVf40jReJWP/5Oi21b2ir6zyI3XpZ1sr3oiHauQm7Xm5o0Wyyh/P20i5woPiVPFy1FPk1l97SdXVnkzedRHF5+V+q7lezmaa9ktJ65YNtdKMZaTa4jQPdsW1SVraCQbguAdFCuRsK6heRcd4pJF9uzvU3HVgeq3iysj0VJOX6+zi+9fZK+/awhMmNZ9vkYY9Tz+wrc5X6YqtVUbiNHh4CwZ8zHu7TpWUMhsbync= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 23, 2025 at 1:33=E2=80=AFAM Jens Axboe wrote: > > On 4/22/25 11:04 AM, ??? wrote: > > On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe wrote: > >> > >> On 4/22/25 10:29 AM, Zhiwei Jiang wrote: > >>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h > >>> index d4fb2940e435..8567a9c819db 100644 > >>> --- a/io_uring/io-wq.h > >>> +++ b/io_uring/io-wq.h > >>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq= , work_cancel_fn *cancel, > >>> void *data, bool cancel_all); > >>> > >>> #if defined(CONFIG_IO_WQ) > >>> -extern void io_wq_worker_sleeping(struct task_struct *); > >>> -extern void io_wq_worker_running(struct task_struct *); > >>> +extern void io_wq_worker_sleeping(struct task_struct *tsk); > >>> +extern void io_wq_worker_running(struct task_struct *tsk); > >>> +extern void set_userfault_flag_for_ioworker(void); > >>> +extern void clear_userfault_flag_for_ioworker(void); > >>> #else > >>> static inline void io_wq_worker_sleeping(struct task_struct *tsk) > >>> { > >>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct ta= sk_struct *tsk) > >>> static inline void io_wq_worker_running(struct task_struct *tsk) > >>> { > >>> } > >>> +static inline void set_userfault_flag_for_ioworker(void) > >>> +{ > >>> +} > >>> +static inline void clear_userfault_flag_for_ioworker(void) > >>> +{ > >>> +} > >>> #endif > >>> > >>> static inline bool io_wq_current_is_worker(void) > >> > >> This should go in include/linux/io_uring.h and then userfaultfd would > >> not have to include io_uring private headers. > >> > >> But that's beside the point, like I said we still need to get to the > >> bottom of what is going on here first, rather than try and paper aroun= d > >> it. So please don't post more versions of this before we have that > >> understanding. > >> > >> See previous emails on 6.8 and other kernel versions. > >> > >> -- > >> Jens Axboe > > The issue did not involve creating new worker processes. Instead, the > > existing IOU worker kernel threads (about a dozen) associated with the = VM > > process were fully utilizing CPU without writing data, caused by a faul= t > > while reading user data pages in the fault_in_iov_iter_readable functio= n > > when pulling user memory into kernel space. > > OK that makes more sense, I can certainly reproduce a loop in this path: > > iou-wrk-726 729 36.910071: 9737 cycles:P: > ffff800080456c44 handle_userfault+0x47c > ffff800080381fc0 hugetlb_fault+0xb68 > ffff80008031fee4 handle_mm_fault+0x2fc > ffff8000812ada6c do_page_fault+0x1e4 > ffff8000812ae024 do_translation_fault+0x9c > ffff800080049a9c do_mem_abort+0x44 > ffff80008129bd78 el1_abort+0x38 > ffff80008129ceb4 el1h_64_sync_handler+0xd4 > ffff8000800112b4 el1h_64_sync+0x6c > ffff80008030984c fault_in_readable+0x74 > ffff800080476f3c iomap_file_buffered_write+0x14c > ffff8000809b1230 blkdev_write_iter+0x1a8 > ffff800080a1f378 io_write+0x188 > ffff800080a14f30 io_issue_sqe+0x68 > ffff800080a155d0 io_wq_submit_work+0xa8 > ffff800080a32afc io_worker_handle_work+0x1f4 > ffff800080a332b8 io_wq_worker+0x110 > ffff80008002dd38 ret_from_fork+0x10 > > which seems to be expected, we'd continually try and fault in the > ranges, if the userfaultfd handler isn't filling them. > > I guess this is where I'm still confused, because I don't see how this > is different from if you have a normal write(2) syscall doing the same > thing - you'd get the same looping. > > ?? > > > This issue occurs like during VM snapshot loading (which uses > > userfaultfd for on-demand memory loading), while the task in the guest = is > > writing data to disk. > > > > Normally, the VM first triggers a user fault to fill the page table. > > So in the IOU worker thread, the page tables are already filled, > > fault no chance happens when faulting in memory pages > > in fault_in_iov_iter_readable. > > > > I suspect that during snapshot loading, a memory access in the > > VM triggers an async page fault handled by the kernel thread, > > while the IOU worker's async kernel thread is also running. > > Maybe If the IOU worker's thread is scheduled first. > > I?m going to bed now. > > Ah ok, so what you're saying is that because we end up not sleeping > (because a signal is pending, it seems), then the fault will never get > filled and hence progress not made? And the signal is pending because > someone tried to create a net worker, and this work is not getting > processed. > > -- > Jens Axboe handle_userfault() { hugetlb_vma_lock_read(); _raw_spin_lock_irq() { __pv_queued_spin_lock_slowpath(); } vma_mmu_pagesize() { hugetlb_vm_op_pagesize(); } huge_pte_offset(); hugetlb_vma_unlock_read(); up_read(); __wake_up() { _raw_spin_lock_irqsave() { __pv_queued_spin_lock_slowpath(); } __wake_up_common(); _raw_spin_unlock_irqrestore(); } schedule() { io_wq_worker_sleeping() { io_wq_dec_running(); } rcu_note_context_switch(); raw_spin_rq_lock_nested() { _raw_spin_lock(); } update_rq_clock(); pick_next_task() { pick_next_task_fair() { update_curr() { update_curr_se(); __calc_delta.constprop.0(); update_min_vruntime(); } check_cfs_rq_runtime(); pick_next_entity() { pick_eevdf(); } update_curr() { update_curr_se(); __calc_delta.constprop.0(); update_min_vruntime(); } check_cfs_rq_runtime(); pick_next_entity() { pick_eevdf(); } update_curr() { update_curr_se(); update_min_vruntime(); cpuacct_charge(); __cgroup_account_cputime() { cgroup_rstat_updated(); } } check_cfs_rq_runtime(); pick_next_entity() { pick_eevdf(); } } } raw_spin_rq_unlock(); io_wq_worker_running(); } _raw_spin_lock_irq() { __pv_queued_spin_lock_slowpath(); } userfaultfd_ctx_put(); } } The execution flow above is the one that kept faulting repeatedly in the IOU worker during the issue. The entire fault path, including this final userfault handling code you're seeing, would be triggered in an infinite loop. That's why I traced and found that the io_wq_worker_running() function returns early, causing the flow to differ from a normal user fault, where it should be sleeping. However, your call stack appears to behave normally, which makes me curious about what's different about execution flow. Would you be able to share your test case code so I can study it and try to reproduce the behavior on my side?