From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD150C369CB for ; Wed, 23 Apr 2025 14:29:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B1F66B00A4; Wed, 23 Apr 2025 10:29:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 560116B00A5; Wed, 23 Apr 2025 10:29:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B1D66B00A6; Wed, 23 Apr 2025 10:29:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1A3186B00A4 for ; Wed, 23 Apr 2025 10:29:55 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CD166BEF33 for ; Wed, 23 Apr 2025 14:29:56 +0000 (UTC) X-FDA: 83365542792.29.D7147E2 Received: from mail-oo1-f44.google.com (mail-oo1-f44.google.com [209.85.161.44]) by imf30.hostedemail.com (Postfix) with ESMTP id DFA1E8000D for ; Wed, 23 Apr 2025 14:29:54 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A+jJfKru; spf=pass (imf30.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.161.44 as permitted sender) smtp.mailfrom=qq282012236@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745418594; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vAmiNRPfDylXI4p2I89n/el3RsJ4kbOsDdMQamDJ/lU=; b=fx2AZyxp4o3XeoaGmmALtU4lbKgt9DwmzMUHBbjTK5gA9JahT+q7U6RMDNdDU0AG1WjxVK fBOJo74Gl7rZOKYnAXxgMvlqOKp25prLeR8CMz3aVOEQkZih+adRNIqxr1AEQ0/yuRNLnx KACBKhO1YiTrHMpvsQhRqLmt3f2hvOw= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A+jJfKru; spf=pass (imf30.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.161.44 as permitted sender) smtp.mailfrom=qq282012236@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745418594; a=rsa-sha256; cv=none; b=ZA0zWh5hVIk8Hyj+m8DymD8g1nKBZ+30MIpR+QOK/+JvBxOnDQ0Eobu16A+omYlF1+msCq wFUE5mTWl6SuwATbuNb/OaxDsWhr5B7qjLyOTIFUTeeMKDD9tn/W7oGAlQShQh6/A2RgoH SgiGjMsklC2Rw7lCZ0v2WdAi9vg2+vc= Received: by mail-oo1-f44.google.com with SMTP id 006d021491bc7-6049acb776bso2921393eaf.3 for ; Wed, 23 Apr 2025 07:29:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745418594; x=1746023394; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vAmiNRPfDylXI4p2I89n/el3RsJ4kbOsDdMQamDJ/lU=; b=A+jJfKru1PreoI9j/g9/gN34wj7leK1wCtIRMhKKu0/VZ+EgnJylh6lJXlRjrkeBZA k87fhAwrf96zx399drWKWMPH5AyvGMzLhQIScvtV5ygj5thZNHd9wt6ivIo75VUlgcAt LqptRn3vZgjSrqxRlx/1zakEraMxCedqrvGKoRC33x/66WEUbLQuh/0b/9HhtHVVisLV bS78MZOcHqu33S5iMiF6beEiIuQC8xCwJG38qjgVxAYIFQRiyFVVBoTPWotAHaHGpaab ZON+1EZP8Pu5Ph7IArPrZ9gWxCfj85bQ6XhCm/Y77Xj7jGbxXH7CFp8ex+lcL10rAX1u 3rUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745418594; x=1746023394; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vAmiNRPfDylXI4p2I89n/el3RsJ4kbOsDdMQamDJ/lU=; b=O85+Xwg2xvJ3HfiEOB1qHnqYWiSWxqzNhkvsY1+VjaGankdlOJMzPakoFBl8F3TIYh o0OgjEi2nKKHgem1/FYcESRwSUZQf+gTUYZAxBqcpMrr6NBCx6U3kab3bgu2w7v1oEoj XAcuHwAD8dBO6WlwMREkjGQfbHiWjVnbgU/qHVMx3Txncscn3XaIGLSZFr4h6E+MUOYR Lht8+ZlW0ohJD2o3QgT1cdsary+r032JB/1HKxujRtNSD6gJH/qLl+al5jiLCxTt5I4z u3fMp7oZC1aVvMdZYpu2gMDX6hLxAleYSpQFFGZI/0qzaSMRfKgUOfPu3+g94ssRNQtN 556Q== X-Forwarded-Encrypted: i=1; AJvYcCVKS7sOC+6bRcp3HL4mZdfSFZ1WmvdrXYGLa8O7ZvFzHuzKQyD2JmBZtR/o464C6CjD5D5/DMtLxw==@kvack.org X-Gm-Message-State: AOJu0YzkeBilLPzF6nMgXHXH7pSKpb0aQAcImSqge4P5SWqwPrD83Y7k JIEX9Mci27zjc93LWKtKyVp4aYqOqCbY+MwQPNEvMnG2jr9r2UnQE2viYxAnDJiU6E32RYScL9S 4G7KfkWwWREjv+5CT/vJQD4ImDy0= X-Gm-Gg: ASbGncuXaC61FO7/C6oPYRuXNWpnZvTT1kb+ehRnChAISFyI5W//OVj50x+G31JqRrS VzuKSp0RKprITqdUiklxZvbWeEAKqSHB35vC8deTZ6xRyMZeAzhG85nzAN7H0UBvJI8D0EZixMZ De0bIMtVLRaJT4P2bSKqo9/qk= X-Google-Smtp-Source: AGHT+IG/83Vm9P2qTMF2EymMFKDXet9s/AZitDACbuc3Nu8TMbhGwbGeWoz11uK69jP/+KGgtXg1R2c2Y0EWUlyBaZc= X-Received: by 2002:a05:6871:400b:b0:2d4:d9d6:c8cf with SMTP id 586e51a60fabf-2d526973ecamr11685729fac.5.1745418593659; Wed, 23 Apr 2025 07:29:53 -0700 (PDT) MIME-Version: 1.0 References: <20250422162913.1242057-1-qq282012236@gmail.com> <20250422162913.1242057-2-qq282012236@gmail.com> <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk> <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk> In-Reply-To: <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk> From: =?UTF-8?B?5aec5pm65Lyf?= Date: Wed, 23 Apr 2025 22:29:39 +0800 X-Gm-Features: ATxdqUEhnCv--36VUKzk9XOpQWOK0vKQGChSsvOwX9Wb21JeSMSuWP0HeyDRE1U Message-ID: Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios To: Jens Axboe Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: wf9fpe8r1o7oedahb49r4pne4frct1zu X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: DFA1E8000D X-Rspam-User: X-HE-Tag: 1745418594-127581 X-HE-Meta: U2FsdGVkX18GuaDpx6+RowBF0Pj/7Th/+LWXl4CdPxartVf4XNlP66LTQNANAXxI1TuxKOt0LZwEjrJSjf8zNRr9YykQpgypPsXQzzObIPDH01G1wO5IoeqabRzbuPEgH5CKzm56aJ4IPId5ianjlZQObONmyoW9ToKAJLODipfG7+nQ6+jokMOSwoyPx0lB/9U20UIUeer8SWzojDC9aRbjE8Xy8fWP1nGda34lbiiFuifZalQnWtabYaMLyqjriiIN+G7A8+EYeTuVdbLhEA/ikwmES+iIONzWsps+bWaljmUcmed7vkLQ+cj+Nfo9bELj8E9E2xC/4fwwc2CmlNhXc44PnqL5vXkcIc16RNeNMQ7fK60jENR6PEPlHmIOtPyzfmom1phQXo3NHLbgFtyJntaK2Y0tPiSNr3hOFoGwoCBkLD7u0/RJB1LMBVsqLadbHg/y3NAFR1PWL+kmcyKTLxiR6XpXGOBTZvQNrTTolXYDyesVLKT6YQU7l3ALSL0EkM4wERQa/BYuAe3qrTjJYeLC+udtyqkE9E3tfrmRCIXQNeQiGybETDW1Wp2/ueTnQI5ndHNCkjSsY6D1bP15UWkWKMEiNq0TigoHr8NUy/f56NlnqhaFkKkfGH5wpogorn804bl5gPiHtQsrZ4VujHUiQznVbPcXSY+hyhS/xPk+CqbX73Mwqlzc+goiH19e9rKWPF8S0BcrcoXD6YJFChRjbqyP2js04gg9uVlNsOh4ynulMB6326TG/lgbWCVjFsgVRq/Hv6aGkLN+5J5vt4fjXgaFwaUWl5qTyHfITGR+7zzrHl9fMTHzzIkNOq9iWZw75kv7YP6M9qiVzTXaFxr5wRodhPGW5zIJtL/olrM/uC1jBadOWKOZK+xVSHprh88toKaK04AxQbZN2hr5TGPmDzlp0QgCr64yi5+eQ9sge/Eg8z2Cpi6WR5WoqpvU7RZ216RWKP9Xfsn 1lVyPXCt ZN8FWZMN8D7II9QkG1zr+3rxew2Q91js7yRr/kcf4dYNK7gn+GWBKR0e7jq9xTygtYU+N7hsTACACK8E6cdwDbH1bkox2YA6KpNAANNTNcz1ylMLxrup+/lH5rYEIEQ7teP7vhApF7pOvKqaUgcdRQkMEaU+xgzh9OEqhA5KRMQIdsQga8IgAQ0b+1a55xoVPEmdXnK8Wnfp6o34SbNWpjNKw8JU2KhtLe4mc+daL2ME54cFQoy0+puqDZJa7GHpAF9RHUcSzaqTAGOAPVfYdLIpUXdFW3ss8tqlvKVDywLy2r7LEUZzIHgxlRnadZa9LeTl9EX7oTwc3SS7/EXMp79/lYYNyxEti7MrhNEnDUDnDRked+O/fVPyH+cyp8Uldj4R3RBvPGQZdIIhx9edgZzbnr8VXcDltL11Bye80/YGzROuBzRCIgGhn/WsgqnkqYeDYzSrDuneMwjY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Jens Axboe =E4=BA=8E2025=E5=B9=B44=E6=9C=8823=E6=97=A5=E5= =91=A8=E4=B8=89 21:34=E5=86=99=E9=81=93=EF=BC=9A > > On 4/22/25 8:49 PM, ??? wrote: > > On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe wrote: > >> > >> On 4/22/25 11:04 AM, ??? wrote: > >>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe wrote: > >>>> > >>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote: > >>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h > >>>>> index d4fb2940e435..8567a9c819db 100644 > >>>>> --- a/io_uring/io-wq.h > >>>>> +++ b/io_uring/io-wq.h > >>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *= wq, work_cancel_fn *cancel, > >>>>> void *data, bool cancel_all); > >>>>> > >>>>> #if defined(CONFIG_IO_WQ) > >>>>> -extern void io_wq_worker_sleeping(struct task_struct *); > >>>>> -extern void io_wq_worker_running(struct task_struct *); > >>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk); > >>>>> +extern void io_wq_worker_running(struct task_struct *tsk); > >>>>> +extern void set_userfault_flag_for_ioworker(void); > >>>>> +extern void clear_userfault_flag_for_ioworker(void); > >>>>> #else > >>>>> static inline void io_wq_worker_sleeping(struct task_struct *tsk) > >>>>> { > >>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct = task_struct *tsk) > >>>>> static inline void io_wq_worker_running(struct task_struct *tsk) > >>>>> { > >>>>> } > >>>>> +static inline void set_userfault_flag_for_ioworker(void) > >>>>> +{ > >>>>> +} > >>>>> +static inline void clear_userfault_flag_for_ioworker(void) > >>>>> +{ > >>>>> +} > >>>>> #endif > >>>>> > >>>>> static inline bool io_wq_current_is_worker(void) > >>>> > >>>> This should go in include/linux/io_uring.h and then userfaultfd woul= d > >>>> not have to include io_uring private headers. > >>>> > >>>> But that's beside the point, like I said we still need to get to the > >>>> bottom of what is going on here first, rather than try and paper aro= und > >>>> it. So please don't post more versions of this before we have that > >>>> understanding. > >>>> > >>>> See previous emails on 6.8 and other kernel versions. > >>>> > >>>> -- > >>>> Jens Axboe > >>> The issue did not involve creating new worker processes. Instead, the > >>> existing IOU worker kernel threads (about a dozen) associated with th= e VM > >>> process were fully utilizing CPU without writing data, caused by a fa= ult > >>> while reading user data pages in the fault_in_iov_iter_readable funct= ion > >>> when pulling user memory into kernel space. > >> > >> OK that makes more sense, I can certainly reproduce a loop in this pat= h: > >> > >> iou-wrk-726 729 36.910071: 9737 cycles:P: > >> ffff800080456c44 handle_userfault+0x47c > >> ffff800080381fc0 hugetlb_fault+0xb68 > >> ffff80008031fee4 handle_mm_fault+0x2fc > >> ffff8000812ada6c do_page_fault+0x1e4 > >> ffff8000812ae024 do_translation_fault+0x9c > >> ffff800080049a9c do_mem_abort+0x44 > >> ffff80008129bd78 el1_abort+0x38 > >> ffff80008129ceb4 el1h_64_sync_handler+0xd4 > >> ffff8000800112b4 el1h_64_sync+0x6c > >> ffff80008030984c fault_in_readable+0x74 > >> ffff800080476f3c iomap_file_buffered_write+0x14c > >> ffff8000809b1230 blkdev_write_iter+0x1a8 > >> ffff800080a1f378 io_write+0x188 > >> ffff800080a14f30 io_issue_sqe+0x68 > >> ffff800080a155d0 io_wq_submit_work+0xa8 > >> ffff800080a32afc io_worker_handle_work+0x1f4 > >> ffff800080a332b8 io_wq_worker+0x110 > >> ffff80008002dd38 ret_from_fork+0x10 > >> > >> which seems to be expected, we'd continually try and fault in the > >> ranges, if the userfaultfd handler isn't filling them. > >> > >> I guess this is where I'm still confused, because I don't see how this > >> is different from if you have a normal write(2) syscall doing the same > >> thing - you'd get the same looping. > >> > >> ?? > >> > >>> This issue occurs like during VM snapshot loading (which uses > >>> userfaultfd for on-demand memory loading), while the task in the gues= t is > >>> writing data to disk. > >>> > >>> Normally, the VM first triggers a user fault to fill the page table. > >>> So in the IOU worker thread, the page tables are already filled, > >>> fault no chance happens when faulting in memory pages > >>> in fault_in_iov_iter_readable. > >>> > >>> I suspect that during snapshot loading, a memory access in the > >>> VM triggers an async page fault handled by the kernel thread, > >>> while the IOU worker's async kernel thread is also running. > >>> Maybe If the IOU worker's thread is scheduled first. > >>> I?m going to bed now. > >> > >> Ah ok, so what you're saying is that because we end up not sleeping > >> (because a signal is pending, it seems), then the fault will never get > >> filled and hence progress not made? And the signal is pending because > >> someone tried to create a net worker, and this work is not getting > >> processed. > >> > >> -- > >> Jens Axboe > > handle_userfault() { > > hugetlb_vma_lock_read(); > > _raw_spin_lock_irq() { > > __pv_queued_spin_lock_slowpath(); > > } > > vma_mmu_pagesize() { > > hugetlb_vm_op_pagesize(); > > } > > huge_pte_offset(); > > hugetlb_vma_unlock_read(); > > up_read(); > > __wake_up() { > > _raw_spin_lock_irqsave() { > > __pv_queued_spin_lock_slowpath(); > > } > > __wake_up_common(); > > _raw_spin_unlock_irqrestore(); > > } > > schedule() { > > io_wq_worker_sleeping() { > > io_wq_dec_running(); > > } > > rcu_note_context_switch(); > > raw_spin_rq_lock_nested() { > > _raw_spin_lock(); > > } > > update_rq_clock(); > > pick_next_task() { > > pick_next_task_fair() { > > update_curr() { > > update_curr_se(); > > __calc_delta.constprop.0(); > > update_min_vruntime(); > > } > > check_cfs_rq_runtime(); > > pick_next_entity() { > > pick_eevdf(); > > } > > update_curr() { > > update_curr_se(); > > __calc_delta.constprop.0(); > > update_min_vruntime(); > > } > > check_cfs_rq_runtime(); > > pick_next_entity() { > > pick_eevdf(); > > } > > update_curr() { > > update_curr_se(); > > update_min_vruntime(); > > cpuacct_charge(); > > __cgroup_account_cputime() { > > cgroup_rstat_updated(); > > } > > } > > check_cfs_rq_runtime(); > > pick_next_entity() { > > pick_eevdf(); > > } > > } > > } > > raw_spin_rq_unlock(); > > io_wq_worker_running(); > > } > > _raw_spin_lock_irq() { > > __pv_queued_spin_lock_slowpath(); > > } > > userfaultfd_ctx_put(); > > } > > } > > The execution flow above is the one that kept faulting > > repeatedly in the IOU worker during the issue. The entire fault path, > > including this final userfault handling code you're seeing, would be > > triggered in an infinite loop. That's why I traced and found that the > > io_wq_worker_running() function returns early, causing the flow to > > differ from a normal user fault, where it should be sleeping. > > io_wq_worker_running() is called when the task is scheduled back in. > There's no "returning early" here, it simply updates the accounting. > Which is part of why your patch makes very little sense to me, we > would've called both io_wq_worker_sleeping() and _running() from the > userfaultfd path. The latter doesn't really do much, it simply just > increments the running worker count, if the worker was previously marked > as sleeping. > > And I strongly suspect that the latter is the issue, not the marking of > running. The above loop is fine if we do go to sleep in schedule. > However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL > based) pending, then schedule() will be a no-op and we're going to > repeatedly go through that loop. This is because the expectation here is > that the loop will be aborted if either of those is true, so that > task_work can get run (or a signal handled, whatever), and then the > operation retried. > > > However, your call stack appears to behave normally, > > which makes me curious about what's different about execution flow. > > Would you be able to share your test case code so I can study it > > and try to reproduce the behavior on my side? > > It behaves normally for the initial attempt - we end up sleeping in > schedule(). However, then a new worker gets created, or the ring > shutdown, in which case schedule() ends up being a no-op because > TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running > the same code again and again to no avail. So I do think my test case > and your issue is the same, I just reproduce it by calling > io_uring_queue_exit(), but the exact same thing would happen if worker > creation is attempted while an io-wq worker is blocked > handle_userfault(). > > This is why I want to fully understand the issue rather than paper > around it, as I don't think the fix is correct as-is. We really want to > abort the loop and allow the task to handle whatever signaling is > currently preventing proper sleeps. > > I'll dabble a bit more and send out the test case too, in case it'll > help on your end. > > -- > Jens Axboe I=E2=80=99m really looking forward to your test case. Also, I=E2=80=99d lik= e to emphasize one more point: the handle_userfault graph path I sent you, including the sched= ule function, is complete and unmodified. You can see that the schedule functio= n is very, very short. I understand your point about signal handling, but in = this very brief function graph, I haven=E2=80=99t yet seen any functions related= to signal handling. Additionally, there is no context switch here, nor is it the situ= ation where the thread is being scheduled back in. Perhaps the scenario you=E2=80= =99ve reproduced is still different from the one I=E2=80=99ve encountered in some= subtle way? void io_wq_worker_running(struct task_struct *tsk) { struct io_worker *worker =3D tsk->worker_private; if (!worker) return; if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) { if (!test_bit(IO_WORKER_F_UP, &worker->flags)) return; if (test_bit(IO_WORKER_F_RUNNING, &worker->flags)) return; set_bit(IO_WORKER_F_RUNNING, &worker->flags); io_wq_inc_running(worker); } } However, from my observation during the crash live memory analysis, when this happens in the IOU worker thread, the IO_WORKER_F_RUNNING flag is set. This is what I said "early return", rather than just a simple accounting function.I look forward to your deeper analysis and any corrections you may have.