From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 406D5C369CB for ; Wed, 23 Apr 2025 18:55:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3F5776B0005; Wed, 23 Apr 2025 14:55:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A66B6B0007; Wed, 23 Apr 2025 14:55:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 201236B0008; Wed, 23 Apr 2025 14:55:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id ECF376B0005 for ; Wed, 23 Apr 2025 14:55:14 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id EBB401A1CC2 for ; Wed, 23 Apr 2025 18:55:15 +0000 (UTC) X-FDA: 83366211390.28.17A416B Received: from mail-io1-f52.google.com (mail-io1-f52.google.com [209.85.166.52]) by imf25.hostedemail.com (Postfix) with ESMTP id B0737A0013 for ; Wed, 23 Apr 2025 18:55:13 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="Y/Wdd7Wy"; dmarc=none; spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.52 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745434514; a=rsa-sha256; cv=none; b=pQ/SSJpH2McH1gmxiA882catswYlOotJnNTUSiSwtvUlPJLESj5wT/syUt/y/0Mf6JLY5Y +1zuVVeyTTcMgl8iAM/xVH9DIMStAHsvxzpiiH42zwZ7Bg+hYjg9DDKocpfaoB1Aravvuq 8B4o5kPFgX4Kf/p1cGwByv8dwulbg74= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="Y/Wdd7Wy"; dmarc=none; spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.52 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745434514; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=; b=AiMgRfV7267C1pmYE6rANmb4hqT/fxvMbaEjdG+lAzGXMYbFKPyVxwlZIPcAjWGSBVN+GL 3dlEaIPrCK8KHHGrk/MJvR6hAYVbxGQl5uqcf+WWmvrtdlrS33bwmQZcYoTvThgJ5aeYSY H4cHqYmpY7+PF2XSPyQQge9m+pXvflQ= Received: by mail-io1-f52.google.com with SMTP id ca18e2360f4ac-8613f456960so8371639f.1 for ; Wed, 23 Apr 2025 11:55:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1745434512; x=1746039312; darn=kvack.org; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=; b=Y/Wdd7WybbIvQrNyA7jCnZ7KuWfYIeRHgv2qMQ4dSxqKBn4wem+/Iy2SZ8gTt4ILFl NIxbHwiXjorXIAzryYWQWxz0rveaUxqJvSbn7iGFC2jX33AvfvOqb/5GPNf16x5KE7xU ssub6iCM5RDqgpisaLYuDXo4pQ6bhtecvljPWshcDfgi9z41GVS3m8GukuW0c4wR8Ggs 4CKycM/OjVilXs4c0n13rHhPPUhWXT551zOXNqXkiDiHuwq4UXZeDwL/ENSqnN2SmNpL bQbWsHVYkmch3RjQlsn83NzCUBIf1weO0uuxPvbhQ8TcKE9oHmU4nxKmrXtRShuz3/ZF HDAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745434512; x=1746039312; h=in-reply-to:content-language:references:cc:to:from:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=; b=I7aMrXxbH1EWHKmM53F0EI6csn4lqqw19QLHl49Va2m8y3KJ2+H8fgeb6ALZNs4nTE 7BQqWQdiOa6Hx2DA9+anLhzLsEPERumz7HZpDDZIbOqTnG2j7ADXSJKims5QsPcR1Vvw nWDplisj8HOXIlG5NVITyNlyKC7mIfXsFaxZ0QNgZxjzxIlCG7tocs4pttfFsMHCq2cu lXP6cccPYoM81q7JFNDam/Tpy6orwEkny+DJ2q6Sr4t1T7UZwQVCU7v66Y9SLv7/9/bb RetMx4BOiv1EsckXt740j9KRNsBZ12fZM0k8vo1Ax/C3flOnUD0y7nAc1mWQcRbZsu3e zs6w== X-Forwarded-Encrypted: i=1; AJvYcCWohCugH781XRVP6p5d99eJKQBDv319g6sVToyQDB6hJGhAywbJxHRcLJAuVBw5xeJCWgbbKOWF+A==@kvack.org X-Gm-Message-State: AOJu0Yx4z9DRSXrtsncPsXWP7wrjnM/kUPv6t1STi7e9xhysuzX1bAHe jPVD/w5oEZjArmE0i4ymWpt9hAHhtW0qh/tgR8u2yDwhcpqJlzJQuEB6S3sjgPI= X-Gm-Gg: ASbGncvw8Ppoyn/xvw6d1nx93wy9qDg7QaEwx8a3pe0+j+/wbfyH/vKXJTM3k1Bwry4 1tJXl2MmAQqT+7CAD7dUcvrrdFRiyOgDXK6qPnO8oYKgn51P24VRUBcQ3LleHB/0UG6QyAtxvmV 73/z9K0RKwN0fUmUfo2PGT/DxZhhv+oKgCK9aWPLEGvlMhGgNTBasFe2aCjh8ikD1t081h9jO7U fzAYgudjYg0TC0xaLyQJerBati+ZAf8qu2h1AD8FONj6DK5PmAVsbbl6R3nnyUSOYGZiRqhEubC WbXtc3UXpwH5MMmA+wpX5SpNLU7uBd1gtBux X-Google-Smtp-Source: AGHT+IE0ngYlwmKwtlrHafkxSZKBLR7SWbSU4ddU2IBv1qLhP9xxHKcjrwnzHLg3RpbBf9acB39Zkg== X-Received: by 2002:a05:6e02:1b0b:b0:3d8:1d7c:e180 with SMTP id e9e14a558f8ab-3d88ed8e8e0mr209743185ab.6.1745434512449; Wed, 23 Apr 2025 11:55:12 -0700 (PDT) Received: from [192.168.1.116] ([96.43.243.2]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4f6a3958edbsm2959247173.115.2025.04.23.11.55.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Apr 2025 11:55:11 -0700 (PDT) Content-Type: multipart/mixed; boundary="------------Rtfq7maOjRQ8KE7c6kd7Vtad" Message-ID: <3d4c5ca8-12f2-4525-8503-f34839ac7099@kernel.dk> Date: Wed, 23 Apr 2025 12:55:10 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios From: Jens Axboe To: =?UTF-8?B?5aec5pm65Lyf?= Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org References: <20250422162913.1242057-1-qq282012236@gmail.com> <20250422162913.1242057-2-qq282012236@gmail.com> <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk> <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk> Content-Language: en-US In-Reply-To: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B0737A0013 X-Stat-Signature: w4hu96xwkcob5ecnjks3bi3ttx81h5tz X-Rspam-User: X-HE-Tag: 1745434513-448378 X-HE-Meta: U2FsdGVkX19YOMK77nt79Dz0einZlsRALOX9D3Mu1Tru8LKYPI1JYHV8d2SITynuyuxs9To1Y5ebv/q676bKHC8uQqbWEt+pMm9jlCCbvPXnHw5D/N/gLNGGgeJ18DxiwHrB6AzJchelaEbMMPhi8v0DyQqGvImzpSxSeFDIsHhRPz4t6krL5bliWPNrjfuCTgz2LG+ktCOPL71HVoj9l1BTBfXFPnFe53x09xD2mBom2RNN5vyQDsVQ1cDmIcXobMW/2r2DvbVzGsG20qosqkqpU2kDLtwo0OLGLWGCzafKNrbM2Kny/xUVIrUK20yA5enTIK4BlRvz6735OUqccD7U+6PvDtJdRVBnuzYgCYMQjmlR0ECqQuMkwl7wio4qRany91XaJbC2gBkBEPJaRve8LalqVb1zNp94GcdjaeOfkZVQ3OJ3VqZaBvfEEzGe1OrSrnVd3ftAcj1loefMNGtBkjVcBMkiVTkN5xlNr1fCnz2V54mE9aM7p/NlsX8CBSV8Uy8sKBajCnT1ZYz/zrmETN6aU2zOD487j2/UWgbFMBWedk+Rxn0ZQUD20uqrdyz5cqWdc7LwL6DxcgVFdOT92I4hBLSf1dC5z0M+5OVMN1RLQQlTxGRxSecvnkYpdRrveDCEL5pfu812YBGM7Rm/BJTwnWEaJVJFYkJ38Wzefm3wXnvSiLQ+MrnMfah9If2pJhEHaCedVb769bSW+LYvhYy8glGjl9+O9uhp9jlGoOskC0WOuE8yz1nDqb9XNT5GvWd/f+JeWZdg3H4mHpBT5rGMd5aF6WsjQsL2jzWbFYDpm7kphPhwCOfi/qkiK0mjC7qE+vwWFLWOhTAmNHLq00Eha25WErdHE4e4r8uQ5zWHOTWnfFUxcvE4nXAAd6auSRPLQlsMdlvKMlTyS5PP+LU8DizWiVRyT2jzK6t7MDo6Y11vBYCqUVODnQGRx1zASDLds682r7Qsooc HIsgoR3L XtC6iSgL6ZTfdrL+U/9KeQfotg8UKuRsDPCPhZemjosG71oG+jyWU8Pop3C5UkySKUVNVmST5M30E/bwfYXSMaRyHfQn+s2BpdVJNtXMGc8FJz1Wb1K2ydh0AgLI4zOA+queJz7OfZJpOdGOTPamF+XeaKga7fCvuMjVy4FjfQTbm8aMYBIHaFMXCGeS6OJoixzcIo667apMt3OCyFlUnSCDj8pSaOXiXlDCF+XGw0mVS8nBJpWG71gz1iru+0I9VAgkqPOjkqiEXafqRMuryhDylMmeWsyoEEo7QbaahH+rMBSrnD0lXDr/OqL4i+oZp7gOy1AhiTSffqz4Iz0dRHfOZyatijynUpSw8+RIBe3KoMGYpn8/gCAbNvubrYbv+PgFh5HRxtggkofGLgqE7pggC46eS/0sBA9h/2ZIPaTgDlBkPebyecqPOy05xE0wD2xwg7/XD44xQVMzqbLkERY98ClcYBrWUenW7i5yDmF3AtkusSj4a4IA4Eynm3qm2grOcicAFyPZyW0ZC3KUiJrf6qb+HAxnOJpJvQuz3GH3iIAD91pN8lkRcsWPUfyykMFeUcDDgRoxg2GbxdCz6xg4HhlgAh+sp6j9p7Eec7FuXw1Cnz81zwfBCjshFoUcJGbnoxQiZ1YE4ev4eh+yCO7drhfzwtExaqdYMTD9mYHPf3Oyp51/+nref3p9QaJ9RHpsuDah+wextQBb+IxF4KWRHsoJgIZoILKmuVkWWzPzYMGFmQJs0DuSOHXFyXdkhnIxxkU67I91MjQ4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is a multi-part message in MIME format. --------------Rtfq7maOjRQ8KE7c6kd7Vtad Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 4/23/25 9:10 AM, Jens Axboe wrote: > On 4/23/25 8:29 AM, ??? wrote: >> Jens Axboe ?2025?4?23??? 21:34??? >>> >>> On 4/22/25 8:49 PM, ??? wrote: >>>> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe wrote: >>>>> >>>>> On 4/22/25 11:04 AM, ??? wrote: >>>>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe wrote: >>>>>>> >>>>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote: >>>>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h >>>>>>>> index d4fb2940e435..8567a9c819db 100644 >>>>>>>> --- a/io_uring/io-wq.h >>>>>>>> +++ b/io_uring/io-wq.h >>>>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, >>>>>>>> void *data, bool cancel_all); >>>>>>>> >>>>>>>> #if defined(CONFIG_IO_WQ) >>>>>>>> -extern void io_wq_worker_sleeping(struct task_struct *); >>>>>>>> -extern void io_wq_worker_running(struct task_struct *); >>>>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk); >>>>>>>> +extern void io_wq_worker_running(struct task_struct *tsk); >>>>>>>> +extern void set_userfault_flag_for_ioworker(void); >>>>>>>> +extern void clear_userfault_flag_for_ioworker(void); >>>>>>>> #else >>>>>>>> static inline void io_wq_worker_sleeping(struct task_struct *tsk) >>>>>>>> { >>>>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk) >>>>>>>> static inline void io_wq_worker_running(struct task_struct *tsk) >>>>>>>> { >>>>>>>> } >>>>>>>> +static inline void set_userfault_flag_for_ioworker(void) >>>>>>>> +{ >>>>>>>> +} >>>>>>>> +static inline void clear_userfault_flag_for_ioworker(void) >>>>>>>> +{ >>>>>>>> +} >>>>>>>> #endif >>>>>>>> >>>>>>>> static inline bool io_wq_current_is_worker(void) >>>>>>> >>>>>>> This should go in include/linux/io_uring.h and then userfaultfd would >>>>>>> not have to include io_uring private headers. >>>>>>> >>>>>>> But that's beside the point, like I said we still need to get to the >>>>>>> bottom of what is going on here first, rather than try and paper around >>>>>>> it. So please don't post more versions of this before we have that >>>>>>> understanding. >>>>>>> >>>>>>> See previous emails on 6.8 and other kernel versions. >>>>>>> >>>>>>> -- >>>>>>> Jens Axboe >>>>>> The issue did not involve creating new worker processes. Instead, the >>>>>> existing IOU worker kernel threads (about a dozen) associated with the VM >>>>>> process were fully utilizing CPU without writing data, caused by a fault >>>>>> while reading user data pages in the fault_in_iov_iter_readable function >>>>>> when pulling user memory into kernel space. >>>>> >>>>> OK that makes more sense, I can certainly reproduce a loop in this path: >>>>> >>>>> iou-wrk-726 729 36.910071: 9737 cycles:P: >>>>> ffff800080456c44 handle_userfault+0x47c >>>>> ffff800080381fc0 hugetlb_fault+0xb68 >>>>> ffff80008031fee4 handle_mm_fault+0x2fc >>>>> ffff8000812ada6c do_page_fault+0x1e4 >>>>> ffff8000812ae024 do_translation_fault+0x9c >>>>> ffff800080049a9c do_mem_abort+0x44 >>>>> ffff80008129bd78 el1_abort+0x38 >>>>> ffff80008129ceb4 el1h_64_sync_handler+0xd4 >>>>> ffff8000800112b4 el1h_64_sync+0x6c >>>>> ffff80008030984c fault_in_readable+0x74 >>>>> ffff800080476f3c iomap_file_buffered_write+0x14c >>>>> ffff8000809b1230 blkdev_write_iter+0x1a8 >>>>> ffff800080a1f378 io_write+0x188 >>>>> ffff800080a14f30 io_issue_sqe+0x68 >>>>> ffff800080a155d0 io_wq_submit_work+0xa8 >>>>> ffff800080a32afc io_worker_handle_work+0x1f4 >>>>> ffff800080a332b8 io_wq_worker+0x110 >>>>> ffff80008002dd38 ret_from_fork+0x10 >>>>> >>>>> which seems to be expected, we'd continually try and fault in the >>>>> ranges, if the userfaultfd handler isn't filling them. >>>>> >>>>> I guess this is where I'm still confused, because I don't see how this >>>>> is different from if you have a normal write(2) syscall doing the same >>>>> thing - you'd get the same looping. >>>>> >>>>> ?? >>>>> >>>>>> This issue occurs like during VM snapshot loading (which uses >>>>>> userfaultfd for on-demand memory loading), while the task in the guest is >>>>>> writing data to disk. >>>>>> >>>>>> Normally, the VM first triggers a user fault to fill the page table. >>>>>> So in the IOU worker thread, the page tables are already filled, >>>>>> fault no chance happens when faulting in memory pages >>>>>> in fault_in_iov_iter_readable. >>>>>> >>>>>> I suspect that during snapshot loading, a memory access in the >>>>>> VM triggers an async page fault handled by the kernel thread, >>>>>> while the IOU worker's async kernel thread is also running. >>>>>> Maybe If the IOU worker's thread is scheduled first. >>>>>> I?m going to bed now. >>>>> >>>>> Ah ok, so what you're saying is that because we end up not sleeping >>>>> (because a signal is pending, it seems), then the fault will never get >>>>> filled and hence progress not made? And the signal is pending because >>>>> someone tried to create a net worker, and this work is not getting >>>>> processed. >>>>> >>>>> -- >>>>> Jens Axboe >>>> handle_userfault() { >>>> hugetlb_vma_lock_read(); >>>> _raw_spin_lock_irq() { >>>> __pv_queued_spin_lock_slowpath(); >>>> } >>>> vma_mmu_pagesize() { >>>> hugetlb_vm_op_pagesize(); >>>> } >>>> huge_pte_offset(); >>>> hugetlb_vma_unlock_read(); >>>> up_read(); >>>> __wake_up() { >>>> _raw_spin_lock_irqsave() { >>>> __pv_queued_spin_lock_slowpath(); >>>> } >>>> __wake_up_common(); >>>> _raw_spin_unlock_irqrestore(); >>>> } >>>> schedule() { >>>> io_wq_worker_sleeping() { >>>> io_wq_dec_running(); >>>> } >>>> rcu_note_context_switch(); >>>> raw_spin_rq_lock_nested() { >>>> _raw_spin_lock(); >>>> } >>>> update_rq_clock(); >>>> pick_next_task() { >>>> pick_next_task_fair() { >>>> update_curr() { >>>> update_curr_se(); >>>> __calc_delta.constprop.0(); >>>> update_min_vruntime(); >>>> } >>>> check_cfs_rq_runtime(); >>>> pick_next_entity() { >>>> pick_eevdf(); >>>> } >>>> update_curr() { >>>> update_curr_se(); >>>> __calc_delta.constprop.0(); >>>> update_min_vruntime(); >>>> } >>>> check_cfs_rq_runtime(); >>>> pick_next_entity() { >>>> pick_eevdf(); >>>> } >>>> update_curr() { >>>> update_curr_se(); >>>> update_min_vruntime(); >>>> cpuacct_charge(); >>>> __cgroup_account_cputime() { >>>> cgroup_rstat_updated(); >>>> } >>>> } >>>> check_cfs_rq_runtime(); >>>> pick_next_entity() { >>>> pick_eevdf(); >>>> } >>>> } >>>> } >>>> raw_spin_rq_unlock(); >>>> io_wq_worker_running(); >>>> } >>>> _raw_spin_lock_irq() { >>>> __pv_queued_spin_lock_slowpath(); >>>> } >>>> userfaultfd_ctx_put(); >>>> } >>>> } >>>> The execution flow above is the one that kept faulting >>>> repeatedly in the IOU worker during the issue. The entire fault path, >>>> including this final userfault handling code you're seeing, would be >>>> triggered in an infinite loop. That's why I traced and found that the >>>> io_wq_worker_running() function returns early, causing the flow to >>>> differ from a normal user fault, where it should be sleeping. >>> >>> io_wq_worker_running() is called when the task is scheduled back in. >>> There's no "returning early" here, it simply updates the accounting. >>> Which is part of why your patch makes very little sense to me, we >>> would've called both io_wq_worker_sleeping() and _running() from the >>> userfaultfd path. The latter doesn't really do much, it simply just >>> increments the running worker count, if the worker was previously marked >>> as sleeping. >>> >>> And I strongly suspect that the latter is the issue, not the marking of >>> running. The above loop is fine if we do go to sleep in schedule. >>> However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL >>> based) pending, then schedule() will be a no-op and we're going to >>> repeatedly go through that loop. This is because the expectation here is >>> that the loop will be aborted if either of those is true, so that >>> task_work can get run (or a signal handled, whatever), and then the >>> operation retried. >>> >>>> However, your call stack appears to behave normally, >>>> which makes me curious about what's different about execution flow. >>>> Would you be able to share your test case code so I can study it >>>> and try to reproduce the behavior on my side? >>> >>> It behaves normally for the initial attempt - we end up sleeping in >>> schedule(). However, then a new worker gets created, or the ring >>> shutdown, in which case schedule() ends up being a no-op because >>> TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running >>> the same code again and again to no avail. So I do think my test case >>> and your issue is the same, I just reproduce it by calling >>> io_uring_queue_exit(), but the exact same thing would happen if worker >>> creation is attempted while an io-wq worker is blocked >>> handle_userfault(). >>> >>> This is why I want to fully understand the issue rather than paper >>> around it, as I don't think the fix is correct as-is. We really want to >>> abort the loop and allow the task to handle whatever signaling is >>> currently preventing proper sleeps. >>> >>> I'll dabble a bit more and send out the test case too, in case it'll >>> help on your end. >>> >>> -- >>> Jens Axboe >> I?m really looking forward to your test case. Also, I?d like to >> emphasize one more point: the handle_userfault graph path I sent you, >> including the schedule function, is complete and unmodified. You can >> see that the schedule function is very, very short. I understand your >> point about signal handling, but in this very brief function graph, I >> haven?t yet seen any functions related to signal handling. >> Additionally, there is no context switch here, nor is it the situation >> where the thread is being scheduled back in. Perhaps the scenario >> you?ve reproduced is still different from the one I?ve encountered in >> some subtle way? > > Ask yourself, why would schedule() return immediately rather than > actually block? There's a few cases here: > > 1) The task state is set to TASK_RUNNING - either because it was never > set to TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE, or because someone > raced and woke up the task after the initial check on whether it > should be sleeping or not. > > 2) Some kind of notification or signal is pending. This is usually when > task_sigpending() returns true, or if TIF_NOTIFY_SIGNAL is set. Those > need clearing, and that's generally done on return to userspace. > > #1 isn't the case here, but #2 looks highly plausible. The io-wq workers > rely on running this kind of work manually, and retrying. If we loop > further down with these conditions being true, then we're just busy > looping at that point and will NEVER sleep. You don't see any functions > related to signal handling etc EXACTLY because of that, there's nowhere > it gets run. > >> void io_wq_worker_running(struct task_struct *tsk) >> { >> struct io_worker *worker = tsk->worker_private; >> >> if (!worker) >> return; >> if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) { >> if (!test_bit(IO_WORKER_F_UP, &worker->flags)) >> return; >> if (test_bit(IO_WORKER_F_RUNNING, &worker->flags)) >> return; >> set_bit(IO_WORKER_F_RUNNING, &worker->flags); >> io_wq_inc_running(worker); >> } >> } >> However, from my observation during the crash live memory analysis, >> when this happens in the IOU worker thread, the >> IO_WORKER_F_RUNNING flag is set. This is what I said "early return", >> rather than just a simple accounting function.I look forward to your >> deeper analysis and any corrections you may have. > > It's set becase of what I outlined above. If schedule() would actually > sleep, then io_wq_worker_sleeping() would've been called. The fact that > you're getting io_wq_worker_running() called without WORKER_F_RUNNING > cleared is because of that. > > But you're too focused on the symptom here, not the underlying issue. It > doesn't matter at all that io_wq_worker_running() is called when the > task is already running, it'll just ignore that. It's explicitly tested. > Your patch won't make a single difference for this case because of that, > you're just wrapping what's esssentially a no-op call with another no-op > call, as you've now nested RUNNING inside the FAULT flag. It won't > change your outcome at all. BTW, same thing can be observed without using io_uring at all - just have a normal task do a write(2) as in my test case, and have the parent send it a signal. We'll loop in page fault handling if userfaultfd is used and it doesn't fill the fault. Example attached. IOW, this is a generic "problem". I use quotes here as it's not _really_ a problem, it'll just loop excessively if a signal is pending. It can still very much get killed or terminated, but it's not going to make progress as the page fault isn't filled. -- Jens Axboe --------------Rtfq7maOjRQ8KE7c6kd7Vtad Content-Type: text/x-csrc; charset=UTF-8; name="tufd.c" Content-Disposition: attachment; filename="tufd.c" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPHVuaXN0 ZC5oPgojaW5jbHVkZSA8ZmNudGwuaD4KI2luY2x1ZGUgPHN0cmluZy5oPgojaW5jbHVkZSA8 cG9sbC5oPgojaW5jbHVkZSA8c3lzL21tYW4uaD4KI2luY2x1ZGUgPHN5cy9pb2N0bC5oPgoj aW5jbHVkZSA8c3lzL3dhaXQuaD4KI2luY2x1ZGUgPGxpbnV4L21tYW4uaD4KI2luY2x1ZGUg PHN5cy91aW8uaD4KI2luY2x1ZGUgPGVycm5vLmg+CiNpbmNsdWRlIDxwdGhyZWFkLmg+CiNp bmNsdWRlIDxzaWduYWwuaD4KI2luY2x1ZGUgPGxpbnV4L3VzZXJmYXVsdGZkLmg+CgojZGVm aW5lIEhQX1NJWkUJCSgyICogMTAyNCAqIDEwMjRVTEwpCgojaWZuZGVmIE5SX3VzZXJmYXVs dGZkCiNkZWZpbmUgTlJfdXNlcmZhdWx0ZmQJMjgyCiNlbmRpZgoKc3RydWN0IHRocmVhZF9k YXRhIHsKCXB0aHJlYWRfdCB0aHJlYWQ7CglwdGhyZWFkX2JhcnJpZXJfdCBiYXJyaWVyOwoJ aW50IHVmZmQ7Cn07CgpzdGF0aWMgdm9pZCAqZmF1bHRfaGFuZGxlcih2b2lkICpkYXRhKQp7 CglzdHJ1Y3QgdGhyZWFkX2RhdGEgKnRkID0gZGF0YTsKCXN0cnVjdCB1ZmZkX21zZyBtc2c7 CglzdHJ1Y3QgcG9sbGZkIHBmZDsKCWludCByZXQsIG5yZWFkeTsKCglwdGhyZWFkX2JhcnJp ZXJfd2FpdCgmdGQtPmJhcnJpZXIpOwoKCWRvIHsKCQlwZmQuZmQgPSB0ZC0+dWZmZDsKCQlw ZmQuZXZlbnRzID0gUE9MTElOOwoJCW5yZWFkeSA9IHBvbGwoJnBmZCwgMSwgLTEpOwoJCWlm IChucmVhZHkgPCAwKSB7CgkJCXBlcnJvcigicG9sbCIpOwoJCQlleGl0KDEpOwoJCX0KCgkJ cmV0ID0gcmVhZCh0ZC0+dWZmZCwgJm1zZywgc2l6ZW9mKG1zZykpOwoJCWlmIChyZXQgPCAw KSB7CgkJCWlmIChlcnJubyA9PSBFQUdBSU4pCgkJCQljb250aW51ZTsKCQkJcGVycm9yKCJy ZWFkIik7CgkJCWV4aXQoMSk7CgkJfQoKCQlpZiAobXNnLmV2ZW50ICE9IFVGRkRfRVZFTlRf UEFHRUZBVUxUKSB7CgkJCXByaW50ZigidW5zcGVjdGVkIGV2ZW50OiAleFxuIiwgbXNnLmV2 ZW50KTsKCQkJZXhpdCgxKTsKCQl9CgoJCXByaW50ZigiUGFnZSBmYXVsdFxuIik7CgkJcHJp bnRmKCJmbGFncyA9ICVseDsgIiwgKGxvbmcpIG1zZy5hcmcucGFnZWZhdWx0LmZsYWdzKTsK CQlwcmludGYoImFkZHJlc3MgPSAlbHhcbiIsIChsb25nKW1zZy5hcmcucGFnZWZhdWx0LmFk ZHJlc3MpOwoJfSB3aGlsZSAoMSk7CgoJcmV0dXJuIE5VTEw7Cn0KCnN0YXRpYyB2b2lkICph cm1fZmF1bHRfaGFuZGxlcihzdHJ1Y3QgdGhyZWFkX2RhdGEgKnRkLCBzaXplX3QgbGVuKQp7 CglzdHJ1Y3QgdWZmZGlvX2FwaSBhcGkgPSB7IH07CglzdHJ1Y3QgdWZmZGlvX3JlZ2lzdGVy IHJlZyA9IHsgfTsKCXZvaWQgKmJ1ZjsKCglidWYgPSBtbWFwKE5VTEwsIEhQX1NJWkUsIFBS T1RfUkVBRHxQUk9UX1dSSVRFLAoJCQlNQVBfUFJJVkFURSB8IE1BUF9IVUdFVExCIHwgTUFQ X0hVR0VfMk1CIHwgTUFQX0FOT05ZTU9VUywKCQkJLTEsIDApOwoJaWYgKGJ1ZiA9PSBNQVBf RkFJTEVEKSB7CgkJcGVycm9yKCJtbWFwIik7CgkJcmV0dXJuIE5VTEw7Cgl9CglwcmludGYo ImdvdCBidWYgJXBcbiIsIGJ1Zik7CgoJdGQtPnVmZmQgPSBzeXNjYWxsKE5SX3VzZXJmYXVs dGZkLCBPX0NMT0VYRUMgfCBPX05PTkJMT0NLKTsKCWlmICh0ZC0+dWZmZCA8IDApIHsKCQlw ZXJyb3IoInVzZXJmYXVsdGZkIik7CgkJcmV0dXJuIE5VTEw7Cgl9CgoJYXBpLmFwaSA9IFVG RkRfQVBJOwoJaWYgKGlvY3RsKHRkLT51ZmZkLCBVRkZESU9fQVBJLCAmYXBpKSA8IDApIHsK CQlwZXJyb3IoImlvY3RsIFVGRkRJT19BUEkiKTsKCQlyZXR1cm4gTlVMTDsKCX0KCglyZWcu cmFuZ2Uuc3RhcnQgPSAodW5zaWduZWQgbG9uZykgYnVmOwoJcmVnLnJhbmdlLmxlbiA9IEhQ X1NJWkU7CglyZWcubW9kZSA9IFVGRkRJT19SRUdJU1RFUl9NT0RFX01JU1NJTkc7CglpZiAo aW9jdGwodGQtPnVmZmQsIFVGRkRJT19SRUdJU1RFUiwgJnJlZykgPCAwKSB7CgkJcGVycm9y KCJpb2N0bCBVRkZESU9fUkVHSVNURVIiKTsKCQlyZXR1cm4gTlVMTDsKCX0KCglyZXR1cm4g YnVmOwp9CgpzdGF0aWMgdm9pZCBzaWdfdXNyMShpbnQgc2lnKQp7Cn0KCnN0YXRpYyB2b2lk IF9fZG9faW8oaW50IGZkLCB2b2lkICpidWYsIHNpemVfdCBsZW4pCnsKCXN0cnVjdCBzaWdh Y3Rpb24gYWN0ID0geyB9OwoJaW50IHJldDsKCglhY3Quc2FfaGFuZGxlciA9IHNpZ191c3Ix OwoJc2lnYWN0aW9uKFNJR1VTUjEsICZhY3QsIE5VTEwpOwoKCXByaW50ZigiY2hpbGQgd2ls bCB3cml0ZVxuIik7CglyZXQgPSB3cml0ZShmZCwgYnVmLCBsZW4pOwoJcHJpbnRmKCJyZXQ9 JWRcbiIsIHJldCk7Cn0KCnN0YXRpYyB2b2lkIGRvX2lvKHN0cnVjdCB0aHJlYWRfZGF0YSAq dGQsIHNpemVfdCBsZW4pCnsKCXZvaWQgKmJ1ZjsKCXBpZF90IHBpZDsKCWludCBmZDsKCglm ZCA9IG9wZW4oIi9kZXYvbnZtZTBuMSIsIE9fUkRXUik7CglpZiAoZmQgPCAwKSB7CgkJcGVy cm9yKCJvcGVuIGNyZWF0ZSIpOwoJCXJldHVybjsKCX0KCglwaWQgPSBmb3JrKCk7CglpZiAo cGlkKSB7CgkJaW50IHdzdGF0OwoKCQlzbGVlcCgxKTsKCQlraWxsKHBpZCwgU0lHVVNSMSk7 CgkJcHJpbnRmKCJ3YWl0IG9uIGNoaWxkXG4iKTsKCQl3YWl0cGlkKHBpZCwgJndzdGF0LCAw KTsKCX0gZWxzZSB7CgkJYnVmID0gYXJtX2ZhdWx0X2hhbmRsZXIodGQsIGxlbik7CgkJcHRo cmVhZF9iYXJyaWVyX3dhaXQoJnRkLT5iYXJyaWVyKTsKCQlfX2RvX2lvKGZkLCBidWYsIGxl bik7CgkJZXhpdCgwKTsKCX0KfQoKaW50IG1haW4oaW50IGFyZ2MsIGNoYXIgKmFyZ3ZbXSkK ewoJc3RydWN0IHRocmVhZF9kYXRhIHRkID0geyB9OwoKCXB0aHJlYWRfYmFycmllcl9pbml0 KCZ0ZC5iYXJyaWVyLCBOVUxMLCAyKTsKCXB0aHJlYWRfY3JlYXRlKCZ0ZC50aHJlYWQsIE5V TEwsIGZhdWx0X2hhbmRsZXIsICZ0ZCk7CgoJZG9faW8oJnRkLCBIUF9TSVpFKTsKCXJldHVy biAwOwp9Cg== --------------Rtfq7maOjRQ8KE7c6kd7Vtad--