From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 406D5C369CB
	for <linux-mm@archiver.kernel.org>; Wed, 23 Apr 2025 18:55:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3F5776B0005; Wed, 23 Apr 2025 14:55:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3A66B6B0007; Wed, 23 Apr 2025 14:55:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 201236B0008; Wed, 23 Apr 2025 14:55:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id ECF376B0005
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 14:55:14 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id EBB401A1CC2
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 18:55:15 +0000 (UTC)
X-FDA: 83366211390.28.17A416B
Received: from mail-io1-f52.google.com (mail-io1-f52.google.com [209.85.166.52])
	by imf25.hostedemail.com (Postfix) with ESMTP id B0737A0013
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 18:55:13 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="Y/Wdd7Wy";
	dmarc=none;
	spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.52 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745434514; a=rsa-sha256;
	cv=none;
	b=pQ/SSJpH2McH1gmxiA882catswYlOotJnNTUSiSwtvUlPJLESj5wT/syUt/y/0Mf6JLY5Y
	+1zuVVeyTTcMgl8iAM/xVH9DIMStAHsvxzpiiH42zwZ7Bg+hYjg9DDKocpfaoB1Aravvuq
	8B4o5kPFgX4Kf/p1cGwByv8dwulbg74=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="Y/Wdd7Wy";
	dmarc=none;
	spf=pass (imf25.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.52 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745434514;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=;
	b=AiMgRfV7267C1pmYE6rANmb4hqT/fxvMbaEjdG+lAzGXMYbFKPyVxwlZIPcAjWGSBVN+GL
	3dlEaIPrCK8KHHGrk/MJvR6hAYVbxGQl5uqcf+WWmvrtdlrS33bwmQZcYoTvThgJ5aeYSY
	H4cHqYmpY7+PF2XSPyQQge9m+pXvflQ=
Received: by mail-io1-f52.google.com with SMTP id ca18e2360f4ac-8613f456960so8371639f.1
        for <linux-mm@kvack.org>; Wed, 23 Apr 2025 11:55:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1745434512; x=1746039312; darn=kvack.org;
        h=in-reply-to:content-language:references:cc:to:from:subject
         :user-agent:mime-version:date:message-id:from:to:cc:subject:date
         :message-id:reply-to;
        bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=;
        b=Y/Wdd7WybbIvQrNyA7jCnZ7KuWfYIeRHgv2qMQ4dSxqKBn4wem+/Iy2SZ8gTt4ILFl
         NIxbHwiXjorXIAzryYWQWxz0rveaUxqJvSbn7iGFC2jX33AvfvOqb/5GPNf16x5KE7xU
         ssub6iCM5RDqgpisaLYuDXo4pQ6bhtecvljPWshcDfgi9z41GVS3m8GukuW0c4wR8Ggs
         4CKycM/OjVilXs4c0n13rHhPPUhWXT551zOXNqXkiDiHuwq4UXZeDwL/ENSqnN2SmNpL
         bQbWsHVYkmch3RjQlsn83NzCUBIf1weO0uuxPvbhQ8TcKE9oHmU4nxKmrXtRShuz3/ZF
         HDAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745434512; x=1746039312;
        h=in-reply-to:content-language:references:cc:to:from:subject
         :user-agent:mime-version:date:message-id:x-gm-message-state:from:to
         :cc:subject:date:message-id:reply-to;
        bh=q45g8yIKt+XT6BYpPHcf4UUatZQhL10uBmB6c2Y20Z4=;
        b=I7aMrXxbH1EWHKmM53F0EI6csn4lqqw19QLHl49Va2m8y3KJ2+H8fgeb6ALZNs4nTE
         7BQqWQdiOa6Hx2DA9+anLhzLsEPERumz7HZpDDZIbOqTnG2j7ADXSJKims5QsPcR1Vvw
         nWDplisj8HOXIlG5NVITyNlyKC7mIfXsFaxZ0QNgZxjzxIlCG7tocs4pttfFsMHCq2cu
         lXP6cccPYoM81q7JFNDam/Tpy6orwEkny+DJ2q6Sr4t1T7UZwQVCU7v66Y9SLv7/9/bb
         RetMx4BOiv1EsckXt740j9KRNsBZ12fZM0k8vo1Ax/C3flOnUD0y7nAc1mWQcRbZsu3e
         zs6w==
X-Forwarded-Encrypted: i=1; AJvYcCWohCugH781XRVP6p5d99eJKQBDv319g6sVToyQDB6hJGhAywbJxHRcLJAuVBw5xeJCWgbbKOWF+A==@kvack.org
X-Gm-Message-State: AOJu0Yx4z9DRSXrtsncPsXWP7wrjnM/kUPv6t1STi7e9xhysuzX1bAHe
	jPVD/w5oEZjArmE0i4ymWpt9hAHhtW0qh/tgR8u2yDwhcpqJlzJQuEB6S3sjgPI=
X-Gm-Gg: ASbGncvw8Ppoyn/xvw6d1nx93wy9qDg7QaEwx8a3pe0+j+/wbfyH/vKXJTM3k1Bwry4
	1tJXl2MmAQqT+7CAD7dUcvrrdFRiyOgDXK6qPnO8oYKgn51P24VRUBcQ3LleHB/0UG6QyAtxvmV
	73/z9K0RKwN0fUmUfo2PGT/DxZhhv+oKgCK9aWPLEGvlMhGgNTBasFe2aCjh8ikD1t081h9jO7U
	fzAYgudjYg0TC0xaLyQJerBati+ZAf8qu2h1AD8FONj6DK5PmAVsbbl6R3nnyUSOYGZiRqhEubC
	WbXtc3UXpwH5MMmA+wpX5SpNLU7uBd1gtBux
X-Google-Smtp-Source: AGHT+IE0ngYlwmKwtlrHafkxSZKBLR7SWbSU4ddU2IBv1qLhP9xxHKcjrwnzHLg3RpbBf9acB39Zkg==
X-Received: by 2002:a05:6e02:1b0b:b0:3d8:1d7c:e180 with SMTP id e9e14a558f8ab-3d88ed8e8e0mr209743185ab.6.1745434512449;
        Wed, 23 Apr 2025 11:55:12 -0700 (PDT)
Received: from [192.168.1.116] ([96.43.243.2])
        by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4f6a3958edbsm2959247173.115.2025.04.23.11.55.11
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 23 Apr 2025 11:55:11 -0700 (PDT)
Content-Type: multipart/mixed; boundary="------------Rtfq7maOjRQ8KE7c6kd7Vtad"
Message-ID: <3d4c5ca8-12f2-4525-8503-f34839ac7099@kernel.dk>
Date: Wed, 23 Apr 2025 12:55:10 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault
 scenarios
From: Jens Axboe <axboe@kernel.dk>
To: =?UTF-8?B?5aec5pm65Lyf?= <qq282012236@gmail.com>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
 akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com,
 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, io-uring@vger.kernel.org
References: <20250422162913.1242057-1-qq282012236@gmail.com>
 <20250422162913.1242057-2-qq282012236@gmail.com>
 <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk>
 <CANHzP_uW4+-M1yTg-GPdPzYWAmvqP5vh6+s1uBhrMZ3eBusLug@mail.gmail.com>
 <b61ac651-fafe-449a-82ed-7239123844e1@kernel.dk>
 <CANHzP_tLV29_uk2gcRAjT9sJNVPH3rMyVuQP07q+c_TWWgsfDg@mail.gmail.com>
 <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk>
 <CANHzP_ui_TEPvr6wkWr42j46Sk5qHzZ+p0oo06BrNny52dPK9Q@mail.gmail.com>
 <bc80036e-2bca-481a-9901-c570d65ad960@kernel.dk>
Content-Language: en-US
In-Reply-To: <bc80036e-2bca-481a-9901-c570d65ad960@kernel.dk>
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: B0737A0013
X-Stat-Signature: w4hu96xwkcob5ecnjks3bi3ttx81h5tz
X-Rspam-User: 
X-HE-Tag: 1745434513-448378
X-HE-Meta: U2FsdGVkX19YOMK77nt79Dz0einZlsRALOX9D3Mu1Tru8LKYPI1JYHV8d2SITynuyuxs9To1Y5ebv/q676bKHC8uQqbWEt+pMm9jlCCbvPXnHw5D/N/gLNGGgeJ18DxiwHrB6AzJchelaEbMMPhi8v0DyQqGvImzpSxSeFDIsHhRPz4t6krL5bliWPNrjfuCTgz2LG+ktCOPL71HVoj9l1BTBfXFPnFe53x09xD2mBom2RNN5vyQDsVQ1cDmIcXobMW/2r2DvbVzGsG20qosqkqpU2kDLtwo0OLGLWGCzafKNrbM2Kny/xUVIrUK20yA5enTIK4BlRvz6735OUqccD7U+6PvDtJdRVBnuzYgCYMQjmlR0ECqQuMkwl7wio4qRany91XaJbC2gBkBEPJaRve8LalqVb1zNp94GcdjaeOfkZVQ3OJ3VqZaBvfEEzGe1OrSrnVd3ftAcj1loefMNGtBkjVcBMkiVTkN5xlNr1fCnz2V54mE9aM7p/NlsX8CBSV8Uy8sKBajCnT1ZYz/zrmETN6aU2zOD487j2/UWgbFMBWedk+Rxn0ZQUD20uqrdyz5cqWdc7LwL6DxcgVFdOT92I4hBLSf1dC5z0M+5OVMN1RLQQlTxGRxSecvnkYpdRrveDCEL5pfu812YBGM7Rm/BJTwnWEaJVJFYkJ38Wzefm3wXnvSiLQ+MrnMfah9If2pJhEHaCedVb769bSW+LYvhYy8glGjl9+O9uhp9jlGoOskC0WOuE8yz1nDqb9XNT5GvWd/f+JeWZdg3H4mHpBT5rGMd5aF6WsjQsL2jzWbFYDpm7kphPhwCOfi/qkiK0mjC7qE+vwWFLWOhTAmNHLq00Eha25WErdHE4e4r8uQ5zWHOTWnfFUxcvE4nXAAd6auSRPLQlsMdlvKMlTyS5PP+LU8DizWiVRyT2jzK6t7MDo6Y11vBYCqUVODnQGRx1zASDLds682r7Qsooc
 HIsgoR3L
 XtC6iSgL6ZTfdrL+U/9KeQfotg8UKuRsDPCPhZemjosG71oG+jyWU8Pop3C5UkySKUVNVmST5M30E/bwfYXSMaRyHfQn+s2BpdVJNtXMGc8FJz1Wb1K2ydh0AgLI4zOA+queJz7OfZJpOdGOTPamF+XeaKga7fCvuMjVy4FjfQTbm8aMYBIHaFMXCGeS6OJoixzcIo667apMt3OCyFlUnSCDj8pSaOXiXlDCF+XGw0mVS8nBJpWG71gz1iru+0I9VAgkqPOjkqiEXafqRMuryhDylMmeWsyoEEo7QbaahH+rMBSrnD0lXDr/OqL4i+oZp7gOy1AhiTSffqz4Iz0dRHfOZyatijynUpSw8+RIBe3KoMGYpn8/gCAbNvubrYbv+PgFh5HRxtggkofGLgqE7pggC46eS/0sBA9h/2ZIPaTgDlBkPebyecqPOy05xE0wD2xwg7/XD44xQVMzqbLkERY98ClcYBrWUenW7i5yDmF3AtkusSj4a4IA4Eynm3qm2grOcicAFyPZyW0ZC3KUiJrf6qb+HAxnOJpJvQuz3GH3iIAD91pN8lkRcsWPUfyykMFeUcDDgRoxg2GbxdCz6xg4HhlgAh+sp6j9p7Eec7FuXw1Cnz81zwfBCjshFoUcJGbnoxQiZ1YE4ev4eh+yCO7drhfzwtExaqdYMTD9mYHPf3Oyp51/+nref3p9QaJ9RHpsuDah+wextQBb+IxF4KWRHsoJgIZoILKmuVkWWzPzYMGFmQJs0DuSOHXFyXdkhnIxxkU67I91MjQ4=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

This is a multi-part message in MIME format.
--------------Rtfq7maOjRQ8KE7c6kd7Vtad
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 4/23/25 9:10 AM, Jens Axboe wrote:
> On 4/23/25 8:29 AM, ??? wrote:
>> Jens Axboe <axboe@kernel.dk> ?2025?4?23??? 21:34???
>>>
>>> On 4/22/25 8:49 PM, ??? wrote:
>>>> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 11:04 AM, ??? wrote:
>>>>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>>>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>>>>>>> index d4fb2940e435..8567a9c819db 100644
>>>>>>>> --- a/io_uring/io-wq.h
>>>>>>>> +++ b/io_uring/io-wq.h
>>>>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>>>>>>                                       void *data, bool cancel_all);
>>>>>>>>
>>>>>>>>  #if defined(CONFIG_IO_WQ)
>>>>>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>>>>>>> -extern void io_wq_worker_running(struct task_struct *);
>>>>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>>>>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>>>>>>> +extern void set_userfault_flag_for_ioworker(void);
>>>>>>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>>>>>>  #else
>>>>>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>>  {
>>>>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>>>>>>  {
>>>>>>>>  }
>>>>>>>> +static inline void set_userfault_flag_for_ioworker(void)
>>>>>>>> +{
>>>>>>>> +}
>>>>>>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>>>>>>> +{
>>>>>>>> +}
>>>>>>>>  #endif
>>>>>>>>
>>>>>>>>  static inline bool io_wq_current_is_worker(void)
>>>>>>>
>>>>>>> This should go in include/linux/io_uring.h and then userfaultfd would
>>>>>>> not have to include io_uring private headers.
>>>>>>>
>>>>>>> But that's beside the point, like I said we still need to get to the
>>>>>>> bottom of what is going on here first, rather than try and paper around
>>>>>>> it. So please don't post more versions of this before we have that
>>>>>>> understanding.
>>>>>>>
>>>>>>> See previous emails on 6.8 and other kernel versions.
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>> The issue did not involve creating new worker processes. Instead, the
>>>>>> existing IOU worker kernel threads (about a dozen) associated with the VM
>>>>>> process were fully utilizing CPU without writing data, caused by a fault
>>>>>> while reading user data pages in the fault_in_iov_iter_readable function
>>>>>> when pulling user memory into kernel space.
>>>>>
>>>>> OK that makes more sense, I can certainly reproduce a loop in this path:
>>>>>
>>>>> iou-wrk-726     729    36.910071:       9737 cycles:P:
>>>>>         ffff800080456c44 handle_userfault+0x47c
>>>>>         ffff800080381fc0 hugetlb_fault+0xb68
>>>>>         ffff80008031fee4 handle_mm_fault+0x2fc
>>>>>         ffff8000812ada6c do_page_fault+0x1e4
>>>>>         ffff8000812ae024 do_translation_fault+0x9c
>>>>>         ffff800080049a9c do_mem_abort+0x44
>>>>>         ffff80008129bd78 el1_abort+0x38
>>>>>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>>>>>         ffff8000800112b4 el1h_64_sync+0x6c
>>>>>         ffff80008030984c fault_in_readable+0x74
>>>>>         ffff800080476f3c iomap_file_buffered_write+0x14c
>>>>>         ffff8000809b1230 blkdev_write_iter+0x1a8
>>>>>         ffff800080a1f378 io_write+0x188
>>>>>         ffff800080a14f30 io_issue_sqe+0x68
>>>>>         ffff800080a155d0 io_wq_submit_work+0xa8
>>>>>         ffff800080a32afc io_worker_handle_work+0x1f4
>>>>>         ffff800080a332b8 io_wq_worker+0x110
>>>>>         ffff80008002dd38 ret_from_fork+0x10
>>>>>
>>>>> which seems to be expected, we'd continually try and fault in the
>>>>> ranges, if the userfaultfd handler isn't filling them.
>>>>>
>>>>> I guess this is where I'm still confused, because I don't see how this
>>>>> is different from if you have a normal write(2) syscall doing the same
>>>>> thing - you'd get the same looping.
>>>>>
>>>>> ??
>>>>>
>>>>>> This issue occurs like during VM snapshot loading (which uses
>>>>>> userfaultfd for on-demand memory loading), while the task in the guest is
>>>>>> writing data to disk.
>>>>>>
>>>>>> Normally, the VM first triggers a user fault to fill the page table.
>>>>>> So in the IOU worker thread, the page tables are already filled,
>>>>>> fault no chance happens when faulting in memory pages
>>>>>> in fault_in_iov_iter_readable.
>>>>>>
>>>>>> I suspect that during snapshot loading, a memory access in the
>>>>>> VM triggers an async page fault handled by the kernel thread,
>>>>>> while the IOU worker's async kernel thread is also running.
>>>>>> Maybe If the IOU worker's thread is scheduled first.
>>>>>> I?m going to bed now.
>>>>>
>>>>> Ah ok, so what you're saying is that because we end up not sleeping
>>>>> (because a signal is pending, it seems), then the fault will never get
>>>>> filled and hence progress not made? And the signal is pending because
>>>>> someone tried to create a net worker, and this work is not getting
>>>>> processed.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>>         handle_userfault() {
>>>>           hugetlb_vma_lock_read();
>>>>           _raw_spin_lock_irq() {
>>>>             __pv_queued_spin_lock_slowpath();
>>>>           }
>>>>           vma_mmu_pagesize() {
>>>>             hugetlb_vm_op_pagesize();
>>>>           }
>>>>           huge_pte_offset();
>>>>           hugetlb_vma_unlock_read();
>>>>           up_read();
>>>>           __wake_up() {
>>>>             _raw_spin_lock_irqsave() {
>>>>               __pv_queued_spin_lock_slowpath();
>>>>             }
>>>>             __wake_up_common();
>>>>             _raw_spin_unlock_irqrestore();
>>>>           }
>>>>           schedule() {
>>>>             io_wq_worker_sleeping() {
>>>>               io_wq_dec_running();
>>>>             }
>>>>             rcu_note_context_switch();
>>>>             raw_spin_rq_lock_nested() {
>>>>               _raw_spin_lock();
>>>>             }
>>>>             update_rq_clock();
>>>>             pick_next_task() {
>>>>               pick_next_task_fair() {
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   __calc_delta.constprop.0();
>>>>                   update_min_vruntime();
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   __calc_delta.constprop.0();
>>>>                   update_min_vruntime();
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   update_min_vruntime();
>>>>                   cpuacct_charge();
>>>>                   __cgroup_account_cputime() {
>>>>                     cgroup_rstat_updated();
>>>>                   }
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>               }
>>>>             }
>>>>             raw_spin_rq_unlock();
>>>>             io_wq_worker_running();
>>>>           }
>>>>           _raw_spin_lock_irq() {
>>>>             __pv_queued_spin_lock_slowpath();
>>>>           }
>>>>           userfaultfd_ctx_put();
>>>>         }
>>>>       }
>>>> The execution flow above is the one that kept faulting
>>>> repeatedly in the IOU worker during the issue. The entire fault path,
>>>> including this final userfault handling code you're seeing, would be
>>>> triggered in an infinite loop. That's why I traced and found that the
>>>> io_wq_worker_running() function returns early, causing the flow to
>>>> differ from a normal user fault, where it should be sleeping.
>>>
>>> io_wq_worker_running() is called when the task is scheduled back in.
>>> There's no "returning early" here, it simply updates the accounting.
>>> Which is part of why your patch makes very little sense to me, we
>>> would've called both io_wq_worker_sleeping() and _running() from the
>>> userfaultfd path. The latter doesn't really do much, it simply just
>>> increments the running worker count, if the worker was previously marked
>>> as sleeping.
>>>
>>> And I strongly suspect that the latter is the issue, not the marking of
>>> running. The above loop is fine if we do go to sleep in schedule.
>>> However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
>>> based) pending, then schedule() will be a no-op and we're going to
>>> repeatedly go through that loop. This is because the expectation here is
>>> that the loop will be aborted if either of those is true, so that
>>> task_work can get run (or a signal handled, whatever), and then the
>>> operation retried.
>>>
>>>> However, your call stack appears to behave normally,
>>>> which makes me curious about what's different about execution flow.
>>>> Would you be able to share your test case code so I can study it
>>>> and try to reproduce the behavior on my side?
>>>
>>> It behaves normally for the initial attempt - we end up sleeping in
>>> schedule(). However, then a new worker gets created, or the ring
>>> shutdown, in which case schedule() ends up being a no-op because
>>> TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
>>> the same code again and again to no avail. So I do think my test case
>>> and your issue is the same, I just reproduce it by calling
>>> io_uring_queue_exit(), but the exact same thing would happen if worker
>>> creation is attempted while an io-wq worker is blocked
>>> handle_userfault().
>>>
>>> This is why I want to fully understand the issue rather than paper
>>> around it, as I don't think the fix is correct as-is. We really want to
>>> abort the loop and allow the task to handle whatever signaling is
>>> currently preventing proper sleeps.
>>>
>>> I'll dabble a bit more and send out the test case too, in case it'll
>>> help on your end.
>>>
>>> --
>>> Jens Axboe
>> I?m really looking forward to your test case. Also, I?d like to
>> emphasize one more point: the handle_userfault graph path I sent you,
>> including the schedule function, is complete and unmodified. You can
>> see that the schedule function is very, very short. I understand your
>> point about signal handling, but in this very brief function graph, I
>> haven?t yet seen any functions related to signal handling.
>> Additionally, there is no context switch here, nor is it the situation
>> where the thread is being scheduled back in. Perhaps the scenario
>> you?ve reproduced is still different from the one I?ve encountered in
>> some subtle way?
> 
> Ask yourself, why would schedule() return immediately rather than
> actually block? There's a few cases here:
> 
> 1) The task state is set to TASK_RUNNING - either because it was never
>    set to TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE, or because someone
>    raced and woke up the task after the initial check on whether it
>    should be sleeping or not.
> 
> 2) Some kind of notification or signal is pending. This is usually when
>    task_sigpending() returns true, or if TIF_NOTIFY_SIGNAL is set. Those
>    need clearing, and that's generally done on return to userspace.
> 
> #1 isn't the case here, but #2 looks highly plausible. The io-wq workers
> rely on running this kind of work manually, and retrying. If we loop
> further down with these conditions being true, then we're just busy
> looping at that point and will NEVER sleep. You don't see any functions
> related to signal handling etc EXACTLY because of that, there's nowhere
> it gets run.
> 
>> void io_wq_worker_running(struct task_struct *tsk)
>> {
>> struct io_worker *worker = tsk->worker_private;
>>
>> if (!worker)
>> return;
>> if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
>> if (!test_bit(IO_WORKER_F_UP, &worker->flags))
>> return;
>> if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
>> return;
>> set_bit(IO_WORKER_F_RUNNING, &worker->flags);
>> io_wq_inc_running(worker);
>> }
>> }
>> However, from my observation during the crash live memory analysis,
>> when this happens in the IOU worker thread, the
>> IO_WORKER_F_RUNNING flag is set. This is what I said "early return",
>> rather than just a simple accounting function.I look forward to your
>> deeper analysis and any corrections you may have.
> 
> It's set becase of what I outlined above. If schedule() would actually
> sleep, then io_wq_worker_sleeping() would've been called. The fact that
> you're getting io_wq_worker_running() called without WORKER_F_RUNNING
> cleared is because of that.
> 
> But you're too focused on the symptom here, not the underlying issue. It
> doesn't matter at all that io_wq_worker_running() is called when the
> task is already running, it'll just ignore that. It's explicitly tested.
> Your patch won't make a single difference for this case because of that,
> you're just wrapping what's esssentially a no-op call with another no-op
> call, as you've now nested RUNNING inside the FAULT flag. It won't
> change your outcome at all.

BTW, same thing can be observed without using io_uring at all - just
have a normal task do a write(2) as in my test case, and have the parent
send it a signal. We'll loop in page fault handling if userfaultfd is
used and it doesn't fill the fault. Example attached.

IOW, this is a generic "problem". I use quotes here as it's not _really_
a problem, it'll just loop excessively if a signal is pending. It can
still very much get killed or terminated, but it's not going to make
progress as the page fault isn't filled.

-- 
Jens Axboe
--------------Rtfq7maOjRQ8KE7c6kd7Vtad
Content-Type: text/x-csrc; charset=UTF-8; name="tufd.c"
Content-Disposition: attachment; filename="tufd.c"
Content-Transfer-Encoding: base64

I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPHVuaXN0
ZC5oPgojaW5jbHVkZSA8ZmNudGwuaD4KI2luY2x1ZGUgPHN0cmluZy5oPgojaW5jbHVkZSA8
cG9sbC5oPgojaW5jbHVkZSA8c3lzL21tYW4uaD4KI2luY2x1ZGUgPHN5cy9pb2N0bC5oPgoj
aW5jbHVkZSA8c3lzL3dhaXQuaD4KI2luY2x1ZGUgPGxpbnV4L21tYW4uaD4KI2luY2x1ZGUg
PHN5cy91aW8uaD4KI2luY2x1ZGUgPGVycm5vLmg+CiNpbmNsdWRlIDxwdGhyZWFkLmg+CiNp
bmNsdWRlIDxzaWduYWwuaD4KI2luY2x1ZGUgPGxpbnV4L3VzZXJmYXVsdGZkLmg+CgojZGVm
aW5lIEhQX1NJWkUJCSgyICogMTAyNCAqIDEwMjRVTEwpCgojaWZuZGVmIE5SX3VzZXJmYXVs
dGZkCiNkZWZpbmUgTlJfdXNlcmZhdWx0ZmQJMjgyCiNlbmRpZgoKc3RydWN0IHRocmVhZF9k
YXRhIHsKCXB0aHJlYWRfdCB0aHJlYWQ7CglwdGhyZWFkX2JhcnJpZXJfdCBiYXJyaWVyOwoJ
aW50IHVmZmQ7Cn07CgpzdGF0aWMgdm9pZCAqZmF1bHRfaGFuZGxlcih2b2lkICpkYXRhKQp7
CglzdHJ1Y3QgdGhyZWFkX2RhdGEgKnRkID0gZGF0YTsKCXN0cnVjdCB1ZmZkX21zZyBtc2c7
CglzdHJ1Y3QgcG9sbGZkIHBmZDsKCWludCByZXQsIG5yZWFkeTsKCglwdGhyZWFkX2JhcnJp
ZXJfd2FpdCgmdGQtPmJhcnJpZXIpOwoKCWRvIHsKCQlwZmQuZmQgPSB0ZC0+dWZmZDsKCQlw
ZmQuZXZlbnRzID0gUE9MTElOOwoJCW5yZWFkeSA9IHBvbGwoJnBmZCwgMSwgLTEpOwoJCWlm
IChucmVhZHkgPCAwKSB7CgkJCXBlcnJvcigicG9sbCIpOwoJCQlleGl0KDEpOwoJCX0KCgkJ
cmV0ID0gcmVhZCh0ZC0+dWZmZCwgJm1zZywgc2l6ZW9mKG1zZykpOwoJCWlmIChyZXQgPCAw
KSB7CgkJCWlmIChlcnJubyA9PSBFQUdBSU4pCgkJCQljb250aW51ZTsKCQkJcGVycm9yKCJy
ZWFkIik7CgkJCWV4aXQoMSk7CgkJfQoKCQlpZiAobXNnLmV2ZW50ICE9IFVGRkRfRVZFTlRf
UEFHRUZBVUxUKSB7CgkJCXByaW50ZigidW5zcGVjdGVkIGV2ZW50OiAleFxuIiwgbXNnLmV2
ZW50KTsKCQkJZXhpdCgxKTsKCQl9CgoJCXByaW50ZigiUGFnZSBmYXVsdFxuIik7CgkJcHJp
bnRmKCJmbGFncyA9ICVseDsgIiwgKGxvbmcpIG1zZy5hcmcucGFnZWZhdWx0LmZsYWdzKTsK
CQlwcmludGYoImFkZHJlc3MgPSAlbHhcbiIsIChsb25nKW1zZy5hcmcucGFnZWZhdWx0LmFk
ZHJlc3MpOwoJfSB3aGlsZSAoMSk7CgoJcmV0dXJuIE5VTEw7Cn0KCnN0YXRpYyB2b2lkICph
cm1fZmF1bHRfaGFuZGxlcihzdHJ1Y3QgdGhyZWFkX2RhdGEgKnRkLCBzaXplX3QgbGVuKQp7
CglzdHJ1Y3QgdWZmZGlvX2FwaSBhcGkgPSB7IH07CglzdHJ1Y3QgdWZmZGlvX3JlZ2lzdGVy
IHJlZyA9IHsgfTsKCXZvaWQgKmJ1ZjsKCglidWYgPSBtbWFwKE5VTEwsIEhQX1NJWkUsIFBS
T1RfUkVBRHxQUk9UX1dSSVRFLAoJCQlNQVBfUFJJVkFURSB8IE1BUF9IVUdFVExCIHwgTUFQ
X0hVR0VfMk1CIHwgTUFQX0FOT05ZTU9VUywKCQkJLTEsIDApOwoJaWYgKGJ1ZiA9PSBNQVBf
RkFJTEVEKSB7CgkJcGVycm9yKCJtbWFwIik7CgkJcmV0dXJuIE5VTEw7Cgl9CglwcmludGYo
ImdvdCBidWYgJXBcbiIsIGJ1Zik7CgoJdGQtPnVmZmQgPSBzeXNjYWxsKE5SX3VzZXJmYXVs
dGZkLCBPX0NMT0VYRUMgfCBPX05PTkJMT0NLKTsKCWlmICh0ZC0+dWZmZCA8IDApIHsKCQlw
ZXJyb3IoInVzZXJmYXVsdGZkIik7CgkJcmV0dXJuIE5VTEw7Cgl9CgoJYXBpLmFwaSA9IFVG
RkRfQVBJOwoJaWYgKGlvY3RsKHRkLT51ZmZkLCBVRkZESU9fQVBJLCAmYXBpKSA8IDApIHsK
CQlwZXJyb3IoImlvY3RsIFVGRkRJT19BUEkiKTsKCQlyZXR1cm4gTlVMTDsKCX0KCglyZWcu
cmFuZ2Uuc3RhcnQgPSAodW5zaWduZWQgbG9uZykgYnVmOwoJcmVnLnJhbmdlLmxlbiA9IEhQ
X1NJWkU7CglyZWcubW9kZSA9IFVGRkRJT19SRUdJU1RFUl9NT0RFX01JU1NJTkc7CglpZiAo
aW9jdGwodGQtPnVmZmQsIFVGRkRJT19SRUdJU1RFUiwgJnJlZykgPCAwKSB7CgkJcGVycm9y
KCJpb2N0bCBVRkZESU9fUkVHSVNURVIiKTsKCQlyZXR1cm4gTlVMTDsKCX0KCglyZXR1cm4g
YnVmOwp9CgpzdGF0aWMgdm9pZCBzaWdfdXNyMShpbnQgc2lnKQp7Cn0KCnN0YXRpYyB2b2lk
IF9fZG9faW8oaW50IGZkLCB2b2lkICpidWYsIHNpemVfdCBsZW4pCnsKCXN0cnVjdCBzaWdh
Y3Rpb24gYWN0ID0geyB9OwoJaW50IHJldDsKCglhY3Quc2FfaGFuZGxlciA9IHNpZ191c3Ix
OwoJc2lnYWN0aW9uKFNJR1VTUjEsICZhY3QsIE5VTEwpOwoKCXByaW50ZigiY2hpbGQgd2ls
bCB3cml0ZVxuIik7CglyZXQgPSB3cml0ZShmZCwgYnVmLCBsZW4pOwoJcHJpbnRmKCJyZXQ9
JWRcbiIsIHJldCk7Cn0KCnN0YXRpYyB2b2lkIGRvX2lvKHN0cnVjdCB0aHJlYWRfZGF0YSAq
dGQsIHNpemVfdCBsZW4pCnsKCXZvaWQgKmJ1ZjsKCXBpZF90IHBpZDsKCWludCBmZDsKCglm
ZCA9IG9wZW4oIi9kZXYvbnZtZTBuMSIsIE9fUkRXUik7CglpZiAoZmQgPCAwKSB7CgkJcGVy
cm9yKCJvcGVuIGNyZWF0ZSIpOwoJCXJldHVybjsKCX0KCglwaWQgPSBmb3JrKCk7CglpZiAo
cGlkKSB7CgkJaW50IHdzdGF0OwoKCQlzbGVlcCgxKTsKCQlraWxsKHBpZCwgU0lHVVNSMSk7
CgkJcHJpbnRmKCJ3YWl0IG9uIGNoaWxkXG4iKTsKCQl3YWl0cGlkKHBpZCwgJndzdGF0LCAw
KTsKCX0gZWxzZSB7CgkJYnVmID0gYXJtX2ZhdWx0X2hhbmRsZXIodGQsIGxlbik7CgkJcHRo
cmVhZF9iYXJyaWVyX3dhaXQoJnRkLT5iYXJyaWVyKTsKCQlfX2RvX2lvKGZkLCBidWYsIGxl
bik7CgkJZXhpdCgwKTsKCX0KfQoKaW50IG1haW4oaW50IGFyZ2MsIGNoYXIgKmFyZ3ZbXSkK
ewoJc3RydWN0IHRocmVhZF9kYXRhIHRkID0geyB9OwoKCXB0aHJlYWRfYmFycmllcl9pbml0
KCZ0ZC5iYXJyaWVyLCBOVUxMLCAyKTsKCXB0aHJlYWRfY3JlYXRlKCZ0ZC50aHJlYWQsIE5V
TEwsIGZhdWx0X2hhbmRsZXIsICZ0ZCk7CgoJZG9faW8oJnRkLCBIUF9TSVpFKTsKCXJldHVy
biAwOwp9Cg==

--------------Rtfq7maOjRQ8KE7c6kd7Vtad--