From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4270BC369CB
	for <linux-mm@archiver.kernel.org>; Wed, 23 Apr 2025 13:34:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2DA9B6B0029; Wed, 23 Apr 2025 09:34:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 28B2F6B002D; Wed, 23 Apr 2025 09:34:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 12AC06B00A3; Wed, 23 Apr 2025 09:34:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id ECEC16B0029
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 09:34:17 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E2BDE1CAB8D
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 13:34:18 +0000 (UTC)
X-FDA: 83365402596.19.7FA3AEC
Received: from mail-il1-f169.google.com (mail-il1-f169.google.com [209.85.166.169])
	by imf26.hostedemail.com (Postfix) with ESMTP id AAAA414000F
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 13:34:16 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=yi9Bvmzp;
	dmarc=none;
	spf=pass (imf26.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.169 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745415256; a=rsa-sha256;
	cv=none;
	b=1EXVFVDetkfwPRTRPJuE/wBCrs8cwRgwO9H/Vlo3JAh/MI9SCocallbWIQttbRemtaA+rM
	nx04sZvvdeSK53A53ZEB3I4jH3MXk0L21iA2j90OYPv8tKbLMSK28mvSrS7T1TyySzCfRJ
	YgQvziCE/p4Yke5zo+M6sPIL6hJFs8o=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=yi9Bvmzp;
	dmarc=none;
	spf=pass (imf26.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.169 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745415256;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=9hpLU4OQ+Jg67sHk70mEz0LoA7hW+cX35ATTpV22znY=;
	b=njV/pcBDnT8B6tg8Z2cpE+wFxvkHCWXQ6DbDzTrtQkA9r6bKodiANZ6Y2fOMZCl9GL07mZ
	I9njL2osO6i0DC3wWj7jtOXaeYseHAFN04OGys/WBOy09J4YOV/S5YZjMAQUWSyoGKkeuj
	pdW2Iy1gy+QievU3fHc++Sao40LLX3k=
Received: by mail-il1-f169.google.com with SMTP id e9e14a558f8ab-3d8b6954251so17133175ab.0
        for <linux-mm@kvack.org>; Wed, 23 Apr 2025 06:34:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1745415255; x=1746020055; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:content-language:from
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=9hpLU4OQ+Jg67sHk70mEz0LoA7hW+cX35ATTpV22znY=;
        b=yi9BvmzpgmUWcwKi4I3iTeDk6WbNgsMHCQwFkfpIYwj7J4YokPC2BkRYY3vVJyoGAj
         T+Mi+rmKKuc+D0ojmerKyqEaLf22U5g1yhbThE8YBUhISoiRL9A3LYb3XDF+rscQ7tEi
         1LsVMY1PJyyNSOx7p8Qo+fyUJzmvxslxzggSZ52N3y1mC7jbdmWjk7w+enPV99GH32LQ
         /iXBrgDHVbktptnL5f8lBrTc3vSLUMcZXPRdEXtMFyDF+47w1N31mgacfuSwP0eNn5yb
         h6YTVkdzYwzdCgKWYMSGmDk79aVPc5cgKcQ5ZUmXIOYC9RXSvIyBjyNTiuctR2SgdAx/
         dwQg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745415255; x=1746020055;
        h=content-transfer-encoding:in-reply-to:content-language:from
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=9hpLU4OQ+Jg67sHk70mEz0LoA7hW+cX35ATTpV22znY=;
        b=XKzY0klcIFhS4uoT8JBbQsoJl5FiSWYN9oUS0BcU9Qy626hzdUBueLlA/v3cZTadhW
         nMdYN0c700I1iI66NhCeW/O8VQ2Mx7YMOcdozyDqsC2ocxyfEt5xTluNAKTrLMEzLfga
         p19F46edWRYytf22lyDG6jgdU+Q5cYmIUo9OlnycIi0N1gmsj0mPR5Y4YSxU8qICEAyd
         AjjYMNthr9Ac8WF4KnP3BUo20vQvYUhLyLC8Adt79j7Nj4M8yIGgXAv72B7P7VU/zaE/
         1U47QjvRmY1OT7QUsgmelwS/jL+41B4wPpcQBVQj4MWImj6Whe0p4TgKN/HWGWgF/Rs2
         UL3Q==
X-Forwarded-Encrypted: i=1; AJvYcCUKQDB6PXiHk2ZrFv0xH+ovlDV7Mix95jMhvZqCebhJWyolH5NRzYEoQ5hYnqsone/0ygK5APhgLg==@kvack.org
X-Gm-Message-State: AOJu0YzRPOvVUQFRgaFZOPww1kcIB7a64eVdBGACPMiYoqCJXOrfjmYm
	lIQLTZQNxWSCflgyFzXVCKi2KPgNNz0pdpXGuNXbG+xg75OUhUX3D4OqxIZ94jWxUG6pFfYT9+t
	i
X-Gm-Gg: ASbGncudxzdvpn6DldPgtkzkCEB5whEeiKI2HhRa9c/FrVCVRpxLUOv85wLRjAqL+X4
	m1iEac2zeuMO3QAtbFTzsC0FktkIq08PxefLRLCeNDSKUjNAJsILrK4bDKbk/ThFNIvxfAkj6xJ
	17x392ABoPucx42r5SrsOBTWgWr4HinyQZ6dOdT5wYitgZ27wxh5bgSf8qZ62h6zjLepZVbdmrx
	OaB49ZKUK/3zJN5TuBZTE9Xvi0szzOoOMC4PuGYm5VLqH0XxvegTD1RWM8MpD/Tp1TJwr7H/nVb
	mOs40NSgyPugmgxxSgKimRGXIh4DHgcf1tbN3Q==
X-Google-Smtp-Source: AGHT+IHThlBHn7rxme4VnCSAQGP0a8Yy93Q156ueMnuo4/505ZcOiwjgN1jA+02krKqEZpf9wNrKiw==
X-Received: by 2002:a05:6e02:3809:b0:3d8:2197:1aa0 with SMTP id e9e14a558f8ab-3d88ed8d68bmr171312825ab.11.1745415255534;
        Wed, 23 Apr 2025 06:34:15 -0700 (PDT)
Received: from [192.168.1.150] ([198.8.77.157])
        by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-3d9276aca79sm3805765ab.67.2025.04.23.06.34.14
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 23 Apr 2025 06:34:14 -0700 (PDT)
Message-ID: <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk>
Date: Wed, 23 Apr 2025 07:34:13 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault
 scenarios
To: =?UTF-8?B?5aec5pm65Lyf?= <qq282012236@gmail.com>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
 akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com,
 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, io-uring@vger.kernel.org
References: <20250422162913.1242057-1-qq282012236@gmail.com>
 <20250422162913.1242057-2-qq282012236@gmail.com>
 <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk>
 <CANHzP_uW4+-M1yTg-GPdPzYWAmvqP5vh6+s1uBhrMZ3eBusLug@mail.gmail.com>
 <b61ac651-fafe-449a-82ed-7239123844e1@kernel.dk>
 <CANHzP_tLV29_uk2gcRAjT9sJNVPH3rMyVuQP07q+c_TWWgsfDg@mail.gmail.com>
From: Jens Axboe <axboe@kernel.dk>
Content-Language: en-US
In-Reply-To: <CANHzP_tLV29_uk2gcRAjT9sJNVPH3rMyVuQP07q+c_TWWgsfDg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: AAAA414000F
X-Stat-Signature: ni8bpnqdhnkmhyzxbyjsb3qpor874owm
X-Rspam-User: 
X-HE-Tag: 1745415256-569942
X-HE-Meta: U2FsdGVkX1+je0PbIu61k56GClSliCJcSxGUZOZ2uvv5CRoQHuXLoiQAXyp5uL/PRtDOiGsR1XwgwpBuJh2gZHqvbZxFuNhl6sWaSSTB/nzL4YGULj9f2XMI/ihbR/6AvEy4mdvicWMlsBQuQhN7RUXBxcPeVSS/OoFtSeh901ziqs29ZhpmrUZzOeA8M3xf5dIvBH6SHbxavDp18ReGyaY+GGi3TL95XBKbCf1BhslgOH+RQ1a4mGxGWfI8wWmxf5uuPLEEx4peCMGy10ok1EPYAt+ziBCRmtre+FSE7nCsA4FqPS4XrBoUUZrIxeoy+qvXIXo1Uq9KnzrkbglAUatMCbn3aR58qxamcHato8VID2/Uw1+H/LjzA47MwzQUgiSOdivVtKKBf7ZqSFLJSNqd71H0+8CPyC+hiI+oGPbrEGPa+4W1wLNB0oDGbVrnzTfR99paXJeJoCJdGs8HGIxe1gR/tRbX+dLjCy4TXhcoAxSACdCJcjXQO4cKfzsxrjLs09ndTsv3pG+aLx6SOPJs2JhbB/O+UCziCa1/Oqf6R8CzVb4wO75NrJ28p5wka3eD6dIW6mnFdVjwjeYJDvDliBrrT+agXJ3JRjEFHk0D3Lks/M+6iJo8NwrCj00bw9cxdZaSkXM6GcY8Ud2lwOcJGkYGR+TXHEZCudHa0PIqP5EgHiW1z6WOv+mmFkkO4Nu3RUvX2YTveZG+n5ddOjBgFBxVYrOXjgcGC1I/xvgJJaQdHh5as8sl1B50zN6D69D2v/+XzcjkZ16vxwaNSPST1FkUDpVw2v0fgvkHTSbaAyUgbm53nUw8RqubDPILQ7RORCrMba0NXzydokQTNv7EB6JFrqz6rrRW7CQJQ5VNORQupiE9dXQ5PF0Kv/gkONIoEzlDoGs70a0h9HO3+I5LagTc0prm818UPxc7g52ztsquLJyYaYmLyvIrv7Nv37srU4NuclH8SOt3nBO
 +yur2O3v
 tf1d+1Bs9OQGh0wabC8+rrPxaxzKLCwTrTlFXcoln9y3CxmLj/qp1MAKAmDit03J6PSr8AQOQOLRgaRia/iiUIFmJrTyv3lGPAnw4dMYPn8NCsuVUQD6ew771qoZP6j9zNTqjmptLXhuOUyzx4Jy6QCrxjskn+Mfb1sp2BG/8fZSG6JDXnwyGBhHf7YXUq8HC2sv7xmSl2iBOo8l1kNZetQYdJqA0raAJlDwCc4ZtFBNUDUoVFSqLDfwoUuxHzc+SegE19bVp/r2FII+kYmYYu61stdOUCgA65hX3iBHsA51QknPUy8V1/f9d5moSj0xVb/vtWHm5uIcnHt8/+BOd1Nxy22x6immdOvj4euyNxNUXV+f67Emk+PelCwenVwQR1OFp4oSy8VJ5m+C/F15gdeJbXZzFWJQWmUFSy+dA9u5ZtlhFRxTpk6V3Da4Ub/1kmA91zNYFm90S0IQty4oO/VpsMOKO5WCcBK6B
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 4/22/25 8:49 PM, ??? wrote:
> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 11:04 AM, ??? wrote:
>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>>>> index d4fb2940e435..8567a9c819db 100644
>>>>> --- a/io_uring/io-wq.h
>>>>> +++ b/io_uring/io-wq.h
>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>>>                                       void *data, bool cancel_all);
>>>>>
>>>>>  #if defined(CONFIG_IO_WQ)
>>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>>>> -extern void io_wq_worker_running(struct task_struct *);
>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>>>> +extern void set_userfault_flag_for_ioworker(void);
>>>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>>>  #else
>>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>  {
>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>>>  {
>>>>>  }
>>>>> +static inline void set_userfault_flag_for_ioworker(void)
>>>>> +{
>>>>> +}
>>>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>>>> +{
>>>>> +}
>>>>>  #endif
>>>>>
>>>>>  static inline bool io_wq_current_is_worker(void)
>>>>
>>>> This should go in include/linux/io_uring.h and then userfaultfd would
>>>> not have to include io_uring private headers.
>>>>
>>>> But that's beside the point, like I said we still need to get to the
>>>> bottom of what is going on here first, rather than try and paper around
>>>> it. So please don't post more versions of this before we have that
>>>> understanding.
>>>>
>>>> See previous emails on 6.8 and other kernel versions.
>>>>
>>>> --
>>>> Jens Axboe
>>> The issue did not involve creating new worker processes. Instead, the
>>> existing IOU worker kernel threads (about a dozen) associated with the VM
>>> process were fully utilizing CPU without writing data, caused by a fault
>>> while reading user data pages in the fault_in_iov_iter_readable function
>>> when pulling user memory into kernel space.
>>
>> OK that makes more sense, I can certainly reproduce a loop in this path:
>>
>> iou-wrk-726     729    36.910071:       9737 cycles:P:
>>         ffff800080456c44 handle_userfault+0x47c
>>         ffff800080381fc0 hugetlb_fault+0xb68
>>         ffff80008031fee4 handle_mm_fault+0x2fc
>>         ffff8000812ada6c do_page_fault+0x1e4
>>         ffff8000812ae024 do_translation_fault+0x9c
>>         ffff800080049a9c do_mem_abort+0x44
>>         ffff80008129bd78 el1_abort+0x38
>>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>>         ffff8000800112b4 el1h_64_sync+0x6c
>>         ffff80008030984c fault_in_readable+0x74
>>         ffff800080476f3c iomap_file_buffered_write+0x14c
>>         ffff8000809b1230 blkdev_write_iter+0x1a8
>>         ffff800080a1f378 io_write+0x188
>>         ffff800080a14f30 io_issue_sqe+0x68
>>         ffff800080a155d0 io_wq_submit_work+0xa8
>>         ffff800080a32afc io_worker_handle_work+0x1f4
>>         ffff800080a332b8 io_wq_worker+0x110
>>         ffff80008002dd38 ret_from_fork+0x10
>>
>> which seems to be expected, we'd continually try and fault in the
>> ranges, if the userfaultfd handler isn't filling them.
>>
>> I guess this is where I'm still confused, because I don't see how this
>> is different from if you have a normal write(2) syscall doing the same
>> thing - you'd get the same looping.
>>
>> ??
>>
>>> This issue occurs like during VM snapshot loading (which uses
>>> userfaultfd for on-demand memory loading), while the task in the guest is
>>> writing data to disk.
>>>
>>> Normally, the VM first triggers a user fault to fill the page table.
>>> So in the IOU worker thread, the page tables are already filled,
>>> fault no chance happens when faulting in memory pages
>>> in fault_in_iov_iter_readable.
>>>
>>> I suspect that during snapshot loading, a memory access in the
>>> VM triggers an async page fault handled by the kernel thread,
>>> while the IOU worker's async kernel thread is also running.
>>> Maybe If the IOU worker's thread is scheduled first.
>>> I?m going to bed now.
>>
>> Ah ok, so what you're saying is that because we end up not sleeping
>> (because a signal is pending, it seems), then the fault will never get
>> filled and hence progress not made? And the signal is pending because
>> someone tried to create a net worker, and this work is not getting
>> processed.
>>
>> --
>> Jens Axboe
>         handle_userfault() {
>           hugetlb_vma_lock_read();
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           vma_mmu_pagesize() {
>             hugetlb_vm_op_pagesize();
>           }
>           huge_pte_offset();
>           hugetlb_vma_unlock_read();
>           up_read();
>           __wake_up() {
>             _raw_spin_lock_irqsave() {
>               __pv_queued_spin_lock_slowpath();
>             }
>             __wake_up_common();
>             _raw_spin_unlock_irqrestore();
>           }
>           schedule() {
>             io_wq_worker_sleeping() {
>               io_wq_dec_running();
>             }
>             rcu_note_context_switch();
>             raw_spin_rq_lock_nested() {
>               _raw_spin_lock();
>             }
>             update_rq_clock();
>             pick_next_task() {
>               pick_next_task_fair() {
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   update_min_vruntime();
>                   cpuacct_charge();
>                   __cgroup_account_cputime() {
>                     cgroup_rstat_updated();
>                   }
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>               }
>             }
>             raw_spin_rq_unlock();
>             io_wq_worker_running();
>           }
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           userfaultfd_ctx_put();
>         }
>       }
> The execution flow above is the one that kept faulting
> repeatedly in the IOU worker during the issue. The entire fault path,
> including this final userfault handling code you're seeing, would be
> triggered in an infinite loop. That's why I traced and found that the
> io_wq_worker_running() function returns early, causing the flow to
> differ from a normal user fault, where it should be sleeping.

io_wq_worker_running() is called when the task is scheduled back in.
There's no "returning early" here, it simply updates the accounting.
Which is part of why your patch makes very little sense to me, we
would've called both io_wq_worker_sleeping() and _running() from the
userfaultfd path. The latter doesn't really do much, it simply just
increments the running worker count, if the worker was previously marked
as sleeping.

And I strongly suspect that the latter is the issue, not the marking of
running. The above loop is fine if we do go to sleep in schedule.
However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
based) pending, then schedule() will be a no-op and we're going to
repeatedly go through that loop. This is because the expectation here is
that the loop will be aborted if either of those is true, so that
task_work can get run (or a signal handled, whatever), and then the
operation retried.

> However, your call stack appears to behave normally,
> which makes me curious about what's different about execution flow.
> Would you be able to share your test case code so I can study it
> and try to reproduce the behavior on my side?

It behaves normally for the initial attempt - we end up sleeping in
schedule(). However, then a new worker gets created, or the ring
shutdown, in which case schedule() ends up being a no-op because
TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
the same code again and again to no avail. So I do think my test case
and your issue is the same, I just reproduce it by calling
io_uring_queue_exit(), but the exact same thing would happen if worker
creation is attempted while an io-wq worker is blocked
handle_userfault().

This is why I want to fully understand the issue rather than paper
around it, as I don't think the fix is correct as-is. We really want to
abort the loop and allow the task to handle whatever signaling is
currently preventing proper sleeps.

I'll dabble a bit more and send out the test case too, in case it'll
help on your end.

-- 
Jens Axboe