From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F517C369CB
	for <linux-mm@archiver.kernel.org>; Wed, 23 Apr 2025 06:22:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0F7AE6B0005; Wed, 23 Apr 2025 02:22:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0A77A6B0007; Wed, 23 Apr 2025 02:22:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E63A66B0008; Wed, 23 Apr 2025 02:22:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id C38896B0005
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 02:22:45 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id D5293140E60
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 06:22:46 +0000 (UTC)
X-FDA: 83364315132.11.298EB8E
Received: from mail-oa1-f48.google.com (mail-oa1-f48.google.com [209.85.160.48])
	by imf28.hostedemail.com (Postfix) with ESMTP id EA3B7C0003
	for <linux-mm@kvack.org>; Wed, 23 Apr 2025 06:22:44 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=TLrxa9ym;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.160.48 as permitted sender) smtp.mailfrom=qq282012236@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745389365; a=rsa-sha256;
	cv=none;
	b=lluoNzNLi7VZ0CIBZJOkFYp2tbICOYq6G2rGMBu8gFJ2Uzw+xDWZEbBi1mqk25Qub5atOB
	rF9cTBj9F4GOahiM66+jekqbVlXdNPggU65Oud9bbtBA/m6BoVEyTPX6O+aBLNJnFwXnLG
	1lcsGbfFieVgouRwhCXnltlvn7xhZfs=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=TLrxa9ym;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of qq282012236@gmail.com designates 209.85.160.48 as permitted sender) smtp.mailfrom=qq282012236@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745389365;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sVcK8cLkuGa6/rpVR1IXkRLxN4IwCjNUW/jXWmtnhB0=;
	b=lz6zFnMi6AcJNKIvWqS+nSswRCD8bI1z+AJ2ZPXayF+DyB7iFrjRPrqJOc07EiqLjUKPTt
	xjcdmtkkcdrqC62630Otc+q7AVZAUyMqTgwWtC0YGyuWObbSk9eBXHm1UOoGCF5C2hgdj3
	zmHKlNSg82hhc0dwGhbGcjqx6vStWP0=
Received: by mail-oa1-f48.google.com with SMTP id 586e51a60fabf-2d5e5e21b92so762560fac.0
        for <linux-mm@kvack.org>; Tue, 22 Apr 2025 23:22:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1745389364; x=1745994164; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=sVcK8cLkuGa6/rpVR1IXkRLxN4IwCjNUW/jXWmtnhB0=;
        b=TLrxa9ymgjH6j5h6BLTxwPPEWa8w8QFL5j+FAxk9bOCpDjGSFljPbfJ9KYv/Smnj1J
         hk6Y4hlrzR4MiNo5S4S9/hlUfl+FJs3BX8DBBLy9qNP6qY2bbsxdv+344Imd+gFNFpgF
         gHbMtojouvf1r8gZpccZCEsgJmBU3Z8vqLRPFwLmIwfWVYrKtKY8dko3K8/10hoyzdku
         CbeXUpvQwvhcbctN5gVKX99OXSHYq5rtI7y/hU/Kzqu6+aPCPZzbmxfimD1QzinK+eSJ
         f0i8pZSU/+uk/JoyMAQ8Lv9C9SD8yspKd53DvtYJvT9hqpZBZcFurStQ9RWhF2wV7QjE
         dkaw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745389364; x=1745994164;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=sVcK8cLkuGa6/rpVR1IXkRLxN4IwCjNUW/jXWmtnhB0=;
        b=Oj+H7yzyYV8thUSZTjvbzm2Zy84YqUJBcpeUtC4rQnclH/pJpRecmUvB0GvsSAjt0s
         TtC5VtD3QvixcLWhG0lOeRg201yw5psRH9kdZLrIqXt3pRqnU54szZGZ3H6s1U16H8gE
         4sHxbJu26PUXYH7mcT+ZqdIgncjc7OysA+tC90NngDFvN9CQsZ9/c6g1NHDkf/5MfOpN
         MY8Jzw3vXslQVS/RtWVbN/drTrLdQgIBRyJllOAwTQ2RiBHQ1cvVn8N4q/lTgQ28wcvo
         dWfWgjarAB3hIVhm6HuDemExnvC2feO1XOrcykhDsK8F78/dHKvL7m01bIeOu9x1zGhG
         6AXg==
X-Forwarded-Encrypted: i=1; AJvYcCWGVaC58Fsnre9O+BrPfuSt/MJkRmpJQGQNvA7Omx6gTvB7dWERrdlFJWxUHLqadhPkhsYTifLgeA==@kvack.org
X-Gm-Message-State: AOJu0YxnIkNTNxtrgM4SFW+Fu+XKKswwIcvAh4SNgDS+vSQtLtaSExBy
	XF0XMVhB7lt5v3fMPPK/kIBKCEVtj2k126UUy+D0m/63uWPImwIdf/4KwhRJOnqGYjnsmURPZft
	2XqZH0w/AF61mbAxSNM3dJlRO/Cs=
X-Gm-Gg: ASbGnctqWrNZNvwmKFk5aS/RZaK8OWlcyNNsrj8QnyDm59aDH/a1E4/SmAahhv5hHCE
	/7yX8RE0l7wPoAKK4HvMmLhC4e9advIJZuYF4NMLRD1iK+23W2raGB8eJd72WdfEK/BPV4NMDCc
	3BsoInwwiZqwiwUxl4M5zmUQ==
X-Google-Smtp-Source: AGHT+IEf3Tx42zp6XYx0ZAYrVGRlkUyP8+/u5lxgWIosIHy0ODAGStT4FJNSvcHbhiDc4RgpkRie1Dcg/4szHJ7p9Eg=
X-Received: by 2002:a05:6870:eca2:b0:29e:2594:81e with SMTP id
 586e51a60fabf-2d526a2e955mr10065315fac.13.1745389363812; Tue, 22 Apr 2025
 23:22:43 -0700 (PDT)
MIME-Version: 1.0
References: <20250422162913.1242057-1-qq282012236@gmail.com>
 <20250422162913.1242057-2-qq282012236@gmail.com> <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk>
 <CANHzP_uW4+-M1yTg-GPdPzYWAmvqP5vh6+s1uBhrMZ3eBusLug@mail.gmail.com>
 <b61ac651-fafe-449a-82ed-7239123844e1@kernel.dk> <CANHzP_tLV29_uk2gcRAjT9sJNVPH3rMyVuQP07q+c_TWWgsfDg@mail.gmail.com>
 <CANHzP_u3zN2a_t2O+BLwgV=KJZaXtANwXVq6VVD26TvF2hFL8Q@mail.gmail.com>
In-Reply-To: <CANHzP_u3zN2a_t2O+BLwgV=KJZaXtANwXVq6VVD26TvF2hFL8Q@mail.gmail.com>
From: =?UTF-8?B?5aec5pm65Lyf?= <qq282012236@gmail.com>
Date: Wed, 23 Apr 2025 14:22:32 +0800
X-Gm-Features: ATxdqUGJyNAYeV6aJ6me5GIU6PLK9S3z-8u-6Awas1fIjW75tEfnK1ENgFfPhF0
Message-ID: <CANHzP_vsSQe2dRniHUFYCo6dkDA5UiGkkY+oXadebOoNkL0KFg@mail.gmail.com>
Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
To: Jens Axboe <axboe@kernel.dk>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, 
	akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com, 
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, io-uring@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: EA3B7C0003
X-Stat-Signature: 81tw8iz9xw4yq5f8jpbwhbq7rdcbeppa
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1745389364-761415
X-HE-Meta: U2FsdGVkX1+MAqIWDpTlWQ0RxbFRwj8WiDhTpyxcEJiixgWi9bT3Qeu1WaoOrIguU0GuCUhvX3+C4QYyJc2ebwuLI/Me42mkGRLVO/AGfGWLIwtU5Vi+Zc6/tHbG1tZ89OyIygINQFXloMgc1/az6CA1kZ8gZiRd9X1pDUEOgxI96uxL+Ne5DI9AXqolTqpzeW6peKG/1uMK0NXqMslOZ/5dzWikKTcMxrx4K/NJbNAgx80gOuEu1sYWahetiMqBZXG0Vmb7IJj8OWF1bN46gthuSt84WoTaH+DZJlEpizI+8GY/tFqhB2HFUdBuv1UvePQOj2ZxB4402ZatOv5upiEgB0jO2muhzC5F1iAgfAFaL7N8fIn/lTzDmGNLc5s8bDewwQQxZbWqUT7Ai/+lXYPgxwADu6Kact+2KhI8Sv4s63kP0l+LGAOYy23qd0tTvLrQR08/R3ivgq/azTG7810Un4GnpELVQS3913armZxcpUOSU5+mwFa7mlAExPON/Zjv4/QVKWmOjyx8GJii+VOLSdPjMoG+Kf0F6rJKcErJKRp9MdztJ5uiYKN6A5ApNaY3ryWkHg534QfYLile4KN19eblKGjrck83kU/WdMGZmXva+Uhx0IfM+brq0J/vrmp4CeOgHeSxFpDk99MkoLY3L3avWxtQmurxp80CEeNzPjsW04Wxw/6Y6/pZgSl6S4LNT+atH851HY8X0FPunwwB8nf3qT5DnhSZAeeMTvPiAr2bK6E5VNhkRl/gFnnQJJysPwpEzQG6cmGVQiNWKWrVKNFUwkEdSI/DQbpB3s+gCHZhVFLbZEZyWEyPFLwTIX7qulbPiVsHrG/MHKMhrSiOf30rfO83J+dNDmZV7s2GXpl5gwuXUZ5TfE1D/mC12hGj5qRqVKp9DfloSqadbnIzZWDigwaV+Fev8RR07IxpQx1ZbBWe5IllxfKzwH8WhsrasUKfeoU/zXTq8+/
 2SkLR6sQ
 kVyAOxgdD5+4fks9RvfxszivqhJ+cY5YmYa0ovMP0rxEHIqmbOpKnoebpETcoQFRrrHC+Wx+DXcsAYauXOcGNpcoC7XugNrXp/cnutMxD6dFBjG5wrjyGPCg8984KaxB+sRE9ubj1VTd+zTmy58A/S/49aXtplauGzGyyr9cvBd/lJaoOUsx49GHUhqNc2HzitXmuHFHpk/flgEV5kfdUzJD6s6eyjJk0EByq9T3PKF0drGBWd578ZJkwLpW/RtCiCN6VboaKUfQvi6p2cVM7zNoicxMX2Y1dtA1UCOsEagMDQu6mpZ/pB7y81ztnc9xzQtaPyoD+LD1Y4fOgv+TPrfi2bFiIuun0F1FTfPCWbRInxmfvIdEtm6OKQuLgufUMnQOhajgTdJvRXegl2dywPjDeOJI8ml6yj/KDOmrJV05Wu+Pua+4azFxZK/tBr2JqBi4ULOYuZ7czkdCOuB7/W/d6QusNxCr+PsDtE2iJG1Ocb0k=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Apr 23, 2025 at 11:11=E2=80=AFAM =E5=A7=9C=E6=99=BA=E4=BC=9F <qq282=
012236@gmail.com> wrote:
>
> Sorry, I may have misunderstood. I thought your test case
> was working correctly. In io_wq_worker_running() it will return
> if in io worker context, that is different from common progress
> context.I hope the graph above can help you understand.
>
> On Wed, Apr 23, 2025 at 10:49=E2=80=AFAM =E5=A7=9C=E6=99=BA=E4=BC=9F <qq2=
82012236@gmail.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 1:33=E2=80=AFAM Jens Axboe <axboe@kernel.dk> wr=
ote:
> > >
> > > On 4/22/25 11:04 AM, ??? wrote:
> > > > On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote=
:
> > > >>
> > > >> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> > > >>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> > > >>> index d4fb2940e435..8567a9c819db 100644
> > > >>> --- a/io_uring/io-wq.h
> > > >>> +++ b/io_uring/io-wq.h
> > > >>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq=
 *wq, work_cancel_fn *cancel,
> > > >>>                                       void *data, bool cancel_all=
);
> > > >>>
> > > >>>  #if defined(CONFIG_IO_WQ)
> > > >>> -extern void io_wq_worker_sleeping(struct task_struct *);
> > > >>> -extern void io_wq_worker_running(struct task_struct *);
> > > >>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> > > >>> +extern void io_wq_worker_running(struct task_struct *tsk);
> > > >>> +extern void set_userfault_flag_for_ioworker(void);
> > > >>> +extern void clear_userfault_flag_for_ioworker(void);
> > > >>>  #else
> > > >>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk=
)
> > > >>>  {
> > > >>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struc=
t task_struct *tsk)
> > > >>>  static inline void io_wq_worker_running(struct task_struct *tsk)
> > > >>>  {
> > > >>>  }
> > > >>> +static inline void set_userfault_flag_for_ioworker(void)
> > > >>> +{
> > > >>> +}
> > > >>> +static inline void clear_userfault_flag_for_ioworker(void)
> > > >>> +{
> > > >>> +}
> > > >>>  #endif
> > > >>>
> > > >>>  static inline bool io_wq_current_is_worker(void)
> > > >>
> > > >> This should go in include/linux/io_uring.h and then userfaultfd wo=
uld
> > > >> not have to include io_uring private headers.
> > > >>
> > > >> But that's beside the point, like I said we still need to get to t=
he
> > > >> bottom of what is going on here first, rather than try and paper a=
round
> > > >> it. So please don't post more versions of this before we have that
> > > >> understanding.
> > > >>
> > > >> See previous emails on 6.8 and other kernel versions.
> > > >>
> > > >> --
> > > >> Jens Axboe
> > > > The issue did not involve creating new worker processes. Instead, t=
he
> > > > existing IOU worker kernel threads (about a dozen) associated with =
the VM
> > > > process were fully utilizing CPU without writing data, caused by a =
fault
> > > > while reading user data pages in the fault_in_iov_iter_readable fun=
ction
> > > > when pulling user memory into kernel space.
> > >
> > > OK that makes more sense, I can certainly reproduce a loop in this pa=
th:
> > >
> > > iou-wrk-726     729    36.910071:       9737 cycles:P:
> > >         ffff800080456c44 handle_userfault+0x47c
> > >         ffff800080381fc0 hugetlb_fault+0xb68
> > >         ffff80008031fee4 handle_mm_fault+0x2fc
> > >         ffff8000812ada6c do_page_fault+0x1e4
> > >         ffff8000812ae024 do_translation_fault+0x9c
> > >         ffff800080049a9c do_mem_abort+0x44
> > >         ffff80008129bd78 el1_abort+0x38
> > >         ffff80008129ceb4 el1h_64_sync_handler+0xd4
> > >         ffff8000800112b4 el1h_64_sync+0x6c
> > >         ffff80008030984c fault_in_readable+0x74
> > >         ffff800080476f3c iomap_file_buffered_write+0x14c
> > >         ffff8000809b1230 blkdev_write_iter+0x1a8
> > >         ffff800080a1f378 io_write+0x188
> > >         ffff800080a14f30 io_issue_sqe+0x68
> > >         ffff800080a155d0 io_wq_submit_work+0xa8
> > >         ffff800080a32afc io_worker_handle_work+0x1f4
> > >         ffff800080a332b8 io_wq_worker+0x110
> > >         ffff80008002dd38 ret_from_fork+0x10
> > >
> > > which seems to be expected, we'd continually try and fault in the
> > > ranges, if the userfaultfd handler isn't filling them.
> > >
> > > I guess this is where I'm still confused, because I don't see how thi=
s
> > > is different from if you have a normal write(2) syscall doing the sam=
e
> > > thing - you'd get the same looping.
> > >
> > > ??
> > >
> > > > This issue occurs like during VM snapshot loading (which uses
> > > > userfaultfd for on-demand memory loading), while the task in the gu=
est is
> > > > writing data to disk.
> > > >
> > > > Normally, the VM first triggers a user fault to fill the page table=
.
> > > > So in the IOU worker thread, the page tables are already filled,
> > > > fault no chance happens when faulting in memory pages
> > > > in fault_in_iov_iter_readable.
> > > >
> > > > I suspect that during snapshot loading, a memory access in the
> > > > VM triggers an async page fault handled by the kernel thread,
> > > > while the IOU worker's async kernel thread is also running.
> > > > Maybe If the IOU worker's thread is scheduled first.
> > > > I?m going to bed now.
> > >
> > > Ah ok, so what you're saying is that because we end up not sleeping
> > > (because a signal is pending, it seems), then the fault will never ge=
t
> > > filled and hence progress not made? And the signal is pending because
> > > someone tried to create a net worker, and this work is not getting
> > > processed.
> > >
> > > --
> > > Jens Axboe
> >         handle_userfault() {
> >           hugetlb_vma_lock_read();
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           vma_mmu_pagesize() {
> >             hugetlb_vm_op_pagesize();
> >           }
> >           huge_pte_offset();
> >           hugetlb_vma_unlock_read();
> >           up_read();
> >           __wake_up() {
> >             _raw_spin_lock_irqsave() {
> >               __pv_queued_spin_lock_slowpath();
> >             }
> >             __wake_up_common();
> >             _raw_spin_unlock_irqrestore();
> >           }
> >           schedule() {
> >             io_wq_worker_sleeping() {
> >               io_wq_dec_running();
> >             }
> >             rcu_note_context_switch();
> >             raw_spin_rq_lock_nested() {
> >               _raw_spin_lock();
> >             }
> >             update_rq_clock();
> >             pick_next_task() {
> >               pick_next_task_fair() {
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   update_min_vruntime();
> >                   cpuacct_charge();
> >                   __cgroup_account_cputime() {
> >                     cgroup_rstat_updated();
> >                   }
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >               }
> >             }
> >             raw_spin_rq_unlock();
> >             io_wq_worker_running();
> >           }
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           userfaultfd_ctx_put();
> >         }
> >       }
> > The execution flow above is the one that kept faulting
> > repeatedly in the IOU worker during the issue. The entire fault path,
> > including this final userfault handling code you're seeing, would be
> > triggered in an infinite loop. That's why I traced and found that the
> > io_wq_worker_running() function returns early, causing the flow to
> > differ from a normal user fault, where it should be sleeping.
> >
> > However, your call stack appears to behave normally,
> > which makes me curious about what's different about execution flow.
> > Would you be able to share your test case code so I can study it
> > and try to reproduce the behavior on my side?
Sorry, I may have misunderstood. I thought your test case
was working correctly. In io_wq_worker_running() it will return
if in io worker context, that is different from common progress
context.I hope the graph above can help you understand.

Also, regarding your initial suggestion to move the function into
include/linux/io_uring.h,I=E2=80=99m not sure that=E2=80=99s the best fit =
=E2=80=94 the
problematic context (io_wq_worker) and the function needing changes
 (io_wq_worker_running) are both heavily tied to the internals of io-wq.
I=E2=80=99m wondering if doing something like #include "../../io_uring/io-w=
q.h"
as in kernel/sched/core.c:96 might actually be a better choice here?

And I=E2=80=99d still really appreciate it if you could share your test cas=
e code
 it would help a lot. Thanks!