From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C7084C369AB
	for <linux-mm@archiver.kernel.org>; Thu, 24 Apr 2025 14:53:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C2C026B00B3; Thu, 24 Apr 2025 10:53:00 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BDDEE6B00B4; Thu, 24 Apr 2025 10:53:00 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A2C7B6B00B5; Thu, 24 Apr 2025 10:53:00 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 839CC6B00B3
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 10:53:00 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id A9C5EBF627
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 14:53:02 +0000 (UTC)
X-FDA: 83369229804.17.BD9FC0E
Received: from mail-io1-f41.google.com (mail-io1-f41.google.com [209.85.166.41])
	by imf07.hostedemail.com (Postfix) with ESMTP id 9185840005
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 14:53:00 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=y6zPEEo3;
	dmarc=none;
	spf=pass (imf07.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.41 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745506380; a=rsa-sha256;
	cv=none;
	b=yYCxNscuoTSMnp5gRgAAVSE32jMBOyDIs/DOefJUtKMhLIUYOlsMKY/9igBRivvGy3p+Hl
	bxlzSjfWrJZw2wha9wkRQyM2YjJB1u5rUcmzXYn5jbl0D5bXZ/lP1U2w3jDwG8ENX9+iUQ
	Bq0/EhoyUy36fieCIeYmeiTKTuBmYzc=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=y6zPEEo3;
	dmarc=none;
	spf=pass (imf07.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.41 as permitted sender) smtp.mailfrom=axboe@kernel.dk
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745506380;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=3yE0uV5DbxwJ1m2x+ijJ+zl26X5qcx9ZqXHltCrR4LE=;
	b=I69CLEgb64mE6aW0IDtjAOCdSlMVvMfpXi77zUO4BBjs2rPSSrSdx8hz51W5h8Gspg/xwz
	KwU73MicEcihNJ/PGh7R0pLK/5zO5SS2P1/jhbsv+ZzV409rSfljdroZ/fNV6GQ7c6m6hf
	DaR7y2t77pRwuSzlYKU94EddsdEnE5A=
Received: by mail-io1-f41.google.com with SMTP id ca18e2360f4ac-86192b6946eso34005239f.1
        for <linux-mm@kvack.org>; Thu, 24 Apr 2025 07:53:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1745506379; x=1746111179; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:content-language:from
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=3yE0uV5DbxwJ1m2x+ijJ+zl26X5qcx9ZqXHltCrR4LE=;
        b=y6zPEEo3mdhiKpWGzqMUojrcdro44/KpYYXXqqqrKHouVGmVRopJErmiHgeYFupf23
         aA+iIzri8vCUocnI4JMHbSVhAUhvwb75KemSLKVdSKHIszu5kfOiCXfHzC/gF+t/g3bV
         MISkb7WhhIgQsvNnmjo1lOg4XthqfMzyE+RXUE7+N/hxYyztiW+IybZhUu/siwb1baYT
         pJTYu2YjHT+vX7dZU+ma49mekYY6TRJlasr0WuRBS+6S9TVBXK93hLRaERCRUPxe3Rk9
         wbQ5ze+TrwyTyKOcMwoJ0hbgs6eM+8MYh57nZgdWKWe/h1RIldUmv7LfDnbxatGvIv38
         IhdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745506379; x=1746111179;
        h=content-transfer-encoding:in-reply-to:content-language:from
         :references:cc:to:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=3yE0uV5DbxwJ1m2x+ijJ+zl26X5qcx9ZqXHltCrR4LE=;
        b=F+jHBL4BFKuRfp608IjUU24UIuMIf4g7OFQGexU96S/6jpPn+kf+CHzcasueBBfSLA
         IxPXUpta76xPbHWTn3XcyZOQPe21jXksGuK4mUqpUrvd6aLjy+jlZFx38O7GWbjeYRCW
         AqjjCU72/iSUb7YAFg2zLy7z7qfo6zRoQB9kDDUTJ8n8bbSvAC68qJtbFYFYPCCabtQ4
         hxn3gnUabRxYdSCCTku+43Am8EaBGfTWPD/HZAynAH6kVkyHgzPLfMwFRuyDbxahmDBE
         udoOZtPawJ9EhKVIzIoLocF69Ogs69dEg2zxjCkI6e1grvoboXh6Q3SwfN7a0AfLYI5V
         K56A==
X-Forwarded-Encrypted: i=1; AJvYcCUQQliN4CCbf6gEwZ8dv027cuSDzouZnpLBj/hcxIKk1o/D/MH+acOBD8FF0Dbm/+QaLTWHiWFLZw==@kvack.org
X-Gm-Message-State: AOJu0YxtsJpsw1gC0PTY/k1hmjBMwEh0puHF/VP5Gbn7F/GmzxUCsc4v
	YBDOOuLkl6E1zLGQpihYoTXKiuQKstzThTLd42usGY0dYX1ve1CNdm2H+aPUa28=
X-Gm-Gg: ASbGncsu3nGDjLF1/fA+ZoRW8/yKbF6qrUXQMFTJ05Nt2nCvhztnGDoZeKL1HaKkxcf
	uswoKBmtUopwKWRamBhxenVl/kiL7b14aomuj731gb2jc97Hq5xG/TKkHF4vuFMGtb+q5r3dK6G
	eB1EFKVx8jic6PUhvboeG9MB7OZ3GAQe/HinKTdiyyBdd3pvUdyJrlgpqPph9jH1Hc8r6po1Zmc
	fNyUTJHEShkQlZE+6RPk8JFdDIhKGigw/5sNTqmxXdBamEDZz3Hrel0172kU94nQ907klj4ydJ5
	Uvsp1CZZeYv8gGXbMW07gW8D3g/hGjzjfLs5
X-Google-Smtp-Source: AGHT+IHmib9ahOC1dv2zniq5nwg6FnMJtZiIc3s8imDTqkDCv/YYE4iVYUjfL3mDW5W99/bqB+ghoQ==
X-Received: by 2002:a05:6602:7517:b0:85b:423a:1c20 with SMTP id ca18e2360f4ac-8644fa7e980mr381604139f.13.1745506379519;
        Thu, 24 Apr 2025 07:52:59 -0700 (PDT)
Received: from [192.168.1.116] ([96.43.243.2])
        by smtp.gmail.com with ESMTPSA id ca18e2360f4ac-864519ca71dsm21162439f.32.2025.04.24.07.52.58
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 24 Apr 2025 07:52:58 -0700 (PDT)
Message-ID: <1ed67bb5-5d3d-4af8-b5a7-4f644186708b@kernel.dk>
Date: Thu, 24 Apr 2025 08:52:57 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault
 scenarios
To: =?UTF-8?B?5aec5pm65Lyf?= <qq282012236@gmail.com>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
 akpm@linux-foundation.org, peterx@redhat.com, asml.silence@gmail.com,
 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, io-uring@vger.kernel.org
References: <20250422162913.1242057-1-qq282012236@gmail.com>
 <20250422162913.1242057-2-qq282012236@gmail.com>
 <14195206-47b1-4483-996d-3315aa7c33aa@kernel.dk>
 <CANHzP_uW4+-M1yTg-GPdPzYWAmvqP5vh6+s1uBhrMZ3eBusLug@mail.gmail.com>
 <b61ac651-fafe-449a-82ed-7239123844e1@kernel.dk>
 <CANHzP_tLV29_uk2gcRAjT9sJNVPH3rMyVuQP07q+c_TWWgsfDg@mail.gmail.com>
 <7bea9c74-7551-4312-bece-86c4ad5c982f@kernel.dk>
 <52d55891-36e3-43e7-9726-a2cd113f5327@kernel.dk>
 <cac3a5c9-e798-47f2-81ff-3c6003c6d8bb@kernel.dk>
 <CANHzP_uJft1FPJ0W++0Zp5rUjayaULEdpAQRn1VuuqDVq3DmJA@mail.gmail.com>
 <5c20b5ca-ce41-43c4-870a-c50206ab058d@kernel.dk>
 <CANHzP_u2SA3uSoG-4LQ-e9BvW6t-Zo1wn8qnKM0xYGoekL74bA@mail.gmail.com>
From: Jens Axboe <axboe@kernel.dk>
Content-Language: en-US
In-Reply-To: <CANHzP_u2SA3uSoG-4LQ-e9BvW6t-Zo1wn8qnKM0xYGoekL74bA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 9185840005
X-Stat-Signature: wwiydfo76pycbqbugfu3ms8gjf8ktmq7
X-Rspam-User: 
X-HE-Tag: 1745506380-582375
X-HE-Meta: U2FsdGVkX19jIh+6mZssVR46o7XeYE9UrfiNNhXX/MieLEq8TbvLq44MZ9j49O0+aE64IEKaJH8r0xz2qfIoK44jDr6YR8PPLTREwcc4QVhmrFYnI4cV1ILz1wv9ycykSRlQO2O7lYwT/j7Q644NOBluMTuM89M19UTx4K4c85EdkxvmaqU5jG8IXwRy+atfeRxv128UbeNVoxFuJpjyymmrbmBHdnTU7Rru7/p4ajoXQScy7t7V1++lINcsNFHJ/+re96aNeLm6HDyUHnY0jUsx6SuKz4lV1MzrfR5R7GvdmunLSJSciTY0g4SsVWA0kTmCb23a8dlgBY6bHzyM14CpVaV5rRc7UICLRxveP/PVwmpYpnMB3+o43Z0VUKbSitnSUv4KJarpil1uF+N9hAZUsqwAmcl4mKJY0NiZgUZSF7GidF40dMQdfKs+v9EMymlVdDuxb1ViLXyhR0VPb/XCw41LQVeNtZXqK8ldL3QEVYz3o3wKACAfknIsQrIuPQiWopi9Q9ueyXpXoNvTwO1T43PkUy+0CcDL9MaOz41URAli1Jt0y1fKnEQgzcMcStqPlJ89kb51Ik9pUwkAWyjNE84sSxJF4LIBb0PHiaeg7OJitm4u9Ho10S6CcN3q8Hkvd7zQqa5nLdK/0H5IIk/M9BTePS7dkXjc/P6wzmuBOV7KToR2WS/9makaH0inyZjHZSqdxvfDRhSVcspbuyMJf5LvlAc7rqz5tMC4FjC/M7DoLfBsESVVnzlfnU8wFeMsr/V3BqiP+koFXpp0+qhTkRclDfJjfnx4uMyFDdgHNf8HNrl+0b4EOPusIRn3sR7E8dcoRMuQRCjDkDD6K7E21U2fT8v8E7Ydr7pVsW43Yc+Zxy2XwjymQZkXnBjVEB72H32tszzqDkIgo8Ra5TbDhgvdqUQd/Buo9Y3tfFqvbstJglUYJ2B7u//0RSAC8QDxeZUde4QoI13Tcmo
 dJmEM0FL
 ML0jtm+IawEiGpCgzAgjk7OnVtfMZdE9nRs6KZnnGzPWFSdfjiVg1QP/ngCJX60cJ1i/CHBvORjEvaGa7zdPRqfofm61CB1z8Io6Nx9VWNvr4qZd6F35f0Dqj3z8cbSxK0njOXe8o3Brc4eGdayYqsUbHRjgebaRunpFUGaz1A3nU7IXXOSs3SQ6OAVZH5b46IOcMLeDOBwyhNs2k7eLMrEaGgIPTCufmF5oOJivQH73cU12vMU3zC33xdb23I7flsjjx/h8bUHJn6X5afH4E3t+2A1CnW+wvzc2gkhjgHP/4ETVcRdlTF6JQ17oi5Tf1n7pCMGLKOk+k+jIHNoN9gXvAwCTdKdWmjS6p3GNiEi98d8aVifEa/zMkgMgczbJvpwOoe7j+eNEsOOCzgCLDkej2PVxCMOpLmcTXgAlANnQhUrWie+mBS9ZkFgGRBGstEwOvNKy+loscyniCEUIhyKpkJqqpqPK3Uj5L
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 4/24/25 8:45 AM, ??? wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:13???
>>
>> On 4/24/25 8:08 AM, ??? wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
>>>>
>>>> On 4/23/25 9:55 AM, Jens Axboe wrote:
>>>>> Something like this, perhaps - it'll ensure that io-wq workers get a
>>>>> chance to flush out pending work, which should prevent the looping. I've
>>>>> attached a basic test case. It'll issue a write that will fault, and
>>>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
>>>>> looping.
>>>>
>>>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
>>>> signal_pending() is true AND the fault has already been tried once
>>>> before. If that's the case, rather than just call schedule() with
>>>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
>>>> a suitable timeout length that prevents the annoying parts busy looping.
>>>> I used HZ / 10.
>>>>
>>>> I don't see how to fix userfaultfd for this case, either using io_uring
>>>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
>>>> it retried, but there's no way to do that from inside fault handling. So
>>>> I think we just have to be nicer about it.
>>>>
>>>> Andrew, as the userfaultfd maintainer, what do you think?
>>>>
>>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>>>> index d80f94346199..1016268c7b51 100644
>>>> --- a/fs/userfaultfd.c
>>>> +++ b/fs/userfaultfd.c
>>>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>>>>         return ret;
>>>>  }
>>>>
>>>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>>>> +struct userfault_wait {
>>>> +       unsigned int task_state;
>>>> +       bool timeout;
>>>> +};
>>>> +
>>>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
>>>>  {
>>>> +       /*
>>>> +        * If the fault has already been tried AND there's a signal pending
>>>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
>>>> +        * This prevents busy looping where schedule() otherwise does nothing
>>>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
>>>> +        */
>>>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
>>>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
>>>> +
>>>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
>>>> -               return TASK_INTERRUPTIBLE;
>>>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
>>>>
>>>>         if (flags & FAULT_FLAG_KILLABLE)
>>>> -               return TASK_KILLABLE;
>>>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
>>>>
>>>> -       return TASK_UNINTERRUPTIBLE;
>>>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
>>>>  }
>>>>
>>>>  /*
>>>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>         struct userfaultfd_wait_queue uwq;
>>>>         vm_fault_t ret = VM_FAULT_SIGBUS;
>>>>         bool must_wait;
>>>> -       unsigned int blocking_state;
>>>> +       struct userfault_wait wait_mode;
>>>>
>>>>         /*
>>>>          * We don't do userfault handling for the final child pid update
>>>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>         uwq.ctx = ctx;
>>>>         uwq.waken = false;
>>>>
>>>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
>>>>
>>>>          /*
>>>>           * Take the vma lock now, in order to safely call
>>>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>          * following the spin_unlock to happen before the list_add in
>>>>          * __add_wait_queue.
>>>>          */
>>>> -       set_current_state(blocking_state);
>>>> +       set_current_state(wait_mode.task_state);
>>>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>>>
>>>>         if (!is_vm_hugetlb_page(vma))
>>>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>
>>>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>>>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>>>> -               schedule();
>>>> +               /* See comment in userfaultfd_get_blocking_state() */
>>>> +               if (!wait_mode.timeout)
>>>> +                       schedule();
>>>> +               else
>>>> +                       schedule_timeout(HZ / 10);
>>>>         }
>>>>
>>>>         __set_current_state(TASK_RUNNING);
>>>>
>>>> --
>>>> Jens Axboe
>>> I guess the previous io_work_fault patch might have already addressed
>>> the issue sufficiently. The later patch that adds a timeout for
>>> userfaultfd might
>>
>> That one isn't guaranteed to be safe, as it's not necessarily a safe
>> context to prune the conditions that lead to a busy loop rather than the
>> normal "schedule until the condition is resolved". Running task_work
>> should only be done at the outermost point in the kernel, where the task
>> state is known sane in terms of what locks etc are being held. For some
>> conditions the patch will work just fine, but it's not guaranteed to be
>> the case.
>>
>>> not be necessary  wouldn?t returning after a timeout just cause the
>>> same fault to repeat indefinitely again? Regardless of whether the
>>> thread is in UN or IN state, the expected behavior should be to wait
>>> until the page is filled or the uffd resource is released to be woken
>>> up, which seems like the correct logic.
>>
>> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
>> That's unfortunately the best we can do for this case... The expected
>> behavior is indeed to schedule until we get woken, however that just
>> doesn't work if there are signals pending, or other conditions that lead
>> to TASK_INTERRUPTIBLE + schedule() being a no-op.
>>
>> --
>> Jens Axboe
> In my testing, clearing the NOTIFY flag in the original io_work_fault
> ensures that the next schedule correctly waits. However, adding a

That's symptom fixing again - the NOTIFY flag is the thing that triggers
for io_uring, but any legitimate signal (or task_work added with
signaling) will cause the same issue.

> timeout causes the issue to return to multiple faults again.
> Also, after clearing the NOTIFY flag in handle_userfault,
> it?s possible that some task work hasn?t been executed.
> But if task_work_run isn?t added back, tasks might get lost?
> It seems like there isn?t an appropriate place to add it back.
> So, do you suggest adjusting the fault frequency in userfaultfd
> to make it more rhythmic to alleviate the issue?

The task_work is still there, you just removed the notification
mechanism that tells the kernel that there's task_work there. For this
particular case, you could re-set TIF_NOTIFY_SIGNAL at the end after
schedule(), but again it'd only fix that specific one case, not the
generic issue.

What's the objection to the sleep approach? If the task is woken by the
fault being filled, it'll still wake on time, no delay. If not, then it
prevents a busy loop, which is counterproductive.

-- 
Jens Axboe