From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B5E7C369C2
	for <linux-mm@archiver.kernel.org>; Thu, 24 Apr 2025 18:26:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2D9E46B0096; Thu, 24 Apr 2025 14:26:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 287406B00AF; Thu, 24 Apr 2025 14:26:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0DA056B00D4; Thu, 24 Apr 2025 14:26:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id DFA276B00AF
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 14:26:45 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 891D0BA40E
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 18:26:46 +0000 (UTC)
X-FDA: 83369768412.24.4D39939
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf28.hostedemail.com (Postfix) with ESMTP id 31BBEC0005
	for <linux-mm@kvack.org>; Thu, 24 Apr 2025 18:26:44 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fo56xeXN;
	dmarc=pass (policy=quarantine) header.from=redhat.com;
	spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745519204;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=;
	b=cwfKRWW/w8HAr/Ap8xaRj9ycJTj+Wl8NkI7FC2cBYFl2TQOQhMFIAY3zbUrvFEj22Mb4hY
	8Q27eq1aXCDoSBxZjAcB0uPpJiInREqMLhq8UUcp0ibNVGsAE0NsGANODuVoqQaHGxeAkj
	0sn2BrJwKZID0SJ4B6/Pl3lOUjIM4gQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745519204; a=rsa-sha256;
	cv=none;
	b=qNEqxUVwY+AX1QqmCeErxZUUD8j0AaiA36M4Tu8d+LSbxp15kz/Lax605A4KtAP5LOFK4E
	BeSPN7voB/PRBZkKH5+sFTRN5Ua0I5RYT7eIXUfY7fOYaqMOkxYEcl+/5+NQYJKKJy6OH/
	4C38R47qu/YzxVjrbJpRdGtVM0ln3Nk=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fo56xeXN;
	dmarc=pass (policy=quarantine) header.from=redhat.com;
	spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1745519203;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=;
	b=fo56xeXNaZxy7Ibn52daX6N7bQsZnCKw6W2wIqH3zoL9mKtlComgT5mrGk/0Tmfj9cRStm
	iym5kxB98JhyHfba9AjofNZjk7TJW6BHKvSsM1XTLja0GAwbdyWVqT7zPvd0KKh7qeflvD
	xbZGwVIP8yGMcjZdp8PDDb9ld2Zpmlg=
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-588-6SnSH3dOPbycT94HhQnoPQ-1; Thu, 24 Apr 2025 14:26:42 -0400
X-MC-Unique: 6SnSH3dOPbycT94HhQnoPQ-1
X-Mimecast-MFC-AGG-ID: 6SnSH3dOPbycT94HhQnoPQ_1745519202
Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-7c791987cf6so276826785a.0
        for <linux-mm@kvack.org>; Thu, 24 Apr 2025 11:26:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745519202; x=1746124002;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=;
        b=B07MrJlvzpLQG50yTNog5gvtF59oy6XXQFD2wSgiKhIi3Hnd6ghnh5YEOW2sGHcaQz
         j11+4UJI1WlRAQp1q7JtyBLSlvHsJLHZg2+zvt2plYXn+aJ5I0aHQpeegABZ2LOlG+Hc
         GkEIMZ+Yt/lGjknKHDutGxvFxPmGkZAvI9eFLGCVqQ+FwyyZPwS/HkKOHm1Fx9mUka7/
         rF2A6rA8TH7x/Obsa5Hx32c4whNgGSSrqIHZyDUjKfq6sC4YNldvpPeRwVYtv2nxIjK7
         QQFiWULRPtgmwfvHTs/6cVSmc41hD19wvNqoTgDuzvt3OH8UY5Xe7mGo3yXLSaZekLQX
         SwRg==
X-Forwarded-Encrypted: i=1; AJvYcCV/GEnjIoj8FvcRdetr5toECrc1fX6VNP1aakIkMzgR828xqb60zGoMp8TB70cWf41pCvjqSDMGrQ==@kvack.org
X-Gm-Message-State: AOJu0YwBSAD+PFYanIZK9t5eyX9/XAIrTBBGEsMDs7PNh1UVRG7uny6r
	W8+hQPIflGECh5t1G16GyVBm/w3yac4s7Zwyi1zhyjFBQP3VlydoNFLi54XPWreikgUxsnthscg
	9bVPS3dYiHO3dNohvTeB9i7Cc60hdqh1wc8rmUW6aG2cvB+7O
X-Gm-Gg: ASbGnct52fvC9nGkozPCQliSLfSpxPwdax1eMhA0ObaF5i6QiMT8cWJAKGxdZzoMPYx
	M58JKtbmQNq2pL8gYomK/BWK6AWSyhhR7Dlt4F0rY3HxiZZRZse7m5jlKvlTFi2CNbF9m4NnMft
	RtaHeayBd799yFPHlxBfIJ7VxqMSh/JyQI5KKHnwkGbgMGVAhsAziuNCXD8AUfdB75E9arP7Sku
	eju3K9Cstw0D1J5t753RLEVPoeLO15eBtKbTNB73VeJ+eP3JX11NZ1ko8/X1QwS87p3HXwS6cZT
	2ao=
X-Received: by 2002:a05:620a:1b99:b0:7c5:43b4:fa97 with SMTP id af79cd13be357-7c95ef62ed8mr65536085a.53.1745519201605;
        Thu, 24 Apr 2025 11:26:41 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IHWadddF4tTVGtsR9D4nXlp5wgIoy+YSZhoYxEZ8+RIfT9JL70XbLnOSO0r756wTUMaiFpfxA==
X-Received: by 2002:a05:620a:1b99:b0:7c5:43b4:fa97 with SMTP id af79cd13be357-7c95ef62ed8mr65533285a.53.1745519201263;
        Thu, 24 Apr 2025 11:26:41 -0700 (PDT)
Received: from x1.local ([85.131.185.92])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-7c958cdc62bsm117826985a.55.2025.04.24.11.26.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 24 Apr 2025 11:26:40 -0700 (PDT)
Date: Thu, 24 Apr 2025 14:26:37 -0400
From: Peter Xu <peterx@redhat.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>, Andrew Morton <akpm@linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with
 signals pending
Message-ID: <aAqCXfPirHqWMlb4@x1.local>
References: <27c3a7f5-aad8-4f2a-a66e-ff5ae98f31eb@kernel.dk>
 <20250424140344.GA840@cmpxchg.org>
MIME-Version: 1.0
In-Reply-To: <20250424140344.GA840@cmpxchg.org>
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: fjd3F2Kh5M4xufsHiErVqwyCqvKHAHWpchFMij4BTjU_1745519202
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 31BBEC0005
X-Rspam-User: 
X-Stat-Signature: xpuqnczzrwfsqz8j1xfcpsqxjqxpim9p
X-HE-Tag: 1745519204-44117
X-HE-Meta: U2FsdGVkX1+99LGGy+afpFK1eN4mI1beR5zYnnNVwtMUHoasOEL2rBXjkTO8kd0P3I1klZoG3ed+l0RGl584MpCWBSAU3k0MY7GQcd7gjlLwxcoTkGIXiJsSinWkkb9eLajiVsx0hMBu6m5rFtZnqM+hGXCBztGE5S87iiCJItZHnOfuIpd8AptLEFzAO1AmP7QoC6jwqJbmNG5SafsP+S0sF0FFi37dtQTEHz+hChykyOL3OWZytKntYjodGGYPnpAkK7JYFtMln7MXgmWmw25B3mv53pXXRyzlflFoLX6HOgncYT39CH8F7DT0nbsH8dXn6VOL89Zec7TKhpd9ZjRtEL1uvKUmv0h9OW6gqDa5teuWZZZHAEwuHsHtlRQa04pVVoT18vJBVrqlCnrikj32uuGT0N4ooFsj+MBE5Tavo5p/SNf0ewxy8BsLplHa4tZ0IDd4ZZEa7SOhP+udSipMuf5jqvGGdUToairaHpKYhTgi3prGw9Z0B2poead5Zq2YU8SydmhgRL16kpp9QdsGI3jFG6ZUQU70MYV9YzjTI3YgJSPv2xEPntYKmBTDSsfGpAF0btQU+nvH40xJvz7TO+pAlSGkBLBEWFgU85biogYhKw8jCBQAtfEBcE1aH2SsRk2weG91cVxSdd1JyYOYUyC8ot6frL0YJFuraegfUeip2jUiDj/Y1uN24Sis+DDSvxzyz/5BBICSmCRsV/cT1qaV8io9PQg9PW8JKnDCyNcqh7o6KzHjVIc+VPqvk14W7COyhIdNZZjaue34R7u04uLnLKvl6p+/pE02XLpeWhUBtA+3ZUgek3gIE8nxNlf/wFYJpW5xpAasOThRYEy4W105By6whemeSLNSynuHvsDnaVmBgcpPVmDBhuBEpLv4/mzYyGLVUDW94dtDsl+SjEFJU7CDRprtrIuR/MhDEVJR26r5BaPDvuY0ddVXfeFG9E17MjDz27LCE3k
 HadLaYnY
 asTUfkeTMT2/WJBQW9G8hR+7FKCUMGj5k8RUvIxFIV9q+0CzDigwDh+uceGlMpfEaHjUtkW8rO+vzeyi5+HM+bQZjxJuqbhbyil2KYt9IzC3pPifX6yCKzJmVpHSNGIVLcqlGPBBFA9NF6zKLQWMchrEuc1BPbn2hHov7TuTp7HwpYln/lZbBBs0WmPjvVzePeldgxUsDiPhNuKIm+mRe78OswSJtL+8cGQrgmX5YrLgFLdd4IvGCgvoz0hCyXSoeJ4NzdeM5LvLjlImvzxlkH9r6Lw8y3qIf+Ban5aSRKZqEbyAM5aHlk6PM5zO+8bV/BFwT4IiXluPDx8ALYxnybNddrtv0+giRj9pWUpn/98iUm0DbnREBQsfvBL7UzwI/R7oLvigdn2D324zpMcMqJ5lxPaPCUoRKZ4QDB4+N7NJgNqo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote:
> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> > userfaultfd may use interruptible sleeps to wait on userspace filling
> > a page fault, which works fine if the task can be reliably put to
> > sleeping waiting for that. However, if the task has a normal (ie
> > non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> > cause schedule() to be a no-op.
> > 
> > For a task that registers a page with userfaultfd and then proceeds
> > to do a write from it, if that task also has a signal pending then
> > it'll essentially busy loop from do_page_fault() -> handle_userfault()
> > until that fault has been filled. Normally it'd be expected that the
> > task would sleep until that happens. Here's a trace from an application
> > doing just that:
> > 
> > handle_userfault+0x4b8/0xa00 (P)
> > hugetlb_fault+0xe24/0x1060
> > handle_mm_fault+0x2bc/0x318
> > do_page_fault+0x1e8/0x6f0
> 
> Makes sense. There is a fault_signal_pending() check before retrying:
> 
> static inline bool fault_signal_pending(vm_fault_t fault_flags,
>                                         struct pt_regs *regs)
> {
>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
>                         (fatal_signal_pending(current) ||
>                          (user_mode(regs) && signal_pending(current))));
> }
> 
> Since it's an in-kernel fault, and the signal is non-fatal, it won't
> stop looping until the fault is handled.
> 
> This in itself seems a bit sketchy. You have to hope there is no
> dependency between handling the signal -> handling the fault inside
> the userspace components.

True. So far, my understanding is e.g. in an userfaultfd context the signal
handler is responsible for not touching any possible trapped pages, or the
sighandler needs fixing on its own.

> 
> > do_translation_fault+0x9c/0xd0
> > do_mem_abort+0x44/0xa0
> > el1_abort+0x3c/0x68
> > el1h_64_sync_handler+0xd4/0x100
> > el1h_64_sync+0x6c/0x70
> > fault_in_readable+0x74/0x108 (P)
> > iomap_file_buffered_write+0x14c/0x438
> > blkdev_write_iter+0x1a8/0x340
> > vfs_write+0x20c/0x348
> > ksys_write+0x64/0x108
> > __arm64_sys_write+0x1c/0x38
> >
> > where the task is looping with 100% CPU time in the above mentioned
> > fault path.
> > 
> > Since it's impossible to handle signals, or other conditions like
> > TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
> > fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
> > modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
> > signals will still be handled by the caller, and the timeout is short
> > enough to hopefully not cause any issues. If this is the first invocation
> > of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
> > is used.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> 
> When this patch was first introduced, VM_FAULT_RETRY would work only
> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
> causing handle_userfault() to return VM_SIGBUS, which would bubble
> through the fixup table (kernel fault), -EFAULT from
> iomap_file_buffered_write() and unwind the kernel stack this way.

AFAIU we can't rely on the exception fixups because when reaching there it
means the user access is going to get a -EFAULT, but here the right
behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done.

> 
> So I'm thinking this is the more likely commit for Fixes: and stable:
> 
> commit 4064b982706375025628094e51d11cf1a958a5d3
> Author: Peter Xu <peterx@redhat.com>
> Date:   Wed Apr 1 21:08:45 2020 -0700
> 
>     mm: allow VM_FAULT_RETRY for multiple times

IMHO the multiple attempts are still fine, instead it's problematic if we
wait in INTERRUPTIBLE mode even in !user mode..  so maybe it's slightly
more suitable to use this as Fixes:

commit c270a7eedcf278304e05ebd2c96807487c97db61
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Apr 1 21:08:41 2020 -0700

    mm: introduce FAULT_FLAG_INTERRUPTIBLE

The important change there is:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 888272621f38..c076d3295958 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
        uwq.ctx = ctx;
        uwq.waken = false;
 
-       return_to_userland =
-               (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
-               (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+       return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
        blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
                         TASK_KILLABLE;

I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in
some other use cases like GUP we'd still want the threads (KVM does GUP and
it's a heavy user of userfaultfd) to respond to non-fatals.

However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP
and if it's non-user either.

So in general, some trivial concerns here on the patch..

Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if
FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after
all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals
too!".

Secondly, userfaultfd is indeed the only consumer of
FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
this patch resolves it for userfaultfd, it might get caught again later if
something else in the kernel starts to respects the _INTERRUPTIBLE flag
request.  For example, __folio_lock_or_retry() ignores that flag so far,
but logically it should obey too (with a folio_wait_locked_interruptible)..

I also think it's not as elegant to have the magic HZ/10, and it's also
destined even the loop is less frequent that's a waste of time (as if the
user page access comes from kernel context, we must wait... until the page
fault is resolved..).

Is it possible we simply unset the request from the top?  As discussed
above, I think we still need to make sure GUP at least works for
non-fatals, however I think it might be more reasonable we never set
_INTERRUPTIBLE for !gup, then this problem might go away too with all above
concerns addressed.

A not-even-compiled patch just to clarify what I meant (and it won't work
unless it makes sense to both of you and we'll need to touch all archs when
changing the default flags):

===8<===
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 296d294142c8..fa721525d93a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
         * We set FAULT_FLAG_USER based on the register state, not
         * based on X86_PF_USER. User space accesses that cause
         * system page faults are still user accesses.
+        *
+        * When we're in user mode, allow fast response on non-fatal
+        * signals.  Do not set this in kernel mode faults because normally
+        * a kernel fault means the fault must be resolved anyway before
+        * going back to userspace.
         */
        if (user_mode(regs))
-               flags |= FAULT_FLAG_USER;
+               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
 
 #ifdef CONFIG_X86_64
        /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef22..a80f3f609b37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
  * arch-specific page fault handlers.
  */
 #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
-                            FAULT_FLAG_KILLABLE | \
-                            FAULT_FLAG_INTERRUPTIBLE)
+                            FAULT_FLAG_KILLABLE)
===8<===

That also kind of matches with what we do with fault_signal_pending().
Would it make sense?

Thanks,

-- 
Peter Xu