From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B5E7C369C2 for ; Thu, 24 Apr 2025 18:26:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D9E46B0096; Thu, 24 Apr 2025 14:26:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 287406B00AF; Thu, 24 Apr 2025 14:26:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DA056B00D4; Thu, 24 Apr 2025 14:26:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id DFA276B00AF for ; Thu, 24 Apr 2025 14:26:45 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 891D0BA40E for ; Thu, 24 Apr 2025 18:26:46 +0000 (UTC) X-FDA: 83369768412.24.4D39939 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 31BBEC0005 for ; Thu, 24 Apr 2025 18:26:44 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fo56xeXN; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745519204; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=; b=cwfKRWW/w8HAr/Ap8xaRj9ycJTj+Wl8NkI7FC2cBYFl2TQOQhMFIAY3zbUrvFEj22Mb4hY 8Q27eq1aXCDoSBxZjAcB0uPpJiInREqMLhq8UUcp0ibNVGsAE0NsGANODuVoqQaHGxeAkj 0sn2BrJwKZID0SJ4B6/Pl3lOUjIM4gQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745519204; a=rsa-sha256; cv=none; b=qNEqxUVwY+AX1QqmCeErxZUUD8j0AaiA36M4Tu8d+LSbxp15kz/Lax605A4KtAP5LOFK4E BeSPN7voB/PRBZkKH5+sFTRN5Ua0I5RYT7eIXUfY7fOYaqMOkxYEcl+/5+NQYJKKJy6OH/ 4C38R47qu/YzxVjrbJpRdGtVM0ln3Nk= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fo56xeXN; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1745519203; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=; b=fo56xeXNaZxy7Ibn52daX6N7bQsZnCKw6W2wIqH3zoL9mKtlComgT5mrGk/0Tmfj9cRStm iym5kxB98JhyHfba9AjofNZjk7TJW6BHKvSsM1XTLja0GAwbdyWVqT7zPvd0KKh7qeflvD xbZGwVIP8yGMcjZdp8PDDb9ld2Zpmlg= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-588-6SnSH3dOPbycT94HhQnoPQ-1; Thu, 24 Apr 2025 14:26:42 -0400 X-MC-Unique: 6SnSH3dOPbycT94HhQnoPQ-1 X-Mimecast-MFC-AGG-ID: 6SnSH3dOPbycT94HhQnoPQ_1745519202 Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-7c791987cf6so276826785a.0 for ; Thu, 24 Apr 2025 11:26:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745519202; x=1746124002; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=zf9fN4uPxYqilQqe6FukB7BMsnH9qJym/Yt+HGTXGJ8=; b=B07MrJlvzpLQG50yTNog5gvtF59oy6XXQFD2wSgiKhIi3Hnd6ghnh5YEOW2sGHcaQz j11+4UJI1WlRAQp1q7JtyBLSlvHsJLHZg2+zvt2plYXn+aJ5I0aHQpeegABZ2LOlG+Hc GkEIMZ+Yt/lGjknKHDutGxvFxPmGkZAvI9eFLGCVqQ+FwyyZPwS/HkKOHm1Fx9mUka7/ rF2A6rA8TH7x/Obsa5Hx32c4whNgGSSrqIHZyDUjKfq6sC4YNldvpPeRwVYtv2nxIjK7 QQFiWULRPtgmwfvHTs/6cVSmc41hD19wvNqoTgDuzvt3OH8UY5Xe7mGo3yXLSaZekLQX SwRg== X-Forwarded-Encrypted: i=1; AJvYcCV/GEnjIoj8FvcRdetr5toECrc1fX6VNP1aakIkMzgR828xqb60zGoMp8TB70cWf41pCvjqSDMGrQ==@kvack.org X-Gm-Message-State: AOJu0YwBSAD+PFYanIZK9t5eyX9/XAIrTBBGEsMDs7PNh1UVRG7uny6r W8+hQPIflGECh5t1G16GyVBm/w3yac4s7Zwyi1zhyjFBQP3VlydoNFLi54XPWreikgUxsnthscg 9bVPS3dYiHO3dNohvTeB9i7Cc60hdqh1wc8rmUW6aG2cvB+7O X-Gm-Gg: ASbGnct52fvC9nGkozPCQliSLfSpxPwdax1eMhA0ObaF5i6QiMT8cWJAKGxdZzoMPYx M58JKtbmQNq2pL8gYomK/BWK6AWSyhhR7Dlt4F0rY3HxiZZRZse7m5jlKvlTFi2CNbF9m4NnMft RtaHeayBd799yFPHlxBfIJ7VxqMSh/JyQI5KKHnwkGbgMGVAhsAziuNCXD8AUfdB75E9arP7Sku eju3K9Cstw0D1J5t753RLEVPoeLO15eBtKbTNB73VeJ+eP3JX11NZ1ko8/X1QwS87p3HXwS6cZT 2ao= X-Received: by 2002:a05:620a:1b99:b0:7c5:43b4:fa97 with SMTP id af79cd13be357-7c95ef62ed8mr65536085a.53.1745519201605; Thu, 24 Apr 2025 11:26:41 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHWadddF4tTVGtsR9D4nXlp5wgIoy+YSZhoYxEZ8+RIfT9JL70XbLnOSO0r756wTUMaiFpfxA== X-Received: by 2002:a05:620a:1b99:b0:7c5:43b4:fa97 with SMTP id af79cd13be357-7c95ef62ed8mr65533285a.53.1745519201263; Thu, 24 Apr 2025 11:26:41 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c958cdc62bsm117826985a.55.2025.04.24.11.26.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Apr 2025 11:26:40 -0700 (PDT) Date: Thu, 24 Apr 2025 14:26:37 -0400 From: Peter Xu To: Johannes Weiner Cc: Jens Axboe , Andrew Morton , "linux-fsdevel@vger.kernel.org" , Linux-MM Subject: Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending Message-ID: References: <27c3a7f5-aad8-4f2a-a66e-ff5ae98f31eb@kernel.dk> <20250424140344.GA840@cmpxchg.org> MIME-Version: 1.0 In-Reply-To: <20250424140344.GA840@cmpxchg.org> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: fjd3F2Kh5M4xufsHiErVqwyCqvKHAHWpchFMij4BTjU_1745519202 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 31BBEC0005 X-Rspam-User: X-Stat-Signature: xpuqnczzrwfsqz8j1xfcpsqxjqxpim9p X-HE-Tag: 1745519204-44117 X-HE-Meta: U2FsdGVkX1+99LGGy+afpFK1eN4mI1beR5zYnnNVwtMUHoasOEL2rBXjkTO8kd0P3I1klZoG3ed+l0RGl584MpCWBSAU3k0MY7GQcd7gjlLwxcoTkGIXiJsSinWkkb9eLajiVsx0hMBu6m5rFtZnqM+hGXCBztGE5S87iiCJItZHnOfuIpd8AptLEFzAO1AmP7QoC6jwqJbmNG5SafsP+S0sF0FFi37dtQTEHz+hChykyOL3OWZytKntYjodGGYPnpAkK7JYFtMln7MXgmWmw25B3mv53pXXRyzlflFoLX6HOgncYT39CH8F7DT0nbsH8dXn6VOL89Zec7TKhpd9ZjRtEL1uvKUmv0h9OW6gqDa5teuWZZZHAEwuHsHtlRQa04pVVoT18vJBVrqlCnrikj32uuGT0N4ooFsj+MBE5Tavo5p/SNf0ewxy8BsLplHa4tZ0IDd4ZZEa7SOhP+udSipMuf5jqvGGdUToairaHpKYhTgi3prGw9Z0B2poead5Zq2YU8SydmhgRL16kpp9QdsGI3jFG6ZUQU70MYV9YzjTI3YgJSPv2xEPntYKmBTDSsfGpAF0btQU+nvH40xJvz7TO+pAlSGkBLBEWFgU85biogYhKw8jCBQAtfEBcE1aH2SsRk2weG91cVxSdd1JyYOYUyC8ot6frL0YJFuraegfUeip2jUiDj/Y1uN24Sis+DDSvxzyz/5BBICSmCRsV/cT1qaV8io9PQg9PW8JKnDCyNcqh7o6KzHjVIc+VPqvk14W7COyhIdNZZjaue34R7u04uLnLKvl6p+/pE02XLpeWhUBtA+3ZUgek3gIE8nxNlf/wFYJpW5xpAasOThRYEy4W105By6whemeSLNSynuHvsDnaVmBgcpPVmDBhuBEpLv4/mzYyGLVUDW94dtDsl+SjEFJU7CDRprtrIuR/MhDEVJR26r5BaPDvuY0ddVXfeFG9E17MjDz27LCE3k HadLaYnY asTUfkeTMT2/WJBQW9G8hR+7FKCUMGj5k8RUvIxFIV9q+0CzDigwDh+uceGlMpfEaHjUtkW8rO+vzeyi5+HM+bQZjxJuqbhbyil2KYt9IzC3pPifX6yCKzJmVpHSNGIVLcqlGPBBFA9NF6zKLQWMchrEuc1BPbn2hHov7TuTp7HwpYln/lZbBBs0WmPjvVzePeldgxUsDiPhNuKIm+mRe78OswSJtL+8cGQrgmX5YrLgFLdd4IvGCgvoz0hCyXSoeJ4NzdeM5LvLjlImvzxlkH9r6Lw8y3qIf+Ban5aSRKZqEbyAM5aHlk6PM5zO+8bV/BFwT4IiXluPDx8ALYxnybNddrtv0+giRj9pWUpn/98iUm0DbnREBQsfvBL7UzwI/R7oLvigdn2D324zpMcMqJ5lxPaPCUoRKZ4QDB4+N7NJgNqo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote: > On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote: > > userfaultfd may use interruptible sleeps to wait on userspace filling > > a page fault, which works fine if the task can be reliably put to > > sleeping waiting for that. However, if the task has a normal (ie > > non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply > > cause schedule() to be a no-op. > > > > For a task that registers a page with userfaultfd and then proceeds > > to do a write from it, if that task also has a signal pending then > > it'll essentially busy loop from do_page_fault() -> handle_userfault() > > until that fault has been filled. Normally it'd be expected that the > > task would sleep until that happens. Here's a trace from an application > > doing just that: > > > > handle_userfault+0x4b8/0xa00 (P) > > hugetlb_fault+0xe24/0x1060 > > handle_mm_fault+0x2bc/0x318 > > do_page_fault+0x1e8/0x6f0 > > Makes sense. There is a fault_signal_pending() check before retrying: > > static inline bool fault_signal_pending(vm_fault_t fault_flags, > struct pt_regs *regs) > { > return unlikely((fault_flags & VM_FAULT_RETRY) && > (fatal_signal_pending(current) || > (user_mode(regs) && signal_pending(current)))); > } > > Since it's an in-kernel fault, and the signal is non-fatal, it won't > stop looping until the fault is handled. > > This in itself seems a bit sketchy. You have to hope there is no > dependency between handling the signal -> handling the fault inside > the userspace components. True. So far, my understanding is e.g. in an userfaultfd context the signal handler is responsible for not touching any possible trapped pages, or the sighandler needs fixing on its own. > > > do_translation_fault+0x9c/0xd0 > > do_mem_abort+0x44/0xa0 > > el1_abort+0x3c/0x68 > > el1h_64_sync_handler+0xd4/0x100 > > el1h_64_sync+0x6c/0x70 > > fault_in_readable+0x74/0x108 (P) > > iomap_file_buffered_write+0x14c/0x438 > > blkdev_write_iter+0x1a8/0x340 > > vfs_write+0x20c/0x348 > > ksys_write+0x64/0x108 > > __arm64_sys_write+0x1c/0x38 > > > > where the task is looping with 100% CPU time in the above mentioned > > fault path. > > > > Since it's impossible to handle signals, or other conditions like > > TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the > > fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf > > modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal > > signals will still be handled by the caller, and the timeout is short > > enough to hopefully not cause any issues. If this is the first invocation > > of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode > > is used. > > > > Cc: stable@vger.kernel.org > > Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") > > When this patch was first introduced, VM_FAULT_RETRY would work only > once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared, > causing handle_userfault() to return VM_SIGBUS, which would bubble > through the fixup table (kernel fault), -EFAULT from > iomap_file_buffered_write() and unwind the kernel stack this way. AFAIU we can't rely on the exception fixups because when reaching there it means the user access is going to get a -EFAULT, but here the right behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done. > > So I'm thinking this is the more likely commit for Fixes: and stable: > > commit 4064b982706375025628094e51d11cf1a958a5d3 > Author: Peter Xu > Date: Wed Apr 1 21:08:45 2020 -0700 > > mm: allow VM_FAULT_RETRY for multiple times IMHO the multiple attempts are still fine, instead it's problematic if we wait in INTERRUPTIBLE mode even in !user mode.. so maybe it's slightly more suitable to use this as Fixes: commit c270a7eedcf278304e05ebd2c96807487c97db61 Author: Peter Xu Date: Wed Apr 1 21:08:41 2020 -0700 mm: introduce FAULT_FLAG_INTERRUPTIBLE The important change there is: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 888272621f38..c076d3295958 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) uwq.ctx = ctx; uwq.waken = false; - return_to_userland = - (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) == - (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE); + return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE; blocking_state = return_to_userland ? TASK_INTERRUPTIBLE : TASK_KILLABLE; I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in some other use cases like GUP we'd still want the threads (KVM does GUP and it's a heavy user of userfaultfd) to respond to non-fatals. However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP and if it's non-user either. So in general, some trivial concerns here on the patch.. Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals too!". Secondly, userfaultfd is indeed the only consumer of FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future. While this patch resolves it for userfaultfd, it might get caught again later if something else in the kernel starts to respects the _INTERRUPTIBLE flag request. For example, __folio_lock_or_retry() ignores that flag so far, but logically it should obey too (with a folio_wait_locked_interruptible).. I also think it's not as elegant to have the magic HZ/10, and it's also destined even the loop is less frequent that's a waste of time (as if the user page access comes from kernel context, we must wait... until the page fault is resolved..). Is it possible we simply unset the request from the top? As discussed above, I think we still need to make sure GUP at least works for non-fatals, however I think it might be more reasonable we never set _INTERRUPTIBLE for !gup, then this problem might go away too with all above concerns addressed. A not-even-compiled patch just to clarify what I meant (and it won't work unless it makes sense to both of you and we'll need to touch all archs when changing the default flags): ===8<=== diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 296d294142c8..fa721525d93a 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs, * We set FAULT_FLAG_USER based on the register state, not * based on X86_PF_USER. User space accesses that cause * system page faults are still user accesses. + * + * When we're in user mode, allow fast response on non-fatal + * signals. Do not set this in kernel mode faults because normally + * a kernel fault means the fault must be resolved anyway before + * going back to userspace. */ if (user_mode(regs)) - flags |= FAULT_FLAG_USER; + flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE; #ifdef CONFIG_X86_64 /* diff --git a/include/linux/mm.h b/include/linux/mm.h index 9b701cfbef22..a80f3f609b37 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp); * arch-specific page fault handlers. */ #define FAULT_FLAG_DEFAULT (FAULT_FLAG_ALLOW_RETRY | \ - FAULT_FLAG_KILLABLE | \ - FAULT_FLAG_INTERRUPTIBLE) + FAULT_FLAG_KILLABLE) ===8<=== That also kind of matches with what we do with fault_signal_pending(). Would it make sense? Thanks, -- Peter Xu