From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A90C4C0218A for ; Sat, 1 Feb 2025 14:38:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B9366B0083; Sat, 1 Feb 2025 09:38:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 36AE86B0085; Sat, 1 Feb 2025 09:38:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20B7F6B0088; Sat, 1 Feb 2025 09:38:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 02E636B0083 for ; Sat, 1 Feb 2025 09:38:33 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id AAB971A013F for ; Sat, 1 Feb 2025 14:38:33 +0000 (UTC) X-FDA: 83071631706.06.0F81B93 Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf01.hostedemail.com (Postfix) with ESMTP id 0E94940004 for ; Sat, 1 Feb 2025 14:38:31 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=WHVcnBTq; spf=pass (imf01.hostedemail.com: domain of brauner@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738420712; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MCwnp5Us+TjcUzOW3Tpng86uK578UfELBhePKRR592I=; b=wQ0JxoGpAyXX9YGEkIUyQSldly9hzEoJFkZhXqQPg7uQnt1GRulSPtmmHyhom5pEyIqK+N ickwqbfU24TZmE+RohZV25R2rElI61ITA+5Y6i/ELyL9F3f5e8QrwsCh78/if3ZsZOuuDd i6ujRq9OavlSyBVHjy3FZIVacpa3gXE= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=WHVcnBTq; spf=pass (imf01.hostedemail.com: domain of brauner@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738420712; a=rsa-sha256; cv=none; b=USKdxnu/5zHHg8be6/jzQcTpfCNAWttvVASnWQJxGtN+zjJWhEkbWtQ/hXuM1vepjKp45F 9Y+TZN6UZ2DFBFG4DUW2h1KxXnncS14mtCG3SZOTpyUvVgyUsHhs06mS9srIcAmGQqyR/e yyw4b5YpHjfy3cdbiMu4QaVD9Ng0b/g= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id BCC65A401B3; Sat, 1 Feb 2025 14:36:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 81966C4CED3; Sat, 1 Feb 2025 14:38:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1738420710; bh=QiSjG7TvkKqWoaD2aBFrVJnxmPFyXLRgqp8pz0gLUzc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=WHVcnBTqerPHgxQCTpImKkHSbu00uWLYWvzmwAtGSlEM5YJZsz+IONl1JAUXaAYeS UJ9kApP54ppYFtkOEdEpcq3q/cMh6HuYR1mHH/xKj/oPfQVrOmFtSrd8PKELMi9ZKd aMt49gIWUhJubrKpeqIdvzW/w1uO2UpMYnZh0EB8fqF0FzQhG/nkrRTh0n3uupqhJX GL6/bojxTFS3yFivQGFxojQ/70DT1M5BSJIYxKerpN3FTZplWkma3sNKC0V8QcHgvv 06dPyDUVaBQCjxiqI0K2TaqPFoWFtdm50CRElJU5ocUYzDcQzY+iGG0rhGd7AyrKNH jwF5lZf/kuiTQ== Date: Sat, 1 Feb 2025 15:38:20 +0100 From: Christian Brauner To: Peter Xu Cc: Linus Torvalds , Alex Williamson , Josef Bacik , kernel-team@fb.com, linux-fsdevel@vger.kernel.org, jack@suse.cz, amir73il@gmail.com, viro@zeniv.linux.org.uk, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org, "linux-kernel@vger.kernel.org" , "kvm@vger.kernel.org" Subject: Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Message-ID: <20250201-legehennen-klopfen-2ab140dc0422@brauner> References: <9035b82cff08a3801cef3d06bbf2778b2e5a4dba.1731684329.git.josef@toxicpanda.com> <20250131121703.1e4d00a7.alex.williamson@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 0E94940004 X-Stat-Signature: 863mzpy7pmpp6ufcxk7mchiwefq9197o X-Rspam-User: X-HE-Tag: 1738420711-838564 X-HE-Meta: U2FsdGVkX186uxZfrQYbtNGKbqRBSgOoub3dusEA/+VMhm4+OWm4voBITHu6A1qMkzrMFflixvcRw3OExHZb72d95DNPGlBNInsy1bptr2il9cy5SAzEyJDYu6a+uUAATk7STpDW22XkucOr+BnhZTEjQYfBAz5JTs4rdQNIlCr2G1i2C8ug3V053A11q2HND2iUaAq3Y1lOZ1BNKWKH3uAX0IMwB0UWl19S55ACUrI3iMfpC8wKlzSU6w/e5PAUixb5ITuY/zU550trcrFn8IWKLhWXfnLQNrmbvYG98XoZivUQ3uxRYDFgM9/2x8chDH7tABbvJWNgCTToOhmb5tuAWuD1QvFMlpS0YSZS9A/hvQSeHfFFNJ0gSabgHc0faLOC0iCwRZGdO0zonVKGNqJenZbFV0z8UiSJYa+6ymp+YFrKYa9+sSXzIaKxqGTA2PR68KjEICqO98NCArUFEcM8JzGAK7luZiJ7Qv5mf8dUnkMEggPcL8BMNWzmxD1KJO1wYek3XcRELuhughpnCUt2MegynsoUrG134ERn/sCgcZXK8rYoZpDF4kFJh2llJq6gwJOK8rZxL1336eQx7UjNau84/IdcykTgw0aV6eR/CwHx6t3fNdXcUlBIUmr3eX5BtTzCXrS5UJo1sj8nGjrP8kg5RtNiEpjC6Xg8RhFF5cRHCav8q6/qqR+XHdcAyOavEr2w7+Hb4X1VoUuBVZcndJrDXCereJAPhSvEUX8uOi6wV62fVCZbRMqAA/wfg81x9xYdtqsWXIynuu3+9Jg6yZs0NWU28C8UtBO9Y8mNTEbvJgPuMqm8dLcI1I8VtlRMvkJLir6iREIR3Lrd5ul4kpLN42JWNMeJwkQxrIY8WmK7pO2VLnRZrlRiIL6wN5qBro9arFB0a7Vx0iHpS7OtFEgRgYnQqXEm6QHPNYg6n1KQaTNab0J5NkLz2reyZYy9b1FsE6T1o+DwLPa ATEjfg8e kBFaANjq0fQPewlYjfb+UJM2zD/csTtLNK/ik72LhA6u8Jdc1tHXeWodfAoY4/7ti9hF72FCbv7Li5ycrDL3xNHch5h6lGgZwsZ3otPaazqtQxfpSfhwfyURfPHJHx7IQy0y4PQgfTX31eLEisxWIg07fFlaIl2+hwDgAu0cOQUfYh6FBU8UvJvJd3R4kI3corEdQCvJbDjpuchTh2L+dQbeSSiamQKVDLVS3SJHWAmoMieMkd6pDBnHKIJoMteXHZfG2o7K1J9GVi8QkHMC17e42Ct/0I8qXde6ySfOUehvBgF9ak0WAOp1YpF63vN7a+xO1f5lw0OJHrjxcoZ1iXtW4LH17hvJwQaXhI08PveQg74A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.004844, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 31, 2025 at 08:19:22PM -0500, Peter Xu wrote: > On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote: > > On Fri, 31 Jan 2025 at 11:17, Alex Williamson > > wrote: > > > > > > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") > > > > > > This breaks huge_fault support for PFNMAPs that was recently added in > > > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD > > > order mappings. > > > > Surely only for content watches? > > > > Which shouldn't be a valid situation *anyway*. > > > > IOW, there must be some unrelated bug somewhere: either somebody is > > allowed to set a pre-content match on a special device. > > > > That should be disabled by the whole > > > > /* > > * If there are permission event watchers but no pre-content event > > * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that. > > */ > > > > thing in file_set_fsnotify_mode() which only allows regular files and > > directories to be notified on. > > > > Or, alternatively, that check for huge-fault disabling is just > > checking the wrong bits. > > > > Or - quite possibly - I am missing something obvious? > > Is it possible that we have some paths got overlooked in setting up the > fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on > those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on > those if overlooked.. > > One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are > not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU > groups rather than the device fd itself. > > The app normally needs to first open the IOMMU group fd under /dev/vfio/*, > then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which > will be the mmap() target, instead of the ones under /dev. Ok, but those "device fds" aren't really device fds in the sense that they are character fds. They are regular files afaict from: vfio_device_open_file(struct vfio_device *device) (Well, it's actually worse as anon_inode_getfile() files don't have any mode at all but that's beside the point.)? In any case, I think you're right that such files would (accidently?) qualify for content watches afaict. So at least that should probably get FMODE_NONOTIFY. > > I checked, those device fds were allocated from vfio_device_open_file() > within the ioctl, which internally uses anon_inode_getfile(). I don't see > anywhere in that path that will set the fanotify bits.. > > Further, I'm not sure whether some callers of alloc_file() can also suffer Sidenote, mm/memfd.c should pretty please rename alloc_file() to memfd_alloc_file() or something. That would be great because alloc_file() is a local fs/file_table.c helper and grepping for it is confusing as I first thought someone made alloc_file() available outside of fs/file_table.c > from similar issue, because at least memfd_create() syscall also uses the > API, which (hopefully?) would used to allow THPs for shmem backed memfds on > aligned mmap()s, but not sure whether it'll also wrongly trigger the > FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs. I > didn't verify it though, nor did I yet check more users. > > So I wonder whether we should setup the fanotify bits in at least > alloc_file() too (to FMODE_NONOTIFY?). > > I'm totally not familiar with fanotify, and it's a bit late to try verify > anything (I cannot quickly find my previous huge pfnmap setup, so setup > those will also take time..). but maybe above can provide some clues for > others.. > > Thanks, > > -- > Peter Xu >