From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00976C61DA4 for ; Fri, 3 Feb 2023 14:56:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 65AA06B0071; Fri, 3 Feb 2023 09:56:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 60A1A6B0072; Fri, 3 Feb 2023 09:56:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D29D6B0074; Fri, 3 Feb 2023 09:56:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3D18F6B0071 for ; Fri, 3 Feb 2023 09:56:28 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E9515C113A for ; Fri, 3 Feb 2023 14:56:27 +0000 (UTC) X-FDA: 80426281614.05.4D95ED8 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf21.hostedemail.com (Postfix) with ESMTP id C28661C001C for ; Fri, 3 Feb 2023 14:56:25 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YttA7QJG; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675436185; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1i7X5PzCBbRsJjDeQvGRkONq09RnZNi3b/olrwD/UP0=; b=yPRemSb6ljy4+lSXVX7tJxib7gQTaFpo/+6+x9vCztkkdTPmfbKauMvX4og2sVGBktGI4i 5Cd8v52EFd/vFhXwJY5XYp4ktBoFGKrJV7EzB6RCwHQZZK9oz8InbCjS9S6oK4lQgv50m3 InIf26AzVLIxc1g8w7gcIYwl79Taaqc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YttA7QJG; spf=pass (imf21.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675436185; a=rsa-sha256; cv=none; b=mJ4kKPzyg70EC14poejtwkHdLkZ2L48dqk/S7s4pZV4pu7ksI6QvCbUoijE16bp9y1VImK mkPKTnoPbdHiWxVTIoj+QkUugvdleuXCsWSkhvVkHphIZAp8VBbo5VTLzJgL3eNkObT33V T+HvzFMm8gu6B4dSq/YeWBzdGc2VC1M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675436185; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1i7X5PzCBbRsJjDeQvGRkONq09RnZNi3b/olrwD/UP0=; b=YttA7QJGyA3opiXWl1sHDSJ4GbyxPOjCuzj63s18ovPbhrZYuFXLPqNdW1qRic/kn6DwGS 079PGF4aEnIefWiOdgKmRMN5Qgbqmxx9NufOnUju1H/lIQbCF7zK9VgFFBsm9UvYIpjvxX AwYLLaYyZNLsYUgu86EKYsdRMxcHFmA= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-107-wr-vthpbPn6Tvm41XScGfQ-1; Fri, 03 Feb 2023 09:56:24 -0500 X-MC-Unique: wr-vthpbPn6Tvm41XScGfQ-1 Received: by mail-qt1-f198.google.com with SMTP id f14-20020ac86ece000000b003b816f57232so2732641qtv.10 for ; Fri, 03 Feb 2023 06:56:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1i7X5PzCBbRsJjDeQvGRkONq09RnZNi3b/olrwD/UP0=; b=YW5LcBk6tKee3tGbIraePDC89kvS9r/EUuPwgI/daDm9SVawRgtwS5+nJ5Ae1JTUF9 FrQReO7MX+AYYp/ef8WuDnTffa0ajg80a3ga/SaP8iphlLegy1+WjjgzQ/LjC5HCBru5 aLzWSYZwn5jijOSgEdj2qgJxvMccYqNEs0T+LHRqDWoZHg/mFBi4Mf4rdvAsml0mbEN9 p8TnSGVZoqYqB3R0aY2GrqLn94JBYiAlWw/vi/pxctAJhKckDfFi0ER7feLkXEFwDgBn PAyh5aIePVDxm2Ugs3jzFQc+SFCL4z+njG3fwvbMiWOqEf9ahj3r5gq7Uc3HTArggqdX uMZw== X-Gm-Message-State: AO0yUKWnRyYet8ZC2CV3pPyBOrxGX9n4sD7hiV0qf9aQv7nK/sgvZZP8 ff+m2EWl/9v15Tkz5ptd3+K1DlE2c/B00DT1YDQbfTjFw2Au/Lg5qxl0Looh7GwpoofG9bvwsWz pqrfXRpQo5hE= X-Received: by 2002:a05:622a:a15:b0:3b8:695b:aad1 with SMTP id bv21-20020a05622a0a1500b003b8695baad1mr17767760qtb.1.1675436183656; Fri, 03 Feb 2023 06:56:23 -0800 (PST) X-Google-Smtp-Source: AK7set+zyBMgYG3jpeNczzlTJfUDlZXixdNcX7huiAdKK4p5M4hddFKCUFkWqcjtL35shPysL6E0fg== X-Received: by 2002:a05:622a:a15:b0:3b8:695b:aad1 with SMTP id bv21-20020a05622a0a1500b003b8695baad1mr17767736qtb.1.1675436183360; Fri, 03 Feb 2023 06:56:23 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id i15-20020a05620a27cf00b006fba0a389a4sm1881958qkp.88.2023.02.03.06.56.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Feb 2023 06:56:22 -0800 (PST) Date: Fri, 3 Feb 2023 09:56:21 -0500 From: Peter Xu To: David Stevens Cc: Yang Shi , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd Message-ID: References: <20230201034137.2463113-1-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: C28661C001C X-Stat-Signature: jn56e9xpsdizhifga47axmp4prga8cxw X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1675436185-94614 X-HE-Meta: U2FsdGVkX183bDIWIEhJDYZ+Hz0///92JoYAb2VgTw/NAJfHw/nov+WDsomdnfJ0Q/f937g/sQo1N1kMCryLu9hiUxsFBT+j0PjaGfVHyZoAsA7OydwqUZyMSvBEXkHGYtmkidnLbDn2F9L7tnxVmPci89Tr7v1F7GnP1M5hZyWp31GPz81M1oTbKPHgeME76Zn1Akg7L+TMl6t9lwZHkyNoHi5x3y0mnzDiioWEicl2BE7LfhHr/Pw/nRTYoJnb/Meh7QFd58FZA7SE2kJUPoY33oYKtCQxXBdBRezCe42T4v2+prV2YGOYgdeb7d0/scaGSf8l0ZdDYtV8qTlL40xpYYvg5gYRRtAEk8YOJxLBoDyoJ1mvtGbICjE4iEzTJI8NhwlB9BP8G+vEMnGu8NdAuRI7cGAtXtRXC28TUMvlu9XQX8p3dczJ9poOO+tpVBn2xxwZ9g+jM0HsQsz/HVNzExcUPlvwbahV+m/vqKyCTs/hUe5On95RLFnfqB2zkbEb6N8vypWfJUHNt+IQX+VwxcPFbJm/NkA3ZD/q+YhoK2r9rByo2Lf70bOLChBsYagp0ZYmm+SPEIR20+g6euZudtHQoVaVyBX8uvS8Gu9GfIBgWMTdQKW9JIBobW2shzrB2sKOzxCUnCPXJOPQpMpONbZIQHeQhUY9B9IcMveOTd7K2HYOHrJg01t9UujjKsE5XpeTCpnGFEcAAqLef9Al/AkpTrCw3ZXxO4a9/YuM37cddBekXLhPbZA5OzcERRWPazfbodawWqmvhLlTS+X2AGjg+ag7wZJZZG/qTS1Gpnee6QVQCVSd7WvyPXr++9VDlqqDLPGh0872X+ZhDBAz+hC+qcFplHRRp2izRpcSivtbExE/YpoHpYB/OT34qYdrQkH1HSCj2lk6GvONmDXGtAZVdAa9QuxJ13F/Ss96rPURyypYkTxY/JHjmxnuDKoxsgRE8sTPc8S9/aL HuY7Qj0m 39bBE9DysuuqWx6cOiVYdr6BszbbUJzBRGjOjBmHGeGGwfjhI6emPzwoMV+qMMeNmEWvetzVjL3BfE6fByJASc57mVhwXPa8lXXYm+nK4aoWiDaNyWTKcZBLv2eOqYs1bAxrjXRpTBIExOdWXLrWHzCLJKtrlxgzGqAQE+oodWgLnx78Rc9pokR+F4H6QjptJEbyXd6+3NrkugHcbMH2N7BlSYyeY9MomRGHuKO7B3J+TZgjP1Mc8AM04/+ETAkdnxkyVlZM1C1iud65nYMu2OGgZyD3eFaNNWYjHTKOl9wYTMEs23V1PbYXS5QYWXfm0Cqw8QDsfSRWQGuVkn8X+yCLJKt9jAHCwpd6Wdo6f46czEbXYdcUHOf8IEAdH2NtwEk5v X-Bogosity: Ham, tests=bogofilter, spamicity=0.000015, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 03, 2023 at 03:09:55PM +0900, David Stevens wrote: > > > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > > > > is inherently brittle. It's possible for userspace to bypass > > > > userfaultfd on a shmem mapping by accessing the shmem through a > > > > different mapping or simply by using the write syscall. > > > > Yes this is possible, but this is user-visible operation - no matter it was > > a read()/write() from another process, or mmap()ed memory accesses. > > Khugepaged merges ptes in a way that is out of control of users. That's > > something the user can hardly control. > > > > AFAICT currently file-based uffd missing mode all works in that way. IOW > > the user should have full control of the file/inode under the hood to make > > sure there will be nothing surprising. Otherwise I don't really see how > > the missing mode can work solidly since it's page cache based. > > > > > > It might be sufficient to say that the kernel won't directly bypass a > > > > VMA's userfaultfd to collapse the underlying shmem's pages. Although on > > > > the other hand, I guess it's not great for the presence of an unused > > > > shmem mapping lying around to cause khugepaged to have user-visible > > > > side effects. > > > > Maybe it works for your use case already, for example, if in your app the > > shmem is only and always be mapped once? However that doesn't seem like a > > complete solution to me. > > We're using userfaultfd for guest memory for a VM. We do have > sandboxed device processes. However, thinking about it a bit more, > this approach would probably cause issues with device hotplug. > > > There's nothing that will prevent another mapping being established, and > > right after that happens it'll stop working, because khugepaged can notice > > that new mm/vma which doesn't register with uffd at all, and thinks it a > > good idea to collapse the shmem page cache again. Uffd will silently fail > > in another case even if not immediately in your current app/reproducer. > > > > Again, I don't think what I propose above is anything close to good.. It'll > > literally disable any collapsing possibility for a shmem node as long as > > any small portion of the inode mapping address space got registered by any > > process with uffd. I just don't see any easier approach so far. > > Maybe we can make things easier by being more precise about what bug > we're trying to fix. Strictly speaking, I don't think what we're > concerned about is whether or not userfaultfd is registered on a > particular VMA at a particular point in time. I think what we're actually > concerned about is that when userspace has a page with an armed > userfaultfd that it knows is missing, that page should not be filled by > khugepaged. If userspace doesn't know that a userfaultfd armed page is > missing, then even if khugepaged fills that page, as far as userspace is > concerned, the page was filled by khugepaged before userfaultfd was > armed. IMHO that's a common issue to solve with registrations on uffd missing mode, and what I'm aware of as a common solution to it is, for shmem, punching holes where we do hope to receive a message. It should only happen _after_ the missing mode registration is successful. At least that's what we do with QEMU's postcopy. > > If that's a valid way to look at it, then I think the fact that > collapse_file locks hpage provides most of the necessary locking. The hpage is still not visible to the page cache at all, not until it is used to replace the small pages, right? Do you mean "when holding the page lock of the existing small pages"? > From there, we need to check whether there are any VMAs with armed > userfaultfds that might have observed a missing page. I think that can be > done while iterating over VMAs in retract_page_tables without acquiring I am not 100% sure how this works. AFAICT we retract pgtables only if we have already collapsed the page. I don't see how it can be undone? > any mmap_lock by adding some memory barriers to userfaultfd_set_vm_flags > and userfaultfd_armed. It is possible that a userfaultfd gets registered > on a particular VMA after we check its flags but before the collapse > finishes. I think the only observability hole left would be operations on > the shmem file descriptor that don't actually lock pages > (e.g. SEEK_DATA/SEEK_HOLE), which are hopefully solvable with some more > thought. So it's possible that I just didn't grasp the fundamental idea of the new proposal on how it works at all. If you're confident with the idea I'd be also glad to read the patch directly; maybe that'll help to explain stuff. Thanks, -- Peter Xu