From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88D58C83F03 for ; Thu, 3 Jul 2025 16:26:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E9F0A280004; Thu, 3 Jul 2025 12:26:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E754A94000D; Thu, 3 Jul 2025 12:26:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D8C03280004; Thu, 3 Jul 2025 12:26:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C4D4B94000D for ; Thu, 3 Jul 2025 12:26:28 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 27161C029A for ; Thu, 3 Jul 2025 16:26:28 +0000 (UTC) X-FDA: 83623481256.22.E6325D1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 6B88012000E for ; Thu, 3 Jul 2025 16:26:25 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hK+sI4UD; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751559985; a=rsa-sha256; cv=none; b=u0Jc6jTOJdahjz79X6aT27dslzH4rWSDgYJdjpoJlBJdQErxNpZjeH9eO9qMNXrNNOSQe3 B7vt5vKaYQvYuI4ZAcgTJNOWpJCKJARt0fU3kafztfUhfuddDKSsq32DV3cIjwmxdwdaag RvNeyc2KmVx5xk7aKzGrq+8szA28y04= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hK+sI4UD; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751559985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Y6zCExeDq8zceYvmEFOtzzhRpN64YrjMCwXYdkn74M=; b=DqZMlGGmpdrzPA6f/aWXmA4IrSwo5uJJdsus9d+zzT1mR/LZOycNEQU9KpttBJQGM2Wj1Z o6Z9jdfqgUOoGRR+c5kSw6rG+A8eDLsBzcRtzLCHlppuiNb5MPiNQmFdDRyghU4x4uU+Xz gk5moAFJI3JGG3oJex3gwdfZrVdfeWw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1751559984; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4Y6zCExeDq8zceYvmEFOtzzhRpN64YrjMCwXYdkn74M=; b=hK+sI4UDqX0X3cBusKeWDCb/RH4zeHUtjrpZeDEePfUAqKqyY5/ZpIPv55nY1ObrN30wfx YQY1ij6/EQZfbHWu7OME6wKhDDjydaYobyqmOkUSCrdMu75Hz/YLKIl/K4I7fo9J4cdsYC VSgfaDBkm7HJ7/Uk7ftNBgS+Zl/WrrI= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-257-qwagbqR8PAGcDAqlEbPaFg-1; Thu, 03 Jul 2025 12:26:23 -0400 X-MC-Unique: qwagbqR8PAGcDAqlEbPaFg-1 X-Mimecast-MFC-AGG-ID: qwagbqR8PAGcDAqlEbPaFg_1751559983 Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-7cf6c53390eso657485a.2 for ; Thu, 03 Jul 2025 09:26:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751559983; x=1752164783; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4Y6zCExeDq8zceYvmEFOtzzhRpN64YrjMCwXYdkn74M=; b=kkPRjGLssd8rpTcQ/dbSZOOdySZVa9jhHWWG18YM53nafx8LZmRrAQ5nzSLvGOSl6o ji6k9kDWq9YCMZqRj4afB/jsLk+Ijb6Z3n0K7ZuIIMPV7r+qT78cjcFou4TYPKjR7wkq qocmPaPksDJ+5o7Ay6eIaNeGiWqrYU7L8Q1MAVTQttZqmwxLHfBQ3LFvZ3eNEhOYDbbF bNODVZ5j5+Wi3EdO1hrXnSik8FnVk4gy4HtNj/RR9sM6bpraXk6SYrN09rtnYs9ADHZZ RLG8/f3ide3zT5LeXjOXowDmfRC7WOcQ9m34hHR1wqgn5/v8Dr/isfJQ0pNWaKUJO0wg yHqA== X-Gm-Message-State: AOJu0YyqOMaA1ra0icGYsgorii6df2GIjNvObkRsltALs7bPhBcjD7jk 9X9Hx5W+5LeuCI36fYzq8OCyFVz5lVhiqlP8BRKNI9jRejXT2urUcmvq0b6LC7f9VadYOn5wUhJ WlgkBhW5gJrKMLg4u1CiZzmn1/1bJ2OezL6t045yUJuE/Gy7rQVeb X-Gm-Gg: ASbGncvnEFmVZoC/mu/fJnuFAfL7vRitBS1JjGylWqBx/PMUATl0KF2DgNJAKmMr19y bSmokXEjAa1y2IbhVWD0uKvb5thCwetu6kBVWqnuYl6xaAnNxxOuNGkCkRxXGJB0iDYnONCGSzk /Ns62RkePe79pdxbFzTUV8ZuoQXq2CCpyKoUd4Ev5A6yNr8N9fp9VL5ApcYQ5ZvfiewIXEyYw8F LdYFob0tiHyGIsXBAELY19j8QlFyHlKEiZ74jqDZqJDOK1gKdhohHSITliUC+qByytm6Y52j1HR C6/NnTfWZEAqgw== X-Received: by 2002:a05:620a:a10c:b0:7d5:dbc3:4184 with SMTP id af79cd13be357-7d5dbc3419dmr64220085a.1.1751559982796; Thu, 03 Jul 2025 09:26:22 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF7sc61w0XpTLo0eopdVSxu3NVsb/iPtSzOZNv6mB6attAfczYLdg4MkHjnjmMcH1lA6umW4g== X-Received: by 2002:a05:620a:a10c:b0:7d5:dbc3:4184 with SMTP id af79cd13be357-7d5dbc3419dmr64213485a.1.1751559982118; Thu, 03 Jul 2025 09:26:22 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7d5dbe7baedsm12772085a.57.2025.07.03.09.26.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Jul 2025 09:26:21 -0700 (PDT) Date: Thu, 3 Jul 2025 12:26:17 -0400 From: Peter Xu To: Lorenzo Stoakes Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Suren Baghdasaryan , Muchun Song , Mike Rapoport , Hugh Dickins , Andrew Morton , James Houghton , "Liam R . Howlett" , Nikita Kalyazin , Michal Hocko , David Hildenbrand , Andrea Arcangeli , Oscar Salvador , Axel Rasmussen , Ujwal Kundur Subject: Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types Message-ID: References: <20250627154655.2085903-1-peterx@redhat.com> <92265a41-7e32-430c-8ab2-4e7680609624@lucifer.local> <54bb09fc-144b-4d61-8bc2-1eca4d278382@lucifer.local> MIME-Version: 1.0 In-Reply-To: <54bb09fc-144b-4d61-8bc2-1eca4d278382@lucifer.local> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: kDejIbiASXXneyoEADz5p-t5qcXgdHvQSGO3VAxZXZc_1751559983 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Stat-Signature: 8ybutooux7awmb8sxdga1cdpr19sb1db X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 6B88012000E X-Rspam-User: X-HE-Tag: 1751559985-942001 X-HE-Meta: U2FsdGVkX1/EEiUtQbZBSfYQ1ZTQOyopPVeqrQLrHM4CU9QfTMF8ZHO3Hgk2WkaX52PZN9vR4OYF03XRZs2SOG9oXFUKb6bB9bK4+hAwfYpRGs0VEHUxgIl7+QjKU3aHkzAjaIcyhmB/iSdg10Lcv7zK7Dqu7YQHbuOrDBK64qPCWl+TBZP/glmeF3oRCizCpNzgKrW8JMYkOcSgg6QihWy6FNNaGw7AHk62a+82ZtXjPuoGn1mTrY1BdW3PCtqSTu83/gDNCe0hUMQ1hwH7sOolCfkJQ9LJIzrZ8SNkJkMeN5KtvVAMubpPfS8aqrachyHlw8WjQupXnmk/J8aji/Q7MF5WMWZsfE1277tvLJpjooDzf1UtheJ9Wh2lGKnZfgkUVdpB74jJjYP96Po9r3p9mTvs5SNFrfQH5Gf2kM9a7XeneILwJok2cyAfDzHVhXMJQgTx+5wJqCQnexXrnO2M4/cRKTcvZWg03U5/q8MqxDh9J8ZFtThVf/GgoA0GU6wFIb4yy8EQLrka7iC1U+5X7q/42UdwIoNBuKdbUOU5gzZeKbszmqdHrjxjKXakbfMzlN3p4MT5qPTGZN2yLaOpz1+ImU5M5G82PaqDSHjOkXQm0Abxyw4XlB9AIwgRd6OR/V6HlWEL0Etyq0GINnupllrJiyhUmUHclduJ9KiYHFocAtx9flJ6cHedNa9HXq3ykQgmXSoh3zf4FOzuz/YDdhqGHCt1VhFw4iTYrk8mAKKsJjg1S7N+YHTzastIAuMQv7Qvh0g/oii71i6PV0nEshFTUuzxIHyawkNaIGy0kgTqHtY9mif6w9W4/bJfBJ6uZD4UbGctMggXiIK/EEic8qdItbZ2mfg0019/q5h8lkhTV/639RjZD+aA5zTbaKKaj1z0sKv8D+AmR+e6SdZUI9bZ7fw5WA+QD+Y6Mzy55VXUz2adT+uvEEXZqexyS6K3EA7tIWpQuoSupPB PqZlREyU gBv9+nw4uPlmwru6tLrgTllz8udPBHsDFYulHql0bmPNtUTLIhrUsDCdkhvI8D+YTyn4y7/7k2VxTRfiGQvTKR/WNoAJbBPP+Q61s1jeE38HnbgpWuSeso9Zmy16yT26lv+nOBAj6y/KLkBqAfjIbANDJmLKJAT7O/dVWqMllvtdvbno1WahcAJY+8yEoRmPHGxdOW4htG4l42JkxyJb9jHp1ZwLqKsPS2mjJcUydC1gihRHlrWroqQskPkln55Ql9A31/uSVSz8qfBeBkmUjvDm53kEqNUYkPGI+TGAw1XxWPlHgarrLXF5goBFsjJq/fvia6Pgxt0CrBgsGwK1TjK1bzy5DYbmz3U9Nm1UwJBSH7mboXMXbQeaYklX3B+KOwS1nYW7wCJl71mrdQD6BrVJU4Dk1uihKUOq5EHyCJCkNRAM+Y+qfrWN6rE0Mg4OY5Fb1AUjdkZ5e/cMKcdlJG/cL03BQBhPKq/xRITP+RszP/fUXw/2XB2cIMWbO68lctsU9ZVDxv6RaeMkMnzuMB6hvYPzbPvdjO8fsIlwBSfMDJua+Wk21cnc1RU8LTFySXz54 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 03, 2025 at 04:55:01PM +0100, Lorenzo Stoakes wrote: > Peter you've ignored a large part of what I've said here, I'm not sure this is > productive. > > On Wed, Jul 02, 2025 at 04:36:31PM -0400, Peter Xu wrote: > > On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote: > > > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote: > > > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f] > > > > > > > > v2 changelog: > > > > - Patch 1 > > > > - update English in commit log [David] > > > > - move vm_uffd_ops definition to userfaultfd_k.h [Mike] > > > > - Patch 4 > > > > - fix sparse warning on bitwise type conversions [syzbot] > > > > - Commit message updates on explanation of vma_can_userfault check [James] > > > > > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > initial three patches: > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > Here, the major goal is to make kernel modules be able to opt-in with any > > > > form of userfaultfd supports, like guest-memfd. This alternative option > > > > should hopefully be cleaner, and avoid leaking userfault details into > > > > vm_ops.fault(). > > > > > > > > It also means this series does not depend on anything. It's a pure > > > > refactoring of userfaultfd internals to provide a generic API, so that > > > > other types of files, especially RAM based, can support userfaultfd without > > > > touching mm/ at all. > > > > > > I'm very concerned that this change will simply move core mm functionality out > > > of mm and into drivers where it can bitrot and cause subtle bugs? > > > > > > You're proposing providing stuff like page table state and asking for a folio > > > back from a driver etc. > > > > > > I absolutely am not in favour of us providing core mm internals like this to > > > drivers, and I don't want to see us having to EXPORT() mm internals just to make > > > module-ised uffd code work (I mean I just will flat out refuse to do that). > > > > > > I think we need to think _very_ carefully about how we do this. > > > > > > I also feel like this series is at a really basic level and you've not fully > > > determined what API calls you need. > > > > See: > > > > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t > > OK so it seems the point you're making there is - there's a lot of complexity - > and you'd rather not think about it for now? This is not a fair judgement. I think I have provided my thought process and how I made the decision to come up with this series. > > My concern is that in _actuality_ you'll need to do a _lot_ more and expose a > _lot_ more mm internals to achieve what you need to. > > I am happy for the API to be internal-to-mm-only. > > My concerns are really simple: > > - exposing page table manipulation outside of mm is something I just cannot > accept us doing for reasons I already mentioned and also Liam > > - we should absolutely minimise how much we do expose > > - we mustn't have hooks that don't allow us to make sensible decisions in core > mm code. > > I think overall I'm just not in favour of us having uffd functionality in > modules, I think it's an abstraction mismatch. > > Now if you instead had something like: > > enum uffd_minor_fault_handler_type { > UFFD_MINOR_FAULT_DEFAULT_HANDLER, > UFFD_MINOR_FAULT_FOO_HANDLER, > UFFD_MINOR_FAULT_BAR_HANDLER, > etc. > }; > > And the module could _choose_ which should be used then that's far far better. > > Really - hermetically seal the sensitive parts but allow module choice. > > You could find ways to do this in a more sophisticated way too by e.g. having a > driver callback that provides a config struct that sets things up. > > If we're going 'simple first' and can 'change what we want' why not do it like > this to start? Could you help to define UFFD_MINOR_FAULT_FOO_HANDLER for guest-memfd, and how that handler would be able to hook to the guest-memfd driver? The kernel needs to build with/without CONFIG_KVM. What about MISSING? Do you plan to support missing in the proposal you mentioned? > > > > > > > > > I agree that it's sensible to be incremental, but I feel like you sort of need > > > to somewhat prove the case that you can jump from 'incremental version where we > > > only support code in mm/' to supporting arbitrary file system code that might be > > > modules. > > > > > > Because otherwise you're basically _guessing_ that you can do this, possibly, in > > > the future and maybe it's just not the right approach but that's not clear yet? > > > > Did you follow up with the discussions in v1? I copied you too. > > > > https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com > > > > Would Nikita's work help here? Could you explain what are you asking for > > to prove that this works for us? > > Yeah thanks this seems like it actually implements useful functionality. The > point is the API seems currently to be a stub so doesn't really give us much to > go on as to what might ultimately be required. > > You say yourself in the series that you'll likely need to expand things and you > already have todos to this effect. > > > > > > > > > > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > > > > > With that, I moved both hugetlbfs and shmem over. > > > > > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder > > > whether we'd need quite a bit more to make that work. > > > > > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points > > > into core mm code just for uffd that then call out to some driver. > > > > > > We've encountered such a total nightmare with .mmap() for instance in the past > > > (including stuff that resulted in security issues) because we - simply cannot > > > assume anything - about what the hook implementor might do with the passed > > > parameters. > > > > > > This is really really problematic. > > > > > > I also absolutely hate the: > > > > > > if (uffd) > > > do_something_weird(); > > > > > > Pattern, so hopefully this won't proliferate that. > > ^ you didn't respond to this. Can you elaborate? I don't understand how this is attached to this series. AFAIU, the series is trying to remove some "if()s", not adding. > > > > > > > > > > > > Hugetlbfs is still very special that it will only use partial of the > > > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > > > > BUG() and so far hard-coded into core mm. But this should still be better, > > > > because at least hugetlbfs is still always involved in feature probing > > > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > > > > to fail that, and some more). Meanwhile after this series, shmem will be > > > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > > > > shmem looks like this: > > > > > > > > static const vm_uffd_ops shmem_uffd_ops = { > > > > .uffd_features = __VM_UFFD_FLAGS, > > > > .uffd_ioctls = BIT(_UFFDIO_COPY) | > > > > BIT(_UFFDIO_ZEROPAGE) | > > > > BIT(_UFFDIO_WRITEPROTECT) | > > > > BIT(_UFFDIO_CONTINUE) | > > > > BIT(_UFFDIO_POISON), > > > > .uffd_get_folio = shmem_uffd_get_folio, > > > > .uffd_copy = shmem_mfill_atomic_pte, > > > > }; > > > > > > > > As I mentioned in one of my reply to Nikita, I don't like the current > > > > interface of uffd_copy(), but this will be the minimum change version of > > > > such API to support complete extrenal-module-ready userfaultfd. Here, very > > > > minimal change will be needed from shmem side to support that. > > > > > > Right, maybe a better version of this interface might address some of my > > > concerns... :) > > > > > > > > > > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > > > > provide to support userfaultfd. Normally vm_ops.fault() will also need to > > > > be updated, but that's a generic function and it'll play together with the > > > > new vm_uffd_ops to make everything fly. > > > > > > > > No functional change expected at all after the whole series applied. There > > > > might be some slightly stricter check on uffd ops here and there in the > > > > last patch, but that really shouldn't stand out anywhere to anyone. > > > > > > > > For testing: besides the cross-compilation tests, I did also try with > > > > uffd-stress in a VM to measure any perf difference before/after the change; > > > > The static call becomes a pointer now. I really cannot measure anything > > > > different, which is more or less expected. > > > > > > > > Comments welcomed, thanks. > > > > > > > > Peter Xu (4): > > > > mm: Introduce vm_uffd_ops API > > > > mm/shmem: Support vm_uffd_ops API > > > > mm/hugetlb: Support vm_uffd_ops API > > > > mm: Apply vm_uffd_ops API to core mm > > > > > > > > include/linux/mm.h | 9 +++ > > > > include/linux/shmem_fs.h | 14 ----- > > > > include/linux/userfaultfd_k.h | 98 +++++++++++++++++++---------- > > > > mm/hugetlb.c | 19 ++++++ > > > > mm/shmem.c | 28 ++++++++- > > > > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > > > > 6 files changed, 207 insertions(+), 76 deletions(-) > > > > > > > > -- > > > > 2.49.0 > > > > > > > > > > Sorry to be critical, I just want to make sure we're not setting ourselves up > > > for trouble here. > > > > > > I _very much_ support efforts to make uffd more generalised, and ideally to find > > > a way to separate out shmem and hugetlbfs implementation bits, so I support the > > > intent _fully_. > > > > > > I just want to make sure we do it in a safe way :) > > > > Any explicit suggestions (besides objections)? > > I gave some above. > > I'm fundamentally opposed to us exposing page table manipulation or really any > state subject to sensitive locks to modules. > > Cheers, Lorenzo > -- Peter Xu