From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CD45C77B7C for ; Thu, 3 Jul 2025 15:45:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E045E6B0151; Thu, 3 Jul 2025 11:45:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DDA8D6B0155; Thu, 3 Jul 2025 11:45:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF06B6B0158; Thu, 3 Jul 2025 11:45:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BBE126B0151 for ; Thu, 3 Jul 2025 11:45:50 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1C76D1602B5 for ; Thu, 3 Jul 2025 15:45:50 +0000 (UTC) X-FDA: 83623378860.24.1339EF6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 815E0140004 for ; Thu, 3 Jul 2025 15:45:47 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bEN2MbUU; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751557547; a=rsa-sha256; cv=none; b=4Q6ZAt7yW44B6txYVCmk6ml9xGzCL3G1UxnYsXPIhm8hkj2x8Awk6AEw3MEp8nDF6DOw6K 6mNjIVag0jrF+ulZ45ATaaqpKFEO0b0qzUfnkyX9/b1RaXiYwItB/URrL2+7XbO2xaTXiT JCSC1o5aSENM7bBovqAVdpVZ/C8i0EI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bEN2MbUU; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751557547; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p6Sc+kgLRaDiLYQQDmtR1W9Lqr6weySIabLVOH0fL5E=; b=28gQ494kH7UVZgBg6sbEHF+8x7LAMGklX4DJkjFWBF6EB4pggvROrZqKjd7bVOIZg+M5mH 2XYNZQLXcCs+2bFkJl8kgdSEr9jyYdPmzDwmcEVuwvBqSv73Unti+ljcurjYwidNjscHuy ApvMloyvE2/xVD1Co1Jw8ZXqKFSaFvk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1751557546; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p6Sc+kgLRaDiLYQQDmtR1W9Lqr6weySIabLVOH0fL5E=; b=bEN2MbUUK9QqmTMPBwGX9v1reJNzNnxOyIsWFWPSiwZEwnmO/O2aUo9pdcuh8kSXpvTV7O FpzAjs5JywNyEF7L5Jk/DBpdewkYdiSfGWTx+MTxXM9ggBjMxrXzrOftWmQYg9PlGz2W5n 8fcB2IsAcIxEDEIRsNVHH2arJit0jOY= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-532-hZJ7CKq6PoWZIhFl4z8chQ-1; Thu, 03 Jul 2025 11:45:45 -0400 X-MC-Unique: hZJ7CKq6PoWZIhFl4z8chQ-1 X-Mimecast-MFC-AGG-ID: hZJ7CKq6PoWZIhFl4z8chQ_1751557545 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-6fd75e60875so555366d6.0 for ; Thu, 03 Jul 2025 08:45:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751557545; x=1752162345; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=p6Sc+kgLRaDiLYQQDmtR1W9Lqr6weySIabLVOH0fL5E=; b=Il+d7RrJ/0f2aHZGq7eMonQ63GHoqFF7CQte1cUL3PFbwC92cTT4l74Jhvon6ytSOS rPPfE3GMI6KdZuHE5/xODN+9SaG7/EQdNejn+FWx656a9kmFwLVS7g4v+aWPafPTd3Sb 5wK+qwtYGWvgFK9GtMA+IAe+e2pJJkWlf/0ympJGTtqhCA51IDWAAfvEjth8mLda2MN4 3UI71pscleP/o4Bk+GUwk3yLUoiRbrTKWCwutydoFz0H9zYhEwWPd9WffXttdufxg9p2 NFQYsRKHyNdIjs856N6aV2Oq+KpzC/IsAYbng/Ql0S6f/pjfZzP9NcfuogkgJIooOAo6 eyaA== X-Forwarded-Encrypted: i=1; AJvYcCVo3qYT2thv9p5jN0K9/vNpiT0ie0Z6WST2ZwqUStdTBhfvnDiIed7FIoek1QkvflEEsfTHEGsZeA==@kvack.org X-Gm-Message-State: AOJu0YzrXyTijYPPqHGuUxxRHxr1Q+2UQG1yTSm2rqfuDGlTxwczlpin rwig7BnDWllBWkafVEDCq200PV37Fs70ebpeWUm3CpZCkq3KFCauI6B3F2QOaRVmkpb+O8cfpfI Yuf2189t76WFz2GhgFwpaBr3n3fEYds4cG+pwq3e6C6uPvB3K+szH X-Gm-Gg: ASbGncsE2AM7SIMAWwlX0DMaIRMwxE6nroExXol40YGH9lo2DwBCeuXFE6doXc5OS5c 1VfZCwFhBYtf3I2hkf+WHPJRRO9IGqa92Zs0yXLdaTVDoj4oowWYr5G/ENgidOmIfOG77KhQYuZ a1WQJ6BYd7IMf8/Cjqx1XatdfTCmdc61xfUh86XHPfkmOZAiG2N+ygmyMoCx7iZviFaOJ/PZ7Lp 8GjvVHNx4aF3HJ65KDSHHcCM2W4BjUq/DFtnNw+W1yO2QO/y5oqeLDNKs+zhj10fUznqL5lOQU2 Qq4NACqa75DKrA== X-Received: by 2002:a05:6214:5b88:b0:6fb:4b73:79f7 with SMTP id 6a1803df08f44-702bc93c856mr48673486d6.41.1751557544953; Thu, 03 Jul 2025 08:45:44 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEF+V3vlvKxUQjIOt6zt4tATIzVgRHBlxWM3d2VwDYaJ6PN1wFyr7wghPPWgr+evT04s4xOZw== X-Received: by 2002:a05:6214:5b88:b0:6fb:4b73:79f7 with SMTP id 6a1803df08f44-702bc93c856mr48672976d6.41.1751557544307; Thu, 03 Jul 2025 08:45:44 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7d5dbdb4bd6sm8875885a.28.2025.07.03.08.45.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Jul 2025 08:45:43 -0700 (PDT) Date: Thu, 3 Jul 2025 11:45:40 -0400 From: Peter Xu To: Suren Baghdasaryan Cc: Mike Rapoport , Lorenzo Stoakes , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Muchun Song , Hugh Dickins , Andrew Morton , James Houghton , "Liam R . Howlett" , Nikita Kalyazin , Michal Hocko , David Hildenbrand , Andrea Arcangeli , Oscar Salvador , Axel Rasmussen , Ujwal Kundur Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Message-ID: References: <20250627154655.2085903-1-peterx@redhat.com> <20250627154655.2085903-2-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: hgM00g8V28C8vpkggPRqyhhp2-WOPJlhVOk2y9HObII_1751557545 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 815E0140004 X-Stat-Signature: 8cecuikdrm8eqtcty8mthnz4s5dbuwnw X-Rspam-User: X-HE-Tag: 1751557547-749718 X-HE-Meta: U2FsdGVkX19sc6L4VV3RrZouTjGVw39EaTgy1WAFl/Y5f6SogkQwWMr7uhfRQ24NI2XkepKdCywWxg4HKiXGUAFt0Cx+5sSo2WEqhPwXuV/+su7VMS3+dEWKVNl1r/oqGKXd3jw5sFwqGw4w1JgkkHrcB7CxYIaa3WMPrvC7QA/LvHK3LA2cxuQqTntfFk9ISVmgVfG6RBQmN+hY7fE6hbPVVHubMg3xRYBmJaTzbBxYelifWYQ2GCWSVK0PRW58ctX1DQCqFEJsUkT3J2g6woHAKOb4q97K7Hx8ktAGZAkQ1Iq2QTWMMteH31oNDYSneiN4ehiUZknCcB82hYAW6Bs9BApqRtluxqmH1KfQZPWJ+sC1vuVkkNC1gr7B0p43oKe1jARTtIkqD0UJUKUnNTEOswukXofY4ApOmh319ueC1gNVqw/q6IQQqE7iO5t4fUiZVwrpD8c0JuLza0PLL/ldi/KVKp25kvUiMXj/UvdYhCl57LhvQkxFjt4O38BQFfG0XOwLa3XvHDRfiVxc/8+6LZUVUaieOkT4fp15Y+bshREtOwcvglsi7R4CGEEI8uY/6o97myuE5OfexjScoBGTi/9TolmDfTKc9aHtSUISwJInDRx4vrIHWaGFs5QQL6h521NTxkl5SBxtcgLgNpMicbR/Vrq/HuzskiOwM4AYxYMrDIJ1yUJWuU0xd6QnunU1hPrki8QhFd7IxC7oMyX3cAKGSYJPuSiB/S9+8rAPQtQhjJ8oSB1oIHCGN9AHRk8cteGTgP3rLZrv4GdaM7pblDnmpPJjQwg/SSy1ny1ocduKQ7XVubT/GnB6WVTlEhymgt4H7pujkJTvniUZPEx/59QXQflsJm8wcKoNuVHOmhBpisbB9EffvaJUoDSj0enCBgDfSrGC6iNJq6ri8WoWRcBQae/XA+f4q3sddvE4Ka/S4RFdyv2+iBxTO2QEoicKdMVKLWD7+mSJLA5 ZqG2yBhB 8z5C5dG5BRpQbxWg+cvKA0wzWd1NbHW5NAnJtxgHRlJ+SdXCvzGO+YxIcSdLEGldMABCkxJ412TFlgjGj3rufV72RBoam4rMqLHLpWCJ4xwiKDu/Bv5u3mLkcUIjT5Cni0oOnTJXY1qvacvRmKNOtQQMPy3mLY76EuQX0nK5SgyHPkFdpbmZQRBh1XxR1fTFvq7u5gdiBlrarYKEfMXVK/Is0X7Yc0mpByfGR+dyh5UlmMxbYlM34iWAkP9xrLG5icFfJcYP0LjXick71XhWgNPSz/q2/VC2hFrmHFIV64MQ7Ce1Uk6yWdOwsIbRqq0+wYieHP/uHphQ2GrlL7WoHkgEWtj9lsxu5Uyzaq4TLWtw9QQxc/sLtII10476zZNgZa1gT3y3o7IQHIrMEkbu0SAadLPXgzlJoZF8FS8/zNi6IJ8vmhYkbrzYVAFfJCWkOUHYzs+dLkyh5DJ5qvuyVR8eCyi6BeXv3pFPjIWQxGwTEnihFEwIWgNUogewL4qtebyYiLh4ra1KpSyGRhAvjC9BzOnJX2eYZNvG5+FPmJAb50g2KFAVQU0xzixtuKsrNSkPX+bP1EG1np64INnWQjSuOaLTCpH83IHlnNvLpaMZhF/ts/U8UoXYvSzbUcSM9pvtz7tcBZq/95E4an/uZAl4MC8J3pTx+vptjqUXJBUrPuzY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 03, 2025 at 08:01:12AM -0700, Suren Baghdasaryan wrote: > On Wed, Jul 2, 2025 at 1:23 PM Peter Xu wrote: > > > > On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote: > > > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote: > > > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes > > > > wrote: > > > > > > > > > > It seems like we're assuming a _lot_ of mm understanding in the underlying > > > > > driver here. > > > > > > > > > > I'm not sure it's really normal to be handing around page table state and > > > > > folios etc. to a driver like this, this is really... worrying to me. > > > > > > > > > > This feels like you're trying to put mm functionality outside of mm? > > > > > > > > To second that, two things stick out for me here: > > > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction > > > > levels. uffd_copy is almost the entire copy operation for VM_SHARED > > > > VMAs while uffd_get_folio is a small part of the continue operation. > > > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the > > > > last patch is quite a complex function which itself calls some IMO > > > > pretty internal functions like mfill_atomic_install_pte(). Expecting > > > > modules to implement such functionality seems like a stretch to me but > > > > maybe this is for some specialized modules which are written by mm > > > > experts only? > > > > > > Largely shmem_mfill_atomic_pte() differs from anonymous memory version > > > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and > > > whether it's added to the page cache. So instead of uffd_copy(...) we might > > > add > > > > > > int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr); > > > void (*folio_release)(struct vm_area_struct *vma, struct folio *folio); > > > > Thanks for digging into this, Mike. It's just that IMHO it may not be > > enough.. > > > > I actually tried to think about a more complicated API, but more I thought > > of that, more it looked like an overkill. I can list something here to > > show why I chose the simplest solution with uffd_copy() as of now. > > TBH below does not sound like an overkill to me for keeping mm parts > to itself without over-exposing them to modules. > > > > > Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem > > accounting we need to do before allocations, and with/without allocations. > > Ok, this results in an additional folio_prealloc() hook. > > > That accounting can't be put into folio_alloc() yet even if we'll have one, > > because we could have the folio allocated when reaching here (that is, when > > *foliop != NULL). That was a very delicated decision of us to do shmem > > accounting separately in 2016: > > > > https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/ > > > > Then, there's also the complexity on how the page cache should be managed > > for any mem type. For shmem, folio was only injected right before the > > pgtable installations. We can't inject it when folio_alloc() because then > > others can fault-in without data populated. It means we at least need one > > more API to do page cache injections for the folio just got allocated from > > folio_alloc(). > > folio_add_to_cache() hook? > > > > > We also may want to have different treatment on how the folio flags are > > setup. It may not always happen in folio_alloc(). E.g. for shmem right > > now we do this right before the page cache injections: > > > > VM_BUG_ON(folio_test_locked(folio)); > > VM_BUG_ON(folio_test_swapbacked(folio)); > > __folio_set_locked(folio); > > __folio_set_swapbacked(folio); > > __folio_mark_uptodate(folio); > > > > We may not want to do exactly the same for all the rest mem types. E.g. we > > likely don't want to set swapbacked for guest-memfd folios. We may need > > one more API to do it. > > Can we do that inside folio_add_to_cache() hook before doing the injection? The folio can be freed outside this function by userfaultfd callers, so I wonder if it'll crash the kernel if mm sees a folio locked while being freed. > > > > > Then if to think about failure path, when we have the question above on > > shmem acct issue, we may want to have yet another post_failure hook doing > > shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that > > for guest-memfd, but we still need that for shmem to properly unacct when > > folio allocation succeeded, but copy_from_user failed somehow. > > Failure handling hook. > > > > > Then the question is, do we really need all these fuss almost for nothing? > > If that helps to keep modules simpler and mm details contained inside > the core kernel I think it is worth doing. I imagine if the shmem was > a module then implementing the current API would require exporting > functions like mfill_atomic_install_pte(). That seems like > over-exposure to me. And if we can avoid all the locking and > refcounting that Liam mentioned that would definitely be worth it > IMHO. > > > > > Note that if we want, IIUC we can still change this in the future. The > > initial isolation like this series would still be good to land earlier; we > > don't even plan to support MISSING for guest-memfd in the near future, but > > only MINOR mode for now. We don't necessarily need to work out MISSING > > immediately to move on useful features landing Linux. Even if we'll have > > it for guest-memfd, it shouldn't be a challenge to maintain both. This is > > just not a contract we need to sign forever yet. [1] > > > > Hope above clarifies a bit on why I chose the simplest solution as of > > now. I also don't like this API, as I mentioned in the cover letter. It's > > really a trade-off I made at least for now the best I can come up with. > > > > Said that, if any of us has better solution, please shoot. I'm always open > > to better alternatives. > > I didn't explore this code as deep as you have done and therefore > might not see all the pitfalls but looks like you already considered > an alternative which does sound better to me. What are the drawbacks > that I might be missing? It'll be a complex API from another angle, that we'll need 5-6 APIs just to implement uffd missing mode. Meanwhile it'll also already start to introduce those function jumps into shmem even if one would be enough to decouple it. As I mentioned above, I'm also not against doing it, but IMHO that does not need to be done right now [1], as currently it looks to me guest-memfd may not really need missing mode traps as of now, and I am actually not sure whether it will. We may not want to introduce 5-6 APIs with no further user. We can also always discuss a proper API when the demand arrives. If uffd_copy is so concerning to people, there's also another alternative which is to make vm_uffd_ops only support traps except MISSING. uffd_copy is only about missing traps. Then we can drop uffd_copy API, but the API itself is then broken on its own. I still feel like we're over-worried on this. For OOT drivers I never cared breaking anything. For in-tree ones, this discussion can really happen when there's a need to. And at that time we can also have a look at the impl using uffd_copy(), maybe it'll not be that bad either. It seems to me we don't necessarily need to figure this out right now. IMHO we can do it two-steps in worst case. Thanks, -- Peter Xu