From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 19D4AC28B28
	for <linux-mm@archiver.kernel.org>; Wed, 12 Mar 2025 15:45:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 47CFD280005; Wed, 12 Mar 2025 11:45:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 42C4A280001; Wed, 12 Mar 2025 11:45:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2F53C280005; Wed, 12 Mar 2025 11:45:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 10107280001
	for <linux-mm@kvack.org>; Wed, 12 Mar 2025 11:45:45 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id CE9A1C114D
	for <linux-mm@kvack.org>; Wed, 12 Mar 2025 15:45:46 +0000 (UTC)
X-FDA: 83213324292.18.417D210
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf25.hostedemail.com (Postfix) with ESMTP id 03062A0021
	for <linux-mm@kvack.org>; Wed, 12 Mar 2025 15:45:43 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hXWkyMkq;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741794344; a=rsa-sha256;
	cv=none;
	b=Xvf/t8qSufjnpljJHKGxrgNDJX2fwUAKdhUSM2NDdP3T0ZxSTr/t7eD71/ZXNVWJ35Ohtf
	URG/nKsSJIjIM+dKtWFIBNVqMWpfjvpUWSLEq2psVoOV2o44aF5qkpykf1Tc48s8IRO5G2
	13tu7eynG0xOiO60+IS/LuYT69d1Tn8=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hXWkyMkq;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1741794344;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Ut4pbapSzhs8uCjUUCSZVy14MK3K3nRSMdDH7UTR4MM=;
	b=0INY+0k/4w/LoGXnRORempJZMI+KqZFMjernGyMFI7EqS+hvXh+V8U6kEkE+phbcgp2L7/
	uGwjG6oJIV2giB2ztK0JlZQuonxwYAZ4sju9p0SakSvAqumOc43mZ+xNA6oTz2548zcMsA
	I2fs+CziimfkhqNfjjZWq6v90N+wt+o=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1741794343;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Ut4pbapSzhs8uCjUUCSZVy14MK3K3nRSMdDH7UTR4MM=;
	b=hXWkyMkq1IgxdVm3S7pDcoQceimXRKyxBlwZHCUWjsjZthykOQmttKg/EvTaB3rsxezTUS
	Zx4J0pF/ND5iCY28Gf9TDC/mio2x9dDB6yPy/Xk6RubtxRIRUoJqcM25NvJY5pDBTb7GKF
	Vyxa7aRVUBsH9kL6ML3WMVPGJOexwO0=
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-507-cUb5V-n7PuqV6rDD2vzPmg-1; Wed, 12 Mar 2025 11:45:42 -0400
X-MC-Unique: cUb5V-n7PuqV6rDD2vzPmg-1
X-Mimecast-MFC-AGG-ID: cUb5V-n7PuqV6rDD2vzPmg_1741794341
Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-7c546efca59so1255687685a.1
        for <linux-mm@kvack.org>; Wed, 12 Mar 2025 08:45:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1741794341; x=1742399141;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Ut4pbapSzhs8uCjUUCSZVy14MK3K3nRSMdDH7UTR4MM=;
        b=uZOZLL55/KucXycXg9a9wT8JjZwZhzOmjaMpBVRntp/0dfVWAU9fXITUpBeBVyVtnr
         bD/53gpttsifEnIguZddz1d86tzyyrVFeob3InJeKi5Q7ZynSnBkXDl+UQxliYUmH7RV
         HfnyV9RjjFJArgymhtrq00lZEJBvYMcriLccwRdL+rOSgFbBeDGyVe+V7DHghgwcywLO
         cQlLouNP9w/c9RrlqRRbV3DxSnP3r9Is9st6RLE6Nauc4FwZnz3jkbnYoAcpR8Z1NAXC
         JhTGApHte1hyy5OXaESmvsA5BK+VVRrLWyj9uJlQ0gm2bzmisQrlJTnGWoE4spoDx+42
         aM/A==
X-Forwarded-Encrypted: i=1; AJvYcCWHCEqW+OgTUbl5lR7YTN72tUeIw5lXb8EG/tET5aKPQyTT930aXJewOA8A3TpdAXDJ8TwNVg6Fjw==@kvack.org
X-Gm-Message-State: AOJu0YxcEeOJjCdYJCq1RNbAWx3sOQ1HyhL7p8hfTfPjMZHP2skKf40W
	B4gTkgstBNTfQ4cSNU2FSjCdkdOg382ki7oRytUGVUq5vLKtVZtTD65gPkCiH/m2FEHNb09mrPA
	5LBtpweMND8gBDHluUZKiGVeHXrelOjaULsBzqNLhcleWWn7k
X-Gm-Gg: ASbGncuCqb6cpG8VPXc9QHGdH6lGrIxCHMckaA1QI+npfdppV3cRmObTP/mIPvDJ02U
	Ekl1BOXwziGSYAcrNjgum7tUHxR46BAQ21blckEjpkjeRviRJKNSEQlTm3RbOpJrbjyC1HbVEbv
	Fby9AyhFPJZbngqDVIe8qJOHna8347vix+W30lQzwrVg4J4rot2Vyw4yvIr8VJ3OzNMoTRRdiDs
	CWWPy9+P0CmGdfF4Fu/0GwdQjQmjESpAMoiDjc9KhH58v9mcj0XEZSmTpkUdlYVqfnnlOEuhcPS
	1cMotpc=
X-Received: by 2002:a05:620a:26a3:b0:7c5:5692:ee91 with SMTP id af79cd13be357-7c55692f028mr1679143485a.23.1741794341522;
        Wed, 12 Mar 2025 08:45:41 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IFeUONLhGbcchFtsON6vku9vRaOBMh9M5T52ggQTNZ6v7izQr/2SjLlIdPk6U0xKsrrhbgKhA==
X-Received: by 2002:a05:620a:26a3:b0:7c5:5692:ee91 with SMTP id af79cd13be357-7c55692f028mr1679137385a.23.1741794341075;
        Wed, 12 Mar 2025 08:45:41 -0700 (PDT)
Received: from x1.local ([85.131.185.92])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-7c545fd5772sm634042685a.35.2025.03.12.08.45.39
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 12 Mar 2025 08:45:40 -0700 (PDT)
Date: Wed, 12 Mar 2025 11:45:36 -0400
From: Peter Xu <peterx@redhat.com>
To: Nikita Kalyazin <kalyazin@amazon.com>
Cc: James Houghton <jthoughton@google.com>, akpm@linux-foundation.org,
	pbonzini@redhat.com, shuah@kernel.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com, david@redhat.com,
	ryan.roberts@arm.com, quic_eberman@quicinc.com, graf@amazon.de,
	jgowans@amazon.com, roypat@amazon.co.uk, derekmn@amazon.com,
	nsaenz@amazon.es, xmarcalx@amazon.com
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing
Message-ID: <Z9GsIDVYWoV8d8-C@x1.local>
References: <20250303133011.44095-1-kalyazin@amazon.com>
 <Z8YfOVYvbwlZST0J@x1.local>
 <CADrL8HXOQ=RuhjTEmMBJrWYkcBaGrqtXmhzPDAo1BE3EWaBk4g@mail.gmail.com>
 <Z8i0HXen8gzVdgnh@x1.local>
 <fdae95e3-962b-4eaf-9ae7-c6bd1062c518@amazon.com>
 <Z89EFbT_DKqyJUxr@x1.local>
 <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com>
MIME-Version: 1.0
In-Reply-To: <9e7536cc-211d-40ca-b458-66d3d8b94b4d@amazon.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: hz6I5l_GxMGl7NJhHkwG3x83pBNon7XMv2K6xC0YypE_1741794341
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspam-User: 
X-Rspamd-Queue-Id: 03062A0021
X-Rspamd-Server: rspam05
X-Stat-Signature: oiyz3799hjrf814dnq8u9ryx6o3hdgcr
X-HE-Tag: 1741794343-549954
X-HE-Meta: U2FsdGVkX19zdGZ7qyUXFhNG1PcjRSco3QH3Z1S91rPPo2JbXs8ZWPhgULWYRu7d0Xi7IKTBnVeRT29Ely+c1eQjbKdGcnUQwnltD2AMQBQ+dj/YQz5ePg6T4HTzek6PEGs7xASlU7wCTSVj6V59EJ9MhtuCJTzCrnknVc1Cd7Zp7RcPM0u9IHKeEoSr9/zNykdPoQYI/p6O0I+FBjOQrO1qxYOCPKB06daYGtsRxqoQz/JWz6N92eKKGvGWymIxkXoa60MiO2JxyCdDOKKH+UepV/jB+HVZwijNz3l0kET6nVpDKkAu/dg/OtvHEsQ0OoSCuyy6ksc9hxtYHRunmO3mw2CsTdiAPu/Qo0jx+ydvAFWW+6866aQCI4vaqGXeOnmei4xuMP50zo2hAZ1N5yqMMqPHoo4hymm+r5T5pZDvUZ7m7Pz/IAouYPNPr9H0tl+egJi8xVqYdU8axktIPTGx50jTFYuVRJ5VHOkCcpEsJSBklEegZ+npE3bgBErmAZQVS3PtdUypeMkFuoXHFPhGfhaImr3sXLJAnn1f5Jp0tg9wkvFKpr3auppNIH/mzqDFndg4Su/i75qJX8GtrqIHb/IDVXLZsT0JD5dSrpIANkwToP3kdJiAsuh5xi6wNLYqm8H1Aue/Fycr+SM6Hz4VxFwoj8b6Z4uJ2TtM9KyYQ9EDEXiXrTA6pCUz0hwuEXCMHJZVPhp+Sr1GhcCxxZSIlBS4KDPt1bNB5bxfsrpOAPO+YahTLFOYPIHJZEYwlf3c+BwnBzh54pMUxwX47ytUocmfxMlLXLo63Cpp1+rVSaiYRqueho/fadkuS/yOOwGK6UdyBVuhsMo1AF0e1JXmsBetCXtmzUFN2QhWvSZE5c6Tn5P+gOc5xAqp+7qwhlT0GrTLMoUNlu/9DY/HbVXEt2shbySEGrYnCu6+yAzLxkD7iNRn6AGP0ItbLIFILS8rcKKyVwsPgU3cSnc
 mi1RrO5L
 8I800Om9wUNO5TxsMT1/S4jV5STZBpdqNnQjUwOafMD/JfgkJM8GVcuG7Vnz2bWr/W+i5jIAFfTESbjKLvRhfA/QVrRVbZPbMZJ1H7DalBiGWVa2uYZfMNNkfpV692m20kgmju7Eo5KVwlufXw/sXtD3x1C5QM1fdnxrL3oatZY7gBBrrXnshDuHFF/l/Vc8tPYj74d41CxUwCkPK0OwGgg3AVtgx5QJndiqedZS584QjD1Y5deppsrGW6Wnp7JhYP6WNxBfvcsoUW5pcN09DWBP/7tA4vfWKyxCrmHtNOKeTcqblaR1A3eTptjURpHp+eaOCIeEw1bitJENnmKjMCJoQvjGoFOPHg9xz1NgtAZBVRVtEMNqyZ2Jz6Ok8Irzp6cFmwnLsopxi6RPuwO8208WoRO9Nub8eWPUT+KRWpRhlDoo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Mar 11, 2025 at 04:56:47PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 10/03/2025 19:57, Peter Xu wrote:
> > On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 05/03/2025 20:29, Peter Xu wrote:
> > > > On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> > > > > I think it might be useful to implement an fs-generic MINOR mode. The
> > > > > fault handler is already easy enough to do generically (though it
> > > > > would become more difficult to determine if the "MINOR" fault is
> > > > > actually a MISSING fault, but at least for my userspace, the
> > > > > distinction isn't important. :)) So the question becomes: what should
> > > > > UFFDIO_CONTINUE look like?
> > > > > 
> > > > > And I think it would be nice if UFFDIO_CONTINUE just called
> > > > > vm_ops->fault() to get the page we want to map and then mapped it,
> > > > > instead of having shmem-specific and hugetlb-specific versions (though
> > > > > maybe we need to keep the hugetlb specialization...). That would avoid
> > > > > putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> > > > > 
> > > > > I've actually wanted to do this for a while but haven't had a good
> > > > > reason to pursue it. I wonder if it can be done in a
> > > > > backwards-compatible fashion...
> > > > 
> > > > Yes I also thought about that. :)
> > > 
> > > Hi Peter, hi James.  Thanks for pointing at the race condition!
> > > 
> > > I did some experimentation and it indeed looks possible to call
> > > vm_ops->fault() from userfault_continue() to make it generic and decouple
> > > from KVM, at least for non-hugetlb cases.  One thing is we'd need to prevent
> > > a recursive handle_userfault() invocation, which I believe can be solved by
> > > adding a new VMF flag to ignore the userfault path when the fault handler is
> > > called from userfault_continue().  I'm open to a more elegant solution
> > > though.
> > 
> > It sounds working to me.  Adding fault flag can also be seen as part of
> > extension of vm_operations_struct ops.  So we could consider reusing
> > fault() API indeed.
> 
> Great!
> 
> > > 
> > > Regarding usage of the MINOR notification, in what case do you recommend
> > > sending it?  If following the logic implemented in shmem and hugetlb, ie if
> > > the page is _present_ in the pagecache, I can't see how it is going to work
> > 
> > It could be confusing when reading that chunk of code, because it looks
> > like it notifies minor fault when cache hit. But the critical part here is
> > that we rely on the pgtable missing causing the fault() to trigger first.
> > So it's more like "cache hit && pgtable missing" for minor fault.
> 
> Right, but the cache hit still looks like a precondition for the minor fault
> event?

Yes.

> 
> > > with the write syscall, as we'd like to know when the page is _missing_ in
> > > order to respond with the population via the write.  If going against
> > > shmem/hugetlb logic, and sending the MINOR event when the page is missing
> > > from the pagecache, how would it solve the race condition problem?
> > 
> > Should be easier we stick with mmap() rather than write().  E.g. for shmem
> > case of current code base:
> > 
> >          if (folio && vma && userfaultfd_minor(vma)) {
> >                  if (!xa_is_value(folio))
> >                          folio_put(folio);
> >                  *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> >                  return 0;
> >          }
> > 
> > vma is only availble if vmf!=NULL, aka in fault context.  With that, in
> > write() to shmem inodes, nothing will generate a message, because minor
> > fault so far is only about pgtable missing.  It needs to be mmap()ed first,
> > and has nothing yet to do with write() syscalls.
> 
> Yes, that's true that write() itself isn't going to generate a message. My
> idea was to _respond_ to a message generated by the fault handler (vmf !=
> NULL) with a write().  I didn't mean to generate it from write().
> 
> What I wanted to achieve was send a message on fault + cache miss and
> respond to the message with a write() to fill the cache followed by a
> UFFDIO_CONTINUE to set up pagetables.  I understand that a MINOR trap (MINOR
> + UFFDIO_CONTINUE) is preferable, but how does it fit into this model?
> What/how will guarantee a cache hit that would trigger the MINOR message?
> 
> To clarify, I would like to be able to populate pages _on-demand_, not only
> proactively (like in the original UFFDIO_CONTINUE cover letter [1]).  Do you
> think the MINOR trap could still be applicable or would it necessarily
> require the MISSING trap?

I think MINOR can also achieve similar things.  MINOR traps the pgtable
missing event (let's imagine page cache is already populated, or at least
when MISSING mode not registered, it'll auto-populate on 1st access).  So
as long as the content can only be accessed from the pgtable (either via
mmap() or GUP on top of it), then afaiu it could work similarly like
MISSING faults, because anything trying to access it will be trapped.

Said that, we can also choose to implement MISSING first.  In that case
write() is definitely not enough, because MISSING is at least so far based
on top of whether the page cache present, and write() won't be atomic on
update a page.  We need to implement UFFDIO_COPY for gmemfd MISSING.

Either way looks ok to me.

> 
> [1] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/
> 
> > > 
> > > Also, where would the check for the folio_test_uptodate() mentioned by James
> > > fit into here?  Would it only be used for fortifying the MINOR (present)
> > > against the race?
> > > 
> > > > When Axel added minor fault, it's not a major concern as it's the only fs
> > > > that will consume the feature anyway in the do_fault() path - hugetlbfs has
> > > > its own path to take care of.. even until now.
> > > > 
> > > > And there's some valid points too if someone would argue to put it there
> > > > especially on folio lock - do that in shmem.c can avoid taking folio lock
> > > > when generating minor fault message.  It might make some difference when
> > > > the faults are heavy and when folio lock is frequently taken elsewhere too.
> > > 
> > > Peter, could you expand on this?  Are you referring to the following
> > > (shmem_get_folio_gfp)?
> > > 
> > >        if (folio) {
> > >                folio_lock(folio);
> > > 
> > >                /* Has the folio been truncated or swapped out? */
> > >                if (unlikely(folio->mapping != inode->i_mapping)) {
> > >                        folio_unlock(folio);
> > >                        folio_put(folio);
> > >                        goto repeat;
> > >                }
> > >                if (sgp == SGP_WRITE)
> > >                        folio_mark_accessed(folio);
> > >                if (folio_test_uptodate(folio))
> > >                        goto out;
> > >                /* fallocated folio */
> > >                if (sgp != SGP_READ)
> > >                        goto clear;
> > >                folio_unlock(folio);
> > >                folio_put(folio);
> > >        }

[1]

> > > 
> > > Could you explain in what case the lock can be avoided?  AFAIC, the function
> > > is called by both the shmem fault handler and userfault_continue().
> > 
> > I think you meant the UFFDIO_CONTINUE side of things.  I agree with you, we
> > always need the folio lock.
> > 
> > What I was saying is the trapping side, where the minor fault message can
> > be generated without the folio lock now in case of shmem.  It's about
> > whether we could generalize the trapping side, so handle_mm_fault() can
> > generate the minor fault message instead of by shmem.c.
> > 
> > If the only concern is "referring to a module symbol from core mm", then
> > indeed the trapping side should be less of a concern anyway, because the
> > trapping side (when in the module codes) should always be able to reference
> > mm functions.
> > 
> > Actually.. if we have a fault() flag introduced above, maybe we can
> > generalize the trap side altogether without the folio lock overhead.  When
> > the flag set, if we can always return the folio unlocked (as long as
> > refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.
> 
> Where does this locking happen exactly during trapping?  I was thinking it
> was only done when the page was allocated.  The trapping part (quoted by you
> above) only looks up the page in the cache and calls handle_userfault().  Am
> I missing something?

That's only what I worry if we want to reuse fault() to generalize the trap
code in core mm, because fault() by default takes the folio lock at least
for shmem.  I agree the folio doesn't need locking when trapping the fault
and sending the message.

Thanks,

> 
> > > 
> > > > It might boil down to how many more FSes would support minor fault, and
> > > > whether we would care about such difference at last to shmem users. If gmem
> > > > is the only one after existing ones, IIUC there's still option we implement
> > > > it in gmem code.  After all, I expect the change should be very under
> > > > control (<20 LOCs?)..
> > > > 
> > > > --
> > > > Peter Xu
> > > > 
> > > 
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu