From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23467C001DE
	for <linux-mm@archiver.kernel.org>; Thu, 20 Jul 2023 20:07:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B94328015A; Thu, 20 Jul 2023 16:07:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8698828004C; Thu, 20 Jul 2023 16:07:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 70A2D28015A; Thu, 20 Jul 2023 16:07:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 6182D28004C
	for <linux-mm@kvack.org>; Thu, 20 Jul 2023 16:07:01 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 25428B0FFD
	for <linux-mm@kvack.org>; Thu, 20 Jul 2023 20:07:01 +0000 (UTC)
X-FDA: 81033073842.25.6CDA548
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf01.hostedemail.com (Postfix) with ESMTP id DA9FC4001A
	for <linux-mm@kvack.org>; Thu, 20 Jul 2023 20:06:58 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vzv2yIg3;
	spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689883619; a=rsa-sha256;
	cv=none;
	b=eiKwxfK4p42vD51lHBJoyCcxCWgxrxIiiRfzeYDJeHYIJnnwXd2TurSmkDNocLmJZrDBwf
	hWcYkcEPyok1Fmn96z6/2Oc1g251o0mLCUE4llqhJWVLe+Wrp/vM4t+m4LL5YJdImr99ni
	w3s8N1yUXHGKmeK6YOJT8tMwa19za3g=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vzv2yIg3;
	spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1689883619;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ucxp4CiclbQdAFxLcc4GXsXkfqukOJSf3s5CGFH45eo=;
	b=CDyxomwI98NNz95IQnSj8efaIOv8A3P9xnA5FoJfzVzpONNe2Y2fIYG5FlVnQ1mlBSH1kD
	wm87gmfAE8eaY8oEZ8GTvtPTgmrn8zHUbxRH1p8HkMxh3iLHHmzZS+Uz940zeXefEHLFAA
	0frAtlQzRZbIOWT6Y/yukF1onHLfY+c=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1689883618;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ucxp4CiclbQdAFxLcc4GXsXkfqukOJSf3s5CGFH45eo=;
	b=Vzv2yIg3wtOio0u+j2mTcZvpc/ct0bp6i9H+qTZ3fi8dccRarEHaCPq9Mv1XcuoO/rfS6a
	FkKEig0iBALQ7hQkc5Dw8ChMt6pUJvi2ACPKEQBJ6m4ggBaeJ9HD2IGrbG8swnXpg4K3zq
	dEIP6lfD1rzA+KGj13wMmWXfE7pKM9Q=
Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com
 [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-126-gF0-rN7BPtmTw8WUs2tqoQ-1; Thu, 20 Jul 2023 16:06:54 -0400
X-MC-Unique: gF0-rN7BPtmTw8WUs2tqoQ-1
Received: by mail-qv1-f70.google.com with SMTP id 6a1803df08f44-635de6f75e0so2946496d6.1
        for <linux-mm@kvack.org>; Thu, 20 Jul 2023 13:06:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689883614; x=1690488414;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=ucxp4CiclbQdAFxLcc4GXsXkfqukOJSf3s5CGFH45eo=;
        b=LooiVJgy/W3H9SwoKmeb1cXyNr5nIMwh7pnUgq5rRrbtqKrMIFDEzvULnwjJOBezh4
         KR88SjN+Czo3d7nOlmMGYpbJaPHbIcCecGgsV6tfu8tjJGQf3OPuFCB1Zp5QGk3lUEsc
         P2JTuz+6fdlZ2faCKBwiNQWgMT8MkgxEIPCViimNag5ErncLe333iynGZTjdUAmRRBCE
         4oREeJylFOQ1B3+VNUIJVxZiwSzV9YOIFdqF9azcAksOBt2YQdL3xojTaEn3t5fkAp2y
         nmpB/DQrlFLOXvHk5mCOWQB/s3huTWc0PUkBpZQd5dA01oHVbjYHpVonzEhn8UVgwHrd
         TKNQ==
X-Gm-Message-State: ABy/qLbIUMyK8Mj112iSBSp7McIPTA3uKJAt1Eaje23uvZJEmm0c+uIv
	rS2KpWu+M1QfGyU7qxWJocfcdGsFgOY8cU0sx0Dl+dPyHHqg/zvlkPHTcRliru9cihJL6dTK7ce
	tYkwA0hfPbp8=
X-Received: by 2002:a05:6214:21c4:b0:639:d239:b4fd with SMTP id d4-20020a05621421c400b00639d239b4fdmr138691qvh.1.1689883614126;
        Thu, 20 Jul 2023 13:06:54 -0700 (PDT)
X-Google-Smtp-Source: APBJJlFfvvqT4xuU20f9j7ej5nd1HPvCjQU0/3eyMCHp2vS99rh43b9nNbOoZuAeDgyFr5glw3iNzQ==
X-Received: by 2002:a05:6214:21c4:b0:639:d239:b4fd with SMTP id d4-20020a05621421c400b00639d239b4fdmr138665qvh.1.1689883613762;
        Thu, 20 Jul 2023 13:06:53 -0700 (PDT)
Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39])
        by smtp.gmail.com with ESMTPSA id k17-20020a0c9711000000b00635e68d3170sm721052qvd.31.2023.07.20.13.06.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 20 Jul 2023 13:06:53 -0700 (PDT)
Date: Thu, 20 Jul 2023 16:06:52 -0400
From: Peter Xu <peterx@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr>,
	viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Using userfaultfd with KVM's async page fault handling causes
 processes to hung waiting for mmap_lock to be released
Message-ID: <ZLmT3BfcmltfFvbq@x1n>
References: <79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr>
 <20230719211631.890995-1-axelrasmussen@google.com>
 <CAJHvVcj+Sc41mfercqxBii5cqRBEgZxNix2R1YMi04K-5nBh8w@mail.gmail.com>
MIME-Version: 1.0
In-Reply-To: <CAJHvVcj+Sc41mfercqxBii5cqRBEgZxNix2R1YMi04K-5nBh8w@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: DA9FC4001A
X-Stat-Signature: cfw3mw1i6p33gezhthf481gh4r386m7j
X-Rspam-User: 
X-HE-Tag: 1689883618-887989
X-HE-Meta: U2FsdGVkX18r/LDvjxgIMhsxP+THj8r4vAEWos5hGMFpMeGgcxdTTUeb1fC/gYMXgb5AcBtlf369udxpcOhGkwfDjB6bU9+JSKty+exdPLKcX5g58MDBgQ3jr/uNIcHyZJPyamOAoxYOjwOU1pPJzaQ1ceYSjirWbjWcDAvogx4ssAU1NM98mFRU5G82s2uBiUxfzad98+q1LOQcTW0aHdOiEGk6TNPQqaVIDmqTOHCPUKfQnHqaScOt8JzeT04iapLN312urgliK0iK5We7sJcOIFYAkB7zhCKGeOgPauqI+2GH3FOSSFTFrL+3GYHTTNOxjAFe9GppoTTnBG/vrUHonIyQ9q2rIQteqwMFmMJywdWhtunY6NKhkYwpOdBT45GHQZydPfFBz/LLCtcvePqJ/p1fiU2h4kjQOnwNUoVXRzYLCXEyKZXwr2TPHp80De+jdM4Hni4GwmFyEcq4bF7vWyDtYC7sD9uvLCY7sj3YSrEVaTTiAFVhVOZ2i5u18eXg0nGh70KABCs9+4DhMN2Il/h6b92No+hX5noekbejb+4/+bWuTx6vdRR6oZ258uQGAbxDXbcv2MlyJov/7SCir7+5mjJgyxZqxI/TSwMYxDkruVjkPO9Rli3ogA0mIvaU8/cN+bD3mPLJgohxi/IYhmDZbag7gPXjEzH6HxZMh/YfGZg+RKfuPSohf3zI91y9IJzhlT1c+736BeUIqDDIKCKZzT55KgkCJhrCLnnYYDoVYYV9UTNWgLMCZspfBXpMvBQntG9mNELSlumfUcsf5GAbl+nGbZiHqeD29plauFN5ncuPqnSIB/YcA83qYyrZfMsGUSg6t4FwgQSUHlEK0tfD+UG+hbX8ZkGXAHk1x0AqpmD+WAMNZkAiBLYP3D7YXVH+Xq41E/FeLGTfp+ISmV7arYOpt9DeJNCCAdkmofAm+YOOHe06p6kx59EYh/zFYQ6jGme11Cpmkks
 bAaLxDkY
 LcDnf0VdF9C9oaSJnlCqdyh431F8RmVDVuZ4eH5OQfdGqJS0GyYW8MV/+0niChT66Fkek4jbRZy2Cm52Wdu+NMylg9cNDs8YrzOb6qGD/rWax1oL7PX1cd1ICx6BNCDgs7v4RMG46KdoGSZKiVqMBpiD1eXVQsTiAzoPjywTZpBe6FT8bHCpQTRDjOrhawwc66kb1kuNXL6bBOgXiHmB1LIJp35fX5LjVcRyRa1k/Ce2odjwTJx+eJTD1FGAlbE1mntC8kPYxOeQfALRntPO3U1SxZ13PozU+5dIfHs8f7xMHjRWJ1YCpFIcPJPMqMM4yv1PwYR4BO92kx6YaNnDTaoSCCuQZOlHIvuT/aE5ZiY5JNHc6/vjHr2L7Xs2fIlmiWNjHdaFLUxyc6/fdbQpd6IvtAVvHUhIuIYtBm2Na/ogrVARPsKxw2BueBDungCvYoaJcrvZ+C+BA3pXbGFbnskPR7GLuyhsgO1M5vgM1x58yGP4ciQrgzdmevELVUYKVflBwzwsA8L9+hC4=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hello, Axel, Dimitris,

On Wed, Jul 19, 2023 at 02:54:21PM -0700, Axel Rasmussen wrote:
> On Wed, Jul 19, 2023 at 2:16 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > Thanks for the detailed report Dimitris! I've CCed the MM mailing list and some
> > folks who work on userfaultfd.
> 
> Apologies, I should have quoted the original message for the others I
> added to CC: https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr/T/#u
> 
> >
> > I took a look at this today, but I haven't quite come up with a solution.
> >
> > I thought it might be as easy as changing userfaultfd_release() to set released
> > *after* taking the lock. But no such luck, the ordering is what it is to deal
> > with another subtle case:
> >
> >
> >         WRITE_ONCE(ctx->released, true);
> >
> >         if (!mmget_not_zero(mm))
> >                 goto wakeup;
> >
> >         /*
> >          * Flush page faults out of all CPUs. NOTE: all page faults
> >          * must be retried without returning VM_FAULT_SIGBUS if
> >          * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
> >          * changes while handle_userfault released the mmap_lock. So
> >          * it's critical that released is set to true (above), before
> >          * taking the mmap_lock for writing.
> >          */
> >         mmap_write_lock(mm);
> >
> > I think perhaps the right thing to do is to have handle_userfault() release
> > mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
> > appropriately? But, some investigation is required to be sure that's okay to do
> > in the other non-GUP ways we can end up in handle_userfault().

Heh, this is also what I thought after reading. :)

If we see in the very early commit from Andrea it seems that would not hang
gup but just sigbus-ing it (see the comment that's mostly exactly the thing
Dimitris hit here):

commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Sep 4 15:46:31 2015 -0700

    userfaultfd: add new syscall to provide memory externalization

+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+

Then we switched over to the friendly way, assuming CRIU could close() the
uffd during the monitee process running, in:

commit 656710a60e3693911bee3a355d2f2bbae3faba33
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Sep 8 16:12:42 2017 -0700

    userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS

I had a feeling that after that we didn't test gup (I assume normal page
fault path will still work).  Let me copy Mike too for that just in case he
has anything to say.  Paste thread again:

https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr/T/#u

My understanding is that releasing mmap lock here should work, but we need
to move the code a bit.  Dimitris, please feel free to try the patch
attached here if you want.  It's probably not a major use case of uffd over
kvm (IIUC unregister before close() will also work?), but if it's trivial
to fix we should proably fix it.

Thanks,

===8<===
>From 7e9ef050b487220463fa77a7aa97259ffe9bb15e Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 20 Jul 2023 15:33:55 -0400
Subject: [PATCH] mm/uffd: Fix release hang over GUP

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 57 ++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bbfaf5837a08..2358e6b00315 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -455,32 +455,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))
 		goto out;
 
-	/*
-	 * If it's already released don't get it. This avoids to loop
-	 * in __get_user_pages if userfaultfd_release waits on the
-	 * caller of handle_userfault to release the mmap_lock.
-	 */
-	if (unlikely(READ_ONCE(ctx->released))) {
-		/*
-		 * Don't return VM_FAULT_SIGBUS in this case, so a non
-		 * cooperative manager can close the uffd after the
-		 * last UFFDIO_COPY, without risking to trigger an
-		 * involuntary SIGBUS if the process was starting the
-		 * userfaultfd while the userfaultfd was still armed
-		 * (but after the last UFFDIO_COPY). If the uffd
-		 * wasn't already closed when the userfault reached
-		 * this point, that would normally be solved by
-		 * userfaultfd_must_wait returning 'false'.
-		 *
-		 * If we were to return VM_FAULT_SIGBUS here, the non
-		 * cooperative manager would be instead forced to
-		 * always call UFFDIO_UNREGISTER before it can safely
-		 * close the uffd.
-		 */
-		ret = VM_FAULT_NOPAGE;
-		goto out;
-	}
-
 	/*
 	 * Check that we can return VM_FAULT_RETRY.
 	 *
@@ -517,6 +491,37 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 		goto out;
 
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_lock.
+	 */
+	if (unlikely(READ_ONCE(ctx->released))) {
+		/*
+		 * Don't return VM_FAULT_SIGBUS in this case, so a non
+		 * cooperative manager can close the uffd after the
+		 * last UFFDIO_COPY, without risking to trigger an
+		 * involuntary SIGBUS if the process was starting the
+		 * userfaultfd while the userfaultfd was still armed
+		 * (but after the last UFFDIO_COPY). If the uffd
+		 * wasn't already closed when the userfault reached
+		 * this point, that would normally be solved by
+		 * userfaultfd_must_wait returning 'false'.
+		 *
+		 * If we were to return VM_FAULT_SIGBUS here, the non
+		 * cooperative manager would be instead forced to
+		 * always call UFFDIO_UNREGISTER before it can safely
+		 * close the uffd.
+		 *
+		 * We release the mmap lock in this special case, just in
+		 * case we're in a gup to not dead loop, so the other uffd
+		 * handler thread/process can have a chance to take the
+		 * write lock and do the unregistration.
+		 */
+		release_fault_lock(vmf);
+		goto out;
+	}
+
 	/* take the reference before dropping the mmap_lock */
 	userfaultfd_ctx_get(ctx);
 
-- 
2.41.0
===8<===

-- 
Peter Xu