From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0059DC46CD2 for ; Wed, 31 Jan 2024 02:24:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83B6F6B007D; Tue, 30 Jan 2024 21:24:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EB7C6B0081; Tue, 30 Jan 2024 21:24:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68BEF6B0082; Tue, 30 Jan 2024 21:24:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 557C56B007D for ; Tue, 30 Jan 2024 21:24:42 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 0DF5DA211A for ; Wed, 31 Jan 2024 02:24:42 +0000 (UTC) X-FDA: 81738012804.26.8889BF0 Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) by imf28.hostedemail.com (Postfix) with ESMTP id 4FC8BC0013 for ; Wed, 31 Jan 2024 02:24:39 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nvKTyJCy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706667879; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LNDBDr2HIUePrwDqJX7AXKQLPV8jSOXnSTo+JHeA8P8=; b=6lXigydr+6r43tq6AomVLP0eKpKavYaKDI4LOWEhdh2WLj9O/jMGVGP2+ZQ8u+fw02tc5M H6f0vbS2Mnab/hrApICReEDBjOLtp+M5d2NDWeGoijLHFZvjXEm1AEKeEkZZ5AjbwOPHJ0 +ldBltUDf+gn1BjM3MhCl4IkqQJRtO0= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nvKTyJCy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706667879; a=rsa-sha256; cv=none; b=pbVyciqPlQhITRSDht6OJ8AgWLuKdR7Bfxk3mJ1bV7l6s8SD8sBqc5TKog18C+8qbICC1g Ml1LQ+qZ0yA9E9Ii5yulrpuU+mga8wkPVrU/gxikUrnbTXnu80cBc9v7izuIaKO+GZjhfQ zDI9nJTGdeBwtptvH61C97Iro+QAIHA= Received: by mail-wm1-f46.google.com with SMTP id 5b1f17b1804b1-40e8fec0968so52409805e9.1 for ; Tue, 30 Jan 2024 18:24:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706667878; x=1707272678; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LNDBDr2HIUePrwDqJX7AXKQLPV8jSOXnSTo+JHeA8P8=; b=nvKTyJCy7DT+jYaPIVMQepwSkNkrGll3H321TUJueQjFJc+ZJ42EU/PiqrAcSUYCV6 CPgmirn7c5WHYJ4q7JACG5RdN/GX7twWaFGENSPvE7+JXgYL/ISNC2U0Ud9KTjVeTMV7 14o9+BlRxkG/QWItUSBNZGAQYibjabHSUjoXcY+iiL5kMfWRY8ejASU40lP9MehxOkVL D4w51I18cg7aUJzSYnkpq8n80pQ1uBE2N/vtlRPv+jXBb0GQJZSOZEJvNVRwJVZVLOtk NVRYp4rQVh0br5lZgmR5b/evKWUogUhG+QvQZoTlNxpnt5Wqdgb0hkDVvhb7KvHa4blp sM7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706667878; x=1707272678; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LNDBDr2HIUePrwDqJX7AXKQLPV8jSOXnSTo+JHeA8P8=; b=D5jfwAsb4EXcZqNGNl8FowA7fB/TNXBiN1lRopnwpJlaT9qwz6+VIwifSt5emmZ7xm 9LGZaoJdAUeLt6ucoT81ITfF6qxSqROPm625ZqEhdlE21lIF8YzaGy/Syrtcaskwe220 eLUfPeknxm8Hcqau8BItph4cdql2T8CAtTjl5EQHWSmkfD1e92+oCnXF1x5dwtz0KwaN K4kC6h9rRe/MpGWPO/PFmRtxYx5HZRr/ZWjrCBJpPa7JPxu2Sheb7EDShhoE0hQn6Jva uaVf4e7blqpI2rzyPQE0AeplIOKGeCL/NGuOR7COzq+6vjz1OqGkH7aAYC/pbubAU6GF fwjQ== X-Gm-Message-State: AOJu0Yyz/kbr0hEb6zRXhqCGIWBBJTd+JcGx2wMQ9nNU1idpbsD5Nb5Y d9yjqGERjYDHd6ZGAQSmus00WhsBHCF+tD0OETcwhrqR1449430HW9qJXT1GJRarOOIy8Jg0G1V 780Vz9DFp/3NPiljKMw5SPBiG6ovAGWBG4Uji X-Google-Smtp-Source: AGHT+IHmasuDEqZccuPfAUzwk2QyR9mhEieYSp+I9wbmzSc2V2fKLkiiZu/JCqxutAC4P8kMWrX8V3bPUn7hqIZEZHY= X-Received: by 2002:a05:6000:1753:b0:33a:f4e4:107f with SMTP id m19-20020a056000175300b0033af4e4107fmr163045wrf.38.1706667877436; Tue, 30 Jan 2024 18:24:37 -0800 (PST) MIME-Version: 1.0 References: <20240129193512.123145-1-lokeshgidra@google.com> <20240129193512.123145-3-lokeshgidra@google.com> <20240129210014.troxejbr3mzorcvx@revolver> <20240130034627.4aupq27mksswisqg@revolver> <20240130172831.hv5z7a7bhh4enoye@revolver> In-Reply-To: <20240130172831.hv5z7a7bhh4enoye@revolver> From: Lokesh Gidra Date: Tue, 30 Jan 2024 18:24:24 -0800 Message-ID: Subject: Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx To: "Liam R. Howlett" , Mike Rapoport , Lokesh Gidra , akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, surenb@google.com, kernel-team@android.com, aarcange@redhat.com, peterx@redhat.com, david@redhat.com, axelrasmussen@google.com, bgeffon@google.com, willy@infradead.org, jannh@google.com, kaleshsingh@google.com, ngeoffray@google.com, timmurray@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: oc5md95czfhk15amq79fxomfjhohyiki X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4FC8BC0013 X-HE-Tag: 1706667879-317629 X-HE-Meta: U2FsdGVkX18S7ZXzm0nq0zf/gEtq1ptEhe4a9O6lgu6Eeg2SAhxdsDQPubFNJKxooCGvGvxjI7d6A8IbEJrFzBt/EZarG93rBVQ2VUa1I2J/eXgnIzQspVOX1FteGqS6fDTki6ERcFa9/3cm+tVR10ozozDYawaExxAwzuot25zxkhcXmAgtQX/+OF/5k2cqb7TL7Rxozavuv1fJI49CKzKuByA4oP8fjFvAV3tw3jT9k+yD8sYx9GgMWLCZng00ppODq/qCzGvqUhd90FgsqkOiY1q0VI4Tch8PySvBzXUg1BcWDpHCcGTPYdifKGvIITItIa4nf/7L51N0ddE5IyK1vEOW5dcPUm1MVVFnPh9C9A3aV76I3D6YpSBUVPCY8WlwgeZ5asQsYe2V2FDS8eA93MF4FOMRQM/y6lyJAmUNb6UPHbBRSJ6QF+8YQpJicPHjFLe7DOKJloyxMUx15jcBuifVVxSrsFPMhYZjKYU40EB7GaelsAKKyHWAEydJQNdSsyJLy0nNF/YJJw8hA3QZRCILfwpeeIMZfLkZuoVP2+ZfdF9jIMExVCwIj1t8L5WWyXECAbuYn8xIFBjvbX7Aj0Ld7l5Fz82X7xcfLcUV/quqJr7oZnRSImj9/Y9MidwvvJkjEPHw+pPC0J15C4neTR+cZC1gbbS9wuCUWLEtkXibkbKvNDvkrbKBzICxkMGbV9VzRQev4omAVW5nwCogUcWDSdDexWK8PpwLB88El/wlmsjEojDBdq4RNo/NjrY/TuL0S1iNXN5Cuo6CDwsE1gJzpZgDo1Ebg51PZvIFfDQAA+XAAEOlm7niL4KUqgeUjTiW995DTMS6qTDjhKFiAU5yIhc4NSxyZprYPqv1y8WjLq5YfFdrQfDFOIumzP9njhYXafRE8xjVdko3FrG9WHFOKSOxDwXqOGyswI+oekL3QDllx4fVVl785DkwU+MSiPj76D/yvChUu3P 1nHDsmEs 7pv+wsHyTa2JAThQYwlB3BvTnUJWDVErYveyQgysvgvSgZsHT1EbQeAo1jUBgeReCnywmYgqiKSKrT639iERVLkv1/fjT2yoHbPhh9t6mF37JtOSAlYejkKN7iQEB1uYoKZxIgH2yxAJ0nZC/VWzzf5X09M92R/UIzBfQn6jzowkdHWLIhMXPxsJvVczK1imDjOdVlKFYUENWmXFjbdr3YXoOje1J3bdzr+Hg0BmCUpYNs+som03xsM+d1OYYC68uM+UAR2Napb4x4Y2BL18QYJoGisAJDuc+wSK3wH04vz6Xmw8NsCJcbBW6BjO5Sjh/JvuwWmRpuBwmNBHTZ2iv5wIVHN1bEER8fzjhPRRFactt1M4ihF+sHxJfCmlnEEmvUw6fKCn6Vn/hNu/TKLuiwj8JEg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 30, 2024 at 9:28=E2=80=AFAM Liam R. Howlett wrote: > > * Mike Rapoport [240130 03:55]: > > On Mon, Jan 29, 2024 at 10:46:27PM -0500, Liam R. Howlett wrote: > > > * Lokesh Gidra [240129 17:35]: > > > > On Mon, Jan 29, 2024 at 1:00=E2=80=AFPM Liam R. Howlett wrote: > > > > > > > > > > * Lokesh Gidra [240129 14:35]: > > > > > > Increments and loads to mmap_changing are always in mmap_lock > > > > > > critical section. > > > > > > > > > > Read or write? > > > > > > > > > It's write-mode when incrementing (except in case of > > > > userfaultfd_remove() where it's done in read-mode) and loads are in > > > > mmap_lock (read-mode). I'll clarify this in the next version. > > > > > > > > > > > This ensures that if userspace requests event > > > > > > notification for non-cooperative operations (e.g. mremap), user= faultfd > > > > > > operations don't occur concurrently. > > > > > > > > > > > > This can be achieved by using a separate read-write semaphore i= n > > > > > > userfaultfd_ctx such that increments are done in write-mode and= loads > > > > > > in read-mode, thereby eliminating the dependency on mmap_lock f= or this > > > > > > purpose. > > > > > > > > > > > > This is a preparatory step before we replace mmap_lock usage wi= th > > > > > > per-vma locks in fill/move ioctls. > > > > > > > > > > > > Signed-off-by: Lokesh Gidra > > > > > > --- > > > > > > fs/userfaultfd.c | 40 ++++++++++++---------- > > > > > > include/linux/userfaultfd_k.h | 31 ++++++++++-------- > > > > > > mm/userfaultfd.c | 62 ++++++++++++++++++++-------= -------- > > > > > > 3 files changed, 75 insertions(+), 58 deletions(-) > > > > > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > > index 58331b83d648..c00a021bcce4 100644 > > > > > > --- a/fs/userfaultfd.c > > > > > > +++ b/fs/userfaultfd.c > > > > > > @@ -685,12 +685,15 @@ int dup_userfaultfd(struct vm_area_struct= *vma, struct list_head *fcs) > > > > > > ctx->flags =3D octx->flags; > > > > > > ctx->features =3D octx->features; > > > > > > ctx->released =3D false; > > > > > > + init_rwsem(&ctx->map_changing_lock); > > > > > > atomic_set(&ctx->mmap_changing, 0); > > > > > > ctx->mm =3D vma->vm_mm; > > > > > > mmgrab(ctx->mm); > > > > > > > > > > > > userfaultfd_ctx_get(octx); > > > > > > + down_write(&octx->map_changing_lock); > > > > > > atomic_inc(&octx->mmap_changing); > > > > > > + up_write(&octx->map_changing_lock); > > > > > > On init, I don't think taking the lock is strictly necessary - unless > > > there is a way to access it before this increment? Not that it would > > > cost much. > > > > It's fork, the lock is for the context of the parent process and there > > could be uffdio ops running in parallel on its VM. > > Is this necessary then? We are getting the octx from another mm but the > mm is locked for forking. Why does it matter if there are readers of > the octx? > > I assume, currently, there is no way the userfaultfd ctx can > be altered under mmap_lock held for writing. I would think it matters if > there are writers (which, I presume are blocked by the mmap_lock for > now?) Shouldn't we hold the write lock for the entire dup process, I > mean, if we remove the userfaultfd from the mmap_lock, we cannot let the > structure being duplicated change half way through the dup process? > > I must be missing something with where this is headed? > AFAIU, the purpose of mmap_changing is to serialize uffdio operations with non-cooperative events if and when such events are being monitored by userspace (in case you missed, in all the cases of writes to mmap_changing, we only do it if that non-cooperative event has been requested by the user). As you pointed out there are no correctness concerns as far as userfaultfd operations are concerned. But these events are essential for the uffd monitor's functioning. For example: say the uffd monitor wants to be notified for REMAP operations while doing uffdio_copy operations. When COPY ioctls start failing with -EAGAIN and uffdio_copy.copy =3D=3D 0, then it knows it must be due to mremap(), in which case it waits for the REMAP event notification before attempting COPY again. But there are few things that I didn't get after going through the history of non-cooperative events. Hopefully Mike (or someone else familiar) can clarify: IIUC, the idea behind non-cooperative events was to block uffdio operations from happening *before* the page tables are manipulated by the event (like mremap), and that the uffdio ops are resumed after the event notification is received by the monitor. If so then: 1) Why in the case of REMAP prep() is done after page-tables are moved? Shouldn't it be done before? All other non-cooperative operations do the prep() before. 2) UFFD_FEATURE_EVENT_REMOVE only notifies user space. It is not consistently blocking uffdio operations (as both sides are acquiring mmap_lock in read-mode) when remove operation is taking place. I can understand this was intentionally left as is in the interest of not acquiring mmap_lock in write-mode during madvise. But is only getting the notification any useful? Can we say this patch fixes it? And in that case shouldn't I split userfaultfd_remove() into two functions (like other non-cooperative operations)? 3) Based on [1] I see how mmap_changing helps in eliminating duplicate work (background copy) by uffd monitor, but didn't get if there is a correctness aspect too that I'm missing? I concur with Amit's point in [1] that getting -EEXIST when setting up the pte will avoid memory corruption, no? [1] https://lore.kernel.org/lkml/20201206093703.GY123287@linux.ibm.com/ > > > > > > > You could use the first bit of the atomic_inc as indication of a = write. > > > > > So if the mmap_changing is even, then there are no writers. If i= t > > > > > didn't change and it's even then you know no modification has hap= pened > > > > > (or it overflowed and hit the same number which would be rare, bu= t > > > > > maybe okay?). > > > > > > > > This is already achievable, right? If mmap_changing is >0 then we k= now > > > > there are writers. The problem is that we want writers (like mremap > > > > operations) to block as long as there is a userfaultfd operation (a= lso > > > > reader of mmap_changing) going on. Please note that I'm inferring t= his > > > > from current implementation. > > > > > > > > AFAIU, mmap_changing isn't required for correctness, because all > > > > operations are happening under the right mode of mmap_lock. It's us= ed > > > > to ensure that while a non-cooperative operations is happening, if = the > > > > user has asked it to be notified, then no other userfaultfd operati= ons > > > > should take place until the user gets the event notification. > > > > > > I think it is needed, mmap_changing is read before the mmap_lock is > > > taken, then compared after the mmap_lock is taken (both read mode) to > > > ensure nothing has changed. > > > > mmap_changing is required to ensure that no uffdio operation runs in > > parallel with operations that modify the memory map, like fork, mremap, > > munmap and some of madvise calls. > > And we do need the writers to block if there is an uffdio operation goi= ng > > on, so I think an rwsem is the right way to protect mmap_chaniging. > > > > > > > > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_stru= ct *vma, > > > > > > return true; > > > > > > > > > > > > userfaultfd_ctx_get(ctx); > > > > > > + down_write(&ctx->map_changing_lock); > > > > > > atomic_inc(&ctx->mmap_changing); > > > > > > + up_write(&ctx->map_changing_lock); > > > > > > mmap_read_unlock(mm); > > > > > > > > > > > > msg_init(&ewq.msg); > > > > > > If this happens in read mode, then why are you waiting for the reader= s > > > to leave? Can't you just increment the atomic? It's fine happening = in > > > read mode today, so it should be fine with this new rwsem. > > > > It's been a while and the details are blurred now, but if I remember > > correctly, having this in read mode forced non-cooperative uffd monitor= to > > be single threaded. If a monitor runs, say uffdio_copy, and in parallel= a > > thread in the monitored process does MADV_DONTNEED, the latter will wai= t > > for userfaultfd_remove notification to be processed in the monitor and = drop > > the VMA contents only afterwards. If a non-cooperative monitor would > > process notification in parallel with uffdio ops, MADV_DONTNEED could > > continue and race with uffdio_copy, so read mode wouldn't be enough. > > > > Right now this function won't stop to wait for readers to exit the > critical section, but with this change there will be a pause (since the > down_write() will need to wait for the readers with the read lock). So > this is adding a delay in this call path that isn't necessary (?) nor > existed before. If you have non-cooperative uffd monitors, then you > will have to wait for them to finish to mark the uffd as being removed, > where as before it was a fire & forget, this is now a wait to tell. > I think a lot will be clearer once we get a response to my questions above. IMHO not only this write-lock is needed here, we need to fix userfaultfd_remove() by splitting it into userfaultfd_remove_prep() and userfaultfd_remove_complete() (like all other non-cooperative operations) as well. This patch enables us to do that as we remove mmap_changing's dependency on mmap_lock for synchronization. > > > There was no much sense to make MADV_DONTNEED take mmap_lock in write m= ode > > just for this, but now taking the rwsem in write mode here sounds > > reasonable. > > > > I see why there was no need for a mmap_lock in write mode, but I think > taking the new rwsem in write mode is unnecessary. > > Basically, I see this as a signal to new readers to abort, but we don't > need to wait for current readers to finish before this one increments > the atomic. > > Unless I missed something, I don't think you want to take the write lock > here. What I understood from the history of mmap_changing is that the intention was to enable informing the uffd monitor about the correct state of which pages are filled and which aren't. Going through this thread was very helpful [2] [2] https://lore.kernel.org/lkml/1527061324-19949-1-git-send-email-rppt@lin= ux.vnet.ibm.com/ > > Thanks, > Liam