From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9033BC4828D for ; Mon, 5 Feb 2024 20:53:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 20DB46B0072; Mon, 5 Feb 2024 15:53:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1975A6B0074; Mon, 5 Feb 2024 15:53:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 037FA6B0075; Mon, 5 Feb 2024 15:53:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E01A46B0072 for ; Mon, 5 Feb 2024 15:53:50 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 78007C084A for ; Mon, 5 Feb 2024 20:53:50 +0000 (UTC) X-FDA: 81758951820.04.70D9473 Received: from mail-ua1-f46.google.com (mail-ua1-f46.google.com [209.85.222.46]) by imf01.hostedemail.com (Postfix) with ESMTP id B784F40009 for ; Mon, 5 Feb 2024 20:53:48 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gBAAedhN; spf=pass (imf01.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707166428; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Tl1CJdjpwoXP7Z0xr0OAs8ppfZXWEt2koixyK2TNq5g=; b=vafMnIEIP5xDEPbvdOGAct30SZmLBhb/BP5EjwUAKLfrEKvWJFWA2LYTzkea/90afjkhZR qWubADUv9psoOi/xREqPmMx0fO7ooINYDrYL9VfhUWn6YESV6luTIuBzXqklTZ5Dc72j35 bS5koJiW44Lvo+BIZv5VPFGYy4m+9f8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707166428; a=rsa-sha256; cv=none; b=4hQbE690zIOIs7d7JxRbHYZWdY4E2EQdwTzzcolToH/a3DF3GCaadLP6ewPMCuvQJJUHD9 37iAnr/yVlV2dPjhYlzdmydk/VnO9G7Q0/1N0PsEJ4C9AunG3V6d9ifVTav12OSyFUHfGe w5jJPglIXUDty1qGLV0L9T5wKbMAEU4= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gBAAedhN; spf=pass (imf01.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ua1-f46.google.com with SMTP id a1e0cc1a2514c-7d2e15193bbso2473762241.0 for ; Mon, 05 Feb 2024 12:53:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707166428; x=1707771228; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Tl1CJdjpwoXP7Z0xr0OAs8ppfZXWEt2koixyK2TNq5g=; b=gBAAedhNXP/d8n3HLFFD5Ay5v7myba3LqTbrjW5ZsRdqYif/6Zz0EbaO8FohYh1yrQ B38pm8hK1haj7GicZu/WkRQFdtYHsd2RxQask5mGJgBViIbS50s3exfoYr103lnBBSb0 gin1446K4ARA75uXOW8RW0RN6aK3j9YWfBaU/HFVZ44F3DR7ORSh2w68wwEmuPZnQtUI I7yS4h1VcYy1BtO/BxWAdjvXd7m3CvA7hXdprDNbngMx2V8PjcMFJTj3CC3v3FflpRQk HaYy4wgoQYmQvq1IEEhLMZjBRC61AmNPGJixyXwCQgK1iwZ+pio/tI3voVLYkhhjsO9a fTLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707166428; x=1707771228; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Tl1CJdjpwoXP7Z0xr0OAs8ppfZXWEt2koixyK2TNq5g=; b=KCWwB+ufwn3YV6uTezo00ZCfxZcPvFttQjox+lynw0g3TrhlZNLxGeQb5zNGSLbCHp LY/KXoVnE1ovG/I6PQhPq5N/J7OnUYEnrgujOp/ORoyTCWF5y/YlrwOnFKNkFTrF0LOQ JiJGUTNlN6V822dU2YSeKv5YQ3s17k9XTeYWzq2SXQX4WnPlQg+DmhkoTqtLIvzVW2Gq mpBLy532SkaLh1yWWu7cl5HTaVBDx94jfzkZvf5ffJe0Wgot4ntC5FZ/dQKd64Roz2q6 KYQ1JEBUI8w4DlwESRzzjHC+jl7g07y9+D0htetQSYsebeI7ZC1AnkREgyhpg7QotMQp ZSaQ== X-Gm-Message-State: AOJu0YyseCq9u1c53HxkSC4ukxiZ4ru7HTn47RYv1LjByGI4+aTbKeMM cFM7Qq3JqUalAs0/rxy/9r8FKIlS1c47HGmJKFN+kdUixTV5y9qkuBCuvGkOvPgvdU1LX3h1TBY 8kvY6LJeZte3X7zYuTvFA4FA45G4r/Q2EtqD1 X-Google-Smtp-Source: AGHT+IFCXUOVu6/hpvyRU/iQ4e43WNb9ZedK84zFmBD+7O9uNYQqMMYiJBNa33Ur49qMxAvc60kncHLb6fW1riBaSK8= X-Received: by 2002:a67:e984:0:b0:46d:28b4:95a2 with SMTP id b4-20020a67e984000000b0046d28b495a2mr1013963vso.32.1707166427565; Mon, 05 Feb 2024 12:53:47 -0800 (PST) MIME-Version: 1.0 References: <20240129193512.123145-1-lokeshgidra@google.com> <20240129193512.123145-3-lokeshgidra@google.com> <20240129210014.troxejbr3mzorcvx@revolver> <20240130034627.4aupq27mksswisqg@revolver> <20240130172831.hv5z7a7bhh4enoye@revolver> In-Reply-To: From: Lokesh Gidra Date: Mon, 5 Feb 2024 12:53:33 -0800 Message-ID: Subject: Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx To: Mike Rapoport Cc: "Liam R. Howlett" , akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, surenb@google.com, kernel-team@android.com, aarcange@redhat.com, peterx@redhat.com, david@redhat.com, axelrasmussen@google.com, bgeffon@google.com, willy@infradead.org, jannh@google.com, kaleshsingh@google.com, ngeoffray@google.com, timmurray@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 4z6p5h4fok5sirp65ousncbddya9xr5e X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B784F40009 X-Rspam-User: X-HE-Tag: 1707166428-767675 X-HE-Meta: U2FsdGVkX1+kgmWYfl0RmCL0fMbG7iPv2oIHMA3ZasuhqweYwCgJg2DoPIPQX44DKjNuezAYoXcVOQKQlBchpFWxf1bvUhelLt/MH4Io5nAXx21gUXT+8Gx0PAANOpmQKjl1vadZ9vdAOseAWqZOZ+GgrHrkrerX7tcC3yao+FeuNDK4xuG2RWAdg/wml6UZeBvpcXLbJPj1CI8EonpkWpNLDINOK0cagZSsjLc3Fv+vP8Fypug637rHOhatP5LKuBEFzstmWjQmfZS7ZkRMCev0Om73QXJRyH2qBoaqUSIoSEvdrtaKmBeTPxLEwuSgsMZPCDFj4Wz4WZ6E7RsU4XvuVF3dQ0jJ0MFp6q3wTntSNsswxlrZK1jRS34OVpC1UiivFMti97uXWFtumxfQ/5Lbb9d8gsdv15YQL6MNF9FYEyX41wHd2RVJT5/uwIz7UByLV5L1MmTq3M5PY1tFonAm4j/6cH2aNiRTpNcszMUYI6dwQrFHSVM6sgRO67RFZQPPP4KADaS209QvwntxIGGNeh+Uz3963DywPB6/CL9JHu5LnPtpfQymPzVvOovo2uwhnXqnrtaWff0sIVlO7gKv0x6jyXFcy8Bx/MnACRGQF6w0kaDNZZI0kpOHkyY7K9lE+mGXE6XAPI7rayJTYSGLpGjd8u9aM1DvjoVrYa9c13zRK3xDSn0OmRyvD0cCA76ddKzZBRzToCRT30byYXSwLF9yMhCZOAAcBrvXi34e9UmWhq/EktNwFT5QmBDW/nuKzUmOIVHYREBrhtgFXKy3EpSR25xbZC/GgLNOjBIqrJlNttpDbgGS1pMC8l+rSZQoUFcbPv/HgadbtMW8LryXA2/MoVUTRvV7EFSTeS5wzT8JQ2jxQYjX6g8a9IxYPQwT4DNRwgGCfzk50xKaEotaIloJHuMbTuo7bdmId3XnqAAIYt6nBytd6WLi8a825v2Q62cNrC6aPeLTqOg INJ4Pq5b wDri8JAjC/eU5cZS5yvyfHx1P2h5ssc99CS2f40FQrTXo2S4clB0TuD7kbvpfuxWX3EziaIBE9gmqNK7yZ9AVAHQz8qMVvCvgbVM1X6GMrlnrgEF0GlfnJbxpdivyi/xfSxE6c/wJCH0d0ervUX/Xu5xR1LpWQ2Dyf6XZjkhqXNbFArSDvQZFgdi8w/g1TrezTyG9ZxJT+R8fb1OlKIXgwo6OvxSQ62cjmlZTguzpZDlCpa5aTDD2Dm2ZBPJNiSsjiy6cJOId7hUE1fqBsMcTBn9su7r29/jLg1KUgHAgC0eaRFtWxvVmIBfbesl3ClA9o9RSOubdMGmxRyuCyrsyqc/YJYxTM+OrVHmRPu5mmGucfgxaVSm7NNDKZgVZzEemX1kr7rUz6soxU0AvpsLJU/QnYQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Feb 4, 2024 at 2:27=E2=80=AFAM Mike Rapoport wrot= e: > > On Tue, Jan 30, 2024 at 06:24:24PM -0800, Lokesh Gidra wrote: > > On Tue, Jan 30, 2024 at 9:28=E2=80=AFAM Liam R. Howlett wrote: > > > > > > * Mike Rapoport [240130 03:55]: > > > > On Mon, Jan 29, 2024 at 10:46:27PM -0500, Liam R. Howlett wrote: > > > > > * Lokesh Gidra [240129 17:35]: > > > > > > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > > > > index 58331b83d648..c00a021bcce4 100644 > > > > > > > > --- a/fs/userfaultfd.c > > > > > > > > +++ b/fs/userfaultfd.c > > > > > > > > @@ -685,12 +685,15 @@ int dup_userfaultfd(struct vm_area_st= ruct *vma, struct list_head *fcs) > > > > > > > > ctx->flags =3D octx->flags; > > > > > > > > ctx->features =3D octx->features; > > > > > > > > ctx->released =3D false; > > > > > > > > + init_rwsem(&ctx->map_changing_lock); > > > > > > > > atomic_set(&ctx->mmap_changing, 0); > > > > > > > > ctx->mm =3D vma->vm_mm; > > > > > > > > mmgrab(ctx->mm); > > > > > > > > > > > > > > > > userfaultfd_ctx_get(octx); > > > > > > > > + down_write(&octx->map_changing_lock); > > > > > > > > atomic_inc(&octx->mmap_changing); > > > > > > > > + up_write(&octx->map_changing_lock); > > > > > > > > > > On init, I don't think taking the lock is strictly necessary - un= less > > > > > there is a way to access it before this increment? Not that it w= ould > > > > > cost much. > > > > > > > > It's fork, the lock is for the context of the parent process and th= ere > > > > could be uffdio ops running in parallel on its VM. > > > > > > Is this necessary then? We are getting the octx from another mm but = the > > > mm is locked for forking. Why does it matter if there are readers of > > > the octx? > > > > > > I assume, currently, there is no way the userfaultfd ctx can > > > be altered under mmap_lock held for writing. I would think it matters= if > > > there are writers (which, I presume are blocked by the mmap_lock for > > > now?) Shouldn't we hold the write lock for the entire dup process, I > > > mean, if we remove the userfaultfd from the mmap_lock, we cannot let = the > > > structure being duplicated change half way through the dup process? > > > > > > I must be missing something with where this is headed? > > > > > AFAIU, the purpose of mmap_changing is to serialize uffdio operations > > with non-cooperative events if and when such events are being > > monitored by userspace (in case you missed, in all the cases of writes > > to mmap_changing, we only do it if that non-cooperative event has been > > requested by the user). As you pointed out there are no correctness > > concerns as far as userfaultfd operations are concerned. But these > > events are essential for the uffd monitor's functioning. > > > > For example: say the uffd monitor wants to be notified for REMAP > > operations while doing uffdio_copy operations. When COPY ioctls start > > failing with -EAGAIN and uffdio_copy.copy =3D=3D 0, then it knows it mu= st > > be due to mremap(), in which case it waits for the REMAP event > > notification before attempting COPY again. > > > > But there are few things that I didn't get after going through the > > history of non-cooperative events. Hopefully Mike (or someone else > > familiar) can clarify: > > > > IIUC, the idea behind non-cooperative events was to block uffdio > > operations from happening *before* the page tables are manipulated by > > the event (like mremap), and that the uffdio ops are resumed after the > > event notification is received by the monitor. > > The idea was to give userspace some way to serialize processing of > non-cooperative event notifications and uffdio operations running in > parallel. It's not necessary to block uffdio operations from happening > before changes to the memory map, but with the mmap_lock synchronization > that already was there adding mmap_chaning that will prevent uffdio > operations when mmap_lock is taken for write was the simplest thing to do= . > > When CRIU does post-copy restore of a process, its uffd monitor reacts to > page fault and non-cooperative notifications and also performs a backgrou= nd > copy of the memory contents from the saved state to the address space of > the process being restored. > > Since non-cooperative events may happen completely independent from the > uffd monitor, there are cases when the uffd monitor couldn't identify the > order of events, like e.g. what won the race on mmap_lock, the process > thread doing fork or the uffd monitor's uffdio_copy. > > In the fork vs uffdio_copy example, without mmap_changing, if the > uffdio_copy takes the mmap_lock first, the new page will be present in th= e > parent by the time copy_page_range() is called and the page will appear i= n > the child's memory mappings by the time uffd monitor gets notification > about the fork event. However, if the fork() is the first to take the > mmap_lock, the new page will appear in the parent address space after > copy_page_range() and it won't be mapped in the child's address space. > > With mmap_changing and current locking with mmap_lock, we have a guarante= e > that uffdio_copy will bail out if fork already took mmap_lock and the > monitor can act appropriately. > Thanks for the explanation. Really helpful! > > 1) Why in the case of REMAP prep() is done after page-tables are > > moved? Shouldn't it be done before? All other non-cooperative > > operations do the prep() before. > > mremap_userfaultfd_prep() is done after page tables are moved because it > initializes uffd context on the new_vma and if the actual remap fails, > there's no point of doing it. > Since mrpemap holds mmap_lock for write it does not matter if mmap_change= d > is updated before or after page tables are moved. In the time between > mmap_lock is released and the UFFD_EVENT_REMAP is delivered to the uffd > monitor, mmap_chaging will remain >0 and uffdio operations will bail out. > Yes this makes sense. Even with per-vma locks, I see that the new_vma is write-locked (vma_start_write()) in vma_link() guaranteeing the same. > > 2) UFFD_FEATURE_EVENT_REMOVE only notifies user space. It is not > > consistently blocking uffdio operations (as both sides are acquiring > > mmap_lock in read-mode) when remove operation is taking place. I can > > understand this was intentionally left as is in the interest of not > > acquiring mmap_lock in write-mode during madvise. But is only getting > > the notification any useful? Can we say this patch fixes it? And in > > that case shouldn't I split userfaultfd_remove() into two functions > > (like other non-cooperative operations)? > > The notifications are useful because uffd monitor knows what memory shoul= d > not be filled with uffdio_copy. Indeed there was no interest in taking > mmap_lock for write in madvise, so there could be race between madvise an= d > uffdio operations. This race essentially prevents uffd monitor from runni= ng > the background copy in a separate thread, and with your change this shoul= d > be possible. > Makes sense. Thanks! > > 3) Based on [1] I see how mmap_changing helps in eliminating duplicate > > work (background copy) by uffd monitor, but didn't get if there is a > > correctness aspect too that I'm missing? I concur with Amit's point in > > [1] that getting -EEXIST when setting up the pte will avoid memory > > corruption, no? > > In the fork case without mmap_changing the child process may be get data = or > zeroes depending on the race for mmap_lock between the fork and > uffdio_copy and -EEXIST is not enough for monitor to detect what was the > ordering between fork and uffdio_copy. This is extremely helpful. IIUC, there is a window after mmap_lock (write-mode) is released and before the uffd monitor thread is notified of fork. In that window, the monitor doesn't know that fork has already happened. So, without mmap_changing it would have done background copy only in the parent, thereby causing data inconsistency between parent and child processes. It seems to me that the correctness argument for mmap_changing is there in case of FORK event and REMAP when mremap is called with MREMAP_DONTUNMAP. In all other cases its only benefit is by avoiding unnecessary background copies, right? > > > > > > > > > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_= struct *vma, > > > > > > > > return true; > > > > > > > > > > > > > > > > userfaultfd_ctx_get(ctx); > > > > > > > > + down_write(&ctx->map_changing_lock); > > > > > > > > atomic_inc(&ctx->mmap_changing); > > > > > > > > + up_write(&ctx->map_changing_lock); > > > > > > > > mmap_read_unlock(mm); > > > > > > > > > > > > > > > > msg_init(&ewq.msg); > > > > > > > > > > If this happens in read mode, then why are you waiting for the re= aders > > > > > to leave? Can't you just increment the atomic? It's fine happen= ing in > > > > > read mode today, so it should be fine with this new rwsem. > > > > > > > > It's been a while and the details are blurred now, but if I remembe= r > > > > correctly, having this in read mode forced non-cooperative uffd mon= itor to > > > > be single threaded. If a monitor runs, say uffdio_copy, and in para= llel a > > > > thread in the monitored process does MADV_DONTNEED, the latter will= wait > > > > for userfaultfd_remove notification to be processed in the monitor = and drop > > > > the VMA contents only afterwards. If a non-cooperative monitor woul= d > > > > process notification in parallel with uffdio ops, MADV_DONTNEED cou= ld > > > > continue and race with uffdio_copy, so read mode wouldn't be enough= . > > > > > > > > > > Right now this function won't stop to wait for readers to exit the > > > critical section, but with this change there will be a pause (since t= he > > > down_write() will need to wait for the readers with the read lock). = So > > > this is adding a delay in this call path that isn't necessary (?) nor > > > existed before. If you have non-cooperative uffd monitors, then you > > > will have to wait for them to finish to mark the uffd as being remove= d, > > > where as before it was a fire & forget, this is now a wait to tell. > > > > > I think a lot will be clearer once we get a response to my questions > > above. IMHO not only this write-lock is needed here, we need to fix > > userfaultfd_remove() by splitting it into userfaultfd_remove_prep() > > and userfaultfd_remove_complete() (like all other non-cooperative > > operations) as well. This patch enables us to do that as we remove > > mmap_changing's dependency on mmap_lock for synchronization. > > The write-lock is not a requirement here for correctness and I don't see > why we would need userfaultfd_remove_prep(). > > As I've said earlier, having a write-lock here will let CRIU to run > background copy in parallel with processing of uffd events, but I don't > feel strongly about doing it. > Got it. Anyways, such a change needn't be part of this patch, so I'm going to keep it unchanged. > > > > There was no much sense to make MADV_DONTNEED take mmap_lock in wri= te mode > > > > just for this, but now taking the rwsem in write mode here sounds > > > > reasonable. > > > > > > > > > > I see why there was no need for a mmap_lock in write mode, but I thin= k > > > taking the new rwsem in write mode is unnecessary. > > > > > > Basically, I see this as a signal to new readers to abort, but we don= 't > > > need to wait for current readers to finish before this one increments > > > the atomic. > > > > > > Unless I missed something, I don't think you want to take the write l= ock > > > here. > > What I understood from the history of mmap_changing is that the > > intention was to enable informing the uffd monitor about the correct > > state of which pages are filled and which aren't. Going through this > > thread was very helpful [2] > > > > [2] https://lore.kernel.org/lkml/1527061324-19949-1-git-send-email-rppt= @linux.vnet.ibm.com/ > > -- > Sincerely yours, > Mike.