From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C84F5C7EE25 for ; Mon, 15 May 2023 22:04:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F2A46280001; Mon, 15 May 2023 18:04:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB301900002; Mon, 15 May 2023 18:04:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2CB8280001; Mon, 15 May 2023 18:04:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BC8AE900002 for ; Mon, 15 May 2023 18:04:33 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 8086EC1422 for ; Mon, 15 May 2023 22:04:33 +0000 (UTC) X-FDA: 80793869226.15.7B2366C Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) by imf01.hostedemail.com (Postfix) with ESMTP id 5E1FB40010 for ; Mon, 15 May 2023 22:04:31 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=EcIdARgD; spf=pass (imf01.hostedemail.com: domain of lstoakes@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=lstoakes@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684188271; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cSvPVoiCn57nMsLWzPvcPZUA4BS16HWMDqe9J36ZX7E=; b=sDlLpqDkM7ZbifANQSQY5ijmqRSIzP72CfXnbHuK3dKjkrue4aBwQqWFrkh6Kcest4lQU0 bPAUJk/QgaCrmiVTr9VCtVT0ggqGVjFCqGGbNanVPrISnVKsZ5iE0/SR2k2zONHxtTp5GG LrikosNZ0KUi+t0wUHvyMsd8Hvv31Ss= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=EcIdARgD; spf=pass (imf01.hostedemail.com: domain of lstoakes@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=lstoakes@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684188271; a=rsa-sha256; cv=none; b=jhDAUCczrHF4QME65CsKVORLwjlNU0ismw0qTIaozb4NKrozLzW63zId5Kc+XzLiH3s4ef GtiBQeNUtHGdrjoA4vYP59PmItmTw/9QDPbttYBbjyvpI/xz7d0wcb6zi1/dMj3/frQqmS eohqlt6ksobP0Q7JJU+gUH5V2UgI6WQ= Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-3f509ec3196so4022845e9.1 for ; Mon, 15 May 2023 15:04:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684188270; x=1686780270; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=cSvPVoiCn57nMsLWzPvcPZUA4BS16HWMDqe9J36ZX7E=; b=EcIdARgDpRT19hPj2649Ozz+EIo53L3HBq475wpPPO0sAqX6K7urmGJ7Vddaw8loQZ tdk2xFKp/+mZ4fOqxLD3i8rsxRqrbZYPmM9yDZgy/wQ9V5igAUCqVK6pELt1HJi+kPGZ Wc/M7np7cF3qAPg0kpLDMt2Y/1fZGj/MLR8gVW0Fs+KuHxf25PWZWP02EBrKquIOt4ub q/GExDLqFFKRbORDh/Bzfh01L+br8CTlJRi+Bw4cHfHotX9OGADSGh3AEMUxzbGpAVHe s9y/6xKsRQgGLKDAdD6ywTlGt2tFALBbR36UGd61l/VDtH0+3ZepH/jYpCS8B8s+UC3I eEsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684188270; x=1686780270; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=cSvPVoiCn57nMsLWzPvcPZUA4BS16HWMDqe9J36ZX7E=; b=F6QDSD50LJVVX+43Z7ZYYthcgYqor6nwRvjXxEkW5vXEp9JjV+JhxZManABhpbGJRp NIdTIhJ+61ixGGmnqR1mHOWx0PPq7YuaRboQEAbcMRWi82bnrEvT8rSauB25VJkcanmV omjlgjJKoFqF1YLj6lfj8Xqj31J1gtObEe/Z9bDf4xCmBtLKtSqF5i6vB1FLPpXM+aad E9/EQL39+1qLRqRPpT6y4glbMQmOJj1wDpjJFxZgSpdcIfrJR9XyKM5XRdoKEJDgFmNO 1vOS44I5U6fCHloR5IAMONrNKsVxq3tG1EWj3wNlA8Ep33RiYt8aWhhEOCehXy8+iEAr Gehg== X-Gm-Message-State: AC+VfDxQfI5rpycuc9ijWisRsQrnA92c49i04pGZUCqQStUNEWdhGKb0 vMmcjDro8dxOJlC3CbO7sk0= X-Google-Smtp-Source: ACHHUZ4HQVJ7rpyJAny7K16ChneTORoAUS58g1bFlXJC/ClNMdCF1AlbsQl0SIKy49NG2tlUVOKRnA== X-Received: by 2002:a5d:4085:0:b0:307:f75:f581 with SMTP id o5-20020a5d4085000000b003070f75f581mr27321983wrp.18.1684188269337; Mon, 15 May 2023 15:04:29 -0700 (PDT) Received: from localhost ([2a00:23c5:dc8c:8701:1663:9a35:5a7b:1d76]) by smtp.gmail.com with ESMTPSA id c9-20020a056000104900b00306c5900c10sm428193wrx.9.2023.05.15.15.04.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 May 2023 15:04:28 -0700 (PDT) Date: Mon, 15 May 2023 23:04:27 +0100 From: Lorenzo Stoakes To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Mike Rapoport , Mark Rutland , "Liam R . Howlett" , Alexander Viro , Christian Brauner Subject: Re: [PATCH v2] mm: userfaultfd: avoid passing an invalid range to vma_merge() Message-ID: References: <20230515193232.67552-1-lstoakes@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: bgwnex31keagu7w58p639szb18fsjs41 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 5E1FB40010 X-Rspam-User: X-HE-Tag: 1684188271-771566 X-HE-Meta: U2FsdGVkX1/MHCciuBvAKZV1D+JyU2PpUtJdEUI5XufzeG1Y87AbSliNnqsO1uX+9+97dzhX5/M/kiYlQC8NnBN66M9pjZVMy+R6FKnrEsQh44jAZ6Dh1VtyrGUkurR+CjHOQEX85d3cm97UBAJghr2/cKvcvdH6d7WddENCY3sIOrlQhJxKBuXQmGgAemlgHLHp2T5spCgls55CQOMoP+opW9JqraTWmUBnSWneKJQM4kk6FBB8YjqlhjgvuWob2U6jLJkdLMPftohUfTYFMZm5CILIs0appMXDVvKo6aCuuBYM25JZvu1n45LbNYd9ZjugG8ignptg8lFfaVoncJZHdQ3eb4XZSEWwamS28QMD/dSK2rMjQd9+2ichgjPjm3yR6d9IGv1H+7TGXPhKIPmUxuRoc5dmvlhxITYzN8oipVwy75xOAVgIjxRDxkjTZKvMgnoFZEZLDsL7VG9+QgoBx2BgiyIyULZ0AZ20D/dFCF4EZ0TRtawkIEIp4ClVL/+zO6MMeyudZIJY80gFPeZgZvJDZtBavbTSyBC+e0t+v2H/h3jh0X+JT8idbLqR3kl0oEBiL5s1EE0z6QjiQ9rIZjoP6Yonp9k8dCPq/Sr+cdX8wN0xNYMO50cI+Hdoa6gyt95naqQC2OmSIq7sRE+gXyGUWCGOUCEHru2wTz6ppRR8OmKyeAHGicJkQg/ZWnvxGNBnKafvxYjwD6aJhPQ29/V243Z4W6FJEQlABLWOtMDOVYxkUfHw/pxSzwpAKnPd0qvGDovrb1XP11FRKlfGOkSfztsrXhrtICjaAcFvVOaZDq5nWtKtujawGhfbjaG08+7xEQHBLGSlYwhS+6Yb8DNSWD9YyOH9MZaV9Fx1pkyspFd3Q9SHr8xmQOF8ZhCpiijZeeICtdNPccy21A3fAEKDIfBXVHgsdtfEaIXrqWZTFNkKraK5KaG1kAa/Dz6S1Db8UayMwWd7Sf5 Od3HGcU1 qBIIWtzDktx0jKn0QkMs9crYZnuJ59gKvxEvBtfLOvjds4dVY/+G9xkdwWylfyjn9mcKcgkTF97no2Z2BmpN2sEFVifq3rOdatZhhgGyycU0VwEQ/6EkFpO3A5PiuAWI7isBvSJzm0KekbSkDpFiKmBSv7DO+PBOLTuVmuvpTjuEyr+VQczw2bPwIddpwU945lHznZOwxw8lh7+eOd/UyRrpKJHGO0bB9loETY74Q7jxeQlfvcQPM44m6eWsQJOkpQFJMmn/rSEWomSpe4Aa1G3iGV5Uthw15dcMs7hANuiwV9bz2eORZz7VpOSM8gA9WfUGqbKdufljD/MWrg9axW7Mte9kN6dmpnTxCyziIbcujOgeVisRu3HGV/P9hTO190wF9prXTp7YfuiZIPxgypxDbLWXJyQIB8XtgDYB25qh7FHeQfKWSeC/YwlxzHKPiyjcwXzzjGa991M930/1lCjAphDImjZJDHPXlEXN3rC+P5OlIaMJcHu+EWlfsFDWA9T+DirLgIlva+b7kgvQ0hx/LNuJ86N23eshJJCOnXEBqPbo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 15, 2023 at 05:27:25PM -0400, Peter Xu wrote: > On Mon, May 15, 2023 at 08:32:32PM +0100, Lorenzo Stoakes wrote: > > The userfaultfd_[un]register() functions will knowingly pass an invalid > > address range to vma_merge(), then rely on it failing to merge to indicate > > that the VMA should be split into a valid one. > > > > This is not something that should be relied upon, > > Is there any real issue pop up to show that it's something that we cannot > rely upon before b0729ae0ae67? Because normally if something broke with a > commit I'll say "commit xxx breaks yyy" unless there's more valid reason.. :) I mean firstly this is triggering warnings in the kernel as referenced in thread [1], this is reason enough :) However, vma_merge() assumes that you won't give it an invalid input range (in this instance, because you provide start that occurs prior to the beginning of the VMA). Giving an invalid input range like this is disasterous, as it could result in a situation such as e.g.:- - addr == prev->vm_end - addr != vma->vm_start (i.e. there is a gap) - prev and vma are otherwise compatible This would then trigger the merge_prev cases and you'll end up with something very broken. I expect this is simply not possible because the input will never be like this (it's clamped to be _within_ a VMA anyway and if the prev was compatible it'd already have been merged), but that is relying on an implementation detail. > > > as vma_merge() implicitly assumes in cases 5-8 that curr->vm_start == > > addr. This is now enforced since commit b0729ae0ae67 ("mm/mmap/vma_merge: > > explicitly assign res, vma, extend invariants") with an explicit > > VM_WARN_ON() check. > > Doing vma_merge() before vma_split() makes some sense to me (and I noticed > that mostly all vma_merge() paths are doing so), because if a merge > happened, we 100% doesn't need a split (vma_merge took care of everything). > It's not the other way round, since if we split, we could still need a > merge. > You only split if start/end are not clamped to the VMA, and in cases where you'd need to split, you could not merge. So I don't think this is true. > So it fundamentally avoids unnecessary calls to split, it seems, if we > always call merge before splits. No, because for the case of splitting the beginning of the VMA, you would have given invalid input to vma_merge() and returned NULL, then had to split anyway. For the case of splitting the end of the VMA, that part would have to be split in any case, this just changes the ordering. So I disagree that this adds work, it does the same thing only this time using vma_merge() correctly. (update after seeing below bit) - vma_merge() is not meant to do 'partial' merges over a non-prev VMA, and relying on it to do so is broken. > > > > > Since commit 29417d292bd0 ("mm/mmap/vma_merge: always check invariants") > > this check is performed unconditionally, which caused this assert to arise > > in tests performed by Mark [1]. > > > > This patch fixes the issue by performing the split operations before > > attempting to merge VMAs in both instances. The problematic operation is > > splitting the start of the VMA since we were clamping to the end of the VMA > > in any case, however it is useful to group both of the split operations > > together to avoid egregious goto's and to abstract the code between the > > functions. > > > > As well as fixing the repro described in [1] this also continues to pass > > uffd unit tests. > > > > [1]:https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com > > > > Reported-by: Mark Rutland > > Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/ > > Signed-off-by: Lorenzo Stoakes > > --- > > fs/userfaultfd.c | 108 ++++++++++++++++++++++++++--------------------- > > 1 file changed, 60 insertions(+), 48 deletions(-) > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 0fd96d6e39ce..ef5d667ea804 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -1319,6 +1319,32 @@ static __always_inline int validate_range(struct mm_struct *mm, > > return 0; > > } > > > > +static int clamp_range(struct vma_iterator *vmi, struct vm_area_struct *vma, > > + unsigned long start, unsigned long end, bool *can_merge) > > +{ > > + int ret; > > + bool merge = true; > > + > > + /* The range must always be clamped to the start of a VMA. */ > > + if (vma->vm_start < start) { > > + ret = split_vma(vmi, vma, start, 1); > > + if (ret) > > + return ret; > > + > > + merge = false; > > Could you explain a bit why we don't need to merge in this case? > > I'm considering, for example, when we have: > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > Then someone unregisters uffd on range (5-9), iiuc it should become: > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > But if no merge here it's: > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > Maybe I missed something? > There's something really, really wrong with this. It simply isn't valid to invoke vma_merge() over an existing VMA that != prev where you're not specifying addr = vma->vm_start, end == vma->vm_end. This seems like you're relying on:- *** CCCCCNNNNN -> CCNNNNNNNN By specifying parameters that are compatible with N even though you're only partially spanning C? This is crazy, and isn't how this should be used. vma_merge() is not supposed to do partial merges. If it works (presumably it does) this is not by design unless I've lost my mind and I (and others) have somehow not noticed this?? I think you're right that now we'll end up with more fragmentation, but what you're suggesting is not how vma_merge() is supposed to work. As I said above, giving vma_merge() invalid parameters is very dangerous as you could end up merging over empty ranges in theory (and could otherwise have corruption). I guess we should probably be passing 0 to the last parameter in split_vma() here then to ensure we do a merge pass too. Will experiment with this. I'm confused as to how the remove from case 8 is not proceeding. I'll look into this some more... Happy to be corrected if I'm misconstruing this! > > + } > > + > > + /* It must also be clamped to the end of a VMA. */ > > + if (vma->vm_end > end) { > > + ret = split_vma(vmi, vma, end, 0); > > + if (ret) > > + return ret; > > + } > > + > > + *can_merge = merge; > > + return 0; > > +} > > + > > static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > unsigned long arg) > > { > > @@ -1330,7 +1356,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > unsigned long vm_flags, new_flags; > > bool found; > > bool basic_ioctls; > > - unsigned long start, end, vma_end; > > + unsigned long start, end; > > struct vma_iterator vmi; > > > > user_uffdio_register = (struct uffdio_register __user *) arg; > > @@ -1462,6 +1488,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > ret = 0; > > for_each_vma_range(vmi, vma, end) { > > + bool can_merge; > > + > > cond_resched(); > > > > BUG_ON(!vma_can_userfault(vma, vm_flags)); > > @@ -1477,32 +1505,22 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > (vma->vm_flags & vm_flags) == vm_flags) > > goto skip; > > > > - if (vma->vm_start > start) > > - start = vma->vm_start; > > - vma_end = min(end, vma->vm_end); > > + ret = clamp_range(&vmi, vma, start, end, &can_merge); > > + if (ret) > > + break; > > > > new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; > > - prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > - vma_policy(vma), > > - ((struct vm_userfaultfd_ctx){ ctx }), > > - anon_vma_name(vma)); > > - if (prev) { > > + if (can_merge) { > > + prev = vma_merge(&vmi, mm, prev, vma->vm_start, vma->vm_end, new_flags, > > + vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > + vma_policy(vma), > > + ((struct vm_userfaultfd_ctx){ ctx }), > > + anon_vma_name(vma)); > > + > > /* vma_merge() invalidated the mas */ > > - vma = prev; > > - goto next; > > - } > > - if (vma->vm_start < start) { > > - ret = split_vma(&vmi, vma, start, 1); > > - if (ret) > > - break; > > - } > > - if (vma->vm_end > end) { > > - ret = split_vma(&vmi, vma, end, 0); > > - if (ret) > > - break; > > + if (prev) > > + vma = prev; > > } > > - next: > > /* > > * In the vma_merge() successful mprotect-like case 8: > > * the next vma was merged into the current one and > > @@ -1560,7 +1578,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > struct uffdio_range uffdio_unregister; > > unsigned long new_flags; > > bool found; > > - unsigned long start, end, vma_end; > > + unsigned long start, end; > > const void __user *buf = (void __user *)arg; > > struct vma_iterator vmi; > > > > @@ -1627,6 +1645,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > prev = vma_prev(&vmi); > > ret = 0; > > for_each_vma_range(vmi, vma, end) { > > + bool can_merge; > > + > > cond_resched(); > > > > BUG_ON(!vma_can_userfault(vma, vma->vm_flags)); > > @@ -1640,9 +1660,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > > WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); > > > > - if (vma->vm_start > start) > > - start = vma->vm_start; > > - vma_end = min(end, vma->vm_end); > > + ret = clamp_range(&vmi, vma, start, end, &can_merge); > > + if (ret) > > + break; > > > > if (userfaultfd_missing(vma)) { > > /* > > @@ -1652,35 +1672,27 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > * UFFDIO_WAKE explicitly. > > */ > > struct userfaultfd_wake_range range; > > - range.start = start; > > - range.len = vma_end - start; > > + range.start = vma->vm_start; > > + range.len = vma->vm_end - vma->vm_start; > > wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range); > > } > > > > /* Reset ptes for the whole vma range if wr-protected */ > > if (userfaultfd_wp(vma)) > > - uffd_wp_range(vma, start, vma_end - start, false); > > + uffd_wp_range(vma, vma->vm_start, > > + vma->vm_end - vma->vm_start, false); > > > > new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; > > - prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > - vma_policy(vma), > > - NULL_VM_UFFD_CTX, anon_vma_name(vma)); > > - if (prev) { > > - vma = prev; > > - goto next; > > - } > > - if (vma->vm_start < start) { > > - ret = split_vma(&vmi, vma, start, 1); > > - if (ret) > > - break; > > - } > > - if (vma->vm_end > end) { > > - ret = split_vma(&vmi, vma, end, 0); > > - if (ret) > > - break; > > + if (can_merge) { > > + prev = vma_merge(&vmi, mm, prev, vma->vm_start, > > + vma->vm_end, new_flags, vma->anon_vma, > > + vma->vm_file, vma->vm_pgoff, > > + vma_policy(vma), > > + NULL_VM_UFFD_CTX, anon_vma_name(vma)); > > + /* vma_merge() invalidated the mas */ > > + if (prev) > > + vma = prev; > > } > > - next: > > /* > > * In the vma_merge() successful mprotect-like case 8: > > * the next vma was merged into the current one and > > -- > > 2.40.1 > > > > -- > Peter Xu >