From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5AE41C77B7F for ; Fri, 28 Apr 2023 11:51:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9D0546B0071; Fri, 28 Apr 2023 07:51:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9806F6B0072; Fri, 28 Apr 2023 07:51:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FA596B0074; Fri, 28 Apr 2023 07:51:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6E2FE6B0071 for ; Fri, 28 Apr 2023 07:51:52 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2F8121203EC for ; Fri, 28 Apr 2023 11:51:52 +0000 (UTC) X-FDA: 80730635664.27.C5F8C47 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf11.hostedemail.com (Postfix) with ESMTP id 009A740026 for ; Fri, 28 Apr 2023 11:51:48 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=Sa3E6gUq; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=yjYNDzkS; spf=pass (imf11.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682682709; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nFDPhGIisW69QBasl94gV2/Ayumrmm28urODFved4oI=; b=ylV2RRfrb0msY/q30T+CZUZlCH3HX2UgFOL9/C1cOOMrJ+7NphHJdYqL/8Xb3ZL2RNui7p DTwPOXCmfb6NhA+J6ttOE9FAMedkRebZ80F6newp8v0W+1JmmVoAsBUZ5IO48yKM81PYF9 /4BGx+PmqhjLLBtq6MFPLSm4h5mOnrg= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=Sa3E6gUq; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=yjYNDzkS; spf=pass (imf11.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682682709; a=rsa-sha256; cv=none; b=ePJtCiViZdHDoyZR/Keohkth3Hch6lNiBRN0HHF/Z9YW/TIRN8VOxEmyVZV2OYhx5PcIpw uJ0JwVlVecie8/q40DxIXscILTWRqnB1VQ5tmR/xf96xnb64i5Vh3kSV+NzEt3Q9AjsQyl eHHa4yBq+3LV+327lFatwAAm4Y8nE4o= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 755321FFAB; Fri, 28 Apr 2023 11:51:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1682682707; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=nFDPhGIisW69QBasl94gV2/Ayumrmm28urODFved4oI=; b=Sa3E6gUq+4z9R90iGteHnUZCCx2Q23sMJdLZSv5gqmEvMESyCG6TN3uAra+BjaGXV0tW3e Yt6MnRb6mNOludkM7IDHqOQM2MWvKarV93AZQogc7QUomrCgNIO4NbEIJxkb269/5mfK6e 6bZok1BCCOSHrKcLcP6Uu3kHmkdeTeU= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1682682707; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=nFDPhGIisW69QBasl94gV2/Ayumrmm28urODFved4oI=; b=yjYNDzkSl5dR+sByN8H5V+6lisW5bpGbbSHnle5h72aNyO3PHHJhFGvWHNdGgPKP9yDNP4 3yPSHZZ/y8zUYmBA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 550811390E; Fri, 28 Apr 2023 11:51:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id U7ddFFOzS2RxZgAAMHmgww (envelope-from ); Fri, 28 Apr 2023 11:51:47 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id AD5C9A0729; Fri, 28 Apr 2023 13:51:46 +0200 (CEST) Date: Fri, 28 Apr 2023 13:51:46 +0200 From: Jan Kara To: Lorenzo Stoakes Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Jason Gunthorpe , Jens Axboe , Matthew Wilcox , Dennis Dalessandro , Leon Romanovsky , Christian Benvenuti , Nelson Escobar , Bernard Metzler , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Bjorn Topel , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , "David S . Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Christian Brauner , Richard Cochran , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov , Jason Gunthorpe , John Hubbard , Jan Kara , "Kirill A . Shutemov" , Pavel Begunkov , Mika Penttila Subject: Re: [PATCH v5] mm/gup: disallow GUP writing to file-backed mappings by default Message-ID: <20230428115146.7ec6sc4oylglchx7@quack3> References: <6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com> X-Stat-Signature: 8t5fno5aypxqwyc96x6chiwsdwt6g485 X-Rspam-User: X-Rspamd-Queue-Id: 009A740026 X-Rspamd-Server: rspam06 X-HE-Tag: 1682682708-911567 X-HE-Meta: U2FsdGVkX19aXHr/xtibfRx4mdvtGSvnwyhh9JTQ0AoI2pAn799xLjH/+RXp301ncebndrUuwGe6SsSVFe3rD9rS/9HmingEGASrp9jLX4BqNX/f9N7rGn6VfZSZRnl7DHwcx3CleuNSPb1tRLpnZNOU/4hkicQzycDACMZ6700+9b6qFSWSS5BaTRqMDx/ElVqz/qHzYkFqMtTZxCzoKqK6eZ3WgrYmjHiNfFDx0MoLv73TYVld919cH/JZ8ZalqYRPOWOmfhEtLqKpeByaDuwTcN3o6cdPTUmhHhrKJABNOnRVWZOBuZVQdI4HJjAh8WM7P8ZrvdsRddM72tQPpSFxZQC9nOW3q1wrvs2YH6mtmtdzXUPQOtghgyZbSQkcIqMOJvESZ5Zw1BytBuIXSY6Z7KWrl51c0aeZjSBjaXcn1hbssxiN8ZhZA8528cGEAWx/676FUeB7UkscHgRlIEVy3NIwuAiU3fqyCces6w9oZH+RES8X3glVSQ+JzZvlCF6wkgdTM6x7h1ITXi1uXho1gaUJnPus7U3P14Nyng7bG+zdD1cHQZE5zRZUZEQdRwo7bnqrZKi2fLqKs56n/yH5uSxrOgRIZAhJQNa9//Gg5u6mNDuTGHvIU/50D9MbF9sF8L9nddbX8MYFT6q/E/48ETR2OjerraApw7FxomXMNz/6WlZ6sxHDw0G3cfHPFIB4bb2HaheDUluXnxo15GcBInavh1SrHD9cKcftnkY5G7zGa8MY3zkXH0epk2CtVGiKuT2imTvLmoBrSQhM9xeDxnjNvcsL/HNqSDkcKhob4+Riszn95VH6pFkapd/Fr4fYQfQfpUfJ2gikqc9jpt/0wSrwnhVvos2xi1tpFx/0NcL+P7aMaEfS+/s/uKaSzUrNAS9E0gHk3F7JbvSo7wAFicsBlrKvkGf4mq8GlWFhwYyY8lWCZwrbFBT+1ujRX49al/8hjE6NHeaCX9B D/m6e0Nn 4MUN9mbtc0sJ8/tOHkGWsD7x56rXqnA0XcKPP3s4rDeEVx/PsBqNRXkdbgW65E6B9856NdHX07QMDoYSCRfVfsboxMbi/yo02YMlbucPH2MRv5DDzwpm5RJ60+2xhFlJcqMWIK08vhfkUBjBcJsI7Uo4pST9tzGtfifMQfj/j41wzbsWIE+d/Env1z7al1F2jhk36N3XtA21rpmpFLfwdf1DTyR455BEmVcVx3DEvioXu369mljkcnzqcqG9lhL4D5esxmInn97RMQs+z2I61pWBHO8dAgHHl65yb5AX4Xe3U4pZjo+32zUPO6XX7vOu1qUfEd7rZ2dtOHB0TsyXzF7PsATTQXHC61TFVIrQ5I3BiVYvud1qwDP36ptTFvjRTl+CYlgesV+uH/dHZzCvPsDhQU5XsDzT0iuHl4gsmfTxw9/ZLzprauw3OIQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri 28-04-23 00:42:32, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This results in both data being written to a folio without writenotify, and > the folio being dirtied unexpectedly (if the caller decides to do so). > > This issue was first reported by Jan Kara [1] in 2018, where the problem > resulted in file system crashes. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > However, for the time being we introduce this check to prevent the most > egregious case of this occurring, use of the FOLL_LONGTERM pin. > > These mappings are considerably more likely to be written to after > folios are cleaned and thus simply must not be permitted to do so. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ > > Suggested-by: Jason Gunthorpe > Signed-off-by: Lorenzo Stoakes I'm for trying this out and let's see who complains ;) The patch looks good to me from the implementation point of view. Feel free to add: Reviewed-by: Jan Kara Honza > --- > include/linux/mm.h | 1 + > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++- > mm/mmap.c | 36 +++++++++++++++++++++++++++--------- > 3 files changed, 68 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..d36a5db9feb1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation, as kernel write access to GUP mappings > + * do not adhere to the semantics expected by a file system. > + * > + * Consider the following scenario:- > + * > + * 1. A folio is written to via GUP which write-faults the memory, notifying > + * the file system and dirtying the folio. > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and > + * the PTE being marked read-only. > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the > + * direct mapping. > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty > + * (though it does not have to). > + * > + * This results in both data being written to a folio without writenotify, and > + * the folio being dirtied unexpectedly (if the caller decides to do so). > + */ > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && > + !writeable_file_mapping_allowed(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..7b6344d1832a 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); > } > > /* > -- > 2.40.0 -- Jan Kara SUSE Labs, CR