From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06AE5C433EF for ; Sat, 26 Feb 2022 00:18:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 62F2E8D0002; Fri, 25 Feb 2022 19:18:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5DCFF8D0001; Fri, 25 Feb 2022 19:18:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CD448D0002; Fri, 25 Feb 2022 19:18:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 3A56B8D0001 for ; Fri, 25 Feb 2022 19:18:58 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id E8DE99A7FE for ; Sat, 26 Feb 2022 00:18:57 +0000 (UTC) X-FDA: 79183020714.24.FA0D0C5 Received: from r3-11.sinamail.sina.com.cn (r3-11.sinamail.sina.com.cn [202.108.3.11]) by imf18.hostedemail.com (Postfix) with SMTP id 042091C000C for ; Sat, 26 Feb 2022 00:18:55 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.249.61.131]) by sina.com (172.16.97.27) with ESMTP id 621971D700020C6D; Sat, 26 Feb 2022 08:18:32 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 97447449283311 From: Hillf Danton To: "Theodore Ts'o" Cc: John Hubbard , linux-ext4@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Date: Sat, 26 Feb 2022 08:18:43 +0800 Message-Id: <20220226001843.2520-1-hdanton@sina.com> In-Reply-To: References: <2f9933b3-a574-23e1-e632-72fc29e582cf@nvidia.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 042091C000C X-Stat-Signature: zwk3qpu7r7uwo7tf6os4eidiuwbcptiz X-Rspam-User: Authentication-Results: imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.11 as permitted sender) smtp.mailfrom=hdanton@sina.com; dmarc=none X-Rspamd-Server: rspam07 X-HE-Tag: 1645834735-232691 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000032, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 25 Feb 2022 18:21:21 -0500 Theodore Ts'o wrote: > On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote: > > On 2/25/22 13:23, Theodore Ts'o wrote: > > > [un]pin_user_pages_remote is dirtying pages without properly warnin= g > > > the file system in advance. This was noted by Jan Kara in 2018[1] = and > >=20 > > In 2018, [un]pin_user_pages_remote did not exist. And so what Jan rep= orted > > was actually that dio_bio_complete() was calling set_page_dirty_lock(= ) > > on pages that were not (any longer) set up for that. >=20 > Fair enough, there are two problems that are getting conflated here, > and that's my bad. The problem which Jan pointed out is one where the > Direct I/O read path triggered a page fault, so page_mkwrite() was > actually called. So in this case, the file system was actually > notified, and the page was marked dirty after the file system was > notified. But then the DIO read was racing with the page cleaner, > which would call writepage(), and then clear the page, and then remove > the buffer_heads. Then dio_bio_complete() would call set_page_dirty() > a second time, and that's what would trigger the BUG. >=20 > But in the syzbot reproducer, it's a different problem. In this case, > process_vm_writev() calling [un]pin_user_pages_remote(), and > page_mkwrite() is never getting called. So there is no need to race > with the page cleaner, and so the BUG triggers much more reliably. >=20 > > > more recently has resulted in bug reports by Syzbot in various Andr= oid > > > kernels[2]. > > >=20 > > > This is technically a bug in mm/gup.c, but arguably ext4 is fragile= in > >=20 > > Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty= _lock() > > call into mm/gup.c, but that merely refactored things. The callers ar= e > > all over the kernel, and those callers are what need changing in orde= r > > to fix this. >=20 > >From my perspective, the bug is calling set_page_dirty() without first > calling the file system's page_mkwrite(). This is necessary since the > file system needs to allocate file system data blocks in preparation > for a future writeback. >=20 > Now, calling page_mkwrite() by itself is not enough, since the moment > you make the page dirty, the page cleaner could go ahead and call > writepage() behind your back and clean it. In actual practice, with a > Direct I/O read request racing with writeback, this is race was quite > hard to hit, because the that would imply that the background > writepage() call would have to complete ahead of the synchronous read > request, and the block layer generally prioritizes synchronous reads > ahead of background write requests. So in practice, this race was > ***very*** hard to hit. Jan may have reported it in 2018, but I don't > think I've ever seen it happen myself. >=20 > For process_vm_writev() this is a case where user pages are pinned and > then released in short order, so I suspect that race with the page > cleaner would also be very hard to hit. But we could completely > remove the potential for the race, and also make things kinder for > f2fs and btrfs's compressed file write support, by making things work > much like the write(2) system call. Imagine if we had a > "pin_user_pages_local()" which calls write_begin(), and a > "unpin_user_pages_local()" which calls write_end(), and the > presumption with the "[un]pin_user_pages_local" API is that you don't > hold the pinned pages for very long --- say, not across a system call > boundary, and then it would work the same way the write(2) system call > works does except that in the case of process_vm_writev(2) the pages > are identified by another process's address space where they happen to > be mapped. >=20 > This obviously doesn't work when pinning pages for remote DMA, because > in that case the time between pin_user_pages_remote() and > unpin_user_pages_remote() could be a long, long time, so that means we > can't use using write_begin/write_end; we'd need to call page_mkwrite() > when the pages are first pinned and then somehow prevent the page > cleaner from touching a dirty page which is pinned for use by the > remote DMA. Sad to see it here given the attempt that no gup-pinned page will be put under writeback. [05] Hillf >=20 > Does that make sense? >=20 > - Ted >=20 [05] https://lore.kernel.org/linux-mm/20191103112113.8256-1-hdanton@sina.= com/