From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA4C3C433FE for ; Wed, 16 Nov 2022 13:57:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5522B6B0071; Wed, 16 Nov 2022 08:57:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 501976B0072; Wed, 16 Nov 2022 08:57:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A2E98E0001; Wed, 16 Nov 2022 08:57:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2B3B06B0071 for ; Wed, 16 Nov 2022 08:57:19 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id CE3DD1A0C6C for ; Wed, 16 Nov 2022 13:57:18 +0000 (UTC) X-FDA: 80139457356.05.4E9FA01 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 15FC2C0011 for ; Wed, 16 Nov 2022 13:57:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668607037; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=U4homV+fVhvRQr3MTMK6djx4iwohawMzl3H/+thnj+U=; b=WK0uoAzO6iu3sExrT7Yu9di522TAx4NduVQqto0VXS9bgmE/6UuXp1eg371lW6ey+vBUP6 qfiHZjQdAOZhnAhclpQlYEMytdxFMJ0b7hrD+1hQ5yjx30GiLwfBxDy6FlQFrvKKgxNhoi UrQIHQBJs95zlTTgH+4miNUuoW4k8OI= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-659-i0LRFKrtPvGZu82QlhTIUg-1; Wed, 16 Nov 2022 08:57:16 -0500 X-MC-Unique: i0LRFKrtPvGZu82QlhTIUg-1 Received: by mail-qv1-f71.google.com with SMTP id do17-20020a056214097100b004c6183e6ce4so11574847qvb.11 for ; Wed, 16 Nov 2022 05:57:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=U4homV+fVhvRQr3MTMK6djx4iwohawMzl3H/+thnj+U=; b=zjS+onLtc7fPj84ZfDlytvO8lWWhnit4pCkRH4VuxDR/0Q11PgLazxeBZv4StE5eEm 69FNels/09P/YWXZMeGByJJ90wc/CYyhMofulHXzIoJGNmfwh9n2CARz0+HS74hJMb2N rE9X3FDLej03RnVBazwLSWbpG3wn71Oa2B62w6mdD0DGVU356lH8dWYyMXDlEZ/vZdSP cWwg/JU/G5FyHh84HeQnTyIN2MaLjw7M9dcxoh7qNzgThUCB8LNsq59a+e9G0CrzG9y6 G4m2l8+GydSdzY+X21j/huzWtWRanoUgXfMMqE4oCouZx4NBLubjau2pHNjSPq0n8GD0 nL8A== X-Gm-Message-State: ANoB5plKW0gMC0otXEzYCyvzAkmu/wrETDq2hWCH+22z4vsYjBYjyFkz wvF85A21YeBdC05ip1zkfyr1mz5JuQMXhB5do2Gsi+qdZJG03xqN9WGMTvk/wzC4UEp/Y+m+kxE VqjQFr1Z4Y8A= X-Received: by 2002:a05:6214:3185:b0:4c6:5682:8878 with SMTP id lb5-20020a056214318500b004c656828878mr7571439qvb.5.1668607035632; Wed, 16 Nov 2022 05:57:15 -0800 (PST) X-Google-Smtp-Source: AA0mqf5886Oi0yVJL/ZsJCfXFajAX1bbbEKJNG9bcaFh8odkbdwQtNJ2DtnN4JzEWzF2Op9k157ZWw== X-Received: by 2002:a05:6214:3185:b0:4c6:5682:8878 with SMTP id lb5-20020a056214318500b004c656828878mr7571423qvb.5.1668607035332; Wed, 16 Nov 2022 05:57:15 -0800 (PST) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id he31-20020a05622a601f00b00397b1c60780sm8787436qtb.61.2022.11.16.05.57.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Nov 2022 05:57:14 -0800 (PST) Date: Wed, 16 Nov 2022 08:57:19 -0500 From: Brian Foster To: Dave Chinner Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Message-ID: References: <20221115013043.360610-1-david@fromorbit.com> <20221115013043.360610-6-david@fromorbit.com> MIME-Version: 1.0 In-Reply-To: <20221115013043.360610-6-david@fromorbit.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668607038; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U4homV+fVhvRQr3MTMK6djx4iwohawMzl3H/+thnj+U=; b=cbWqs+Fm6Ay4rX4czZqG3Nez+f1ViL9KNAvw6rQG2Nahi2UrIxqCMRyY2igb9JyST1PAbT 9CEAU+V/Xjk3HAU6vbfEXxe45Mkp1w0apS6W0ceROxY0XkSEhgYQroogx0wkbQZG8/mLZT rm7W7wxuvczn7eAV9T+jyxXtMUe3MFY= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WK0uoAzO; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668607038; a=rsa-sha256; cv=none; b=c7pvKSRzwQMwpQWIeDtKoMiU5kP7qMjDRHxofsVNnNYfOkfPznBjEgNjVSVBLaoHkVRAbR qWScIeiVqFUw5PeGugni0Bw40hzNIGhyAmhscoo90QBen1ePbUpDBKf3XGXrPHHGB9moOf 2ZMxBi0GJ6b1gnWQ9qbEe5f9gE/qvpw= X-Rspam-User: X-Stat-Signature: mapfi3naxeozk815adzjzwry1x3e49rc X-Rspamd-Queue-Id: 15FC2C0011 Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WK0uoAzO; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com X-Rspamd-Server: rspam07 X-HE-Tag: 1668607037-773064 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > From: Dave Chinner > ... > > Signed-off-by: Dave Chinner > --- > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 141 insertions(+), 10 deletions(-) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > end_fsb - start_fsb); > } > ... > +/* > + * Punch out all the delalloc blocks in the range given except for those that > + * have dirty data still pending in the page cache - those are going to be > + * written and so must still retain the delalloc backing for writeback. > + * > + * As we are scanning the page cache for data, we don't need to reimplement the > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > + * start and end of data ranges correctly even for sub-folio block sizes. This > + * byte range based iteration is especially convenient because it means we don't > + * have to care about variable size folios, nor where the start or end of the > + * data range lies within a folio, if they lie within the same folio or even if > + * there are multiple discontiguous data ranges within the folio. > + */ > +static int > +xfs_buffered_write_delalloc_release( > + struct inode *inode, > + loff_t start_byte, > + loff_t end_byte) > +{ > + loff_t punch_start_byte = start_byte; > + int error = 0; > + > + /* > + * Lock the mapping to avoid races with page faults re-instantiating > + * folios and dirtying them via ->page_mkwrite whilst we walk the > + * cache and perform delalloc extent removal. Failing to do this can > + * leave dirty pages with no space reservation in the cache. > + */ > + filemap_invalidate_lock(inode->i_mapping); > + while (start_byte < end_byte) { > + loff_t data_end; > + > + start_byte = mapping_seek_hole_data(inode->i_mapping, > + start_byte, end_byte, SEEK_DATA); FWIW, the fact that mapping seek data is based on uptodate status means that seek behavior can change based on prior reads. For example, see how seek hole/data presents reads of unwritten ranges as data [1]. The same thing isn't observable for holes because iomap doesn't check the mapping in that case, but underlying iop state is the same and that is what this code is looking at. The filtering being done here means we essentially only care about dirty pages backed by delalloc blocks. That means if you get here with a dirty page and the portion of the page affected by this failed write is uptodate, this won't punch an underlying delalloc block even though nothing else may have written to it in the meantime. That sort of state can be created by a prior read of the range on a sub-page block size fs, or perhaps a racing async readahead (via read fault of a lower offset..?), etc. I suspect this is not a serious error because the page is dirty and writeback will thus convert the block. The only exception to that I can see is if the block is beyond EOF (consider a mapped read to a page that straddles EOF, followed by a post-eof write that fails), writeback won't actually map the block directly. It may convert if contiguous with delalloc blocks inside EOF (and sufficiently sized physical extents exist), or even if not, should still otherwise be cleaned up by the various other means we already have to manage post-eof blocks. So IOW there's a tradeoff being made here for possible spurious allocation and I/O and a subtle dependency on writeback that should probably be documented somewhere. The larger concern is that if writeback eventually changes based on dirty range tracking in a way that breaks this dependency, that introduces yet another stale delalloc block landmine associated with this error handling code (regardless of whether you want to call that a bug in this code, seek data, whatever), and those problems are difficult enough to root cause as it is. Brian [1] # xfs_io -fc "falloc 0 4k" -c "seek -a 0" -c "pread 0 4k" -c "seek -a 0" Whence Result HOLE 0 read 4096/4096 bytes at offset 0 4 KiB, 4 ops; 0.0000 sec (156 MiB/sec and 160000.0000 ops/sec) Whence Result DATA 0 HOLE 4096 > + /* > + * If there is no more data to scan, all that is left is to > + * punch out the remaining range. > + */ > + if (start_byte == -ENXIO || start_byte == end_byte) > + break; > + if (start_byte < 0) { > + error = start_byte; > + goto out_unlock; > + } > + ASSERT(start_byte >= punch_start_byte); > + ASSERT(start_byte < end_byte); > + > + /* > + * We find the end of this contiguous cached data range by > + * seeking from start_byte to the beginning of the next hole. > + */ > + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, > + end_byte, SEEK_HOLE); > + if (data_end < 0) { > + error = data_end; > + goto out_unlock; > + } > + ASSERT(data_end > start_byte); > + ASSERT(data_end <= end_byte); > + > + error = xfs_buffered_write_delalloc_scan(inode, > + &punch_start_byte, start_byte, data_end); > + if (error) > + goto out_unlock; > + > + /* The next data search starts at the end of this one. */ > + start_byte = data_end; > + } > + > + if (punch_start_byte < end_byte) > + error = xfs_buffered_write_delalloc_punch(inode, > + punch_start_byte, end_byte); > +out_unlock: > + filemap_invalidate_unlock(inode->i_mapping); > + return error; > +} > + > static int > xfs_buffered_write_iomap_end( > struct inode *inode, > @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end( > if (start_byte >= end_byte) > return 0; > > - /* > - * Lock the mapping to avoid races with page faults re-instantiating > - * folios and dirtying them via ->page_mkwrite between the page cache > - * truncation and the delalloc extent removal. Failing to do this can > - * leave dirty pages with no space reservation in the cache. > - */ > - filemap_invalidate_lock(inode->i_mapping); > - truncate_pagecache_range(inode, start_byte, end_byte - 1); > - error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte); > - filemap_invalidate_unlock(inode->i_mapping); > + error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte); > if (error && !xfs_is_shutdown(mp)) { > xfs_alert(mp, "%s: unable to clean up ino 0x%llx", > __func__, XFS_I(inode)->i_ino); > -- > 2.37.2 > >