From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 863F1C4332F for ; Tue, 15 Nov 2022 01:30:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D1F68E0003; Mon, 14 Nov 2022 20:30:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 45B208E0002; Mon, 14 Nov 2022 20:30:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 322098E0003; Mon, 14 Nov 2022 20:30:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 22FFB8E0002 for ; Mon, 14 Nov 2022 20:30:51 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id ED5281C645F for ; Tue, 15 Nov 2022 01:30:50 +0000 (UTC) X-FDA: 80133947460.28.8E4999F Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf12.hostedemail.com (Postfix) with ESMTP id 931A940006 for ; Tue, 15 Nov 2022 01:30:50 +0000 (UTC) Received: by mail-pg1-f169.google.com with SMTP id h193so11861150pgc.10 for ; Mon, 14 Nov 2022 17:30:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xXTvPzYo/FNVPPUBJfWV1FT/CBSdhDIuqNcR9jo+cPM=; b=YeiVS6DbLpVk00yQfx8PyNLK6jcZ1YAIEmPvKhGiCeMnVsTJobJoplSrltjmtk/iIA LTzl4pjiLxoaKi2tGfcvcg+vUo9G04Pj/qzLWVbW1xyIMCNqF7NY/Z2s+CSkcLvmf4d1 vY2yskVvzZ01jPJhkKb8RutnRxDpIafnWupAel36iBL943wb5ZqVLaqKJhKceu4H5/8f cFu+AztIQjn2HqAZa63WRRd78gwTuBP2CSIY36rG4sUjPY8yHw5CZcpBvcA1nJrj09ar QS60U2sttLdOtt7jz6WVVphTMh246zUic1SjasKq+K4uDu55aXs1VQnEt9hlvAGpjK3S 3u8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xXTvPzYo/FNVPPUBJfWV1FT/CBSdhDIuqNcR9jo+cPM=; b=6VRJR2MjUcXV2hk2EZhVeSmY0Tb9MmEmZhC3OQbM6o6sBEid75ukf6/anvziOn0JA4 8MaCRcsRAs6M6PXzTvjuomM286c0996SUJtsXZ3JvQ5qNWZpgZvPmXJ1yx5Hf/aLpDjn NofOQ3GK8Z03Q5ZEWAKXESYAj1acxshayEWeJutfi5AdsJNgaQxDa9hIzlaTRLC91jB8 SVQTozepGCWNjuhloLvnsS8BMA7u2+qZ3z5kmN7t4+63xoyaR/oB/QMhdKU5BsM+DpS3 WmuIG6ZcXTWz4NwnyTo3jK/MUcPyK/cQPtXaZ++oCZZMOji3AmFViWzeqmgRD7t80m0z KwxA== X-Gm-Message-State: ANoB5pnw5hbzN7EAeqIwtzaVREWE/KGsQVUJqBH0Pv6cR/upxOrNR7tc p5NjwlI+jVsKLuNDYlskhIXkMA== X-Google-Smtp-Source: AA0mqf6mvfxaixLdw9mdL/WImf3S0ul9CJiKBqlXzswxmwwbWhPfuah1HeY/Qzm93fIbtOk79jnU3A== X-Received: by 2002:a63:e343:0:b0:46f:ed91:6664 with SMTP id o3-20020a63e343000000b0046fed916664mr13798300pgj.558.1668475849538; Mon, 14 Nov 2022 17:30:49 -0800 (PST) Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au. [49.181.106.210]) by smtp.gmail.com with ESMTPSA id x15-20020aa7940f000000b005627d995a36sm7346339pfo.44.2022.11.14.17.30.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Nov 2022 17:30:47 -0800 (PST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1oukmj-00EKFv-AB; Tue, 15 Nov 2022 12:30:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.96) (envelope-from ) id 1oukmj-001VpI-0u; Tue, 15 Nov 2022 12:30:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 2/9] xfs: write page faults in iomap are not buffered writes Date: Tue, 15 Nov 2022 12:30:36 +1100 Message-Id: <20221115013043.360610-3-david@fromorbit.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20221115013043.360610-1-david@fromorbit.com> References: <20221115013043.360610-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668475850; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xXTvPzYo/FNVPPUBJfWV1FT/CBSdhDIuqNcR9jo+cPM=; b=1NB+Yrzk9lKmzLl8euHMq8MTk1B+Lwx89OS1x+tFD5GqcOS+YYehTclorDxh65ZV68EgEy u9Se8Qwul7T3+V2HRRiSyETjb/UcdLdwtJ1bUAU7OXpdwFNdR8yp4wIaB0FpjDDucs048O u/wFjUQ/0xtX9hYjopS4QKc/NO3lcVc= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=YeiVS6Db; spf=none (imf12.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.215.169) smtp.mailfrom=david@fromorbit.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668475850; a=rsa-sha256; cv=none; b=j8xeY/TVPnOu5vcHAfsJv4yp2F8zzRgHUR/wo7PeO5OVnssLf5soQff6Dt9PAfwzjyyt+h rsYF6TWdp6u+2um17PcVxai2d89UGkMsqX30GdetOsDBiNdI30M6V5zLPK4Gp8KQ8GRJi0 b2ITg6VK7TAhjtclsMHTdYlcQyEfBv8= X-Stat-Signature: ezfgdbayn4atfxrduzqbm8m6sus3bio4 X-Rspamd-Queue-Id: 931A940006 Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=YeiVS6Db; spf=none (imf12.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.215.169) smtp.mailfrom=david@fromorbit.com; dmarc=none X-Rspamd-Server: rspam07 X-Rspam-User: X-HE-Tag: 1668475850-453574 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner When we reserve a delalloc region in xfs_buffered_write_iomap_begin, we mark the iomap as IOMAP_F_NEW so that the the write context understands that it allocated the delalloc region. If we then fail that buffered write, xfs_buffered_write_iomap_end() checks for the IOMAP_F_NEW flag and if it is set, it punches out the unused delalloc region that was allocated for the write. The assumption this code makes is that all buffered write operations that can allocate space are run under an exclusive lock (i_rwsem). This is an invalid assumption: page faults in mmap()d regions call through this same function pair to map the file range being faulted and this runs only holding the inode->i_mapping->invalidate_lock in shared mode. IOWs, we can have races between page faults and write() calls that fail the nested page cache write operation that result in data loss. That is, the failing iomap_end call will punch out the data that the other racing iomap iteration brought into the page cache. This can be reproduced with generic/34[46] if we arbitrarily fail page cache copy-in operations from write() syscalls. Code analysis tells us that the iomap_page_mkwrite() function holds the already instantiated and uptodate folio locked across the iomap mapping iterations. Hence the folio cannot be removed from memory whilst we are mapping the range it covers, and as such we do not care if the mapping changes state underneath the iomap iteration loop: 1. if the folio is not already dirty, there is no writeback races possible. 2. if we allocated the mapping (delalloc or unwritten), the folio cannot already be dirty. See #1. 3. If the folio is already dirty, it must be up to date. As we hold it locked, it cannot be reclaimed from memory. Hence we always have valid data in the page cache while iterating the mapping. 4. Valid data in the page cache can exist when the underlying mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not change the data in the page - it only affects actions if we are initialising a new page. Hence #3 applies and we don't care about these extent map transitions racing with iomap_page_mkwrite(). 5. iomap_page_mkwrite() checks for page invalidation races (truncate, hole punch, etc) after it locks the folio. We also hold the mapping->invalidation_lock here, and hence the mapping cannot change due to extent removal operations while we are iterating the folio. As such, filesystems that don't use bufferheads will never fail the iomap_folio_mkwrite_iter() operation on the current mapping, regardless of whether the iomap should be considered stale. Further, the range we are asked to iterate is limited to the range inside EOF that the folio spans. Hence, for XFS, we will only map the exact range we are asked for, and we will only do speculative preallocation with delalloc if we are mapping a hole at the EOF page. The iterator will consume the entire range of the folio that is within EOF, and anything beyond the EOF block cannot be accessed. We never need to truncate this post-EOF speculative prealloc away in the context of the iomap_page_mkwrite() iterator because if it remains unused we'll remove it when the last reference to the inode goes away. Hence we don't actually need an .iomap_end() cleanup/error handling path at all for iomap_page_mkwrite() for XFS. This means we can separate the page fault processing from the complexity of the .iomap_end() processing in the buffered write path. This also means that the buffered write path will also be able to take the mapping->invalidate_lock as necessary. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_file.c | 2 +- fs/xfs/xfs_iomap.c | 9 +++++++++ fs/xfs/xfs_iomap.h | 1 + 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e462d39c840e..595a5bcf46b9 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1325,7 +1325,7 @@ __xfs_filemap_fault( if (write_fault) { xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); ret = iomap_page_mkwrite(vmf, - &xfs_buffered_write_iomap_ops); + &xfs_page_mkwrite_iomap_ops); xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); } else { ret = filemap_fault(vmf); diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 07da03976ec1..5cea069a38b4 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1187,6 +1187,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = { .iomap_end = xfs_buffered_write_iomap_end, }; +/* + * iomap_page_mkwrite() will never fail in a way that requires delalloc extents + * that it allocated to be revoked. Hence we do not need an .iomap_end method + * for this operation. + */ +const struct iomap_ops xfs_page_mkwrite_iomap_ops = { + .iomap_begin = xfs_buffered_write_iomap_begin, +}; + static int xfs_read_iomap_begin( struct inode *inode, diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index c782e8c0479c..0f62ab633040 100644 --- a/fs/xfs/xfs_iomap.h +++ b/fs/xfs/xfs_iomap.h @@ -47,6 +47,7 @@ xfs_aligned_fsb_count( } extern const struct iomap_ops xfs_buffered_write_iomap_ops; +extern const struct iomap_ops xfs_page_mkwrite_iomap_ops; extern const struct iomap_ops xfs_direct_write_iomap_ops; extern const struct iomap_ops xfs_read_iomap_ops; extern const struct iomap_ops xfs_seek_iomap_ops; -- 2.37.2