From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 624D0EC1426 for ; Tue, 3 Mar 2026 10:35:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B96CB6B0117; Tue, 3 Mar 2026 05:35:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B83096B0119; Tue, 3 Mar 2026 05:35:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A50206B011C; Tue, 3 Mar 2026 05:35:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 885136B0117 for ; Tue, 3 Mar 2026 05:35:40 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5163E58B9C for ; Tue, 3 Mar 2026 10:35:40 +0000 (UTC) X-FDA: 84504395640.07.AD9C0D4 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf02.hostedemail.com (Postfix) with ESMTP id 05A1280002 for ; Tue, 3 Mar 2026 10:35:37 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; spf=pass (imf02.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772534138; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=k/hHopPGWvcpf5ajyt8Z9IZwM9RusNerbVEiPcmPUg8=; b=3SKG0M30dU7zAz0VS/Ov8rqf+72+wwcSc0PshOaD2ktJJFaV0FhSVvnAgVQ0hqY7WEj9G+ 696Po1xBY7yLHrBXG3h2+BZZK5vUm4KGiRE1Rsrq7/paOyYRQ5ASU3L7iveKyXbxcqh922 8yKFwX9+fSLMcEO9NZO5rJ5DRFh7nDk= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772534138; a=rsa-sha256; cv=none; b=ICBL8ee9bAIgmsHQkKBfgRicc23fdXHJgZLP66P+9Ykw0NOT0/MHXUDGONoLTDMhAzmCQR AKu8WowKRgtk00DTinNfTe3V/MkQ++RVXqMDZToXRzdZ29cR3fogVMsM3gPqR+kHNzCQQD cizK/F02+/zhT3XImDS0BLD6kdZiz0M= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 36A103F931; Tue, 3 Mar 2026 10:34:45 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 29D2F3EA70; Tue, 3 Mar 2026 10:34:45 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id 6DIxCkW5pmmCFQAAD6G6ig (envelope-from ); Tue, 03 Mar 2026 10:34:45 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id D0F7DA0B57; Tue, 3 Mar 2026 11:34:40 +0100 (CET) From: Jan Kara To: Cc: Christian Brauner , Al Viro , , Ted Tso , "Tigran A. Aivazian" , David Sterba , OGAWA Hirofumi , Muchun Song , Oscar Salvador , David Hildenbrand , linux-mm@kvack.org, linux-aio@kvack.org, Benjamin LaHaise , Jan Kara Subject: [PATCH 17/32] fs: Move metadata bhs tracking to a separate struct Date: Tue, 3 Mar 2026 11:34:06 +0100 Message-ID: <20260303103406.4355-49-jack@suse.cz> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260303101717.27224-1-jack@suse.cz> References: <20260303101717.27224-1-jack@suse.cz> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=11918; i=jack@suse.cz; h=from:subject; bh=PsrYbjQYy+JKPN4ylXgRqTH/Kampp8KKAcnDVbtwug4=; b=owEBbQGS/pANAwAIAZydqgc/ZEDZAcsmYgBpprksE5Lx4epWwB/e0YNwgGWaQ2b2qYcKQL6e0 TTshDb9OKaJATMEAAEIAB0WIQSrWdEr1p4yirVVKBycnaoHP2RA2QUCaaa5LAAKCRCcnaoHP2RA 2SALB/4iCUL7ofPmwRscjLvD6emQ+2yb5k2ZJg5EBrxKR9QVhQhpNXSS0EDgwZAYB7lKjmISlbM 7mlsLXf0E2jlET13UShCUd/rsjArlSFpsgXv/de/0fo9faADMhxYOVyWRrpd9OqwfsM22Pwzxjp bbgsh0Ru3198i+V0pgYb+scgS+3+A7Lh89nU/xdomdL0Mbw3mYEkD3ing6QS/X+0QlsJKg2xlGM GFjb3qPRCQJHfvamjryw2kTGcaWCGqrIZcvRvKHU71QUxOtJ2+G+98r1VOxL0Y8XB2DEP3xbKCI 4NDsTXrL/0dA9Q+1kaRz6AX47mu+X2KUtmp/0tCgf07LGS6I X-Developer-Key: i=jack@suse.cz; a=openpgp; fpr=93C6099A142276A28BBE35D815BC833443038D8C Content-Transfer-Encoding: 8bit X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Rspamd-Queue-Id: 05A1280002 X-Stat-Signature: opf78tyqte9hm9sb497hrc5rec41e4uw X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1772534137-52501 X-HE-Meta: U2FsdGVkX18vVkHwS2s0UUrzl7SyhQ88mV4ggQXCt4bG0mINuz+QpBVwXWkvHMB1FrilwM4RxSa9mN7C2J9Yq7BsZyGWLmX6wWRZjCjrOVuF2MKODMFHvHFG9KOQRiy2RpLrRBTamZ6ScD8qHLk6jgLS+QI5RCJ1gy4fvo9QJTMSMpRQXKiyIMi0H6Hftfm8OJEiOBJX0Ha+FzUoYMBU7ujCpJM5FaC9hrEkX/ipFzXn9hZYr+iuh7na1yYpxE014n0CesXC+Ewz81Zl5M8vJMJjv7TInqUF0XMNtuEw2QjxDMX2g9kfUT2GPpEdoDFygOv6T43LORbcXIUa0TtX0e0W1ey4p38e3tS2E53fgHlZzXL6iDUJfJGr80wpwGuZ6dk5oRgUxfQpWRJrKDP4ZT59XNQXeaURQo2OnBTy3kboSkyS9Dg06vcyq9muWpkI7TaIHpVEeGrh5WjGuSv5+wtTd4lrx9suX+/MIxYGhffwRLCPWPzojsh54c3vWs0jGL3l7orB09nwtYiC5KVQQUbdRyF1h3JAfEh6P01hhIV19r/KfvFff7GgaSnxq3Ceir093D+e7WqVAHTzul2n1BpfoVKXnAe3TJkVIvpSt58jxSNUVg83Dm6Wq8MpibFtAys+Di50w1m8Oc7Yecz3PNQsqjDq3Ikzo7B8xQ5GuDM2EgV60ricnBJBMsezmmjgZLQScl1CqLcUL4odJOPaQNW3iYDxafjVSbvkhjGWH37/8ThLYL+HU9nEKVlfZ7mZqBIihSwhfvDqWqVK3yX1ToqTyWWU0oJG1mngELcPaK4QSWXCpSciOnUXTY7CTVCOippf/bXZA5C72bhZfay6xuBi37RCkyOObcJfi4UrWIGaxm7tx0nNsY5YSM6xqphQBERjsdadg/ilCgZU+ymNDsxECuQJxLj2PlVaI161uY1ith1M4tozgXgNPyXFosLsWG8CePX7id129OnfsQw 0A1spia1 v5DRCYf3BJNuV1/Psabwfp+VU6PhKYPSYLb7CNRsb7zVOvFaJ6l0dJ/NUFu3xkv8XUDlRrfudDJatkvYRYR4EgYW+95ADejaDdz+dbSAtEl33k/SK5mPQ/eRebVVFHoTQwLG3R26QJpMnD2NEggtjBoqhfg== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Instead of tracking metadata bhs for a mapping using i_private_list and i_private_lock we create a dedicated mapping_metadata_bhs struct for it. So far this struct is embedded in address_space but that will be switched for per-fs private inode parts later in the series. This also changes the locking from bdev mapping's i_private_lock to lock embedded in mapping_metadata_bhs to untangle the i_private_lock locking for maintaining lists of metadata bhs and the locking for looking up / reclaiming bdev's buffer heads. The locking in remove_assoc_map() gets more complex due to this but overall this looks like a reasonable tradeoff. Signed-off-by: Jan Kara --- fs/buffer.c | 138 +++++++++++++++++++++------------------------ fs/inode.c | 2 + include/linux/fs.h | 7 +++ 3 files changed, 74 insertions(+), 73 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 18012afb8289..d39ae6581c26 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -469,30 +469,13 @@ EXPORT_SYMBOL(mark_buffer_async_write); * * The functions mark_buffer_dirty_inode(), fsync_inode_buffers(), * inode_has_buffers() and invalidate_inode_buffers() are provided for the - * management of a list of dependent buffers at ->i_mapping->i_private_list. - * - * Locking is a little subtle: try_to_free_buffers() will remove buffers - * from their controlling inode's queue when they are being freed. But - * try_to_free_buffers() will be operating against the *blockdev* mapping - * at the time, not against the S_ISREG file which depends on those buffers. - * So the locking for i_private_list is via the i_private_lock in the address_space - * which backs the buffers. Which is different from the address_space - * against which the buffers are listed. So for a particular address_space, - * mapping->i_private_lock does *not* protect mapping->i_private_list! In fact, - * mapping->i_private_list will always be protected by the backing blockdev's - * ->i_private_lock. - * - * Which introduces a requirement: all buffers on an address_space's - * ->i_private_list must be from the same address_space: the blockdev's. - * - * address_spaces which do not place buffers at ->i_private_list via these - * utility functions are free to use i_private_lock and i_private_list for - * whatever they want. The only requirement is that list_empty(i_private_list) - * be true at clear_inode() time. - * - * FIXME: clear_inode should not call invalidate_inode_buffers(). The - * filesystems should do that. invalidate_inode_buffers() should just go - * BUG_ON(!list_empty). + * management of a list of dependent buffers in mapping_metadata_bhs struct. + * + * The locking is a little subtle: The list of buffer heads is protected by + * the lock in mapping_metadata_bhs so functions coming from bdev mapping + * (such as try_to_free_buffers()) need to safely get to mapping_metadata_bhs + * using RCU, grab the lock, verify we didn't race with somebody detaching the + * bh / moving it to different inode and only then proceeding. * * FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should * take an address_space, not an inode. And it should be called @@ -509,19 +492,45 @@ EXPORT_SYMBOL(mark_buffer_async_write); * b_inode back. */ -/* - * The buffer's backing address_space's i_private_lock must be held - */ -static void __remove_assoc_queue(struct buffer_head *bh) +static void __remove_assoc_queue(struct mapping_metadata_bhs *mmb, + struct buffer_head *bh) { + lockdep_assert_held(&mmb->lock); list_del_init(&bh->b_assoc_buffers); WARN_ON(!bh->b_assoc_map); bh->b_assoc_map = NULL; } +static void remove_assoc_queue(struct buffer_head *bh) +{ + struct address_space *mapping; + struct mapping_metadata_bhs *mmb; + + /* + * The locking dance is ugly here. We need to acquire lock + * protecting metadata bh list while possibly racing with bh + * being removed from the list or moved to a different one. We + * use RCU to pin mapping_metadata_bhs in memory to + * opportunistically acquire the lock and then recheck the bh + * didn't move under us. + */ + while (bh->b_assoc_map) { + rcu_read_lock(); + mapping = READ_ONCE(bh->b_assoc_map); + if (mapping) { + mmb = &mapping->i_metadata_bhs; + spin_lock(&mmb->lock); + if (bh->b_assoc_map == mapping) + __remove_assoc_queue(mmb, bh); + spin_unlock(&mmb->lock); + } + rcu_read_unlock(); + } +} + int inode_has_buffers(struct inode *inode) { - return !list_empty(&inode->i_data.i_private_list); + return !list_empty(&inode->i_data.i_metadata_bhs.list); } EXPORT_SYMBOL_GPL(inode_has_buffers); @@ -529,7 +538,7 @@ EXPORT_SYMBOL_GPL(inode_has_buffers); * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers * @mapping: the mapping which wants those buffers written * - * Starts I/O against the buffers at mapping->i_private_list, and waits upon + * Starts I/O against the buffers at mapping->i_metadata_bhs and waits upon * that I/O. Basically, this is a convenience function for fsync(). @mapping * is a file or directory which needs those buffers to be written for a * successful fsync(). @@ -548,23 +557,22 @@ EXPORT_SYMBOL_GPL(inode_has_buffers); */ int sync_mapping_buffers(struct address_space *mapping) { - struct address_space *buffer_mapping = - mapping->host->i_sb->s_bdev->bd_mapping; + struct mapping_metadata_bhs *mmb = &mapping->i_metadata_bhs; struct buffer_head *bh; int err = 0; struct blk_plug plug; LIST_HEAD(tmp); - if (list_empty(&mapping->i_private_list)) + if (list_empty(&mmb->list)) return 0; blk_start_plug(&plug); - spin_lock(&buffer_mapping->i_private_lock); - while (!list_empty(&mapping->i_private_list)) { - bh = BH_ENTRY(list->next); + spin_lock(&mmb->lock); + while (!list_empty(&mmb->list)) { + bh = BH_ENTRY(mmb->list.next); WARN_ON_ONCE(bh->b_assoc_map != mapping); - __remove_assoc_queue(bh); + __remove_assoc_queue(mmb, bh); /* Avoid race with mark_buffer_dirty_inode() which does * a lockless check and we rely on seeing the dirty bit */ smp_mb(); @@ -573,7 +581,7 @@ int sync_mapping_buffers(struct address_space *mapping) bh->b_assoc_map = mapping; if (buffer_dirty(bh)) { get_bh(bh); - spin_unlock(&buffer_mapping->i_private_lock); + spin_unlock(&mmb->lock); /* * Ensure any pending I/O completes so that * write_dirty_buffer() actually writes the @@ -590,35 +598,34 @@ int sync_mapping_buffers(struct address_space *mapping) * through sync_buffer(). */ brelse(bh); - spin_lock(&buffer_mapping->i_private_lock); + spin_lock(&mmb->lock); } } } - spin_unlock(&buffer_mapping->i_private_lock); + spin_unlock(&mmb->lock); blk_finish_plug(&plug); - spin_lock(&buffer_mapping->i_private_lock); + spin_lock(&mmb->lock); while (!list_empty(&tmp)) { bh = BH_ENTRY(tmp.prev); get_bh(bh); - __remove_assoc_queue(bh); + __remove_assoc_queue(mmb, bh); /* Avoid race with mark_buffer_dirty_inode() which does * a lockless check and we rely on seeing the dirty bit */ smp_mb(); if (buffer_dirty(bh)) { - list_add(&bh->b_assoc_buffers, - &mapping->i_private_list); + list_add(&bh->b_assoc_buffers, &mmb->list); bh->b_assoc_map = mapping; } - spin_unlock(&buffer_mapping->i_private_lock); + spin_unlock(&mmb->lock); wait_on_buffer(bh); if (!buffer_uptodate(bh)) err = -EIO; brelse(bh); - spin_lock(&buffer_mapping->i_private_lock); + spin_lock(&mmb->lock); } - spin_unlock(&buffer_mapping->i_private_lock); + spin_unlock(&mmb->lock); return err; } EXPORT_SYMBOL(sync_mapping_buffers); @@ -715,15 +722,14 @@ void write_boundary_block(struct block_device *bdev, void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) { struct address_space *mapping = inode->i_mapping; - struct address_space *buffer_mapping = bh->b_folio->mapping; mark_buffer_dirty(bh); if (!bh->b_assoc_map) { - spin_lock(&buffer_mapping->i_private_lock); + spin_lock(&mapping->i_metadata_bhs.lock); list_move_tail(&bh->b_assoc_buffers, - &mapping->i_private_list); + &mapping->i_metadata_bhs.list); bh->b_assoc_map = mapping; - spin_unlock(&buffer_mapping->i_private_lock); + spin_unlock(&mapping->i_metadata_bhs.lock); } } EXPORT_SYMBOL(mark_buffer_dirty_inode); @@ -796,22 +802,16 @@ EXPORT_SYMBOL(block_dirty_folio); * Invalidate any and all dirty buffers on a given inode. We are * probably unmounting the fs, but that doesn't mean we have already * done a sync(). Just drop the buffers from the inode list. - * - * NOTE: we take the inode's blockdev's mapping's i_private_lock. Which - * assumes that all the buffers are against the blockdev. */ void invalidate_inode_buffers(struct inode *inode) { if (inode_has_buffers(inode)) { - struct address_space *mapping = &inode->i_data; - struct list_head *list = &mapping->i_private_list; - struct address_space *buffer_mapping = - mapping->host->i_sb->s_bdev->bd_mapping; - - spin_lock(&buffer_mapping->i_private_lock); - while (!list_empty(list)) - __remove_assoc_queue(BH_ENTRY(list->next)); - spin_unlock(&buffer_mapping->i_private_lock); + struct mapping_metadata_bhs *mmb = &inode->i_data.i_metadata_bhs; + + spin_lock(&mmb->lock); + while (!list_empty(&mmb->list)) + __remove_assoc_queue(mmb, BH_ENTRY(mmb->list.next)); + spin_unlock(&mmb->lock); } } EXPORT_SYMBOL(invalidate_inode_buffers); @@ -1155,14 +1155,7 @@ EXPORT_SYMBOL(__brelse); void __bforget(struct buffer_head *bh) { clear_buffer_dirty(bh); - if (bh->b_assoc_map) { - struct address_space *buffer_mapping = bh->b_folio->mapping; - - spin_lock(&buffer_mapping->i_private_lock); - list_del_init(&bh->b_assoc_buffers); - bh->b_assoc_map = NULL; - spin_unlock(&buffer_mapping->i_private_lock); - } + remove_assoc_queue(bh); __brelse(bh); } EXPORT_SYMBOL(__bforget); @@ -2810,8 +2803,7 @@ drop_buffers(struct folio *folio, struct buffer_head **buffers_to_free) do { struct buffer_head *next = bh->b_this_page; - if (bh->b_assoc_map) - __remove_assoc_queue(bh); + remove_assoc_queue(bh); bh = next; } while (bh != head); *buffers_to_free = head; diff --git a/fs/inode.c b/fs/inode.c index d5774e627a9c..393f586d050a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -483,6 +483,8 @@ static void __address_space_init_once(struct address_space *mapping) init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->i_private_list); spin_lock_init(&mapping->i_private_lock); + spin_lock_init(&mapping->i_metadata_bhs.lock); + INIT_LIST_HEAD(&mapping->i_metadata_bhs.list); mapping->i_mmap = RB_ROOT_CACHED; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 10b96eb5391d..64771a55adc5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -445,6 +445,12 @@ struct address_space_operations { extern const struct address_space_operations empty_aops; +/* Structure for tracking metadata buffer heads associated with the mapping */ +struct mapping_metadata_bhs { + spinlock_t lock; /* Lock protecting bh list */ + struct list_head list; /* The list of bhs (b_assoc_buffers) */ +}; + /** * struct address_space - Contents of a cacheable, mappable object. * @host: Owner, either the inode or the block_device. @@ -484,6 +490,7 @@ struct address_space { errseq_t wb_err; spinlock_t i_private_lock; struct list_head i_private_list; + struct mapping_metadata_bhs i_metadata_bhs; struct rw_semaphore i_mmap_rwsem; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* -- 2.51.0