From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CC12C369D3 for ; Mon, 21 Apr 2025 17:18:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BB0346B000E; Mon, 21 Apr 2025 13:18:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B5BED6B0010; Mon, 21 Apr 2025 13:18:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A24886B0011; Mon, 21 Apr 2025 13:18:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 852E66B000E for ; Mon, 21 Apr 2025 13:18:54 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 57A351609D2 for ; Mon, 21 Apr 2025 17:18:55 +0000 (UTC) X-FDA: 83358711030.13.6A86029 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf22.hostedemail.com (Postfix) with ESMTP id B36F3C0009 for ; Mon, 21 Apr 2025 17:18:53 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=h7r+Fr08; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745255933; a=rsa-sha256; cv=none; b=SyVu3qP4Au0TkDGtbftiT5EqWaTRvjffV40TasDn4tphZbjyg9HSOE4WaQBIpSH7bRrTMc jKxQ/pKycdttADON+HFXWY6JwzdrC9U1g5OHrlAkxnGNC8g47XZ4weWMQvMaAqN2uvQCWY rfvQHVqSb4pwYA1gHhf3LPhVsrUBPBk= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=h7r+Fr08; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745255933; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QvQDIGR3M8bHFNxX2+8eX3I7eIrXQhmzSmhnJAkTQJw=; b=y0unzNGLkML+ZkhJN7Dxa4eeIAFv0mDb8qF0RU+sTW74XVWx47n9jCCpjOWhkUoprVaDW4 U6Xsc0OurzPolaFjoX8J+gcJ48FZXdN4JdiycMFBr2W3DSC6IrH+FZIzA8L27JJX3dpgPm zjkE76OB9lGTIaObYoBgnqLv7/DU79A= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id D72385C5735; Mon, 21 Apr 2025 17:16:35 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 91648C4CEE4; Mon, 21 Apr 2025 17:18:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745255932; bh=QNxdAD67GuPR8zjH70YV89ssGFxiqUimIzIfpmWAvZI=; h=Date:Subject:From:To:Cc:In-Reply-To:References:From; b=h7r+Fr08sO3WPwwe4N952FlShReak43ZY2TrttEdL3L6+i8ycmUMYHclrfgXVwJ7J kQVoPyN8boxubS4uVz23xLORx9Y2/jmRs+krJvPW/2FPpaNRWsFvhnNCi9BqttrSxG W6J/MJR4W5ULNgFPEcco0mr2eYxIC0UdxRGWSCPtH8N0TfrLNrv5o8jmJ0yw7HDwor Nv8xa8arEcH/qgXCRmMnIbblBxxH9DDaorWzftOo1vMyUWtjeK/CR/DlX6+HuONyjG vJ6k/bgkrf6v5QSV2XAOspQ2iodlZ91ykXHre7Epm4+JnEmRC6Yz6GzR2BykQ3XXIX Hsz46whsmhLzQ== Date: Mon, 21 Apr 2025 10:18:51 -0700 Subject: [PATCH 1/3] block: fix race between set_blocksize and read paths From: "Darrick J. Wong" To: djwong@kernel.org, cem@kernel.org, axboe@kernel.dk Cc: hch@lst.de, shinichiro.kawasaki@wdc.com, linux-mm@kvack.org, mcgrof@kernel.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, willy@infradead.org, hch@infradead.org, linux-block@vger.kernel.org Message-ID: <174525589048.2138337.8655735382810222791.stgit@frogsfrogsfrogs> In-Reply-To: <174525589013.2138337.16473045486118778580.stgit@frogsfrogsfrogs> References: <174525589013.2138337.16473045486118778580.stgit@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B36F3C0009 X-Stat-Signature: dn3cymrnwhkm84xdexazzsfuu6qee5xd X-Rspam-User: X-HE-Tag: 1745255933-329148 X-HE-Meta: U2FsdGVkX18OZMaKnK/fEBRzDgtGnEpynz/TNcyBp36BsJrZ90NRp35ovCecHTgWh+qI9OqscN4HT+mhp74l7VHeIebmRQwJX8FL6jLyXyvw8AcS0g4g8C+aQ0CJNwrOHcX9BEICjlRpFKdkiFR14n9GCFbOdRbpcNXUuWWTKdJ3gX4AAxaUVPoQ9pSo+ziBztFNgkfsL78xDIGa50uSyqLFsg/H9ZmsA3vupIetrMbRft2GUx3X1buYLiVmbJhuxLRnk0J4b5Cen2TAK+mirq0vv35vS2d2ovxBHBXIpwfEkPawRklv11MIB6qQCfNdlj6lTkjv0BfN74N5SeXyIfWo5EPLWyGgCSRy1qBxZAWOltEW5/c/O/NVbCl1HxiMS+5uRr2PLdWFvLVHDB3bXjf4i1J7xBCrJTAAptv2VsBZK1wihUaM1eKit6FIDBrJccODY6Sg1A95DNgFC9TV9qfiaf0Uq6ZofUNy6u2pw+yl5Kh0SpOjqR66xBOlZjlC8YM8+5YGO8UeRALMQSVv4B1uX+gijREPOQ/jZVPlhz7p6Hx8JS5eesLUK3iCSADUTuLgUAgHmGYZFYQFoza/MCo/HiC/zVKdsysMY7SJReiKnTskpZvQtZqj2ESpSfpbp0MQ8Cp5E6wPnQx82uSyFAb3a+3w8mq8dGwNd9kn0u0yozEXrDqkm1k9TBIrd+bIYPXX2+E4lR05CFZNH8GIzJz+yNSKNEvY9y4T3CaxEhpYifhUFmh4kTanb1vt3DLpKD8SmTqz+2VM/T6wfHHU82N9mf4fFUL4COsTXqq9avRaCpuYYZCe/AEJ7wsFVBObow5z7HWxHy8hPYERRfIFR+folaF0NIEy9AqVjpdlNK9FQ2PIBLHP6Oqdyik5RknvZEC52cgm1GtM8a4wTLUP2X6bPmzMj02JBPt7JWKv6L4swZJ5efxzbGYTt6zT1eenEt42Ic+ubwh7KIVvSS3 gBNIcra4 rTPp8kgAjc4p9hcVp5d8MpHQx1o/m+jhqYZbiZdtG38jRrYt49grM/H7txUR1bOiy1VgZ1k/13+MF5+joHMzMDekhW0E7EuSXXe10LXeTN4zvDBzElGvXXuIG+95G9gRIb9GhItM/b5uXQfttgD3WZaJZRsoW3Spv+q4zxvcXFkmeUiIgzDgqKM5Xn5pIL9Z9p3NP7m5tygH7TNxG925DAy0VtooPQ0/gpEXv9k/9VXiaTM6MHitQURCCh0LAGTKdrXEvgQCEz9Nw8xMOKBbpYkY0w5WUVAUcuC1ZPO3KY7C4wBCpTE9OGrmdn7ZbUwWyBvuj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Darrick J. Wong With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: "Darrick J. Wong" Reviewed-by: Christoph Hellwig --- block/bdev.c | 17 +++++++++++++++++ block/blk-zoned.c | 5 ++++- block/fops.c | 16 ++++++++++++++++ block/ioctl.c | 6 ++++++ 4 files changed, 43 insertions(+), 1 deletion(-) diff --git a/block/bdev.c b/block/bdev.c index 6a2d08166e50c7..24984ec13e7cb2 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -169,11 +169,28 @@ int set_blocksize(struct file *file, int size) /* Don't change the size if it is same as current */ if (inode->i_blkbits != blksize_bits(size)) { + /* + * Flush and truncate the pagecache before we reconfigure the + * mapping geometry because folio sizes are variable now. If a + * reader has already allocated a folio whose size is smaller + * than the new min_order but invokes readahead after the new + * min_order becomes visible, readahead will think there are + * "zero" blocks per folio and crash. Take the inode and + * invalidation locks to avoid racing with + * read/write/fallocate. + */ + inode_lock(inode); + filemap_invalidate_lock(inode->i_mapping); + sync_blockdev(bdev); + kill_bdev(bdev); + inode->i_blkbits = blksize_bits(size); mapping_set_folio_order_range(inode->i_mapping, get_order(size), get_order(size)); kill_bdev(bdev); + filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); } return 0; } diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 0c77244a35c92e..8f15d1aa6eb89a 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, op = REQ_OP_ZONE_RESET; /* Invalidate the page cache, including dirty pages. */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); ret = blkdev_truncate_zone_range(bdev, mode, &zrange); if (ret) @@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors); fail: - if (cmd == BLKRESETZONE) + if (cmd == BLKRESETZONE) { filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); + } return ret; } diff --git a/block/fops.c b/block/fops.c index be9f1dbea9ce0a..e221fdcaa8aaf8 100644 --- a/block/fops.c +++ b/block/fops.c @@ -746,7 +746,14 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = direct_write_fallback(iocb, from, ret, blkdev_buffered_write(iocb, from)); } else { + /* + * Take i_rwsem and invalidate_lock to avoid racing with + * set_blocksize changing i_blkbits/folio order and punching + * out the pagecache. + */ + inode_lock_shared(bd_inode); ret = blkdev_buffered_write(iocb, from); + inode_unlock_shared(bd_inode); } if (ret > 0) @@ -757,6 +764,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) { + struct inode *bd_inode = bdev_file_inode(iocb->ki_filp); struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); loff_t size = bdev_nr_bytes(bdev); loff_t pos = iocb->ki_pos; @@ -793,7 +801,13 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) goto reexpand; } + /* + * Take i_rwsem and invalidate_lock to avoid racing with set_blocksize + * changing i_blkbits/folio order and punching out the pagecache. + */ + inode_lock_shared(bd_inode); ret = filemap_read(iocb, to, ret); + inode_unlock_shared(bd_inode); reexpand: if (unlikely(shorted)) @@ -836,6 +850,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, if ((start | len) & (bdev_logical_block_size(bdev) - 1)) return -EINVAL; + inode_lock(inode); filemap_invalidate_lock(inode->i_mapping); /* @@ -868,6 +883,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, fail: filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); return error; } diff --git a/block/ioctl.c b/block/ioctl.c index faa40f383e2736..e472cc1030c60c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, if (err) return err; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, start + len - 1); if (err) @@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, blk_finish_plug(&plug); fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode, end > bdev_nr_bytes(bdev)) return -EINVAL; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end - 1); if (!err) err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9, GFP_KERNEL); filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, return -EINVAL; /* Invalidate the page cache, including dirty pages */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end); if (err) @@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; }