From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63641C47DB7 for ; Thu, 18 Jan 2024 22:22:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE60C6B007B; Thu, 18 Jan 2024 17:22:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C95DB6B0080; Thu, 18 Jan 2024 17:22:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B35AB6B0081; Thu, 18 Jan 2024 17:22:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A121D6B007B for ; Thu, 18 Jan 2024 17:22:25 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6F7C31402C4 for ; Thu, 18 Jan 2024 22:22:25 +0000 (UTC) X-FDA: 81693856650.10.011F609 Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by imf23.hostedemail.com (Postfix) with ESMTP id 757C6140003 for ; Thu, 18 Jan 2024 22:22:23 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=U0KHcNRS; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf23.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705616543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fO+mBN2GDAFsDbZRgLX551gORnQ6e1TTPyY8dgTZaZk=; b=MyxR1P45DGPEMkJSALrFfd5uAH3Ttbv7p8PwIzxAHFt5LsltYK8vwxXAKWtrXxw7L8LW7Y Qo7okK16gZLDJ/tOVmB+GJqQUJFPTJ/t7QSRFFrIr1Qu2jZ+lAR2qBNRVK9mE0hqHHgeoz RgBNnYiuowQsx0RvspYTf7S3pYENUlo= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=U0KHcNRS; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf23.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705616543; a=rsa-sha256; cv=none; b=lzAllWjH4EKFlQxecccYEK2PIzSFugxvtbC2tCD/1XL20XZV58lze0x0uYDgAeoyGSqAj4 1uUQTRWQEwVxN5wDnskNZ4U6zH0wG334jD+Xfjg9nL28gqAO1dNH9RzXC+YZtU49A5fF9V P/P0P1/j0t8mHwv9KBl2OFPXtxtvrKc= Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-6d9bba6d773so188812b3a.1 for ; Thu, 18 Jan 2024 14:22:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1705616542; x=1706221342; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fO+mBN2GDAFsDbZRgLX551gORnQ6e1TTPyY8dgTZaZk=; b=U0KHcNRSbfuourm4tei6Pahj/4BEKLZSi7x6Yfrw9Y0uSJEW9YpaTyJjrt0eR4KlCG xOGWC1MCtBUu/9YFgFYjN+0kryYGvRMPOGDiqKLm1k4bsD7qvBI3zBngm9GnBF89y82x HUmOgZWbx4eoqL/V2QeFqqW5U1N4ioMofIL3T2GIrMgAM7uO11F/seYV/bs5670Iy1y3 LcANLjSVu2sDPpCnsV7m+2+ApGIEj33dXaUHIwxwNmc1qDumlpHUq30BBP/zQUdDWnmE UjSOMXrHO2gCFLdWYCrimrVa5rQZcNJ+G9TssnNEIImG4gczuVflDpPk4Gg/yuP43vFl 24+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705616542; x=1706221342; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fO+mBN2GDAFsDbZRgLX551gORnQ6e1TTPyY8dgTZaZk=; b=CwfFnZ6RLG01KD+pBn9GbHQDH2xhbGH3hW0GjGH1CCFEYeIlR74C+exh0EaxNZjjJC ZuJbgY00aMnKuw4YB3LQcqsNFk7rCiu5/8n9YesgWg5FdK/i9A4wcVCQ8CyEdxgs+Gg6 uk+9PE5HmBfm2qxlGCFzIdU0rWiqidkXHtrD1R+yPCCHzOfV3WHOPvf8sgcedDB0N5pr 0sxCmQRsFo1CfHBh1hGFfy58/n/lh91L1QlK5mY948oSp1UGMBhDIBqOyHX6oDPTA4uN XoTzKxQU5jhRuFobLTbhxeja1zVoLqRBtf+e+z+Q0hYQr4yps+d6/SYmUb7vN8xfNfTh g6Lg== X-Gm-Message-State: AOJu0Ywr92DHNe8eIcs2Cj1LiSODB34kPOK2Y36KB+RbFhWJXqc0uGQ5 +rE2H8o5ixIDMpCUCvoC7iIS+eA4KlH+pQJyd9VpfT79usD9kwTadjdxg9uMjQEOhRW8RhdZNHA Y X-Google-Smtp-Source: AGHT+IFxnt3+P3qfd/Or7vfgHgHJT56Si91E5SB2dSZzMTOU7Zs7rV/RsmUryc8DgCT+nt/t5oxBsQ== X-Received: by 2002:a05:6a20:ae1c:b0:19b:4580:e9c6 with SMTP id dp28-20020a056a20ae1c00b0019b4580e9c6mr1220797pzb.65.1705616542306; Thu, 18 Jan 2024 14:22:22 -0800 (PST) Received: from dread.disaster.area (pa49-180-249-6.pa.nsw.optusnet.com.au. [49.180.249.6]) by smtp.gmail.com with ESMTPSA id s13-20020a056a00194d00b006db13a02921sm3764329pfk.183.2024.01.18.14.22.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Jan 2024 14:22:22 -0800 (PST) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1rQamB-00CCGW-1A; Fri, 19 Jan 2024 09:22:18 +1100 Received: from dave by devoid.disaster.area with local (Exim 4.97) (envelope-from ) id 1rQamA-0000000HMlw-3ZYk; Fri, 19 Jan 2024 09:22:18 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: willy@infradead.org, linux-mm@kvack.org Subject: [PATCH 3/3] xfs: convert buffer cache to use high order folios Date: Fri, 19 Jan 2024 09:19:41 +1100 Message-ID: <20240118222216.4131379-4-david@fromorbit.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240118222216.4131379-1-david@fromorbit.com> References: <20240118222216.4131379-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: y13qmukoenn6eqqc98fx19qby6wswq7g X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 757C6140003 X-HE-Tag: 1705616543-911691 X-HE-Meta: U2FsdGVkX1/fzKyd8y3wRm+siOoH1xIO7sYCcc4uwRDQnhW3ZC3MnpIpuYakyOmtI/Q+cloDiWX5VWVq2QeBS6xcVQNu473vcgIpPjhKsj6RpCYz3ZMXEDlOM+3gYfkFtqZRTuSDcKXBklI6BmnkvR3MJujBFeCNUKnyd9Vw2bLezXRxAIL1+MDodMdQw2mBT1eNh/tHG7g2yTj1rHkeoyOe5/TdWr8I/Q1/ygVloqaTEzyKWXpOeH2VHgZYamgYpm60zqBIdxa/vPGZd+GOWFp0vlfOn2Sgr+V07onJFyp4Hvin5vX3mc5iK8Q3AWqqUxfaWX3P75jH4GxE8FeQaBncyO6O3CQY02wEQ4dgPMPQbYAGm0LsdoIH/H5TBAYjvk8kzBdGroRWU7i98p0qnGbv7nmtWhlngs+ThF1WXxzkLt/hjUPIfSr84S/HsY2dMZorJ3EIAA11DFQ3q9j1eu5CWxFCc8q+V+GCTY2naFa6eTo+7DsyFZG2ju7rlOGIPjlTKgG3Z6mgXm3NK5AjvTBCyKmwIL24odZPCFkyx/A7b0ct5LnCbJzDbhmKrQjunlnOl/+fNREZ/kx2N9MUniq49JTdM7klGLiy46ldY+YlZV/xIczGQ0ZGXXw6jpNUMU/Wk734e2rvH2byfPfYaPiNphTcICcEjmYHzxy49RLLigG/DPgs5dvCuLxAhSrIz8z039X45f0ktr5EK2VSmHT1ixi5BoB6osuxHbhAXRh+V20d4YYByE+aot6PhEK+Z1Xhf8biv1VXhOZuGnmBoo4YnQbEM09KPIiMLXdsbiw6SS4i1eGaQWYM8CE77k0/PKpuE1cysoc/yGBwygkI98fGxXjZauP3sA9TlcZhob1EyAhME+QXOpbNvHBx5V2TrWInsgZSotjB2s5Ow9JXl2v78Isc4XJ8JeWpVAC749PNnHTEcmobXSK4O0e29bKdneDI7oSjZVlhYGYEjnF 8cGljBSZ hp+lVbdqz692+vudRjIaASYU5kH8+ElMxjlf0AIgi3FNMmjVJp9qe9Sj87UfsKNkmk9MbOmQN3BdIHSCI4V5+v9eBR7/w/J9NFTxznzj+apqSm1QahKqOgseMd0NYpeeLNSJ5/t1igq6RRwJqPE5MalBpiWrkK+hWmJ0BURDqg80ulCnEf1dY7XOpsDAYN4vJUHeY+yuH3dA7cF2YSZP/qLdetJmkZnbKou6MT55qEFnFE+I9gmgsxEqIB0aSO2H4cnbdInZmmiQNXoNG3Gs4UGeFxx3LcYYOijgpEBpF5x1jEn1Uji5WJeTgHJxmRBti2ZxIzX21GzEVEuXULuQhq0HD0fC+OHkn2YaeTQSzgrRaa0mfk2bzjAAk0fqpeVQjktDSDhZVfFpPUUNuxwRT+62QNe6ReA+ia2eLo6jNVDSv8FtQK0uGdNB3iJ/m8Le8SzRrD3bYZFZilHKcM3DiTk8GXkw47ydZ1rAHlwXVL1isn0k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Dave Chinner Now that we have the buffer cache using the folio API, we can extend the use of folios to allocate high order folios for multi-page buffers rather than an array of single pages that are then vmapped into a contiguous range. This creates two types of buffers: single folio buffers that can have arbitrary order, and multi-folio buffers made up of many single page folios that get vmapped. The latter is essentially the existing code, so there are no logic changes to handle this case. There are a few places where we iterate the folios on a buffer. These need to be converted to handle the high order folio case. Luckily, this only occurs when bp->b_folio_count == 1, and the code for handling this case is just a simple application of the folio API to the operations that need to be performed. The code that allocates buffers will optimistically attempt a high order folio allocation as a fast path. If this high order allocation fails, then we fall back to the existing multi-folio allocation code. This now forms the slow allocation path, and hopefully will be largely unused in normal conditions. This should improve performance of large buffer operations (e.g. large directory block sizes) as we should now mostly avoid the expense of vmapping large buffers (and the vmap lock contention that can occur) as well as avoid the runtime pressure that frequently accessing kernel vmapped pages put on the TLBs. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 119 insertions(+), 31 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 15907e92d0d3..df363f17ea1a 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -74,6 +74,10 @@ xfs_buf_is_vmapped( return bp->b_addr && bp->b_folio_count > 1; } +/* + * See comment above xfs_buf_alloc_folios() about the constraints placed on + * allocating vmapped buffers. + */ static inline int xfs_buf_vmap_len( struct xfs_buf *bp) @@ -344,14 +348,72 @@ xfs_buf_alloc_kmem( bp->b_addr = NULL; return -ENOMEM; } - bp->b_offset = offset_in_page(bp->b_addr); bp->b_folios = bp->b_folio_array; bp->b_folios[0] = kmem_to_folio(bp->b_addr); + bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr); bp->b_folio_count = 1; bp->b_flags |= _XBF_KMEM; return 0; } +/* + * Allocating a high order folio makes the assumption that buffers are a + * power-of-2 size so that ilog2() returns the exact order needed to fit + * the contents of the buffer. Buffer lengths are mostly a power of two, + * so this is not an unreasonable approach to take by default. + * + * The exception here are user xattr data buffers, which can be arbitrarily + * sized up to 64kB plus structure metadata. In that case, round up the order. + */ +static bool +xfs_buf_alloc_folio( + struct xfs_buf *bp, + gfp_t gfp_mask) +{ + int length = BBTOB(bp->b_length); + int order; + + order = ilog2(length); + if ((1 << order) < length) + order = ilog2(length - 1) + 1; + + if (order <= PAGE_SHIFT) + order = 0; + else + order -= PAGE_SHIFT; + + bp->b_folio_array[0] = folio_alloc(gfp_mask, order); + if (!bp->b_folio_array[0]) + return false; + + bp->b_folios = bp->b_folio_array; + bp->b_folio_count = 1; + bp->b_flags |= _XBF_FOLIOS; + return true; +} + +/* + * When we allocate folios for a buffer, we end up with one of two types of + * buffer. + * + * The first type is a single folio buffer - this may be a high order + * folio or just a single page sized folio, but either way they get treated the + * same way by the rest of the code - the buffer memory spans a single + * contiguous memory region that we don't have to map and unmap to access the + * data directly. + * + * The second type of buffer is the multi-folio buffer. These are *always* made + * up of single page folios so that they can be fed to vmap_ram() to return a + * contiguous memory region we can access the data through, or mark it as + * XBF_UNMAPPED and access the data directly through individual folio_address() + * calls. + * + * We don't use high order folios for this second type of buffer (yet) because + * having variable size folios makes offset-to-folio indexing and iteration of + * the data range more complex than if they are fixed size. This case should now + * be the slow path, though, so unless we regularly fail to allocate high order + * folios, there should be little need to optimise this path. + */ static int xfs_buf_alloc_folios( struct xfs_buf *bp, @@ -363,7 +425,15 @@ xfs_buf_alloc_folios( if (flags & XBF_READ_AHEAD) gfp_mask |= __GFP_NORETRY; - /* Make sure that we have a page list */ + /* Assure zeroed buffer for non-read cases. */ + if (!(flags & XBF_READ)) + gfp_mask |= __GFP_ZERO; + + /* Optimistically attempt a single high order folio allocation. */ + if (xfs_buf_alloc_folio(bp, gfp_mask)) + return 0; + + /* Fall back to allocating an array of single page folios. */ bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE); if (bp->b_folio_count <= XB_FOLIOS) { bp->b_folios = bp->b_folio_array; @@ -375,9 +445,6 @@ xfs_buf_alloc_folios( } bp->b_flags |= _XBF_FOLIOS; - /* Assure zeroed buffer for non-read cases. */ - if (!(flags & XBF_READ)) - gfp_mask |= __GFP_ZERO; /* * Bulk filling of pages can take multiple calls. Not filling the entire @@ -418,7 +485,7 @@ _xfs_buf_map_folios( { ASSERT(bp->b_flags & _XBF_FOLIOS); if (bp->b_folio_count == 1) { - /* A single page buffer is always mappable */ + /* A single folio buffer is always mappable */ bp->b_addr = folio_address(bp->b_folios[0]); } else if (flags & XBF_UNMAPPED) { bp->b_addr = NULL; @@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map( int *count, blk_opf_t op) { - int page_index; - unsigned int total_nr_pages = bp->b_folio_count; - int nr_pages; + int folio_index; + unsigned int total_nr_folios = bp->b_folio_count; + int nr_folios; struct bio *bio; sector_t sector = bp->b_maps[map].bm_bn; int size; int offset; - /* skip the pages in the buffer before the start offset */ - page_index = 0; + /* + * If the start offset if larger than a single page, we need to be + * careful. We might have a high order folio, in which case the indexing + * is from the start of the buffer. However, if we have more than one + * folio single page folio in the buffer, we need to skip the folios in + * the buffer before the start offset. + */ + folio_index = 0; offset = *buf_offset; - while (offset >= PAGE_SIZE) { - page_index++; - offset -= PAGE_SIZE; + if (bp->b_folio_count > 1) { + while (offset >= PAGE_SIZE) { + folio_index++; + offset -= PAGE_SIZE; + } } /* @@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map( next_chunk: atomic_inc(&bp->b_io_remaining); - nr_pages = bio_max_segs(total_nr_pages); + nr_folios = bio_max_segs(total_nr_folios); - bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO); + bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO); bio->bi_iter.bi_sector = sector; bio->bi_end_io = xfs_buf_bio_end_io; bio->bi_private = bp; - for (; size && nr_pages; nr_pages--, page_index++) { - int rbytes, nbytes = PAGE_SIZE - offset; + for (; size && nr_folios; nr_folios--, folio_index++) { + struct folio *folio = bp->b_folios[folio_index]; + int nbytes = folio_size(folio) - offset; if (nbytes > size) nbytes = size; - rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes, - offset); - if (rbytes < nbytes) + if (!bio_add_folio(bio, folio, nbytes, + offset_in_folio(folio, offset))) break; offset = 0; sector += BTOBB(nbytes); size -= nbytes; - total_nr_pages--; + total_nr_folios--; } if (likely(bio->bi_iter.bi_size)) { @@ -1722,6 +1797,13 @@ xfs_buf_offset( if (bp->b_addr) return bp->b_addr + offset; + /* Single folio buffers may use large folios. */ + if (bp->b_folio_count == 1) { + folio = bp->b_folios[0]; + return folio_address(folio) + offset_in_folio(folio, offset); + } + + /* Multi-folio buffers always use PAGE_SIZE folios */ folio = bp->b_folios[offset >> PAGE_SHIFT]; return folio_address(folio) + (offset & (PAGE_SIZE-1)); } @@ -1737,18 +1819,24 @@ xfs_buf_zero( bend = boff + bsize; while (boff < bend) { struct folio *folio; - int page_index, page_offset, csize; + int folio_index, folio_offset, csize; - page_index = (boff + bp->b_offset) >> PAGE_SHIFT; - page_offset = (boff + bp->b_offset) & ~PAGE_MASK; - folio = bp->b_folios[page_index]; - csize = min_t(size_t, PAGE_SIZE - page_offset, + /* Single folio buffers may use large folios. */ + if (bp->b_folio_count == 1) { + folio = bp->b_folios[0]; + folio_offset = offset_in_folio(folio, + bp->b_offset + boff); + } else { + folio_index = (boff + bp->b_offset) >> PAGE_SHIFT; + folio_offset = (boff + bp->b_offset) & ~PAGE_MASK; + folio = bp->b_folios[folio_index]; + } + + csize = min_t(size_t, folio_size(folio) - folio_offset, BBTOB(bp->b_length) - boff); + ASSERT((csize + folio_offset) <= folio_size(folio)); - ASSERT((csize + page_offset) <= PAGE_SIZE); - - memset(folio_address(folio) + page_offset, 0, csize); - + memset(folio_address(folio) + folio_offset, 0, csize); boff += csize; } } -- 2.43.0