* [RFC 0/3] add large zero page for zeroing out larger segments
@ 2025-05-16 10:10 Pankaj Raghav
2025-05-16 10:10 ` [RFC 1/3] mm: add large zero page for efficient zeroing of " Pankaj Raghav
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Pankaj Raghav @ 2025-05-16 10:10 UTC (permalink / raw)
To: Darrick J . Wong, hch, willy
Cc: linux-kernel, linux-mm, David Hildenbrand, linux-fsdevel, mcgrof,
gost.dev, Andrew Morton, kernel, Pankaj Raghav
Introduce LARGE_ZERO_PAGE of size 2M as an alternative to ZERO_PAGE.
Similar to ZERO_PAGE, LARGE_ZERO_PAGE is also a global shared page.
2M seems to be a decent compromise between memory usage and performance.
This idea (but not the implementation) was suggested during the review of
adding LBS support to XFS[1][2].
NOTE:
===
This implementation probably has a lot of holes, and it is not complete.
For example, this implementation only works on x86.
The intent of the RFC is:
- To understand if this is something we still need in the kernel.
- If this is the approach we want to take to implement a feature like
this or should we explore other alternatives.
I have excluded a lot of Maintainers/mailing list and only included relevant
folks in this RFC to understand the direction we want to take if this
feature is needed.
===
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time is limited by
PAGE_SIZE.
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.
Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...
I have converted blkdev_issue_zero_pages() and iomap_dio_zero() as an
example as a part of this series.
While there are other options such as huge_zero_page, they can fail
based on the system conditions requiring a fallback to ZERO_PAGE[3].
LARGE_ZERO_PAGE is added behind a config option so that systems that are
constrained by memory are not forced to use it.
Looking forward to some feedback.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Pankaj Raghav (3):
mm: add large zero page for efficient zeroing of larger segments
block: use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages()
iomap: use LARGE_ZERO_PAGE in iomap_dio_zero()
arch/Kconfig | 8 ++++++++
arch/x86/include/asm/pgtable.h | 20 +++++++++++++++++++-
arch/x86/kernel/head_64.S | 9 ++++++++-
block/blk-lib.c | 4 ++--
fs/iomap/direct-io.c | 31 +++++++++----------------------
5 files changed, 46 insertions(+), 26 deletions(-)
base-commit: 9e619cd4fefd19cdce16e169d5827bc64ae01aa1
--
2.47.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 1/3] mm: add large zero page for efficient zeroing of larger segments
2025-05-16 10:10 [RFC 0/3] add large zero page for zeroing out larger segments Pankaj Raghav
@ 2025-05-16 10:10 ` Pankaj Raghav
2025-05-16 12:21 ` David Hildenbrand
2025-05-16 10:10 ` [RFC 2/3] block: use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages() Pankaj Raghav
2025-05-16 10:10 ` [RFC 3/3] iomap: use LARGE_ZERO_PAGE in iomap_dio_zero() Pankaj Raghav
2 siblings, 1 reply; 7+ messages in thread
From: Pankaj Raghav @ 2025-05-16 10:10 UTC (permalink / raw)
To: Darrick J . Wong, hch, willy
Cc: linux-kernel, linux-mm, David Hildenbrand, linux-fsdevel, mcgrof,
gost.dev, Andrew Morton, kernel, Pankaj Raghav
Introduce LARGE_ZERO_PAGE of size 2M as an alternative to ZERO_PAGE of
size PAGE_SIZE.
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time is limited by
PAGE_SIZE.
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.
While there are other options such as huge_zero_page, they can fail
based on the system memory pressure requiring a fallback to ZERO_PAGE[3].
This idea (but not the implementation) was suggested during the review of
adding LBS support to XFS[1][2].
LARGE_ZERO_PAGE is added behind a config option so that systems that are
constrained by memory are not forced to use it.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
[3] https://lore.kernel.org/linux-xfs/3pqmgrlewo6ctcwakdvbvjqixac5en6irlipe5aiz6vkylfyni@2luhrs36ke5r/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
arch/Kconfig | 8 ++++++++
arch/x86/include/asm/pgtable.h | 20 +++++++++++++++++++-
arch/x86/kernel/head_64.S | 9 ++++++++-
3 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index b0adb665041f..aefa519cb211 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -218,6 +218,14 @@ config USER_RETURN_NOTIFIER
Provide a kernel-internal notification when a cpu is about to
switch to user mode.
+config LARGE_ZERO_PAGE
+ bool "Large zero pages"
+ def_bool n
+ help
+ 2M sized zero pages for zeroing. This will reserve 2M sized
+ physical pages for zeroing. Not suitable for memory constrained
+ systems.
+
config HAVE_IOREMAP_PROT
bool
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3f59d7a16010..78eb83f2da34 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -17,6 +17,7 @@
#ifndef __ASSEMBLER__
#include <linux/spinlock.h>
+#include <linux/sizes.h>
#include <asm/x86_init.h>
#include <asm/pkru.h>
#include <asm/fpu/api.h>
@@ -47,14 +48,31 @@ void ptdump_walk_user_pgd_level_checkwx(void);
#define debug_checkwx_user() do { } while (0)
#endif
+#ifdef CONFIG_LARGE_ZERO_PAGE
+/*
+ * LARGE_ZERO_PAGE is a global shared page that is always zero: used
+ * for zero-mapped memory areas etc..
+ */
+extern unsigned long empty_large_zero_page[(SZ_2M) / sizeof(unsigned long)]
+ __visible;
+#define ZERO_LARGE_PAGE(vaddr) ((void)(vaddr),virt_to_page(empty_large_zero_page))
+
+#define ZERO_PAGE(vaddr) ZERO_LARGE_PAGE(vaddr)
+#define ZERO_LARGE_PAGE_SIZE SZ_2M
+#else
/*
* ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc..
*/
-extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
+extern unsigned long empty_zero_page[(PAGE_SIZE) / sizeof(unsigned long)]
__visible;
#define ZERO_PAGE(vaddr) ((void)(vaddr),virt_to_page(empty_zero_page))
+#define ZERO_LARGE_PAGE(vaddr) ZERO_PAGE(vaddr)
+
+#define ZERO_LARGE_PAGE_SIZE PAGE_SIZE
+#endif
+
extern spinlock_t pgd_lock;
extern struct list_head pgd_list;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index fefe2a25cf02..ebcd12f72966 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -14,6 +14,7 @@
#include <linux/threads.h>
#include <linux/init.h>
#include <linux/pgtable.h>
+#include <linux/sizes.h>
#include <asm/segment.h>
#include <asm/page.h>
#include <asm/msr.h>
@@ -708,8 +709,14 @@ EXPORT_SYMBOL(phys_base)
#include "../xen/xen-head.S"
__PAGE_ALIGNED_BSS
+#ifdef CONFIG_LARGE_ZERO_PAGE
+SYM_DATA_START_PAGE_ALIGNED(empty_large_zero_page)
+ .skip SZ_2M
+SYM_DATA_END(empty_large_zero_page)
+EXPORT_SYMBOL(empty_large_zero_page)
+#else
SYM_DATA_START_PAGE_ALIGNED(empty_zero_page)
.skip PAGE_SIZE
SYM_DATA_END(empty_zero_page)
EXPORT_SYMBOL(empty_zero_page)
-
+#endif
--
2.47.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 2/3] block: use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages()
2025-05-16 10:10 [RFC 0/3] add large zero page for zeroing out larger segments Pankaj Raghav
2025-05-16 10:10 ` [RFC 1/3] mm: add large zero page for efficient zeroing of " Pankaj Raghav
@ 2025-05-16 10:10 ` Pankaj Raghav
2025-05-16 10:10 ` [RFC 3/3] iomap: use LARGE_ZERO_PAGE in iomap_dio_zero() Pankaj Raghav
2 siblings, 0 replies; 7+ messages in thread
From: Pankaj Raghav @ 2025-05-16 10:10 UTC (permalink / raw)
To: Darrick J . Wong, hch, willy
Cc: linux-kernel, linux-mm, David Hildenbrand, linux-fsdevel, mcgrof,
gost.dev, Andrew Morton, kernel, Pankaj Raghav
Use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages() instead of ZERO_PAGE.
On systems that support LARGE_ZERO_PAGE, we will end up sending larger
bvecs instead of multiple small ones.
Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The performance gains might be bigger if
the device supports larger MDTS.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
block/blk-lib.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..80dfc737d1f6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -211,8 +211,8 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
unsigned int len, added;
len = min_t(sector_t,
- PAGE_SIZE, nr_sects << SECTOR_SHIFT);
- added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
+ ZERO_LARGE_PAGE_SIZE, nr_sects << SECTOR_SHIFT);
+ added = bio_add_page(bio, ZERO_LARGE_PAGE(0), len, 0);
if (added < len)
break;
nr_sects -= added >> SECTOR_SHIFT;
--
2.47.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 3/3] iomap: use LARGE_ZERO_PAGE in iomap_dio_zero()
2025-05-16 10:10 [RFC 0/3] add large zero page for zeroing out larger segments Pankaj Raghav
2025-05-16 10:10 ` [RFC 1/3] mm: add large zero page for efficient zeroing of " Pankaj Raghav
2025-05-16 10:10 ` [RFC 2/3] block: use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages() Pankaj Raghav
@ 2025-05-16 10:10 ` Pankaj Raghav
2 siblings, 0 replies; 7+ messages in thread
From: Pankaj Raghav @ 2025-05-16 10:10 UTC (permalink / raw)
To: Darrick J . Wong, hch, willy
Cc: linux-kernel, linux-mm, David Hildenbrand, linux-fsdevel, mcgrof,
gost.dev, Andrew Morton, kernel, Pankaj Raghav
Use LARGE_ZERO_PAGE instead of custom allocated 64k zero pages. The
downside is we might end up using ZERO_PAGE on systems that do not
enable LARGE_ZERO_PAGE feature.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/iomap/direct-io.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 844261a31156..6a2b6726a156 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -29,13 +29,6 @@
#define IOMAP_DIO_WRITE (1U << 30)
#define IOMAP_DIO_DIRTY (1U << 31)
-/*
- * Used for sub block zeroing in iomap_dio_zero()
- */
-#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
-#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
-static struct page *zero_page;
-
struct iomap_dio {
struct kiocb *iocb;
const struct iomap_dio_ops *dops;
@@ -290,23 +283,29 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
{
struct inode *inode = file_inode(dio->iocb->ki_filp);
struct bio *bio;
+ int nr_vecs = max(1, i_blocksize(inode) / ZERO_LARGE_PAGE_SIZE);
if (!len)
return 0;
/*
* Max block size supported is 64k
*/
- if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+ if (WARN_ON_ONCE(len > SZ_64K))
return -EINVAL;
- bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+ bio = iomap_dio_alloc_bio(iter, dio, nr_vecs, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- __bio_add_page(bio, zero_page, len, 0);
+ while (len) {
+ unsigned int io_len = min_t(unsigned int, len, ZERO_LARGE_PAGE_SIZE);
+
+ __bio_add_page(bio, ZERO_LARGE_PAGE(0), len, 0);
+ len -= io_len;
+ }
iomap_dio_submit_bio(iter, dio, bio, pos);
return 0;
}
@@ -827,15 +826,3 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
-
-static int __init iomap_dio_init(void)
-{
- zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
- IOMAP_ZERO_PAGE_ORDER);
-
- if (!zero_page)
- return -ENOMEM;
-
- return 0;
-}
-fs_initcall(iomap_dio_init);
--
2.47.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 1/3] mm: add large zero page for efficient zeroing of larger segments
2025-05-16 10:10 ` [RFC 1/3] mm: add large zero page for efficient zeroing of " Pankaj Raghav
@ 2025-05-16 12:21 ` David Hildenbrand
2025-05-16 13:03 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2025-05-16 12:21 UTC (permalink / raw)
To: Pankaj Raghav, Darrick J . Wong, hch, willy
Cc: linux-kernel, linux-mm, linux-fsdevel, mcgrof, gost.dev,
Andrew Morton, kernel
On 16.05.25 12:10, Pankaj Raghav wrote:
> Introduce LARGE_ZERO_PAGE of size 2M as an alternative to ZERO_PAGE of
> size PAGE_SIZE.
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time is limited by
> PAGE_SIZE.
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of single bvec.
>
> While there are other options such as huge_zero_page, they can fail
> based on the system memory pressure requiring a fallback to ZERO_PAGE[3].
Instead of adding another one, why not have a config option that will
always allocate the huge zeropage, and never free it?
I mean, the whole thing about dynamically allocating/freeing it was for
memory-constrained systems. For large systems, we just don't care.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 1/3] mm: add large zero page for efficient zeroing of larger segments
2025-05-16 12:21 ` David Hildenbrand
@ 2025-05-16 13:03 ` Pankaj Raghav (Samsung)
2025-05-16 14:54 ` David Hildenbrand
0 siblings, 1 reply; 7+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-16 13:03 UTC (permalink / raw)
To: David Hildenbrand
Cc: Pankaj Raghav, Darrick J . Wong, hch, willy, linux-kernel,
linux-mm, linux-fsdevel, mcgrof, gost.dev, Andrew Morton
On Fri, May 16, 2025 at 02:21:04PM +0200, David Hildenbrand wrote:
> On 16.05.25 12:10, Pankaj Raghav wrote:
> > Introduce LARGE_ZERO_PAGE of size 2M as an alternative to ZERO_PAGE of
> > size PAGE_SIZE.
> >
> > There are many places in the kernel where we need to zeroout larger
> > chunks but the maximum segment we can zeroout at a time is limited by
> > PAGE_SIZE.
> >
> > This is especially annoying in block devices and filesystems where we
> > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> > bvec support in block layer, it is much more efficient to send out
> > larger zero pages as a part of single bvec.
> >
> > While there are other options such as huge_zero_page, they can fail
> > based on the system memory pressure requiring a fallback to ZERO_PAGE[3].
>
> Instead of adding another one, why not have a config option that will always
> allocate the huge zeropage, and never free it?
>
> I mean, the whole thing about dynamically allocating/freeing it was for
> memory-constrained systems. For large systems, we just don't care.
That sounds like a good idea. I was just worried about wasting too much
memory with a huge page in systems with 64k page size. But it can always be
disabled by putting it behind a config.
Thanks, David. I will wait to see what others think but what you
suggested sounds like a good idea on how to proceed.
--
Pankaj Raghav
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 1/3] mm: add large zero page for efficient zeroing of larger segments
2025-05-16 13:03 ` Pankaj Raghav (Samsung)
@ 2025-05-16 14:54 ` David Hildenbrand
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2025-05-16 14:54 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Pankaj Raghav, Darrick J . Wong, hch, willy, linux-kernel,
linux-mm, linux-fsdevel, mcgrof, gost.dev, Andrew Morton
On 16.05.25 15:03, Pankaj Raghav (Samsung) wrote:
> On Fri, May 16, 2025 at 02:21:04PM +0200, David Hildenbrand wrote:
>> On 16.05.25 12:10, Pankaj Raghav wrote:
>>> Introduce LARGE_ZERO_PAGE of size 2M as an alternative to ZERO_PAGE of
>>> size PAGE_SIZE.
>>>
>>> There are many places in the kernel where we need to zeroout larger
>>> chunks but the maximum segment we can zeroout at a time is limited by
>>> PAGE_SIZE.
>>>
>>> This is especially annoying in block devices and filesystems where we
>>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>>> bvec support in block layer, it is much more efficient to send out
>>> larger zero pages as a part of single bvec.
>>>
>>> While there are other options such as huge_zero_page, they can fail
>>> based on the system memory pressure requiring a fallback to ZERO_PAGE[3].
>>
>> Instead of adding another one, why not have a config option that will always
>> allocate the huge zeropage, and never free it?
>>
>> I mean, the whole thing about dynamically allocating/freeing it was for
>> memory-constrained systems. For large systems, we just don't care.
>
> That sounds like a good idea. I was just worried about wasting too much
> memory with a huge page in systems with 64k page size. But it can always be
> disabled by putting it behind a config.
Exactly. If the huge zero page is larger than 2M, we probably don't want
it in any case.
On arm64k it could be 512 of MiBs. Full of zeroes.
I'm wondering why nobody ever complained about that before, and I don't
see anything immediate that would disable the huge zero page in such
environments. Well, we can just leave that as it is.
In any case, the idea would be to have a Kconfig where we statically
allocate the huge zero page and disable all the refcounting / shrinking.
Then, we can make this Kconfig specific to sane environments (e.g., 4
KiB page size).
From other MM code, we can then simply reuse that single huge zero page.
>
> Thanks, David. I will wait to see what others think but what you
> suggested sounds like a good idea on how to proceed.
In particular, it wouldn't be arch specific, and we wouldn't waste on
x86 2x 2MB for storing zeroes ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-05-16 14:54 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-16 10:10 [RFC 0/3] add large zero page for zeroing out larger segments Pankaj Raghav
2025-05-16 10:10 ` [RFC 1/3] mm: add large zero page for efficient zeroing of " Pankaj Raghav
2025-05-16 12:21 ` David Hildenbrand
2025-05-16 13:03 ` Pankaj Raghav (Samsung)
2025-05-16 14:54 ` David Hildenbrand
2025-05-16 10:10 ` [RFC 2/3] block: use LARGE_ZERO_PAGE in __blkdev_issue_zero_pages() Pankaj Raghav
2025-05-16 10:10 ` [RFC 3/3] iomap: use LARGE_ZERO_PAGE in iomap_dio_zero() Pankaj Raghav
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox