* Efficient mapping of sparse file holes to zero-pages
@ 2025-02-20 12:48 David Frank
2025-02-20 13:47 ` Matthew Wilcox
0 siblings, 1 reply; 5+ messages in thread
From: David Frank @ 2025-02-20 12:48 UTC (permalink / raw)
To: linux-mm, linux-kernel
Hi all,
I'd like to efficiently mmap a large sparse file (ext4), 95% of which
is holes. I was unsatisfied with the performance and after profiling,
I found that most of the time is spent in filemap_add_folio and
filemap_alloc_folio - much more than in my algorithm:
- 97.87% filemap_fault
- 97.57% do_sync_mmap_readahead
- page_cache_ra_order
- 97.28% page_cache_ra_unbounded
- 40.80% filemap_add_folio
+ 21.93% __filemap_add_folio
+ 8.88% folio_add_lru
+ 7.56% workingset_refault
+ 28.73% filemap_alloc_folio
+ 22.34% read_pages
+ 3.29% xa_load
As a workaround, I started using lseek and SEEK_HOLE+SEEK_DATA and
changed the algorithm to use a static array filled with zeros instead
of reading from the holes. This works ~30x faster, however, it
introduces substantial complexity in the implementation. I was
wondering if mapping holes to zero pages with COW in the kernel is
being considered.
I found [a related thread][1] from early 2022 which mentions mapping
to zero pages for shared memory objects. There seemed to be some
concerns about the complexity, I wonder if it's different for (even
just private/readonly) mmap.
[1]: https://lore.kernel.org/lkml/4b1885b8-eb95-c50-2965-11e7c8efbf36@google.com/T/
Thanks,
David
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Efficient mapping of sparse file holes to zero-pages 2025-02-20 12:48 Efficient mapping of sparse file holes to zero-pages David Frank @ 2025-02-20 13:47 ` Matthew Wilcox 2025-02-20 20:46 ` David Frank 0 siblings, 1 reply; 5+ messages in thread From: Matthew Wilcox @ 2025-02-20 13:47 UTC (permalink / raw) To: David Frank; +Cc: linux-mm, linux-kernel On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote: > I'd like to efficiently mmap a large sparse file (ext4), 95% of which > is holes. I was unsatisfied with the performance and after profiling, > I found that most of the time is spent in filemap_add_folio and > filemap_alloc_folio - much more than in my algorithm: > > - 97.87% filemap_fault > - 97.57% do_sync_mmap_readahead > - page_cache_ra_order > - 97.28% page_cache_ra_unbounded > - 40.80% filemap_add_folio > + 21.93% __filemap_add_folio > + 8.88% folio_add_lru > + 7.56% workingset_refault > + 28.73% filemap_alloc_folio > + 22.34% read_pages > + 3.29% xa_load Yes, this is expected. The fundamental problem is that we don't have the sparseness information at the right point. So the read request (or pagefault) comes in, the VFS allocates a page, puts it in the pagecache, then asks the filesystem to fill it. The filesystem knows, so could theoretically tell the VFS "Oh, this is a hole", but by this point the "damage" is done -- the page has been allocated and added to the page cache. Of course, this is a soluble problem. The VFS could ask the filesystem for its sparseness information (as you do in userspace), but unlike your particular usecase, the kernel must handle attackers who are trying to make it do the wrong thing as well as ill-timed writes. So the VFS has to ensure it does not use stale data from the filesystem. This is a problem I'm somewhat interested in solving, but I'm a bit busy with folios right now. And once that project is done, improving the page cache for reflinked files is next on my list, so I'm not likely to get to this problem for a few years. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Efficient mapping of sparse file holes to zero-pages 2025-02-20 13:47 ` Matthew Wilcox @ 2025-02-20 20:46 ` David Frank 2025-02-23 1:47 ` Matthew Wilcox 0 siblings, 1 reply; 5+ messages in thread From: David Frank @ 2025-02-20 20:46 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm, linux-kernel Thank you, Matthew, for your reply. What do you think about the complexity of this task? I'd be interested in taking a look but I don't have kernel development experience so I would need guidance. On Thu, 20 Feb 2025 at 14:47, Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote: > > I'd like to efficiently mmap a large sparse file (ext4), 95% of which > > is holes. I was unsatisfied with the performance and after profiling, > > I found that most of the time is spent in filemap_add_folio and > > filemap_alloc_folio - much more than in my algorithm: > > > > - 97.87% filemap_fault > > - 97.57% do_sync_mmap_readahead > > - page_cache_ra_order > > - 97.28% page_cache_ra_unbounded > > - 40.80% filemap_add_folio > > + 21.93% __filemap_add_folio > > + 8.88% folio_add_lru > > + 7.56% workingset_refault > > + 28.73% filemap_alloc_folio > > + 22.34% read_pages > > + 3.29% xa_load > > Yes, this is expected. > > The fundamental problem is that we don't have the sparseness information > at the right point. So the read request (or pagefault) comes in, the > VFS allocates a page, puts it in the pagecache, then asks the filesystem > to fill it. The filesystem knows, so could theoretically tell the VFS > "Oh, this is a hole", but by this point the "damage" is done -- the page > has been allocated and added to the page cache. > > Of course, this is a soluble problem. The VFS could ask the filesystem > for its sparseness information (as you do in userspace), but unlike your > particular usecase, the kernel must handle attackers who are trying to > make it do the wrong thing as well as ill-timed writes. So the VFS has > to ensure it does not use stale data from the filesystem. > > This is a problem I'm somewhat interested in solving, but I'm a bit > busy with folios right now. And once that project is done, improving > the page cache for reflinked files is next on my list, so I'm not likely > to get to this problem for a few years. > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Efficient mapping of sparse file holes to zero-pages 2025-02-20 20:46 ` David Frank @ 2025-02-23 1:47 ` Matthew Wilcox 2025-02-24 16:17 ` Christoph Hellwig 0 siblings, 1 reply; 5+ messages in thread From: Matthew Wilcox @ 2025-02-23 1:47 UTC (permalink / raw) To: David Frank; +Cc: linux-mm, linux-kernel On Thu, Feb 20, 2025 at 09:46:08PM +0100, David Frank wrote: > Thank you, Matthew, for your reply. > > What do you think about the complexity of this task? I'd be interested > in taking a look but I don't have kernel development experience so I > would need guidance. Unfortunately, I would say this is a high complexity task. At a high level, I think we'd need: - Choose a data structure in the VFS to store this range information (a tree of some kind) - Design a protocol such that the VFS can query this information about a range of a particular file, and the filesystem can invalidate the VFS's knowledge - Use that range information when performing readahead [1] - Put zero entries into the page cache - Handle retrieving zero entries appropriately at all the points which currently retrieve folios from the page cache - Handle tearing down mmaps of zero entries when written to Probably a few other things, but that's about the size of it. I started hinting at a way to do the second point, and it was not well-received. https://lore.kernel.org/linux-fsdevel/Ytcd2a0RVCccWOmC@casper.infradead.org/ got no responses https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/ got a lot of push-back. I consider most of the responses on that thread to be from people who understand the problems far better than I do, so I'd need to learn a lot more before making another proposal. [1] Little secret, almost all reads / page faults are handled by readahead > On Thu, 20 Feb 2025 at 14:47, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote: > > > I'd like to efficiently mmap a large sparse file (ext4), 95% of which > > > is holes. I was unsatisfied with the performance and after profiling, > > > I found that most of the time is spent in filemap_add_folio and > > > filemap_alloc_folio - much more than in my algorithm: > > > > > > - 97.87% filemap_fault > > > - 97.57% do_sync_mmap_readahead > > > - page_cache_ra_order > > > - 97.28% page_cache_ra_unbounded > > > - 40.80% filemap_add_folio > > > + 21.93% __filemap_add_folio > > > + 8.88% folio_add_lru > > > + 7.56% workingset_refault > > > + 28.73% filemap_alloc_folio > > > + 22.34% read_pages > > > + 3.29% xa_load > > > > Yes, this is expected. > > > > The fundamental problem is that we don't have the sparseness information > > at the right point. So the read request (or pagefault) comes in, the > > VFS allocates a page, puts it in the pagecache, then asks the filesystem > > to fill it. The filesystem knows, so could theoretically tell the VFS > > "Oh, this is a hole", but by this point the "damage" is done -- the page > > has been allocated and added to the page cache. > > > > Of course, this is a soluble problem. The VFS could ask the filesystem > > for its sparseness information (as you do in userspace), but unlike your > > particular usecase, the kernel must handle attackers who are trying to > > make it do the wrong thing as well as ill-timed writes. So the VFS has > > to ensure it does not use stale data from the filesystem. > > > > This is a problem I'm somewhat interested in solving, but I'm a bit > > busy with folios right now. And once that project is done, improving > > the page cache for reflinked files is next on my list, so I'm not likely > > to get to this problem for a few years. > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Efficient mapping of sparse file holes to zero-pages 2025-02-23 1:47 ` Matthew Wilcox @ 2025-02-24 16:17 ` Christoph Hellwig 0 siblings, 0 replies; 5+ messages in thread From: Christoph Hellwig @ 2025-02-24 16:17 UTC (permalink / raw) To: Matthew Wilcox; +Cc: David Frank, linux-mm, linux-kernel On Sun, Feb 23, 2025 at 01:47:52AM +0000, Matthew Wilcox wrote: > - Choose a data structure in the VFS to store this range information > (a tree of some kind) > - Design a protocol such that the VFS can query this information about > a range of a particular file, and the filesystem can invalidate the > VFS's knowledge That information is always going to be incoherent in some way. Reads are already done without i_rwsem for most file systems and there is further work on reducing locking. So anything needs to come from inside the file system itself. That probably means you can't reduce the folio allocation overhead, but at least you don't have to persistently use the memory. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-02-24 16:17 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-20 12:48 Efficient mapping of sparse file holes to zero-pages David Frank 2025-02-20 13:47 ` Matthew Wilcox 2025-02-20 20:46 ` David Frank 2025-02-23 1:47 ` Matthew Wilcox 2025-02-24 16:17 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox