Efficient mapping of sparse file holes to zero-pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Efficient mapping of sparse file holes to zero-pages
@ 2025-02-20 12:48 David Frank
  2025-02-20 13:47 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: David Frank @ 2025-02-20 12:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel

Hi all,

I'd like to efficiently mmap a large sparse file (ext4), 95% of which
is holes. I was unsatisfied with the performance and after profiling,
I found that most of the time is spent in filemap_add_folio and
filemap_alloc_folio - much more than in my algorithm:

 - 97.87% filemap_fault
    - 97.57% do_sync_mmap_readahead
       - page_cache_ra_order
          - 97.28% page_cache_ra_unbounded
             - 40.80% filemap_add_folio
                + 21.93% __filemap_add_folio
                + 8.88% folio_add_lru
                + 7.56% workingset_refault
             + 28.73% filemap_alloc_folio
             + 22.34% read_pages
             + 3.29% xa_load

As a workaround, I started using lseek and SEEK_HOLE+SEEK_DATA and
changed the algorithm to use a static array filled with zeros instead
of reading from the holes. This works ~30x faster, however, it
introduces substantial complexity in the implementation. I was
wondering if mapping holes to zero pages with COW in the kernel is
being considered.

I found [a related thread][1] from early 2022 which mentions mapping
to zero pages for shared memory objects. There seemed to be some
concerns about the complexity, I wonder if it's different for (even
just private/readonly) mmap.

[1]: https://lore.kernel.org/lkml/4b1885b8-eb95-c50-2965-11e7c8efbf36@google.com/T/

Thanks,
David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Efficient mapping of sparse file holes to zero-pages
  2025-02-20 12:48 Efficient mapping of sparse file holes to zero-pages David Frank
@ 2025-02-20 13:47 ` Matthew Wilcox
  2025-02-20 20:46   ` David Frank
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-02-20 13:47 UTC (permalink / raw)
  To: David Frank; +Cc: linux-mm, linux-kernel

On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote:
> I'd like to efficiently mmap a large sparse file (ext4), 95% of which
> is holes. I was unsatisfied with the performance and after profiling,
> I found that most of the time is spent in filemap_add_folio and
> filemap_alloc_folio - much more than in my algorithm:
> 
>  - 97.87% filemap_fault
>     - 97.57% do_sync_mmap_readahead
>        - page_cache_ra_order
>           - 97.28% page_cache_ra_unbounded
>              - 40.80% filemap_add_folio
>                 + 21.93% __filemap_add_folio
>                 + 8.88% folio_add_lru
>                 + 7.56% workingset_refault
>              + 28.73% filemap_alloc_folio
>              + 22.34% read_pages
>              + 3.29% xa_load

Yes, this is expected.

The fundamental problem is that we don't have the sparseness information
at the right point.  So the read request (or pagefault) comes in, the
VFS allocates a page, puts it in the pagecache, then asks the filesystem
to fill it.  The filesystem knows, so could theoretically tell the VFS
"Oh, this is a hole", but by this point the "damage" is done -- the page
has been allocated and added to the page cache.

Of course, this is a soluble problem.  The VFS could ask the filesystem
for its sparseness information (as you do in userspace), but unlike your
particular usecase, the kernel must handle attackers who are trying to
make it do the wrong thing as well as ill-timed writes.  So the VFS has
to ensure it does not use stale data from the filesystem.

This is a problem I'm somewhat interested in solving, but I'm a bit
busy with folios right now.  And once that project is done, improving
the page cache for reflinked files is next on my list, so I'm not likely
to get to this problem for a few years.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Efficient mapping of sparse file holes to zero-pages
  2025-02-20 13:47 ` Matthew Wilcox
@ 2025-02-20 20:46   ` David Frank
  2025-02-23  1:47     ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: David Frank @ 2025-02-20 20:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-kernel

Thank you, Matthew, for your reply.

What do you think about the complexity of this task? I'd be interested
in taking a look but I don't have kernel development experience so I
would need guidance.

On Thu, 20 Feb 2025 at 14:47, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote:
> > I'd like to efficiently mmap a large sparse file (ext4), 95% of which
> > is holes. I was unsatisfied with the performance and after profiling,
> > I found that most of the time is spent in filemap_add_folio and
> > filemap_alloc_folio - much more than in my algorithm:
> >
> >  - 97.87% filemap_fault
> >     - 97.57% do_sync_mmap_readahead
> >        - page_cache_ra_order
> >           - 97.28% page_cache_ra_unbounded
> >              - 40.80% filemap_add_folio
> >                 + 21.93% __filemap_add_folio
> >                 + 8.88% folio_add_lru
> >                 + 7.56% workingset_refault
> >              + 28.73% filemap_alloc_folio
> >              + 22.34% read_pages
> >              + 3.29% xa_load
>
> Yes, this is expected.
>
> The fundamental problem is that we don't have the sparseness information
> at the right point.  So the read request (or pagefault) comes in, the
> VFS allocates a page, puts it in the pagecache, then asks the filesystem
> to fill it.  The filesystem knows, so could theoretically tell the VFS
> "Oh, this is a hole", but by this point the "damage" is done -- the page
> has been allocated and added to the page cache.
>
> Of course, this is a soluble problem.  The VFS could ask the filesystem
> for its sparseness information (as you do in userspace), but unlike your
> particular usecase, the kernel must handle attackers who are trying to
> make it do the wrong thing as well as ill-timed writes.  So the VFS has
> to ensure it does not use stale data from the filesystem.
>
> This is a problem I'm somewhat interested in solving, but I'm a bit
> busy with folios right now.  And once that project is done, improving
> the page cache for reflinked files is next on my list, so I'm not likely
> to get to this problem for a few years.
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Efficient mapping of sparse file holes to zero-pages
  2025-02-20 20:46   ` David Frank
@ 2025-02-23  1:47     ` Matthew Wilcox
  2025-02-24 16:17       ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-02-23  1:47 UTC (permalink / raw)
  To: David Frank; +Cc: linux-mm, linux-kernel

On Thu, Feb 20, 2025 at 09:46:08PM +0100, David Frank wrote:
> Thank you, Matthew, for your reply.
> 
> What do you think about the complexity of this task? I'd be interested
> in taking a look but I don't have kernel development experience so I
> would need guidance.

Unfortunately, I would say this is a high complexity task.  At a high
level, I think we'd need:

 - Choose a data structure in the VFS to store this range information
   (a tree of some kind)
 - Design a protocol such that the VFS can query this information about
   a range of a particular file, and the filesystem can invalidate the
   VFS's knowledge
 - Use that range information when performing readahead [1]
 - Put zero entries into the page cache
 - Handle retrieving zero entries appropriately at all the points which
   currently retrieve folios from the page cache
 - Handle tearing down mmaps of zero entries when written to

Probably a few other things, but that's about the size of it.

I started hinting at a way to do the second point, and it was not
well-received.

https://lore.kernel.org/linux-fsdevel/Ytcd2a0RVCccWOmC@casper.infradead.org/
got no responses
https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead.org/
got a lot of push-back.

I consider most of the responses on that thread to be from people who
understand the problems far better than I do, so I'd need to learn a
lot more before making another proposal.

[1] Little secret, almost all reads / page faults are handled by
readahead

> On Thu, 20 Feb 2025 at 14:47, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Feb 20, 2025 at 01:48:18PM +0100, David Frank wrote:
> > > I'd like to efficiently mmap a large sparse file (ext4), 95% of which
> > > is holes. I was unsatisfied with the performance and after profiling,
> > > I found that most of the time is spent in filemap_add_folio and
> > > filemap_alloc_folio - much more than in my algorithm:
> > >
> > >  - 97.87% filemap_fault
> > >     - 97.57% do_sync_mmap_readahead
> > >        - page_cache_ra_order
> > >           - 97.28% page_cache_ra_unbounded
> > >              - 40.80% filemap_add_folio
> > >                 + 21.93% __filemap_add_folio
> > >                 + 8.88% folio_add_lru
> > >                 + 7.56% workingset_refault
> > >              + 28.73% filemap_alloc_folio
> > >              + 22.34% read_pages
> > >              + 3.29% xa_load
> >
> > Yes, this is expected.
> >
> > The fundamental problem is that we don't have the sparseness information
> > at the right point.  So the read request (or pagefault) comes in, the
> > VFS allocates a page, puts it in the pagecache, then asks the filesystem
> > to fill it.  The filesystem knows, so could theoretically tell the VFS
> > "Oh, this is a hole", but by this point the "damage" is done -- the page
> > has been allocated and added to the page cache.
> >
> > Of course, this is a soluble problem.  The VFS could ask the filesystem
> > for its sparseness information (as you do in userspace), but unlike your
> > particular usecase, the kernel must handle attackers who are trying to
> > make it do the wrong thing as well as ill-timed writes.  So the VFS has
> > to ensure it does not use stale data from the filesystem.
> >
> > This is a problem I'm somewhat interested in solving, but I'm a bit
> > busy with folios right now.  And once that project is done, improving
> > the page cache for reflinked files is next on my list, so I'm not likely
> > to get to this problem for a few years.
> >


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Efficient mapping of sparse file holes to zero-pages
  2025-02-23  1:47     ` Matthew Wilcox
@ 2025-02-24 16:17       ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2025-02-24 16:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Frank, linux-mm, linux-kernel

On Sun, Feb 23, 2025 at 01:47:52AM +0000, Matthew Wilcox wrote:
>  - Choose a data structure in the VFS to store this range information
>    (a tree of some kind)
>  - Design a protocol such that the VFS can query this information about
>    a range of a particular file, and the filesystem can invalidate the
>    VFS's knowledge

That information is always going to be incoherent in some way.  Reads
are already done without i_rwsem for most file systems and there is
further work on reducing locking.  So anything needs to come from inside
the file system itself.  That probably means you can't reduce the folio
allocation overhead, but at least you don't have to persistently use the
memory.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-02-24 16:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-20 12:48 Efficient mapping of sparse file holes to zero-pages David Frank
2025-02-20 13:47 ` Matthew Wilcox
2025-02-20 20:46   ` David Frank
2025-02-23  1:47     ` Matthew Wilcox
2025-02-24 16:17       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox