linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: Issue with 8K folio size in __filemap_get_folio()
       [not found] <B467D07C-00D2-47C6-A034-2D88FE88A092@dubeyko.com>
@ 2023-12-03 21:27 ` Matthew Wilcox
  2023-12-03 23:12   ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2023-12-03 21:27 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: Linux FS Devel, linux-mm

On Sun, Dec 03, 2023 at 11:11:14PM +0300, Viacheslav Dubeyko wrote:
> So, why do we correct the order to zero always if order is equal to one?
> It sounds for me like incorrect logic. Even if we consider the troubles
> with memory allocation, then we will try allocate, for example, 16K, exclude 8K,
> and, finally, will try to allocate 4K. This logic puzzles me anyway.
> Do I miss something here?

The problem I'm trying to avoid is that when we allocate an order-1
folio, we only have two struct pages for the folio.  At the moment, we
have the deferred_list stored in the third struct page of the folio.
There isn't quite space to squeeze it into the second struct page on
32-bit systems -- we have

_flags_1		flags
_head_1			lru.head
_folio_avail		lru.tail
_entire_mapcount	mapping
_nr_pages_mapped	index
_pincount		private

after that we'd start to collide with _mapcount (which is still used
per-page) and _refcount (which must be 0 on tail pages).  We have one
word available (_folio_avail), but we'd need two to fit in a list_head.

We might be able to reengineer things so that we don't need a list_head
for deferred_list.  eg we could store deferred-split folios in an
xarray and only store the index in the folio.  Or we could decline to
deferred-split folios that are order-1.  I haven't looked into it in
detail, but I feel like this might be an acceptable approach.

I was talking with Darrick on Friday and he convinced me that this is
something we're going to need to fix sooner rather than later for the
benefit of devices with block size 8kB.  So it's definitely on my todo
list, but I haven't investigated in any detail yet.

Thanks for raising the issue; it gave me a good opportunity to explain
my current thinking on the problem.  Adding linux-mm for insights from
the MM audience.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-03 21:27 ` Issue with 8K folio size in __filemap_get_folio() Matthew Wilcox
@ 2023-12-03 23:12   ` Matthew Wilcox
  2023-12-04  5:57     ` Hugh Dickins
  2023-12-04 15:09     ` David Hildenbrand
  0 siblings, 2 replies; 7+ messages in thread
From: Matthew Wilcox @ 2023-12-03 23:12 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Linux FS Devel, linux-mm, Hugh Dickins, Kirill A. Shutemov

On Sun, Dec 03, 2023 at 09:27:57PM +0000, Matthew Wilcox wrote:
> I was talking with Darrick on Friday and he convinced me that this is
> something we're going to need to fix sooner rather than later for the
> benefit of devices with block size 8kB.  So it's definitely on my todo
> list, but I haven't investigated in any detail yet.

OK, here's my initial analysis of just not putting order-1 folios
on the deferred split list.  folio->_deferred_list is only used in
mm/huge_memory.c, which makes this a nice simple analysis.

 - folio_prep_large_rmappable() initialises the list_head.  No problem,
   just don't do that for order-1 folios.
 - split_huge_page_to_list() will remove the folio from the split queue.
   No problem, just don't do that.
 - folio_undo_large_rmappable() removes it from the list if it's
   on the list.  Again, no problem, don't do that for order-1 folios.
 - deferred_split_scan() walks the list, it won't find any order-1
   folios.

 - deferred_split_folio() will add the folio to the list.  Returning
   here will avoid adding the folio to the list.  But what consequences
   will that have?  Ah.  There's only one caller of
   deferred_split_folio() and it's in page_remove_rmap() ... and it's
   only called for anon folios anyway.

So it looks like we can support order-1 folios in the page cache without
any change in behaviour since file-backed folios were never added to
the deferred split list.

Now, is this the right direction?  Is it a bug that we never called
deferred_split_folio() for pagecache folios?  I would defer to Hugh
or Kirill on this.  Ccs added.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-03 23:12   ` Matthew Wilcox
@ 2023-12-04  5:57     ` Hugh Dickins
  2023-12-04 15:09     ` David Hildenbrand
  1 sibling, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2023-12-04  5:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Viacheslav Dubeyko, Linux FS Devel, linux-mm, Hugh Dickins,
	Kirill A. Shutemov

On Sun, 3 Dec 2023, Matthew Wilcox wrote:
> On Sun, Dec 03, 2023 at 09:27:57PM +0000, Matthew Wilcox wrote:
> > I was talking with Darrick on Friday and he convinced me that this is
> > something we're going to need to fix sooner rather than later for the
> > benefit of devices with block size 8kB.  So it's definitely on my todo
> > list, but I haven't investigated in any detail yet.
> 
> OK, here's my initial analysis of just not putting order-1 folios
> on the deferred split list.  folio->_deferred_list is only used in
> mm/huge_memory.c, which makes this a nice simple analysis.
> 
>  - folio_prep_large_rmappable() initialises the list_head.  No problem,
>    just don't do that for order-1 folios.
>  - split_huge_page_to_list() will remove the folio from the split queue.
>    No problem, just don't do that.
>  - folio_undo_large_rmappable() removes it from the list if it's
>    on the list.  Again, no problem, don't do that for order-1 folios.
>  - deferred_split_scan() walks the list, it won't find any order-1
>    folios.
> 
>  - deferred_split_folio() will add the folio to the list.  Returning
>    here will avoid adding the folio to the list.  But what consequences
>    will that have?  Ah.  There's only one caller of
>    deferred_split_folio() and it's in page_remove_rmap() ... and it's
>    only called for anon folios anyway.

Yes, the deferred split business is quite nicely localized,
which makes it easy to avoid.

> 
> So it looks like we can support order-1 folios in the page cache without
> any change in behaviour since file-backed folios were never added to
> the deferred split list.

Yes, you just have to code in a few "don't do that"s as above.

And I think it would be reasonable to allow even anon order-1 folios,
but they'd just be prevented from participating in the deferred split
business.

Allowing a PMD to be split at any time is necessary for correctness;
then allowing the underlying folio to be split is a nice-to-have when
trying to meet competition for memory, but it's not a correctness issue,
and the smaller the folio to be split, the less the saving - a lot of
unneeded order-1s could still waste a lot of memory, but it's not so
serious as with the order-9s.

> 
> Now, is this the right direction?  Is it a bug that we never called
> deferred_split_folio() for pagecache folios?  I would defer to Hugh
> or Kirill on this.  Ccs added.

No, not a bug.

The thing with anon folios is that they only have value while mapped
(let's ignore swap and GUP for simplicity), so when page_remove_rmap()
is called in the unmapping, that's a good hint that the hugeness of
the folio is no longer worthwhile; but it's a bad moment for splitting
because of locks held, and quite possibly a stupid time for splitting
too (because just a part of unmapping a large range, when splitting
would merely slow it all down).  Hence anon's deferred split queue.

But for file pages, the file contents must retain their value whether
mapped or not.  The moment to consider splitting the folio itself is
when truncating or punching a hole; and it happens that there is not
a locking problem then, nor any overhead added to eviction.  Hence no
deferred split queue for file.

Shmem does have its per-superblock shmem_unused_huge_shrink(), for
splitting huge folios at EOF when under memory pressure (and it would
be better if it were not restricted to EOF, but could retry folios
which had been EBUSY or EAGAIN at the time of hole-punch).  Maybe
other large folio filesystems should also have such a shrinker.

Hugh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-03 23:12   ` Matthew Wilcox
  2023-12-04  5:57     ` Hugh Dickins
@ 2023-12-04 15:09     ` David Hildenbrand
  2023-12-04 16:51       ` David Hildenbrand
  2023-12-04 17:17       ` Matthew Wilcox
  1 sibling, 2 replies; 7+ messages in thread
From: David Hildenbrand @ 2023-12-04 15:09 UTC (permalink / raw)
  To: Matthew Wilcox, Viacheslav Dubeyko
  Cc: Linux FS Devel, linux-mm, Hugh Dickins, Kirill A. Shutemov

On 04.12.23 00:12, Matthew Wilcox wrote:
> On Sun, Dec 03, 2023 at 09:27:57PM +0000, Matthew Wilcox wrote:
>> I was talking with Darrick on Friday and he convinced me that this is
>> something we're going to need to fix sooner rather than later for the
>> benefit of devices with block size 8kB.  So it's definitely on my todo
>> list, but I haven't investigated in any detail yet.
> 
> OK, here's my initial analysis of just not putting order-1 folios
> on the deferred split list.  folio->_deferred_list is only used in
> mm/huge_memory.c, which makes this a nice simple analysis.
> 
>   - folio_prep_large_rmappable() initialises the list_head.  No problem,
>     just don't do that for order-1 folios.
>   - split_huge_page_to_list() will remove the folio from the split queue.
>     No problem, just don't do that.
>   - folio_undo_large_rmappable() removes it from the list if it's
>     on the list.  Again, no problem, don't do that for order-1 folios.
>   - deferred_split_scan() walks the list, it won't find any order-1
>     folios.
> 
>   - deferred_split_folio() will add the folio to the list.  Returning
>     here will avoid adding the folio to the list.  But what consequences
>     will that have?  Ah.  There's only one caller of
>     deferred_split_folio() and it's in page_remove_rmap() ... and it's
>     only called for anon folios anyway.
> 
> So it looks like we can support order-1 folios in the page cache without
> any change in behaviour since file-backed folios were never added to
> the deferred split list.

I think for the pagecache it should work. In the context of [1], a total 
mapcount would likely still be possible. Anything beyond that likely 
not, if we ever care.

[1] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-04 15:09     ` David Hildenbrand
@ 2023-12-04 16:51       ` David Hildenbrand
  2023-12-04 17:17       ` Matthew Wilcox
  1 sibling, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2023-12-04 16:51 UTC (permalink / raw)
  To: Matthew Wilcox, Viacheslav Dubeyko
  Cc: Linux FS Devel, linux-mm, Hugh Dickins, Kirill A. Shutemov

On 04.12.23 16:09, David Hildenbrand wrote:
> On 04.12.23 00:12, Matthew Wilcox wrote:
>> On Sun, Dec 03, 2023 at 09:27:57PM +0000, Matthew Wilcox wrote:
>>> I was talking with Darrick on Friday and he convinced me that this is
>>> something we're going to need to fix sooner rather than later for the
>>> benefit of devices with block size 8kB.  So it's definitely on my todo
>>> list, but I haven't investigated in any detail yet.
>>
>> OK, here's my initial analysis of just not putting order-1 folios
>> on the deferred split list.  folio->_deferred_list is only used in
>> mm/huge_memory.c, which makes this a nice simple analysis.
>>
>>    - folio_prep_large_rmappable() initialises the list_head.  No problem,
>>      just don't do that for order-1 folios.
>>    - split_huge_page_to_list() will remove the folio from the split queue.
>>      No problem, just don't do that.
>>    - folio_undo_large_rmappable() removes it from the list if it's
>>      on the list.  Again, no problem, don't do that for order-1 folios.
>>    - deferred_split_scan() walks the list, it won't find any order-1
>>      folios.
>>
>>    - deferred_split_folio() will add the folio to the list.  Returning
>>      here will avoid adding the folio to the list.  But what consequences
>>      will that have?  Ah.  There's only one caller of
>>      deferred_split_folio() and it's in page_remove_rmap() ... and it's
>>      only called for anon folios anyway.
>>
>> So it looks like we can support order-1 folios in the page cache without
>> any change in behaviour since file-backed folios were never added to
>> the deferred split list.
> 
> I think for the pagecache it should work. In the context of [1], a total
> mapcount would likely still be possible. Anything beyond that likely
> not, if we ever care.

Thinking about it, maybe 64bit could be made working. Anyhow, just 
something to keep in mind once we want to support folio-1 orders: memmap 
space can start getting a problem again.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-04 15:09     ` David Hildenbrand
  2023-12-04 16:51       ` David Hildenbrand
@ 2023-12-04 17:17       ` Matthew Wilcox
  2023-12-04 17:22         ` David Hildenbrand
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2023-12-04 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Viacheslav Dubeyko, Linux FS Devel, linux-mm, Hugh Dickins,
	Kirill A. Shutemov

On Mon, Dec 04, 2023 at 04:09:36PM +0100, David Hildenbrand wrote:
> I think for the pagecache it should work. In the context of [1], a total
> mapcount would likely still be possible. Anything beyond that likely not, if
> we ever care.

I confess I hadn't gone through your patches.

https://lore.kernel.org/all/20231124132626.235350-8-david@redhat.com/

is the critical one.  It's actually going to walk off the end of order-2
folios today (which we'll create, eg with XFS).

You can put _rmap_val0 and _rmap_val1 in page2 and _rmap_val2-5 in page3
to fix this.  Once we're allocating order-1 folios, I think you can
avoid this scheme and just check page0 and page1 mapcounts independently.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue with 8K folio size in __filemap_get_folio()
  2023-12-04 17:17       ` Matthew Wilcox
@ 2023-12-04 17:22         ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2023-12-04 17:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Viacheslav Dubeyko, Linux FS Devel, linux-mm, Hugh Dickins,
	Kirill A. Shutemov

On 04.12.23 18:17, Matthew Wilcox wrote:
> On Mon, Dec 04, 2023 at 04:09:36PM +0100, David Hildenbrand wrote:
>> I think for the pagecache it should work. In the context of [1], a total
>> mapcount would likely still be possible. Anything beyond that likely not, if
>> we ever care.
> 
> I confess I hadn't gone through your patches.
> 
> https://lore.kernel.org/all/20231124132626.235350-8-david@redhat.com/
> 
> is the critical one.  It's actually going to walk off the end of order-2
> folios today (which we'll create, eg with XFS).

Note that order-2 only uses _rmap_val0. So that *should* work  as 
expected (and that layout resulted in the best cache behavior, weirdly 
enough).

> 
> You can put _rmap_val0 and _rmap_val1 in page2 and _rmap_val2-5 in page3
> to fix this.  Once we're allocating order-1 folios, I think you can
> avoid this scheme and just check page0 and page1 mapcounts independently.

I could put _rmap_val0 in page1 after I get rid of  		_folio_nr_pages. 
That would make it work on 64bit.

But yeah, it feels kind of stupid to do elaborate tracking for "it's 2 
pages". OTOH, less special-casing can be preferable.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-12-04 17:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <B467D07C-00D2-47C6-A034-2D88FE88A092@dubeyko.com>
2023-12-03 21:27 ` Issue with 8K folio size in __filemap_get_folio() Matthew Wilcox
2023-12-03 23:12   ` Matthew Wilcox
2023-12-04  5:57     ` Hugh Dickins
2023-12-04 15:09     ` David Hildenbrand
2023-12-04 16:51       ` David Hildenbrand
2023-12-04 17:17       ` Matthew Wilcox
2023-12-04 17:22         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox