linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* PageOffline: refcount, flags and memdesc
@ 2024-11-14 11:18 David Hildenbrand
  2024-11-14 15:23 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-11-14 11:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Matthew Wilcox

Hi,

I'm currently staring again at PageOffline and wonder how we could 
prepare it for the memdesc future, and if we can remove refcount handling.


Currently, we set PageOffline in the following cases (one nast exception 
below):

(a) Memory blocks gets onlined, whereby we initialize all "struct pages
     to PageOffline + refcount of 1: memmap_init_range(). These pages are
     expected to get onlined via generic_online_page() later. Drivers
     might decide to leave some offline, because they are not backed by
     actual memory in the hypervisor. Some drivers still use free_page()
     instead of generic_online_page().

(b) We allocated pages (alloc_page(), alloc_contig_pages() ...) to
     logically offline them, whereby the refcount is set to 1 by the
     buddy and to PageOffline is set manually be the driver afterwards.

We clear PageOffline in the following cases (one nasty exception below):

(a) We want to return a page to the buddy (free_page/
     free_contig_page_range).
     PageOffline is cleared by the driver and freeing the page will
     decrement the refcount to 0.
(b) We want to expose it to the buddy the first time
     (generic_online_page). We will force the refcount to 0.

There are still subtle differences between onlining a page the first 
time to the buddy, such as debug_pagealloc_map_pages() in 
__free_pages_core(). I'm hoping we can get rid of them long-term, or 
just abstract it internally.


I'd like to stop using the refcount for PageOffline pages, and keep the 
refcount always at 0.

But the refcount, it is currently used to detect whether we are allowed 
to offline memory blocks that contain PageOffline pages, because only 
selected drivers support re-onlining. Well, and it is used when 
returning the pages to the buddy where 
free_page()/free_contig_range().... expect a refcount of 1.

Further, virtio-mem currently uses the PageDirty() bit to remember if a 
PageOffline page was already exposed to the buddy before, or if we must 
use generic_online_page().

For now we would need the following information, that could be stored in 
2 flags, leaving the refcount at 0:

(1) Was it obtained from the buddy or never exposed it to the buddy

PageOffline() && PageOfflineNeverOnlined()

(2) The driver does support actual memory offlining+reonlining, they can
     be skipped when offlining.

PageOffline() && PageOfflineSkippable


But when allocating/freeing pages we would still mess with the refcount, 
which is bad.

We could have a dedicated interface for freeing them, where we abstract 
the generic_online_page() bits, and leave the refcount at 0:

free_offline_page()
free_offline_page_range()

And

alloc_offline_page()
alloc_offline_page_range()
alloc_offline_pages

I'm not super happy about the "alloc/free" terminology, but nothing 
better came to mind.


There is one complication to sort out: balloon_compaction.h supports 
moving PageOffline pages, and seems to use the page lock, page refcount, 
page lru, page private... which is all rather nasty. I wonder if these 
should get their own page type, like PageMovableOffline, and we'd mostly 
leave them alone for now. This would mean that virtio-balloon, 
vmware-balloon and ppc CMM would keep doing the old refcount-based thing 
but with a new page type.


I assume this all goes into the direction of getting pages from the 
buddy and returning them without refcounts  ... thoughts?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PageOffline: refcount, flags and memdesc
  2024-11-14 11:18 PageOffline: refcount, flags and memdesc David Hildenbrand
@ 2024-11-14 15:23 ` Matthew Wilcox
  2024-11-14 15:55   ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2024-11-14 15:23 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm

On Thu, Nov 14, 2024 at 12:18:15PM +0100, David Hildenbrand wrote:
> I'm currently staring again at PageOffline and wonder how we could prepare
> it for the memdesc future, and if we can remove refcount handling.

Thanks for bringing it up.  As a memdesc, I currently have PageOffline
as being type 0 (Misc), subtype 5 (Offline).  That's bits 0-10 and then
bit 11 is for "may be mapped to userspace".  Bits 12-17 are the order.
With the top bits being used for section/node/zone, that could be 25 +
12 + 3 = 40 bits, so we'd have 7 bits remaining for use as flags.

> I'd like to stop using the refcount for PageOffline pages, and keep the
> refcount always at 0.

I think this makes sense.

> But the refcount, it is currently used to detect whether we are allowed to
> offline memory blocks that contain PageOffline pages, because only selected
> drivers support re-onlining. Well, and it is used when returning the pages
> to the buddy where free_page()/free_contig_range().... expect a refcount of
> 1.
> 
> Further, virtio-mem currently uses the PageDirty() bit to remember if a
> PageOffline page was already exposed to the buddy before, or if we must use
> generic_online_page().
> 
> For now we would need the following information, that could be stored in 2
> flags, leaving the refcount at 0:
> 
> (1) Was it obtained from the buddy or never exposed it to the buddy
> 
> PageOffline() && PageOfflineNeverOnlined()
> 
> (2) The driver does support actual memory offlining+reonlining, they can
>     be skipped when offlining.
> 
> PageOffline() && PageOfflineSkippable
> 
> 
> But when allocating/freeing pages we would still mess with the refcount,
> which is bad.
> 
> We could have a dedicated interface for freeing them, where we abstract the
> generic_online_page() bits, and leave the refcount at 0:
> 
> free_offline_page()
> free_offline_page_range()
> 
> And
> 
> alloc_offline_page()
> alloc_offline_page_range()
> alloc_offline_pages
> 
> I'm not super happy about the "alloc/free" terminology, but nothing better
> came to mind.

If I resurrect
https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
would the frozen terminology work for you here?

> There is one complication to sort out: balloon_compaction.h supports moving
> PageOffline pages, and seems to use the page lock, page refcount, page lru,
> page private... which is all rather nasty. I wonder if these should get
> their own page type, like PageMovableOffline, and we'd mostly leave them
> alone for now. This would mean that virtio-balloon, vmware-balloon and ppc
> CMM would keep doing the old refcount-based thing but with a new page type.

It's fairly clear to me now that we have a sane story for moving
file/anon folios.  The current way we handle movable pages looks mostly
insane because it's hammered into that framework,  I think we need
something entirely different to handle movable non-folio pages, but I
don't know what that story is yet.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PageOffline: refcount, flags and memdesc
  2024-11-14 15:23 ` Matthew Wilcox
@ 2024-11-14 15:55   ` David Hildenbrand
  2024-11-14 20:29     ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-11-14 15:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On 14.11.24 16:23, Matthew Wilcox wrote:
> On Thu, Nov 14, 2024 at 12:18:15PM +0100, David Hildenbrand wrote:
>> I'm currently staring again at PageOffline and wonder how we could prepare
>> it for the memdesc future, and if we can remove refcount handling.
> 

Hi!

> Thanks for bringing it up.  As a memdesc, I currently have PageOffline
> as being type 0 (Misc), subtype 5 (Offline).  That's bits 0-10 and then
> bit 11 is for "may be mapped to userspace".  Bits 12-17 are the order.
> With the top bits being used for section/node/zone, that could be 25 +
> 12 + 3 = 40 bits, so we'd have 7 bits remaining for use as flags.

"may be mapped to userspace" will always be 0. 1 / 2 flags initially 
would do.

> 
>> I'd like to stop using the refcount for PageOffline pages, and keep the
>> refcount always at 0.
> 
> I think this makes sense.
> 
>> But the refcount, it is currently used to detect whether we are allowed to
>> offline memory blocks that contain PageOffline pages, because only selected
>> drivers support re-onlining. Well, and it is used when returning the pages
>> to the buddy where free_page()/free_contig_range().... expect a refcount of
>> 1.
>>
>> Further, virtio-mem currently uses the PageDirty() bit to remember if a
>> PageOffline page was already exposed to the buddy before, or if we must use
>> generic_online_page().
>>
>> For now we would need the following information, that could be stored in 2
>> flags, leaving the refcount at 0:
>>
>> (1) Was it obtained from the buddy or never exposed it to the buddy
>>
>> PageOffline() && PageOfflineNeverOnlined()
>>
>> (2) The driver does support actual memory offlining+reonlining, they can
>>      be skipped when offlining.
>>
>> PageOffline() && PageOfflineSkippable
>>
>>
>> But when allocating/freeing pages we would still mess with the refcount,
>> which is bad.
>>
>> We could have a dedicated interface for freeing them, where we abstract the
>> generic_online_page() bits, and leave the refcount at 0:
>>
>> free_offline_page()
>> free_offline_page_range()
>>
>> And
>>
>> alloc_offline_page()
>> alloc_offline_page_range()
>> alloc_offline_pages
>>
>> I'm not super happy about the "alloc/free" terminology, but nothing better
>> came to mind.
> 
> If I resurrect
> https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
> would the frozen terminology work for you here?

Ah, I remember that.

I was more concerned about alloc/free terminology, because 
"free_offline_page" could simply be "online_page" :) But the 
"allocation" part is trickier. Maybe it's simply alloc/free of frozen 
pages for the time being.

But yes, the "allocate/free pages without involving refcounts" will be a 
crucial thing to get the PageOffline conversion flying.

Instead of alloc_frozen_pages(), I was wondering if we should have 
something like GFP_FROZEN. For example, for two PG_offline users
I'd currently also need alloc_contig_frozen_range() and 
alloc_contig_frozen_pages(). Using alloc_contig_range(GFP_FROZEN) 
alloc_contig_pages(GFP_PROZEN) would make that easier.

Did you consider that already?

> 
>> There is one complication to sort out: balloon_compaction.h supports moving
>> PageOffline pages, and seems to use the page lock, page refcount, page lru,
>> page private... which is all rather nasty. I wonder if these should get
>> their own page type, like PageMovableOffline, and we'd mostly leave them
>> alone for now. This would mean that virtio-balloon, vmware-balloon and ppc
>> CMM would keep doing the old refcount-based thing but with a new page type.
> 
> It's fairly clear to me now that we have a sane story for moving
> file/anon folios.  The current way we handle movable pages looks mostly
> insane because it's hammered into that framework,  I think we need
> something entirely different to handle movable non-folio pages, but I
> don't know what that story is yet.

Okay, so the first step would be to leave that part alone and convert 
the other (sane :) ) users of PageOffline to not refcount.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PageOffline: refcount, flags and memdesc
  2024-11-14 15:55   ` David Hildenbrand
@ 2024-11-14 20:29     ` Matthew Wilcox
  2024-11-14 20:45       ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2024-11-14 20:29 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, Yu Zhao

On Thu, Nov 14, 2024 at 04:55:33PM +0100, David Hildenbrand wrote:
> > Thanks for bringing it up.  As a memdesc, I currently have PageOffline
> > as being type 0 (Misc), subtype 5 (Offline).  That's bits 0-10 and then
> > bit 11 is for "may be mapped to userspace".  Bits 12-17 are the order.
> > With the top bits being used for section/node/zone, that could be 25 +
> > 12 + 3 = 40 bits, so we'd have 7 bits remaining for use as flags.
> 
> "may be mapped to userspace" will always be 0. 1 / 2 flags initially would
> do.

Yes, indeed, would always be 0 for Offline pages.  Probably not for
other pages though, and I'd like to have that property as a standard
flag.

> > > free_offline_page()
> > > free_offline_page_range()
> > > 
> > > And
> > > 
> > > alloc_offline_page()
> > > alloc_offline_page_range()
> > > alloc_offline_pages
> > > 
> > > I'm not super happy about the "alloc/free" terminology, but nothing better
> > > came to mind.
> > 
> > If I resurrect
> > https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
> > would the frozen terminology work for you here?
> 
> Ah, I remember that.
> 
> I was more concerned about alloc/free terminology, because
> "free_offline_page" could simply be "online_page" :) But the "allocation"
> part is trickier. Maybe it's simply alloc/free of frozen pages for the time
> being.
> 
> But yes, the "allocate/free pages without involving refcounts" will be a
> crucial thing to get the PageOffline conversion flying.
> 
> Instead of alloc_frozen_pages(), I was wondering if we should have something
> like GFP_FROZEN. For example, for two PG_offline users
> I'd currently also need alloc_contig_frozen_range() and
> alloc_contig_frozen_pages(). Using alloc_contig_range(GFP_FROZEN)
> alloc_contig_pages(GFP_PROZEN) would make that easier.
> 
> Did you consider that already?

I think Yu Zhao's recent patches:

e98337d11bbd ("mm/contig_alloc: support __GFP_COMP")
463586e9ff39 ("mm/cma: add cma_{alloc,free}_folio()")
cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")

get us most of the way there.

What I really want to do is make contig allocation not do anything with
the refcounts.  Hoist that into the callers which actually want it
(probably not many), and then we can actually drop the __GFP_COMP
support to alloc_contig_range_noprof().



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PageOffline: refcount, flags and memdesc
  2024-11-14 20:29     ` Matthew Wilcox
@ 2024-11-14 20:45       ` David Hildenbrand
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2024-11-14 20:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Yu Zhao

>>
>> Instead of alloc_frozen_pages(), I was wondering if we should have something
>> like GFP_FROZEN. For example, for two PG_offline users
>> I'd currently also need alloc_contig_frozen_range() and
>> alloc_contig_frozen_pages(). Using alloc_contig_range(GFP_FROZEN)
>> alloc_contig_pages(GFP_PROZEN) would make that easier.
>>
>> Did you consider that already?
> 
> I think Yu Zhao's recent patches:
> 
> e98337d11bbd ("mm/contig_alloc: support __GFP_COMP")
> 463586e9ff39 ("mm/cma: add cma_{alloc,free}_folio()")
> cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")
> 
> get us most of the way there.
> 
> What I really want to do is make contig allocation not do anything with
> the refcounts.  Hoist that into the callers which actually want it
> (probably not many), and then we can actually drop the __GFP_COMP
> support to alloc_contig_range_noprof().

Right, at least virtio-mem cannot easily support 
alloc_contig_pages(__GFP_COMP; we'd need splitting support, and it 
should be a separate effort than reworking PageOffline; and it would all 
be in vain if we drop __GFP_COMP later again :

So regarding the PageOffline refcount rework it would be easier to do it 
stepwise

1) add GFP_FROZEN support to alloc_contig_*
2) convert the two alloc_contig_*+PageOffline users to use that
3) convert the remaining alloc_contig_* users to GFP_FROZEN
4) make GFP_FROZEN implicit (drop it?) and drop __GFP_COMP

3) is a bigger task, likely too big to do in one shot.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-14 20:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-14 11:18 PageOffline: refcount, flags and memdesc David Hildenbrand
2024-11-14 15:23 ` Matthew Wilcox
2024-11-14 15:55   ` David Hildenbrand
2024-11-14 20:29     ` Matthew Wilcox
2024-11-14 20:45       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox