[LSF/MM/BPF TOPIC] Reclamation interactions with RCU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
@ 2024-02-27 18:56 Paul E. McKenney
  2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein
  0 siblings, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2024-02-27 18:56 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

Hello!

Recent discussions [1] suggest that greater mutual understanding between
memory reclaim on the one hand and RCU on the other might be in order.

One possibility would be an open discussion.  If it would help, I would
be happy to describe how RCU reacts and responds to heavy load, along with
some ways that RCU's reactions and responses could be enhanced if needed.

							Thanx, Paul

[1] https://lore.kernel.org/all/fb4d944e-fde7-423b-a376-25db0b317398@paulmck-laptop/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-27 18:56 [LSF/MM/BPF TOPIC] Reclamation interactions with RCU Paul E. McKenney
@ 2024-02-27 19:19 ` Amir Goldstein
  2024-02-27 22:59   ` Paul E. McKenney
  2024-02-28 19:37   ` Matthew Wilcox
  0 siblings, 2 replies; 52+ messages in thread
From: Amir Goldstein @ 2024-02-27 19:19 UTC (permalink / raw)
  To: paulmck
  Cc: lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Kent Overstreet,
	Jan Kara

On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Hello!
>
> Recent discussions [1] suggest that greater mutual understanding between
> memory reclaim on the one hand and RCU on the other might be in order.
>
> One possibility would be an open discussion.  If it would help, I would
> be happy to describe how RCU reacts and responds to heavy load, along with
> some ways that RCU's reactions and responses could be enhanced if needed.
>

Adding fsdevel as this should probably be a cross track session.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein
@ 2024-02-27 22:59   ` Paul E. McKenney
  2024-03-01  3:28     ` Kent Overstreet
  2024-02-28 19:37   ` Matthew Wilcox
  1 sibling, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2024-02-27 22:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Kent Overstreet,
	Jan Kara

On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Hello!
> >
> > Recent discussions [1] suggest that greater mutual understanding between
> > memory reclaim on the one hand and RCU on the other might be in order.
> >
> > One possibility would be an open discussion.  If it would help, I would
> > be happy to describe how RCU reacts and responds to heavy load, along with
> > some ways that RCU's reactions and responses could be enhanced if needed.
> 
> Adding fsdevel as this should probably be a cross track session.

The more, the merrier!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-27 22:59   ` Paul E. McKenney
@ 2024-03-01  3:28     ` Kent Overstreet
  2024-03-05  2:43       ` Paul E. McKenney
  2024-03-05  2:56       ` Yosry Ahmed
  0 siblings, 2 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  3:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Amir Goldstein, lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox,
	Jan Kara

On Tue, Feb 27, 2024 at 02:59:33PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Hello!
> > >
> > > Recent discussions [1] suggest that greater mutual understanding between
> > > memory reclaim on the one hand and RCU on the other might be in order.
> > >
> > > One possibility would be an open discussion.  If it would help, I would
> > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > some ways that RCU's reactions and responses could be enhanced if needed.
> > 
> > Adding fsdevel as this should probably be a cross track session.
> 
> The more, the merrier!  ;-)

Starting to look at the practicalities of freeing pages via rcu.

Where to stick the rcu_head might be a problem. I think by the time
we're in free_unref_page() enough of struct page isn't in use that we
can just... throw it in the struct page union somewhere (ok, actually
there's one already for ZONE_DEVICE pages; we just have to make sure it
doesn't collide with anything - probably just compound page metadata).

But it'd be more efficient if we weren't using linked lists for this.
It's premature to ask for this before I've tested or profiled anything,
but I just wanted to throw it out there that we might want to look using
a dequeue for pending frees (and perhaps other things?).

Willy and I were just chatting about this and I think the conclusion
we're coming to is that we'll probably want to try just RCU freeing
_everything_ at first, the reason being that anonymous pages are
probably going to get included and that may end up being most of the
allocations and frees - so why not just do everything.

Meaning - _lots _of memory cycling through RCU, not bounded by device
bandwidth like I initially thought.

Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs

The plan has been to eventually get an actual enum/type tag for our
different page types, and that would be really helpful for system
visibility/introspection in general, and in particular for this project
where we're going to want to be able to put numbers on different things
for performance analysis and then later probably excluding things from
RCU freeing in a sane standard way (slab, at least).

I wonder if anyone wants to solve this. The problem with doing it now is
that we've got no more page flags on 32 bit, but it'd be _very_ helpful
in future cleanups if we could figure out a way to make it hapen.

That'd help with another wrinkle, which is non compound higher order
pages. Those are tricky to handle, because how do we know the order from
the RCU callback? If we just don't RCU free them, that's sketchy without
the page type enum because we have no way to have a master
definition/list of "these types of pages are RCU freed, these aren't".

Possibly we can just fake-compound them - store the order like it is in
compound pages but don't initialize every page like happens for compound
pages.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:28     ` Kent Overstreet
@ 2024-03-05  2:43       ` Paul E. McKenney
  2024-03-05  2:56       ` Yosry Ahmed
  1 sibling, 0 replies; 52+ messages in thread
From: Paul E. McKenney @ 2024-03-05  2:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Amir Goldstein, lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox,
	Jan Kara

On Thu, Feb 29, 2024 at 10:28:33PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 02:59:33PM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > The more, the merrier!  ;-)
> 
> Starting to look at the practicalities of freeing pages via rcu.
> 
> Where to stick the rcu_head might be a problem. I think by the time
> we're in free_unref_page() enough of struct page isn't in use that we
> can just... throw it in the struct page union somewhere (ok, actually
> there's one already for ZONE_DEVICE pages; we just have to make sure it
> doesn't collide with anything - probably just compound page metadata).
> 
> But it'd be more efficient if we weren't using linked lists for this.
> It's premature to ask for this before I've tested or profiled anything,
> but I just wanted to throw it out there that we might want to look using
> a dequeue for pending frees (and perhaps other things?).

You could certainly use pages of pointers like kfree_rcu() does, at
least when there is ample memory.  You could go further and allocate
those pages as the structures are being allocated in order to guarantee
that you never run short.  The accounting is likely to be a bit tricky,
but definitely doable.

You could also use polled grace periods:

o	start_poll_synchronize_rcu_full() to get a "cookie".  If you
	already have a batch for that cookie, add the new block to
	that batch.  The cookie is then associated with the batch,
	not the individual block.

	The batch could be tied together by a non-linked-list data
	structure, if desired, and need not have any storage overhead
	at all within the blocks themselves.

o	poll_state_synchronize_rcu_full() to check if a batch's
	cookie corresponds to a completed grace period.  Free
	up the batch if so.

o	cond_synchronize_rcu_full() to block until the specified
	cookie's grace period ends.  (I suspect you won't want to
	do this, but just in case.)

o	cond_synchronize_rcu_expedited_full() is as above, but
	it uses an expedited grace period to do the waiting.

o	NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE gives the maximum number
	of cookies that can correspond to a grace period that has
	not yet completed.  This is a numerical constant that can
	be used to size an array.

If things are piling up, you can call synchronize_rcu_expedited(),
and that will cause all cookies collected before that call to suddenly
correspond to completed grace periods.  Where "suddenly" is give or take
long RCU readers elsewhere in the kernel.

Does this help?

> Willy and I were just chatting about this and I think the conclusion
> we're coming to is that we'll probably want to try just RCU freeing
> _everything_ at first, the reason being that anonymous pages are
> probably going to get included and that may end up being most of the
> allocations and frees - so why not just do everything.
> 
> Meaning - _lots _of memory cycling through RCU, not bounded by device
> bandwidth like I initially thought.

There already is the possibility of users doing close(open(...)) in a
tight loop, which is why rcutorture does callback flooding.  But each
new use case brings a parcel of risk, so I do not claim that either of
these mean that further adjustments, up to and including inventing a
new synchronization primitive, won't be needed.

> Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs

Not sure how this relates, but good to know.

> The plan has been to eventually get an actual enum/type tag for our
> different page types, and that would be really helpful for system
> visibility/introspection in general, and in particular for this project
> where we're going to want to be able to put numbers on different things
> for performance analysis and then later probably excluding things from
> RCU freeing in a sane standard way (slab, at least).
> 
> I wonder if anyone wants to solve this. The problem with doing it now is
> that we've got no more page flags on 32 bit, but it'd be _very_ helpful
> in future cleanups if we could figure out a way to make it hapen.
> 
> That'd help with another wrinkle, which is non compound higher order
> pages. Those are tricky to handle, because how do we know the order from
> the RCU callback? If we just don't RCU free them, that's sketchy without
> the page type enum because we have no way to have a master
> definition/list of "these types of pages are RCU freed, these aren't".

If I understand correctly, this might be another reason for using pages
of pointers rather than rcu_head structures.

> Possibly we can just fake-compound them - store the order like it is in
> compound pages but don't initialize every page like happens for compound
> pages.

I must defer to you guys on this one.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:28     ` Kent Overstreet
  2024-03-05  2:43       ` Paul E. McKenney
@ 2024-03-05  2:56       ` Yosry Ahmed
  1 sibling, 0 replies; 52+ messages in thread
From: Yosry Ahmed @ 2024-03-05  2:56 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Paul E. McKenney, Amir Goldstein, lsf-pc, linux-mm,
	linux-fsdevel, Matthew Wilcox, Jan Kara

[..]

> Willy and I were just chatting about this and I think the conclusion
> we're coming to is that we'll probably want to try just RCU freeing
> _everything_ at first, the reason being that anonymous pages are
> probably going to get included and that may end up being most of the
> allocations and frees - so why not just do everything.
>
> Meaning - _lots _of memory cycling through RCU, not bounded by device
> bandwidth like I initially thought.
>
> Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs
>
> The plan has been to eventually get an actual enum/type tag for our
> different page types, and that would be really helpful for system
> visibility/introspection in general, and in particular for this project
> where we're going to want to be able to put numbers on different things
> for performance analysis and then later probably excluding things from
> RCU freeing in a sane standard way (slab, at least).
>
> I wonder if anyone wants to solve this. The problem with doing it now is
> that we've got no more page flags on 32 bit, but it'd be _very_ helpful
> in future cleanups if we could figure out a way to make it hapen.
>
> That'd help with another wrinkle, which is non compound higher order
> pages. Those are tricky to handle, because how do we know the order from
> the RCU callback? If we just don't RCU free them, that's sketchy without
> the page type enum because we have no way to have a master
> definition/list of "these types of pages are RCU freed, these aren't".
>
> Possibly we can just fake-compound them - store the order like it is in
> compound pages but don't initialize every page like happens for compound
> pages.

(Not following the thread closely, sorry if this is completely meaningless)

Why not store the metadata we need in the page itself, given that it
is being freed anyway?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein
  2024-02-27 22:59   ` Paul E. McKenney
@ 2024-02-28 19:37   ` Matthew Wilcox
  2024-02-29  1:29     ` Dave Chinner
                       ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Matthew Wilcox @ 2024-02-28 19:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: paulmck, lsf-pc, linux-mm, linux-fsdevel, Kent Overstreet, Jan Kara

On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Hello!
> >
> > Recent discussions [1] suggest that greater mutual understanding between
> > memory reclaim on the one hand and RCU on the other might be in order.
> >
> > One possibility would be an open discussion.  If it would help, I would
> > be happy to describe how RCU reacts and responds to heavy load, along with
> > some ways that RCU's reactions and responses could be enhanced if needed.
> >
> 
> Adding fsdevel as this should probably be a cross track session.

Perhaps broaden this slightly.  On the THP Cabal call we just had a
conversation about the requirements on filesystems in the writeback
path.  We currently tell filesystem authors that the entire writeback
path must avoid allocating memory in order to prevent deadlock (or use
GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
writing pagecache back will not allocate memory in, eg, the network stack,
the device driver, and any other layers the write must traverse.

With the removal of ->writepage from vmscan, perhaps we can make
filesystem authors lives easier by relaxing this requirement as pagecache
should be cleaned long before we get to reclaiming it.

I don't think there's anything to be done about swapping anon memory.
We probably don't want to proactively write anon memory to swap, so by
the time we're in ->swap_rw we really are low on memory.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-28 19:37   ` Matthew Wilcox
@ 2024-02-29  1:29     ` Dave Chinner
  2024-02-29  4:20       ` Kent Overstreet
  2024-02-29  4:17     ` Kent Overstreet
  2024-03-01  2:16     ` NeilBrown
  2 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2024-02-29  1:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Kent Overstreet, Jan Kara

On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Hello!
> > >
> > > Recent discussions [1] suggest that greater mutual understanding between
> > > memory reclaim on the one hand and RCU on the other might be in order.
> > >
> > > One possibility would be an open discussion.  If it would help, I would
> > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > some ways that RCU's reactions and responses could be enhanced if needed.
> > >
> > 
> > Adding fsdevel as this should probably be a cross track session.
> 
> Perhaps broaden this slightly.  On the THP Cabal call we just had a
> conversation about the requirements on filesystems in the writeback
> path.  We currently tell filesystem authors that the entire writeback
> path must avoid allocating memory in order to prevent deadlock (or use
> GFP_MEMALLOC).  Is this appropriate?

The reality is that filesystem developers have been ignoring that
"mm rule" for a couple of decades. It was also discussed at LSFMM a
decade ago (2014 IIRC) without resolution, so in the mean time we
just took control of our own destiny....

> It's a lot of work to assure that
> writing pagecache back will not allocate memory in, eg, the network stack,
> the device driver, and any other layers the write must traverse.
>
> With the removal of ->writepage from vmscan, perhaps we can make
> filesystem authors lives easier by relaxing this requirement as pagecache
> should be cleaned long before we get to reclaiming it.

.... by removing memory reclaim page cache writeback support from
the filesystems entirely.

IOWs, this rule hasn't been valid for a -long- time, so maybe it
is time to remove it. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-29  1:29     ` Dave Chinner
@ 2024-02-29  4:20       ` Kent Overstreet
  0 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-02-29  4:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

On Thu, Feb 29, 2024 at 12:29:35PM +1100, Dave Chinner wrote:
> On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > >
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?
> 
> The reality is that filesystem developers have been ignoring that
> "mm rule" for a couple of decades. It was also discussed at LSFMM a
> decade ago (2014 IIRC) without resolution, so in the mean time we
> just took control of our own destiny....
> 
> > It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> >
> > With the removal of ->writepage from vmscan, perhaps we can make
> > filesystem authors lives easier by relaxing this requirement as pagecache
> > should be cleaned long before we get to reclaiming it.
> 
> .... by removing memory reclaim page cache writeback support from
> the filesystems entirely.
> 
> IOWs, this rule hasn't been valid for a -long- time, so maybe it
> is time to remove it. :)

It's _never_ been valid, the entire IO stack allocates memory.

This is what GFP_NOIO/GFP_NOFS is for, and additionaly mempools/biosets.
If mm can't satisfy the allocation, they should fail it, and then the IO
path will have fallbacks (but they wil be slow, i.e. iodepth will be
greatly limited).


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-28 19:37   ` Matthew Wilcox
  2024-02-29  1:29     ` Dave Chinner
@ 2024-02-29  4:17     ` Kent Overstreet
  2024-02-29  4:24       ` Matthew Wilcox
  2024-03-01  2:16     ` NeilBrown
  2 siblings, 1 reply; 52+ messages in thread
From: Kent Overstreet @ 2024-02-29  4:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Hello!
> > >
> > > Recent discussions [1] suggest that greater mutual understanding between
> > > memory reclaim on the one hand and RCU on the other might be in order.
> > >
> > > One possibility would be an open discussion.  If it would help, I would
> > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > some ways that RCU's reactions and responses could be enhanced if needed.
> > >
> > 
> > Adding fsdevel as this should probably be a cross track session.
> 
> Perhaps broaden this slightly.  On the THP Cabal call we just had a
> conversation about the requirements on filesystems in the writeback
> path.  We currently tell filesystem authors that the entire writeback
> path must avoid allocating memory in order to prevent deadlock (or use
> GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> writing pagecache back will not allocate memory in, eg, the network stack,
> the device driver, and any other layers the write must traverse.

Why would you not simply mark the writeback path with
memalloc_nofs_save()?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-29  4:17     ` Kent Overstreet
@ 2024-02-29  4:24       ` Matthew Wilcox
  2024-02-29  4:44         ` Kent Overstreet
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2024-02-29  4:24 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Wed, Feb 28, 2024 at 11:17:33PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote:
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> 
> Why would you not simply mark the writeback path with
> memalloc_nofs_save()?

It's not about preventing recursion, it's about guaranteeing forward
progres.  If you can't allocate a bio, you can't clean memory.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-29  4:24       ` Matthew Wilcox
@ 2024-02-29  4:44         ` Kent Overstreet
  0 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-02-29  4:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Thu, Feb 29, 2024 at 04:24:52AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 11:17:33PM -0500, Kent Overstreet wrote:
> > On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote:
> > > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > > conversation about the requirements on filesystems in the writeback
> > > path.  We currently tell filesystem authors that the entire writeback
> > > path must avoid allocating memory in order to prevent deadlock (or use
> > > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > > writing pagecache back will not allocate memory in, eg, the network stack,
> > > the device driver, and any other layers the write must traverse.
> > 
> > Why would you not simply mark the writeback path with
> > memalloc_nofs_save()?
> 
> It's not about preventing recursion, it's about guaranteeing forward
> progres.  If you can't allocate a bio, you can't clean memory.

Err, what? We have to be able to allocate bios in order to do writeback,
_period_. And not just bios, sometimes we have to bounce the entire IO.

I keep noticing the mm people developing weird, crazy ideas about it
being "unsafe" to allocate memory in various contexts. That's wrong;
attempting to allocate memory is always a _safe_ operation, provided you
tell the allocator what it's allowed to do. The allocation might fail,
and that's ok.

If code must succeed for the system to operate it must have fallbacks if
allocations fail, but we don't limit ourselves to those fallbacks
("avoid allocating memory") because then performance would be shit.

The PF_MEMALLOC suggestion is even weirder.

mm people are high on something...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-02-28 19:37   ` Matthew Wilcox
  2024-02-29  1:29     ` Dave Chinner
  2024-02-29  4:17     ` Kent Overstreet
@ 2024-03-01  2:16     ` NeilBrown
  2024-03-01  2:39       ` Kent Overstreet
  2024-03-01  5:54       ` Dave Chinner
  2 siblings, 2 replies; 52+ messages in thread
From: NeilBrown @ 2024-03-01  2:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Kent Overstreet, Jan Kara

On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Hello!
> > >
> > > Recent discussions [1] suggest that greater mutual understanding between
> > > memory reclaim on the one hand and RCU on the other might be in order.
> > >
> > > One possibility would be an open discussion.  If it would help, I would
> > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > some ways that RCU's reactions and responses could be enhanced if needed.
> > >
> > 
> > Adding fsdevel as this should probably be a cross track session.
> 
> Perhaps broaden this slightly.  On the THP Cabal call we just had a
> conversation about the requirements on filesystems in the writeback
> path.  We currently tell filesystem authors that the entire writeback
> path must avoid allocating memory in order to prevent deadlock (or use
> GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> writing pagecache back will not allocate memory in, eg, the network stack,
> the device driver, and any other layers the write must traverse.
> 
> With the removal of ->writepage from vmscan, perhaps we can make
> filesystem authors lives easier by relaxing this requirement as pagecache
> should be cleaned long before we get to reclaiming it.
> 
> I don't think there's anything to be done about swapping anon memory.
> We probably don't want to proactively write anon memory to swap, so by
> the time we're in ->swap_rw we really are low on memory.
> 
> 

While we are considering revising mm rules, I would really like to
revised the rule that GFP_KERNEL allocations are allowed to fail.
I'm not at all sure that they ever do (except for large allocations - so
maybe we could leave that exception in - or warn if large allocations
are tried without a MAY_FAIL flag).

Given that GFP_KERNEL can wait, and that the mm can kill off processes
and clear cache to free memory, there should be no case where failure is
needed or when simply waiting will eventually result in success.  And if
there is, the machine is a gonner anyway.

Once upon a time user-space pages could not be ripped out of a process
by the oom killer until the process actually exited, and that meant that
GFP_KERNEL allocations of a process being oom killed should not block
indefinitely in the allocator.  I *think* that isn't the case any more.

Insisting that GFP_KERNEL allocations never returned NULL would allow us
to remove a lot of untested error handling code....

NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  2:16     ` NeilBrown
@ 2024-03-01  2:39       ` Kent Overstreet
  2024-03-01  2:48         ` Matthew Wilcox
  2024-03-01  5:54       ` Dave Chinner
  1 sibling, 1 reply; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  2:39 UTC (permalink / raw)
  To: NeilBrown
  Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > >
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> > 
> > With the removal of ->writepage from vmscan, perhaps we can make
> > filesystem authors lives easier by relaxing this requirement as pagecache
> > should be cleaned long before we get to reclaiming it.
> > 
> > I don't think there's anything to be done about swapping anon memory.
> > We probably don't want to proactively write anon memory to swap, so by
> > the time we're in ->swap_rw we really are low on memory.
> > 
> > 
> 
> While we are considering revising mm rules, I would really like to
> revised the rule that GFP_KERNEL allocations are allowed to fail.
> I'm not at all sure that they ever do (except for large allocations - so
> maybe we could leave that exception in - or warn if large allocations
> are tried without a MAY_FAIL flag).
> 
> Given that GFP_KERNEL can wait, and that the mm can kill off processes
> and clear cache to free memory, there should be no case where failure is
> needed or when simply waiting will eventually result in success.  And if
> there is, the machine is a gonner anyway.
> 
> Once upon a time user-space pages could not be ripped out of a process
> by the oom killer until the process actually exited, and that meant that
> GFP_KERNEL allocations of a process being oom killed should not block
> indefinitely in the allocator.  I *think* that isn't the case any more.
> 
> Insisting that GFP_KERNEL allocations never returned NULL would allow us
> to remove a lot of untested error handling code....

If memcg ever gets enabled for all kernel side allocations we might
start seeing failures of GFP_KERNEL allocations.

I've got better fault injection code coming, I'll be posting it right
after memory allocation profiling gets merged - that'll help with the
testing situation.

The big blocker on enabling memcg for all kernel allocations is
performance overhead, but I hear that's getting worked on as well.

We'd probably want to add a gfp flag to annotate which allocations we
want to fail because of memcg, though...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  2:39       ` Kent Overstreet
@ 2024-03-01  2:48         ` Matthew Wilcox
  2024-03-01  3:09           ` Kent Overstreet
  2024-03-05  2:54           ` Yosry Ahmed
  0 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2024-03-01  2:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > Insisting that GFP_KERNEL allocations never returned NULL would allow us
> > to remove a lot of untested error handling code....
> 
> If memcg ever gets enabled for all kernel side allocations we might
> start seeing failures of GFP_KERNEL allocations.

Why would we want that behaviour?  A memcg-limited allocation should
behave like any other allocation -- block until we've freed some other
memory in this cgroup, either by swap or killing or ...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  2:48         ` Matthew Wilcox
@ 2024-03-01  3:09           ` Kent Overstreet
  2024-03-01  3:33             ` James Bottomley
  2024-03-05  2:54           ` Yosry Ahmed
  1 sibling, 1 reply; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  3:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 02:48:52AM +0000, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > Insisting that GFP_KERNEL allocations never returned NULL would allow us
> > > to remove a lot of untested error handling code....
> > 
> > If memcg ever gets enabled for all kernel side allocations we might
> > start seeing failures of GFP_KERNEL allocations.
> 
> Why would we want that behaviour?  A memcg-limited allocation should
> behave like any other allocation -- block until we've freed some other
> memory in this cgroup, either by swap or killing or ...

It's not uncommon to have a more efficient way of doing something if you
can allocate more memory, but still have the ability to run in a more
bounded amount of space if you need to; I write code like this quite
often.

Or maybe you just want the syscall to return an error instead of
blocking for an unbounded amount of time if userspace asks for something
silly.

Honestly, relying on the OOM killer and saying that because that now we
don't have to write and test your error paths is a lazy cop out.

The same kind of thinking got us overcommit, where yes we got an
increase in efficiency, but the cost was that everyone started assuming
and relying on overcommit, so now it's impossible to run without
overcommit enabled except in highly controlled environments.

And that means allocation failure as an effective signal is just
completely busted in userspace. If you want to write code in userspace
that uses as much memory as is available and no more, you _can't_,
because system behaviour goes to shit if you have overcommit enabled or
a bunch of memory gets wasted if overcommit is disabled because everyone
assumes that's just what you do.

Let's _not_ go that route in the kernel. I have pointy sticks to
brandish at people who don't want to deal with properly handling errors.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:09           ` Kent Overstreet
@ 2024-03-01  3:33             ` James Bottomley
  2024-03-01  3:52               ` Kent Overstreet
  0 siblings, 1 reply; 52+ messages in thread
From: James Bottomley @ 2024-03-01  3:33 UTC (permalink / raw)
  To: Kent Overstreet, Matthew Wilcox
  Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 02:48:52AM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote:
> > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > > Insisting that GFP_KERNEL allocations never returned NULL would
> > > > allow us to remove a lot of untested error handling code....
> > > 
> > > If memcg ever gets enabled for all kernel side allocations we
> > > might start seeing failures of GFP_KERNEL allocations.
> > 
> > Why would we want that behaviour?  A memcg-limited allocation
> > should behave like any other allocation -- block until we've freed
> > some other memory in this cgroup, either by swap or killing or ...
> 
> It's not uncommon to have a more efficient way of doing something if
> you can allocate more memory, but still have the ability to run in a
> more bounded amount of space if you need to; I write code like this
> quite often.

The cgroup design is to do what we do usually, but within settable hard
and soft limits.  So if the kernel could make GFP_KERNEL wait without
failing, the cgroup would mirror that.

> Or maybe you just want the syscall to return an error instead of
> blocking for an unbounded amount of time if userspace asks for
> something silly.

Warn on allocation above a certain size without MAY_FAIL would seem to
cover all those cases.  If there is a case for requiring instant
allocation, you always have GFP_ATOMIC, and, I suppose, we could even
do a bounded reclaim allocation where it tries for a certain time then
fails.

> Honestly, relying on the OOM killer and saying that because that now
> we don't have to write and test your error paths is a lazy cop out.

OOM Killer is the most extreme outcome.  Usually reclaim (hugely
simplified) dumps clean cache first and tries the shrinkers then tries
to write out dirty cache.  Only after that hasn't found anything after
a few iterations will the oom killer get activated.

> The same kind of thinking got us overcommit, where yes we got an
> increase in efficiency, but the cost was that everyone started
> assuming and relying on overcommit, so now it's impossible to run
> without overcommit enabled except in highly controlled environments.

That might be true for your use case, but it certainly isn't true for a
cheap hosting cloud using containers: overcommit is where you make your
money, so it's absolutely standard operating procedure.  I wouldn't
call cheap hosting a "highly controlled environment" they're just
making a bet they won't get caught out too often.

> And that means allocation failure as an effective signal is just
> completely busted in userspace. If you want to write code in
> userspace that uses as much memory as is available and no more, you
> _can't_, because system behaviour goes to shit if you have overcommit
> enabled or a bunch of memory gets wasted if overcommit is disabled
> because everyone assumes that's just what you do.

OK, this seems to be specific to your use case again, because if you
look at what the major user space processes like web browsers do, they
allocate way over the physical memory available to them for cache and
assume the kernel will take care of it.  Making failure a signal for
being over the working set would cause all these applications to
segfault almost immediately.

I think what you're asking for is an API to try to calculate what the
current available headroom in the working set would be?  That's highly
heuristic, but the mm people might have an idea how to do it.

> Let's _not_ go that route in the kernel. I have pointy sticks to
> brandish at people who don't want to deal with properly handling
> errors.

Error legs are the least exercised and most bug, and therefore exploit,
prone pieces of code in C.  If we can get rid of them, we should.

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:33             ` James Bottomley
@ 2024-03-01  3:52               ` Kent Overstreet
  2024-03-01  4:01                 ` Kent Overstreet
  2024-03-01  4:08                 ` James Bottomley
  0 siblings, 2 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  3:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> > Or maybe you just want the syscall to return an error instead of
> > blocking for an unbounded amount of time if userspace asks for
> > something silly.
> 
> Warn on allocation above a certain size without MAY_FAIL would seem to
> cover all those cases.  If there is a case for requiring instant
> allocation, you always have GFP_ATOMIC, and, I suppose, we could even
> do a bounded reclaim allocation where it tries for a certain time then
> fails.

Then you're baking in this weird constant into all your algorithms that
doesn't scale as machine memory sizes and working set sizes increase.

> > Honestly, relying on the OOM killer and saying that because that now
> > we don't have to write and test your error paths is a lazy cop out.
> 
> OOM Killer is the most extreme outcome.  Usually reclaim (hugely
> simplified) dumps clean cache first and tries the shrinkers then tries
> to write out dirty cache.  Only after that hasn't found anything after
> a few iterations will the oom killer get activated

All your caches dumped and the machine grinds to a halt and then a
random process gets killed instead of simply _failing the allocation_.

> > The same kind of thinking got us overcommit, where yes we got an
> > increase in efficiency, but the cost was that everyone started
> > assuming and relying on overcommit, so now it's impossible to run
> > without overcommit enabled except in highly controlled environments.
> 
> That might be true for your use case, but it certainly isn't true for a
> cheap hosting cloud using containers: overcommit is where you make your
> money, so it's absolutely standard operating procedure.  I wouldn't
> call cheap hosting a "highly controlled environment" they're just
> making a bet they won't get caught out too often.

Reading comprehension fail. Reread what I wrote.

> > And that means allocation failure as an effective signal is just
> > completely busted in userspace. If you want to write code in
> > userspace that uses as much memory as is available and no more, you
> > _can't_, because system behaviour goes to shit if you have overcommit
> > enabled or a bunch of memory gets wasted if overcommit is disabled
> > because everyone assumes that's just what you do.
> 
> OK, this seems to be specific to your use case again, because if you
> look at what the major user space processes like web browsers do, they
> allocate way over the physical memory available to them for cache and
> assume the kernel will take care of it.  Making failure a signal for
> being over the working set would cause all these applications to
> segfault almost immediately.

Again, reread what I wrote. You're restating what I wrote and completely
missing the point.

> > Let's _not_ go that route in the kernel. I have pointy sticks to
> > brandish at people who don't want to deal with properly handling
> > errors.
> 
> Error legs are the least exercised and most bug, and therefore exploit,
> prone pieces of code in C.  If we can get rid of them, we should.

Fuck no.

Having working error paths is _basic_, and learning how to test your
code is also basic. If you can't be bothered to do that you shouldn't be
writing kernel code.

We are giving far too much by going down the route of "oh, just kill
stuff if we screwed the pooch and overcommitted".

I don't fucking care if it's what the big cloud providers want because
it's convenient for them, some of us actually do care about reliability.

By just saying "oh, the OO killer will save us" what you're doing is
making it nearly impossible to fully utilize a machine without having
stuff randomly killed.

Fuck. That.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:52               ` Kent Overstreet
@ 2024-03-01  4:01                 ` Kent Overstreet
  2024-03-01  4:09                   ` NeilBrown
  2024-03-01  4:18                   ` James Bottomley
  2024-03-01  4:08                 ` James Bottomley
  1 sibling, 2 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  4:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> > > Or maybe you just want the syscall to return an error instead of
> > > blocking for an unbounded amount of time if userspace asks for
> > > something silly.
> > 
> > Warn on allocation above a certain size without MAY_FAIL would seem to
> > cover all those cases.  If there is a case for requiring instant
> > allocation, you always have GFP_ATOMIC, and, I suppose, we could even
> > do a bounded reclaim allocation where it tries for a certain time then
> > fails.
> 
> Then you're baking in this weird constant into all your algorithms that
> doesn't scale as machine memory sizes and working set sizes increase.
> 
> > > Honestly, relying on the OOM killer and saying that because that now
> > > we don't have to write and test your error paths is a lazy cop out.
> > 
> > OOM Killer is the most extreme outcome.  Usually reclaim (hugely
> > simplified) dumps clean cache first and tries the shrinkers then tries
> > to write out dirty cache.  Only after that hasn't found anything after
> > a few iterations will the oom killer get activated
> 
> All your caches dumped and the machine grinds to a halt and then a
> random process gets killed instead of simply _failing the allocation_.
> 
> > > The same kind of thinking got us overcommit, where yes we got an
> > > increase in efficiency, but the cost was that everyone started
> > > assuming and relying on overcommit, so now it's impossible to run
> > > without overcommit enabled except in highly controlled environments.
> > 
> > That might be true for your use case, but it certainly isn't true for a
> > cheap hosting cloud using containers: overcommit is where you make your
> > money, so it's absolutely standard operating procedure.  I wouldn't
> > call cheap hosting a "highly controlled environment" they're just
> > making a bet they won't get caught out too often.
> 
> Reading comprehension fail. Reread what I wrote.
> 
> > > And that means allocation failure as an effective signal is just
> > > completely busted in userspace. If you want to write code in
> > > userspace that uses as much memory as is available and no more, you
> > > _can't_, because system behaviour goes to shit if you have overcommit
> > > enabled or a bunch of memory gets wasted if overcommit is disabled
> > > because everyone assumes that's just what you do.
> > 
> > OK, this seems to be specific to your use case again, because if you
> > look at what the major user space processes like web browsers do, they
> > allocate way over the physical memory available to them for cache and
> > assume the kernel will take care of it.  Making failure a signal for
> > being over the working set would cause all these applications to
> > segfault almost immediately.
> 
> Again, reread what I wrote. You're restating what I wrote and completely
> missing the point.
> 
> > > Let's _not_ go that route in the kernel. I have pointy sticks to
> > > brandish at people who don't want to deal with properly handling
> > > errors.
> > 
> > Error legs are the least exercised and most bug, and therefore exploit,
> > prone pieces of code in C.  If we can get rid of them, we should.
> 
> Fuck no.
> 
> Having working error paths is _basic_, and learning how to test your
> code is also basic. If you can't be bothered to do that you shouldn't be
> writing kernel code.
> 
> We are giving far too much by going down the route of "oh, just kill
> stuff if we screwed the pooch and overcommitted".
> 
> I don't fucking care if it's what the big cloud providers want because
> it's convenient for them, some of us actually do care about reliability.
> 
> By just saying "oh, the OO killer will save us" what you're doing is
> making it nearly impossible to fully utilize a machine without having
> stuff randomly killed.
> 
> Fuck. That.

And besides all that, as a practical matter you can't just "not have
erro paths" because, like you said, you'd still have to have a max size
where you WARN() - and _fail the allocation_ - and you've still got to
unwind.

The OOM killer can't kill processes while they're stuck blocking on an
allocation that will rever return in the kernel.

I think we can safely nip this idea in the bud.

Test your damn error paths...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  4:01                 ` Kent Overstreet
@ 2024-03-01  4:09                   ` NeilBrown
  2024-03-01  4:18                     ` Kent Overstreet
  2024-03-01  4:18                   ` James Bottomley
  1 sibling, 1 reply; 52+ messages in thread
From: NeilBrown @ 2024-03-01  4:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: James Bottomley, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Fri, 01 Mar 2024, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> > > > Or maybe you just want the syscall to return an error instead of
> > > > blocking for an unbounded amount of time if userspace asks for
> > > > something silly.
> > > 
> > > Warn on allocation above a certain size without MAY_FAIL would seem to
> > > cover all those cases.  If there is a case for requiring instant
> > > allocation, you always have GFP_ATOMIC, and, I suppose, we could even
> > > do a bounded reclaim allocation where it tries for a certain time then
> > > fails.
> > 
> > Then you're baking in this weird constant into all your algorithms that
> > doesn't scale as machine memory sizes and working set sizes increase.
> > 
> > > > Honestly, relying on the OOM killer and saying that because that now
> > > > we don't have to write and test your error paths is a lazy cop out.
> > > 
> > > OOM Killer is the most extreme outcome.  Usually reclaim (hugely
> > > simplified) dumps clean cache first and tries the shrinkers then tries
> > > to write out dirty cache.  Only after that hasn't found anything after
> > > a few iterations will the oom killer get activated
> > 
> > All your caches dumped and the machine grinds to a halt and then a
> > random process gets killed instead of simply _failing the allocation_.
> > 
> > > > The same kind of thinking got us overcommit, where yes we got an
> > > > increase in efficiency, but the cost was that everyone started
> > > > assuming and relying on overcommit, so now it's impossible to run
> > > > without overcommit enabled except in highly controlled environments.
> > > 
> > > That might be true for your use case, but it certainly isn't true for a
> > > cheap hosting cloud using containers: overcommit is where you make your
> > > money, so it's absolutely standard operating procedure.  I wouldn't
> > > call cheap hosting a "highly controlled environment" they're just
> > > making a bet they won't get caught out too often.
> > 
> > Reading comprehension fail. Reread what I wrote.
> > 
> > > > And that means allocation failure as an effective signal is just
> > > > completely busted in userspace. If you want to write code in
> > > > userspace that uses as much memory as is available and no more, you
> > > > _can't_, because system behaviour goes to shit if you have overcommit
> > > > enabled or a bunch of memory gets wasted if overcommit is disabled
> > > > because everyone assumes that's just what you do.
> > > 
> > > OK, this seems to be specific to your use case again, because if you
> > > look at what the major user space processes like web browsers do, they
> > > allocate way over the physical memory available to them for cache and
> > > assume the kernel will take care of it.  Making failure a signal for
> > > being over the working set would cause all these applications to
> > > segfault almost immediately.
> > 
> > Again, reread what I wrote. You're restating what I wrote and completely
> > missing the point.
> > 
> > > > Let's _not_ go that route in the kernel. I have pointy sticks to
> > > > brandish at people who don't want to deal with properly handling
> > > > errors.
> > > 
> > > Error legs are the least exercised and most bug, and therefore exploit,
> > > prone pieces of code in C.  If we can get rid of them, we should.
> > 
> > Fuck no.
> > 
> > Having working error paths is _basic_, and learning how to test your
> > code is also basic. If you can't be bothered to do that you shouldn't be
> > writing kernel code.
> > 
> > We are giving far too much by going down the route of "oh, just kill
> > stuff if we screwed the pooch and overcommitted".
> > 
> > I don't fucking care if it's what the big cloud providers want because
> > it's convenient for them, some of us actually do care about reliability.
> > 
> > By just saying "oh, the OO killer will save us" what you're doing is
> > making it nearly impossible to fully utilize a machine without having
> > stuff randomly killed.
> > 
> 
> And besides all that, as a practical matter you can't just "not have
> erro paths" because, like you said, you'd still have to have a max size
> where you WARN() - and _fail the allocation_ - and you've still got to
> unwind.

No.  You warn and DON'T fail the allocation.  Just like lockdep warns of
possible deadlocks but lets you continue.
These will be found in development (mostly) and changed to use
__GFP_RETRY_MAYFAIL and have appropriate error-handling paths.


> 
> The OOM killer can't kill processes while they're stuck blocking on an
> allocation that will rever return in the kernel.

But it can depopulate the user address space (I think).

NeilBrown


> 
> I think we can safely nip this idea in the bud.
> 
> Test your damn error paths...
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  4:09                   ` NeilBrown
@ 2024-03-01  4:18                     ` Kent Overstreet
  0 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  4:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: James Bottomley, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 03:09:09PM +1100, NeilBrown wrote:
> No.  You warn and DON'T fail the allocation.  Just like lockdep warns of
> possible deadlocks but lets you continue.
> These will be found in development (mostly) and changed to use
> __GFP_RETRY_MAYFAIL and have appropriate error-handling paths.

And when an allocation happens that cannot be satisfied without killing
everything off? You do realize there are _lots_ of allocations where
userspace has control over the size of the allocation, via e.g. obscure
ioctls that ask for a buffer to be filled?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  4:01                 ` Kent Overstreet
  2024-03-01  4:09                   ` NeilBrown
@ 2024-03-01  4:18                   ` James Bottomley
  1 sibling, 0 replies; 52+ messages in thread
From: James Bottomley @ 2024-03-01  4:18 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Thu, 2024-02-29 at 23:01 -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
[...]
> > > > Let's _not_ go that route in the kernel. I have pointy sticks
> > > > to brandish at people who don't want to deal with properly
> > > > handling errors.
> > > 
> > > Error legs are the least exercised and most bug, and therefore
> > > exploit, prone pieces of code in C.  If we can get rid of them,
> > > we should.
> > 
> > Fuck no.
> > 
> > Having working error paths is _basic_, and learning how to test
> > your code is also basic. If you can't be bothered to do that you
> > shouldn't be writing kernel code.

Heh, that's as glib as saying people should test their C code for
overcommit errors.  If everyone did that there'd be no need for
languages like rust.

> > We are giving far too much by going down the route of "oh, just
> > kill stuff if we screwed the pooch and overcommitted".
> > 
> > I don't fucking care if it's what the big cloud providers want
> > because it's convenient for them, some of us actually do care about
> > reliability.

Reliability has many definitions.  The kernel tries to leave policy
like this to the user, which is why the overcommit type is exposed to
userspace.  Arguing about whose workload is more important isn't really
going to be helpful.

> > By just saying "oh, the OO killer will save us" what you're doing
> > is making it nearly impossible to fully utilize a machine without
> > having stuff randomly killed.
> > 
> > Fuck. That.
> 
> And besides all that, as a practical matter you can't just "not have
> erro paths" because, like you said, you'd still have to have a max
> size where you WARN() - and _fail the allocation_ - and you've still
> got to unwind.

So? the point would be we can eliminate some potentially buggy error
legs on small allocations.  Since we have to add MAY_FAIL and error
handling (which should already exist) to the larger ones.  It would
have no impact at all on scaling.  The question of where the limit
should be in the general case should probably be compile time
configurable.  We can probably even get the compiler to eliminate the
if (err) leg with some judicious return value priming, meaning we
achieve some meaningful bug density reduction for no or very few code
changes.

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  3:52               ` Kent Overstreet
  2024-03-01  4:01                 ` Kent Overstreet
@ 2024-03-01  4:08                 ` James Bottomley
  2024-03-01  4:15                   ` Kent Overstreet
  1 sibling, 1 reply; 52+ messages in thread
From: James Bottomley @ 2024-03-01  4:08 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Thu, 2024-02-29 at 22:52 -0500, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> > > Or maybe you just want the syscall to return an error instead of
> > > blocking for an unbounded amount of time if userspace asks for
> > > something silly.
> > 
> > Warn on allocation above a certain size without MAY_FAIL would seem
> > to cover all those cases.  If there is a case for requiring instant
> > allocation, you always have GFP_ATOMIC, and, I suppose, we could
> > even do a bounded reclaim allocation where it tries for a certain
> > time then fails.
> 
> Then you're baking in this weird constant into all your algorithms
> that doesn't scale as machine memory sizes and working set sizes
> increase.
> 
> > > Honestly, relying on the OOM killer and saying that because that
> > > now we don't have to write and test your error paths is a lazy
> > > cop out.
> > 
> > OOM Killer is the most extreme outcome.  Usually reclaim (hugely
> > simplified) dumps clean cache first and tries the shrinkers then
> > tries to write out dirty cache.  Only after that hasn't found
> > anything after a few iterations will the oom killer get activated
> 
> All your caches dumped and the machine grinds to a halt and then a
> random process gets killed instead of simply _failing the
> allocation_.

Ignoring the fact free invective below, I think what you're asking for
is strict overcommit.  There's a tunable for that:

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

However, see the Gotchas section for why we can't turn it on globally,
but it is available to you if you know what you're doing.

James



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  4:08                 ` James Bottomley
@ 2024-03-01  4:15                   ` Kent Overstreet
  0 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01  4:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 11:08:52AM +0700, James Bottomley wrote:
> On Thu, 2024-02-29 at 22:52 -0500, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote:
> > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote:
> > > > Or maybe you just want the syscall to return an error instead of
> > > > blocking for an unbounded amount of time if userspace asks for
> > > > something silly.
> > > 
> > > Warn on allocation above a certain size without MAY_FAIL would seem
> > > to cover all those cases.  If there is a case for requiring instant
> > > allocation, you always have GFP_ATOMIC, and, I suppose, we could
> > > even do a bounded reclaim allocation where it tries for a certain
> > > time then fails.
> > 
> > Then you're baking in this weird constant into all your algorithms
> > that doesn't scale as machine memory sizes and working set sizes
> > increase.
> > 
> > > > Honestly, relying on the OOM killer and saying that because that
> > > > now we don't have to write and test your error paths is a lazy
> > > > cop out.
> > > 
> > > OOM Killer is the most extreme outcome.  Usually reclaim (hugely
> > > simplified) dumps clean cache first and tries the shrinkers then
> > > tries to write out dirty cache.  Only after that hasn't found
> > > anything after a few iterations will the oom killer get activated
> > 
> > All your caches dumped and the machine grinds to a halt and then a
> > random process gets killed instead of simply _failing the
> > allocation_.
> 
> Ignoring the fact free invective below, I think what you're asking for
> is strict overcommit.  There's a tunable for that:
> 
> https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
> 
> However, see the Gotchas section for why we can't turn it on globally,
> but it is available to you if you know what you're doing.

James, I already explained all this.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  2:48         ` Matthew Wilcox
  2024-03-01  3:09           ` Kent Overstreet
@ 2024-03-05  2:54           ` Yosry Ahmed
  1 sibling, 0 replies; 52+ messages in thread
From: Yosry Ahmed @ 2024-03-05  2:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, NeilBrown, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Thu, Feb 29, 2024 at 6:49 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > Insisting that GFP_KERNEL allocations never returned NULL would allow us
> > > to remove a lot of untested error handling code....
> >
> > If memcg ever gets enabled for all kernel side allocations we might
> > start seeing failures of GFP_KERNEL allocations.
>
> Why would we want that behaviour?  A memcg-limited allocation should
> behave like any other allocation -- block until we've freed some other
> memory in this cgroup, either by swap or killing or ...

I am not closely following this thread (although it is very
interesting), but I don't think the same rules fully apply for
memcg-limited allocations. Specifically, because the scope of the OOM
killer and available resources is limited to the subtree of the memcg
that hit its limit. Consider a case where a memcg is full of page
cache memory that is mlock()'s by a process in another memcg, or full
of tmpfs memory and there is no swap (or swap limit is reached).

You can get more creative with this too, start a process in memcg A,
allocate some anon memory, move the process to memcg B. Now if you
cannot swap, you cannot reclaim the memory from memcg A and killing
everything in memcg A doesn't help.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  2:16     ` NeilBrown
  2024-03-01  2:39       ` Kent Overstreet
@ 2024-03-01  5:54       ` Dave Chinner
  2024-03-01 20:20         ` Kent Overstreet
  1 sibling, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2024-03-01  5:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm,
	linux-fsdevel, Kent Overstreet, Jan Kara

On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > >
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> > 
> > With the removal of ->writepage from vmscan, perhaps we can make
> > filesystem authors lives easier by relaxing this requirement as pagecache
> > should be cleaned long before we get to reclaiming it.
> > 
> > I don't think there's anything to be done about swapping anon memory.
> > We probably don't want to proactively write anon memory to swap, so by
> > the time we're in ->swap_rw we really are low on memory.
> > 
> > 
> 
> While we are considering revising mm rules, I would really like to
> revised the rule that GFP_KERNEL allocations are allowed to fail.
> I'm not at all sure that they ever do (except for large allocations - so
> maybe we could leave that exception in - or warn if large allocations
> are tried without a MAY_FAIL flag).
> 
> Given that GFP_KERNEL can wait, and that the mm can kill off processes
> and clear cache to free memory, there should be no case where failure is
> needed or when simply waiting will eventually result in success.  And if
> there is, the machine is a gonner anyway.

Yes, please!

XFS was designed and implemented on an OS that gave this exact
guarantee for kernel allocations back in the early 1990s.  Memory
allocation simply blocked until it succeeded unless the caller
indicated they could handle failure. That's what __GFP_NOFAIL does
and XFS is still heavily dependent on this behaviour.

And before people scream "but that was 30 years ago, Unix OS code
was much simpler", consider that Irix supported machines with
hundreds of NUMA nodes, thousands of CPUs, terabytes of memory and
petabytes of storage. It had variable size high order pages in the
page cache (something we've only just got with folios!), page
migration, page compaction, memory and process locality control,
filesystem block sizes larger than page size (which we don't have
yet!), memory shrinkers for subsystem cache reclaim, page cache
dirty throttling to sustained writeback IO rates, etc.

Lots of the mm technology from that OS has been re-implemented in
Linux in the past two decades, but in several important ways Linux
still falls shy of the bar that Irix set a couple of decades ago.
One of those is the kernel memory allocation guarantee.

> Once upon a time user-space pages could not be ripped out of a process
> by the oom killer until the process actually exited, and that meant that
> GFP_KERNEL allocations of a process being oom killed should not block
> indefinitely in the allocator.  I *think* that isn't the case any more.
> 
> Insisting that GFP_KERNEL allocations never returned NULL would allow us
> to remove a lot of untested error handling code....

This is the sort of thing I was thinking of in the "remove
GFP_NOFS" discussion thread when I said this to Kent:

	"We need to start designing our code in a way that doesn't require
	extensive testing to validate it as correct. If the only way to
	validate new code is correct is via stochastic coverage via error
	injection, then that is a clear sign we've made poor design choices
	along the way."

https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/

If memory allocation doesn't fail by default, then we can remove the
vast majority of allocation error handling from the kernel. Make the
common case just work - remove the need for all that code to handle
failures that is hard to exercise reliably and so are rarely tested.

A simple change to make long standing behaviour an actual policy we
can rely on means we can remove both code and test matrix overhead -
it's a win-win IMO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01  5:54       ` Dave Chinner
@ 2024-03-01 20:20         ` Kent Overstreet
  2024-03-01 23:47           ` NeilBrown
  0 siblings, 1 reply; 52+ messages in thread
From: Kent Overstreet @ 2024-03-01 20:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > While we are considering revising mm rules, I would really like to
> > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > I'm not at all sure that they ever do (except for large allocations - so
> > maybe we could leave that exception in - or warn if large allocations
> > are tried without a MAY_FAIL flag).
> > 
> > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > and clear cache to free memory, there should be no case where failure is
> > needed or when simply waiting will eventually result in success.  And if
> > there is, the machine is a gonner anyway.
> 
> Yes, please!
> 
> XFS was designed and implemented on an OS that gave this exact
> guarantee for kernel allocations back in the early 1990s.  Memory
> allocation simply blocked until it succeeded unless the caller
> indicated they could handle failure. That's what __GFP_NOFAIL does
> and XFS is still heavily dependent on this behaviour.

I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
let's remove the underscores and get rid of the silly two page limit.
GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
as you don't mind possibly waiting a bit.

But it can't be the default because, like I mentioned to Neal, there are
a _lot_ of different places where we allocate memory in the kernel, and
they have to be able to fail instead of shoving everything else out of
memory.

> This is the sort of thing I was thinking of in the "remove
> GFP_NOFS" discussion thread when I said this to Kent:
> 
> 	"We need to start designing our code in a way that doesn't require
> 	extensive testing to validate it as correct. If the only way to
> 	validate new code is correct is via stochastic coverage via error
> 	injection, then that is a clear sign we've made poor design choices
> 	along the way."
> 
> https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/
> 
> If memory allocation doesn't fail by default, then we can remove the
> vast majority of allocation error handling from the kernel. Make the
> common case just work - remove the need for all that code to handle
> failures that is hard to exercise reliably and so are rarely tested.
> 
> A simple change to make long standing behaviour an actual policy we
> can rely on means we can remove both code and test matrix overhead -
> it's a win-win IMO.

We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
default - a great many of those allocations have mempools in front of
them to avoid deadlocks, and if you do that you've made the mempools
useless.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01 20:20         ` Kent Overstreet
@ 2024-03-01 23:47           ` NeilBrown
  2024-03-02  0:02             ` Kent Overstreet
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2024-03-01 23:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Sat, 02 Mar 2024, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > While we are considering revising mm rules, I would really like to
> > > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > > I'm not at all sure that they ever do (except for large allocations - so
> > > maybe we could leave that exception in - or warn if large allocations
> > > are tried without a MAY_FAIL flag).
> > > 
> > > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > > and clear cache to free memory, there should be no case where failure is
> > > needed or when simply waiting will eventually result in success.  And if
> > > there is, the machine is a gonner anyway.
> > 
> > Yes, please!
> > 
> > XFS was designed and implemented on an OS that gave this exact
> > guarantee for kernel allocations back in the early 1990s.  Memory
> > allocation simply blocked until it succeeded unless the caller
> > indicated they could handle failure. That's what __GFP_NOFAIL does
> > and XFS is still heavily dependent on this behaviour.
> 
> I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
> let's remove the underscores and get rid of the silly two page limit.
> GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
> as you don't mind possibly waiting a bit.
> 
> But it can't be the default because, like I mentioned to Neal, there are
> a _lot_ of different places where we allocate memory in the kernel, and
> they have to be able to fail instead of shoving everything else out of
> memory.
> 
> > This is the sort of thing I was thinking of in the "remove
> > GFP_NOFS" discussion thread when I said this to Kent:
> > 
> > 	"We need to start designing our code in a way that doesn't require
> > 	extensive testing to validate it as correct. If the only way to
> > 	validate new code is correct is via stochastic coverage via error
> > 	injection, then that is a clear sign we've made poor design choices
> > 	along the way."
> > 
> > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/
> > 
> > If memory allocation doesn't fail by default, then we can remove the
> > vast majority of allocation error handling from the kernel. Make the
> > common case just work - remove the need for all that code to handle
> > failures that is hard to exercise reliably and so are rarely tested.
> > 
> > A simple change to make long standing behaviour an actual policy we
> > can rely on means we can remove both code and test matrix overhead -
> > it's a win-win IMO.
> 
> We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
> default - a great many of those allocations have mempools in front of
> them to avoid deadlocks, and if you do that you've made the mempools
> useless.
> 

Not strictly true.  mempool_alloc() adds __GFP_NORETRY so the allocation
will certainly fail if that is appropriate.

I suspect that most places where there is a non-error fallback already
use NORETRY or RETRY_MAYFAIL or similar.

But I agree that changing the meaning of GFP_KERNEL has a potential to
cause problems.  I support promoting "GFP_NOFAIL" which should work at
least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
I'm unsure how it should be have in PF_MEMALLOC_NOFS and
PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
these contexts, in which case I'm sure it should.

Maybe we could then deprecate GFP_KERNEL.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-01 23:47           ` NeilBrown
@ 2024-03-02  0:02             ` Kent Overstreet
  2024-03-02 11:33               ` Tetsuo Handa
  2024-03-03 22:45               ` NeilBrown
  0 siblings, 2 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-02  0:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Sat, Mar 02, 2024 at 10:47:59AM +1100, NeilBrown wrote:
> On Sat, 02 Mar 2024, Kent Overstreet wrote:
> > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > > While we are considering revising mm rules, I would really like to
> > > > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > > > I'm not at all sure that they ever do (except for large allocations - so
> > > > maybe we could leave that exception in - or warn if large allocations
> > > > are tried without a MAY_FAIL flag).
> > > > 
> > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > > > and clear cache to free memory, there should be no case where failure is
> > > > needed or when simply waiting will eventually result in success.  And if
> > > > there is, the machine is a gonner anyway.
> > > 
> > > Yes, please!
> > > 
> > > XFS was designed and implemented on an OS that gave this exact
> > > guarantee for kernel allocations back in the early 1990s.  Memory
> > > allocation simply blocked until it succeeded unless the caller
> > > indicated they could handle failure. That's what __GFP_NOFAIL does
> > > and XFS is still heavily dependent on this behaviour.
> > 
> > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
> > let's remove the underscores and get rid of the silly two page limit.
> > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
> > as you don't mind possibly waiting a bit.
> > 
> > But it can't be the default because, like I mentioned to Neal, there are
> > a _lot_ of different places where we allocate memory in the kernel, and
> > they have to be able to fail instead of shoving everything else out of
> > memory.
> > 
> > > This is the sort of thing I was thinking of in the "remove
> > > GFP_NOFS" discussion thread when I said this to Kent:
> > > 
> > > 	"We need to start designing our code in a way that doesn't require
> > > 	extensive testing to validate it as correct. If the only way to
> > > 	validate new code is correct is via stochastic coverage via error
> > > 	injection, then that is a clear sign we've made poor design choices
> > > 	along the way."
> > > 
> > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/
> > > 
> > > If memory allocation doesn't fail by default, then we can remove the
> > > vast majority of allocation error handling from the kernel. Make the
> > > common case just work - remove the need for all that code to handle
> > > failures that is hard to exercise reliably and so are rarely tested.
> > > 
> > > A simple change to make long standing behaviour an actual policy we
> > > can rely on means we can remove both code and test matrix overhead -
> > > it's a win-win IMO.
> > 
> > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
> > default - a great many of those allocations have mempools in front of
> > them to avoid deadlocks, and if you do that you've made the mempools
> > useless.
> > 
> 
> Not strictly true.  mempool_alloc() adds __GFP_NORETRY so the allocation
> will certainly fail if that is appropriate.

*nod* 

> I suspect that most places where there is a non-error fallback already
> use NORETRY or RETRY_MAYFAIL or similar.

NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see
_tons_ of uses for either of them - more for NORETRY.

My go-to is NOWAIT in this scenario though; my common pattern is "try
nonblocking with locks held, then drop locks and retry GFP_KERNEL".
 
> But I agree that changing the meaning of GFP_KERNEL has a potential to
> cause problems.  I support promoting "GFP_NOFAIL" which should work at
> least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).

I'd support this change.

> I'm unsure how it should be have in PF_MEMALLOC_NOFS and
> PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
> these contexts, in which case I'm sure it should.
> 
> Maybe we could then deprecate GFP_KERNEL.

What do you have in mind?

Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
pushing for memalloc_flags_(save|restore) more.

Getting rid of those would be a really nice cleanup beacuse then gfp
flags would mostly just be:
 - the type of memory to allocate (highmem, zeroed, etc.)
 - how hard to try (don't block at all, block some, block forever)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-02  0:02             ` Kent Overstreet
@ 2024-03-02 11:33               ` Tetsuo Handa
  2024-03-02 16:53                 ` Matthew Wilcox
  2024-03-03 22:45               ` NeilBrown
  1 sibling, 1 reply; 52+ messages in thread
From: Tetsuo Handa @ 2024-03-02 11:33 UTC (permalink / raw)
  To: Kent Overstreet, NeilBrown
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On 2024/03/02 9:02, Kent Overstreet wrote:
> Getting rid of those would be a really nice cleanup beacuse then gfp
> flags would mostly just be:
>  - the type of memory to allocate (highmem, zeroed, etc.)
>  - how hard to try (don't block at all, block some, block forever)

I really wish that __GFP_KILLABLE is added, so that we can use like
mutex_trylock()/mutex_lock_killable()/mutex_lock().

Memory allocations from LSM modules for access control are willing to
give up when the allocating process is killed, for userspace won't know
about whether the request succeed. But such allocations are hardly
acceptable to give up unless the allocating process is killed or
allocation hit PAGE_ALLOC_COSTLY_ORDER, for ENOMEM failure returned by
e.g. open() can break the purpose of executing userspace processes
(i.e. as bad as being killed by the OOM killer).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-02 11:33               ` Tetsuo Handa
@ 2024-03-02 16:53                 ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2024-03-02 16:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Kent Overstreet, NeilBrown, Dave Chinner, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Sat, Mar 02, 2024 at 08:33:34PM +0900, Tetsuo Handa wrote:
> On 2024/03/02 9:02, Kent Overstreet wrote:
> > Getting rid of those would be a really nice cleanup beacuse then gfp
> > flags would mostly just be:
> >  - the type of memory to allocate (highmem, zeroed, etc.)
> >  - how hard to try (don't block at all, block some, block forever)
> 
> I really wish that __GFP_KILLABLE is added, so that we can use like
> mutex_trylock()/mutex_lock_killable()/mutex_lock().
> 
> Memory allocations from LSM modules for access control are willing to
> give up when the allocating process is killed, for userspace won't know
> about whether the request succeed. But such allocations are hardly
> acceptable to give up unless the allocating process is killed or
> allocation hit PAGE_ALLOC_COSTLY_ORDER, for ENOMEM failure returned by
> e.g. open() can break the purpose of executing userspace processes
> (i.e. as bad as being killed by the OOM killer).

I'd be open to adding that, but does it really happen?  For it to be
useful, we'd have to have an allocation spending a significant amount of
time in memory allocation such that the signal arrives while we're
retrying.  Do you have any details about times when you've seen this,
eg sizes, other GFP flags, ... ?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-02  0:02             ` Kent Overstreet
  2024-03-02 11:33               ` Tetsuo Handa
@ 2024-03-03 22:45               ` NeilBrown
  2024-03-03 22:54                 ` Kent Overstreet
                                   ` (4 more replies)
  1 sibling, 5 replies; 52+ messages in thread
From: NeilBrown @ 2024-03-03 22:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Sat, 02 Mar 2024, Kent Overstreet wrote:
> On Sat, Mar 02, 2024 at 10:47:59AM +1100, NeilBrown wrote:
> > On Sat, 02 Mar 2024, Kent Overstreet wrote:
> > > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > > > While we are considering revising mm rules, I would really like to
> > > > > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > > > > I'm not at all sure that they ever do (except for large allocations - so
> > > > > maybe we could leave that exception in - or warn if large allocations
> > > > > are tried without a MAY_FAIL flag).
> > > > > 
> > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > > > > and clear cache to free memory, there should be no case where failure is
> > > > > needed or when simply waiting will eventually result in success.  And if
> > > > > there is, the machine is a gonner anyway.
> > > > 
> > > > Yes, please!
> > > > 
> > > > XFS was designed and implemented on an OS that gave this exact
> > > > guarantee for kernel allocations back in the early 1990s.  Memory
> > > > allocation simply blocked until it succeeded unless the caller
> > > > indicated they could handle failure. That's what __GFP_NOFAIL does
> > > > and XFS is still heavily dependent on this behaviour.
> > > 
> > > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
> > > let's remove the underscores and get rid of the silly two page limit.
> > > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
> > > as you don't mind possibly waiting a bit.
> > > 
> > > But it can't be the default because, like I mentioned to Neal, there are
> > > a _lot_ of different places where we allocate memory in the kernel, and
> > > they have to be able to fail instead of shoving everything else out of
> > > memory.
> > > 
> > > > This is the sort of thing I was thinking of in the "remove
> > > > GFP_NOFS" discussion thread when I said this to Kent:
> > > > 
> > > > 	"We need to start designing our code in a way that doesn't require
> > > > 	extensive testing to validate it as correct. If the only way to
> > > > 	validate new code is correct is via stochastic coverage via error
> > > > 	injection, then that is a clear sign we've made poor design choices
> > > > 	along the way."
> > > > 
> > > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/
> > > > 
> > > > If memory allocation doesn't fail by default, then we can remove the
> > > > vast majority of allocation error handling from the kernel. Make the
> > > > common case just work - remove the need for all that code to handle
> > > > failures that is hard to exercise reliably and so are rarely tested.
> > > > 
> > > > A simple change to make long standing behaviour an actual policy we
> > > > can rely on means we can remove both code and test matrix overhead -
> > > > it's a win-win IMO.
> > > 
> > > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
> > > default - a great many of those allocations have mempools in front of
> > > them to avoid deadlocks, and if you do that you've made the mempools
> > > useless.
> > > 
> > 
> > Not strictly true.  mempool_alloc() adds __GFP_NORETRY so the allocation
> > will certainly fail if that is appropriate.
> 
> *nod* 
> 
> > I suspect that most places where there is a non-error fallback already
> > use NORETRY or RETRY_MAYFAIL or similar.
> 
> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see
> _tons_ of uses for either of them - more for NORETRY.
> 
> My go-to is NOWAIT in this scenario though; my common pattern is "try
> nonblocking with locks held, then drop locks and retry GFP_KERNEL".
>  
> > But I agree that changing the meaning of GFP_KERNEL has a potential to
> > cause problems.  I support promoting "GFP_NOFAIL" which should work at
> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
> 
> I'd support this change.
> 
> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and
> > PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
> > these contexts, in which case I'm sure it should.
> > 
> > Maybe we could then deprecate GFP_KERNEL.
> 
> What do you have in mind?

I have in mind a more explicit statement of how much waiting is
acceptable.

GFP_NOFAIL - wait indefinitely
GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
            don't wait indefinitely.  May abort more quickly if fatal
            signal is pending.
GFP_NO_RETRY - only try things once.  This may sleep, but will give up
            fairly quickly.  Either deadlock is a significant
            possibility, or alternate strategy is fairly cheap.
GFP_ATOMIC - don't sleep - same as current.

I don't see how "GFP_KERNEL" fits into that spectrum.  The definition of
"this will try really hard, but might fail and we can't really tell you
what circumstances it might fail in" isn't fun to work with.

Thanks,
NeilBrown


> 
> Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
> really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
> pushing for memalloc_flags_(save|restore) more.
> 
> Getting rid of those would be a really nice cleanup beacuse then gfp
> flags would mostly just be:
>  - the type of memory to allocate (highmem, zeroed, etc.)
>  - how hard to try (don't block at all, block some, block forever)
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-03 22:45               ` NeilBrown
@ 2024-03-03 22:54                 ` Kent Overstreet
  2024-03-04  0:20                 ` Dave Chinner
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-03 22:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> I have in mind a more explicit statement of how much waiting is
> acceptable.
> 
> GFP_NOFAIL - wait indefinitely
> GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
>             don't wait indefinitely.  May abort more quickly if fatal
>             signal is pending.
> GFP_NO_RETRY - only try things once.  This may sleep, but will give up
>             fairly quickly.  Either deadlock is a significant
>             possibility, or alternate strategy is fairly cheap.
> GFP_ATOMIC - don't sleep - same as current.
> 
> I don't see how "GFP_KERNEL" fits into that spectrum.  The definition of
> "this will try really hard, but might fail and we can't really tell you
> what circumstances it might fail in" isn't fun to work with.

Well, lots of things "aren't fun to work with", but error handling is
just a part of life.

Your "GFP_KILLABLE" has the exact same problem of "this thing will be
rarely hit and difficult to test" - if anything moreso.

We just need to make sure error paths are getting tested - we need more
practical fault injection, that's all.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-03 22:45               ` NeilBrown
  2024-03-03 22:54                 ` Kent Overstreet
@ 2024-03-04  0:20                 ` Dave Chinner
  2024-03-04  1:16                   ` NeilBrown
  2024-03-04  0:35                 ` Matthew Wilcox
                                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2024-03-04  0:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> I have in mind a more explicit statement of how much waiting is
> acceptable.
> 
> GFP_NOFAIL - wait indefinitely

Make this the default, and we don't need a flag for it at all.

> GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
>             don't wait indefinitely.  May abort more quickly if fatal
>             signal is pending.

KILLABLE and RETRY are the same thing from the caller POV.
Effectively "GFP_MAY_FAIL", where it will try really hard, but if it
there is a risk of deadlock or a fatal signal pending, it will fail.

> GFP_NO_RETRY - only try things once.  This may sleep, but will give up
>             fairly quickly.  Either deadlock is a significant
>             possibility, or alternate strategy is fairly cheap.
> GFP_ATOMIC - don't sleep - same as current.

We're talking about wait semantics, so GFP_ATOMIC should be named
GFP_NO_WAIT and described as "same as NO_RETRY but will not sleep".

That gives us three modifying flags to the default behaviour of
sleeping until success: GFP_MAY_FAIL, GFP_NO_RETRY and GFP_NO_WAIT.

I will note there is one more case callers might really want to
avoid: direct reclaim. That sort of behaviour might be worth folding
into GFP_NO_WAIT, as there are cases where we want the allocation
attempt to fail without trying to reclaim memory because it's *much*
faster to simply use the fallback mechanism than it is to attempt
memory reclaim (e.g.  xlog_kvmalloc()).

> I don't see how "GFP_KERNEL" fits into that spectrum.

Agreed.

> The definition of
> "this will try really hard, but might fail and we can't really tell you
> what circumstances it might fail in" isn't fun to work with.

Yup, XFS was designed for NO_FAIL and MAY_FAIL behaviour, and in more
recent times we also use NO_RECLAIM to provide our own kvmalloc
semantics because the current kvmalloc API really only supports
"GFP_KERNEL" allocation.

> > Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
> > really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
> > pushing for memalloc_flags_(save|restore) more.

This is largely now subsystem maintenance work - the infrastructure
is there, and some subsystems have been converted over entirely to
use it. The remaining work either needs to be mandated or have
someone explicitly tasked with completing that work.

IOWs, the process in which we change APIs and then leave the long
tail of conversions to subsystem maintainers is just a mechanism for
creating technical debt that takes forever to clean up...

> > Getting rid of those would be a really nice cleanup beacuse then gfp
> > flags would mostly just be:
> >  - the type of memory to allocate (highmem, zeroed, etc.)
> >  - how hard to try (don't block at all, block some, block forever)

Yup, would be a very good improvement.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-04  0:20                 ` Dave Chinner
@ 2024-03-04  1:16                   ` NeilBrown
  0 siblings, 0 replies; 52+ messages in thread
From: NeilBrown @ 2024-03-04  1:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, 04 Mar 2024, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > I have in mind a more explicit statement of how much waiting is
> > acceptable.
> > 
> > GFP_NOFAIL - wait indefinitely
> 
> Make this the default, and we don't need a flag for it at all.

These aren't meant to be flags - those start with __.
They are ..  "flag combinations" to use the term description from
gfp_types.h

There could only be a "default" if we used macro magic to allow the GFP_
argument to be omitted.

> 
> > GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> > GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
> >             don't wait indefinitely.  May abort more quickly if fatal
> >             signal is pending.
> 
> KILLABLE and RETRY are the same thing from the caller POV.
> Effectively "GFP_MAY_FAIL", where it will try really hard, but if it
> there is a risk of deadlock or a fatal signal pending, it will fail.
> 
> > GFP_NO_RETRY - only try things once.  This may sleep, but will give up
> >             fairly quickly.  Either deadlock is a significant
> >             possibility, or alternate strategy is fairly cheap.
> > GFP_ATOMIC - don't sleep - same as current.
> 
> We're talking about wait semantics, so GFP_ATOMIC should be named
> GFP_NO_WAIT and described as "same as NO_RETRY but will not sleep".
> 
> That gives us three modifying flags to the default behaviour of
> sleeping until success: GFP_MAY_FAIL, GFP_NO_RETRY and GFP_NO_WAIT.


We currently have both __GFP_NORETRY and __GFP_RETRY_MAYFAIL which
differ in how many retries (1 or a few) and are both used (the former
about twice as much as the latter).
Do we need both?

Commit dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")

might be useful in understanding the RETRY_MAYFAIL semantic.

I think the intent is that RETRY_MAYFAIL doesn't trigger the oom killer.
That seems like it could be a useful distinction.

GFP_NOFAIL - retry indefinitely
GFP_NOOOM  - retry until fatal signal or OOM condition
GFP_NORETRY - sleep if needed, but don't retry
GFP_NOSLEEP - AKA GFP_ATOMIC

We might need a better name than GFP_NOOOM :-)

Thanks,
NeilBrown


> 
> I will note there is one more case callers might really want to
> avoid: direct reclaim. That sort of behaviour might be worth folding
> into GFP_NO_WAIT, as there are cases where we want the allocation
> attempt to fail without trying to reclaim memory because it's *much*
> faster to simply use the fallback mechanism than it is to attempt
> memory reclaim (e.g.  xlog_kvmalloc()).
> 
> > I don't see how "GFP_KERNEL" fits into that spectrum.
> 
> Agreed.
> 
> > The definition of
> > "this will try really hard, but might fail and we can't really tell you
> > what circumstances it might fail in" isn't fun to work with.
> 
> Yup, XFS was designed for NO_FAIL and MAY_FAIL behaviour, and in more
> recent times we also use NO_RECLAIM to provide our own kvmalloc
> semantics because the current kvmalloc API really only supports
> "GFP_KERNEL" allocation.
> 
> > > Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
> > > really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
> > > pushing for memalloc_flags_(save|restore) more.
> 
> This is largely now subsystem maintenance work - the infrastructure
> is there, and some subsystems have been converted over entirely to
> use it. The remaining work either needs to be mandated or have
> someone explicitly tasked with completing that work.
> 
> IOWs, the process in which we change APIs and then leave the long
> tail of conversions to subsystem maintainers is just a mechanism for
> creating technical debt that takes forever to clean up...
> 
> > > Getting rid of those would be a really nice cleanup beacuse then gfp
> > > flags would mostly just be:
> > >  - the type of memory to allocate (highmem, zeroed, etc.)
> > >  - how hard to try (don't block at all, block some, block forever)
> 
> Yup, would be a very good improvement.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-03 22:45               ` NeilBrown
  2024-03-03 22:54                 ` Kent Overstreet
  2024-03-04  0:20                 ` Dave Chinner
@ 2024-03-04  0:35                 ` Matthew Wilcox
  2024-03-04  1:27                   ` NeilBrown
  2024-03-04  2:05                   ` Kent Overstreet
  2024-03-12 14:46                 ` Vlastimil Babka
  2024-03-21  6:27                 ` Dan Carpenter
  4 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2024-03-04  0:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Dave Chinner, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> I have in mind a more explicit statement of how much waiting is
> acceptable.
> 
> GFP_NOFAIL - wait indefinitely
> GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
>             don't wait indefinitely.  May abort more quickly if fatal
>             signal is pending.
> GFP_NO_RETRY - only try things once.  This may sleep, but will give up
>             fairly quickly.  Either deadlock is a significant
>             possibility, or alternate strategy is fairly cheap.
> GFP_ATOMIC - don't sleep - same as current.

I don't think these should be GFP flags.  Rather, these should be
context flags (and indeed, they're mutually exclusive, so this is a
small integer to represent where we are on the spectrum).  That is
we want code to do

void *alloc_foo(void)
{
	return init_foo(kmalloc(256, GFP_MOVABLE));
}

void submit_foo(void)
{
	spin_lock(&submit_lock);
	flags = memalloc_set_atomic();
	__submit_foo(alloc_foo());
	memalloc_restore_flags(flags);
	spin_unlock(&submit_lock);
}

struct foo *prealloc_foo(void)
{
	return alloc_foo();
}

... for various degrees of complexity.  That is, the _location_ of memory
is allocation site knowledge, but how hard to try is _path_ dependent,
and not known at the allocation site because it doesn't know what locks
are held.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-04  0:35                 ` Matthew Wilcox
@ 2024-03-04  1:27                   ` NeilBrown
  2024-03-04  2:05                   ` Kent Overstreet
  1 sibling, 0 replies; 52+ messages in thread
From: NeilBrown @ 2024-03-04  1:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Dave Chinner, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, 04 Mar 2024, Matthew Wilcox wrote:
> On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > I have in mind a more explicit statement of how much waiting is
> > acceptable.
> > 
> > GFP_NOFAIL - wait indefinitely
> > GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> > GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
> >             don't wait indefinitely.  May abort more quickly if fatal
> >             signal is pending.
> > GFP_NO_RETRY - only try things once.  This may sleep, but will give up
> >             fairly quickly.  Either deadlock is a significant
> >             possibility, or alternate strategy is fairly cheap.
> > GFP_ATOMIC - don't sleep - same as current.
> 
> I don't think these should be GFP flags.  Rather, these should be
> context flags (and indeed, they're mutually exclusive, so this is a
> small integer to represent where we are on the spectrum).  That is
> we want code to do
> 
> void *alloc_foo(void)
> {
> 	return init_foo(kmalloc(256, GFP_MOVABLE));
> }
> 
> void submit_foo(void)
> {
> 	spin_lock(&submit_lock);
> 	flags = memalloc_set_atomic();
> 	__submit_foo(alloc_foo());
> 	memalloc_restore_flags(flags);
> 	spin_unlock(&submit_lock);
> }
> 
> struct foo *prealloc_foo(void)
> {
> 	return alloc_foo();
> }
> 
> ... for various degrees of complexity.  That is, the _location_ of memory
> is allocation site knowledge, but how hard to try is _path_ dependent,
> and not known at the allocation site because it doesn't know what locks
> are held.
> 

I'm not convinced.  While there is a path dependency there is also
location dependency.
The code at the call-site determines what happens in response to
failure.
For GFP_NOFAIL, failure is not possible.  We cannot allow context to
turn NOFAIL into NOSLEEP because context cannot add error handling.

Consider mempool_alloc().  That requests a NORETRY allocation because
there is an easy fall-back.  Is that a location dependency or a path
dependency?  I would say it is location.  Of course a location is just a
very short path so it is a path dependency too - but is that perspective
really helpful?

Certainly I could accept a GFP_WHATEVER which uses NOSLEEP if path
context demands that, and NO_OOM otherwise.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-04  0:35                 ` Matthew Wilcox
  2024-03-04  1:27                   ` NeilBrown
@ 2024-03-04  2:05                   ` Kent Overstreet
  1 sibling, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-04  2:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On Mon, Mar 04, 2024 at 12:35:05AM +0000, Matthew Wilcox wrote:
> On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > I have in mind a more explicit statement of how much waiting is
> > acceptable.
> > 
> > GFP_NOFAIL - wait indefinitely
> > GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> > GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
> >             don't wait indefinitely.  May abort more quickly if fatal
> >             signal is pending.
> > GFP_NO_RETRY - only try things once.  This may sleep, but will give up
> >             fairly quickly.  Either deadlock is a significant
> >             possibility, or alternate strategy is fairly cheap.
> > GFP_ATOMIC - don't sleep - same as current.
> 
> I don't think these should be GFP flags.  Rather, these should be
> context flags (and indeed, they're mutually exclusive, so this is a
> small integer to represent where we are on the spectrum).  That is
> we want code to do

Why?

Context flags are for /context/, i.e. the scope where you take a lock
that's GFP_FS unsafe. These really are callsite specific - "how bad is
it if we have to deal with an allocation failure here?"


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-03 22:45               ` NeilBrown
                                   ` (2 preceding siblings ...)
  2024-03-04  0:35                 ` Matthew Wilcox
@ 2024-03-12 14:46                 ` Vlastimil Babka
  2024-03-12 22:09                   ` NeilBrown
  2024-03-20 18:32                   ` Dan Carpenter
  2024-03-21  6:27                 ` Dan Carpenter
  4 siblings, 2 replies; 52+ messages in thread
From: Vlastimil Babka @ 2024-03-12 14:46 UTC (permalink / raw)
  To: NeilBrown, Kent Overstreet
  Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc,
	linux-mm, linux-fsdevel, Jan Kara

On 3/3/24 23:45, NeilBrown wrote:
> On Sat, 02 Mar 2024, Kent Overstreet wrote:
>> 
>> *nod* 
>> 
>> > I suspect that most places where there is a non-error fallback already
>> > use NORETRY or RETRY_MAYFAIL or similar.
>> 
>> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see
>> _tons_ of uses for either of them - more for NORETRY.
>> 
>> My go-to is NOWAIT in this scenario though; my common pattern is "try
>> nonblocking with locks held, then drop locks and retry GFP_KERNEL".
>>  
>> > But I agree that changing the meaning of GFP_KERNEL has a potential to
>> > cause problems.  I support promoting "GFP_NOFAIL" which should work at
>> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
>> 
>> I'd support this change.
>> 
>> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and
>> > PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
>> > these contexts, in which case I'm sure it should.
>> > 
>> > Maybe we could then deprecate GFP_KERNEL.
>> 
>> What do you have in mind?
> 
> I have in mind a more explicit statement of how much waiting is
> acceptable.
> 
> GFP_NOFAIL - wait indefinitely
> GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
>             don't wait indefinitely.  May abort more quickly if fatal
>             signal is pending.
> GFP_NO_RETRY - only try things once.  This may sleep, but will give up
>             fairly quickly.  Either deadlock is a significant
>             possibility, or alternate strategy is fairly cheap.
> GFP_ATOMIC - don't sleep - same as current.
> 
> I don't see how "GFP_KERNEL" fits into that spectrum.  The definition of
> "this will try really hard, but might fail and we can't really tell you
> what circumstances it might fail in" isn't fun to work with.

The problem is if we set out to change everything from GFP_KERNEL to one of
the above, it will take many years. So I think it would be better to just
change the semantics of GFP_KERNEL too.

If we change it to remove the "too-small to fail" rule, we might suddenly
introduce crashes in unknown amount of places, so I don't think that's feasible.

But if we change it to effectively mean GFP_NOFAIL (for non-costly
allocations), there should be a manageable number of places to change to a
variant that allows failure. Also if these places are GFP_KERNEL by mistake
today, and should in fact allow failure, they would be already causing
problems today, as the circumstances where too-small-to-fail is violated are
quite rare (IIRC just being an oom victim, so somewhat close to
GFP_KILLABLE). So changing GFP_KERNEL to GFP_NOFAIL should be the lowest
risk (one could argue for GFP_KILLABLE but I'm afraid many places don't
really handle that as they assume the too-small-to-fail without exceptions
and are unaware of the oom victim loophole, and failing on any fatal signal
increases the chances of this happening).

> Thanks,
> NeilBrown
> 
> 
>> 
>> Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
>> really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
>> pushing for memalloc_flags_(save|restore) more.
>> 
>> Getting rid of those would be a really nice cleanup beacuse then gfp
>> flags would mostly just be:
>>  - the type of memory to allocate (highmem, zeroed, etc.)
>>  - how hard to try (don't block at all, block some, block forever)
>> 
> 
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-12 14:46                 ` Vlastimil Babka
@ 2024-03-12 22:09                   ` NeilBrown
  2024-03-20 18:32                   ` Dan Carpenter
  1 sibling, 0 replies; 52+ messages in thread
From: NeilBrown @ 2024-03-12 22:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Wed, 13 Mar 2024, Vlastimil Babka wrote:
> On 3/3/24 23:45, NeilBrown wrote:
> > On Sat, 02 Mar 2024, Kent Overstreet wrote:
> >> 
> >> *nod* 
> >> 
> >> > I suspect that most places where there is a non-error fallback already
> >> > use NORETRY or RETRY_MAYFAIL or similar.
> >> 
> >> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see
> >> _tons_ of uses for either of them - more for NORETRY.
> >> 
> >> My go-to is NOWAIT in this scenario though; my common pattern is "try
> >> nonblocking with locks held, then drop locks and retry GFP_KERNEL".
> >>  
> >> > But I agree that changing the meaning of GFP_KERNEL has a potential to
> >> > cause problems.  I support promoting "GFP_NOFAIL" which should work at
> >> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
> >> 
> >> I'd support this change.
> >> 
> >> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and
> >> > PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
> >> > these contexts, in which case I'm sure it should.
> >> > 
> >> > Maybe we could then deprecate GFP_KERNEL.
> >> 
> >> What do you have in mind?
> > 
> > I have in mind a more explicit statement of how much waiting is
> > acceptable.
> > 
> > GFP_NOFAIL - wait indefinitely
> > GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
> > GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
> >             don't wait indefinitely.  May abort more quickly if fatal
> >             signal is pending.
> > GFP_NO_RETRY - only try things once.  This may sleep, but will give up
> >             fairly quickly.  Either deadlock is a significant
> >             possibility, or alternate strategy is fairly cheap.
> > GFP_ATOMIC - don't sleep - same as current.
> > 
> > I don't see how "GFP_KERNEL" fits into that spectrum.  The definition of
> > "this will try really hard, but might fail and we can't really tell you
> > what circumstances it might fail in" isn't fun to work with.
> 
> The problem is if we set out to change everything from GFP_KERNEL to one of
> the above, it will take many years. So I think it would be better to just
> change the semantics of GFP_KERNEL too.

It took a long time to completely remove the BKL too.  I don't think
this is something we should be afraid of.  We can easily use tools to
remind us about the work that needs doing and the progress being made.

> 
> If we change it to remove the "too-small to fail" rule, we might suddenly
> introduce crashes in unknown amount of places, so I don't think that's feasible.

I don't think anyone wants that.

> 
> But if we change it to effectively mean GFP_NOFAIL (for non-costly
> allocations), there should be a manageable number of places to change to a
> variant that allows failure. Also if these places are GFP_KERNEL by mistake
> today, and should in fact allow failure, they would be already causing
> problems today, as the circumstances where too-small-to-fail is violated are
> quite rare (IIRC just being an oom victim, so somewhat close to
> GFP_KILLABLE). So changing GFP_KERNEL to GFP_NOFAIL should be the lowest
> risk (one could argue for GFP_KILLABLE but I'm afraid many places don't
> really handle that as they assume the too-small-to-fail without exceptions
> and are unaware of the oom victim loophole, and failing on any fatal signal
> increases the chances of this happening).

I think many uses of GFP_KERNEL should be changed to GFP_NOFAIL.
But that ISN'T just changing the flag (or the meaning of the flag).  It
also involves removing that code that handles failure.  That is, to me,
a strong argument against redefining GFP_KERNEL to mean GFP_NOFAIL.  We (I)
really do want that error handling code to be removed.  That needs to be
done on a case-by-case basis.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-12 14:46                 ` Vlastimil Babka
  2024-03-12 22:09                   ` NeilBrown
@ 2024-03-20 18:32                   ` Dan Carpenter
  2024-03-20 18:48                     ` Vlastimil Babka
  2024-03-20 19:09                     ` Kent Overstreet
  1 sibling, 2 replies; 52+ messages in thread
From: Dan Carpenter @ 2024-03-20 18:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: NeilBrown, Kent Overstreet, Dave Chinner, Matthew Wilcox,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote:
> But if we change it to effectively mean GFP_NOFAIL (for non-costly
> allocations), there should be a manageable number of places to change to a
> variant that allows failure.

What does that even mean if GFP_NOFAIL can fail for "costly" allocations?
I thought GFP_NOFAIL couldn't fail at all...

Unfortunately, it's common that when we can't decide on a sane limit for
something people just say "let the user decide based on how much memory
they have".  I have added some integer overflow checks which allow the
user to allocate up to UINT_MAX bytes so I know this code is out
there.  We can't just s/GFP_KERNEL/GFP_NOFAIL/.

From a static analysis perspective it would be nice if the callers
explicitly marked which allocations can fail and which can't.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 18:32                   ` Dan Carpenter
@ 2024-03-20 18:48                     ` Vlastimil Babka
  2024-03-20 18:55                       ` Matthew Wilcox
  2024-03-20 19:09                     ` Kent Overstreet
  1 sibling, 1 reply; 52+ messages in thread
From: Vlastimil Babka @ 2024-03-20 18:48 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: NeilBrown, Kent Overstreet, Dave Chinner, Matthew Wilcox,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On 3/20/24 19:32, Dan Carpenter wrote:
> On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote:
>> But if we change it to effectively mean GFP_NOFAIL (for non-costly
>> allocations), there should be a manageable number of places to change to a
>> variant that allows failure.
> 
> What does that even mean if GFP_NOFAIL can fail for "costly" allocations?
> I thought GFP_NOFAIL couldn't fail at all...

Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only
for non-costly allocations. Anything marked GFP_NOFAIL would still be fully
nofail.

> Unfortunately, it's common that when we can't decide on a sane limit for
> something people just say "let the user decide based on how much memory
> they have".  I have added some integer overflow checks which allow the
> user to allocate up to UINT_MAX bytes so I know this code is out
> there.  We can't just s/GFP_KERNEL/GFP_NOFAIL/.

Maybe we could start producing warnings for costly GFP_KERNEL allocations to
get them converted away faster. Anything that's user-controlled most likely
shouldn't be GFP_KERNEL.

> From a static analysis perspective it would be nice if the callers
> explicitly marked which allocations can fail and which can't.

As I suggested, it would be nice not to wait until everything is explicitly
marked one way or another. I get the comparison with BKL, but also the
kernel got much larger since the BKL times?

Good point that it's not ideal if the size is unknown. Maybe the tools could
be used to point out places where the size cannot be determined, so those
should be converted first?

Also tools cound warn about attempts to handle failure to point out places
where it should be removed?

> regards,
> dan carpenter
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 18:48                     ` Vlastimil Babka
@ 2024-03-20 18:55                       ` Matthew Wilcox
  2024-03-20 19:07                         ` Kent Overstreet
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2024-03-20 18:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dan Carpenter, NeilBrown, Kent Overstreet, Dave Chinner,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Wed, Mar 20, 2024 at 07:48:01PM +0100, Vlastimil Babka wrote:
> On 3/20/24 19:32, Dan Carpenter wrote:
> > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote:
> >> But if we change it to effectively mean GFP_NOFAIL (for non-costly
> >> allocations), there should be a manageable number of places to change to a
> >> variant that allows failure.
> > 
> > What does that even mean if GFP_NOFAIL can fail for "costly" allocations?
> > I thought GFP_NOFAIL couldn't fail at all...
> 
> Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only
> for non-costly allocations. Anything marked GFP_NOFAIL would still be fully
> nofail.

GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE.
Or should we interpret that as "die now"?  Or "go into an unkillable
sleep"?  If the caller really has taken the opportunity to remove their
error handling path, returning NULL will lead to a crash and a lot of
beard stroking trying to understand why a GFP_NOFAIL allocation has
returned NULL.  May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give
the developer a clear indication of what they did wrong.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 18:55                       ` Matthew Wilcox
@ 2024-03-20 19:07                         ` Kent Overstreet
  2024-03-20 19:14                           ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Kent Overstreet @ 2024-03-20 19:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 20, 2024 at 07:48:01PM +0100, Vlastimil Babka wrote:
> > On 3/20/24 19:32, Dan Carpenter wrote:
> > > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote:
> > >> But if we change it to effectively mean GFP_NOFAIL (for non-costly
> > >> allocations), there should be a manageable number of places to change to a
> > >> variant that allows failure.
> > > 
> > > What does that even mean if GFP_NOFAIL can fail for "costly" allocations?
> > > I thought GFP_NOFAIL couldn't fail at all...
> > 
> > Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only
> > for non-costly allocations. Anything marked GFP_NOFAIL would still be fully
> > nofail.
> 
> GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE.
> Or should we interpret that as "die now"?  Or "go into an unkillable
> sleep"?  If the caller really has taken the opportunity to remove their
> error handling path, returning NULL will lead to a crash and a lot of
> beard stroking trying to understand why a GFP_NOFAIL allocation has
> returned NULL.  May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give
> the developer a clear indication of what they did wrong.

Why do we even need KMALLOC_MAX_SIZE...?

Given that kmalloc internally switches to the page allocator when
needed, I would think that that's something we can do away with.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 19:07                         ` Kent Overstreet
@ 2024-03-20 19:14                           ` Matthew Wilcox
  2024-03-20 19:33                             ` Kent Overstreet
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2024-03-20 19:14 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Wed, Mar 20, 2024 at 03:07:38PM -0400, Kent Overstreet wrote:
> On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote:
> > GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE.
> > Or should we interpret that as "die now"?  Or "go into an unkillable
> > sleep"?  If the caller really has taken the opportunity to remove their
> > error handling path, returning NULL will lead to a crash and a lot of
> > beard stroking trying to understand why a GFP_NOFAIL allocation has
> > returned NULL.  May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give
> > the developer a clear indication of what they did wrong.
> 
> Why do we even need KMALLOC_MAX_SIZE...?
> 
> Given that kmalloc internally switches to the page allocator when
> needed, I would think that that's something we can do away with.

... maybe check what I said before replying?

/*
 * SLUB directly allocates requests fitting in to an order-1 page
 * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
 */
#define KMALLOC_SHIFT_MAX       (MAX_PAGE_ORDER + PAGE_SHIFT)

You caan't allocate larger than that without going to CMA or some other
custom allocator.

So.  The caller has requested NOFAIL, and requested a size larger than
we can allocate.  Do we panic, go into an uninterruptible sleep, or
return NULL?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 19:14                           ` Matthew Wilcox
@ 2024-03-20 19:33                             ` Kent Overstreet
  0 siblings, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-20 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Wed, Mar 20, 2024 at 07:14:35PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 20, 2024 at 03:07:38PM -0400, Kent Overstreet wrote:
> > On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote:
> > > GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE.
> > > Or should we interpret that as "die now"?  Or "go into an unkillable
> > > sleep"?  If the caller really has taken the opportunity to remove their
> > > error handling path, returning NULL will lead to a crash and a lot of
> > > beard stroking trying to understand why a GFP_NOFAIL allocation has
> > > returned NULL.  May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give
> > > the developer a clear indication of what they did wrong.
> > 
> > Why do we even need KMALLOC_MAX_SIZE...?
> > 
> > Given that kmalloc internally switches to the page allocator when
> > needed, I would think that that's something we can do away with.
> 
> ... maybe check what I said before replying?
> 
> /*
>  * SLUB directly allocates requests fitting in to an order-1 page
>  * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
>  */
> #define KMALLOC_SHIFT_MAX       (MAX_PAGE_ORDER + PAGE_SHIFT)
> 
> You caan't allocate larger than that without going to CMA or some other
> custom allocator.

Ahh, my recollection was out of date - I believe that was 128k at one
time?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-20 18:32                   ` Dan Carpenter
  2024-03-20 18:48                     ` Vlastimil Babka
@ 2024-03-20 19:09                     ` Kent Overstreet
  1 sibling, 0 replies; 52+ messages in thread
From: Kent Overstreet @ 2024-03-20 19:09 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Vlastimil Babka, NeilBrown, Dave Chinner, Matthew Wilcox,
	Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara

On Wed, Mar 20, 2024 at 09:32:52PM +0300, Dan Carpenter wrote:
> On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote:
> > But if we change it to effectively mean GFP_NOFAIL (for non-costly
> > allocations), there should be a manageable number of places to change to a
> > variant that allows failure.
> 
> What does that even mean if GFP_NOFAIL can fail for "costly" allocations?
> I thought GFP_NOFAIL couldn't fail at all...
> 
> Unfortunately, it's common that when we can't decide on a sane limit for
> something people just say "let the user decide based on how much memory
> they have".  I have added some integer overflow checks which allow the
> user to allocate up to UINT_MAX bytes so I know this code is out
> there.  We can't just s/GFP_KERNEL/GFP_NOFAIL/.
> 
> From a static analysis perspective it would be nice if the callers
> explicitly marked which allocations can fail and which can't.

GFP_NOFAIL throws a warning if the allocation size is > 2 pages, which
is a separate issue from whether the allocation becomes fallible -
someone would have to - oh, I don't know, read the code to answer that
question.

I think we can ditch the 2 page limit on GFP_NOFAIL, though.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-03 22:45               ` NeilBrown
                                   ` (3 preceding siblings ...)
  2024-03-12 14:46                 ` Vlastimil Babka
@ 2024-03-21  6:27                 ` Dan Carpenter
  2024-03-22  1:47                   ` NeilBrown
  4 siblings, 1 reply; 52+ messages in thread
From: Dan Carpenter @ 2024-03-21  6:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> I have in mind a more explicit statement of how much waiting is
> acceptable.
> 
> GFP_NOFAIL - wait indefinitely

Why not call it GFP_SMALL?  It wouldn't fail.  The size would have to be
less than some limit.  If the size was too large, that would trigger a
WARN_ON_ONCE().

I obviously understand that this duplicates the information in the size
parameter but the point is that GFP_SMALL allocations have been
reviewed, updated, and don't have error handling code.

We'd keep GFP_KERNEL which would keep the existing behavior.  (Which is
that it can sleep and it can fail).  I think that maps to GFP_RETRY but
GFP_RETRY is an uglier name.

People could still use __GFP_NOFAIL for larger allocations.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-21  6:27                 ` Dan Carpenter
@ 2024-03-22  1:47                   ` NeilBrown
  2024-03-22  6:13                     ` Dan Carpenter
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2024-03-22  1:47 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara

On Thu, 21 Mar 2024, Dan Carpenter wrote:
> On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > I have in mind a more explicit statement of how much waiting is
> > acceptable.
> > 
> > GFP_NOFAIL - wait indefinitely
> 
> Why not call it GFP_SMALL?  It wouldn't fail.  The size would have to be
> less than some limit.  If the size was too large, that would trigger a
> WARN_ON_ONCE().

I would be happy with GFP_SMALL.  It would never return NULL but might
block indefinitely.  It would (as you say) WARN (maybe ONCE) if the size
was considered "COSTLY" and would possibly BUG if the size exceeded
KMALLOC_MAX_SIZE. 

> 
> I obviously understand that this duplicates the information in the size
> parameter but the point is that GFP_SMALL allocations have been
> reviewed, updated, and don't have error handling code.

We are on the same page here.

> 
> We'd keep GFP_KERNEL which would keep the existing behavior.  (Which is
> that it can sleep and it can fail).  I think that maps to GFP_RETRY but
> GFP_RETRY is an uglier name.

Can it fail though?  I know it is allowed to, but does it happen?

I think I would prefer the normal approach for allocations that are (or
might be) too large for GFP_SMALL was to *not* risk triggering oom.
So I'd rather we didn't have GFP_KERNEL (which I think does have that
risk) and instead (maybe) GFP_LARGE which sleeps but will return NULL
when an OOM condition is looming.

So we only get OOM for GFP_SMALL allocations (because not failing
simplifies the code), and where it is explicitly requested - like for
user-space allocations and probably for file cache allocations.

But I think that keeping it simple and only adding GFP_SMALL as we have
discussed would be a big step in the right direction and we don't need
to complicate it with other ideas.

Thanks,
NeilBrown

> 
> People could still use __GFP_NOFAIL for larger allocations.
> 
> regards,
> dan carpenter
> 
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-22  1:47                   ` NeilBrown
@ 2024-03-22  6:13                     ` Dan Carpenter
  2024-03-24 22:31                       ` NeilBrown
  0 siblings, 1 reply; 52+ messages in thread
From: Dan Carpenter @ 2024-03-22  6:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko

On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote:
> On Thu, 21 Mar 2024, Dan Carpenter wrote:
> > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > > I have in mind a more explicit statement of how much waiting is
> > > acceptable.
> > > 
> > > GFP_NOFAIL - wait indefinitely
> > 
> > Why not call it GFP_SMALL?  It wouldn't fail.  The size would have to be
> > less than some limit.  If the size was too large, that would trigger a
> > WARN_ON_ONCE().
> 
> I would be happy with GFP_SMALL.  It would never return NULL but might
> block indefinitely.  It would (as you say) WARN (maybe ONCE) if the size
> was considered "COSTLY" and would possibly BUG if the size exceeded
> KMALLOC_MAX_SIZE. 

I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE.  IIf
you're allocating larger than that, you'd still be able to GFP_NOFAIL.
I looked quickly an I think over 60% of allocations are just sizeof(*p)
and probably 90% are under 4k.

> 
> > 
> > I obviously understand that this duplicates the information in the size
> > parameter but the point is that GFP_SMALL allocations have been
> > reviewed, updated, and don't have error handling code.
> 
> We are on the same page here.
> 
> > 
> > We'd keep GFP_KERNEL which would keep the existing behavior.  (Which is
> > that it can sleep and it can fail).  I think that maps to GFP_RETRY but
> > GFP_RETRY is an uglier name.
> 
> Can it fail though?  I know it is allowed to, but does it happen?
> 

In some sense, I don't really care about this, I just want the rules
clear from a static analysis perspective.  Btw, you're discussing making
the too small to fail rule official but other times we have discussed
getting rid of it all together.  So I think maybe it's better to keep
the rules strict but allow the actual implentation to change later.

The GFP_SMALL stuff is nice for static analysis because it would warn
about anything larger than whatever the small limit is.  So that means I
have fewer allocations to review for integer overflow bugs.

Btw, Jiri Pirko, was proposing a macro which would automatically
allocate the 60+% of allocations which are sizeof(*p).
https://lore.kernel.org/all/20240315132249.2515468-1-jiri@resnulli.us/
I had offered an alternative macro but the idea is the same:

#define __ALLOC(p) p __free(kfree) = kzalloc(sizeof(*p), GFP_KERNEL)
        struct ice_aqc_get_phy_caps_data *__ALLOC(pcaps);

Combining no fail allocations with automatic cleanup changes the way you
write code.

And then on the success patch you have the no_free_ptr() which I haven't
used but I think will be so useful for static analysis.  I'm so excited
about this.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-22  6:13                     ` Dan Carpenter
@ 2024-03-24 22:31                       ` NeilBrown
  2024-03-25  8:43                         ` Dan Carpenter
  0 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2024-03-24 22:31 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko

On Fri, 22 Mar 2024, Dan Carpenter wrote:
> On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote:
> > On Thu, 21 Mar 2024, Dan Carpenter wrote:
> > > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > > > I have in mind a more explicit statement of how much waiting is
> > > > acceptable.
> > > > 
> > > > GFP_NOFAIL - wait indefinitely
> > > 
> > > Why not call it GFP_SMALL?  It wouldn't fail.  The size would have to be
> > > less than some limit.  If the size was too large, that would trigger a
> > > WARN_ON_ONCE().
> > 
> > I would be happy with GFP_SMALL.  It would never return NULL but might
> > block indefinitely.  It would (as you say) WARN (maybe ONCE) if the size
> > was considered "COSTLY" and would possibly BUG if the size exceeded
> > KMALLOC_MAX_SIZE. 
> 
> I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE.  IIf
> you're allocating larger than that, you'd still be able to GFP_NOFAIL.
> I looked quickly an I think over 60% of allocations are just sizeof(*p)
> and probably 90% are under 4k.

What do you mean exactly by "keep"??
Do you mean WARN_ON if it is "too big" - certainly agree.
Do you mean BUG_ON if it is "too big" - maybe agree.
Do you mean return NULL if it is "too big" - definitely disagree.
Do you mean build failure if it could be too big - I would LOVE that,
but I don't think we can do that with current build tools.

Thanks,
NeilBrown



> 
> > 
> > > 
> > > I obviously understand that this duplicates the information in the size
> > > parameter but the point is that GFP_SMALL allocations have been
> > > reviewed, updated, and don't have error handling code.
> > 
> > We are on the same page here.
> > 
> > > 
> > > We'd keep GFP_KERNEL which would keep the existing behavior.  (Which is
> > > that it can sleep and it can fail).  I think that maps to GFP_RETRY but
> > > GFP_RETRY is an uglier name.
> > 
> > Can it fail though?  I know it is allowed to, but does it happen?
> > 
> 
> In some sense, I don't really care about this, I just want the rules
> clear from a static analysis perspective.  Btw, you're discussing making
> the too small to fail rule official but other times we have discussed
> getting rid of it all together.  So I think maybe it's better to keep
> the rules strict but allow the actual implentation to change later.
> 
> The GFP_SMALL stuff is nice for static analysis because it would warn
> about anything larger than whatever the small limit is.  So that means I
> have fewer allocations to review for integer overflow bugs.
> 
> Btw, Jiri Pirko, was proposing a macro which would automatically
> allocate the 60+% of allocations which are sizeof(*p).
> https://lore.kernel.org/all/20240315132249.2515468-1-jiri@resnulli.us/
> I had offered an alternative macro but the idea is the same:
> 
> #define __ALLOC(p) p __free(kfree) = kzalloc(sizeof(*p), GFP_KERNEL)
>         struct ice_aqc_get_phy_caps_data *__ALLOC(pcaps);
> 
> Combining no fail allocations with automatic cleanup changes the way you
> write code.
> 
> And then on the success patch you have the no_free_ptr() which I haven't
> used but I think will be so useful for static analysis.  I'm so excited
> about this.
> 
> regards,
> dan carpenter
> 
> 
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU
  2024-03-24 22:31                       ` NeilBrown
@ 2024-03-25  8:43                         ` Dan Carpenter
  0 siblings, 0 replies; 52+ messages in thread
From: Dan Carpenter @ 2024-03-25  8:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein,
	paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko

On Mon, Mar 25, 2024 at 09:31:53AM +1100, NeilBrown wrote:
> On Fri, 22 Mar 2024, Dan Carpenter wrote:
> > On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote:
> > > On Thu, 21 Mar 2024, Dan Carpenter wrote:
> > > > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote:
> > > > > I have in mind a more explicit statement of how much waiting is
> > > > > acceptable.
> > > > > 
> > > > > GFP_NOFAIL - wait indefinitely
> > > > 
> > > > Why not call it GFP_SMALL?  It wouldn't fail.  The size would have to be
> > > > less than some limit.  If the size was too large, that would trigger a
> > > > WARN_ON_ONCE().
> > > 
> > > I would be happy with GFP_SMALL.  It would never return NULL but might
> > > block indefinitely.  It would (as you say) WARN (maybe ONCE) if the size
> > > was considered "COSTLY" and would possibly BUG if the size exceeded
> > > KMALLOC_MAX_SIZE. 
> > 
> > I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE.  IIf
> > you're allocating larger than that, you'd still be able to GFP_NOFAIL.
> > I looked quickly an I think over 60% of allocations are just sizeof(*p)
> > and probably 90% are under 4k.
> 
> What do you mean exactly by "keep"??

Poor word choice...

> Do you mean WARN_ON if it is "too big" - certainly agree.
> Do you mean BUG_ON if it is "too big" - maybe agree.

WARN_ON_ONCE().  But a lot of people have reboot on Oops enabled so
triggering a WARN_ON() is still a very serious bug.

> Do you mean return NULL if it is "too big" - definitely disagree.

Yeah.  It's going to be a style violation to check a GFP_SMALL
allocation for NULL so it needs to have GFP_NOFAIL behavior.  It can
still fail if it's larger than KMALLOC_MAX_SIZE.

> Do you mean build failure if it could be too big - I would LOVE that,
> but I don't think we can do that with current build tools.

The limit is going to be a constant so using static analysis to check
that is easier than checking if we're less than some variable.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2024-03-25  8:43 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-27 18:56 [LSF/MM/BPF TOPIC] Reclamation interactions with RCU Paul E. McKenney
2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein
2024-02-27 22:59   ` Paul E. McKenney
2024-03-01  3:28     ` Kent Overstreet
2024-03-05  2:43       ` Paul E. McKenney
2024-03-05  2:56       ` Yosry Ahmed
2024-02-28 19:37   ` Matthew Wilcox
2024-02-29  1:29     ` Dave Chinner
2024-02-29  4:20       ` Kent Overstreet
2024-02-29  4:17     ` Kent Overstreet
2024-02-29  4:24       ` Matthew Wilcox
2024-02-29  4:44         ` Kent Overstreet
2024-03-01  2:16     ` NeilBrown
2024-03-01  2:39       ` Kent Overstreet
2024-03-01  2:48         ` Matthew Wilcox
2024-03-01  3:09           ` Kent Overstreet
2024-03-01  3:33             ` James Bottomley
2024-03-01  3:52               ` Kent Overstreet
2024-03-01  4:01                 ` Kent Overstreet
2024-03-01  4:09                   ` NeilBrown
2024-03-01  4:18                     ` Kent Overstreet
2024-03-01  4:18                   ` James Bottomley
2024-03-01  4:08                 ` James Bottomley
2024-03-01  4:15                   ` Kent Overstreet
2024-03-05  2:54           ` Yosry Ahmed
2024-03-01  5:54       ` Dave Chinner
2024-03-01 20:20         ` Kent Overstreet
2024-03-01 23:47           ` NeilBrown
2024-03-02  0:02             ` Kent Overstreet
2024-03-02 11:33               ` Tetsuo Handa
2024-03-02 16:53                 ` Matthew Wilcox
2024-03-03 22:45               ` NeilBrown
2024-03-03 22:54                 ` Kent Overstreet
2024-03-04  0:20                 ` Dave Chinner
2024-03-04  1:16                   ` NeilBrown
2024-03-04  0:35                 ` Matthew Wilcox
2024-03-04  1:27                   ` NeilBrown
2024-03-04  2:05                   ` Kent Overstreet
2024-03-12 14:46                 ` Vlastimil Babka
2024-03-12 22:09                   ` NeilBrown
2024-03-20 18:32                   ` Dan Carpenter
2024-03-20 18:48                     ` Vlastimil Babka
2024-03-20 18:55                       ` Matthew Wilcox
2024-03-20 19:07                         ` Kent Overstreet
2024-03-20 19:14                           ` Matthew Wilcox
2024-03-20 19:33                             ` Kent Overstreet
2024-03-20 19:09                     ` Kent Overstreet
2024-03-21  6:27                 ` Dan Carpenter
2024-03-22  1:47                   ` NeilBrown
2024-03-22  6:13                     ` Dan Carpenter
2024-03-24 22:31                       ` NeilBrown
2024-03-25  8:43                         ` Dan Carpenter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox