* [LSF/MM/BPF TOPIC] Reclamation interactions with RCU @ 2024-02-27 18:56 Paul E. McKenney 2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein 0 siblings, 1 reply; 52+ messages in thread From: Paul E. McKenney @ 2024-02-27 18:56 UTC (permalink / raw) To: lsf-pc; +Cc: linux-mm Hello! Recent discussions [1] suggest that greater mutual understanding between memory reclaim on the one hand and RCU on the other might be in order. One possibility would be an open discussion. If it would help, I would be happy to describe how RCU reacts and responds to heavy load, along with some ways that RCU's reactions and responses could be enhanced if needed. Thanx, Paul [1] https://lore.kernel.org/all/fb4d944e-fde7-423b-a376-25db0b317398@paulmck-laptop/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-27 18:56 [LSF/MM/BPF TOPIC] Reclamation interactions with RCU Paul E. McKenney @ 2024-02-27 19:19 ` Amir Goldstein 2024-02-27 22:59 ` Paul E. McKenney 2024-02-28 19:37 ` Matthew Wilcox 0 siblings, 2 replies; 52+ messages in thread From: Amir Goldstein @ 2024-02-27 19:19 UTC (permalink / raw) To: paulmck Cc: lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Kent Overstreet, Jan Kara On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > Hello! > > Recent discussions [1] suggest that greater mutual understanding between > memory reclaim on the one hand and RCU on the other might be in order. > > One possibility would be an open discussion. If it would help, I would > be happy to describe how RCU reacts and responds to heavy load, along with > some ways that RCU's reactions and responses could be enhanced if needed. > Adding fsdevel as this should probably be a cross track session. Thanks, Amir. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein @ 2024-02-27 22:59 ` Paul E. McKenney 2024-03-01 3:28 ` Kent Overstreet 2024-02-28 19:37 ` Matthew Wilcox 1 sibling, 1 reply; 52+ messages in thread From: Paul E. McKenney @ 2024-02-27 22:59 UTC (permalink / raw) To: Amir Goldstein Cc: lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Kent Overstreet, Jan Kara On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > Hello! > > > > Recent discussions [1] suggest that greater mutual understanding between > > memory reclaim on the one hand and RCU on the other might be in order. > > > > One possibility would be an open discussion. If it would help, I would > > be happy to describe how RCU reacts and responds to heavy load, along with > > some ways that RCU's reactions and responses could be enhanced if needed. > > Adding fsdevel as this should probably be a cross track session. The more, the merrier! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-27 22:59 ` Paul E. McKenney @ 2024-03-01 3:28 ` Kent Overstreet 2024-03-05 2:43 ` Paul E. McKenney 2024-03-05 2:56 ` Yosry Ahmed 0 siblings, 2 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 3:28 UTC (permalink / raw) To: Paul E. McKenney Cc: Amir Goldstein, lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Jan Kara On Tue, Feb 27, 2024 at 02:59:33PM -0800, Paul E. McKenney wrote: > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > Hello! > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > One possibility would be an open discussion. If it would help, I would > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > Adding fsdevel as this should probably be a cross track session. > > The more, the merrier! ;-) Starting to look at the practicalities of freeing pages via rcu. Where to stick the rcu_head might be a problem. I think by the time we're in free_unref_page() enough of struct page isn't in use that we can just... throw it in the struct page union somewhere (ok, actually there's one already for ZONE_DEVICE pages; we just have to make sure it doesn't collide with anything - probably just compound page metadata). But it'd be more efficient if we weren't using linked lists for this. It's premature to ask for this before I've tested or profiled anything, but I just wanted to throw it out there that we might want to look using a dequeue for pending frees (and perhaps other things?). Willy and I were just chatting about this and I think the conclusion we're coming to is that we'll probably want to try just RCU freeing _everything_ at first, the reason being that anonymous pages are probably going to get included and that may end up being most of the allocations and frees - so why not just do everything. Meaning - _lots _of memory cycling through RCU, not bounded by device bandwidth like I initially thought. Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs The plan has been to eventually get an actual enum/type tag for our different page types, and that would be really helpful for system visibility/introspection in general, and in particular for this project where we're going to want to be able to put numbers on different things for performance analysis and then later probably excluding things from RCU freeing in a sane standard way (slab, at least). I wonder if anyone wants to solve this. The problem with doing it now is that we've got no more page flags on 32 bit, but it'd be _very_ helpful in future cleanups if we could figure out a way to make it hapen. That'd help with another wrinkle, which is non compound higher order pages. Those are tricky to handle, because how do we know the order from the RCU callback? If we just don't RCU free them, that's sketchy without the page type enum because we have no way to have a master definition/list of "these types of pages are RCU freed, these aren't". Possibly we can just fake-compound them - store the order like it is in compound pages but don't initialize every page like happens for compound pages. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:28 ` Kent Overstreet @ 2024-03-05 2:43 ` Paul E. McKenney 2024-03-05 2:56 ` Yosry Ahmed 1 sibling, 0 replies; 52+ messages in thread From: Paul E. McKenney @ 2024-03-05 2:43 UTC (permalink / raw) To: Kent Overstreet Cc: Amir Goldstein, lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Jan Kara On Thu, Feb 29, 2024 at 10:28:33PM -0500, Kent Overstreet wrote: > On Tue, Feb 27, 2024 at 02:59:33PM -0800, Paul E. McKenney wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > The more, the merrier! ;-) > > Starting to look at the practicalities of freeing pages via rcu. > > Where to stick the rcu_head might be a problem. I think by the time > we're in free_unref_page() enough of struct page isn't in use that we > can just... throw it in the struct page union somewhere (ok, actually > there's one already for ZONE_DEVICE pages; we just have to make sure it > doesn't collide with anything - probably just compound page metadata). > > But it'd be more efficient if we weren't using linked lists for this. > It's premature to ask for this before I've tested or profiled anything, > but I just wanted to throw it out there that we might want to look using > a dequeue for pending frees (and perhaps other things?). You could certainly use pages of pointers like kfree_rcu() does, at least when there is ample memory. You could go further and allocate those pages as the structures are being allocated in order to guarantee that you never run short. The accounting is likely to be a bit tricky, but definitely doable. You could also use polled grace periods: o start_poll_synchronize_rcu_full() to get a "cookie". If you already have a batch for that cookie, add the new block to that batch. The cookie is then associated with the batch, not the individual block. The batch could be tied together by a non-linked-list data structure, if desired, and need not have any storage overhead at all within the blocks themselves. o poll_state_synchronize_rcu_full() to check if a batch's cookie corresponds to a completed grace period. Free up the batch if so. o cond_synchronize_rcu_full() to block until the specified cookie's grace period ends. (I suspect you won't want to do this, but just in case.) o cond_synchronize_rcu_expedited_full() is as above, but it uses an expedited grace period to do the waiting. o NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE gives the maximum number of cookies that can correspond to a grace period that has not yet completed. This is a numerical constant that can be used to size an array. If things are piling up, you can call synchronize_rcu_expedited(), and that will cause all cookies collected before that call to suddenly correspond to completed grace periods. Where "suddenly" is give or take long RCU readers elsewhere in the kernel. Does this help? > Willy and I were just chatting about this and I think the conclusion > we're coming to is that we'll probably want to try just RCU freeing > _everything_ at first, the reason being that anonymous pages are > probably going to get included and that may end up being most of the > allocations and frees - so why not just do everything. > > Meaning - _lots _of memory cycling through RCU, not bounded by device > bandwidth like I initially thought. There already is the possibility of users doing close(open(...)) in a tight loop, which is why rcutorture does callback flooding. But each new use case brings a parcel of risk, so I do not claim that either of these mean that further adjustments, up to and including inventing a new synchronization primitive, won't be needed. > Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs Not sure how this relates, but good to know. > The plan has been to eventually get an actual enum/type tag for our > different page types, and that would be really helpful for system > visibility/introspection in general, and in particular for this project > where we're going to want to be able to put numbers on different things > for performance analysis and then later probably excluding things from > RCU freeing in a sane standard way (slab, at least). > > I wonder if anyone wants to solve this. The problem with doing it now is > that we've got no more page flags on 32 bit, but it'd be _very_ helpful > in future cleanups if we could figure out a way to make it hapen. > > That'd help with another wrinkle, which is non compound higher order > pages. Those are tricky to handle, because how do we know the order from > the RCU callback? If we just don't RCU free them, that's sketchy without > the page type enum because we have no way to have a master > definition/list of "these types of pages are RCU freed, these aren't". If I understand correctly, this might be another reason for using pages of pointers rather than rcu_head structures. > Possibly we can just fake-compound them - store the order like it is in > compound pages but don't initialize every page like happens for compound > pages. I must defer to you guys on this one. Thanx, Paul ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:28 ` Kent Overstreet 2024-03-05 2:43 ` Paul E. McKenney @ 2024-03-05 2:56 ` Yosry Ahmed 1 sibling, 0 replies; 52+ messages in thread From: Yosry Ahmed @ 2024-03-05 2:56 UTC (permalink / raw) To: Kent Overstreet Cc: Paul E. McKenney, Amir Goldstein, lsf-pc, linux-mm, linux-fsdevel, Matthew Wilcox, Jan Kara [..] > Willy and I were just chatting about this and I think the conclusion > we're coming to is that we'll probably want to try just RCU freeing > _everything_ at first, the reason being that anonymous pages are > probably going to get included and that may end up being most of the > allocations and frees - so why not just do everything. > > Meaning - _lots _of memory cycling through RCU, not bounded by device > bandwidth like I initially thought. > > Another relevant topic: https://kernelnewbies.org/MatthewWilcox/Memdescs > > The plan has been to eventually get an actual enum/type tag for our > different page types, and that would be really helpful for system > visibility/introspection in general, and in particular for this project > where we're going to want to be able to put numbers on different things > for performance analysis and then later probably excluding things from > RCU freeing in a sane standard way (slab, at least). > > I wonder if anyone wants to solve this. The problem with doing it now is > that we've got no more page flags on 32 bit, but it'd be _very_ helpful > in future cleanups if we could figure out a way to make it hapen. > > That'd help with another wrinkle, which is non compound higher order > pages. Those are tricky to handle, because how do we know the order from > the RCU callback? If we just don't RCU free them, that's sketchy without > the page type enum because we have no way to have a master > definition/list of "these types of pages are RCU freed, these aren't". > > Possibly we can just fake-compound them - store the order like it is in > compound pages but don't initialize every page like happens for compound > pages. (Not following the thread closely, sorry if this is completely meaningless) Why not store the metadata we need in the page itself, given that it is being freed anyway? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein 2024-02-27 22:59 ` Paul E. McKenney @ 2024-02-28 19:37 ` Matthew Wilcox 2024-02-29 1:29 ` Dave Chinner ` (2 more replies) 1 sibling, 3 replies; 52+ messages in thread From: Matthew Wilcox @ 2024-02-28 19:37 UTC (permalink / raw) To: Amir Goldstein Cc: paulmck, lsf-pc, linux-mm, linux-fsdevel, Kent Overstreet, Jan Kara On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > Hello! > > > > Recent discussions [1] suggest that greater mutual understanding between > > memory reclaim on the one hand and RCU on the other might be in order. > > > > One possibility would be an open discussion. If it would help, I would > > be happy to describe how RCU reacts and responds to heavy load, along with > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > Adding fsdevel as this should probably be a cross track session. Perhaps broaden this slightly. On the THP Cabal call we just had a conversation about the requirements on filesystems in the writeback path. We currently tell filesystem authors that the entire writeback path must avoid allocating memory in order to prevent deadlock (or use GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that writing pagecache back will not allocate memory in, eg, the network stack, the device driver, and any other layers the write must traverse. With the removal of ->writepage from vmscan, perhaps we can make filesystem authors lives easier by relaxing this requirement as pagecache should be cleaned long before we get to reclaiming it. I don't think there's anything to be done about swapping anon memory. We probably don't want to proactively write anon memory to swap, so by the time we're in ->swap_rw we really are low on memory. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-28 19:37 ` Matthew Wilcox @ 2024-02-29 1:29 ` Dave Chinner 2024-02-29 4:20 ` Kent Overstreet 2024-02-29 4:17 ` Kent Overstreet 2024-03-01 2:16 ` NeilBrown 2 siblings, 1 reply; 52+ messages in thread From: Dave Chinner @ 2024-02-29 1:29 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Kent Overstreet, Jan Kara On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > Hello! > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > One possibility would be an open discussion. If it would help, I would > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > Adding fsdevel as this should probably be a cross track session. > > Perhaps broaden this slightly. On the THP Cabal call we just had a > conversation about the requirements on filesystems in the writeback > path. We currently tell filesystem authors that the entire writeback > path must avoid allocating memory in order to prevent deadlock (or use > GFP_MEMALLOC). Is this appropriate? The reality is that filesystem developers have been ignoring that "mm rule" for a couple of decades. It was also discussed at LSFMM a decade ago (2014 IIRC) without resolution, so in the mean time we just took control of our own destiny.... > It's a lot of work to assure that > writing pagecache back will not allocate memory in, eg, the network stack, > the device driver, and any other layers the write must traverse. > > With the removal of ->writepage from vmscan, perhaps we can make > filesystem authors lives easier by relaxing this requirement as pagecache > should be cleaned long before we get to reclaiming it. .... by removing memory reclaim page cache writeback support from the filesystems entirely. IOWs, this rule hasn't been valid for a -long- time, so maybe it is time to remove it. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-29 1:29 ` Dave Chinner @ 2024-02-29 4:20 ` Kent Overstreet 0 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-02-29 4:20 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, Feb 29, 2024 at 12:29:35PM +1100, Dave Chinner wrote: > On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? > > The reality is that filesystem developers have been ignoring that > "mm rule" for a couple of decades. It was also discussed at LSFMM a > decade ago (2014 IIRC) without resolution, so in the mean time we > just took control of our own destiny.... > > > It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > > > With the removal of ->writepage from vmscan, perhaps we can make > > filesystem authors lives easier by relaxing this requirement as pagecache > > should be cleaned long before we get to reclaiming it. > > .... by removing memory reclaim page cache writeback support from > the filesystems entirely. > > IOWs, this rule hasn't been valid for a -long- time, so maybe it > is time to remove it. :) It's _never_ been valid, the entire IO stack allocates memory. This is what GFP_NOIO/GFP_NOFS is for, and additionaly mempools/biosets. If mm can't satisfy the allocation, they should fail it, and then the IO path will have fallbacks (but they wil be slow, i.e. iodepth will be greatly limited). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-28 19:37 ` Matthew Wilcox 2024-02-29 1:29 ` Dave Chinner @ 2024-02-29 4:17 ` Kent Overstreet 2024-02-29 4:24 ` Matthew Wilcox 2024-03-01 2:16 ` NeilBrown 2 siblings, 1 reply; 52+ messages in thread From: Kent Overstreet @ 2024-02-29 4:17 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > Hello! > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > One possibility would be an open discussion. If it would help, I would > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > Adding fsdevel as this should probably be a cross track session. > > Perhaps broaden this slightly. On the THP Cabal call we just had a > conversation about the requirements on filesystems in the writeback > path. We currently tell filesystem authors that the entire writeback > path must avoid allocating memory in order to prevent deadlock (or use > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > writing pagecache back will not allocate memory in, eg, the network stack, > the device driver, and any other layers the write must traverse. Why would you not simply mark the writeback path with memalloc_nofs_save()? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-29 4:17 ` Kent Overstreet @ 2024-02-29 4:24 ` Matthew Wilcox 2024-02-29 4:44 ` Kent Overstreet 0 siblings, 1 reply; 52+ messages in thread From: Matthew Wilcox @ 2024-02-29 4:24 UTC (permalink / raw) To: Kent Overstreet Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Feb 28, 2024 at 11:17:33PM -0500, Kent Overstreet wrote: > On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote: > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > Why would you not simply mark the writeback path with > memalloc_nofs_save()? It's not about preventing recursion, it's about guaranteeing forward progres. If you can't allocate a bio, you can't clean memory. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-29 4:24 ` Matthew Wilcox @ 2024-02-29 4:44 ` Kent Overstreet 0 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-02-29 4:44 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, Feb 29, 2024 at 04:24:52AM +0000, Matthew Wilcox wrote: > On Wed, Feb 28, 2024 at 11:17:33PM -0500, Kent Overstreet wrote: > > On Wed, Feb 28, 2024 at 07:37:58PM +0000, Matthew Wilcox wrote: > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > > conversation about the requirements on filesystems in the writeback > > > path. We currently tell filesystem authors that the entire writeback > > > path must avoid allocating memory in order to prevent deadlock (or use > > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > > writing pagecache back will not allocate memory in, eg, the network stack, > > > the device driver, and any other layers the write must traverse. > > > > Why would you not simply mark the writeback path with > > memalloc_nofs_save()? > > It's not about preventing recursion, it's about guaranteeing forward > progres. If you can't allocate a bio, you can't clean memory. Err, what? We have to be able to allocate bios in order to do writeback, _period_. And not just bios, sometimes we have to bounce the entire IO. I keep noticing the mm people developing weird, crazy ideas about it being "unsafe" to allocate memory in various contexts. That's wrong; attempting to allocate memory is always a _safe_ operation, provided you tell the allocator what it's allowed to do. The allocation might fail, and that's ok. If code must succeed for the system to operate it must have fallbacks if allocations fail, but we don't limit ourselves to those fallbacks ("avoid allocating memory") because then performance would be shit. The PF_MEMALLOC suggestion is even weirder. mm people are high on something... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-02-28 19:37 ` Matthew Wilcox 2024-02-29 1:29 ` Dave Chinner 2024-02-29 4:17 ` Kent Overstreet @ 2024-03-01 2:16 ` NeilBrown 2024-03-01 2:39 ` Kent Overstreet 2024-03-01 5:54 ` Dave Chinner 2 siblings, 2 replies; 52+ messages in thread From: NeilBrown @ 2024-03-01 2:16 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Kent Overstreet, Jan Kara On Thu, 29 Feb 2024, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > Hello! > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > One possibility would be an open discussion. If it would help, I would > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > Adding fsdevel as this should probably be a cross track session. > > Perhaps broaden this slightly. On the THP Cabal call we just had a > conversation about the requirements on filesystems in the writeback > path. We currently tell filesystem authors that the entire writeback > path must avoid allocating memory in order to prevent deadlock (or use > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > writing pagecache back will not allocate memory in, eg, the network stack, > the device driver, and any other layers the write must traverse. > > With the removal of ->writepage from vmscan, perhaps we can make > filesystem authors lives easier by relaxing this requirement as pagecache > should be cleaned long before we get to reclaiming it. > > I don't think there's anything to be done about swapping anon memory. > We probably don't want to proactively write anon memory to swap, so by > the time we're in ->swap_rw we really are low on memory. > > While we are considering revising mm rules, I would really like to revised the rule that GFP_KERNEL allocations are allowed to fail. I'm not at all sure that they ever do (except for large allocations - so maybe we could leave that exception in - or warn if large allocations are tried without a MAY_FAIL flag). Given that GFP_KERNEL can wait, and that the mm can kill off processes and clear cache to free memory, there should be no case where failure is needed or when simply waiting will eventually result in success. And if there is, the machine is a gonner anyway. Once upon a time user-space pages could not be ripped out of a process by the oom killer until the process actually exited, and that meant that GFP_KERNEL allocations of a process being oom killed should not block indefinitely in the allocator. I *think* that isn't the case any more. Insisting that GFP_KERNEL allocations never returned NULL would allow us to remove a lot of untested error handling code.... NeilBrown ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 2:16 ` NeilBrown @ 2024-03-01 2:39 ` Kent Overstreet 2024-03-01 2:48 ` Matthew Wilcox 2024-03-01 5:54 ` Dave Chinner 1 sibling, 1 reply; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 2:39 UTC (permalink / raw) To: NeilBrown Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > On Thu, 29 Feb 2024, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > > > With the removal of ->writepage from vmscan, perhaps we can make > > filesystem authors lives easier by relaxing this requirement as pagecache > > should be cleaned long before we get to reclaiming it. > > > > I don't think there's anything to be done about swapping anon memory. > > We probably don't want to proactively write anon memory to swap, so by > > the time we're in ->swap_rw we really are low on memory. > > > > > > While we are considering revising mm rules, I would really like to > revised the rule that GFP_KERNEL allocations are allowed to fail. > I'm not at all sure that they ever do (except for large allocations - so > maybe we could leave that exception in - or warn if large allocations > are tried without a MAY_FAIL flag). > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > and clear cache to free memory, there should be no case where failure is > needed or when simply waiting will eventually result in success. And if > there is, the machine is a gonner anyway. > > Once upon a time user-space pages could not be ripped out of a process > by the oom killer until the process actually exited, and that meant that > GFP_KERNEL allocations of a process being oom killed should not block > indefinitely in the allocator. I *think* that isn't the case any more. > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > to remove a lot of untested error handling code.... If memcg ever gets enabled for all kernel side allocations we might start seeing failures of GFP_KERNEL allocations. I've got better fault injection code coming, I'll be posting it right after memory allocation profiling gets merged - that'll help with the testing situation. The big blocker on enabling memcg for all kernel allocations is performance overhead, but I hear that's getting worked on as well. We'd probably want to add a gfp flag to annotate which allocations we want to fail because of memcg, though... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 2:39 ` Kent Overstreet @ 2024-03-01 2:48 ` Matthew Wilcox 2024-03-01 3:09 ` Kent Overstreet 2024-03-05 2:54 ` Yosry Ahmed 0 siblings, 2 replies; 52+ messages in thread From: Matthew Wilcox @ 2024-03-01 2:48 UTC (permalink / raw) To: Kent Overstreet Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote: > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > > to remove a lot of untested error handling code.... > > If memcg ever gets enabled for all kernel side allocations we might > start seeing failures of GFP_KERNEL allocations. Why would we want that behaviour? A memcg-limited allocation should behave like any other allocation -- block until we've freed some other memory in this cgroup, either by swap or killing or ... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 2:48 ` Matthew Wilcox @ 2024-03-01 3:09 ` Kent Overstreet 2024-03-01 3:33 ` James Bottomley 2024-03-05 2:54 ` Yosry Ahmed 1 sibling, 1 reply; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 3:09 UTC (permalink / raw) To: Matthew Wilcox Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 02:48:52AM +0000, Matthew Wilcox wrote: > On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > > > to remove a lot of untested error handling code.... > > > > If memcg ever gets enabled for all kernel side allocations we might > > start seeing failures of GFP_KERNEL allocations. > > Why would we want that behaviour? A memcg-limited allocation should > behave like any other allocation -- block until we've freed some other > memory in this cgroup, either by swap or killing or ... It's not uncommon to have a more efficient way of doing something if you can allocate more memory, but still have the ability to run in a more bounded amount of space if you need to; I write code like this quite often. Or maybe you just want the syscall to return an error instead of blocking for an unbounded amount of time if userspace asks for something silly. Honestly, relying on the OOM killer and saying that because that now we don't have to write and test your error paths is a lazy cop out. The same kind of thinking got us overcommit, where yes we got an increase in efficiency, but the cost was that everyone started assuming and relying on overcommit, so now it's impossible to run without overcommit enabled except in highly controlled environments. And that means allocation failure as an effective signal is just completely busted in userspace. If you want to write code in userspace that uses as much memory as is available and no more, you _can't_, because system behaviour goes to shit if you have overcommit enabled or a bunch of memory gets wasted if overcommit is disabled because everyone assumes that's just what you do. Let's _not_ go that route in the kernel. I have pointy sticks to brandish at people who don't want to deal with properly handling errors. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:09 ` Kent Overstreet @ 2024-03-01 3:33 ` James Bottomley 2024-03-01 3:52 ` Kent Overstreet 0 siblings, 1 reply; 52+ messages in thread From: James Bottomley @ 2024-03-01 3:33 UTC (permalink / raw) To: Kent Overstreet, Matthew Wilcox Cc: NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > On Fri, Mar 01, 2024 at 02:48:52AM +0000, Matthew Wilcox wrote: > > On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote: > > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > > Insisting that GFP_KERNEL allocations never returned NULL would > > > > allow us to remove a lot of untested error handling code.... > > > > > > If memcg ever gets enabled for all kernel side allocations we > > > might start seeing failures of GFP_KERNEL allocations. > > > > Why would we want that behaviour? A memcg-limited allocation > > should behave like any other allocation -- block until we've freed > > some other memory in this cgroup, either by swap or killing or ... > > It's not uncommon to have a more efficient way of doing something if > you can allocate more memory, but still have the ability to run in a > more bounded amount of space if you need to; I write code like this > quite often. The cgroup design is to do what we do usually, but within settable hard and soft limits. So if the kernel could make GFP_KERNEL wait without failing, the cgroup would mirror that. > Or maybe you just want the syscall to return an error instead of > blocking for an unbounded amount of time if userspace asks for > something silly. Warn on allocation above a certain size without MAY_FAIL would seem to cover all those cases. If there is a case for requiring instant allocation, you always have GFP_ATOMIC, and, I suppose, we could even do a bounded reclaim allocation where it tries for a certain time then fails. > Honestly, relying on the OOM killer and saying that because that now > we don't have to write and test your error paths is a lazy cop out. OOM Killer is the most extreme outcome. Usually reclaim (hugely simplified) dumps clean cache first and tries the shrinkers then tries to write out dirty cache. Only after that hasn't found anything after a few iterations will the oom killer get activated. > The same kind of thinking got us overcommit, where yes we got an > increase in efficiency, but the cost was that everyone started > assuming and relying on overcommit, so now it's impossible to run > without overcommit enabled except in highly controlled environments. That might be true for your use case, but it certainly isn't true for a cheap hosting cloud using containers: overcommit is where you make your money, so it's absolutely standard operating procedure. I wouldn't call cheap hosting a "highly controlled environment" they're just making a bet they won't get caught out too often. > And that means allocation failure as an effective signal is just > completely busted in userspace. If you want to write code in > userspace that uses as much memory as is available and no more, you > _can't_, because system behaviour goes to shit if you have overcommit > enabled or a bunch of memory gets wasted if overcommit is disabled > because everyone assumes that's just what you do. OK, this seems to be specific to your use case again, because if you look at what the major user space processes like web browsers do, they allocate way over the physical memory available to them for cache and assume the kernel will take care of it. Making failure a signal for being over the working set would cause all these applications to segfault almost immediately. I think what you're asking for is an API to try to calculate what the current available headroom in the working set would be? That's highly heuristic, but the mm people might have an idea how to do it. > Let's _not_ go that route in the kernel. I have pointy sticks to > brandish at people who don't want to deal with properly handling > errors. Error legs are the least exercised and most bug, and therefore exploit, prone pieces of code in C. If we can get rid of them, we should. James ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:33 ` James Bottomley @ 2024-03-01 3:52 ` Kent Overstreet 2024-03-01 4:01 ` Kent Overstreet 2024-03-01 4:08 ` James Bottomley 0 siblings, 2 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 3:52 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > Or maybe you just want the syscall to return an error instead of > > blocking for an unbounded amount of time if userspace asks for > > something silly. > > Warn on allocation above a certain size without MAY_FAIL would seem to > cover all those cases. If there is a case for requiring instant > allocation, you always have GFP_ATOMIC, and, I suppose, we could even > do a bounded reclaim allocation where it tries for a certain time then > fails. Then you're baking in this weird constant into all your algorithms that doesn't scale as machine memory sizes and working set sizes increase. > > Honestly, relying on the OOM killer and saying that because that now > > we don't have to write and test your error paths is a lazy cop out. > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > simplified) dumps clean cache first and tries the shrinkers then tries > to write out dirty cache. Only after that hasn't found anything after > a few iterations will the oom killer get activated All your caches dumped and the machine grinds to a halt and then a random process gets killed instead of simply _failing the allocation_. > > The same kind of thinking got us overcommit, where yes we got an > > increase in efficiency, but the cost was that everyone started > > assuming and relying on overcommit, so now it's impossible to run > > without overcommit enabled except in highly controlled environments. > > That might be true for your use case, but it certainly isn't true for a > cheap hosting cloud using containers: overcommit is where you make your > money, so it's absolutely standard operating procedure. I wouldn't > call cheap hosting a "highly controlled environment" they're just > making a bet they won't get caught out too often. Reading comprehension fail. Reread what I wrote. > > And that means allocation failure as an effective signal is just > > completely busted in userspace. If you want to write code in > > userspace that uses as much memory as is available and no more, you > > _can't_, because system behaviour goes to shit if you have overcommit > > enabled or a bunch of memory gets wasted if overcommit is disabled > > because everyone assumes that's just what you do. > > OK, this seems to be specific to your use case again, because if you > look at what the major user space processes like web browsers do, they > allocate way over the physical memory available to them for cache and > assume the kernel will take care of it. Making failure a signal for > being over the working set would cause all these applications to > segfault almost immediately. Again, reread what I wrote. You're restating what I wrote and completely missing the point. > > Let's _not_ go that route in the kernel. I have pointy sticks to > > brandish at people who don't want to deal with properly handling > > errors. > > Error legs are the least exercised and most bug, and therefore exploit, > prone pieces of code in C. If we can get rid of them, we should. Fuck no. Having working error paths is _basic_, and learning how to test your code is also basic. If you can't be bothered to do that you shouldn't be writing kernel code. We are giving far too much by going down the route of "oh, just kill stuff if we screwed the pooch and overcommitted". I don't fucking care if it's what the big cloud providers want because it's convenient for them, some of us actually do care about reliability. By just saying "oh, the OO killer will save us" what you're doing is making it nearly impossible to fully utilize a machine without having stuff randomly killed. Fuck. That. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:52 ` Kent Overstreet @ 2024-03-01 4:01 ` Kent Overstreet 2024-03-01 4:09 ` NeilBrown 2024-03-01 4:18 ` James Bottomley 2024-03-01 4:08 ` James Bottomley 1 sibling, 2 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 4:01 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote: > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > > Or maybe you just want the syscall to return an error instead of > > > blocking for an unbounded amount of time if userspace asks for > > > something silly. > > > > Warn on allocation above a certain size without MAY_FAIL would seem to > > cover all those cases. If there is a case for requiring instant > > allocation, you always have GFP_ATOMIC, and, I suppose, we could even > > do a bounded reclaim allocation where it tries for a certain time then > > fails. > > Then you're baking in this weird constant into all your algorithms that > doesn't scale as machine memory sizes and working set sizes increase. > > > > Honestly, relying on the OOM killer and saying that because that now > > > we don't have to write and test your error paths is a lazy cop out. > > > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > > simplified) dumps clean cache first and tries the shrinkers then tries > > to write out dirty cache. Only after that hasn't found anything after > > a few iterations will the oom killer get activated > > All your caches dumped and the machine grinds to a halt and then a > random process gets killed instead of simply _failing the allocation_. > > > > The same kind of thinking got us overcommit, where yes we got an > > > increase in efficiency, but the cost was that everyone started > > > assuming and relying on overcommit, so now it's impossible to run > > > without overcommit enabled except in highly controlled environments. > > > > That might be true for your use case, but it certainly isn't true for a > > cheap hosting cloud using containers: overcommit is where you make your > > money, so it's absolutely standard operating procedure. I wouldn't > > call cheap hosting a "highly controlled environment" they're just > > making a bet they won't get caught out too often. > > Reading comprehension fail. Reread what I wrote. > > > > And that means allocation failure as an effective signal is just > > > completely busted in userspace. If you want to write code in > > > userspace that uses as much memory as is available and no more, you > > > _can't_, because system behaviour goes to shit if you have overcommit > > > enabled or a bunch of memory gets wasted if overcommit is disabled > > > because everyone assumes that's just what you do. > > > > OK, this seems to be specific to your use case again, because if you > > look at what the major user space processes like web browsers do, they > > allocate way over the physical memory available to them for cache and > > assume the kernel will take care of it. Making failure a signal for > > being over the working set would cause all these applications to > > segfault almost immediately. > > Again, reread what I wrote. You're restating what I wrote and completely > missing the point. > > > > Let's _not_ go that route in the kernel. I have pointy sticks to > > > brandish at people who don't want to deal with properly handling > > > errors. > > > > Error legs are the least exercised and most bug, and therefore exploit, > > prone pieces of code in C. If we can get rid of them, we should. > > Fuck no. > > Having working error paths is _basic_, and learning how to test your > code is also basic. If you can't be bothered to do that you shouldn't be > writing kernel code. > > We are giving far too much by going down the route of "oh, just kill > stuff if we screwed the pooch and overcommitted". > > I don't fucking care if it's what the big cloud providers want because > it's convenient for them, some of us actually do care about reliability. > > By just saying "oh, the OO killer will save us" what you're doing is > making it nearly impossible to fully utilize a machine without having > stuff randomly killed. > > Fuck. That. And besides all that, as a practical matter you can't just "not have erro paths" because, like you said, you'd still have to have a max size where you WARN() - and _fail the allocation_ - and you've still got to unwind. The OOM killer can't kill processes while they're stuck blocking on an allocation that will rever return in the kernel. I think we can safely nip this idea in the bud. Test your damn error paths... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 4:01 ` Kent Overstreet @ 2024-03-01 4:09 ` NeilBrown 2024-03-01 4:18 ` Kent Overstreet 2024-03-01 4:18 ` James Bottomley 1 sibling, 1 reply; 52+ messages in thread From: NeilBrown @ 2024-03-01 4:09 UTC (permalink / raw) To: Kent Overstreet Cc: James Bottomley, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, 01 Mar 2024, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > > > Or maybe you just want the syscall to return an error instead of > > > > blocking for an unbounded amount of time if userspace asks for > > > > something silly. > > > > > > Warn on allocation above a certain size without MAY_FAIL would seem to > > > cover all those cases. If there is a case for requiring instant > > > allocation, you always have GFP_ATOMIC, and, I suppose, we could even > > > do a bounded reclaim allocation where it tries for a certain time then > > > fails. > > > > Then you're baking in this weird constant into all your algorithms that > > doesn't scale as machine memory sizes and working set sizes increase. > > > > > > Honestly, relying on the OOM killer and saying that because that now > > > > we don't have to write and test your error paths is a lazy cop out. > > > > > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > > > simplified) dumps clean cache first and tries the shrinkers then tries > > > to write out dirty cache. Only after that hasn't found anything after > > > a few iterations will the oom killer get activated > > > > All your caches dumped and the machine grinds to a halt and then a > > random process gets killed instead of simply _failing the allocation_. > > > > > > The same kind of thinking got us overcommit, where yes we got an > > > > increase in efficiency, but the cost was that everyone started > > > > assuming and relying on overcommit, so now it's impossible to run > > > > without overcommit enabled except in highly controlled environments. > > > > > > That might be true for your use case, but it certainly isn't true for a > > > cheap hosting cloud using containers: overcommit is where you make your > > > money, so it's absolutely standard operating procedure. I wouldn't > > > call cheap hosting a "highly controlled environment" they're just > > > making a bet they won't get caught out too often. > > > > Reading comprehension fail. Reread what I wrote. > > > > > > And that means allocation failure as an effective signal is just > > > > completely busted in userspace. If you want to write code in > > > > userspace that uses as much memory as is available and no more, you > > > > _can't_, because system behaviour goes to shit if you have overcommit > > > > enabled or a bunch of memory gets wasted if overcommit is disabled > > > > because everyone assumes that's just what you do. > > > > > > OK, this seems to be specific to your use case again, because if you > > > look at what the major user space processes like web browsers do, they > > > allocate way over the physical memory available to them for cache and > > > assume the kernel will take care of it. Making failure a signal for > > > being over the working set would cause all these applications to > > > segfault almost immediately. > > > > Again, reread what I wrote. You're restating what I wrote and completely > > missing the point. > > > > > > Let's _not_ go that route in the kernel. I have pointy sticks to > > > > brandish at people who don't want to deal with properly handling > > > > errors. > > > > > > Error legs are the least exercised and most bug, and therefore exploit, > > > prone pieces of code in C. If we can get rid of them, we should. > > > > Fuck no. > > > > Having working error paths is _basic_, and learning how to test your > > code is also basic. If you can't be bothered to do that you shouldn't be > > writing kernel code. > > > > We are giving far too much by going down the route of "oh, just kill > > stuff if we screwed the pooch and overcommitted". > > > > I don't fucking care if it's what the big cloud providers want because > > it's convenient for them, some of us actually do care about reliability. > > > > By just saying "oh, the OO killer will save us" what you're doing is > > making it nearly impossible to fully utilize a machine without having > > stuff randomly killed. > > > > And besides all that, as a practical matter you can't just "not have > erro paths" because, like you said, you'd still have to have a max size > where you WARN() - and _fail the allocation_ - and you've still got to > unwind. No. You warn and DON'T fail the allocation. Just like lockdep warns of possible deadlocks but lets you continue. These will be found in development (mostly) and changed to use __GFP_RETRY_MAYFAIL and have appropriate error-handling paths. > > The OOM killer can't kill processes while they're stuck blocking on an > allocation that will rever return in the kernel. But it can depopulate the user address space (I think). NeilBrown > > I think we can safely nip this idea in the bud. > > Test your damn error paths... > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 4:09 ` NeilBrown @ 2024-03-01 4:18 ` Kent Overstreet 0 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 4:18 UTC (permalink / raw) To: NeilBrown Cc: James Bottomley, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 03:09:09PM +1100, NeilBrown wrote: > No. You warn and DON'T fail the allocation. Just like lockdep warns of > possible deadlocks but lets you continue. > These will be found in development (mostly) and changed to use > __GFP_RETRY_MAYFAIL and have appropriate error-handling paths. And when an allocation happens that cannot be satisfied without killing everything off? You do realize there are _lots_ of allocations where userspace has control over the size of the allocation, via e.g. obscure ioctls that ask for a buffer to be filled? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 4:01 ` Kent Overstreet 2024-03-01 4:09 ` NeilBrown @ 2024-03-01 4:18 ` James Bottomley 1 sibling, 0 replies; 52+ messages in thread From: James Bottomley @ 2024-03-01 4:18 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, 2024-02-29 at 23:01 -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 10:52:06PM -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: [...] > > > > Let's _not_ go that route in the kernel. I have pointy sticks > > > > to brandish at people who don't want to deal with properly > > > > handling errors. > > > > > > Error legs are the least exercised and most bug, and therefore > > > exploit, prone pieces of code in C. If we can get rid of them, > > > we should. > > > > Fuck no. > > > > Having working error paths is _basic_, and learning how to test > > your code is also basic. If you can't be bothered to do that you > > shouldn't be writing kernel code. Heh, that's as glib as saying people should test their C code for overcommit errors. If everyone did that there'd be no need for languages like rust. > > We are giving far too much by going down the route of "oh, just > > kill stuff if we screwed the pooch and overcommitted". > > > > I don't fucking care if it's what the big cloud providers want > > because it's convenient for them, some of us actually do care about > > reliability. Reliability has many definitions. The kernel tries to leave policy like this to the user, which is why the overcommit type is exposed to userspace. Arguing about whose workload is more important isn't really going to be helpful. > > By just saying "oh, the OO killer will save us" what you're doing > > is making it nearly impossible to fully utilize a machine without > > having stuff randomly killed. > > > > Fuck. That. > > And besides all that, as a practical matter you can't just "not have > erro paths" because, like you said, you'd still have to have a max > size where you WARN() - and _fail the allocation_ - and you've still > got to unwind. So? the point would be we can eliminate some potentially buggy error legs on small allocations. Since we have to add MAY_FAIL and error handling (which should already exist) to the larger ones. It would have no impact at all on scaling. The question of where the limit should be in the general case should probably be compile time configurable. We can probably even get the compiler to eliminate the if (err) leg with some judicious return value priming, meaning we achieve some meaningful bug density reduction for no or very few code changes. James ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 3:52 ` Kent Overstreet 2024-03-01 4:01 ` Kent Overstreet @ 2024-03-01 4:08 ` James Bottomley 2024-03-01 4:15 ` Kent Overstreet 1 sibling, 1 reply; 52+ messages in thread From: James Bottomley @ 2024-03-01 4:08 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, 2024-02-29 at 22:52 -0500, Kent Overstreet wrote: > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > > Or maybe you just want the syscall to return an error instead of > > > blocking for an unbounded amount of time if userspace asks for > > > something silly. > > > > Warn on allocation above a certain size without MAY_FAIL would seem > > to cover all those cases. If there is a case for requiring instant > > allocation, you always have GFP_ATOMIC, and, I suppose, we could > > even do a bounded reclaim allocation where it tries for a certain > > time then fails. > > Then you're baking in this weird constant into all your algorithms > that doesn't scale as machine memory sizes and working set sizes > increase. > > > > Honestly, relying on the OOM killer and saying that because that > > > now we don't have to write and test your error paths is a lazy > > > cop out. > > > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > > simplified) dumps clean cache first and tries the shrinkers then > > tries to write out dirty cache. Only after that hasn't found > > anything after a few iterations will the oom killer get activated > > All your caches dumped and the machine grinds to a halt and then a > random process gets killed instead of simply _failing the > allocation_. Ignoring the fact free invective below, I think what you're asking for is strict overcommit. There's a tunable for that: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting However, see the Gotchas section for why we can't turn it on globally, but it is available to you if you know what you're doing. James ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 4:08 ` James Bottomley @ 2024-03-01 4:15 ` Kent Overstreet 0 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 4:15 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 11:08:52AM +0700, James Bottomley wrote: > On Thu, 2024-02-29 at 22:52 -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 10:33:59AM +0700, James Bottomley wrote: > > > On Thu, 2024-02-29 at 22:09 -0500, Kent Overstreet wrote: > > > > Or maybe you just want the syscall to return an error instead of > > > > blocking for an unbounded amount of time if userspace asks for > > > > something silly. > > > > > > Warn on allocation above a certain size without MAY_FAIL would seem > > > to cover all those cases. If there is a case for requiring instant > > > allocation, you always have GFP_ATOMIC, and, I suppose, we could > > > even do a bounded reclaim allocation where it tries for a certain > > > time then fails. > > > > Then you're baking in this weird constant into all your algorithms > > that doesn't scale as machine memory sizes and working set sizes > > increase. > > > > > > Honestly, relying on the OOM killer and saying that because that > > > > now we don't have to write and test your error paths is a lazy > > > > cop out. > > > > > > OOM Killer is the most extreme outcome. Usually reclaim (hugely > > > simplified) dumps clean cache first and tries the shrinkers then > > > tries to write out dirty cache. Only after that hasn't found > > > anything after a few iterations will the oom killer get activated > > > > All your caches dumped and the machine grinds to a halt and then a > > random process gets killed instead of simply _failing the > > allocation_. > > Ignoring the fact free invective below, I think what you're asking for > is strict overcommit. There's a tunable for that: > > https://www.kernel.org/doc/Documentation/vm/overcommit-accounting > > However, see the Gotchas section for why we can't turn it on globally, > but it is available to you if you know what you're doing. James, I already explained all this. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 2:48 ` Matthew Wilcox 2024-03-01 3:09 ` Kent Overstreet @ 2024-03-05 2:54 ` Yosry Ahmed 1 sibling, 0 replies; 52+ messages in thread From: Yosry Ahmed @ 2024-03-05 2:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Kent Overstreet, NeilBrown, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, Feb 29, 2024 at 6:49 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 29, 2024 at 09:39:17PM -0500, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > > > to remove a lot of untested error handling code.... > > > > If memcg ever gets enabled for all kernel side allocations we might > > start seeing failures of GFP_KERNEL allocations. > > Why would we want that behaviour? A memcg-limited allocation should > behave like any other allocation -- block until we've freed some other > memory in this cgroup, either by swap or killing or ... I am not closely following this thread (although it is very interesting), but I don't think the same rules fully apply for memcg-limited allocations. Specifically, because the scope of the OOM killer and available resources is limited to the subtree of the memcg that hit its limit. Consider a case where a memcg is full of page cache memory that is mlock()'s by a process in another memcg, or full of tmpfs memory and there is no swap (or swap limit is reached). You can get more creative with this too, start a process in memcg A, allocate some anon memory, move the process to memcg B. Now if you cannot swap, you cannot reclaim the memory from memcg A and killing everything in memcg A doesn't help. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 2:16 ` NeilBrown 2024-03-01 2:39 ` Kent Overstreet @ 2024-03-01 5:54 ` Dave Chinner 2024-03-01 20:20 ` Kent Overstreet 1 sibling, 1 reply; 52+ messages in thread From: Dave Chinner @ 2024-03-01 5:54 UTC (permalink / raw) To: NeilBrown Cc: Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Kent Overstreet, Jan Kara On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > On Thu, 29 Feb 2024, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > > > With the removal of ->writepage from vmscan, perhaps we can make > > filesystem authors lives easier by relaxing this requirement as pagecache > > should be cleaned long before we get to reclaiming it. > > > > I don't think there's anything to be done about swapping anon memory. > > We probably don't want to proactively write anon memory to swap, so by > > the time we're in ->swap_rw we really are low on memory. > > > > > > While we are considering revising mm rules, I would really like to > revised the rule that GFP_KERNEL allocations are allowed to fail. > I'm not at all sure that they ever do (except for large allocations - so > maybe we could leave that exception in - or warn if large allocations > are tried without a MAY_FAIL flag). > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > and clear cache to free memory, there should be no case where failure is > needed or when simply waiting will eventually result in success. And if > there is, the machine is a gonner anyway. Yes, please! XFS was designed and implemented on an OS that gave this exact guarantee for kernel allocations back in the early 1990s. Memory allocation simply blocked until it succeeded unless the caller indicated they could handle failure. That's what __GFP_NOFAIL does and XFS is still heavily dependent on this behaviour. And before people scream "but that was 30 years ago, Unix OS code was much simpler", consider that Irix supported machines with hundreds of NUMA nodes, thousands of CPUs, terabytes of memory and petabytes of storage. It had variable size high order pages in the page cache (something we've only just got with folios!), page migration, page compaction, memory and process locality control, filesystem block sizes larger than page size (which we don't have yet!), memory shrinkers for subsystem cache reclaim, page cache dirty throttling to sustained writeback IO rates, etc. Lots of the mm technology from that OS has been re-implemented in Linux in the past two decades, but in several important ways Linux still falls shy of the bar that Irix set a couple of decades ago. One of those is the kernel memory allocation guarantee. > Once upon a time user-space pages could not be ripped out of a process > by the oom killer until the process actually exited, and that meant that > GFP_KERNEL allocations of a process being oom killed should not block > indefinitely in the allocator. I *think* that isn't the case any more. > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > to remove a lot of untested error handling code.... This is the sort of thing I was thinking of in the "remove GFP_NOFS" discussion thread when I said this to Kent: "We need to start designing our code in a way that doesn't require extensive testing to validate it as correct. If the only way to validate new code is correct is via stochastic coverage via error injection, then that is a clear sign we've made poor design choices along the way." https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/ If memory allocation doesn't fail by default, then we can remove the vast majority of allocation error handling from the kernel. Make the common case just work - remove the need for all that code to handle failures that is hard to exercise reliably and so are rarely tested. A simple change to make long standing behaviour an actual policy we can rely on means we can remove both code and test matrix overhead - it's a win-win IMO. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 5:54 ` Dave Chinner @ 2024-03-01 20:20 ` Kent Overstreet 2024-03-01 23:47 ` NeilBrown 0 siblings, 1 reply; 52+ messages in thread From: Kent Overstreet @ 2024-03-01 20:20 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote: > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > While we are considering revising mm rules, I would really like to > > revised the rule that GFP_KERNEL allocations are allowed to fail. > > I'm not at all sure that they ever do (except for large allocations - so > > maybe we could leave that exception in - or warn if large allocations > > are tried without a MAY_FAIL flag). > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > > and clear cache to free memory, there should be no case where failure is > > needed or when simply waiting will eventually result in success. And if > > there is, the machine is a gonner anyway. > > Yes, please! > > XFS was designed and implemented on an OS that gave this exact > guarantee for kernel allocations back in the early 1990s. Memory > allocation simply blocked until it succeeded unless the caller > indicated they could handle failure. That's what __GFP_NOFAIL does > and XFS is still heavily dependent on this behaviour. I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say let's remove the underscores and get rid of the silly two page limit. GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long as you don't mind possibly waiting a bit. But it can't be the default because, like I mentioned to Neal, there are a _lot_ of different places where we allocate memory in the kernel, and they have to be able to fail instead of shoving everything else out of memory. > This is the sort of thing I was thinking of in the "remove > GFP_NOFS" discussion thread when I said this to Kent: > > "We need to start designing our code in a way that doesn't require > extensive testing to validate it as correct. If the only way to > validate new code is correct is via stochastic coverage via error > injection, then that is a clear sign we've made poor design choices > along the way." > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/ > > If memory allocation doesn't fail by default, then we can remove the > vast majority of allocation error handling from the kernel. Make the > common case just work - remove the need for all that code to handle > failures that is hard to exercise reliably and so are rarely tested. > > A simple change to make long standing behaviour an actual policy we > can rely on means we can remove both code and test matrix overhead - > it's a win-win IMO. We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by default - a great many of those allocations have mempools in front of them to avoid deadlocks, and if you do that you've made the mempools useless. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 20:20 ` Kent Overstreet @ 2024-03-01 23:47 ` NeilBrown 2024-03-02 0:02 ` Kent Overstreet 0 siblings, 1 reply; 52+ messages in thread From: NeilBrown @ 2024-03-01 23:47 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Sat, 02 Mar 2024, Kent Overstreet wrote: > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote: > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > While we are considering revising mm rules, I would really like to > > > revised the rule that GFP_KERNEL allocations are allowed to fail. > > > I'm not at all sure that they ever do (except for large allocations - so > > > maybe we could leave that exception in - or warn if large allocations > > > are tried without a MAY_FAIL flag). > > > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > > > and clear cache to free memory, there should be no case where failure is > > > needed or when simply waiting will eventually result in success. And if > > > there is, the machine is a gonner anyway. > > > > Yes, please! > > > > XFS was designed and implemented on an OS that gave this exact > > guarantee for kernel allocations back in the early 1990s. Memory > > allocation simply blocked until it succeeded unless the caller > > indicated they could handle failure. That's what __GFP_NOFAIL does > > and XFS is still heavily dependent on this behaviour. > > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say > let's remove the underscores and get rid of the silly two page limit. > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long > as you don't mind possibly waiting a bit. > > But it can't be the default because, like I mentioned to Neal, there are > a _lot_ of different places where we allocate memory in the kernel, and > they have to be able to fail instead of shoving everything else out of > memory. > > > This is the sort of thing I was thinking of in the "remove > > GFP_NOFS" discussion thread when I said this to Kent: > > > > "We need to start designing our code in a way that doesn't require > > extensive testing to validate it as correct. If the only way to > > validate new code is correct is via stochastic coverage via error > > injection, then that is a clear sign we've made poor design choices > > along the way." > > > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/ > > > > If memory allocation doesn't fail by default, then we can remove the > > vast majority of allocation error handling from the kernel. Make the > > common case just work - remove the need for all that code to handle > > failures that is hard to exercise reliably and so are rarely tested. > > > > A simple change to make long standing behaviour an actual policy we > > can rely on means we can remove both code and test matrix overhead - > > it's a win-win IMO. > > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by > default - a great many of those allocations have mempools in front of > them to avoid deadlocks, and if you do that you've made the mempools > useless. > Not strictly true. mempool_alloc() adds __GFP_NORETRY so the allocation will certainly fail if that is appropriate. I suspect that most places where there is a non-error fallback already use NORETRY or RETRY_MAYFAIL or similar. But I agree that changing the meaning of GFP_KERNEL has a potential to cause problems. I support promoting "GFP_NOFAIL" which should work at least up to PAGE_ALLOC_COSTLY_ORDER (8 pages). I'm unsure how it should be have in PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO context. I suspect Dave would tell me it should work in these contexts, in which case I'm sure it should. Maybe we could then deprecate GFP_KERNEL. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-01 23:47 ` NeilBrown @ 2024-03-02 0:02 ` Kent Overstreet 2024-03-02 11:33 ` Tetsuo Handa 2024-03-03 22:45 ` NeilBrown 0 siblings, 2 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-02 0:02 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Sat, Mar 02, 2024 at 10:47:59AM +1100, NeilBrown wrote: > On Sat, 02 Mar 2024, Kent Overstreet wrote: > > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote: > > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > > While we are considering revising mm rules, I would really like to > > > > revised the rule that GFP_KERNEL allocations are allowed to fail. > > > > I'm not at all sure that they ever do (except for large allocations - so > > > > maybe we could leave that exception in - or warn if large allocations > > > > are tried without a MAY_FAIL flag). > > > > > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > > > > and clear cache to free memory, there should be no case where failure is > > > > needed or when simply waiting will eventually result in success. And if > > > > there is, the machine is a gonner anyway. > > > > > > Yes, please! > > > > > > XFS was designed and implemented on an OS that gave this exact > > > guarantee for kernel allocations back in the early 1990s. Memory > > > allocation simply blocked until it succeeded unless the caller > > > indicated they could handle failure. That's what __GFP_NOFAIL does > > > and XFS is still heavily dependent on this behaviour. > > > > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say > > let's remove the underscores and get rid of the silly two page limit. > > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long > > as you don't mind possibly waiting a bit. > > > > But it can't be the default because, like I mentioned to Neal, there are > > a _lot_ of different places where we allocate memory in the kernel, and > > they have to be able to fail instead of shoving everything else out of > > memory. > > > > > This is the sort of thing I was thinking of in the "remove > > > GFP_NOFS" discussion thread when I said this to Kent: > > > > > > "We need to start designing our code in a way that doesn't require > > > extensive testing to validate it as correct. If the only way to > > > validate new code is correct is via stochastic coverage via error > > > injection, then that is a clear sign we've made poor design choices > > > along the way." > > > > > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/ > > > > > > If memory allocation doesn't fail by default, then we can remove the > > > vast majority of allocation error handling from the kernel. Make the > > > common case just work - remove the need for all that code to handle > > > failures that is hard to exercise reliably and so are rarely tested. > > > > > > A simple change to make long standing behaviour an actual policy we > > > can rely on means we can remove both code and test matrix overhead - > > > it's a win-win IMO. > > > > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by > > default - a great many of those allocations have mempools in front of > > them to avoid deadlocks, and if you do that you've made the mempools > > useless. > > > > Not strictly true. mempool_alloc() adds __GFP_NORETRY so the allocation > will certainly fail if that is appropriate. *nod* > I suspect that most places where there is a non-error fallback already > use NORETRY or RETRY_MAYFAIL or similar. NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see _tons_ of uses for either of them - more for NORETRY. My go-to is NOWAIT in this scenario though; my common pattern is "try nonblocking with locks held, then drop locks and retry GFP_KERNEL". > But I agree that changing the meaning of GFP_KERNEL has a potential to > cause problems. I support promoting "GFP_NOFAIL" which should work at > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages). I'd support this change. > I'm unsure how it should be have in PF_MEMALLOC_NOFS and > PF_MEMALLOC_NOIO context. I suspect Dave would tell me it should work in > these contexts, in which case I'm sure it should. > > Maybe we could then deprecate GFP_KERNEL. What do you have in mind? Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're pushing for memalloc_flags_(save|restore) more. Getting rid of those would be a really nice cleanup beacuse then gfp flags would mostly just be: - the type of memory to allocate (highmem, zeroed, etc.) - how hard to try (don't block at all, block some, block forever) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-02 0:02 ` Kent Overstreet @ 2024-03-02 11:33 ` Tetsuo Handa 2024-03-02 16:53 ` Matthew Wilcox 2024-03-03 22:45 ` NeilBrown 1 sibling, 1 reply; 52+ messages in thread From: Tetsuo Handa @ 2024-03-02 11:33 UTC (permalink / raw) To: Kent Overstreet, NeilBrown Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On 2024/03/02 9:02, Kent Overstreet wrote: > Getting rid of those would be a really nice cleanup beacuse then gfp > flags would mostly just be: > - the type of memory to allocate (highmem, zeroed, etc.) > - how hard to try (don't block at all, block some, block forever) I really wish that __GFP_KILLABLE is added, so that we can use like mutex_trylock()/mutex_lock_killable()/mutex_lock(). Memory allocations from LSM modules for access control are willing to give up when the allocating process is killed, for userspace won't know about whether the request succeed. But such allocations are hardly acceptable to give up unless the allocating process is killed or allocation hit PAGE_ALLOC_COSTLY_ORDER, for ENOMEM failure returned by e.g. open() can break the purpose of executing userspace processes (i.e. as bad as being killed by the OOM killer). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-02 11:33 ` Tetsuo Handa @ 2024-03-02 16:53 ` Matthew Wilcox 0 siblings, 0 replies; 52+ messages in thread From: Matthew Wilcox @ 2024-03-02 16:53 UTC (permalink / raw) To: Tetsuo Handa Cc: Kent Overstreet, NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Sat, Mar 02, 2024 at 08:33:34PM +0900, Tetsuo Handa wrote: > On 2024/03/02 9:02, Kent Overstreet wrote: > > Getting rid of those would be a really nice cleanup beacuse then gfp > > flags would mostly just be: > > - the type of memory to allocate (highmem, zeroed, etc.) > > - how hard to try (don't block at all, block some, block forever) > > I really wish that __GFP_KILLABLE is added, so that we can use like > mutex_trylock()/mutex_lock_killable()/mutex_lock(). > > Memory allocations from LSM modules for access control are willing to > give up when the allocating process is killed, for userspace won't know > about whether the request succeed. But such allocations are hardly > acceptable to give up unless the allocating process is killed or > allocation hit PAGE_ALLOC_COSTLY_ORDER, for ENOMEM failure returned by > e.g. open() can break the purpose of executing userspace processes > (i.e. as bad as being killed by the OOM killer). I'd be open to adding that, but does it really happen? For it to be useful, we'd have to have an allocation spending a significant amount of time in memory allocation such that the signal arrives while we're retrying. Do you have any details about times when you've seen this, eg sizes, other GFP flags, ... ? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-02 0:02 ` Kent Overstreet 2024-03-02 11:33 ` Tetsuo Handa @ 2024-03-03 22:45 ` NeilBrown 2024-03-03 22:54 ` Kent Overstreet ` (4 more replies) 1 sibling, 5 replies; 52+ messages in thread From: NeilBrown @ 2024-03-03 22:45 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Sat, 02 Mar 2024, Kent Overstreet wrote: > On Sat, Mar 02, 2024 at 10:47:59AM +1100, NeilBrown wrote: > > On Sat, 02 Mar 2024, Kent Overstreet wrote: > > > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote: > > > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > > > > > While we are considering revising mm rules, I would really like to > > > > > revised the rule that GFP_KERNEL allocations are allowed to fail. > > > > > I'm not at all sure that they ever do (except for large allocations - so > > > > > maybe we could leave that exception in - or warn if large allocations > > > > > are tried without a MAY_FAIL flag). > > > > > > > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > > > > > and clear cache to free memory, there should be no case where failure is > > > > > needed or when simply waiting will eventually result in success. And if > > > > > there is, the machine is a gonner anyway. > > > > > > > > Yes, please! > > > > > > > > XFS was designed and implemented on an OS that gave this exact > > > > guarantee for kernel allocations back in the early 1990s. Memory > > > > allocation simply blocked until it succeeded unless the caller > > > > indicated they could handle failure. That's what __GFP_NOFAIL does > > > > and XFS is still heavily dependent on this behaviour. > > > > > > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say > > > let's remove the underscores and get rid of the silly two page limit. > > > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long > > > as you don't mind possibly waiting a bit. > > > > > > But it can't be the default because, like I mentioned to Neal, there are > > > a _lot_ of different places where we allocate memory in the kernel, and > > > they have to be able to fail instead of shoving everything else out of > > > memory. > > > > > > > This is the sort of thing I was thinking of in the "remove > > > > GFP_NOFS" discussion thread when I said this to Kent: > > > > > > > > "We need to start designing our code in a way that doesn't require > > > > extensive testing to validate it as correct. If the only way to > > > > validate new code is correct is via stochastic coverage via error > > > > injection, then that is a clear sign we've made poor design choices > > > > along the way." > > > > > > > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@dread.disaster.area/ > > > > > > > > If memory allocation doesn't fail by default, then we can remove the > > > > vast majority of allocation error handling from the kernel. Make the > > > > common case just work - remove the need for all that code to handle > > > > failures that is hard to exercise reliably and so are rarely tested. > > > > > > > > A simple change to make long standing behaviour an actual policy we > > > > can rely on means we can remove both code and test matrix overhead - > > > > it's a win-win IMO. > > > > > > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by > > > default - a great many of those allocations have mempools in front of > > > them to avoid deadlocks, and if you do that you've made the mempools > > > useless. > > > > > > > Not strictly true. mempool_alloc() adds __GFP_NORETRY so the allocation > > will certainly fail if that is appropriate. > > *nod* > > > I suspect that most places where there is a non-error fallback already > > use NORETRY or RETRY_MAYFAIL or similar. > > NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see > _tons_ of uses for either of them - more for NORETRY. > > My go-to is NOWAIT in this scenario though; my common pattern is "try > nonblocking with locks held, then drop locks and retry GFP_KERNEL". > > > But I agree that changing the meaning of GFP_KERNEL has a potential to > > cause problems. I support promoting "GFP_NOFAIL" which should work at > > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages). > > I'd support this change. > > > I'm unsure how it should be have in PF_MEMALLOC_NOFS and > > PF_MEMALLOC_NOIO context. I suspect Dave would tell me it should work in > > these contexts, in which case I'm sure it should. > > > > Maybe we could then deprecate GFP_KERNEL. > > What do you have in mind? I have in mind a more explicit statement of how much waiting is acceptable. GFP_NOFAIL - wait indefinitely GFP_KILLABLE - wait indefinitely unless fatal signal is pending. GFP_RETRY - may retry but deadlock, though unlikely, is possible. So don't wait indefinitely. May abort more quickly if fatal signal is pending. GFP_NO_RETRY - only try things once. This may sleep, but will give up fairly quickly. Either deadlock is a significant possibility, or alternate strategy is fairly cheap. GFP_ATOMIC - don't sleep - same as current. I don't see how "GFP_KERNEL" fits into that spectrum. The definition of "this will try really hard, but might fail and we can't really tell you what circumstances it might fail in" isn't fun to work with. Thanks, NeilBrown > > Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should > really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're > pushing for memalloc_flags_(save|restore) more. > > Getting rid of those would be a really nice cleanup beacuse then gfp > flags would mostly just be: > - the type of memory to allocate (highmem, zeroed, etc.) > - how hard to try (don't block at all, block some, block forever) > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-03 22:45 ` NeilBrown @ 2024-03-03 22:54 ` Kent Overstreet 2024-03-04 0:20 ` Dave Chinner ` (3 subsequent siblings) 4 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-03 22:54 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > I have in mind a more explicit statement of how much waiting is > acceptable. > > GFP_NOFAIL - wait indefinitely > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > don't wait indefinitely. May abort more quickly if fatal > signal is pending. > GFP_NO_RETRY - only try things once. This may sleep, but will give up > fairly quickly. Either deadlock is a significant > possibility, or alternate strategy is fairly cheap. > GFP_ATOMIC - don't sleep - same as current. > > I don't see how "GFP_KERNEL" fits into that spectrum. The definition of > "this will try really hard, but might fail and we can't really tell you > what circumstances it might fail in" isn't fun to work with. Well, lots of things "aren't fun to work with", but error handling is just a part of life. Your "GFP_KILLABLE" has the exact same problem of "this thing will be rarely hit and difficult to test" - if anything moreso. We just need to make sure error paths are getting tested - we need more practical fault injection, that's all. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-03 22:45 ` NeilBrown 2024-03-03 22:54 ` Kent Overstreet @ 2024-03-04 0:20 ` Dave Chinner 2024-03-04 1:16 ` NeilBrown 2024-03-04 0:35 ` Matthew Wilcox ` (2 subsequent siblings) 4 siblings, 1 reply; 52+ messages in thread From: Dave Chinner @ 2024-03-04 0:20 UTC (permalink / raw) To: NeilBrown Cc: Kent Overstreet, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > I have in mind a more explicit statement of how much waiting is > acceptable. > > GFP_NOFAIL - wait indefinitely Make this the default, and we don't need a flag for it at all. > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > don't wait indefinitely. May abort more quickly if fatal > signal is pending. KILLABLE and RETRY are the same thing from the caller POV. Effectively "GFP_MAY_FAIL", where it will try really hard, but if it there is a risk of deadlock or a fatal signal pending, it will fail. > GFP_NO_RETRY - only try things once. This may sleep, but will give up > fairly quickly. Either deadlock is a significant > possibility, or alternate strategy is fairly cheap. > GFP_ATOMIC - don't sleep - same as current. We're talking about wait semantics, so GFP_ATOMIC should be named GFP_NO_WAIT and described as "same as NO_RETRY but will not sleep". That gives us three modifying flags to the default behaviour of sleeping until success: GFP_MAY_FAIL, GFP_NO_RETRY and GFP_NO_WAIT. I will note there is one more case callers might really want to avoid: direct reclaim. That sort of behaviour might be worth folding into GFP_NO_WAIT, as there are cases where we want the allocation attempt to fail without trying to reclaim memory because it's *much* faster to simply use the fallback mechanism than it is to attempt memory reclaim (e.g. xlog_kvmalloc()). > I don't see how "GFP_KERNEL" fits into that spectrum. Agreed. > The definition of > "this will try really hard, but might fail and we can't really tell you > what circumstances it might fail in" isn't fun to work with. Yup, XFS was designed for NO_FAIL and MAY_FAIL behaviour, and in more recent times we also use NO_RECLAIM to provide our own kvmalloc semantics because the current kvmalloc API really only supports "GFP_KERNEL" allocation. > > Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should > > really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're > > pushing for memalloc_flags_(save|restore) more. This is largely now subsystem maintenance work - the infrastructure is there, and some subsystems have been converted over entirely to use it. The remaining work either needs to be mandated or have someone explicitly tasked with completing that work. IOWs, the process in which we change APIs and then leave the long tail of conversions to subsystem maintainers is just a mechanism for creating technical debt that takes forever to clean up... > > Getting rid of those would be a really nice cleanup beacuse then gfp > > flags would mostly just be: > > - the type of memory to allocate (highmem, zeroed, etc.) > > - how hard to try (don't block at all, block some, block forever) Yup, would be a very good improvement. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-04 0:20 ` Dave Chinner @ 2024-03-04 1:16 ` NeilBrown 0 siblings, 0 replies; 52+ messages in thread From: NeilBrown @ 2024-03-04 1:16 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, 04 Mar 2024, Dave Chinner wrote: > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > I have in mind a more explicit statement of how much waiting is > > acceptable. > > > > GFP_NOFAIL - wait indefinitely > > Make this the default, and we don't need a flag for it at all. These aren't meant to be flags - those start with __. They are .. "flag combinations" to use the term description from gfp_types.h There could only be a "default" if we used macro magic to allow the GFP_ argument to be omitted. > > > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > > don't wait indefinitely. May abort more quickly if fatal > > signal is pending. > > KILLABLE and RETRY are the same thing from the caller POV. > Effectively "GFP_MAY_FAIL", where it will try really hard, but if it > there is a risk of deadlock or a fatal signal pending, it will fail. > > > GFP_NO_RETRY - only try things once. This may sleep, but will give up > > fairly quickly. Either deadlock is a significant > > possibility, or alternate strategy is fairly cheap. > > GFP_ATOMIC - don't sleep - same as current. > > We're talking about wait semantics, so GFP_ATOMIC should be named > GFP_NO_WAIT and described as "same as NO_RETRY but will not sleep". > > That gives us three modifying flags to the default behaviour of > sleeping until success: GFP_MAY_FAIL, GFP_NO_RETRY and GFP_NO_WAIT. We currently have both __GFP_NORETRY and __GFP_RETRY_MAYFAIL which differ in how many retries (1 or a few) and are both used (the former about twice as much as the latter). Do we need both? Commit dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic") might be useful in understanding the RETRY_MAYFAIL semantic. I think the intent is that RETRY_MAYFAIL doesn't trigger the oom killer. That seems like it could be a useful distinction. GFP_NOFAIL - retry indefinitely GFP_NOOOM - retry until fatal signal or OOM condition GFP_NORETRY - sleep if needed, but don't retry GFP_NOSLEEP - AKA GFP_ATOMIC We might need a better name than GFP_NOOOM :-) Thanks, NeilBrown > > I will note there is one more case callers might really want to > avoid: direct reclaim. That sort of behaviour might be worth folding > into GFP_NO_WAIT, as there are cases where we want the allocation > attempt to fail without trying to reclaim memory because it's *much* > faster to simply use the fallback mechanism than it is to attempt > memory reclaim (e.g. xlog_kvmalloc()). > > > I don't see how "GFP_KERNEL" fits into that spectrum. > > Agreed. > > > The definition of > > "this will try really hard, but might fail and we can't really tell you > > what circumstances it might fail in" isn't fun to work with. > > Yup, XFS was designed for NO_FAIL and MAY_FAIL behaviour, and in more > recent times we also use NO_RECLAIM to provide our own kvmalloc > semantics because the current kvmalloc API really only supports > "GFP_KERNEL" allocation. > > > > Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should > > > really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're > > > pushing for memalloc_flags_(save|restore) more. > > This is largely now subsystem maintenance work - the infrastructure > is there, and some subsystems have been converted over entirely to > use it. The remaining work either needs to be mandated or have > someone explicitly tasked with completing that work. > > IOWs, the process in which we change APIs and then leave the long > tail of conversions to subsystem maintainers is just a mechanism for > creating technical debt that takes forever to clean up... > > > > Getting rid of those would be a really nice cleanup beacuse then gfp > > > flags would mostly just be: > > > - the type of memory to allocate (highmem, zeroed, etc.) > > > - how hard to try (don't block at all, block some, block forever) > > Yup, would be a very good improvement. > > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-03 22:45 ` NeilBrown 2024-03-03 22:54 ` Kent Overstreet 2024-03-04 0:20 ` Dave Chinner @ 2024-03-04 0:35 ` Matthew Wilcox 2024-03-04 1:27 ` NeilBrown 2024-03-04 2:05 ` Kent Overstreet 2024-03-12 14:46 ` Vlastimil Babka 2024-03-21 6:27 ` Dan Carpenter 4 siblings, 2 replies; 52+ messages in thread From: Matthew Wilcox @ 2024-03-04 0:35 UTC (permalink / raw) To: NeilBrown Cc: Kent Overstreet, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > I have in mind a more explicit statement of how much waiting is > acceptable. > > GFP_NOFAIL - wait indefinitely > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > don't wait indefinitely. May abort more quickly if fatal > signal is pending. > GFP_NO_RETRY - only try things once. This may sleep, but will give up > fairly quickly. Either deadlock is a significant > possibility, or alternate strategy is fairly cheap. > GFP_ATOMIC - don't sleep - same as current. I don't think these should be GFP flags. Rather, these should be context flags (and indeed, they're mutually exclusive, so this is a small integer to represent where we are on the spectrum). That is we want code to do void *alloc_foo(void) { return init_foo(kmalloc(256, GFP_MOVABLE)); } void submit_foo(void) { spin_lock(&submit_lock); flags = memalloc_set_atomic(); __submit_foo(alloc_foo()); memalloc_restore_flags(flags); spin_unlock(&submit_lock); } struct foo *prealloc_foo(void) { return alloc_foo(); } ... for various degrees of complexity. That is, the _location_ of memory is allocation site knowledge, but how hard to try is _path_ dependent, and not known at the allocation site because it doesn't know what locks are held. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-04 0:35 ` Matthew Wilcox @ 2024-03-04 1:27 ` NeilBrown 2024-03-04 2:05 ` Kent Overstreet 1 sibling, 0 replies; 52+ messages in thread From: NeilBrown @ 2024-03-04 1:27 UTC (permalink / raw) To: Matthew Wilcox Cc: Kent Overstreet, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, 04 Mar 2024, Matthew Wilcox wrote: > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > I have in mind a more explicit statement of how much waiting is > > acceptable. > > > > GFP_NOFAIL - wait indefinitely > > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > > don't wait indefinitely. May abort more quickly if fatal > > signal is pending. > > GFP_NO_RETRY - only try things once. This may sleep, but will give up > > fairly quickly. Either deadlock is a significant > > possibility, or alternate strategy is fairly cheap. > > GFP_ATOMIC - don't sleep - same as current. > > I don't think these should be GFP flags. Rather, these should be > context flags (and indeed, they're mutually exclusive, so this is a > small integer to represent where we are on the spectrum). That is > we want code to do > > void *alloc_foo(void) > { > return init_foo(kmalloc(256, GFP_MOVABLE)); > } > > void submit_foo(void) > { > spin_lock(&submit_lock); > flags = memalloc_set_atomic(); > __submit_foo(alloc_foo()); > memalloc_restore_flags(flags); > spin_unlock(&submit_lock); > } > > struct foo *prealloc_foo(void) > { > return alloc_foo(); > } > > ... for various degrees of complexity. That is, the _location_ of memory > is allocation site knowledge, but how hard to try is _path_ dependent, > and not known at the allocation site because it doesn't know what locks > are held. > I'm not convinced. While there is a path dependency there is also location dependency. The code at the call-site determines what happens in response to failure. For GFP_NOFAIL, failure is not possible. We cannot allow context to turn NOFAIL into NOSLEEP because context cannot add error handling. Consider mempool_alloc(). That requests a NORETRY allocation because there is an easy fall-back. Is that a location dependency or a path dependency? I would say it is location. Of course a location is just a very short path so it is a path dependency too - but is that perspective really helpful? Certainly I could accept a GFP_WHATEVER which uses NOSLEEP if path context demands that, and NO_OOM otherwise. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-04 0:35 ` Matthew Wilcox 2024-03-04 1:27 ` NeilBrown @ 2024-03-04 2:05 ` Kent Overstreet 1 sibling, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-04 2:05 UTC (permalink / raw) To: Matthew Wilcox Cc: NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, Mar 04, 2024 at 12:35:05AM +0000, Matthew Wilcox wrote: > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > I have in mind a more explicit statement of how much waiting is > > acceptable. > > > > GFP_NOFAIL - wait indefinitely > > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > > don't wait indefinitely. May abort more quickly if fatal > > signal is pending. > > GFP_NO_RETRY - only try things once. This may sleep, but will give up > > fairly quickly. Either deadlock is a significant > > possibility, or alternate strategy is fairly cheap. > > GFP_ATOMIC - don't sleep - same as current. > > I don't think these should be GFP flags. Rather, these should be > context flags (and indeed, they're mutually exclusive, so this is a > small integer to represent where we are on the spectrum). That is > we want code to do Why? Context flags are for /context/, i.e. the scope where you take a lock that's GFP_FS unsafe. These really are callsite specific - "how bad is it if we have to deal with an allocation failure here?" ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-03 22:45 ` NeilBrown ` (2 preceding siblings ...) 2024-03-04 0:35 ` Matthew Wilcox @ 2024-03-12 14:46 ` Vlastimil Babka 2024-03-12 22:09 ` NeilBrown 2024-03-20 18:32 ` Dan Carpenter 2024-03-21 6:27 ` Dan Carpenter 4 siblings, 2 replies; 52+ messages in thread From: Vlastimil Babka @ 2024-03-12 14:46 UTC (permalink / raw) To: NeilBrown, Kent Overstreet Cc: Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On 3/3/24 23:45, NeilBrown wrote: > On Sat, 02 Mar 2024, Kent Overstreet wrote: >> >> *nod* >> >> > I suspect that most places where there is a non-error fallback already >> > use NORETRY or RETRY_MAYFAIL or similar. >> >> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see >> _tons_ of uses for either of them - more for NORETRY. >> >> My go-to is NOWAIT in this scenario though; my common pattern is "try >> nonblocking with locks held, then drop locks and retry GFP_KERNEL". >> >> > But I agree that changing the meaning of GFP_KERNEL has a potential to >> > cause problems. I support promoting "GFP_NOFAIL" which should work at >> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages). >> >> I'd support this change. >> >> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and >> > PF_MEMALLOC_NOIO context. I suspect Dave would tell me it should work in >> > these contexts, in which case I'm sure it should. >> > >> > Maybe we could then deprecate GFP_KERNEL. >> >> What do you have in mind? > > I have in mind a more explicit statement of how much waiting is > acceptable. > > GFP_NOFAIL - wait indefinitely > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > don't wait indefinitely. May abort more quickly if fatal > signal is pending. > GFP_NO_RETRY - only try things once. This may sleep, but will give up > fairly quickly. Either deadlock is a significant > possibility, or alternate strategy is fairly cheap. > GFP_ATOMIC - don't sleep - same as current. > > I don't see how "GFP_KERNEL" fits into that spectrum. The definition of > "this will try really hard, but might fail and we can't really tell you > what circumstances it might fail in" isn't fun to work with. The problem is if we set out to change everything from GFP_KERNEL to one of the above, it will take many years. So I think it would be better to just change the semantics of GFP_KERNEL too. If we change it to remove the "too-small to fail" rule, we might suddenly introduce crashes in unknown amount of places, so I don't think that's feasible. But if we change it to effectively mean GFP_NOFAIL (for non-costly allocations), there should be a manageable number of places to change to a variant that allows failure. Also if these places are GFP_KERNEL by mistake today, and should in fact allow failure, they would be already causing problems today, as the circumstances where too-small-to-fail is violated are quite rare (IIRC just being an oom victim, so somewhat close to GFP_KILLABLE). So changing GFP_KERNEL to GFP_NOFAIL should be the lowest risk (one could argue for GFP_KILLABLE but I'm afraid many places don't really handle that as they assume the too-small-to-fail without exceptions and are unaware of the oom victim loophole, and failing on any fatal signal increases the chances of this happening). > Thanks, > NeilBrown > > >> >> Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should >> really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're >> pushing for memalloc_flags_(save|restore) more. >> >> Getting rid of those would be a really nice cleanup beacuse then gfp >> flags would mostly just be: >> - the type of memory to allocate (highmem, zeroed, etc.) >> - how hard to try (don't block at all, block some, block forever) >> > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-12 14:46 ` Vlastimil Babka @ 2024-03-12 22:09 ` NeilBrown 2024-03-20 18:32 ` Dan Carpenter 1 sibling, 0 replies; 52+ messages in thread From: NeilBrown @ 2024-03-12 22:09 UTC (permalink / raw) To: Vlastimil Babka Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, 13 Mar 2024, Vlastimil Babka wrote: > On 3/3/24 23:45, NeilBrown wrote: > > On Sat, 02 Mar 2024, Kent Overstreet wrote: > >> > >> *nod* > >> > >> > I suspect that most places where there is a non-error fallback already > >> > use NORETRY or RETRY_MAYFAIL or similar. > >> > >> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see > >> _tons_ of uses for either of them - more for NORETRY. > >> > >> My go-to is NOWAIT in this scenario though; my common pattern is "try > >> nonblocking with locks held, then drop locks and retry GFP_KERNEL". > >> > >> > But I agree that changing the meaning of GFP_KERNEL has a potential to > >> > cause problems. I support promoting "GFP_NOFAIL" which should work at > >> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages). > >> > >> I'd support this change. > >> > >> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and > >> > PF_MEMALLOC_NOIO context. I suspect Dave would tell me it should work in > >> > these contexts, in which case I'm sure it should. > >> > > >> > Maybe we could then deprecate GFP_KERNEL. > >> > >> What do you have in mind? > > > > I have in mind a more explicit statement of how much waiting is > > acceptable. > > > > GFP_NOFAIL - wait indefinitely > > GFP_KILLABLE - wait indefinitely unless fatal signal is pending. > > GFP_RETRY - may retry but deadlock, though unlikely, is possible. So > > don't wait indefinitely. May abort more quickly if fatal > > signal is pending. > > GFP_NO_RETRY - only try things once. This may sleep, but will give up > > fairly quickly. Either deadlock is a significant > > possibility, or alternate strategy is fairly cheap. > > GFP_ATOMIC - don't sleep - same as current. > > > > I don't see how "GFP_KERNEL" fits into that spectrum. The definition of > > "this will try really hard, but might fail and we can't really tell you > > what circumstances it might fail in" isn't fun to work with. > > The problem is if we set out to change everything from GFP_KERNEL to one of > the above, it will take many years. So I think it would be better to just > change the semantics of GFP_KERNEL too. It took a long time to completely remove the BKL too. I don't think this is something we should be afraid of. We can easily use tools to remind us about the work that needs doing and the progress being made. > > If we change it to remove the "too-small to fail" rule, we might suddenly > introduce crashes in unknown amount of places, so I don't think that's feasible. I don't think anyone wants that. > > But if we change it to effectively mean GFP_NOFAIL (for non-costly > allocations), there should be a manageable number of places to change to a > variant that allows failure. Also if these places are GFP_KERNEL by mistake > today, and should in fact allow failure, they would be already causing > problems today, as the circumstances where too-small-to-fail is violated are > quite rare (IIRC just being an oom victim, so somewhat close to > GFP_KILLABLE). So changing GFP_KERNEL to GFP_NOFAIL should be the lowest > risk (one could argue for GFP_KILLABLE but I'm afraid many places don't > really handle that as they assume the too-small-to-fail without exceptions > and are unaware of the oom victim loophole, and failing on any fatal signal > increases the chances of this happening). I think many uses of GFP_KERNEL should be changed to GFP_NOFAIL. But that ISN'T just changing the flag (or the meaning of the flag). It also involves removing that code that handles failure. That is, to me, a strong argument against redefining GFP_KERNEL to mean GFP_NOFAIL. We (I) really do want that error handling code to be removed. That needs to be done on a case-by-case basis. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-12 14:46 ` Vlastimil Babka 2024-03-12 22:09 ` NeilBrown @ 2024-03-20 18:32 ` Dan Carpenter 2024-03-20 18:48 ` Vlastimil Babka 2024-03-20 19:09 ` Kent Overstreet 1 sibling, 2 replies; 52+ messages in thread From: Dan Carpenter @ 2024-03-20 18:32 UTC (permalink / raw) To: Vlastimil Babka Cc: NeilBrown, Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote: > But if we change it to effectively mean GFP_NOFAIL (for non-costly > allocations), there should be a manageable number of places to change to a > variant that allows failure. What does that even mean if GFP_NOFAIL can fail for "costly" allocations? I thought GFP_NOFAIL couldn't fail at all... Unfortunately, it's common that when we can't decide on a sane limit for something people just say "let the user decide based on how much memory they have". I have added some integer overflow checks which allow the user to allocate up to UINT_MAX bytes so I know this code is out there. We can't just s/GFP_KERNEL/GFP_NOFAIL/. From a static analysis perspective it would be nice if the callers explicitly marked which allocations can fail and which can't. regards, dan carpenter ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 18:32 ` Dan Carpenter @ 2024-03-20 18:48 ` Vlastimil Babka 2024-03-20 18:55 ` Matthew Wilcox 2024-03-20 19:09 ` Kent Overstreet 1 sibling, 1 reply; 52+ messages in thread From: Vlastimil Babka @ 2024-03-20 18:48 UTC (permalink / raw) To: Dan Carpenter Cc: NeilBrown, Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On 3/20/24 19:32, Dan Carpenter wrote: > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote: >> But if we change it to effectively mean GFP_NOFAIL (for non-costly >> allocations), there should be a manageable number of places to change to a >> variant that allows failure. > > What does that even mean if GFP_NOFAIL can fail for "costly" allocations? > I thought GFP_NOFAIL couldn't fail at all... Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only for non-costly allocations. Anything marked GFP_NOFAIL would still be fully nofail. > Unfortunately, it's common that when we can't decide on a sane limit for > something people just say "let the user decide based on how much memory > they have". I have added some integer overflow checks which allow the > user to allocate up to UINT_MAX bytes so I know this code is out > there. We can't just s/GFP_KERNEL/GFP_NOFAIL/. Maybe we could start producing warnings for costly GFP_KERNEL allocations to get them converted away faster. Anything that's user-controlled most likely shouldn't be GFP_KERNEL. > From a static analysis perspective it would be nice if the callers > explicitly marked which allocations can fail and which can't. As I suggested, it would be nice not to wait until everything is explicitly marked one way or another. I get the comparison with BKL, but also the kernel got much larger since the BKL times? Good point that it's not ideal if the size is unknown. Maybe the tools could be used to point out places where the size cannot be determined, so those should be converted first? Also tools cound warn about attempts to handle failure to point out places where it should be removed? > regards, > dan carpenter > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 18:48 ` Vlastimil Babka @ 2024-03-20 18:55 ` Matthew Wilcox 2024-03-20 19:07 ` Kent Overstreet 0 siblings, 1 reply; 52+ messages in thread From: Matthew Wilcox @ 2024-03-20 18:55 UTC (permalink / raw) To: Vlastimil Babka Cc: Dan Carpenter, NeilBrown, Kent Overstreet, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Mar 20, 2024 at 07:48:01PM +0100, Vlastimil Babka wrote: > On 3/20/24 19:32, Dan Carpenter wrote: > > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote: > >> But if we change it to effectively mean GFP_NOFAIL (for non-costly > >> allocations), there should be a manageable number of places to change to a > >> variant that allows failure. > > > > What does that even mean if GFP_NOFAIL can fail for "costly" allocations? > > I thought GFP_NOFAIL couldn't fail at all... > > Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only > for non-costly allocations. Anything marked GFP_NOFAIL would still be fully > nofail. GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE. Or should we interpret that as "die now"? Or "go into an unkillable sleep"? If the caller really has taken the opportunity to remove their error handling path, returning NULL will lead to a crash and a lot of beard stroking trying to understand why a GFP_NOFAIL allocation has returned NULL. May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give the developer a clear indication of what they did wrong. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 18:55 ` Matthew Wilcox @ 2024-03-20 19:07 ` Kent Overstreet 2024-03-20 19:14 ` Matthew Wilcox 0 siblings, 1 reply; 52+ messages in thread From: Kent Overstreet @ 2024-03-20 19:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote: > On Wed, Mar 20, 2024 at 07:48:01PM +0100, Vlastimil Babka wrote: > > On 3/20/24 19:32, Dan Carpenter wrote: > > > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote: > > >> But if we change it to effectively mean GFP_NOFAIL (for non-costly > > >> allocations), there should be a manageable number of places to change to a > > >> variant that allows failure. > > > > > > What does that even mean if GFP_NOFAIL can fail for "costly" allocations? > > > I thought GFP_NOFAIL couldn't fail at all... > > > > Yeah, the suggestion was that GFP_KERNEL would act as GFP_NOFAIL but only > > for non-costly allocations. Anything marked GFP_NOFAIL would still be fully > > nofail. > > GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE. > Or should we interpret that as "die now"? Or "go into an unkillable > sleep"? If the caller really has taken the opportunity to remove their > error handling path, returning NULL will lead to a crash and a lot of > beard stroking trying to understand why a GFP_NOFAIL allocation has > returned NULL. May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give > the developer a clear indication of what they did wrong. Why do we even need KMALLOC_MAX_SIZE...? Given that kmalloc internally switches to the page allocator when needed, I would think that that's something we can do away with. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 19:07 ` Kent Overstreet @ 2024-03-20 19:14 ` Matthew Wilcox 2024-03-20 19:33 ` Kent Overstreet 0 siblings, 1 reply; 52+ messages in thread From: Matthew Wilcox @ 2024-03-20 19:14 UTC (permalink / raw) To: Kent Overstreet Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Mar 20, 2024 at 03:07:38PM -0400, Kent Overstreet wrote: > On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote: > > GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE. > > Or should we interpret that as "die now"? Or "go into an unkillable > > sleep"? If the caller really has taken the opportunity to remove their > > error handling path, returning NULL will lead to a crash and a lot of > > beard stroking trying to understand why a GFP_NOFAIL allocation has > > returned NULL. May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give > > the developer a clear indication of what they did wrong. > > Why do we even need KMALLOC_MAX_SIZE...? > > Given that kmalloc internally switches to the page allocator when > needed, I would think that that's something we can do away with. ... maybe check what I said before replying? /* * SLUB directly allocates requests fitting in to an order-1 page * (PAGE_SIZE*2). Larger requests are passed to the page allocator. */ #define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT) You caan't allocate larger than that without going to CMA or some other custom allocator. So. The caller has requested NOFAIL, and requested a size larger than we can allocate. Do we panic, go into an uninterruptible sleep, or return NULL? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 19:14 ` Matthew Wilcox @ 2024-03-20 19:33 ` Kent Overstreet 0 siblings, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-20 19:33 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Dan Carpenter, NeilBrown, Dave Chinner, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Mar 20, 2024 at 07:14:35PM +0000, Matthew Wilcox wrote: > On Wed, Mar 20, 2024 at 03:07:38PM -0400, Kent Overstreet wrote: > > On Wed, Mar 20, 2024 at 06:55:36PM +0000, Matthew Wilcox wrote: > > > GFP_NOFAIL should still fail for allocations larger than KMALLOC_MAX_SIZE. > > > Or should we interpret that as "die now"? Or "go into an unkillable > > > sleep"? If the caller really has taken the opportunity to remove their > > > error handling path, returning NULL will lead to a crash and a lot of > > > beard stroking trying to understand why a GFP_NOFAIL allocation has > > > returned NULL. May as well BUG_ON(size > KMALLOC_MAX_SIZE) and give > > > the developer a clear indication of what they did wrong. > > > > Why do we even need KMALLOC_MAX_SIZE...? > > > > Given that kmalloc internally switches to the page allocator when > > needed, I would think that that's something we can do away with. > > ... maybe check what I said before replying? > > /* > * SLUB directly allocates requests fitting in to an order-1 page > * (PAGE_SIZE*2). Larger requests are passed to the page allocator. > */ > #define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT) > > You caan't allocate larger than that without going to CMA or some other > custom allocator. Ahh, my recollection was out of date - I believe that was 128k at one time? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-20 18:32 ` Dan Carpenter 2024-03-20 18:48 ` Vlastimil Babka @ 2024-03-20 19:09 ` Kent Overstreet 1 sibling, 0 replies; 52+ messages in thread From: Kent Overstreet @ 2024-03-20 19:09 UTC (permalink / raw) To: Dan Carpenter Cc: Vlastimil Babka, NeilBrown, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Wed, Mar 20, 2024 at 09:32:52PM +0300, Dan Carpenter wrote: > On Tue, Mar 12, 2024 at 03:46:32PM +0100, Vlastimil Babka wrote: > > But if we change it to effectively mean GFP_NOFAIL (for non-costly > > allocations), there should be a manageable number of places to change to a > > variant that allows failure. > > What does that even mean if GFP_NOFAIL can fail for "costly" allocations? > I thought GFP_NOFAIL couldn't fail at all... > > Unfortunately, it's common that when we can't decide on a sane limit for > something people just say "let the user decide based on how much memory > they have". I have added some integer overflow checks which allow the > user to allocate up to UINT_MAX bytes so I know this code is out > there. We can't just s/GFP_KERNEL/GFP_NOFAIL/. > > From a static analysis perspective it would be nice if the callers > explicitly marked which allocations can fail and which can't. GFP_NOFAIL throws a warning if the allocation size is > 2 pages, which is a separate issue from whether the allocation becomes fallible - someone would have to - oh, I don't know, read the code to answer that question. I think we can ditch the 2 page limit on GFP_NOFAIL, though. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-03 22:45 ` NeilBrown ` (3 preceding siblings ...) 2024-03-12 14:46 ` Vlastimil Babka @ 2024-03-21 6:27 ` Dan Carpenter 2024-03-22 1:47 ` NeilBrown 4 siblings, 1 reply; 52+ messages in thread From: Dan Carpenter @ 2024-03-21 6:27 UTC (permalink / raw) To: NeilBrown Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > I have in mind a more explicit statement of how much waiting is > acceptable. > > GFP_NOFAIL - wait indefinitely Why not call it GFP_SMALL? It wouldn't fail. The size would have to be less than some limit. If the size was too large, that would trigger a WARN_ON_ONCE(). I obviously understand that this duplicates the information in the size parameter but the point is that GFP_SMALL allocations have been reviewed, updated, and don't have error handling code. We'd keep GFP_KERNEL which would keep the existing behavior. (Which is that it can sleep and it can fail). I think that maps to GFP_RETRY but GFP_RETRY is an uglier name. People could still use __GFP_NOFAIL for larger allocations. regards, dan carpenter ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-21 6:27 ` Dan Carpenter @ 2024-03-22 1:47 ` NeilBrown 2024-03-22 6:13 ` Dan Carpenter 0 siblings, 1 reply; 52+ messages in thread From: NeilBrown @ 2024-03-22 1:47 UTC (permalink / raw) To: Dan Carpenter Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara On Thu, 21 Mar 2024, Dan Carpenter wrote: > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > I have in mind a more explicit statement of how much waiting is > > acceptable. > > > > GFP_NOFAIL - wait indefinitely > > Why not call it GFP_SMALL? It wouldn't fail. The size would have to be > less than some limit. If the size was too large, that would trigger a > WARN_ON_ONCE(). I would be happy with GFP_SMALL. It would never return NULL but might block indefinitely. It would (as you say) WARN (maybe ONCE) if the size was considered "COSTLY" and would possibly BUG if the size exceeded KMALLOC_MAX_SIZE. > > I obviously understand that this duplicates the information in the size > parameter but the point is that GFP_SMALL allocations have been > reviewed, updated, and don't have error handling code. We are on the same page here. > > We'd keep GFP_KERNEL which would keep the existing behavior. (Which is > that it can sleep and it can fail). I think that maps to GFP_RETRY but > GFP_RETRY is an uglier name. Can it fail though? I know it is allowed to, but does it happen? I think I would prefer the normal approach for allocations that are (or might be) too large for GFP_SMALL was to *not* risk triggering oom. So I'd rather we didn't have GFP_KERNEL (which I think does have that risk) and instead (maybe) GFP_LARGE which sleeps but will return NULL when an OOM condition is looming. So we only get OOM for GFP_SMALL allocations (because not failing simplifies the code), and where it is explicitly requested - like for user-space allocations and probably for file cache allocations. But I think that keeping it simple and only adding GFP_SMALL as we have discussed would be a big step in the right direction and we don't need to complicate it with other ideas. Thanks, NeilBrown > > People could still use __GFP_NOFAIL for larger allocations. > > regards, > dan carpenter > > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-22 1:47 ` NeilBrown @ 2024-03-22 6:13 ` Dan Carpenter 2024-03-24 22:31 ` NeilBrown 0 siblings, 1 reply; 52+ messages in thread From: Dan Carpenter @ 2024-03-22 6:13 UTC (permalink / raw) To: NeilBrown Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote: > On Thu, 21 Mar 2024, Dan Carpenter wrote: > > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > > I have in mind a more explicit statement of how much waiting is > > > acceptable. > > > > > > GFP_NOFAIL - wait indefinitely > > > > Why not call it GFP_SMALL? It wouldn't fail. The size would have to be > > less than some limit. If the size was too large, that would trigger a > > WARN_ON_ONCE(). > > I would be happy with GFP_SMALL. It would never return NULL but might > block indefinitely. It would (as you say) WARN (maybe ONCE) if the size > was considered "COSTLY" and would possibly BUG if the size exceeded > KMALLOC_MAX_SIZE. I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE. IIf you're allocating larger than that, you'd still be able to GFP_NOFAIL. I looked quickly an I think over 60% of allocations are just sizeof(*p) and probably 90% are under 4k. > > > > > I obviously understand that this duplicates the information in the size > > parameter but the point is that GFP_SMALL allocations have been > > reviewed, updated, and don't have error handling code. > > We are on the same page here. > > > > > We'd keep GFP_KERNEL which would keep the existing behavior. (Which is > > that it can sleep and it can fail). I think that maps to GFP_RETRY but > > GFP_RETRY is an uglier name. > > Can it fail though? I know it is allowed to, but does it happen? > In some sense, I don't really care about this, I just want the rules clear from a static analysis perspective. Btw, you're discussing making the too small to fail rule official but other times we have discussed getting rid of it all together. So I think maybe it's better to keep the rules strict but allow the actual implentation to change later. The GFP_SMALL stuff is nice for static analysis because it would warn about anything larger than whatever the small limit is. So that means I have fewer allocations to review for integer overflow bugs. Btw, Jiri Pirko, was proposing a macro which would automatically allocate the 60+% of allocations which are sizeof(*p). https://lore.kernel.org/all/20240315132249.2515468-1-jiri@resnulli.us/ I had offered an alternative macro but the idea is the same: #define __ALLOC(p) p __free(kfree) = kzalloc(sizeof(*p), GFP_KERNEL) struct ice_aqc_get_phy_caps_data *__ALLOC(pcaps); Combining no fail allocations with automatic cleanup changes the way you write code. And then on the success patch you have the no_free_ptr() which I haven't used but I think will be so useful for static analysis. I'm so excited about this. regards, dan carpenter ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-22 6:13 ` Dan Carpenter @ 2024-03-24 22:31 ` NeilBrown 2024-03-25 8:43 ` Dan Carpenter 0 siblings, 1 reply; 52+ messages in thread From: NeilBrown @ 2024-03-24 22:31 UTC (permalink / raw) To: Dan Carpenter Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko On Fri, 22 Mar 2024, Dan Carpenter wrote: > On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote: > > On Thu, 21 Mar 2024, Dan Carpenter wrote: > > > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > > > I have in mind a more explicit statement of how much waiting is > > > > acceptable. > > > > > > > > GFP_NOFAIL - wait indefinitely > > > > > > Why not call it GFP_SMALL? It wouldn't fail. The size would have to be > > > less than some limit. If the size was too large, that would trigger a > > > WARN_ON_ONCE(). > > > > I would be happy with GFP_SMALL. It would never return NULL but might > > block indefinitely. It would (as you say) WARN (maybe ONCE) if the size > > was considered "COSTLY" and would possibly BUG if the size exceeded > > KMALLOC_MAX_SIZE. > > I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE. IIf > you're allocating larger than that, you'd still be able to GFP_NOFAIL. > I looked quickly an I think over 60% of allocations are just sizeof(*p) > and probably 90% are under 4k. What do you mean exactly by "keep"?? Do you mean WARN_ON if it is "too big" - certainly agree. Do you mean BUG_ON if it is "too big" - maybe agree. Do you mean return NULL if it is "too big" - definitely disagree. Do you mean build failure if it could be too big - I would LOVE that, but I don't think we can do that with current build tools. Thanks, NeilBrown > > > > > > > > > I obviously understand that this duplicates the information in the size > > > parameter but the point is that GFP_SMALL allocations have been > > > reviewed, updated, and don't have error handling code. > > > > We are on the same page here. > > > > > > > > We'd keep GFP_KERNEL which would keep the existing behavior. (Which is > > > that it can sleep and it can fail). I think that maps to GFP_RETRY but > > > GFP_RETRY is an uglier name. > > > > Can it fail though? I know it is allowed to, but does it happen? > > > > In some sense, I don't really care about this, I just want the rules > clear from a static analysis perspective. Btw, you're discussing making > the too small to fail rule official but other times we have discussed > getting rid of it all together. So I think maybe it's better to keep > the rules strict but allow the actual implentation to change later. > > The GFP_SMALL stuff is nice for static analysis because it would warn > about anything larger than whatever the small limit is. So that means I > have fewer allocations to review for integer overflow bugs. > > Btw, Jiri Pirko, was proposing a macro which would automatically > allocate the 60+% of allocations which are sizeof(*p). > https://lore.kernel.org/all/20240315132249.2515468-1-jiri@resnulli.us/ > I had offered an alternative macro but the idea is the same: > > #define __ALLOC(p) p __free(kfree) = kzalloc(sizeof(*p), GFP_KERNEL) > struct ice_aqc_get_phy_caps_data *__ALLOC(pcaps); > > Combining no fail allocations with automatic cleanup changes the way you > write code. > > And then on the success patch you have the no_free_ptr() which I haven't > used but I think will be so useful for static analysis. I'm so excited > about this. > > regards, > dan carpenter > > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU 2024-03-24 22:31 ` NeilBrown @ 2024-03-25 8:43 ` Dan Carpenter 0 siblings, 0 replies; 52+ messages in thread From: Dan Carpenter @ 2024-03-25 8:43 UTC (permalink / raw) To: NeilBrown Cc: Kent Overstreet, Dave Chinner, Matthew Wilcox, Amir Goldstein, paulmck, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Jiri Pirko On Mon, Mar 25, 2024 at 09:31:53AM +1100, NeilBrown wrote: > On Fri, 22 Mar 2024, Dan Carpenter wrote: > > On Fri, Mar 22, 2024 at 12:47:42PM +1100, NeilBrown wrote: > > > On Thu, 21 Mar 2024, Dan Carpenter wrote: > > > > On Mon, Mar 04, 2024 at 09:45:48AM +1100, NeilBrown wrote: > > > > > I have in mind a more explicit statement of how much waiting is > > > > > acceptable. > > > > > > > > > > GFP_NOFAIL - wait indefinitely > > > > > > > > Why not call it GFP_SMALL? It wouldn't fail. The size would have to be > > > > less than some limit. If the size was too large, that would trigger a > > > > WARN_ON_ONCE(). > > > > > > I would be happy with GFP_SMALL. It would never return NULL but might > > > block indefinitely. It would (as you say) WARN (maybe ONCE) if the size > > > was considered "COSTLY" and would possibly BUG if the size exceeded > > > KMALLOC_MAX_SIZE. > > > > I'd like to keep GFP_SMALL much smaller than KMALLOC_MAX_SIZE. IIf > > you're allocating larger than that, you'd still be able to GFP_NOFAIL. > > I looked quickly an I think over 60% of allocations are just sizeof(*p) > > and probably 90% are under 4k. > > What do you mean exactly by "keep"?? Poor word choice... > Do you mean WARN_ON if it is "too big" - certainly agree. > Do you mean BUG_ON if it is "too big" - maybe agree. WARN_ON_ONCE(). But a lot of people have reboot on Oops enabled so triggering a WARN_ON() is still a very serious bug. > Do you mean return NULL if it is "too big" - definitely disagree. Yeah. It's going to be a style violation to check a GFP_SMALL allocation for NULL so it needs to have GFP_NOFAIL behavior. It can still fail if it's larger than KMALLOC_MAX_SIZE. > Do you mean build failure if it could be too big - I would LOVE that, > but I don't think we can do that with current build tools. The limit is going to be a constant so using static analysis to check that is easier than checking if we're less than some variable. regards, dan carpenter ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2024-03-25 8:43 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-27 18:56 [LSF/MM/BPF TOPIC] Reclamation interactions with RCU Paul E. McKenney 2024-02-27 19:19 ` [Lsf-pc] " Amir Goldstein 2024-02-27 22:59 ` Paul E. McKenney 2024-03-01 3:28 ` Kent Overstreet 2024-03-05 2:43 ` Paul E. McKenney 2024-03-05 2:56 ` Yosry Ahmed 2024-02-28 19:37 ` Matthew Wilcox 2024-02-29 1:29 ` Dave Chinner 2024-02-29 4:20 ` Kent Overstreet 2024-02-29 4:17 ` Kent Overstreet 2024-02-29 4:24 ` Matthew Wilcox 2024-02-29 4:44 ` Kent Overstreet 2024-03-01 2:16 ` NeilBrown 2024-03-01 2:39 ` Kent Overstreet 2024-03-01 2:48 ` Matthew Wilcox 2024-03-01 3:09 ` Kent Overstreet 2024-03-01 3:33 ` James Bottomley 2024-03-01 3:52 ` Kent Overstreet 2024-03-01 4:01 ` Kent Overstreet 2024-03-01 4:09 ` NeilBrown 2024-03-01 4:18 ` Kent Overstreet 2024-03-01 4:18 ` James Bottomley 2024-03-01 4:08 ` James Bottomley 2024-03-01 4:15 ` Kent Overstreet 2024-03-05 2:54 ` Yosry Ahmed 2024-03-01 5:54 ` Dave Chinner 2024-03-01 20:20 ` Kent Overstreet 2024-03-01 23:47 ` NeilBrown 2024-03-02 0:02 ` Kent Overstreet 2024-03-02 11:33 ` Tetsuo Handa 2024-03-02 16:53 ` Matthew Wilcox 2024-03-03 22:45 ` NeilBrown 2024-03-03 22:54 ` Kent Overstreet 2024-03-04 0:20 ` Dave Chinner 2024-03-04 1:16 ` NeilBrown 2024-03-04 0:35 ` Matthew Wilcox 2024-03-04 1:27 ` NeilBrown 2024-03-04 2:05 ` Kent Overstreet 2024-03-12 14:46 ` Vlastimil Babka 2024-03-12 22:09 ` NeilBrown 2024-03-20 18:32 ` Dan Carpenter 2024-03-20 18:48 ` Vlastimil Babka 2024-03-20 18:55 ` Matthew Wilcox 2024-03-20 19:07 ` Kent Overstreet 2024-03-20 19:14 ` Matthew Wilcox 2024-03-20 19:33 ` Kent Overstreet 2024-03-20 19:09 ` Kent Overstreet 2024-03-21 6:27 ` Dan Carpenter 2024-03-22 1:47 ` NeilBrown 2024-03-22 6:13 ` Dan Carpenter 2024-03-24 22:31 ` NeilBrown 2024-03-25 8:43 ` Dan Carpenter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox