linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] slab updates for 6.10
@ 2024-05-09 14:25 Vlastimil Babka
  2024-05-13 17:33 ` Linus Torvalds
  2024-05-13 17:38 ` pr-tracker-bot
  0 siblings, 2 replies; 4+ messages in thread
From: Vlastimil Babka @ 2024-05-09 14:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Joonsoo Kim, Christoph Lameter, Pekka Enberg,
	Andrew Morton, linux-mm, LKML, patches, Roman Gushchin,
	Hyeonggon Yoo, Chengming Zhou

Hi Linus,

please pull the latest slab updates from:

  git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git tags/slab-for-6.10

Sending this early due to upcoming LSF/MM travel and chances there's no rc8.

Thanks,
Vlastimil

======================================

This time it's mostly random cleanups and fixes, with two performance fixes
that might have significant impact, but limited to systems experiencing
particular bad corner case scenarios rather than general performance
improvements.

The memcg hook changes are going through the mm tree due to dependencies.

- Prevent stalls when reading /proc/slabinfo (Jianfeng Wang)

  This fixes the long-standing problem that can happen with workloads that have
  alloc/free patterns resulting in many partially used slabs (in e.g. dentry
  cache). Reading /proc/slabinfo will traverse the long partial slab list under
  spinlock with disabled irqs and thus can stall other processes or even
  trigger the lockup detection. The traversal is only done to count free
  objects so that <active_objs> column can be reported along with <num_objs>.

  To avoid affecting fast paths with another shared counter (attempted in the
  past) or complex partial list traversal schemes that allow rescheduling, the
  chosen solution resorts to approximation - when the partial list is over
  10000 slabs long, we will only traverse first 5000 slabs from head and tail
  each and use the average of those to estimate the whole list. Both head and
  tail are used as the slabs near head to tend to have more free objects than
  the slabs towards the tail.

  It is expected the approximation should not break existing /proc/slabinfo
  consumers. The <num_objs> field is still accurate and reflects the overall
  kmem_cache footprint. The <active_objs> was already imprecise due to cpu and
  percpu-partial slabs, so can't be relied upon to determine exact cache usage.
  The difference between <active_objs> and <num_objs> is mainly useful to
  determine the slab fragmentation, and that will be possible even with the
  approximation in place.

- Prevent allocating many slabs when a NUMA node is full (Chen Jun)

  Currently, on NUMA systems with a node under significantly bigger pressure
  than other nodes, the fallback strategy may result in each kmalloc_node()
  that can't be safisfied from the preferred node, to allocate a new slab on a
  fallback node, and not reuse the slabs already on that node's partial list.

  This is now fixed and partial lists of fallback nodes are checked even for
  kmalloc_node() allocations. It's still preferred to allocate a new slab on
  the requested node before a fallback, but only with a GFP_NOWAIT attempt,
  which will fail quickly when the node is under a significant memory pressure.

- More SLAB removal related cleanups (Xiu Jianfeng, Hyunmin Lee)

- Fix slub_kunit self-test with hardened freelists (Guenter Roeck)

- Mark racy accesses for KCSAN (linke li)

- Misc cleanups (Xiongwei Song, Haifeng Xu, Sangyun Kim)

----------------------------------------------------------------
Chen Jun (1):
      mm/slub: Reduce memory consumption in extreme scenarios

Guenter Roeck (1):
      mm/slub, kunit: Use inverted data to corrupt kmem cache

Haifeng Xu (1):
      slub: Set __GFP_COMP in kmem_cache by default

Hyunmin Lee (2):
      mm/slub: create kmalloc 96 and 192 caches regardless cache size order
      mm/slub: remove the check for NULL kmalloc_caches

Jianfeng Wang (2):
      slub: introduce count_partial_free_approx()
      slub: use count_partial_free_approx() in slab_out_of_memory()

Sangyun Kim (1):
      mm/slub: remove duplicate initialization for early_kmem_cache_node_alloc()

Xiongwei Song (3):
      mm/slub: remove the check of !kmem_cache_has_cpu_partial()
      mm/slub: add slub_get_cpu_partial() helper
      mm/slub: simplify get_partial_node()

Xiu Jianfeng (2):
      mm/slub: remove dummy slabinfo functions
      mm/slub: correct comment in do_slab_free()

linke li (2):
      mm/slub: mark racy accesses on slab->slabs
      mm/slub: mark racy access on slab->freelist

 lib/slub_kunit.c |   2 +-
 mm/slab.h        |   3 --
 mm/slab_common.c |  27 +++++--------
 mm/slub.c        | 118 ++++++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 96 insertions(+), 54 deletions(-)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GIT PULL] slab updates for 6.10
  2024-05-09 14:25 [GIT PULL] slab updates for 6.10 Vlastimil Babka
@ 2024-05-13 17:33 ` Linus Torvalds
  2024-05-20 10:18   ` Vlastimil Babka
  2024-05-13 17:38 ` pr-tracker-bot
  1 sibling, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2024-05-13 17:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Joonsoo Kim, Christoph Lameter, Pekka Enberg,
	Andrew Morton, linux-mm, LKML, patches, Roman Gushchin,
	Hyeonggon Yoo, Chengming Zhou

On Thu, 9 May 2024 at 07:25, Vlastimil Babka <vbabka@suse.cz> wrote:
>
>   To avoid affecting fast paths with another shared counter (attempted in the
>   past) or complex partial list traversal schemes that allow rescheduling, the
>   chosen solution resorts to approximation - when the partial list is over
>   10000 slabs long, we will only traverse first 5000 slabs from head and tail
>   each and use the average of those to estimate the whole list. Both head and
>   tail are used as the slabs near head to tend to have more free objects than
>   the slabs towards the tail.

I suspect you could have cut this down by an order of magnitude, and
made the limit be just 1k slabs rather than 10k slabs. Or even
_another_ order of magnitude smaller.

Somebody was being a bit too worried about approximations, methinks -
but I think the real worry goes the other way, where it's practically
so hard to even hit the approximation situation that it gets no
testing at all.

IOW, I suspect it's better to be explicit about approximations, and
have people aware of it, rather than be overly cautious and have it be
a special case that almost never triggers in any normal loads.

But pulled.

              Linus


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GIT PULL] slab updates for 6.10
  2024-05-09 14:25 [GIT PULL] slab updates for 6.10 Vlastimil Babka
  2024-05-13 17:33 ` Linus Torvalds
@ 2024-05-13 17:38 ` pr-tracker-bot
  1 sibling, 0 replies; 4+ messages in thread
From: pr-tracker-bot @ 2024-05-13 17:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, David Rientjes, Joonsoo Kim, Christoph Lameter,
	Pekka Enberg, Andrew Morton, linux-mm, LKML, patches,
	Roman Gushchin, Hyeonggon Yoo, Chengming Zhou

The pull request you sent on Thu, 9 May 2024 16:25:05 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git tags/slab-for-6.10

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/cd97950cbcabe662cd8a9fd0a08a247c1ea1fb28

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GIT PULL] slab updates for 6.10
  2024-05-13 17:33 ` Linus Torvalds
@ 2024-05-20 10:18   ` Vlastimil Babka
  0 siblings, 0 replies; 4+ messages in thread
From: Vlastimil Babka @ 2024-05-20 10:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Joonsoo Kim, Christoph Lameter, Pekka Enberg,
	Andrew Morton, linux-mm, LKML, patches, Roman Gushchin,
	Hyeonggon Yoo, Chengming Zhou

On 5/13/24 7:33 PM, Linus Torvalds wrote:
> On Thu, 9 May 2024 at 07:25, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>   To avoid affecting fast paths with another shared counter (attempted in the
>>   past) or complex partial list traversal schemes that allow rescheduling, the
>>   chosen solution resorts to approximation - when the partial list is over
>>   10000 slabs long, we will only traverse first 5000 slabs from head and tail
>>   each and use the average of those to estimate the whole list. Both head and
>>   tail are used as the slabs near head to tend to have more free objects than
>>   the slabs towards the tail.
> 
> I suspect you could have cut this down by an order of magnitude, and
> made the limit be just 1k slabs rather than 10k slabs. Or even
> _another_ order of magnitude smaller.
> 
> Somebody was being a bit too worried about approximations, methinks -

Indeed, my focus was that we make the approximation as accurate as possible
when introducing it, to minimize the chance of possibly breaking somebody
and having to revert it. Then we can try reduce the limit once the approach
itself is established.

> but I think the real worry goes the other way, where it's practically
> so hard to even hit the approximation situation that it gets no
> testing at all.

Good point.

> IOW, I suspect it's better to be explicit about approximations, and
> have people aware of it, rather than be overly cautious and have it be
> a special case that almost never triggers in any normal loads.

OK we can reduce the limit sooner than later. As for explicit, there was an
idea that an approximated line in slabinfo would be marked, but I thought
changing the layout would be more likely to break someone parsing it, than
an unmarked approximation. We can be more explicit e.g. in the documentation
though for sure.

> But pulled.

Thanks.

> 
>               Linus



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-20 10:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-09 14:25 [GIT PULL] slab updates for 6.10 Vlastimil Babka
2024-05-13 17:33 ` Linus Torvalds
2024-05-20 10:18   ` Vlastimil Babka
2024-05-13 17:38 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox