* [PATCH v4 1/2] slub: introduce count_partial_free_approx() [not found] <20240423043420.13854-1-jianfeng.w.wang@oracle.com> @ 2024-04-23 4:34 ` Jianfeng Wang 0 siblings, 0 replies; 3+ messages in thread From: Jianfeng Wang @ 2024-04-23 4:34 UTC (permalink / raw) To: linux-mm, vbabka; +Cc: cl, rientjes, akpm When reading "/proc/slabinfo", the kernel needs to report the number of free objects for each kmem_cache. The current implementation uses count_partial() to get it by scanning each kmem_cache_node's partial slab list and summing free objects from every partial slab. This process must hold per-kmem_cache_node spinlock and disable IRQ, and may take a long time. Consequently, it can block slab allocations on other CPUs and cause timeouts for network devices, when the partial list is long. In production, even NMI watchdog can be triggered due to this matter: e.g., for "buffer_head", the number of partial slabs was observed to be ~1M in one kmem_cache_node. This problem was also confirmed by others [1-3]. Iterating a partial list to get the exact count of objects can cause soft lockups for a long list with or without the lock (e.g., if preemption is disabled), and may not be very useful: the object count can change after the lock is released. The approach of maintaining free-object counters requires atomic operations on the fast path [3]. So, the fix is to introduce count_partial_free_approx(). This function can be used for getting the free object count in a kmem_cache_node's partial list. It limits the number of slabs to scan and avoids scanning the whole list by giving an approximation for a long list. Suppose the limit is N. If the list's length is not greater than N, output the exact count by traversing the list; if its length is greater than N, output an approximated count by traversing a subset of the list. The proposed method is to scan N/2 slabs from the list's head and N/2 slabs from the tail. For a partial list with ~280K slabs, benchmarks show that it performs better than just counting from the list's head, after slabs get sorted by kmem_cache_shrink(). Default the limit to 10000, as it produces an approximation within 1% of the exact count for both scenarios. Then, use count_partial_free_approx() in get_slabinfo(). Benchmarks: Diff = (exact - approximated) / exact * Normal case (w/o kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 0.43 % | 1.09 % | | 5000 | 0.06 % | 0.37 % | | 10000 | 0.02 % | 0.16 % | | 20000 | 0.009 % | -0.003 % | * Skewed case (w/ kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 12.46 % | 6.75 % | | 5000 | 5.38 % | 1.27 % | | 10000 | 4.99 % | 0.22 % | | 20000 | 4.86 % | -0.06 % | [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/ [2] https://lore.kernel.org/lkml/alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/ [3] https://lore.kernel.org/lkml/1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/ Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com> Acked-by: David Rientjes <rientjes@google.com> --- mm/slub.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 1bb2a93cf7b6..6d8ecad07daf 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3229,6 +3229,43 @@ static unsigned long count_partial(struct kmem_cache_node *n, #endif /* CONFIG_SLUB_DEBUG || SLAB_SUPPORTS_SYSFS */ #ifdef CONFIG_SLUB_DEBUG +#define MAX_PARTIAL_TO_SCAN 10000 + +static unsigned long count_partial_free_approx(struct kmem_cache_node *n) +{ + unsigned long flags; + unsigned long x = 0; + struct slab *slab; + + spin_lock_irqsave(&n->list_lock, flags); + if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) { + list_for_each_entry(slab, &n->partial, slab_list) + x += slab->objects - slab->inuse; + } else { + /* + * For a long list, approximate the total count of objects in + * it to meet the limit on the number of slabs to scan. + * Scan from both the list's head and tail for better accuracy. + */ + unsigned long scanned = 0; + + list_for_each_entry(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN / 2) + break; + } + list_for_each_entry_reverse(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN) + break; + } + x = mult_frac(x, n->nr_partial, scanned); + x = min(x, node_nr_objs(n)); + } + spin_unlock_irqrestore(&n->list_lock, flags); + return x; +} + static noinline void slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid) { @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += count_partial_free_approx(n); } sinfo->active_objs = nr_objs - nr_free; -- 2.42.1 ^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v4 0/2] slub: introduce count_partial_free_approx() @ 2024-04-23 4:55 Jianfeng Wang 2024-04-23 4:55 ` [PATCH v4 1/2] " Jianfeng Wang 0 siblings, 1 reply; 3+ messages in thread From: Jianfeng Wang @ 2024-04-23 4:55 UTC (permalink / raw) To: linux-mm; +Cc: vbabka, cl, rientjes, akpm, jianfeng.w.wang This patch fixes a known issue in get_slabinfo() which relies on count_partial() to get the exact count of free objects in a kmem_cache_node's partial list. For some slubs, their partial lists can be extremely long. Currently, count_partial() traverses a partial list to get the exact count of objects. This process may take a long time, during which slab allocations are blocked and IRQs are disabled. In production, even NMI watchdog can be triggered due to this matter. The fix is to limit the number of slabs to scan and output an approximated count for a long partial list. The v1 patch counts N slabs from the list's head and then uses it to estimate the total object count in the list. As suggested by Vlastimil, the v2 patch uses an alternative, i.e., counting N/2 from the list's head and tail, produces a more accurate approximation after the partial list is sorted by kmem_cache_shrink(). In the latest version, the implementation is moved to a new function count_partial_free_approx() under CONFIG_SLUB_DEBUG. count_partial() is still used in sysfs. --- Changes since v3 [3] - Place count_partial_free_approx() under CONFIG_SLUB_DEBUG only Changes since v2 [2] - Introduce count_partial_free_approx() and keep count_partial() - Use count_partial_free_approx() in get_slabinfo() and slab_out_of_memory() Changes since v1 [1] - Update the approximation method by counting from the list's head and tail - Cap the approximation by the total object count - Update the commit message to add benchmark results and explain the choice [1] https://lore.kernel.org/linux-mm/20240411164023.99368-1-jianfeng.w.wang@oracle.com/ [2] https://lore.kernel.org/linux-mm/20240417185938.5237-2-jianfeng.w.wang@oracle.com/ [3] https://lore.kernel.org/linux-mm/20240419175611.47413-1-jianfeng.w.wang@oracle.com/ Thanks, --Jianfeng Jianfeng Wang (2): slub: introduce count_partial_free_approx() slub: use count_partial_free_approx() in slab_out_of_memory() mm/slub.c | 41 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 39 insertions(+), 2 deletions(-) -- 2.42.1 ^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v4 1/2] slub: introduce count_partial_free_approx() 2024-04-23 4:55 [PATCH v4 0/2] " Jianfeng Wang @ 2024-04-23 4:55 ` Jianfeng Wang 2024-04-23 16:09 ` Christoph Lameter (Ampere) 0 siblings, 1 reply; 3+ messages in thread From: Jianfeng Wang @ 2024-04-23 4:55 UTC (permalink / raw) To: linux-mm; +Cc: vbabka, cl, rientjes, akpm, jianfeng.w.wang When reading "/proc/slabinfo", the kernel needs to report the number of free objects for each kmem_cache. The current implementation uses count_partial() to get it by scanning each kmem_cache_node's partial slab list and summing free objects from every partial slab. This process must hold per-kmem_cache_node spinlock and disable IRQ, and may take a long time. Consequently, it can block slab allocations on other CPUs and cause timeouts for network devices, when the partial list is long. In production, even NMI watchdog can be triggered due to this matter: e.g., for "buffer_head", the number of partial slabs was observed to be ~1M in one kmem_cache_node. This problem was also confirmed by others [1-3]. Iterating a partial list to get the exact count of objects can cause soft lockups for a long list with or without the lock (e.g., if preemption is disabled), and may not be very useful: the object count can change after the lock is released. The approach of maintaining free-object counters requires atomic operations on the fast path [3]. So, the fix is to introduce count_partial_free_approx(). This function can be used for getting the free object count in a kmem_cache_node's partial list. It limits the number of slabs to scan and avoids scanning the whole list by giving an approximation for a long list. Suppose the limit is N. If the list's length is not greater than N, output the exact count by traversing the list; if its length is greater than N, output an approximated count by traversing a subset of the list. The proposed method is to scan N/2 slabs from the list's head and N/2 slabs from the tail. For a partial list with ~280K slabs, benchmarks show that it performs better than just counting from the list's head, after slabs get sorted by kmem_cache_shrink(). Default the limit to 10000, as it produces an approximation within 1% of the exact count for both scenarios. Then, use count_partial_free_approx() in get_slabinfo(). Benchmarks: Diff = (exact - approximated) / exact * Normal case (w/o kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 0.43 % | 1.09 % | | 5000 | 0.06 % | 0.37 % | | 10000 | 0.02 % | 0.16 % | | 20000 | 0.009 % | -0.003 % | * Skewed case (w/ kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 12.46 % | 6.75 % | | 5000 | 5.38 % | 1.27 % | | 10000 | 4.99 % | 0.22 % | | 20000 | 4.86 % | -0.06 % | [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/ [2] https://lore.kernel.org/lkml/alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/ [3] https://lore.kernel.org/lkml/1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/ Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com> Acked-by: David Rientjes <rientjes@google.com> --- mm/slub.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 1bb2a93cf7b6..6d8ecad07daf 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3229,6 +3229,43 @@ static unsigned long count_partial(struct kmem_cache_node *n, #endif /* CONFIG_SLUB_DEBUG || SLAB_SUPPORTS_SYSFS */ #ifdef CONFIG_SLUB_DEBUG +#define MAX_PARTIAL_TO_SCAN 10000 + +static unsigned long count_partial_free_approx(struct kmem_cache_node *n) +{ + unsigned long flags; + unsigned long x = 0; + struct slab *slab; + + spin_lock_irqsave(&n->list_lock, flags); + if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) { + list_for_each_entry(slab, &n->partial, slab_list) + x += slab->objects - slab->inuse; + } else { + /* + * For a long list, approximate the total count of objects in + * it to meet the limit on the number of slabs to scan. + * Scan from both the list's head and tail for better accuracy. + */ + unsigned long scanned = 0; + + list_for_each_entry(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN / 2) + break; + } + list_for_each_entry_reverse(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN) + break; + } + x = mult_frac(x, n->nr_partial, scanned); + x = min(x, node_nr_objs(n)); + } + spin_unlock_irqrestore(&n->list_lock, flags); + return x; +} + static noinline void slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid) { @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += count_partial_free_approx(n); } sinfo->active_objs = nr_objs - nr_free; -- 2.42.1 ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v4 1/2] slub: introduce count_partial_free_approx() 2024-04-23 4:55 ` [PATCH v4 1/2] " Jianfeng Wang @ 2024-04-23 16:09 ` Christoph Lameter (Ampere) 0 siblings, 0 replies; 3+ messages in thread From: Christoph Lameter (Ampere) @ 2024-04-23 16:09 UTC (permalink / raw) To: Jianfeng Wang; +Cc: linux-mm, vbabka, rientjes, akpm Acked-by: Christoph Lameter <cl@linux.com> ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-04-23 16:09 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20240423043420.13854-1-jianfeng.w.wang@oracle.com>
2024-04-23 4:34 ` [PATCH v4 1/2] slub: introduce count_partial_free_approx() Jianfeng Wang
2024-04-23 4:55 [PATCH v4 0/2] " Jianfeng Wang
2024-04-23 4:55 ` [PATCH v4 1/2] " Jianfeng Wang
2024-04-23 16:09 ` Christoph Lameter (Ampere)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox