* [PATCH] mm: slub: Panic if the object corruption is checked.
[not found] <CGME20250120083041epcas2p30d61ee801c1db5a7ebdf26fdc642883f@epcas2p3.samsung.com>
@ 2025-01-20 8:28 ` Hyesoo Yu
2025-01-20 15:36 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Hyesoo Yu @ 2025-01-20 8:28 UTC (permalink / raw)
Cc: janghyuck.kim, Hyesoo Yu, Andrew Morton, Jonathan Corbet,
Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo, linux-mm,
linux-doc, linux-kernel
If a slab object is corrupted or an error occurs in its internal
value, continuing after restoration may cause other side effects.
At this point, it is difficult to debug because the problem occurred
in the past. A flag has been added that can cause a panic when there
is a problem with the object.
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
---
Documentation/mm/slub.rst | 2 ++
include/linux/slab.h | 4 ++++
mm/slub.c | 14 ++++++++++++++
3 files changed, 20 insertions(+)
diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
index 84ca1dc94e5e..ce58525db93d 100644
--- a/Documentation/mm/slub.rst
+++ b/Documentation/mm/slub.rst
@@ -53,6 +53,7 @@ Possible debug options are::
U User tracking (free and alloc)
T Trace (please only use on single slabs)
A Enable failslab filter mark for the cache
+ C Panic if object corruption is checked.
O Switch debugging off for caches that would have
caused higher minimum slab orders
- Switch all debugging off (useful if the kernel is
@@ -113,6 +114,7 @@ options from the ``slab_debug`` parameter translate to the following files::
U store_user
T trace
A failslab
+ C corruption_panic
failslab file is writable, so writing 1 or 0 will enable or disable
the option at runtime. Write returns -EINVAL if cache is an alias.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..4391c30564d6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -31,6 +31,7 @@ enum _slab_flag_bits {
_SLAB_CACHE_DMA32,
_SLAB_STORE_USER,
_SLAB_PANIC,
+ _SLAB_CORRUPTION_PANIC,
_SLAB_TYPESAFE_BY_RCU,
_SLAB_TRACE,
#ifdef CONFIG_DEBUG_OBJECTS
@@ -97,6 +98,9 @@ enum _slab_flag_bits {
#define SLAB_STORE_USER __SLAB_FLAG_BIT(_SLAB_STORE_USER)
/* Panic if kmem_cache_create() fails */
#define SLAB_PANIC __SLAB_FLAG_BIT(_SLAB_PANIC)
+/* Panic if object corruption is checked */
+#define SLAB_CORRUPTION_PANIC __SLAB_FLAG_BIT(_SLAB_CORRUPTION_PANIC)
+
/**
* define SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
*
diff --git a/mm/slub.c b/mm/slub.c
index 48cefc969480..36a8dabf1349 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1306,6 +1306,8 @@ slab_pad_check(struct kmem_cache *s, struct slab *slab)
fault, end - 1, fault - start);
print_section(KERN_ERR, "Padding ", pad, remainder);
+ BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
+
restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
}
@@ -1389,6 +1391,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
if (!ret && !slab_in_kunit_test()) {
print_trailer(s, slab, object);
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+
+ BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
}
return ret;
@@ -1689,6 +1693,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init)
case 'a':
*flags |= SLAB_FAILSLAB;
break;
+ case 'c':
+ *flags |= SLAB_CORRUPTION_PANIC;
+ break;
case 'o':
/*
* Avoid enabling debugging on caches if its minimum
@@ -6874,6 +6881,12 @@ static ssize_t store_user_show(struct kmem_cache *s, char *buf)
SLAB_ATTR_RO(store_user);
+static ssize_t corruption_panic_show(struct kmem_cache *s, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CORRUPTION_PANIC));
+}
+SLAB_ATTR_RO(corruption_panic);
+
static ssize_t validate_show(struct kmem_cache *s, char *buf)
{
return 0;
@@ -7092,6 +7105,7 @@ static struct attribute *slab_attrs[] = {
&red_zone_attr.attr,
&poison_attr.attr,
&store_user_attr.attr,
+ &corruption_panic_attr.attr,
&validate_attr.attr,
#endif
#ifdef CONFIG_ZONE_DMA
--
2.48.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-20 8:28 ` [PATCH] mm: slub: Panic if the object corruption is checked Hyesoo Yu
@ 2025-01-20 15:36 ` Matthew Wilcox
2025-01-21 0:40 ` Hyesoo Yu
2025-01-20 15:41 ` Hyeonggon Yoo
2025-01-21 12:46 ` kernel test robot
2 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2025-01-20 15:36 UTC (permalink / raw)
To: Hyesoo Yu
Cc: janghyuck.kim, Andrew Morton, Jonathan Corbet, Christoph Lameter,
Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
Roman Gushchin, Hyeonggon Yoo, linux-mm, linux-doc, linux-kernel
On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. A flag has been added that can cause a panic when there
> is a problem with the object.
>
> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
Linux does not use Change IDs. Please omit these from future patches.
Panicing is a very unfriendly approach. I think a better approach would
be to freeze the slab where corruption is detected. That is, no future
objects are allocated from that slab, and attempts to free objects from
that slab become no-ops. I don't think that should be hard to implement.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-20 8:28 ` [PATCH] mm: slub: Panic if the object corruption is checked Hyesoo Yu
2025-01-20 15:36 ` Matthew Wilcox
@ 2025-01-20 15:41 ` Hyeonggon Yoo
2025-01-21 0:54 ` Hyesoo Yu
2025-01-21 12:46 ` kernel test robot
2 siblings, 1 reply; 10+ messages in thread
From: Hyeonggon Yoo @ 2025-01-20 15:41 UTC (permalink / raw)
To: Hyesoo Yu
Cc: janghyuck.kim, Andrew Morton, Jonathan Corbet, Christoph Lameter,
Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
Roman Gushchin, linux-mm, linux-doc, linux-kernel
On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. A flag has been added that can cause a panic when there
> is a problem with the object.
Hi Hyesoo,
I'm concerned about this because it goes against the effort to avoid
introducing new BUG() calls [1].
And I think it would be more appropriate to use existing panic_on_warn
functionality [2] which causes
a panic on WARN(), rather than introducing a SLUB-specific knob to do
the same thing.
However SLUB does not call WARN() and uses pr_err() instead when
reporting an error.
Vlastimil and I talked about changing it to use WARN() a while ago
[3], but neither of us
have done that yet.
Probably you may want to look at it, as it also aligns with your purpose?
FYI, if you would like to work on it, please make sure that it WARN()
is suppressed during kunit test.
[1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
[2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
[3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
Best,
Hyeonggon
> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
> ---
> Documentation/mm/slub.rst | 2 ++
> include/linux/slab.h | 4 ++++
> mm/slub.c | 14 ++++++++++++++
> 3 files changed, 20 insertions(+)
>
> diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
> index 84ca1dc94e5e..ce58525db93d 100644
> --- a/Documentation/mm/slub.rst
> +++ b/Documentation/mm/slub.rst
> @@ -53,6 +53,7 @@ Possible debug options are::
> U User tracking (free and alloc)
> T Trace (please only use on single slabs)
> A Enable failslab filter mark for the cache
> + C Panic if object corruption is checked.
> O Switch debugging off for caches that would have
> caused higher minimum slab orders
> - Switch all debugging off (useful if the kernel is
> @@ -113,6 +114,7 @@ options from the ``slab_debug`` parameter translate to the following files::
> U store_user
> T trace
> A failslab
> + C corruption_panic
>
> failslab file is writable, so writing 1 or 0 will enable or disable
> the option at runtime. Write returns -EINVAL if cache is an alias.
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..4391c30564d6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -31,6 +31,7 @@ enum _slab_flag_bits {
> _SLAB_CACHE_DMA32,
> _SLAB_STORE_USER,
> _SLAB_PANIC,
> + _SLAB_CORRUPTION_PANIC,
> _SLAB_TYPESAFE_BY_RCU,
> _SLAB_TRACE,
> #ifdef CONFIG_DEBUG_OBJECTS
> @@ -97,6 +98,9 @@ enum _slab_flag_bits {
> #define SLAB_STORE_USER __SLAB_FLAG_BIT(_SLAB_STORE_USER)
> /* Panic if kmem_cache_create() fails */
> #define SLAB_PANIC __SLAB_FLAG_BIT(_SLAB_PANIC)
> +/* Panic if object corruption is checked */
> +#define SLAB_CORRUPTION_PANIC __SLAB_FLAG_BIT(_SLAB_CORRUPTION_PANIC)
> +
> /**
> * define SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
> *
> diff --git a/mm/slub.c b/mm/slub.c
> index 48cefc969480..36a8dabf1349 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1306,6 +1306,8 @@ slab_pad_check(struct kmem_cache *s, struct slab *slab)
> fault, end - 1, fault - start);
> print_section(KERN_ERR, "Padding ", pad, remainder);
>
> + BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> +
> restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
> }
>
> @@ -1389,6 +1391,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> if (!ret && !slab_in_kunit_test()) {
> print_trailer(s, slab, object);
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> +
> + BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> }
>
> return ret;
> @@ -1689,6 +1693,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init)
> case 'a':
> *flags |= SLAB_FAILSLAB;
> break;
> + case 'c':
> + *flags |= SLAB_CORRUPTION_PANIC;
> + break;
> case 'o':
> /*
> * Avoid enabling debugging on caches if its minimum
> @@ -6874,6 +6881,12 @@ static ssize_t store_user_show(struct kmem_cache *s, char *buf)
>
> SLAB_ATTR_RO(store_user);
>
> +static ssize_t corruption_panic_show(struct kmem_cache *s, char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CORRUPTION_PANIC));
> +}
> +SLAB_ATTR_RO(corruption_panic);
> +
> static ssize_t validate_show(struct kmem_cache *s, char *buf)
> {
> return 0;
> @@ -7092,6 +7105,7 @@ static struct attribute *slab_attrs[] = {
> &red_zone_attr.attr,
> &poison_attr.attr,
> &store_user_attr.attr,
> + &corruption_panic_attr.attr,
> &validate_attr.attr,
> #endif
> #ifdef CONFIG_ZONE_DMA
> --
> 2.48.0
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-20 15:36 ` Matthew Wilcox
@ 2025-01-21 0:40 ` Hyesoo Yu
2025-01-21 10:27 ` Vlastimil Babka
0 siblings, 1 reply; 10+ messages in thread
From: Hyesoo Yu @ 2025-01-21 0:40 UTC (permalink / raw)
To: Matthew Wilcox
Cc: janghyuck.kim, Andrew Morton, Jonathan Corbet, Christoph Lameter,
Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
Roman Gushchin, Hyeonggon Yoo, linux-mm, linux-doc, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1608 bytes --]
On Mon, Jan 20, 2025 at 03:36:08PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
> > If a slab object is corrupted or an error occurs in its internal
> > value, continuing after restoration may cause other side effects.
> > At this point, it is difficult to debug because the problem occurred
> > in the past. A flag has been added that can cause a panic when there
> > is a problem with the object.
> >
> > Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> > Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
>
> Linux does not use Change IDs. Please omit these from future patches.
>
> Panicing is a very unfriendly approach. I think a better approach would
> be to freeze the slab where corruption is detected. That is, no future
> objects are allocated from that slab, and attempts to free objects from
> that slab become no-ops. I don't think that should be hard to implement.
>
Thanks you for your responce. That is my mistake. I will remove the change ID.
I agree that freezing is better than recovery or panic for the system's stability.
However what I want from the patch is not just to make the system run stably.
I need to immediately trigger a panic to investigate the slub.
I would like to analyze the corrupted data at that moment to check issues
like cache problem, user errors, system clock frequency and similar problems,
not just passing by without any issues.
However I agree that panic is not a friendly approach.
I will modify it to notify the problem using warn() and then use
panic_on_warn to trigger panic.
Thanks,
Regards.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-20 15:41 ` Hyeonggon Yoo
@ 2025-01-21 0:54 ` Hyesoo Yu
2025-01-21 1:48 ` Hyeonggon Yoo
0 siblings, 1 reply; 10+ messages in thread
From: Hyesoo Yu @ 2025-01-21 0:54 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: janghyuck.kim, Andrew Morton, Jonathan Corbet, Christoph Lameter,
Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
Roman Gushchin, linux-mm, linux-doc, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 5959 bytes --]
On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
> >
> > If a slab object is corrupted or an error occurs in its internal
> > value, continuing after restoration may cause other side effects.
> > At this point, it is difficult to debug because the problem occurred
> > in the past. A flag has been added that can cause a panic when there
> > is a problem with the object.
>
> Hi Hyesoo,
>
> I'm concerned about this because it goes against the effort to avoid
> introducing new BUG() calls [1].
>
> And I think it would be more appropriate to use existing panic_on_warn
> functionality [2] which causes
> a panic on WARN(), rather than introducing a SLUB-specific knob to do
> the same thing.
>
> However SLUB does not call WARN() and uses pr_err() instead when
> reporting an error.
> Vlastimil and I talked about changing it to use WARN() a while ago
> [3], but neither of us
> have done that yet.
>
> Probably you may want to look at it, as it also aligns with your purpose?
> FYI, if you would like to work on it, please make sure that it WARN()
> is suppressed during kunit test.
>
> [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
> [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
> [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
>
> Best,
> Hyeonggon
Thanks for response.
Using warn() instead of panic, is a great idea.
Thanks for pointing out what I missed.
The next patch will be changed to use warn().
Thanks.
>
> > Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> > Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
> > ---
> > Documentation/mm/slub.rst | 2 ++
> > include/linux/slab.h | 4 ++++
> > mm/slub.c | 14 ++++++++++++++
> > 3 files changed, 20 insertions(+)
> >
> > diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
> > index 84ca1dc94e5e..ce58525db93d 100644
> > --- a/Documentation/mm/slub.rst
> > +++ b/Documentation/mm/slub.rst
> > @@ -53,6 +53,7 @@ Possible debug options are::
> > U User tracking (free and alloc)
> > T Trace (please only use on single slabs)
> > A Enable failslab filter mark for the cache
> > + C Panic if object corruption is checked.
> > O Switch debugging off for caches that would have
> > caused higher minimum slab orders
> > - Switch all debugging off (useful if the kernel is
> > @@ -113,6 +114,7 @@ options from the ``slab_debug`` parameter translate to the following files::
> > U store_user
> > T trace
> > A failslab
> > + C corruption_panic
> >
> > failslab file is writable, so writing 1 or 0 will enable or disable
> > the option at runtime. Write returns -EINVAL if cache is an alias.
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 10a971c2bde3..4391c30564d6 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -31,6 +31,7 @@ enum _slab_flag_bits {
> > _SLAB_CACHE_DMA32,
> > _SLAB_STORE_USER,
> > _SLAB_PANIC,
> > + _SLAB_CORRUPTION_PANIC,
> > _SLAB_TYPESAFE_BY_RCU,
> > _SLAB_TRACE,
> > #ifdef CONFIG_DEBUG_OBJECTS
> > @@ -97,6 +98,9 @@ enum _slab_flag_bits {
> > #define SLAB_STORE_USER __SLAB_FLAG_BIT(_SLAB_STORE_USER)
> > /* Panic if kmem_cache_create() fails */
> > #define SLAB_PANIC __SLAB_FLAG_BIT(_SLAB_PANIC)
> > +/* Panic if object corruption is checked */
> > +#define SLAB_CORRUPTION_PANIC __SLAB_FLAG_BIT(_SLAB_CORRUPTION_PANIC)
> > +
> > /**
> > * define SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
> > *
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 48cefc969480..36a8dabf1349 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1306,6 +1306,8 @@ slab_pad_check(struct kmem_cache *s, struct slab *slab)
> > fault, end - 1, fault - start);
> > print_section(KERN_ERR, "Padding ", pad, remainder);
> >
> > + BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> > +
> > restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
> > }
> >
> > @@ -1389,6 +1391,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> > if (!ret && !slab_in_kunit_test()) {
> > print_trailer(s, slab, object);
> > add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> > +
> > + BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> > }
> >
> > return ret;
> > @@ -1689,6 +1693,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init)
> > case 'a':
> > *flags |= SLAB_FAILSLAB;
> > break;
> > + case 'c':
> > + *flags |= SLAB_CORRUPTION_PANIC;
> > + break;
> > case 'o':
> > /*
> > * Avoid enabling debugging on caches if its minimum
> > @@ -6874,6 +6881,12 @@ static ssize_t store_user_show(struct kmem_cache *s, char *buf)
> >
> > SLAB_ATTR_RO(store_user);
> >
> > +static ssize_t corruption_panic_show(struct kmem_cache *s, char *buf)
> > +{
> > + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CORRUPTION_PANIC));
> > +}
> > +SLAB_ATTR_RO(corruption_panic);
> > +
> > static ssize_t validate_show(struct kmem_cache *s, char *buf)
> > {
> > return 0;
> > @@ -7092,6 +7105,7 @@ static struct attribute *slab_attrs[] = {
> > &red_zone_attr.attr,
> > &poison_attr.attr,
> > &store_user_attr.attr,
> > + &corruption_panic_attr.attr,
> > &validate_attr.attr,
> > #endif
> > #ifdef CONFIG_ZONE_DMA
> > --
> > 2.48.0
> >
>
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-21 0:54 ` Hyesoo Yu
@ 2025-01-21 1:48 ` Hyeonggon Yoo
2025-01-21 2:32 ` Hyesoo Yu
0 siblings, 1 reply; 10+ messages in thread
From: Hyeonggon Yoo @ 2025-01-21 1:48 UTC (permalink / raw)
To: Hyesoo Yu, Hyeonggon Yoo
Cc: kernel_team, janghyuck.kim, Andrew Morton, Jonathan Corbet,
Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
Vlastimil Babka, Roman Gushchin, linux-mm, linux-doc,
linux-kernel
On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
> On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
>> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>>>
>>> If a slab object is corrupted or an error occurs in its internal
>>> value, continuing after restoration may cause other side effects.
>>> At this point, it is difficult to debug because the problem occurred
>>> in the past. A flag has been added that can cause a panic when there
>>> is a problem with the object.
>>
>> Hi Hyesoo,
>>
>> I'm concerned about this because it goes against the effort to avoid
>> introducing new BUG() calls [1].
>>
>> And I think it would be more appropriate to use existing panic_on_warn
>> functionality [2] which causes
>> a panic on WARN(), rather than introducing a SLUB-specific knob to do
>> the same thing.
>>
>> However SLUB does not call WARN() and uses pr_err() instead when
>> reporting an error.
>> Vlastimil and I talked about changing it to use WARN() a while ago
>> [3], but neither of us
>> have done that yet.
>>
>> Probably you may want to look at it, as it also aligns with your purpose?
>> FYI, if you would like to work on it, please make sure that it WARN()
>> is suppressed during kunit test.
>>
>> [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
>> [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
>> [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
>>
>> Best,
>> Hyeonggon
>
> Thanks for response.
>
> Using warn() instead of panic, is a great idea.
> Thanks for pointing out what I missed.
Just for clarification, I think changing the common error reporting
logic (like, slab_bug()) to use WARN() will be preferable to inserting
new WARN()s at random points, which is what this patch does now.
Best,
Hyeonggon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-21 1:48 ` Hyeonggon Yoo
@ 2025-01-21 2:32 ` Hyesoo Yu
2025-01-21 10:28 ` Vlastimil Babka
0 siblings, 1 reply; 10+ messages in thread
From: Hyesoo Yu @ 2025-01-21 2:32 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: Hyeonggon Yoo, kernel_team, janghyuck.kim, Andrew Morton,
Jonathan Corbet, Christoph Lameter, Pekka Enberg, David Rientjes,
Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-mm,
linux-doc, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2872 bytes --]
On Tue, Jan 21, 2025 at 10:48:08AM +0900, Hyeonggon Yoo wrote:
>
>
> On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
> > On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
> > > On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
> > > >
> > > > If a slab object is corrupted or an error occurs in its internal
> > > > value, continuing after restoration may cause other side effects.
> > > > At this point, it is difficult to debug because the problem occurred
> > > > in the past. A flag has been added that can cause a panic when there
> > > > is a problem with the object.
> > >
> > > Hi Hyesoo,
> > >
> > > I'm concerned about this because it goes against the effort to avoid
> > > introducing new BUG() calls [1].
> > >
> > > And I think it would be more appropriate to use existing panic_on_warn
> > > functionality [2] which causes
> > > a panic on WARN(), rather than introducing a SLUB-specific knob to do
> > > the same thing.
> > >
> > > However SLUB does not call WARN() and uses pr_err() instead when
> > > reporting an error.
> > > Vlastimil and I talked about changing it to use WARN() a while ago
> > > [3], but neither of us
> > > have done that yet.
> > >
> > > Probably you may want to look at it, as it also aligns with your purpose?
> > > FYI, if you would like to work on it, please make sure that it WARN()
> > > is suppressed during kunit test.
> > >
> > > [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
> > > [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
> > > [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
> > >
> > > Best,
> > > Hyeonggon
> >
> > Thanks for response.
> >
> > Using warn() instead of panic, is a great idea.
> > Thanks for pointing out what I missed.
>
> Just for clarification, I think changing the common error reporting
> logic (like, slab_bug()) to use WARN() will be preferable to inserting
> new WARN()s at random points, which is what this patch does now.
>
> Best,
> Hyeonggon
>
Thanks you for clarification.
Actually, I considered adding BUG_ON() to slab_bug. However if we add BUG_ON() to slab_bug,
it will prevent many meaningful error log from being printed subsequently.
As you know, slab_bug is the log that usually is printed at the biginning of a bug
in the slab. As a result, it would be difficult to figure out the problems based on
the logs during our large-scale test-bed.
Similary, even if I use WARN() in slab_bug, we won't be able to obtain the logs
when panic_on_warn is enabled. I don't think it is useful to include WARN in slab_bug.
Instead, I will implement a solution where WARN is only used in slab_fix before
slab object is restored. If I add it to slab_fix, I think warning is suppressed on
kunit test by slab_add_kunit_errors handling.
Thanks,
Regards.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-21 0:40 ` Hyesoo Yu
@ 2025-01-21 10:27 ` Vlastimil Babka
0 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2025-01-21 10:27 UTC (permalink / raw)
To: Hyesoo Yu, Matthew Wilcox
Cc: janghyuck.kim, Andrew Morton, Jonathan Corbet, Christoph Lameter,
Pekka Enberg, David Rientjes, Joonsoo Kim, Roman Gushchin,
Hyeonggon Yoo, linux-mm, linux-doc, linux-kernel
On 1/21/25 1:40 AM, Hyesoo Yu wrote:
> On Mon, Jan 20, 2025 at 03:36:08PM +0000, Matthew Wilcox wrote:
>> On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
>>> If a slab object is corrupted or an error occurs in its internal
>>> value, continuing after restoration may cause other side effects.
>>> At this point, it is difficult to debug because the problem occurred
>>> in the past. A flag has been added that can cause a panic when there
>>> is a problem with the object.
>>>
>>> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
>>> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
>>
>> Linux does not use Change IDs. Please omit these from future patches.
>>
>> Panicing is a very unfriendly approach. I think a better approach would
>> be to freeze the slab where corruption is detected. That is, no future
>> objects are allocated from that slab, and attempts to free objects from
>> that slab become no-ops. I don't think that should be hard to implement.
Freezing of slab is already done in some cases when corruption is
detected - all objects are marked as used, and further freeing attempts
on the slab are discarded. Perhaps not all cases, which could be improved.
> Thanks you for your responce. That is my mistake. I will remove the change ID.
>
> I agree that freezing is better than recovery or panic for the system's stability.
> However what I want from the patch is not just to make the system run stably.
> I need to immediately trigger a panic to investigate the slub.
IMHO it's a valid goal to panic more quickly when debugging, and
enabling slub_debug means debugging is in progress (as opposed to normal
production when we try to avoid panic).
But making it possible to reuse the general panic_on_warn mechanism
(which can be also expected to be enabled when debugging) is indeed
preferable to introducing a new slab-specific flag.
> I would like to analyze the corrupted data at that moment to check issues
> like cache problem, user errors, system clock frequency and similar problems,
> not just passing by without any issues.
>
> However I agree that panic is not a friendly approach.
> I will modify it to notify the problem using warn() and then use
> panic_on_warn to trigger panic.
>
> Thanks,
> Regards.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-21 2:32 ` Hyesoo Yu
@ 2025-01-21 10:28 ` Vlastimil Babka
0 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2025-01-21 10:28 UTC (permalink / raw)
To: Hyesoo Yu, Hyeonggon Yoo
Cc: Hyeonggon Yoo, kernel_team, janghyuck.kim, Andrew Morton,
Jonathan Corbet, Christoph Lameter, Pekka Enberg, David Rientjes,
Joonsoo Kim, Roman Gushchin, linux-mm, linux-doc, linux-kernel
On 1/21/25 3:32 AM, Hyesoo Yu wrote:
> On Tue, Jan 21, 2025 at 10:48:08AM +0900, Hyeonggon Yoo wrote:
>>
>>
>> On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
>>> On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
>>>> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>>>
>>> Thanks for response.
>>>
>>> Using warn() instead of panic, is a great idea.
>>> Thanks for pointing out what I missed.
>>
>> Just for clarification, I think changing the common error reporting
>> logic (like, slab_bug()) to use WARN() will be preferable to inserting
>> new WARN()s at random points, which is what this patch does now.
>>
>> Best,
>> Hyeonggon
>>
>
> Thanks you for clarification.
>
> Actually, I considered adding BUG_ON() to slab_bug. However if we add BUG_ON() to slab_bug,
> it will prevent many meaningful error log from being printed subsequently.
> As you know, slab_bug is the log that usually is printed at the biginning of a bug
> in the slab. As a result, it would be difficult to figure out the problems based on
> the logs during our large-scale test-bed.
>
> Similary, even if I use WARN() in slab_bug, we won't be able to obtain the logs
> when panic_on_warn is enabled. I don't think it is useful to include WARN in slab_bug.
>
> Instead, I will implement a solution where WARN is only used in slab_fix before
> slab object is restored. If I add it to slab_fix, I think warning is suppressed on
> kunit test by slab_add_kunit_errors handling.
Right, makes sense to only do the WARN() after printing the debugging
logs. Thanks.
> Thanks,
> Regards.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: slub: Panic if the object corruption is checked.
2025-01-20 8:28 ` [PATCH] mm: slub: Panic if the object corruption is checked Hyesoo Yu
2025-01-20 15:36 ` Matthew Wilcox
2025-01-20 15:41 ` Hyeonggon Yoo
@ 2025-01-21 12:46 ` kernel test robot
2 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2025-01-21 12:46 UTC (permalink / raw)
To: Hyesoo Yu
Cc: oe-kbuild-all, janghyuck.kim, Hyesoo Yu, Andrew Morton,
Jonathan Corbet, Christoph Lameter, Pekka Enberg, David Rientjes,
Joonsoo Kim, Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
linux-mm, linux-doc, linux-kernel
Hi Hyesoo,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Hyesoo-Yu/mm-slub-Panic-if-the-object-corruption-is-checked/20250120-163233
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250120082908.4162780-1-hyesoo.yu%40samsung.com
patch subject: [PATCH] mm: slub: Panic if the object corruption is checked.
config: mips-randconfig-r111-20250121 (https://download.01.org/0day-ci/archive/20250121/202501212026.lUnLNhv6-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 14.2.0
reproduce: (https://download.01.org/0day-ci/archive/20250121/202501212026.lUnLNhv6-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501212026.lUnLNhv6-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> mm/slub.c:1308:9: sparse: sparse: cast from restricted slab_flags_t
mm/slub.c:1394:17: sparse: sparse: cast from restricted slab_flags_t
mm/slub.c:4440:47: sparse: sparse: context imbalance in '__slab_free' - unexpected unlock
vim +1308 mm/slub.c
1273
1274 /* Check the pad bytes at the end of a slab page */
1275 static pad_check_attributes void
1276 slab_pad_check(struct kmem_cache *s, struct slab *slab)
1277 {
1278 u8 *start;
1279 u8 *fault;
1280 u8 *end;
1281 u8 *pad;
1282 int length;
1283 int remainder;
1284
1285 if (!(s->flags & SLAB_POISON))
1286 return;
1287
1288 start = slab_address(slab);
1289 length = slab_size(slab);
1290 end = start + length;
1291 remainder = length % s->size;
1292 if (!remainder)
1293 return;
1294
1295 pad = end - remainder;
1296 metadata_access_enable();
1297 fault = memchr_inv(kasan_reset_tag(pad), POISON_INUSE, remainder);
1298 metadata_access_disable();
1299 if (!fault)
1300 return;
1301 while (end > fault && end[-1] == POISON_INUSE)
1302 end--;
1303
1304 slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu",
1305 fault, end - 1, fault - start);
1306 print_section(KERN_ERR, "Padding ", pad, remainder);
1307
> 1308 BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
1309
1310 restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
1311 }
1312
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-01-21 12:47 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CGME20250120083041epcas2p30d61ee801c1db5a7ebdf26fdc642883f@epcas2p3.samsung.com>
2025-01-20 8:28 ` [PATCH] mm: slub: Panic if the object corruption is checked Hyesoo Yu
2025-01-20 15:36 ` Matthew Wilcox
2025-01-21 0:40 ` Hyesoo Yu
2025-01-21 10:27 ` Vlastimil Babka
2025-01-20 15:41 ` Hyeonggon Yoo
2025-01-21 0:54 ` Hyesoo Yu
2025-01-21 1:48 ` Hyeonggon Yoo
2025-01-21 2:32 ` Hyesoo Yu
2025-01-21 10:28 ` Vlastimil Babka
2025-01-21 12:46 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox