[PATCH] mm/slub: add missing TID updates on slab deactivation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/slub: add missing TID updates on slab deactivation
@ 2022-06-08 18:22 Jann Horn
  2022-06-09 11:58 ` Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Jann Horn @ 2022-06-08 18:22 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka
  Cc: Hyeonggon Yoo, linux-mm, linux-kernel

The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.

If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.

(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)

The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):

 - task A: begin do_slab_free():
    - read TID
    - read pcpu freelist (==NULL)
    - check `slab == c->slab` (true)
 - [PREEMPT A->B]
 - task B: begin slab_alloc_node():
    - fastpath fails (`c->freelist` is NULL)
    - enter __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - take local_lock_irqsave()
    - read c->freelist as NULL
    - get_freelist() returns NULL
    - write `c->slab = NULL`
    - drop local_unlock_irqrestore()
    - goto new_slab
    - slub_percpu_partial() is NULL
    - get_partial() returns NULL
    - slub_put_cpu_ptr() (enables preemption)
 - [PREEMPT B->A]
 - task A: finish do_slab_free():
    - this_cpu_cmpxchg_double() succeeds()
    - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]


From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.


But if we instead continue as follows, we get worse corruption:

 - task A: run __slab_free() on object from other struct slab:
    - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
 - task A: run slab_alloc_node() with NUMA node constraint:
    - fastpath fails (c->slab is NULL)
    - call __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - c->slab is NULL: goto new_slab
    - slub_percpu_partial() is non-NULL
    - set c->slab to slub_percpu_partial(c)
    - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
      from slab-2]
    - goto redo
    - node_match() fails
    - goto deactivate_slab
    - existing c->freelist is passed into deactivate_slab()
    - inuse count of slab-1 is decremented to account for object from
      slab-2

At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.

Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
Fixes: 03e404af26dc2 ("slub: fast release on full slab")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index e5535020e0fdf..b97fa5e210469 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 	if (!freelist) {
 		c->slab = NULL;
+		c->tid = next_tid(c->tid);
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
@@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	freelist = c->freelist;
 	c->slab = NULL;
 	c->freelist = NULL;
+	c->tid = next_tid(c->tid);
 	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 	deactivate_slab(s, slab, freelist);
 

base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6
-- 
2.36.1.476.g0c4daa206d-goog



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
@ 2022-06-09 11:58 ` Christoph Lameter
  2022-06-12 22:45 ` David Rientjes
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Christoph Lameter @ 2022-06-09 11:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka, Hyeonggon Yoo, linux-mm, linux-kernel

On Wed, 8 Jun 2022, Jann Horn wrote:

> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.

Looks ok.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
  2022-06-09 11:58 ` Christoph Lameter
@ 2022-06-12 22:45 ` David Rientjes
  2022-06-13  3:19 ` Muchun Song
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: David Rientjes @ 2022-06-12 22:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Andrew Morton,
	Vlastimil Babka, Hyeonggon Yoo, linux-mm, linux-kernel

On Wed, 8 Jun 2022, Jann Horn wrote:

> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
> 
> 
> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
> 
> 
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2
> 
> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
  2022-06-09 11:58 ` Christoph Lameter
  2022-06-12 22:45 ` David Rientjes
@ 2022-06-13  3:19 ` Muchun Song
  2022-06-13 12:49 ` Hyeonggon Yoo
  2022-06-14  8:23 ` Vlastimil Babka
  4 siblings, 0 replies; 8+ messages in thread
From: Muchun Song @ 2022-06-13  3:19 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, Hyeonggon Yoo, linux-mm,
	linux-kernel

On Wed, Jun 08, 2022 at 08:22:05PM +0200, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
> 
> 
> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
> 
> 
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2
> 
> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
                   ` (2 preceding siblings ...)
  2022-06-13  3:19 ` Muchun Song
@ 2022-06-13 12:49 ` Hyeonggon Yoo
  2022-06-14  8:23 ` Vlastimil Babka
  4 siblings, 0 replies; 8+ messages in thread
From: Hyeonggon Yoo @ 2022-06-13 12:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Wed, Jun 08, 2022 at 08:22:05PM +0200, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

I can see this happening (!c->slab && c->freelist becoming true)
when I synthetically add scheduling points in the code:

diff --git a/mm/slub.c b/mm/slub.c
index b97fa5e21046..b8012fdf2607 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3001,6 +3001,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto check_new_slab;

 	slub_put_cpu_ptr(s->cpu_slab);
+
+	if (!in_atomic())
+		schedule();
+
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);

@@ -3456,9 +3460,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	if (likely(slab == c->slab)) {
 #ifndef CONFIG_PREEMPT_RT
 		void **freelist = READ_ONCE(c->freelist);
+		unsigned long flags;

 		set_freepointer(s, tail_obj, freelist);

+		if (!in_atomic())
+			schedule();
+
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				freelist, tid,
@@ -3467,6 +3475,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
+
+		local_irq_save(flags);
+		WARN_ON(!READ_ONCE(c->slab) && READ_ONCE(c->freelist));
+		local_irq_restore(flags);
 #else /* CONFIG_PREEMPT_RT */
 		/*
 		 * We cannot use the lockless fastpath on PREEMPT_RT because if


> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
>
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2

I didn't try to reproduce this -- but I agree SLUB can be fooled
by the condition (!c->slab && c->freelist).

> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  mm/slub.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e5535020e0fdf..b97fa5e210469 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  	if (!freelist) {
>  		c->slab = NULL;
> +		c->tid = next_tid(c->tid);
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  		stat(s, DEACTIVATE_BYPASS);
>  		goto new_slab;
> @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	freelist = c->freelist;
>  	c->slab = NULL;
>  	c->freelist = NULL;
> +	c->tid = next_tid(c->tid);
>  	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  	deactivate_slab(s, slab, freelist);
>  
> 
> base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6
> -- 
> 2.36.1.476.g0c4daa206d-goog

With this patch I couldn't reproduce it.
This work is really nice. Thanks!

Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

BTW I wonder how much this race will affect machines in the real world.
Maybe just rare and undetectable memory leak?

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
                   ` (3 preceding siblings ...)
  2022-06-13 12:49 ` Hyeonggon Yoo
@ 2022-06-14  8:23 ` Vlastimil Babka
  2022-06-14 15:54   ` Jann Horn
  4 siblings, 1 reply; 8+ messages in thread
From: Vlastimil Babka @ 2022-06-14  8:23 UTC (permalink / raw)
  To: Jann Horn, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton
  Cc: Hyeonggon Yoo, linux-mm, linux-kernel

On 6/8/22 20:22, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
> 
> 
> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
> 
> 
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2
> 
> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@vger.kernel.org

Hmm these are old commits, and currently oldest LTS is 4.9, so this will be
fun. Worth doublechecking if it's not recent changes that actually
introduced the bug... but seems not, AFAICS.

> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  mm/slub.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e5535020e0fdf..b97fa5e210469 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  	if (!freelist) {
>  		c->slab = NULL;
> +		c->tid = next_tid(c->tid);
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);

So this immediate unlock after setting NULL is new from the 5.15 preempt-rt
changes. However even in older versions we could goto new_slab,
new_slab_objects(), new_slab(), allocate_slab(), where if
(gfpflags_allow_blocking()) local_irq_enable(); (there's no extra disabled
preemption besides the irq disable) so I'd say the bug was possible before
too, but less often?

>  		stat(s, DEACTIVATE_BYPASS);
>  		goto new_slab;
> @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	freelist = c->freelist;
>  	c->slab = NULL;
>  	c->freelist = NULL;

Previously these were part of deactivate_slab(), which does that at the very
end, but also without bumping tid.
I just wonder if it's necessary too, because IIUC the scenario you described
relies on the missing bump above. This alone doesn't cause the c->slab vs
c->freelist mismatch?
But I guess it won't hurt to just bump tid on each c->freelist assignment.
In backports we would just add it to deactivate_slab() instead.

Thanks. Applying to slab/for-5.19-rc3/fixes branch.

> +	c->tid = next_tid(c->tid);
>  	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  	deactivate_slab(s, slab, freelist);
>  
> 
> base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-14  8:23 ` Vlastimil Babka
@ 2022-06-14 15:54   ` Jann Horn
  2022-06-15  7:18     ` Vlastimil Babka
  0 siblings, 1 reply; 8+ messages in thread
From: Jann Horn @ 2022-06-14 15:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Hyeonggon Yoo, linux-mm, linux-kernel

On Tue, Jun 14, 2022 at 10:23 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 6/8/22 20:22, Jann Horn wrote:
> > The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> > the TID stays the same. However, two places in __slab_alloc() currently
> > don't update the TID when deactivating the CPU slab.
> >
> > If multiple operations race the right way, this could lead to an object
> > getting lost; or, in an even more unlikely situation, it could even lead to
> > an object being freed onto the wrong slab's freelist, messing up the
> > `inuse` counter and eventually causing a page to be freed to the page
> > allocator while it still contains slab objects.
[...]
> > Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> > Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> > Cc: stable@vger.kernel.org
>
> Hmm these are old commits, and currently oldest LTS is 4.9, so this will be
> fun. Worth doublechecking if it's not recent changes that actually
> introduced the bug... but seems not, AFAICS.
[...]
> > @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >
> >       if (!freelist) {
> >               c->slab = NULL;
> > +             c->tid = next_tid(c->tid);
> >               local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>
> So this immediate unlock after setting NULL is new from the 5.15 preempt-rt
> changes. However even in older versions we could goto new_slab,
> new_slab_objects(), new_slab(), allocate_slab(), where if
> (gfpflags_allow_blocking()) local_irq_enable(); (there's no extra disabled
> preemption besides the irq disable) so I'd say the bug was possible before
> too, but less often?

Yeah, I think so too.

> >               stat(s, DEACTIVATE_BYPASS);
> >               goto new_slab;
> > @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >       freelist = c->freelist;
> >       c->slab = NULL;
> >       c->freelist = NULL;
>
> Previously these were part of deactivate_slab(), which does that at the very
> end, but also without bumping tid.
> I just wonder if it's necessary too, because IIUC the scenario you described
> relies on the missing bump above. This alone doesn't cause the c->slab vs
> c->freelist mismatch?

It's a different scenario, but at least in the current version, the
ALLOC_NODE_MISMATCH case jumps straight to the deactivate_slab label,
which takes the local_lock, grabs the old c->freelist, NULLs out
->slab and ->freelist, then drops the local_lock again. If the
c->freelist was non-NULL, then this will prevent concurrent cmpxchg
success; but there is no reason why c->freelist has to be non-NULL
here. So if c->freelist is already NULL, we basically just take the
local_lock, set c->slab to NULL, and drop the local_lock. And IIUC the
local_lock is the only protection we have here against concurrency,
since the slub_get_cpu_ptr() in __slab_alloc() only disables
migration? So again a concurrent fastpath free should be able to set
c->freelist to non-NULL after c->slab has been set to NULL.

So I think this TID bump is also necessary for correctness in the
current version.

And looking back at older kernels, back to at least 4.9, the
ALLOC_NODE_MISMATCH case looks similarly broken - except that again,
as you pointed out, we don't have the fine-grained locking, so it only
becomes racy if we hit new_slab_objects() -> new_slab() ->
allocate_slab() and then either we do local_irq_enable() or the
allocation fails.

> Thanks. Applying to slab/for-5.19-rc3/fixes branch.

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/slub: add missing TID updates on slab deactivation
  2022-06-14 15:54   ` Jann Horn
@ 2022-06-15  7:18     ` Vlastimil Babka
  0 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2022-06-15  7:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Hyeonggon Yoo, linux-mm, linux-kernel

On 6/14/22 17:54, Jann Horn wrote:
> On Tue, Jun 14, 2022 at 10:23 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> >               stat(s, DEACTIVATE_BYPASS);
>> >               goto new_slab;
>> > @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> >       freelist = c->freelist;
>> >       c->slab = NULL;
>> >       c->freelist = NULL;
>>
>> Previously these were part of deactivate_slab(), which does that at the very
>> end, but also without bumping tid.
>> I just wonder if it's necessary too, because IIUC the scenario you described
>> relies on the missing bump above. This alone doesn't cause the c->slab vs
>> c->freelist mismatch?
> 
> It's a different scenario, but at least in the current version, the
> ALLOC_NODE_MISMATCH case jumps straight to the deactivate_slab label,
> which takes the local_lock, grabs the old c->freelist, NULLs out
> ->slab and ->freelist, then drops the local_lock again. If the
> c->freelist was non-NULL, then this will prevent concurrent cmpxchg
> success; but there is no reason why c->freelist has to be non-NULL
> here. So if c->freelist is already NULL, we basically just take the
> local_lock, set c->slab to NULL, and drop the local_lock. And IIUC the

Ah, right. Thanks for the explanation.

> local_lock is the only protection we have here against concurrency,
> since the slub_get_cpu_ptr() in __slab_alloc() only disables
> migration?

On PREEMPT_RT it disables migration, but on !PREEMPT_RT it's a plain
get_cpu_ptr() that does preempt_disable(). But that's an implementation
detail, disabling migration would be sufficient on !PREEMPT_RT too, but
right now it's cheaper to disable migration.

> So again a concurrent fastpath free should be able to set
> c->freelist to non-NULL after c->slab has been set to NULL.
> 
> So I think this TID bump is also necessary for correctness in the
> current version.

OK.

> And looking back at older kernels, back to at least 4.9, the
> ALLOC_NODE_MISMATCH case looks similarly broken - except that again,
> as you pointed out, we don't have the fine-grained locking, so it only
> becomes racy if we hit new_slab_objects() -> new_slab() ->
> allocate_slab() and then either we do local_irq_enable() or the
> allocation fails.
> 
>> Thanks. Applying to slab/for-5.19-rc3/fixes branch.
> 
> Thanks!



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-06-15  7:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 18:22 [PATCH] mm/slub: add missing TID updates on slab deactivation Jann Horn
2022-06-09 11:58 ` Christoph Lameter
2022-06-12 22:45 ` David Rientjes
2022-06-13  3:19 ` Muchun Song
2022-06-13 12:49 ` Hyeonggon Yoo
2022-06-14  8:23 ` Vlastimil Babka
2022-06-14 15:54   ` Jann Horn
2022-06-15  7:18     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox