[PATCH] slub: correctly bootstrap boot caches

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] slub: correctly bootstrap boot caches
@ 2013-02-22 10:30 Glauber Costa
  2013-02-22 11:15 ` Kamezawa Hiroyuki
  2013-02-22 15:00 ` Christoph Lameter
  0 siblings, 2 replies; 10+ messages in thread
From: Glauber Costa @ 2013-02-22 10:30 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, Glauber Costa, Christoph Lameter,
	Andrew Morton, Tejun Heo, Pekka Enberg, Kamezawa Hiroyuki

After we create a boot cache, we may allocate from it until it is bootstraped.
This will move the page from the partial list to the cpu slab list. If this
happens, the loop:

	list_for_each_entry(p, &n->partial, lru)

that we use to scan for all partial pages will yield nothing, and the pages
will keep pointing to the boot cpu cache, which is of course, invalid. To do
that, we should flush the cache to make sure that the cpu slab is back to the
partial list.

Although not verified in practice, I also point out that it is not safe to scan
the full list only when debugging is on in this case. As unlikely as it is, it
is theoretically possible for the pages to be full. If they are, they will
become unreachable. Aside from scanning the full list, we also need to make
sure that the pages indeed sit in there: the easiest way to do it is to make
sure the boot caches have the SLAB_STORE_USER debug flag set.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by:  Steffen Michalke <StMichalke@web.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 mm/slub.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)
---
 mm/slub.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ba2ca53..ab372c8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3617,6 +3617,12 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 
 	memcpy(s, static_cache, kmem_cache->object_size);
 
+	/*
+	 * This runs very early, and only the boot processor is supposed to be
+	 * up.  Even if it weren't true, IRQs are not up so we couldn't fire
+	 * IPIs around.
+	 */
+	__flush_cpu_slab(s, smp_processor_id());
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 		struct page *p;
@@ -3625,12 +3631,13 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 			list_for_each_entry(p, &n->partial, lru)
 				p->slab_cache = s;
 
-#ifdef CONFIG_SLUB_DEBUG
 			list_for_each_entry(p, &n->full, lru)
 				p->slab_cache = s;
-#endif
 		}
 	}
+
+	/* No longer needs to keep track of the full list */
+	s->flags &= ~SLAB_STORE_USER;
 	list_add(&s->list, &slab_caches);
 	return s;
 }
@@ -3648,8 +3655,14 @@ void __init kmem_cache_init(void)
 	kmem_cache_node = &boot_kmem_cache_node;
 	kmem_cache = &boot_kmem_cache;
 
+	/*
+	 * We want to keep early pages in the full list because of the
+	 * bootstrap process. If we don't do it, those pages become unreachable
+	 * and we can't update their page->slab_cache information.
+	 */
 	create_boot_cache(kmem_cache_node, "kmem_cache_node",
-		sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN);
+			sizeof(struct kmem_cache_node),
+			SLAB_HWCACHE_ALIGN | SLAB_STORE_USER);
 
 	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
 
@@ -3659,7 +3672,7 @@ void __init kmem_cache_init(void)
 	create_boot_cache(kmem_cache, "kmem_cache",
 			offsetof(struct kmem_cache, node) +
 				nr_node_ids * sizeof(struct kmem_cache_node *),
-		       SLAB_HWCACHE_ALIGN);
+		       SLAB_HWCACHE_ALIGN | SLAB_STORE_USER);
 
 	kmem_cache = bootstrap(&boot_kmem_cache);
 
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 10:30 [PATCH] slub: correctly bootstrap boot caches Glauber Costa
@ 2013-02-22 11:15 ` Kamezawa Hiroyuki
  2013-02-22 16:20   ` Glauber Costa
  2013-02-22 15:00 ` Christoph Lameter
  1 sibling, 1 reply; 10+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-22 11:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, linux-kernel, Christoph Lameter,
	Andrew Morton, Tejun Heo, Pekka Enberg

(2013/02/22 19:30), Glauber Costa wrote:
> After we create a boot cache, we may allocate from it until it is bootstraped.
> This will move the page from the partial list to the cpu slab list. If this
> happens, the loop:
> 
> 	list_for_each_entry(p, &n->partial, lru)
> 
> that we use to scan for all partial pages will yield nothing, and the pages
> will keep pointing to the boot cpu cache, which is of course, invalid. To do
> that, we should flush the cache to make sure that the cpu slab is back to the
> partial list.
> 
> Although not verified in practice, I also point out that it is not safe to scan
> the full list only when debugging is on in this case. As unlikely as it is, it
> is theoretically possible for the pages to be full. If they are, they will
> become unreachable. Aside from scanning the full list, we also need to make
> sure that the pages indeed sit in there: the easiest way to do it is to make
> sure the boot caches have the SLAB_STORE_USER debug flag set.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reported-by:  Steffen Michalke <StMichalke@web.de>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
You're quick :) the issue is fixed in my environ.

Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu,com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 10:30 [PATCH] slub: correctly bootstrap boot caches Glauber Costa
  2013-02-22 11:15 ` Kamezawa Hiroyuki
@ 2013-02-22 15:00 ` Christoph Lameter
  2013-02-22 15:09   ` Glauber Costa
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2013-02-22 15:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On Fri, 22 Feb 2013, Glauber Costa wrote:

> Although not verified in practice, I also point out that it is not safe to scan
> the full list only when debugging is on in this case. As unlikely as it is, it
> is theoretically possible for the pages to be full. If they are, they will
> become unreachable. Aside from scanning the full list, we also need to make
> sure that the pages indeed sit in there: the easiest way to do it is to make
> sure the boot caches have the SLAB_STORE_USER debug flag set.

SLAB_STORE_USER typically increases the size of the managed object. It is
not available when slab debugging is not compiled in. There is no list of
full slab objects that is maintained in the non debug case and if the
allocator is compiled without debug support also the code to manage full
lists will not be present.

Only one or two kmem_cache item is allocated in the bootstrap code and so
far the size of the objects was signficantly smaller than page size. So
the slab pages will be on the partial lists. Why are your slab management
structures so large that a page can no longer contain multiple objects?

If you have that issue then I would suggest to find another way to track
the early object (you could f.e. determine the page addresses from the
object and go through the slab pages like that).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 15:00 ` Christoph Lameter
@ 2013-02-22 15:09   ` Glauber Costa
  2013-02-22 15:39     ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Glauber Costa @ 2013-02-22 15:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On 02/22/2013 07:00 PM, Christoph Lameter wrote:
> On Fri, 22 Feb 2013, Glauber Costa wrote:
> 
>> Although not verified in practice, I also point out that it is not safe to scan
>> the full list only when debugging is on in this case. As unlikely as it is, it
>> is theoretically possible for the pages to be full. If they are, they will
>> become unreachable. Aside from scanning the full list, we also need to make
>> sure that the pages indeed sit in there: the easiest way to do it is to make
>> sure the boot caches have the SLAB_STORE_USER debug flag set.
> 
> SLAB_STORE_USER typically increases the size of the managed object. It is
> not available when slab debugging is not compiled in. There is no list of
> full slab objects that is maintained in the non debug case and if the
> allocator is compiled without debug support also the code to manage full
> lists will not be present.
> 
> Only one or two kmem_cache item is allocated in the bootstrap code and so
> far the size of the objects was signficantly smaller than page size. So
> the slab pages will be on the partial lists. Why are your slab management
> structures so large that a page can no longer contain multiple objects?
> 
They are not.

As I've mentioned in the description, the real bug is from partial slabs
being temporarily in the cpu_slab during a recent allocation and
therefore unreachable through the partial list.

I've just read the code, and it seemed to me that theoretically that
could happen. I agree with you that this is an unlikely scenario and if
you prefer I can resend the patch without that part.

Would that be preferable ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 15:09   ` Glauber Costa
@ 2013-02-22 15:39     ` Christoph Lameter
  2013-02-22 15:45       ` Glauber Costa
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2013-02-22 15:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On Fri, 22 Feb 2013, Glauber Costa wrote:

> As I've mentioned in the description, the real bug is from partial slabs
> being temporarily in the cpu_slab during a recent allocation and
> therefore unreachable through the partial list.

The bootstrap code does not use cpu slabs but goes directly to the slab
pages. See early_kmem_cache_node_alloc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 15:39     ` Christoph Lameter
@ 2013-02-22 15:45       ` Glauber Costa
  2013-02-22 16:10         ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Glauber Costa @ 2013-02-22 15:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On 02/22/2013 07:39 PM, Christoph Lameter wrote:
> On Fri, 22 Feb 2013, Glauber Costa wrote:
> 
>> As I've mentioned in the description, the real bug is from partial slabs
>> being temporarily in the cpu_slab during a recent allocation and
>> therefore unreachable through the partial list.
> 
> The bootstrap code does not use cpu slabs but goes directly to the slab
> pages. See early_kmem_cache_node_alloc.
> 

That differs from what I am seeing here.
I can trace an early __slab_alloc allocation from
the kmem_cache_node cache, very likely coming from the kmem_cache boot
cache creation. It takes the page out of the partial list and moves it
to the cpu_slab. After that, that particular page becomes unreachable
for bootstrap.

At this point, we are already slab_state == PARTIAL, while
init_kmem_cache_nodes will only differentiate against slab_state == DOWN.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 15:45       ` Glauber Costa
@ 2013-02-22 16:10         ` Christoph Lameter
  2013-02-22 16:11           ` Glauber Costa
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2013-02-22 16:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On Fri, 22 Feb 2013, Glauber Costa wrote:

> At this point, we are already slab_state == PARTIAL, while
> init_kmem_cache_nodes will only differentiate against slab_state == DOWN.

kmem_cache_node creation runs before PARTIAL and kmem_cache runs
after. So there would be 2 kmem_cache_node structures allocated. Ok so
that would use cpu slabs and therefore remove pages from the partial list.
Pushing that back using the flushing should fix this. But I thought there
was already code that went through the cpu slabs to address this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 16:10         ` Christoph Lameter
@ 2013-02-22 16:11           ` Glauber Costa
  2013-02-22 16:34             ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Glauber Costa @ 2013-02-22 16:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki

On 02/22/2013 08:10 PM, Christoph Lameter wrote:
> kmem_cache_node creation runs before PARTIAL and kmem_cache runs
> after. So there would be 2 kmem_cache_node structures allocated. Ok so
> that would use cpu slabs and therefore remove pages from the partial list.
> Pushing that back using the flushing should fix this. But I thought there
> was already code that went through the cpu slabs to address this?

not in bootstrap(), which is quite primitive. (and should remain so)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 11:15 ` Kamezawa Hiroyuki
@ 2013-02-22 16:20   ` Glauber Costa
  0 siblings, 0 replies; 10+ messages in thread
From: Glauber Costa @ 2013-02-22 16:20 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Christoph Lameter,
	Andrew Morton, Tejun Heo, Pekka Enberg

On 02/22/2013 03:15 PM, Kamezawa Hiroyuki wrote:
> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu,com>
I took the liberty to keep this, even with changes in the patch
because I didn't change anything related to the root cause of the
problem.

Let me know if you object.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] slub: correctly bootstrap boot caches
  2013-02-22 16:11           ` Glauber Costa
@ 2013-02-22 16:34             ` Christoph Lameter
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2013-02-22 16:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, linux-kernel, Andrew Morton, Tejun Heo,
	Pekka Enberg, Kamezawa Hiroyuki, Joonsoo Kim

On Fri, 22 Feb 2013, Glauber Costa wrote:

> On 02/22/2013 08:10 PM, Christoph Lameter wrote:
> > kmem_cache_node creation runs before PARTIAL and kmem_cache runs
> > after. So there would be 2 kmem_cache_node structures allocated. Ok so
> > that would use cpu slabs and therefore remove pages from the partial list.
> > Pushing that back using the flushing should fix this. But I thought there
> > was already code that went through the cpu slabs to address this?
>
> not in bootstrap(), which is quite primitive. (and should remain so)

Joonsoo Kim had a patch for this. I acked it a while back AFAICR


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-02-22 16:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-22 10:30 [PATCH] slub: correctly bootstrap boot caches Glauber Costa
2013-02-22 11:15 ` Kamezawa Hiroyuki
2013-02-22 16:20   ` Glauber Costa
2013-02-22 15:00 ` Christoph Lameter
2013-02-22 15:09   ` Glauber Costa
2013-02-22 15:39     ` Christoph Lameter
2013-02-22 15:45       ` Glauber Costa
2013-02-22 16:10         ` Christoph Lameter
2013-02-22 16:11           ` Glauber Costa
2013-02-22 16:34             ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox