[BUG] THP allocations escape cpuset when defrag is off

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUG] THP allocations escape cpuset when defrag is off
@ 2014-07-23 22:05 Alex Thorlton
  2014-07-23 22:28 ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Thorlton @ 2014-07-23 22:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, mgorman, riel, kirill.shutemov, mingo, hughd, lliubbo,
	hannes, rientjes, srivatsa.bhat, dave.hansen, dfults, hedi

Hey everyone,

We're hitting an interesting bug on systems with THP defrag turned off.
It seems that we're able to make very large THP allocations outside of
our cpuset.  Here's the test procedure I've been using:

- Create a mem_exclusive/hardwall cpuset that is restricted to memory
  on one node.
- Turn off swap (swapoff -a).  This step is not explicitly necessary,
  but it appears to speed up the reaction time of the OOM killer
  considerably.
- Turn off THP compaction/defrag.
- Run memhog inside the cpuset.  Tell it to allocate far more memory
  than should be available inside the cpuset.

Quick example:

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]
# grep "[0-9]" cpu* mem*         <-- from /dev/cpuset/test01
cpu_exclusive:0
cpus:8-15
mem_exclusive:1
mem_hardwall:1
memory_migrate:0
memory_pressure:0
memory_spread_page:1
memory_spread_slab:1
mems:1                           <-- ~32g per node
# cat /proc/self/cpuset
/test01
# memhog 80g > /dev/null
(Runs to completion, which is the bug)

Monitoring 'numactl --hardware' with watch, you can see memhog's
allocations start spilling over onto the other nodes.  Take note that
this can be somewhat intermittent.  Often when running this test
immediately after a boot, the OOM killer will catch memhog and stop it
immediately, but subsequent runs can either run to completion, or at
least soak up good chunks of memory on nodes which they're not supposed
to be permitted to allocate memory on, before being killed.  I'm not
positive on all the factors that influence this timing yet.  It seems to
reproduce very reliably if you toggle swap back and forth with each run:

(Run before this was killed by OOM with swap off)
# swapon -a
# memhog 80g > /dev/null
# swapoff -a
# memhog 80g > /dev/null
(Both of these ran to completion.  Again, a sign of the bug)

After digging through the code quite a bit, I've managed to turn up
something that I think could be the cause of the problem here.  In
alloc_hugepage_vma we send a gfp_mask generated using
alloc_hugepage_gfpmask, which removes the ___GFP_WAIT bit from the
gfp_mask when defrag is off.

Further down in pagefault code path, when we fall back to the slowpath
for allocations (from my testing, this fallback appears to happen around
the same time that we run out of memory on our cpuset's node), we see
that, without the ___GFP_WAIT bit set, we will clear the ALLOC_CPUSET
flag from alloc_flags, which in turn allows us to grab memory from
any node. (See __alloc_pages_slowpath and gfp_to_alloc_flags to see
where ALLOC_CPUSET gets wiped out).

This simple patch seems to keep things inside our cpuset:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..7a05576 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -754,7 +754,7 @@ static int __do_huge_pmd_anonymous_page(struct
mm_struct *mm,

 static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 {
-       return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+       return GFP_TRANSHUGE | extra_gfp;
 }

My debug code shows that certain code paths are still allowing
ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but
monitoring the memory usage shows that we're staying on node, aside from
some very small allocations, which may be other types of allocations that
are not necessarly confined to a cpuset.  Need a bit more research to
confirm that.

So, my question ends up being, why do we wipe out ___GFP_WAIT when
defrag is off?  I'll trust that there is good reason to do that, but, if
so, is the behavior that I'm seeing expected?

Any input is greatly appreciated.  Thanks!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] THP allocations escape cpuset when defrag is off
  2014-07-23 22:05 [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
@ 2014-07-23 22:28 ` David Rientjes
  2014-07-23 22:50   ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
  2014-07-23 22:57   ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
  0 siblings, 2 replies; 7+ messages in thread
From: David Rientjes @ 2014-07-23 22:28 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, linux-kernel, akpm, mgorman, riel, kirill.shutemov,
	mingo, hughd, lliubbo, hannes, srivatsa.bhat, dave.hansen,
	dfults, hedi

On Wed, 23 Jul 2014, Alex Thorlton wrote:

> Hey everyone,
> 
> We're hitting an interesting bug on systems with THP defrag turned off.
> It seems that we're able to make very large THP allocations outside of
> our cpuset.  Here's the test procedure I've been using:
> 
> - Create a mem_exclusive/hardwall cpuset that is restricted to memory
>   on one node.
> - Turn off swap (swapoff -a).  This step is not explicitly necessary,
>   but it appears to speed up the reaction time of the OOM killer
>   considerably.
> - Turn off THP compaction/defrag.
> - Run memhog inside the cpuset.  Tell it to allocate far more memory
>   than should be available inside the cpuset.
> 
> Quick example:
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always madvise [never]
> # grep "[0-9]" cpu* mem*         <-- from /dev/cpuset/test01
> cpu_exclusive:0
> cpus:8-15
> mem_exclusive:1
> mem_hardwall:1
> memory_migrate:0
> memory_pressure:0
> memory_spread_page:1
> memory_spread_slab:1
> mems:1                           <-- ~32g per node
> # cat /proc/self/cpuset
> /test01
> # memhog 80g > /dev/null
> (Runs to completion, which is the bug)
> 
> Monitoring 'numactl --hardware' with watch, you can see memhog's
> allocations start spilling over onto the other nodes.  Take note that
> this can be somewhat intermittent.  Often when running this test
> immediately after a boot, the OOM killer will catch memhog and stop it
> immediately, but subsequent runs can either run to completion, or at
> least soak up good chunks of memory on nodes which they're not supposed
> to be permitted to allocate memory on, before being killed.  I'm not
> positive on all the factors that influence this timing yet.  It seems to
> reproduce very reliably if you toggle swap back and forth with each run:
> 
> (Run before this was killed by OOM with swap off)
> # swapon -a
> # memhog 80g > /dev/null
> # swapoff -a
> # memhog 80g > /dev/null
> (Both of these ran to completion.  Again, a sign of the bug)
> 
> After digging through the code quite a bit, I've managed to turn up
> something that I think could be the cause of the problem here.  In
> alloc_hugepage_vma we send a gfp_mask generated using
> alloc_hugepage_gfpmask, which removes the ___GFP_WAIT bit from the
> gfp_mask when defrag is off.
> 
> Further down in pagefault code path, when we fall back to the slowpath
> for allocations (from my testing, this fallback appears to happen around
> the same time that we run out of memory on our cpuset's node), we see
> that, without the ___GFP_WAIT bit set, we will clear the ALLOC_CPUSET
> flag from alloc_flags, which in turn allows us to grab memory from
> any node. (See __alloc_pages_slowpath and gfp_to_alloc_flags to see
> where ALLOC_CPUSET gets wiped out).
> 
> This simple patch seems to keep things inside our cpuset:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 33514d8..7a05576 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -754,7 +754,7 @@ static int __do_huge_pmd_anonymous_page(struct
> mm_struct *mm,
> 
>  static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
>  {
> -       return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
> +       return GFP_TRANSHUGE | extra_gfp;
>  }
> 
> My debug code shows that certain code paths are still allowing
> ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but
> monitoring the memory usage shows that we're staying on node, aside from
> some very small allocations, which may be other types of allocations that
> are not necessarly confined to a cpuset.  Need a bit more research to
> confirm that.
> 

ALLOC_CPUSET should get stripped for the cases outlined in 
__cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does 
not have __GFP_WAIT set.

> So, my question ends up being, why do we wipe out ___GFP_WAIT when
> defrag is off?  I'll trust that there is good reason to do that, but, if
> so, is the behavior that I'm seeing expected?
> 

The intention is to avoid memory compaction (and direct reclaim), 
obviously, which does not run when __GFP_WAIT is not set.  But you're 
exactly right that this abuses the allocflags conversion that allows 
ALLOC_CPUSET to get cleared because it is using the aforementioned 
GFP_ATOMIC exception for cpuset allocation.

We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it 
affects the allowed watermarks and nothing else prevents memory compaction 
or direct reclaim from running in the page allocator slowpath.

So it looks like a modification to the page allocator is needed, see 
below.

It's also been a long-standing issue that cpusets and mempolicies are 
ignored by khugepaged that allows memory to be migrated remotely to nodes 
that are not allowed by a cpuset's mems or a mempolicy's nodemask.  Even 
with this issue fixed, you may find that some memory is migrated remotely, 
although it may be negligible, by khugepaged.

 [ We should really rename __GFP_NO_KSWAPD to __GFP_THP and not allow the
   other users to piggyback off it. ]
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2447,7 +2447,8 @@ static inline int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const bool atomic = (gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD)) ==
+			    __GFP_WAIT;
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -2456,20 +2457,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (atomic) {
 		/*
 		 * Not worth trying to allocate harder for
 		 * __GFP_NOMEMALLOC even if it can't schedule.
 		 */
-		if  (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
 			alloc_flags |= ALLOC_HARDER;
 		/*
-		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 * Ignore cpuset for GFP_ATOMIC rather than fail alloc.
+		 * See also cpuset_zone_allowed_softwall() comment.
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && !in_interrupt())

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions
  2014-07-23 22:28 ` David Rientjes
@ 2014-07-23 22:50   ` David Rientjes
  2014-07-23 23:20     ` Alex Thorlton
  2014-07-25  9:14     ` Michal Hocko
  2014-07-23 22:57   ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
  1 sibling, 2 replies; 7+ messages in thread
From: David Rientjes @ 2014-07-23 22:50 UTC (permalink / raw)
  To: Alex Thorlton, Andrew Morton
  Cc: linux-mm, linux-kernel, Mel Gorman, Rik van Riel,
	kirill.shutemov, Ingo Molnar, Hugh Dickins, lliubbo,
	Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults, hedi

The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET 
should be set in allocflags.  ALLOC_CPUSET controls if a page allocation 
should be restricted only to the set of allowed cpuset mems.

Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent 
the fault path from using memory compaction or direct reclaim.  Thus, it 
is unfairly able to allocate outside of its cpuset mems restriction as a 
side-effect.

This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is 
truly GFP_ATOMIC by verifying it is also not a thp allocation.

Reported-by: Alex Thorlton <athorlton@sgi.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2447,7 +2447,7 @@ static inline int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -2456,20 +2456,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (atomic) {
 		/*
-		 * Not worth trying to allocate harder for
-		 * __GFP_NOMEMALLOC even if it can't schedule.
+		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
+		 * if it can't schedule.
 		 */
-		if  (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
 			alloc_flags |= ALLOC_HARDER;
 		/*
-		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
+		 * comment for __cpuset_node_allowed_softwall().
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && !in_interrupt())

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions
  2014-07-23 22:50   ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
@ 2014-07-23 23:20     ` Alex Thorlton
  2014-07-25  9:14     ` Michal Hocko
  1 sibling, 0 replies; 7+ messages in thread
From: Alex Thorlton @ 2014-07-23 23:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Thorlton, Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Rik van Riel, kirill.shutemov, Ingo Molnar, Hugh Dickins,
	lliubbo, Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults,
	hedi

On Wed, Jul 23, 2014 at 03:50:09PM -0700, David Rientjes wrote:
> The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET 
> should be set in allocflags.  ALLOC_CPUSET controls if a page allocation 
> should be restricted only to the set of allowed cpuset mems.
> 
> Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent 
> the fault path from using memory compaction or direct reclaim.  Thus, it 
> is unfairly able to allocate outside of its cpuset mems restriction as a 
> side-effect.
> 
> This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is 
> truly GFP_ATOMIC by verifying it is also not a thp allocation.

Tested.  Works as expected.

Tested-by: Alex Thorlton <athorlton@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions
  2014-07-23 22:50   ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
  2014-07-23 23:20     ` Alex Thorlton
@ 2014-07-25  9:14     ` Michal Hocko
  1 sibling, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2014-07-25  9:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Thorlton, Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
	Rik van Riel, kirill.shutemov, Ingo Molnar, Hugh Dickins,
	lliubbo, Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults,
	hedi

On Wed 23-07-14 15:50:09, David Rientjes wrote:
> The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET 
> should be set in allocflags.  ALLOC_CPUSET controls if a page allocation 
> should be restricted only to the set of allowed cpuset mems.
> 
> Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent 
> the fault path from using memory compaction or direct reclaim.  Thus, it 
> is unfairly able to allocate outside of its cpuset mems restriction as a 
> side-effect.
> 
> This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is 
> truly GFP_ATOMIC by verifying it is also not a thp allocation.
> 
> Reported-by: Alex Thorlton <athorlton@sgi.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: David Rientjes <rientjes@google.com>

This is an abuse of __GFP_NO_KSWAPD but it also looks like a new gfp
flag would need to be added to do it in other way. No other users seem to
clear GFP_WAIT while using __GFP_NO_KSWAPD AFAICS so this should really
affect only THP allocations.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/page_alloc.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2447,7 +2447,7 @@ static inline int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> -	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -2456,20 +2456,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	if (atomic) {
>  		/*
> -		 * Not worth trying to allocate harder for
> -		 * __GFP_NOMEMALLOC even if it can't schedule.
> +		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
> +		 * if it can't schedule.
>  		 */
> -		if  (!(gfp_mask & __GFP_NOMEMALLOC))
> +		if (!(gfp_mask & __GFP_NOMEMALLOC))
>  			alloc_flags |= ALLOC_HARDER;
>  		/*
> -		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> -		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
> +		 * comment for __cpuset_node_allowed_softwall().
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] THP allocations escape cpuset when defrag is off
  2014-07-23 22:28 ` David Rientjes
  2014-07-23 22:50   ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
@ 2014-07-23 22:57   ` Alex Thorlton
  2014-07-23 23:05     ` David Rientjes
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Thorlton @ 2014-07-23 22:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Thorlton, linux-mm, linux-kernel, akpm, mgorman, riel,
	kirill.shutemov, mingo, hughd, lliubbo, hannes, srivatsa.bhat,
	dave.hansen, dfults, hedi

On Wed, Jul 23, 2014 at 03:28:09PM -0700, David Rientjes wrote:
> > My debug code shows that certain code paths are still allowing
> > ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but
> > monitoring the memory usage shows that we're staying on node, aside from
> > some very small allocations, which may be other types of allocations that
> > are not necessarly confined to a cpuset.  Need a bit more research to
> > confirm that.
> > 
> 
> ALLOC_CPUSET should get stripped for the cases outlined in 
> __cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does 
> not have __GFP_WAIT set.

Makes sense.  I knew my patch was probably the wrong way to fix this,
but it did serve my purpose :)

> > So, my question ends up being, why do we wipe out ___GFP_WAIT when
> > defrag is off?  I'll trust that there is good reason to do that, but, if
> > so, is the behavior that I'm seeing expected?
> > 
> 
> The intention is to avoid memory compaction (and direct reclaim), 
> obviously, which does not run when __GFP_WAIT is not set.  But you're 
> exactly right that this abuses the allocflags conversion that allows 
> ALLOC_CPUSET to get cleared because it is using the aforementioned 
> GFP_ATOMIC exception for cpuset allocation.
>
> We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it 
> affects the allowed watermarks and nothing else prevents memory compaction 
> or direct reclaim from running in the page allocator slowpath.
> 
> So it looks like a modification to the page allocator is needed, see 
> below.

Looks good to me.  Fixes the problem without affecting any of the other
intended functionality.

> It's also been a long-standing issue that cpusets and mempolicies are 
> ignored by khugepaged that allows memory to be migrated remotely to nodes 
> that are not allowed by a cpuset's mems or a mempolicy's nodemask.  Even 
> with this issue fixed, you may find that some memory is migrated remotely, 
> although it may be negligible, by khugepaged.

A bit here and there is manageable.  There is, of course, some work to
be done there, but for now we're mainly concerned with a job that's
supposed to be confined to a cpuset spilling out and soaking up all the
memory on a machine.

Thanks for the help, David.  Much appreciated!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] THP allocations escape cpuset when defrag is off
  2014-07-23 22:57   ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
@ 2014-07-23 23:05     ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2014-07-23 23:05 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, linux-kernel, akpm, mgorman, riel, kirill.shutemov,
	mingo, hughd, lliubbo, hannes, srivatsa.bhat, dave.hansen,
	dfults, hedi

On Wed, 23 Jul 2014, Alex Thorlton wrote:

> > It's also been a long-standing issue that cpusets and mempolicies are 
> > ignored by khugepaged that allows memory to be migrated remotely to nodes 
> > that are not allowed by a cpuset's mems or a mempolicy's nodemask.  Even 
> > with this issue fixed, you may find that some memory is migrated remotely, 
> > although it may be negligible, by khugepaged.
> 
> A bit here and there is manageable.  There is, of course, some work to
> be done there, but for now we're mainly concerned with a job that's
> supposed to be confined to a cpuset spilling out and soaking up all the
> memory on a machine.
> 

You may find my patch[*] in -mm to be helpful if you enable 
zone_reclaim_mode.  It changes khugepaged so that it is not allowed to 
migrate any memory to a remote node where the distance between the nodes 
is greater than RECLAIM_DISTANCE.

These issues are still pending and we've encountered a couple of them in 
the past weeks ourselves.  The definition of RECLAIM_DISTANCE, currently 
at 30 for x86, is relying on the SLIT to define when remote access is 
costly and there are cases where people need to alter the BIOS to 
workaround this definition.

We can hope that NUMA balancing will solve a lot of these problems for us, 
but there's always a chance that the VM does something totally wrong which 
you've undoubtedly encountered already.

 [*] http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-only-collapse-hugepages-to-nodes-with-affinity-for-zone_reclaim_mode.patch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-07-25  9:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-23 22:05 [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
2014-07-23 22:28 ` David Rientjes
2014-07-23 22:50   ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
2014-07-23 23:20     ` Alex Thorlton
2014-07-25  9:14     ` Michal Hocko
2014-07-23 22:57   ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
2014-07-23 23:05     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox