* [BUG] THP allocations escape cpuset when defrag is off @ 2014-07-23 22:05 Alex Thorlton 2014-07-23 22:28 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Alex Thorlton @ 2014-07-23 22:05 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: akpm, mgorman, riel, kirill.shutemov, mingo, hughd, lliubbo, hannes, rientjes, srivatsa.bhat, dave.hansen, dfults, hedi Hey everyone, We're hitting an interesting bug on systems with THP defrag turned off. It seems that we're able to make very large THP allocations outside of our cpuset. Here's the test procedure I've been using: - Create a mem_exclusive/hardwall cpuset that is restricted to memory on one node. - Turn off swap (swapoff -a). This step is not explicitly necessary, but it appears to speed up the reaction time of the OOM killer considerably. - Turn off THP compaction/defrag. - Run memhog inside the cpuset. Tell it to allocate far more memory than should be available inside the cpuset. Quick example: # cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never # cat /sys/kernel/mm/transparent_hugepage/defrag always madvise [never] # grep "[0-9]" cpu* mem* <-- from /dev/cpuset/test01 cpu_exclusive:0 cpus:8-15 mem_exclusive:1 mem_hardwall:1 memory_migrate:0 memory_pressure:0 memory_spread_page:1 memory_spread_slab:1 mems:1 <-- ~32g per node # cat /proc/self/cpuset /test01 # memhog 80g > /dev/null (Runs to completion, which is the bug) Monitoring 'numactl --hardware' with watch, you can see memhog's allocations start spilling over onto the other nodes. Take note that this can be somewhat intermittent. Often when running this test immediately after a boot, the OOM killer will catch memhog and stop it immediately, but subsequent runs can either run to completion, or at least soak up good chunks of memory on nodes which they're not supposed to be permitted to allocate memory on, before being killed. I'm not positive on all the factors that influence this timing yet. It seems to reproduce very reliably if you toggle swap back and forth with each run: (Run before this was killed by OOM with swap off) # swapon -a # memhog 80g > /dev/null # swapoff -a # memhog 80g > /dev/null (Both of these ran to completion. Again, a sign of the bug) After digging through the code quite a bit, I've managed to turn up something that I think could be the cause of the problem here. In alloc_hugepage_vma we send a gfp_mask generated using alloc_hugepage_gfpmask, which removes the ___GFP_WAIT bit from the gfp_mask when defrag is off. Further down in pagefault code path, when we fall back to the slowpath for allocations (from my testing, this fallback appears to happen around the same time that we run out of memory on our cpuset's node), we see that, without the ___GFP_WAIT bit set, we will clear the ALLOC_CPUSET flag from alloc_flags, which in turn allows us to grab memory from any node. (See __alloc_pages_slowpath and gfp_to_alloc_flags to see where ALLOC_CPUSET gets wiped out). This simple patch seems to keep things inside our cpuset: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 33514d8..7a05576 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -754,7 +754,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) { - return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp; + return GFP_TRANSHUGE | extra_gfp; } My debug code shows that certain code paths are still allowing ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but monitoring the memory usage shows that we're staying on node, aside from some very small allocations, which may be other types of allocations that are not necessarly confined to a cpuset. Need a bit more research to confirm that. So, my question ends up being, why do we wipe out ___GFP_WAIT when defrag is off? I'll trust that there is good reason to do that, but, if so, is the behavior that I'm seeing expected? Any input is greatly appreciated. Thanks! - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] THP allocations escape cpuset when defrag is off 2014-07-23 22:05 [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton @ 2014-07-23 22:28 ` David Rientjes 2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes 2014-07-23 22:57 ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton 0 siblings, 2 replies; 7+ messages in thread From: David Rientjes @ 2014-07-23 22:28 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, linux-kernel, akpm, mgorman, riel, kirill.shutemov, mingo, hughd, lliubbo, hannes, srivatsa.bhat, dave.hansen, dfults, hedi On Wed, 23 Jul 2014, Alex Thorlton wrote: > Hey everyone, > > We're hitting an interesting bug on systems with THP defrag turned off. > It seems that we're able to make very large THP allocations outside of > our cpuset. Here's the test procedure I've been using: > > - Create a mem_exclusive/hardwall cpuset that is restricted to memory > on one node. > - Turn off swap (swapoff -a). This step is not explicitly necessary, > but it appears to speed up the reaction time of the OOM killer > considerably. > - Turn off THP compaction/defrag. > - Run memhog inside the cpuset. Tell it to allocate far more memory > than should be available inside the cpuset. > > Quick example: > > # cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > # cat /sys/kernel/mm/transparent_hugepage/defrag > always madvise [never] > # grep "[0-9]" cpu* mem* <-- from /dev/cpuset/test01 > cpu_exclusive:0 > cpus:8-15 > mem_exclusive:1 > mem_hardwall:1 > memory_migrate:0 > memory_pressure:0 > memory_spread_page:1 > memory_spread_slab:1 > mems:1 <-- ~32g per node > # cat /proc/self/cpuset > /test01 > # memhog 80g > /dev/null > (Runs to completion, which is the bug) > > Monitoring 'numactl --hardware' with watch, you can see memhog's > allocations start spilling over onto the other nodes. Take note that > this can be somewhat intermittent. Often when running this test > immediately after a boot, the OOM killer will catch memhog and stop it > immediately, but subsequent runs can either run to completion, or at > least soak up good chunks of memory on nodes which they're not supposed > to be permitted to allocate memory on, before being killed. I'm not > positive on all the factors that influence this timing yet. It seems to > reproduce very reliably if you toggle swap back and forth with each run: > > (Run before this was killed by OOM with swap off) > # swapon -a > # memhog 80g > /dev/null > # swapoff -a > # memhog 80g > /dev/null > (Both of these ran to completion. Again, a sign of the bug) > > After digging through the code quite a bit, I've managed to turn up > something that I think could be the cause of the problem here. In > alloc_hugepage_vma we send a gfp_mask generated using > alloc_hugepage_gfpmask, which removes the ___GFP_WAIT bit from the > gfp_mask when defrag is off. > > Further down in pagefault code path, when we fall back to the slowpath > for allocations (from my testing, this fallback appears to happen around > the same time that we run out of memory on our cpuset's node), we see > that, without the ___GFP_WAIT bit set, we will clear the ALLOC_CPUSET > flag from alloc_flags, which in turn allows us to grab memory from > any node. (See __alloc_pages_slowpath and gfp_to_alloc_flags to see > where ALLOC_CPUSET gets wiped out). > > This simple patch seems to keep things inside our cpuset: > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 33514d8..7a05576 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -754,7 +754,7 @@ static int __do_huge_pmd_anonymous_page(struct > mm_struct *mm, > > static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) > { > - return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp; > + return GFP_TRANSHUGE | extra_gfp; > } > > My debug code shows that certain code paths are still allowing > ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but > monitoring the memory usage shows that we're staying on node, aside from > some very small allocations, which may be other types of allocations that > are not necessarly confined to a cpuset. Need a bit more research to > confirm that. > ALLOC_CPUSET should get stripped for the cases outlined in __cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does not have __GFP_WAIT set. > So, my question ends up being, why do we wipe out ___GFP_WAIT when > defrag is off? I'll trust that there is good reason to do that, but, if > so, is the behavior that I'm seeing expected? > The intention is to avoid memory compaction (and direct reclaim), obviously, which does not run when __GFP_WAIT is not set. But you're exactly right that this abuses the allocflags conversion that allows ALLOC_CPUSET to get cleared because it is using the aforementioned GFP_ATOMIC exception for cpuset allocation. We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it affects the allowed watermarks and nothing else prevents memory compaction or direct reclaim from running in the page allocator slowpath. So it looks like a modification to the page allocator is needed, see below. It's also been a long-standing issue that cpusets and mempolicies are ignored by khugepaged that allows memory to be migrated remotely to nodes that are not allowed by a cpuset's mems or a mempolicy's nodemask. Even with this issue fixed, you may find that some memory is migrated remotely, although it may be negligible, by khugepaged. [ We should really rename __GFP_NO_KSWAPD to __GFP_THP and not allow the other users to piggyback off it. ] --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2447,7 +2447,8 @@ static inline int gfp_to_alloc_flags(gfp_t gfp_mask) { int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; - const gfp_t wait = gfp_mask & __GFP_WAIT; + const bool atomic = (gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD)) == + __GFP_WAIT; /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); @@ -2456,20 +2457,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask) * The caller may dip into page reserves a bit more if the caller * cannot run direct reclaim, or if the caller has realtime scheduling * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will - * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). + * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH). */ alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); - if (!wait) { + if (atomic) { /* * Not worth trying to allocate harder for * __GFP_NOMEMALLOC even if it can't schedule. */ - if (!(gfp_mask & __GFP_NOMEMALLOC)) + if (!(gfp_mask & __GFP_NOMEMALLOC)) alloc_flags |= ALLOC_HARDER; /* - * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. - * See also cpuset_zone_allowed() comment in kernel/cpuset.c. + * Ignore cpuset for GFP_ATOMIC rather than fail alloc. + * See also cpuset_zone_allowed_softwall() comment. */ alloc_flags &= ~ALLOC_CPUSET; } else if (unlikely(rt_task(current)) && !in_interrupt()) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions 2014-07-23 22:28 ` David Rientjes @ 2014-07-23 22:50 ` David Rientjes 2014-07-23 23:20 ` Alex Thorlton 2014-07-25 9:14 ` Michal Hocko 2014-07-23 22:57 ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton 1 sibling, 2 replies; 7+ messages in thread From: David Rientjes @ 2014-07-23 22:50 UTC (permalink / raw) To: Alex Thorlton, Andrew Morton Cc: linux-mm, linux-kernel, Mel Gorman, Rik van Riel, kirill.shutemov, Ingo Molnar, Hugh Dickins, lliubbo, Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults, hedi The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET should be set in allocflags. ALLOC_CPUSET controls if a page allocation should be restricted only to the set of allowed cpuset mems. Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent the fault path from using memory compaction or direct reclaim. Thus, it is unfairly able to allocate outside of its cpuset mems restriction as a side-effect. This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is truly GFP_ATOMIC by verifying it is also not a thp allocation. Reported-by: Alex Thorlton <athorlton@sgi.com> Cc: stable@vger.kernel.org Signed-off-by: David Rientjes <rientjes@google.com> --- mm/page_alloc.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2447,7 +2447,7 @@ static inline int gfp_to_alloc_flags(gfp_t gfp_mask) { int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; - const gfp_t wait = gfp_mask & __GFP_WAIT; + const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD)); /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); @@ -2456,20 +2456,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask) * The caller may dip into page reserves a bit more if the caller * cannot run direct reclaim, or if the caller has realtime scheduling * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will - * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). + * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH). */ alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); - if (!wait) { + if (atomic) { /* - * Not worth trying to allocate harder for - * __GFP_NOMEMALLOC even if it can't schedule. + * Not worth trying to allocate harder for __GFP_NOMEMALLOC even + * if it can't schedule. */ - if (!(gfp_mask & __GFP_NOMEMALLOC)) + if (!(gfp_mask & __GFP_NOMEMALLOC)) alloc_flags |= ALLOC_HARDER; /* - * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. - * See also cpuset_zone_allowed() comment in kernel/cpuset.c. + * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the + * comment for __cpuset_node_allowed_softwall(). */ alloc_flags &= ~ALLOC_CPUSET; } else if (unlikely(rt_task(current)) && !in_interrupt()) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions 2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes @ 2014-07-23 23:20 ` Alex Thorlton 2014-07-25 9:14 ` Michal Hocko 1 sibling, 0 replies; 7+ messages in thread From: Alex Thorlton @ 2014-07-23 23:20 UTC (permalink / raw) To: David Rientjes Cc: Alex Thorlton, Andrew Morton, linux-mm, linux-kernel, Mel Gorman, Rik van Riel, kirill.shutemov, Ingo Molnar, Hugh Dickins, lliubbo, Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults, hedi On Wed, Jul 23, 2014 at 03:50:09PM -0700, David Rientjes wrote: > The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET > should be set in allocflags. ALLOC_CPUSET controls if a page allocation > should be restricted only to the set of allowed cpuset mems. > > Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent > the fault path from using memory compaction or direct reclaim. Thus, it > is unfairly able to allocate outside of its cpuset mems restriction as a > side-effect. > > This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is > truly GFP_ATOMIC by verifying it is also not a thp allocation. Tested. Works as expected. Tested-by: Alex Thorlton <athorlton@sgi.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions 2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes 2014-07-23 23:20 ` Alex Thorlton @ 2014-07-25 9:14 ` Michal Hocko 1 sibling, 0 replies; 7+ messages in thread From: Michal Hocko @ 2014-07-25 9:14 UTC (permalink / raw) To: David Rientjes Cc: Alex Thorlton, Andrew Morton, linux-mm, linux-kernel, Mel Gorman, Rik van Riel, kirill.shutemov, Ingo Molnar, Hugh Dickins, lliubbo, Johannes Weiner, srivatsa.bhat, Dave Hansen, dfults, hedi On Wed 23-07-14 15:50:09, David Rientjes wrote: > The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET > should be set in allocflags. ALLOC_CPUSET controls if a page allocation > should be restricted only to the set of allowed cpuset mems. > > Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent > the fault path from using memory compaction or direct reclaim. Thus, it > is unfairly able to allocate outside of its cpuset mems restriction as a > side-effect. > > This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is > truly GFP_ATOMIC by verifying it is also not a thp allocation. > > Reported-by: Alex Thorlton <athorlton@sgi.com> > Cc: stable@vger.kernel.org > Signed-off-by: David Rientjes <rientjes@google.com> This is an abuse of __GFP_NO_KSWAPD but it also looks like a new gfp flag would need to be added to do it in other way. No other users seem to clear GFP_WAIT while using __GFP_NO_KSWAPD AFAICS so this should really affect only THP allocations. Reviewed-by: Michal Hocko <mhocko@suse.cz> > --- > mm/page_alloc.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2447,7 +2447,7 @@ static inline int > gfp_to_alloc_flags(gfp_t gfp_mask) > { > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > - const gfp_t wait = gfp_mask & __GFP_WAIT; > + const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD)); > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > @@ -2456,20 +2456,20 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > * The caller may dip into page reserves a bit more if the caller > * cannot run direct reclaim, or if the caller has realtime scheduling > * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will > - * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). > + * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH). > */ > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > - if (!wait) { > + if (atomic) { > /* > - * Not worth trying to allocate harder for > - * __GFP_NOMEMALLOC even if it can't schedule. > + * Not worth trying to allocate harder for __GFP_NOMEMALLOC even > + * if it can't schedule. > */ > - if (!(gfp_mask & __GFP_NOMEMALLOC)) > + if (!(gfp_mask & __GFP_NOMEMALLOC)) > alloc_flags |= ALLOC_HARDER; > /* > - * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. > - * See also cpuset_zone_allowed() comment in kernel/cpuset.c. > + * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the > + * comment for __cpuset_node_allowed_softwall(). > */ > alloc_flags &= ~ALLOC_CPUSET; > } else if (unlikely(rt_task(current)) && !in_interrupt()) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] THP allocations escape cpuset when defrag is off 2014-07-23 22:28 ` David Rientjes 2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes @ 2014-07-23 22:57 ` Alex Thorlton 2014-07-23 23:05 ` David Rientjes 1 sibling, 1 reply; 7+ messages in thread From: Alex Thorlton @ 2014-07-23 22:57 UTC (permalink / raw) To: David Rientjes Cc: Alex Thorlton, linux-mm, linux-kernel, akpm, mgorman, riel, kirill.shutemov, mingo, hughd, lliubbo, hannes, srivatsa.bhat, dave.hansen, dfults, hedi On Wed, Jul 23, 2014 at 03:28:09PM -0700, David Rientjes wrote: > > My debug code shows that certain code paths are still allowing > > ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but > > monitoring the memory usage shows that we're staying on node, aside from > > some very small allocations, which may be other types of allocations that > > are not necessarly confined to a cpuset. Need a bit more research to > > confirm that. > > > > ALLOC_CPUSET should get stripped for the cases outlined in > __cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does > not have __GFP_WAIT set. Makes sense. I knew my patch was probably the wrong way to fix this, but it did serve my purpose :) > > So, my question ends up being, why do we wipe out ___GFP_WAIT when > > defrag is off? I'll trust that there is good reason to do that, but, if > > so, is the behavior that I'm seeing expected? > > > > The intention is to avoid memory compaction (and direct reclaim), > obviously, which does not run when __GFP_WAIT is not set. But you're > exactly right that this abuses the allocflags conversion that allows > ALLOC_CPUSET to get cleared because it is using the aforementioned > GFP_ATOMIC exception for cpuset allocation. > > We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it > affects the allowed watermarks and nothing else prevents memory compaction > or direct reclaim from running in the page allocator slowpath. > > So it looks like a modification to the page allocator is needed, see > below. Looks good to me. Fixes the problem without affecting any of the other intended functionality. > It's also been a long-standing issue that cpusets and mempolicies are > ignored by khugepaged that allows memory to be migrated remotely to nodes > that are not allowed by a cpuset's mems or a mempolicy's nodemask. Even > with this issue fixed, you may find that some memory is migrated remotely, > although it may be negligible, by khugepaged. A bit here and there is manageable. There is, of course, some work to be done there, but for now we're mainly concerned with a job that's supposed to be confined to a cpuset spilling out and soaking up all the memory on a machine. Thanks for the help, David. Much appreciated! - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] THP allocations escape cpuset when defrag is off 2014-07-23 22:57 ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton @ 2014-07-23 23:05 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2014-07-23 23:05 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, linux-kernel, akpm, mgorman, riel, kirill.shutemov, mingo, hughd, lliubbo, hannes, srivatsa.bhat, dave.hansen, dfults, hedi On Wed, 23 Jul 2014, Alex Thorlton wrote: > > It's also been a long-standing issue that cpusets and mempolicies are > > ignored by khugepaged that allows memory to be migrated remotely to nodes > > that are not allowed by a cpuset's mems or a mempolicy's nodemask. Even > > with this issue fixed, you may find that some memory is migrated remotely, > > although it may be negligible, by khugepaged. > > A bit here and there is manageable. There is, of course, some work to > be done there, but for now we're mainly concerned with a job that's > supposed to be confined to a cpuset spilling out and soaking up all the > memory on a machine. > You may find my patch[*] in -mm to be helpful if you enable zone_reclaim_mode. It changes khugepaged so that it is not allowed to migrate any memory to a remote node where the distance between the nodes is greater than RECLAIM_DISTANCE. These issues are still pending and we've encountered a couple of them in the past weeks ourselves. The definition of RECLAIM_DISTANCE, currently at 30 for x86, is relying on the SLIT to define when remote access is costly and there are cases where people need to alter the BIOS to workaround this definition. We can hope that NUMA balancing will solve a lot of these problems for us, but there's always a chance that the VM does something totally wrong which you've undoubtedly encountered already. [*] http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-only-collapse-hugepages-to-nodes-with-affinity-for-zone_reclaim_mode.patch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-07-25 9:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-23 22:05 [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton 2014-07-23 22:28 ` David Rientjes 2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes 2014-07-23 23:20 ` Alex Thorlton 2014-07-25 9:14 ` Michal Hocko 2014-07-23 22:57 ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton 2014-07-23 23:05 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox