* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-16 13:22 ` [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim mhocko
@ 2015-11-16 21:18 ` David Rientjes
2015-11-17 10:58 ` Tetsuo Handa
` (2 subsequent siblings)
3 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2015-11-16 21:18 UTC (permalink / raw)
To: mhocko; +Cc: Andrew Morton, Mel Gorman, linux-mm, LKML, Michal Hocko
On Mon, 16 Nov 2015, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
> __GFP_NOFAIL is requested. This is fragile because we are basically
> relying on somebody else to make the reclaim (be it the direct reclaim
> or OOM killer) for us. The caller might be holding resources (e.g.
> locks) which block other other reclaimers from making any progress for
> example. Remove the retry loop and rely on __alloc_pages_slowpath to
> invoke all allowed reclaim steps and retry logic.
>
> We have to be careful about __GFP_NOFAIL allocations from the
> PF_MEMALLOC context even though this is a very bad idea to begin with
> because no progress can be gurateed at all. We shouldn't break the
> __GFP_NOFAIL semantic here though. It could be argued that this is
> essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
> is much harder to check for existing users because they might happen
> deep down the code path performed much later after setting the flag
> so we cannot really rule out there is no kernel path triggering this
> combination.
>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
It'll be scary if anything actually relies on this, but I think it's more
correct.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-16 13:22 ` [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim mhocko
2015-11-16 21:18 ` David Rientjes
@ 2015-11-17 10:58 ` Tetsuo Handa
2015-11-18 9:11 ` Michal Hocko
2015-11-18 14:57 ` Vlastimil Babka
2015-11-23 9:33 ` Michal Hocko
3 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2015-11-17 10:58 UTC (permalink / raw)
To: mhocko, Andrew Morton
Cc: Mel Gorman, David Rientjes, linux-mm, LKML, Michal Hocko
Michal Hocko wrote:
> __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
> __GFP_NOFAIL is requested. This is fragile because we are basically
> relying on somebody else to make the reclaim (be it the direct reclaim
> or OOM killer) for us. The caller might be holding resources (e.g.
> locks) which block other other reclaimers from making any progress for
> example. Remove the retry loop and rely on __alloc_pages_slowpath to
> invoke all allowed reclaim steps and retry logic.
This implies invoking OOM killer, doesn't it?
> /* Avoid recursion of direct reclaim */
> - if (current->flags & PF_MEMALLOC)
> + if (current->flags & PF_MEMALLOC) {
> + /*
> + * __GFP_NOFAIL request from this context is rather bizarre
> + * because we cannot reclaim anything and only can loop waiting
> + * for somebody to do a work for us.
> + */
> + if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> + cond_resched();
> + goto retry;
I think that this "goto retry;" omits call to out_of_memory() which is allowed
for __GFP_NOFAIL allocations. Even if this is what you meant, current thread
can be a workqueue, which currently need a short sleep (as with
wait_iff_congested() changes), can't it?
> + }
> goto nopage;
> + }
>
> /* Avoid allocations with no watermarks from looping endlessly */
> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>
Well, is it cond_resched() which should include
if (current->flags & PF_WQ_WORKER)
schedule_timeout(1);
than wait_iff_congested() because not all yield calls use wait_iff_congested()
and giving pending workqueue jobs a chance to be processed is anyway preferable?
int __sched _cond_resched(void)
{
if (should_resched(0)) {
if ((current->flags & PF_WQ_WORKER) && workqueue_has_pending_jobs())
schedule_timeout(1);
else
preempt_schedule_common();
return 1;
}
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-17 10:58 ` Tetsuo Handa
@ 2015-11-18 9:11 ` Michal Hocko
2015-11-18 9:22 ` Michal Hocko
0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-11-18 9:11 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Andrew Morton, Mel Gorman, David Rientjes, linux-mm, LKML
On Tue 17-11-15 19:58:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
> > __GFP_NOFAIL is requested. This is fragile because we are basically
> > relying on somebody else to make the reclaim (be it the direct reclaim
> > or OOM killer) for us. The caller might be holding resources (e.g.
> > locks) which block other other reclaimers from making any progress for
> > example. Remove the retry loop and rely on __alloc_pages_slowpath to
> > invoke all allowed reclaim steps and retry logic.
>
> This implies invoking OOM killer, doesn't it?
It does and the changelog is explicit about this.
> > /* Avoid recursion of direct reclaim */
> > - if (current->flags & PF_MEMALLOC)
> > + if (current->flags & PF_MEMALLOC) {
> > + /*
> > + * __GFP_NOFAIL request from this context is rather bizarre
> > + * because we cannot reclaim anything and only can loop waiting
> > + * for somebody to do a work for us.
> > + */
> > + if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > + cond_resched();
> > + goto retry;
>
> I think that this "goto retry;" omits call to out_of_memory() which is allowed
> for __GFP_NOFAIL allocations.
It wasn't called for PF_MEMALLOC requests though. Whether invoking OOM
killer is a good idea for this case is a harder question and out of
scope of this patch.
> Even if this is what you meant, current thread
> can be a workqueue, which currently need a short sleep (as with
> wait_iff_congested() changes), can't it?
As the changelog tries to clarify PF_MEMALLOC with __GFP_NOFAIL is
basically a bug. That is the reason I am adding WARN_ON there. I do not
think making this code more complex for abusers/buggy code is really
worthwhile. Besides that I fail to see why a work item would ever
want to set PF_MEMALLOC for legitimate reasons. I have done a quick git
grep over the tree and there doesn't seem to be any user.
>
> > + }
> > goto nopage;
> > + }
> >
> > /* Avoid allocations with no watermarks from looping endlessly */
> > if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> >
>
> Well, is it cond_resched() which should include
>
> if (current->flags & PF_WQ_WORKER)
> schedule_timeout(1);
I believe you are getting off-topic here.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-18 9:11 ` Michal Hocko
@ 2015-11-18 9:22 ` Michal Hocko
0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2015-11-18 9:22 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: Andrew Morton, Mel Gorman, David Rientjes, linux-mm, LKML
On Wed 18-11-15 10:11:01, Michal Hocko wrote:
> Besides that I fail to see why a work item would ever
> want to set PF_MEMALLOC for legitimate reasons. I have done a quick git
> grep over the tree and there doesn't seem to be any user.
OK, I have missed one case. xfs_btree_split_worker is really setting
PF_MEMALLOC from the worker context basically to inherit the flag from
kswapd. This is a legitimate use but it doesn't affect the allocation
path so it is not related to this discussion.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-16 13:22 ` [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim mhocko
2015-11-16 21:18 ` David Rientjes
2015-11-17 10:58 ` Tetsuo Handa
@ 2015-11-18 14:57 ` Vlastimil Babka
2015-11-18 15:11 ` Michal Hocko
2015-11-23 9:33 ` Michal Hocko
3 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2015-11-18 14:57 UTC (permalink / raw)
To: mhocko, Andrew Morton
Cc: Mel Gorman, David Rientjes, linux-mm, LKML, Michal Hocko
On 11/16/2015 02:22 PM, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
> __GFP_NOFAIL is requested. This is fragile because we are basically
> relying on somebody else to make the reclaim (be it the direct reclaim
> or OOM killer) for us. The caller might be holding resources (e.g.
> locks) which block other other reclaimers from making any progress for
> example. Remove the retry loop and rely on __alloc_pages_slowpath to
> invoke all allowed reclaim steps and retry logic.
>
> We have to be careful about __GFP_NOFAIL allocations from the
> PF_MEMALLOC context even though this is a very bad idea to begin with
> because no progress can be gurateed at all. We shouldn't break the
> __GFP_NOFAIL semantic here though. It could be argued that this is
> essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
> is much harder to check for existing users because they might happen
> deep down the code path performed much later after setting the flag
> so we cannot really rule out there is no kernel path triggering this
> combination.
>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/page_alloc.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b153fa3d0b9b..df7746280427 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3046,32 +3046,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * allocations are system rather than user orientated
> */
> ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> - do {
> - page = get_page_from_freelist(gfp_mask, order,
> - ALLOC_NO_WATERMARKS, ac);
> - if (page)
> - goto got_pg;
> -
> - if (gfp_mask & __GFP_NOFAIL)
> - wait_iff_congested(ac->preferred_zone,
> - BLK_RW_ASYNC, HZ/50);
I've been thinking if the lack of unconditional wait_iff_congested() can affect
something negatively. I guess not?
> - } while (gfp_mask & __GFP_NOFAIL);
> + page = get_page_from_freelist(gfp_mask, order,
> + ALLOC_NO_WATERMARKS, ac);
> + if (page)
> + goto got_pg;
> }
>
> /* Caller is not willing to reclaim, we can't balance anything */
> if (!can_direct_reclaim) {
> /*
> - * All existing users of the deprecated __GFP_NOFAIL are
> - * blockable, so warn of any new users that actually allow this
> - * type of allocation to fail.
> + * All existing users of the __GFP_NOFAIL are blockable, so warn
> + * of any new users that actually allow this type of allocation
> + * to fail.
> */
> WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
> goto nopage;
> }
>
> /* Avoid recursion of direct reclaim */
> - if (current->flags & PF_MEMALLOC)
> + if (current->flags & PF_MEMALLOC) {
> + /*
> + * __GFP_NOFAIL request from this context is rather bizarre
> + * because we cannot reclaim anything and only can loop waiting
> + * for somebody to do a work for us.
> + */
> + if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> + cond_resched();
> + goto retry;
> + }
> goto nopage;
> + }
>
> /* Avoid allocations with no watermarks from looping endlessly */
> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-18 14:57 ` Vlastimil Babka
@ 2015-11-18 15:11 ` Michal Hocko
2015-11-18 15:19 ` Vlastimil Babka
0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-11-18 15:11 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Andrew Morton, Mel Gorman, David Rientjes, linux-mm, LKML
On Wed 18-11-15 15:57:45, Vlastimil Babka wrote:
[...]
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3046,32 +3046,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > * allocations are system rather than user orientated
> > */
> > ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > - do {
> > - page = get_page_from_freelist(gfp_mask, order,
> > - ALLOC_NO_WATERMARKS, ac);
> > - if (page)
> > - goto got_pg;
> > -
> > - if (gfp_mask & __GFP_NOFAIL)
> > - wait_iff_congested(ac->preferred_zone,
> > - BLK_RW_ASYNC, HZ/50);
>
> I've been thinking if the lack of unconditional wait_iff_congested() can affect
> something negatively. I guess not?
Considering that the wait_iff_congested is removed only for PF_MEMALLOC
with __GFP_NOFAIL which should be non-existent in the kernel then I
think the risk is really low. Even if there was a caller _and_ there
was a congestion then the behavior wouldn't be much more worse than
what we have currently. The system is out of memory hoplessly if
ALLOC_NO_WATERMARKS allocation fails.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-18 15:11 ` Michal Hocko
@ 2015-11-18 15:19 ` Vlastimil Babka
0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-11-18 15:19 UTC (permalink / raw)
To: Michal Hocko; +Cc: Andrew Morton, Mel Gorman, David Rientjes, linux-mm, LKML
On 11/18/2015 04:11 PM, Michal Hocko wrote:
> On Wed 18-11-15 15:57:45, Vlastimil Babka wrote:
> [...]
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -3046,32 +3046,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > * allocations are system rather than user orientated
>> > */
>> > ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>> > - do {
>> > - page = get_page_from_freelist(gfp_mask, order,
>> > - ALLOC_NO_WATERMARKS, ac);
>> > - if (page)
>> > - goto got_pg;
>> > -
>> > - if (gfp_mask & __GFP_NOFAIL)
>> > - wait_iff_congested(ac->preferred_zone,
>> > - BLK_RW_ASYNC, HZ/50);
>>
>> I've been thinking if the lack of unconditional wait_iff_congested() can affect
>> something negatively. I guess not?
>
> Considering that the wait_iff_congested is removed only for PF_MEMALLOC
> with __GFP_NOFAIL which should be non-existent in the kernel then I
Hm that one won't reach it indeed, but also not loop, so that wasn't my concern.
I was referring to:
/* Keep reclaiming pages as long as there is reasonable progress */
pages_reclaimed += did_some_progress;
if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
goto retry;
}
Here we might skip the wait_iff_congested and go straight for oom. But it's true
that ordinary allocations that fail to make progress will also not wait, so I
guess it's fine.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> think the risk is really low. Even if there was a caller _and_ there
> was a congestion then the behavior wouldn't be much more worse than
> what we have currently. The system is out of memory hoplessly if
> ALLOC_NO_WATERMARKS allocation fails.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim
2015-11-16 13:22 ` [PATCH 2/2] mm: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim mhocko
` (2 preceding siblings ...)
2015-11-18 14:57 ` Vlastimil Babka
@ 2015-11-23 9:33 ` Michal Hocko
3 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2015-11-23 9:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mel Gorman, David Rientjes, linux-mm, LKML, Tetsuo Handa
It seems this patch hasn't reached the mmotm tree. Are there any
unresolved concerns left?
On Mon 16-11-15 14:22:19, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
> __GFP_NOFAIL is requested. This is fragile because we are basically
> relying on somebody else to make the reclaim (be it the direct reclaim
> or OOM killer) for us. The caller might be holding resources (e.g.
> locks) which block other other reclaimers from making any progress for
> example. Remove the retry loop and rely on __alloc_pages_slowpath to
> invoke all allowed reclaim steps and retry logic.
>
> We have to be careful about __GFP_NOFAIL allocations from the
> PF_MEMALLOC context even though this is a very bad idea to begin with
> because no progress can be gurateed at all. We shouldn't break the
> __GFP_NOFAIL semantic here though. It could be argued that this is
> essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
> is much harder to check for existing users because they might happen
> deep down the code path performed much later after setting the flag
> so we cannot really rule out there is no kernel path triggering this
> combination.
>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/page_alloc.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b153fa3d0b9b..df7746280427 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3046,32 +3046,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * allocations are system rather than user orientated
> */
> ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> - do {
> - page = get_page_from_freelist(gfp_mask, order,
> - ALLOC_NO_WATERMARKS, ac);
> - if (page)
> - goto got_pg;
> -
> - if (gfp_mask & __GFP_NOFAIL)
> - wait_iff_congested(ac->preferred_zone,
> - BLK_RW_ASYNC, HZ/50);
> - } while (gfp_mask & __GFP_NOFAIL);
> + page = get_page_from_freelist(gfp_mask, order,
> + ALLOC_NO_WATERMARKS, ac);
> + if (page)
> + goto got_pg;
> }
>
> /* Caller is not willing to reclaim, we can't balance anything */
> if (!can_direct_reclaim) {
> /*
> - * All existing users of the deprecated __GFP_NOFAIL are
> - * blockable, so warn of any new users that actually allow this
> - * type of allocation to fail.
> + * All existing users of the __GFP_NOFAIL are blockable, so warn
> + * of any new users that actually allow this type of allocation
> + * to fail.
> */
> WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
> goto nopage;
> }
>
> /* Avoid recursion of direct reclaim */
> - if (current->flags & PF_MEMALLOC)
> + if (current->flags & PF_MEMALLOC) {
> + /*
> + * __GFP_NOFAIL request from this context is rather bizarre
> + * because we cannot reclaim anything and only can loop waiting
> + * for somebody to do a work for us.
> + */
> + if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> + cond_resched();
> + goto retry;
> + }
> goto nopage;
> + }
>
> /* Avoid allocations with no watermarks from looping endlessly */
> if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> --
> 2.6.2
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread