Re: [PATCH] fdtable: Avoid triggering OOMs from alloc_fdmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] fdtable: Avoid triggering OOMs from alloc_fdmem
Date: Tue, 04 Feb 2014 09:22:40 -0800	[thread overview]
Message-ID: <871tzirdwf.fsf@xmission.com> (raw)
In-Reply-To: <1391530721.4301.8.camel@edumazet-glaptop2.roam.corp.google.com> (Eric Dumazet's message of "Tue, 04 Feb 2014 08:18:41 -0800")

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Mon, 2014-02-03 at 21:26 -0800, Eric W. Biederman wrote:
>> Recently due to a spike in connections per second memcached on 3
>> separate boxes triggered the OOM killer from accept.  At the time the
>> OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
>> problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
>> hold a bitmap, and there was sufficient fragmentation that the largest
>> page available was 8KiB.
>> 
>> I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
>> but I do agree that order 3 allocations are very likely to succeed.
>> 
>> There are always pathologies where order > 0 allocations can fail when
>> there are copious amounts of free memory available.  Using the pigeon
>> hole principle it is easy to show that it requires 1 page more than 50%
>> of the pages being free to guarantee an order 1 (8KiB) allocation will
>> succeed, 1 page more than 75% of the pages being free to guarantee an
>> order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
>> the pages being free to guarantee an order 3 allocate will succeed.
>> 
>> A server churning memory with a lot of small requests and replies like
>> memcached is a common case that if anything can will skew the odds
>> against large pages being available.
>> 
>> Therefore let's not give external applications a practical way to kill
>> linux server applications, and specify __GFP_NORETRY to the kmalloc in
>> alloc_fdmem.  Unless I am misreading the code and by the time the code
>> reaches should_alloc_retry in __alloc_pages_slowpath (where
>> __GFP_NORETRY becomes signification).  We have already tried everything
>> reasonable to allocate a page and the only thing left to do is wait.  So
>> not waiting and falling back to vmalloc immediately seems like the
>> reasonable thing to do even if there wasn't a chance of triggering the
>> OOM killer.
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/file.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/fs/file.c b/fs/file.c
>> index 771578b33fb6..db25c2bdfe46 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -34,7 +34,7 @@ static void *alloc_fdmem(size_t size)
>>  	 * vmalloc() if the allocation size will be considered "large" by the VM.
>>  	 */
>>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY);
>>  		if (data != NULL)
>>  			return data;
>>  	}
>
> Hi Eric
>
> I wrote yesterday a similar patch adding __GFP_NORETRY in following
> paths. I feel that alloc_fdmem() is only a part of the problem ;)

These code paths below were triggering OOMs for you?

I looked and didn't see a path flying through the air.

> What do you think, should we merge our changes or have distinct
> patches ?

I don't know about merging changes but certainly looking at the issue
together sounds good.

My gut feel says if there is a code path that has __GFP_NOWARN and
because of PAGE_ALLOC_COSTLY_ORDER we loop forever then there is
something fishy going on.

I would love to hear some people who are more current on the mm
subsystem than I am chime in.  It might be that the darn fix is going to
be to teach __alloc_pages_slowpath to not loop forever, unless order == 0. 
I expect the worst offenders need to have __GFP_NORETRY added so the
knowledge of what is going on spreads, and so we can avoid the danger of
needing to retune the whole mm subsystem that changing
__alloc_pages_slowpath does.

The two code paths below certainly look good canidates for having
__GFP_NORETRY added to them.  The same issues I ran into with
alloc_fdmem are likely to show up there as well.

Eric



> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0c127dcdf6a8..5b6a9431b017 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1775,7 +1775,9 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
>  			while (order) {
>  				if (npages >= 1 << order) {
>  					page = alloc_pages(sk->sk_allocation |
> -							   __GFP_COMP | __GFP_NOWARN,
> +							   __GFP_COMP |
> +							   __GFP_NOWARN |
> +							   __GFP_NORETRY,
>  							   order);
>  					if (page)
>  						goto fill_page;
> @@ -1845,7 +1847,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
>  		gfp_t gfp = prio;
>  
>  		if (order)
> -			gfp |= __GFP_COMP | __GFP_NOWARN;
> +			gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
>  		pfrag->page = alloc_pages(gfp, order);
>  		if (likely(pfrag->page)) {
>  			pfrag->offset = 0;
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-02-04 17:22 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-04  5:26 Eric W. Biederman
2014-02-04 16:18 ` Eric Dumazet
2014-02-04 17:22   ` Eric W. Biederman [this message]
2014-02-04 18:44     ` Eric Dumazet
2014-02-04 18:57       ` Eric W. Biederman
2014-02-04 22:48         ` Eric Dumazet
2014-02-04 21:27     ` David Rientjes
2014-02-04 18:25 ` Cong Wang
2014-02-04 18:53   ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871tzirdwf.fsf@xmission.com \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox