From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f198.google.com (mail-yw0-f198.google.com [209.85.161.198]) by kanga.kvack.org (Postfix) with ESMTP id ACE0D83102 for ; Mon, 29 Aug 2016 13:52:52 -0400 (EDT) Received: by mail-yw0-f198.google.com with SMTP id r9so295427094ywg.0 for ; Mon, 29 Aug 2016 10:52:52 -0700 (PDT) Received: from mail-qt0-x22e.google.com (mail-qt0-x22e.google.com. [2607:f8b0:400d:c0d::22e]) by mx.google.com with ESMTPS id l67si24014990qkc.231.2016.08.29.10.52.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Aug 2016 10:52:51 -0700 (PDT) Received: by mail-qt0-x22e.google.com with SMTP id 52so72204930qtq.3 for ; Mon, 29 Aug 2016 10:52:48 -0700 (PDT) Message-ID: <1472493162.16070.10.camel@poochiereds.net> Subject: Re: OOM detection regressions since 4.7 From: Jeff Layton Date: Mon, 29 Aug 2016 13:52:42 -0400 In-Reply-To: References: <20160822093249.GA14916@dhcp22.suse.cz> <20160822093707.GG13596@dhcp22.suse.cz> <20160822100528.GB11890@kroah.com> <20160822105441.GH13596@dhcp22.suse.cz> <20160822133114.GA15302@kroah.com> <20160822134227.GM13596@dhcp22.suse.cz> <20160822150517.62dc7cce74f1af6c1f204549@linux-foundation.org> <20160823074339.GB23577@dhcp22.suse.cz> <20160825071103.GC4230@dhcp22.suse.cz> <20160825071728.GA3169@aepfle.de> <20160829145203.GA30660@aepfle.de> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds , Olaf Hering , Bruce Fields Cc: Michal Hocko , Andrew Morton , Markus Trippelsdorf , Arkadiusz Miskiewicz , Ralf-Peter Rohbeck , Jiri Slaby , Greg KH , Vlastimil Babka , Joonsoo Kim , linux-mm , LKML , Linux NFS Mailing List On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote: > > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering wrote: > > > > > > Today I noticed the nfsserver was disabled, probably since months already. > > Starting it gives a OOM, not sure if this is new with 4.7+. > > That's not an oom, that's just an allocation failure. > > And with order-4, that's actually pretty normal. Nobody should use > order-4 (that's 16 contiguous pages, fragmentation can easily make > that hard - *much* harder than the small order-2 or order-2 cases that > we should largely be able to rely on). > > In fact, people who do multi-order allocations should always have a > fallback, and use __GFP_NOWARN. > > > > > [93348.306406] Call Trace: > > [93348.306490]A A [] __alloc_pages_slowpath+0x1af/0xa10 > > [93348.306501]A A [] __alloc_pages_nodemask+0x250/0x290 > > [93348.306511]A A [] cache_grow_begin+0x8d/0x540 > > [93348.306520]A A [] fallback_alloc+0x161/0x200 > > [93348.306530]A A [] __kmalloc+0x1d2/0x570 > > [93348.306589]A A [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] > > Hmm. That's kmalloc itself falling back after already failing to grow > the slab cache earlier (the earlier allocations *were* done with > NOWARN afaik). > > It does look like nfsdstarts out by allocating the hash table with one > single fairly big allocation, and has no fallback position. > > I suspect the code expects to be started at boot time, when this just > isn't an issue. The fact that you loaded the nfsd kernel module with > memory already fragmented after heavy use is likely why nobody else > has seen this. > > Adding the nfsd people to the cc, because just from a robustness > standpoint I suspect it would be better if the code did something like > > A (a) shrink the hash table if the allocation fails (we've got some > examples of that elsewhere) > > or > > A (b) fall back on a vmalloc allocation (that's certainly the simpler model) > > We do have a "kvfree()" helper function for the "free either a kmalloc > or vmalloc allocation" but we don't actually have a good helper > pattern for the allocation side. People just do it by hand, at least > partly because we have so many different ways to allocate things - > zeroing, non-zeroing, node-specific or not, atomic or not (atomic > cannot fall back to vmalloc, obviously) etc etc. > > Bruce, Jeff, comments? > > A A A A A A A A A A A A A Linus Yeah, that makes total sense. Hmm...we _do_ already auto-size the hash at init time already, so shrinking it downward and retrying if the allocation fails wouldn't be hard to do. Maybe I can just cut it in half and throw a pr_warn to tell the admin in that case. In any case...I'll take a look at how we can improve it. Thanks for the heads-up! --A Jeff Layton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org