From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 006836B006E for ; Wed, 18 Mar 2015 08:23:47 -0400 (EDT) Received: by wibg7 with SMTP id g7so88745157wib.1 for ; Wed, 18 Mar 2015 05:23:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id cz10si3350428wib.78.2015.03.18.05.23.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 18 Mar 2015 05:23:45 -0700 (PDT) Date: Wed, 18 Mar 2015 13:23:43 +0100 From: Michal Hocko Subject: Re: [PATCH 1/2 v2] mm: Allow small allocations to fail Message-ID: <20150318122343.GF17241@dhcp22.suse.cz> References: <20150315121317.GA30685@dhcp22.suse.cz> <201503152206.AGJ22930.HOStFFFQLVMOOJ@I-love.SAKURA.ne.jp> <20150316074607.GA24885@dhcp22.suse.cz> <201503172013.HCI87500.QFHtOOMLOVFSJF@I-love.SAKURA.ne.jp> <20150317131501.GH28112@dhcp22.suse.cz> <201503182033.CFI43269.FOJFOFtQHLSOMV@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201503182033.CFI43269.FOJFOFtQHLSOMV@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, david@fromorbit.com, mgorman@suse.de, riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed 18-03-15 20:33:03, Tetsuo Handa wrote: > Michal Hocko wrote: > > > Tetsuo Handa wrote: > > > > I also tested on XFS. One is Linux 3.19 and the other is Linux 3.19 > > > > with debug printk patch shown above. According to console logs, > > > > oom_kill_process() is trivially called via pagefault_out_of_memory() > > > > for the former kernel. Due to giving up !GFP_FS allocations immediately? > > > > > > > > (From http://I-love.SAKURA.ne.jp/tmp/serial-20150223-3.19-xfs-unpatched.txt.xz ) > > > > ---------- xfs / Linux 3.19 ---------- > > > > [ 793.283099] su invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 > > > > [ 793.283102] su cpuset=/ mems_allowed=0 > > > > [ 793.283104] CPU: 3 PID: 9552 Comm: su Not tainted 3.19.0 #40 > > > > [ 793.283159] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 > > > > [ 793.283161] 0000000000000000 ffff88007ac03bf8 ffffffff816ae9d4 000000000000bebe > > > > [ 793.283162] ffff880078b0d740 ffff88007ac03c98 ffffffff816ac7ac 0000000000000206 > > > > [ 793.283163] 0000000481f30298 ffff880073e55850 ffff88007ac03c88 ffff88007a20bef8 > > > > [ 793.283164] Call Trace: > > > > [ 793.283169] [] dump_stack+0x45/0x57 > > > > [ 793.283171] [] dump_header+0x7f/0x1f1 > > > > [ 793.283174] [] oom_kill_process+0x22b/0x390 > > > > [ 793.283177] [] ? has_capability_noaudit+0x20/0x30 > > > > [ 793.283178] [] out_of_memory+0x4b2/0x500 > > > > [ 793.283179] [] pagefault_out_of_memory+0x77/0x90 > > > > [ 793.283180] [] mm_fault_error+0x67/0x140 > > > > [ 793.283182] [] __do_page_fault+0x3f6/0x580 > > > > [ 793.283185] [] ? remove_wait_queue+0x4d/0x60 > > > > [ 793.283186] [] ? do_wait+0x12b/0x240 > > > > [ 793.283187] [] do_page_fault+0x31/0x70 > > > > [ 793.283189] [] page_fault+0x28/0x30 > > > > ---------- xfs / Linux 3.19 ---------- > > > > > > Are all memory allocations caused by page fault __GFP_FS allocation? > > > > They should be GFP_HIGHUSER_MOVABLE or GFP_KERNEL. There should be no > > reason to have GFP_NOFS there because the page fault doesn't come from a > > fs path. > > Excuse me, but are you sure? I am seeing 0x2015a (!__GFP_NOFS) allocation > failures from page fault. SystemTap also reports that 0x2015a is used from > page fault. > > ---------- > [root@localhost ~]# stap -p4 -d xfs -m pagefault -g -DSTP_NO_OVERLOAD -e ' > global traces_bt[65536]; > probe begin { printf("Probe start!\n"); } > probe kernel.function("__alloc_pages_nodemask") { > if ($gfp_mask == 0x2015a && execname() != "stapio") { > bt = backtrace(); > if (traces_bt[bt]++ == 0) { > printf("%s (%u) order:%u gfp:0x%x\n", execname(), tid(), $order, $gfp_mask); > print_stack(bt); > printf("\n\n"); > } > } > } > probe end { delete traces_bt; }' > pagefault.ko > [root@localhost ~]# staprun pagefault.ko > Probe start! > rsyslogd (1852) order:0 gfp:0x2015a > 0xffffffff81130030 : __alloc_pages_nodemask+0x0/0x9a0 [kernel] > 0xffffffff81170d87 : alloc_pages_current+0xa7/0x170 [kernel] > 0xffffffff81126d07 : __page_cache_alloc+0xb7/0xd0 [kernel] > 0xffffffff811287a5 : filemap_fault+0x1b5/0x440 [kernel] > 0xffffffff811502ff : __do_fault+0x3f/0xc0 [kernel] > 0xffffffff811518e1 : handle_mm_fault+0x5e1/0x13b0 [kernel] > 0xffffffff810463ef : __do_page_fault+0x18f/0x430 [kernel] > 0xffffffff8104676c : do_page_fault+0xc/0x10 [kernel] > 0xffffffff814d67a2 : page_fault+0x22/0x30 [kernel] Hmm, interesting. This seems to be page_cache_read path. I really fail to see why we are considering mapping_gfp_mask here. We are not holding any fs locks in this path AFAICS. Moreover we are doing GFP_KERNEL allocation few lines below. I guess this is something to be fixed. I will look into this. > ---------- > > So, your patch introduces a trigger to involve OOM killer for !__GFP_FS > allocation. I myself think that we should trigger OOM killer for !__GFP_FS > allocation in order to make forward progress in case the OOM victim is blocked. > What is the reason we did not involve OOM killer for !__GFP_FS allocation? Because the reclaim context for these allocations is very restricted. We might have a lot of cache which needs to be written down before it will be reclaimed. If we triggered OOM from this path we would see a lot of pre-mature OOM killers triggered. > Below is an example from http://I-love.SAKURA.ne.jp/tmp/serial-20150318.txt.xz > which is Linux 4.0-rc4 + your patch applied with sysctl_nr_alloc_retry == 1 > which has fallen into infinite "XFS: possible memory allocation deadlock in > xfs_buf_allocate_memory (mode:0x250)" retry trap called OOM-deadlock by > running multiple memory stressing processes described at > http://www.spinics.net/lists/linux-ext4/msg47216.html . > > [ 584.766247] Out of memory: Kill process 27800 (a.out) score 17 or sacrifice child > [ 584.766248] Killed process 27800 (a.out) total-vm:69516kB, anon-rss:33236kB, file-rss:4kB > (...snipped...) > [ 587.097942] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) > (...snipped...) > [ 891.677310] a.out D ffff880069c3fb78 0 27800 1 0x00100084 > [ 891.679239] ffff880069c3fb78 ffff880057b2f570 ffff88007cfaf3b0 0000000000000000 > [ 891.681368] ffff88007fffdb08 0000000000000000 ffff880069c3c010 ffff88007cfaf3b0 > [ 891.683519] ffff88007bde5dc4 00000000ffffffff ffff88007bde5dc8 ffff880069c3fb98 > [ 891.685654] Call Trace: > [ 891.686350] [] schedule+0x3e/0x90 > [ 891.687645] [] schedule_preempt_disabled+0xe/0x10 > [ 891.689289] [] __mutex_lock_slowpath+0x92/0x100 > [ 891.690898] [] ? unlazy_walk+0xe6/0x150 > [ 891.692333] [] mutex_lock+0x23/0x40 > [ 891.693671] [] lookup_slow+0x3d/0xc0 > [ 891.695036] [] link_path_walk+0x375/0x910 > [ 891.696523] [] path_init+0xc8/0x460 > [ 891.697864] [] path_openat+0x72/0x680 > [ 891.699280] [] ? fallback_alloc+0x192/0x200 > [ 891.700852] [] ? kmem_getpages+0x58/0x110 > [ 891.702334] [] do_filp_open+0x4a/0xa0 > [ 891.703769] [] ? __alloc_fd+0xcd/0x140 > [ 891.705200] [] do_sys_open+0x145/0x240 > [ 891.706650] [] SyS_open+0x1e/0x20 > [ 891.707976] [] system_call_fastpath+0x12/0x17 > (...snipped...) > [ 899.777423] init: page allocation failure: order:0, mode:0x2015a > [ 899.777424] CPU: 2 PID: 1 Comm: init Tainted: G E 4.0.0-rc4+ #13 > [ 899.777425] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 > [ 899.777426] 0000000000000000 ffff88007d07ba98 ffffffff814d0ee5 0000000000000001 > [ 899.777426] 000000000002015a ffff88007d07bb28 ffffffff8112f2ba ffff88007fffdb28 > [ 899.777427] ffff88007d07bab8 0000000000000020 000000000002015a 0000000000000000 > [ 899.777428] Call Trace: > [ 899.777430] [] dump_stack+0x48/0x5b > [ 899.777431] [] warn_alloc_failed+0xea/0x130 > [ 899.777432] [] __alloc_pages_nodemask+0x669/0x9a0 > [ 899.777434] [] alloc_pages_current+0xa7/0x170 > [ 899.777435] [] __page_cache_alloc+0xb7/0xd0 > [ 899.777436] [] filemap_fault+0x1b5/0x440 > [ 899.777437] [] __do_fault+0x3f/0xc0 > [ 899.777438] [] handle_mm_fault+0x5e1/0x13b0 > [ 899.777441] [] ? set_next_entity+0x2a/0x60 > [ 899.777442] [] __do_page_fault+0x18f/0x430 > [ 899.777443] [] do_page_fault+0xc/0x10 > [ 899.777445] [] page_fault+0x22/0x30 > (...snipped...) > [ 1013.096701] XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) > ---------- > > We have mutex_lock() which prevented effectively GFP_NOFAIL allocation > at xfs_buf_allocate_memory() from making forward progress when the OOM > victim is blocked at mutex_lock(). As long as there is GFP_NOFAIL users, > we need some heuristic mechanism for detecting stalls. One of those is to give GFP_NOFAIL user an access to reserves after it is not able to make any progress after several OOM attempts. If the caller is using GFP_NOFAIL appropriately then we should be slightly better off. XFS people refused to replace opencoded GFP_NOFAIL because they have plans to implement failure strategies so they didn't consider the change worth it. > While your patch seems to shorten the duration of !__GFP_FS allocations, > I can't feel that the I/O layer is making forward progress because the > system is stalling as if forever retrying !__GFP_FS allocations than > return I/O error to the caller. Yes and this is unfixable from the MM layer IMO. > Maybe somewhere in the I/O layer is > stalling due to use of the same watermark threshold for GFP_NOIO / > GFP_NOFS / GFP_KERNEL allocations, though I didn't check for details... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org