linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: dchinner@redhat.com, mhocko@suse.cz, linux-mm@kvack.org,
	rientjes@google.com, oleg@redhat.com
Subject: Re: How to handle TIF_MEMDIE stalls?
Date: Sun, 21 Dec 2014 09:35:04 +1100	[thread overview]
Message-ID: <20141220223504.GI15665@dastard> (raw)
In-Reply-To: <201412202141.ADF87596.tOSLJHFFOOFMVQ@I-love.SAKURA.ne.jp>

On Sat, Dec 20, 2014 at 09:41:22PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > On Fri, Dec 19, 2014 at 09:22:49PM +0900, Tetsuo Handa wrote:
> > > > > The global OOM killer will try to kill this program because this program
> > > > > will be using 400MB+ of RAM by the time the global OOM killer is triggered.
> > > > > But sometimes this program cannot be terminated by the global OOM killer
> > > > > due to XFS lock dependency.
> > > > >
> > > > > You can see what is happening from OOM traces after uptime > 320 seconds of
> > > > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not
> > > > > configured on this program.
> > > >
> > > > This is clearly a separate issue. It is a lock dependency and that alone
> > > > _cannot_ be handled from OOM killer as it doesn't understand lock
> > > > dependencies. This should be addressed from the xfs point of view IMHO
> > > > but I am not familiar with this filesystem to tell you how or whether it
> > > > is possible.
> > 
> > What XFS lock dependency? I see nothing in that output file that indicates a
> > lock dependency problem - can you point out what the issue is here?
> 
> This is a problem which lockdep cannot report.
> 
> The problem is that an OOM-victim task is unable to terminate because it is
> blocked for waiting for (I don't know which lock but) one of locks used by XFS.

That's not an XFS problem - XFS relies on the memory reclaim
subsystem being able to make progress. If the memory reclaim
subsystem cannot make progress, then there's a bug in the memory
reclaim subsystem, not a problem with the OOM killer.

IOWs, you're not looking at the right place to solve the problem.

> ----------
> [  320.788387] Kill process 10732 (a.out) sharing same memory
> (...snipped...)
> [  398.641724] a.out           D ffff880077e42638     0 10732      1 0x00000084
> [  398.643705]  ffff8800770ebcb8 0000000000000082 ffff8800770ebc88 ffff880077e42210
> [  398.645819]  0000000000012500 ffff8800770ebfd8 0000000000012500 ffff880077e42210
> [  398.647917]  ffff8800770ebcb8 ffff88007b4a2a48 ffff88007b4a2a4c ffff880077e42210
> [  398.650009] Call Trace:
> [  398.651094]  [<ffffffff8159f954>] schedule_preempt_disabled+0x24/0x70
> [  398.652913]  [<ffffffff815a1705>] __mutex_lock_slowpath+0xb5/0x120
> [  398.654679]  [<ffffffff815a178e>] mutex_lock+0x1e/0x32
> [  398.656262]  [<ffffffffa023b58a>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs]
> [  398.658350]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
> [  398.660191]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
> [  398.661829]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
> [  398.663397]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
> [  398.665190]  [<ffffffff81180200>] SyS_write+0x50/0xc0
> [  398.666745]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
> [  398.668539]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17

These processes are blocked because some other process is holding the
i_mutex - likely another write that is blocked in memory reclaim
during page cache allocation. Yup:

[  398.852364] a.out           R  running task        0 10739      1 0x00000084
[  398.854312]  ffff8800751d3898 0000000000000082 ffff8800751d3960 ffff880035c42a80
[  398.856369]  0000000000012500 ffff8800751d3fd8 0000000000012500 ffff880035c42a80
[  398.858440]  0000000000000020 ffff8800751d3970 0000000000000003 ffffffff81848408
[  398.860497] Call Trace:
[  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0
[  398.866536]  [<ffffffff811254c4>] try_to_free_pages+0x94/0xc0
[  398.868177]  [<ffffffff8111a793>] __alloc_pages_nodemask+0x4e3/0xa40
[  398.869920]  [<ffffffff8115a8ce>] alloc_pages_current+0x8e/0x100
[  398.871647]  [<ffffffff81111b27>] __page_cache_alloc+0xa7/0xc0
[  398.873785]  [<ffffffff8111263b>] pagecache_get_page+0x6b/0x1e0
[  398.875468]  [<ffffffff811127de>] grab_cache_page_write_begin+0x2e/0x50
[  398.881857]  [<ffffffffa02301cf>] xfs_vm_write_begin+0x2f/0xe0 [xfs]
[  398.883553]  [<ffffffff8111188c>] generic_perform_write+0xcc/0x1d0
[  398.885210]  [<ffffffffa023b50f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs]
[  398.887100]  [<ffffffffa023b5ef>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs]
[  398.889135]  [<ffffffffa023b79e>] xfs_file_write_iter+0x7e/0x120 [xfs]
[  398.890907]  [<ffffffff8117edd9>] new_sync_write+0x89/0xd0
[  398.892495]  [<ffffffff8117f742>] vfs_write+0xb2/0x1f0
[  398.894017]  [<ffffffff8101a9f4>] ? do_audit_syscall_entry+0x64/0x70
[  398.895768]  [<ffffffff81180200>] SyS_write+0x50/0xc0
[  398.897273]  [<ffffffff810f729e>] ? __audit_syscall_exit+0x22e/0x2d0
[  398.899013]  [<ffffffff815a38e9>] system_call_fastpath+0x12/0x17

That's what's holding the i_mutex. This is normal, and *every*
filesystem holds the i_mutex here for buffered writes. Stop
trying to shoot the messenger...

Oh, boy.

struct page *grab_cache_page_write_begin(struct address_space *mapping,
                                        pgoff_t index, unsigned flags)
{
        struct page *page;
        int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;

        if (flags & AOP_FLAG_NOFS)
                fgp_flags |= FGP_NOFS;

        page = pagecache_get_page(mapping, index, fgp_flags,
                        mapping_gfp_mask(mapping),
                        GFP_KERNEL);
        if (page)
                wait_for_stable_page(page);

        return page;
}

There are *3* different memory allocation controls passed to
pagecache_get_page. The first is via AOP_FLAG_NOFS, where the caller
explicitly says this allocation is in filesystem context with locks
held, and so all allocations need to be done in GFP_NOFS context.
This is used to override the second and third gfp parameters.

The second is mapping_gfp_mask(mapping), which is the *default
allocation context* the filesystem wants the page cache to use for
allocating pages to the mapping.

The third is a hard coded GFP_KERNEL, which is used for radix tree
node allocation.

Why are there separate allocation contexts for the radix tree nodes
and the page cache pages when they are done under *exactly the same
caller context*? Either we are allowed to recurse into the
filesystem or we aren't, and the inode mapping mask defines that
context for all page cache allocations, not just the pages
themselves.

And to point out how many filesystems this affects,
the loop device, btrfs, f2fs, gfs2, jfs, logfs, nil2fs, reiserfs
and XFS all use this mapping default to clear __GFP_FS from
page cache allocations. Only ext4 and gfs2 use AOP_FLAG_NOFS in
their ->write_begin callouts to prevent recusrion.

IOWs, grab_cache_page_write_begin/pagecache_get_page multiple
allocation contexts are just wrong.  It does not match the way
filesystems are informing the page cache of allocation context to
avoid recursion (for avoiding stack overflow and/or deadlock).
AOP_FLAG_NOFS should go away, and all filesystems should modify the
mapping gfp mask to set their allocation context. If should be used
*everywhere* pages are allocated into the page cache, and for all
allocations related to tracking those allocated pages.

Now, that's not the problem directly related to this lockup, but
it's indicative of how far the page cache code has become from
reality over the past few years...

So, going back to the lockup, doesn't hte fact that so many
processes are spinning in the shrinker tell you that there's a
problem in that area? i.e. this:

[  398.861602]  [<ffffffff8159f814>] _cond_resched+0x24/0x40
[  398.863195]  [<ffffffff81122119>] shrink_slab+0x139/0x150
[  398.864799]  [<ffffffff811252bf>] do_try_to_free_pages+0x35f/0x4d0

tells me a shrinker is not making progress for some reason.  I'd
suggest that you run some tracing to find out what shrinker it is
stuck in. there are tracepoints in shrink_slab that will tell you
what shrinker is iterating for long periods of time. i.e instead of
ranting and pointing fingers at everyone, you need to keep digging
until you know exactly where reclaim progress is stalling.

> I don't know how block layer requests are issued by filesystem layer's
> activities, but PID=10832 is blocked for so long at blk_rq_map_kern() doing
> __GFP_WAIT allocation. I'm sure that this blk_rq_map_kern() is issued by XFS
> filesystem's activities because this system has only /dev/sda1 formatted as
> XFS and there is no swap memory.

Sorry, what?

[  525.184545]  [<ffffffff81265e6f>] blk_rq_map_kern+0x6f/0x130
[  525.186156]  [<ffffffff8116393e>] ? kmem_cache_alloc+0x48e/0x4b0
[  525.187831]  [<ffffffff813a66cf>] scsi_execute+0x12f/0x160
[  525.189418]  [<ffffffff813a7f14>] scsi_execute_req_flags+0x84/0xf0
[  525.191148]  [<ffffffffa01e29cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
[  525.192969]  [<ffffffff8109834c>] ? put_prev_entity+0x2c/0x3b0
[  525.194688]  [<ffffffffa01d6177>] cdrom_check_events+0x17/0x30 [cdrom]
[  525.196455]  [<ffffffffa01e2e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
[  525.198291]  [<ffffffff812701c6>] disk_check_events+0x56/0x1b0
[  525.199984]  [<ffffffff81270331>] disk_events_workfn+0x11/0x20
[  525.201616]  [<ffffffff8107ceaf>] process_one_work+0x13f/0x370
[  525.203264]  [<ffffffff8107de99>] worker_thread+0x119/0x500
[  525.204799]  [<ffffffff8107dd80>] ? rescuer_thread+0x350/0x350
[  525.206436]  [<ffffffff81082f7c>] kthread+0xdc/0x100
[  525.207902]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0
[  525.209655]  [<ffffffff815a383c>] ret_from_fork+0x7c/0xb0
[  525.211206]  [<ffffffff81082ea0>] ? kthread_create_on_node+0x1b0/0x1b0

That's a CDROM event through the SCSI stack via a raw scsi device.
If you read the code you'd see that scsi_execute() is the function
using __GFP_WAIT semantics. This has *absolutely nothing* to do with
XFS, and clearly has nothing to do with anything related to the
problem you are seeing.

> Anyway stalling for 10 minutes upon OOM (and can't solve with
> SysRq-f) is unusable for me.

OOM-killing is not a magic button that will miraculously make the
system work when you oversubscribe it severely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-12-20 22:35 UTC|newest]

Thread overview: 177+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-12 13:54 [RFC PATCH] oom: Don't count on mm-less current process Tetsuo Handa
2014-12-16 12:47 ` Michal Hocko
2014-12-17 11:54   ` Tetsuo Handa
2014-12-17 13:08     ` Michal Hocko
2014-12-18 12:11       ` Tetsuo Handa
2014-12-18 15:33         ` Michal Hocko
2014-12-19 12:07           ` Tetsuo Handa
2014-12-19 12:49             ` Michal Hocko
2014-12-20  9:13               ` Tetsuo Handa
2014-12-20 11:42                 ` Tetsuo Handa
2014-12-22 20:25                   ` Michal Hocko
2014-12-23  1:00                     ` Tetsuo Handa
2014-12-23  9:51                       ` Michal Hocko
2014-12-23 11:46                         ` Tetsuo Handa
2014-12-23 11:57                           ` Tetsuo Handa
2014-12-23 12:12                             ` Tetsuo Handa
2014-12-23 12:27                             ` Michal Hocko
2014-12-23 12:24                           ` Michal Hocko
2014-12-23 13:00                             ` Tetsuo Handa
2014-12-23 13:09                               ` Michal Hocko
2014-12-23 13:20                                 ` Tetsuo Handa
2014-12-23 13:43                                   ` Michal Hocko
2014-12-23 14:11                                     ` Tetsuo Handa
2014-12-23 14:57                                       ` Michal Hocko
2014-12-19 12:22           ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
2014-12-20  2:03             ` Dave Chinner
2014-12-20 12:41               ` Tetsuo Handa
2014-12-20 22:35                 ` Dave Chinner [this message]
2014-12-21  8:45                   ` Tetsuo Handa
2014-12-21 20:42                     ` Dave Chinner
2014-12-22 16:57                       ` Michal Hocko
2014-12-22 21:30                         ` Dave Chinner
2014-12-23  9:41                           ` Johannes Weiner
2014-12-24  1:06                             ` Dave Chinner
2014-12-24  2:40                               ` Linus Torvalds
2014-12-29 18:19                     ` Michal Hocko
2014-12-30  6:42                       ` Tetsuo Handa
2014-12-30 11:21                         ` Michal Hocko
2014-12-30 13:33                           ` Tetsuo Handa
2014-12-31 10:24                             ` Tetsuo Handa
2015-02-09 11:44                           ` Tetsuo Handa
2015-02-10 13:58                             ` Tetsuo Handa
2015-02-10 15:19                               ` Johannes Weiner
2015-02-11  2:23                                 ` Tetsuo Handa
2015-02-11 13:37                                   ` Tetsuo Handa
2015-02-11 18:50                                     ` Oleg Nesterov
2015-02-11 18:59                                       ` Oleg Nesterov
2015-03-14 13:03                                         ` Tetsuo Handa
2015-02-17 12:23                                   ` Tetsuo Handa
2015-02-17 12:53                                     ` Johannes Weiner
2015-02-17 15:38                                       ` Michal Hocko
2015-02-17 22:54                                       ` Dave Chinner
2015-02-17 23:32                                         ` Dave Chinner
2015-02-18  8:25                                         ` Michal Hocko
2015-02-18 10:48                                           ` Dave Chinner
2015-02-18 12:16                                             ` Michal Hocko
2015-02-18 21:31                                               ` Dave Chinner
2015-02-19  9:40                                                 ` Michal Hocko
2015-02-19 22:03                                                   ` Dave Chinner
2015-02-20  9:27                                                     ` Michal Hocko
2015-02-19 11:01                                               ` Johannes Weiner
2015-02-19 12:29                                                 ` Michal Hocko
2015-02-19 12:58                                                   ` Michal Hocko
2015-02-19 15:29                                                     ` Tetsuo Handa
2015-02-19 21:53                                                       ` Tetsuo Handa
2015-02-20  9:13                                                       ` Michal Hocko
2015-02-20 13:37                                                         ` Stefan Ring
2015-02-19 13:29                                                   ` Tetsuo Handa
2015-02-20  9:10                                                     ` Michal Hocko
2015-02-20 12:20                                                       ` Tetsuo Handa
2015-02-20 12:38                                                         ` Michal Hocko
2015-02-19 21:43                                                   ` Dave Chinner
2015-02-20 12:48                                                     ` Michal Hocko
2015-02-20 23:09                                                       ` Dave Chinner
2015-02-19 10:24                                         ` Johannes Weiner
2015-02-19 22:52                                           ` Dave Chinner
2015-02-20 10:36                                             ` Tetsuo Handa
2015-02-20 23:15                                               ` Dave Chinner
2015-02-21  3:20                                                 ` Theodore Ts'o
2015-02-21  9:19                                                   ` Andrew Morton
2015-02-21 13:48                                                     ` Tetsuo Handa
2015-02-21 21:38                                                     ` Dave Chinner
2015-02-22  0:20                                                     ` Johannes Weiner
2015-02-23 10:48                                                       ` Michal Hocko
2015-02-23 11:23                                                         ` Tetsuo Handa
2015-02-23 21:33                                                       ` David Rientjes
2015-02-22 14:48                                                     ` __GFP_NOFAIL and oom_killer_disabled? Tetsuo Handa
2015-02-23 10:21                                                       ` Michal Hocko
2015-02-23 13:03                                                         ` Tetsuo Handa
2015-02-24 18:14                                                           ` Michal Hocko
2015-02-25 11:22                                                             ` Tetsuo Handa
2015-02-25 16:02                                                               ` Michal Hocko
2015-02-25 21:48                                                                 ` Tetsuo Handa
2015-02-25 21:51                                                                   ` Andrew Morton
2015-02-21 12:00                                                   ` How to handle TIF_MEMDIE stalls? Tetsuo Handa
2015-02-23 10:26                                                   ` Michal Hocko
2015-02-21 11:12                                                 ` Tetsuo Handa
2015-02-21 21:48                                                   ` Dave Chinner
2015-02-21 23:52                                             ` Johannes Weiner
2015-02-23  0:45                                               ` Dave Chinner
2015-02-23  1:29                                                 ` Andrew Morton
2015-02-23  7:32                                                   ` Dave Chinner
2015-02-27 18:24                                                     ` Vlastimil Babka
2015-02-28  0:03                                                       ` Dave Chinner
2015-02-28 15:17                                                         ` Theodore Ts'o
2015-03-02  9:39                                                     ` Vlastimil Babka
2015-03-02 22:31                                                       ` Dave Chinner
2015-03-03  9:13                                                         ` Vlastimil Babka
2015-03-04  1:33                                                           ` Dave Chinner
2015-03-04  8:50                                                             ` Vlastimil Babka
2015-03-04 11:03                                                               ` Dave Chinner
2015-03-07  0:20                                                         ` Johannes Weiner
2015-03-07  3:43                                                           ` Dave Chinner
2015-03-07 15:08                                                             ` Johannes Weiner
2015-03-02 20:22                                                     ` Johannes Weiner
2015-03-02 23:12                                                       ` Dave Chinner
2015-03-03  2:50                                                         ` Johannes Weiner
2015-03-04  6:52                                                           ` Dave Chinner
2015-03-04 15:04                                                             ` Johannes Weiner
2015-03-04 17:38                                                               ` Theodore Ts'o
2015-03-04 23:17                                                                 ` Dave Chinner
2015-02-28 16:29                                                 ` Johannes Weiner
2015-02-28 16:41                                                   ` Theodore Ts'o
2015-02-28 22:15                                                     ` Johannes Weiner
2015-03-01 11:17                                                       ` Tetsuo Handa
2015-03-06 11:53                                                         ` Tetsuo Handa
2015-03-01 13:43                                                       ` Theodore Ts'o
2015-03-01 16:15                                                         ` Johannes Weiner
2015-03-01 19:36                                                           ` Theodore Ts'o
2015-03-01 20:44                                                             ` Johannes Weiner
2015-03-01 20:17                                                         ` Johannes Weiner
2015-03-01 21:48                                                       ` Dave Chinner
2015-03-02  0:17                                                         ` Dave Chinner
2015-03-02 12:46                                                           ` Brian Foster
2015-02-28 18:36                                                 ` Vlastimil Babka
2015-03-02 15:18                                                 ` Michal Hocko
2015-03-02 16:05                                                   ` Johannes Weiner
2015-03-02 17:10                                                     ` Michal Hocko
2015-03-02 17:27                                                       ` Johannes Weiner
2015-03-02 16:39                                                   ` Theodore Ts'o
2015-03-02 16:58                                                     ` Michal Hocko
2015-03-04 12:52                                                       ` Dave Chinner
2015-02-17 14:59                                     ` Michal Hocko
2015-02-17 14:50                                 ` Michal Hocko
2015-02-17 14:37                             ` Michal Hocko
2015-02-17 14:44                               ` Michal Hocko
2015-02-16 11:23                           ` Tetsuo Handa
2015-02-16 15:42                             ` Johannes Weiner
2015-02-17 11:57                               ` Tetsuo Handa
2015-02-17 13:16                                 ` Johannes Weiner
2015-02-17 16:50                                   ` Michal Hocko
2015-02-17 23:25                                     ` Dave Chinner
2015-02-18  8:48                                       ` Michal Hocko
2015-02-18 11:23                                         ` Tetsuo Handa
2015-02-18 12:29                                           ` Michal Hocko
2015-02-18 14:06                                             ` Tetsuo Handa
2015-02-18 14:25                                               ` Michal Hocko
2015-02-19 10:48                                                 ` Tetsuo Handa
2015-02-20  8:26                                                   ` Michal Hocko
2015-02-23 22:08                                 ` David Rientjes
2015-02-24 11:20                                   ` Tetsuo Handa
2015-02-24 15:20                                     ` Theodore Ts'o
2015-02-24 21:02                                       ` Dave Chinner
2015-02-25 14:31                                         ` Tetsuo Handa
2015-02-27  7:39                                           ` Dave Chinner
2015-02-27 12:42                                             ` Tetsuo Handa
2015-02-27 13:12                                               ` Dave Chinner
2015-03-04 12:41                                                 ` Tetsuo Handa
2015-03-04 13:25                                                   ` Dave Chinner
2015-03-04 14:11                                                     ` Tetsuo Handa
2015-03-05  1:36                                                       ` Dave Chinner
2015-02-17 16:33                             ` Michal Hocko
2014-12-29 17:40                   ` [PATCH] mm: get rid of radix tree gfp mask for pagecache_get_page (was: Re: How to handle TIF_MEMDIE stalls?) Michal Hocko
2014-12-29 18:45                     ` Linus Torvalds
2014-12-29 19:33                       ` Michal Hocko
2014-12-30 13:42                         ` Michal Hocko
2014-12-30 21:45                           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141220223504.GI15665@dastard \
    --to=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=oleg@redhat.com \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox