From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com, linux-mm@kvack.org
Subject: Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
Date: Wed, 18 Jun 2014 18:37:11 +0900 [thread overview]
Message-ID: <53A15DC7.50001@jp.fujitsu.com> (raw)
In-Reply-To: <20140617132609.GI9508@dastard>
On Tue, 17 Jun 2014 23:26:09 +1000 Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
>> I found two deadlock problems occur when kswapd writebacks XFS pages.
>> I detected these problems on RHEL kernel actually, and I suppose these
>> also happen on upstream kernel (3.16-rc1).
>>
>> 1.
>>
>> A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
>> at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
>> kworker has issued xlog_cil_push_work() and it is waiting for acquiring
>> the write semaphore. kswapd is waiting for acquiring the read semaphore
>> at xfs_log_commit_cil() because the kworker has been waiting before for
>> acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
>> happens.
>>
>> The deadlock flow is as follows.
>>
>> processA | kworker | kswapd
>> ----------------------+--------------------------+----------------------
>> | xfs_trans_commit | |
>> | xfs_log_commit_cil | |
>> | down_read(xc_ctx_lock)| |
>> | xlog_cil_insert_items | |
>> | xlog_cil_insert_format_items |
>> | kmem_alloc | |
>> | : | |
>> | shrink_inactive_list | |
>> | congestion_wait | |
>> | # waiting for kswapd..| |
>> | | xlog_cil_push_work |
>> | | xlog_cil_push |
>> | | xfs_trans_commit |
>> | | down_write(xc_ctx_lock) |
>> | | # waiting for processA...|
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | xfs_map_blocks
>> | | | xfs_iomap_write_allocate
>> | | | xfs_trans_commit
>> | | | xfs_log_commit_cil
>> | | | down_read(xc_ctx_lock)
>> V(time) | | # waiting for kworker...
>> ----------------------+--------------------------+-----------------------
>
> Where's the deadlock here? congestion_wait() simply times out and
> processA continues onward doing memory reclaim. It should continue
> making progress, albeit slowly, and if it isn't then the allocation
> will fail. If the allocation repeatedly fails then you should be
> seeing this in the logs:
>
> XFS: possible memory allocation deadlock in <func> (mode:0x%x)
>
> If you aren't seeing that in the logs a few times a second and never
> stopping, then the system is still making progress and isn't
> deadlocked.
processA is stuck at following while loop. In this situation,
too_many_isolated() always returns true because kswapd is also stuck...
---
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
...
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
---
On that point, this problem is similar to the problem fixed by
the following commit.
1f6d64829d xfs: block allocation work needs to be kswapd aware
So, the same solution, for example we add PF_KSWAPD to current->flags
before calling kmem_alloc(), may fix this problem1...
>
>> To fix this, should we up the read semaphore before calling kmem_alloc()
>> at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
>> should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
>> to KM_NOSLEEP to avoid waiting for the kswapd. Or...
>
> Can't do that - it's in transaction context and so reclaim can't
> recurse into the fs. Even if you do remove the flag, kmem_alloc()
> will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
> the task, so it won't affect anything...
I think kmem_alloc() doesn't re-add the GFP_NOFS if the second argument
is set to KM_NOSLEEP. kmem_alloc() will re-add GFP_ATOMIC and __GFP_NOWARN.
---
static inline gfp_t
kmem_flags_convert(xfs_km_flags_t flags)
{
gfp_t lflags;
BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
if (flags & KM_ZERO)
lflags |= __GFP_ZERO;
return lflags;
}
---
>
> We might be able to do a down_write_trylock() in xlog_cil_push(),
> but we can't delay the push for an arbitrary amount of time - the
> write lock needs to be a barrier otherwise we'll get push
> starvation and that will lead to checkpoint size overruns (i.e.
> temporary journal corruption).
I understand, thanks.
>
>> 2.
>>
>> A kworker (kworkerA), whish is a writeback thread, is waiting for
>> the XFS allocation thread (kworkerB) while it writebacks XFS pages.
>> kworkerB has started the allocation and it is waiting for kswapd to
>> allocate free pages. kswapd has started writeback XFS pages and
>> it is waiting for more log space. The reason why exhaustion of the
>> log space is both the writeback thread and kswapd are stuck, so
>> some processes, who have allocated the log space and are requesting
>> free pages, are also stuck.
>>
>> The deadlock flow is as follows.
>>
>> kworkerA | kworkerB | kswapd
>> ----------------------+--------------------------+-----------------------
>> | wb_writeback | |
>> | : | |
>> | xfs_vm_writepage | |
>> | xfs_map_blocks | |
>> | xfs_iomap_write_allocate |
>> | xfs_bmapi_write | |
>> | xfs_bmapi_allocate | |
>> | wait_for_completion | |
>> | # waiting for kworkerB... |
>> | | xfs_bmapi_allocate_worker|
>> | | : |
>> | | xfs_buf_get_map |
>> | | xfs_buf_allocate_memory |
>> | | alloc_pages_current |
>> | | : |
>> | | shrink_inactive_list |
>> | | congestion_wait |
>> | | # waiting for kswapd... |
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | :
>> | | | xfs_log_reserve
>> | | | :
>> | | | xlog_grant_head_check
>> | | | xlog_grant_head_wait
>> | | | # waiting for more
>> | | | # space...
>> V(time) | |
>> ----------------------+--------------------------+-----------------------
>
> Again, anything in congestion_wait() is not stuck and if the
> allocations here are repeatedly failing and progress is not being
> made, then there should be log messages from XFS indicating this.
kworkerB is stuck at the same reason as above processA.
>
> I need more information about your test setup to understand what is
> going on here. Can you provide:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> The output of sysrq-w would also be useful here, because the above
> abridged stack traces do not tell me everything about the state of
> the system I need to know.
OK, I will try to get the information when this problem2 is reproduced.
Thanks,
Masayoshi Mizuma
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-06-18 9:37 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-17 8:50 Masayoshi Mizuma
2014-06-17 13:26 ` Dave Chinner
2014-06-18 9:37 ` Masayoshi Mizuma [this message]
2014-06-18 11:48 ` Dave Chinner
[not found] ` <53A7D6CC.1040605@jp.fujitsu.com>
2014-06-24 22:05 ` Dave Chinner
2014-07-14 11:00 ` Masayoshi Mizuma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53A15DC7.50001@jp.fujitsu.com \
--to=m.mizuma@jp.fujitsu.com \
--cc=david@fromorbit.com \
--cc=linux-mm@kvack.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox