From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 27 Oct 2008 17:54:55 +1100 From: Dave Chinner Subject: Re: deadlock with latest xfs Message-ID: <20081027065455.GB4985@disturbed> References: <4900412A.2050802@sgi.com> <20081023205727.GA28490@infradead.org> <49013C47.4090601@sgi.com> <20081024052418.GO25906@disturbed> <20081024064804.GQ25906@disturbed> <20081026005351.GK18495@disturbed> <20081026025013.GL18495@disturbed> <49051C71.9040404@sgi.com> <20081027053004.GF11948@disturbed> <49055FDE.7040709@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49055FDE.7040709@sgi.com> Sender: owner-linux-mm@kvack.org Return-Path: To: Lachlan McIlroy Cc: Christoph Hellwig , xfs-oss , linux-mm@kvack.org List-ID: On Mon, Oct 27, 2008 at 05:29:50PM +1100, Lachlan McIlroy wrote: > Dave Chinner wrote: >> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote: >>> Dave Chinner wrote: >>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote: >>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote: >>>>>> OK, I just hung a single-threaded rm -rf after this completed: >>>>>> >>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress >>>>>> >>>>>> It has hung with this trace: >> .... >>>> Got it now. I can reproduce this in a couple of minutes now that both >>>> the test fs and the fs hosting the UML fs images are using lazy-count=1 >>>> (and the frequent 10s long host system freezes have gone away, too). >>>> >>>> Looks like *another* new memory allocation problem [1]: >> ..... >>>> We've entered memory reclaim inside the xfsdatad while trying to do >>>> unwritten extent completion during I/O completion, and that memory >>>> reclaim is now blocked waiting for I/o completion that cannot make >>>> progress. >>>> >>>> Nasty. >>>> >>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument >>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should >>>> then re-queue the I/O completion at the back of the workqueue and let other >>>> I/o completions progress before retrying this one. That way the I/O that >>>> is simply cleaning memory will make progress, hence allowing memory >>>> allocation to occur successfully when we retry this I/O completion... >>> It could work - unless it's a synchronous I/O in which case the I/O is not >>> complete until the extent conversion takes place. >> >> Right. Pushing unwritten extent conversion onto a different >> workqueue is probably the only way to handle this easily. >> That's the same solution Irix has been using for a long time >> (the xfsc thread).... > > Would that be a workqueue specific to one filesystem? Right now our > workqueues are per-cpu so they can contain I/O completions for multiple > filesystems. I've simply implemented another per-cpu workqueue set. >>> Could we allocate the memory up front before the I/O is issued? >> >> Possibly, but that will create more memory pressure than >> allocation in I/O completion because now we could need to hold >> thousands of allocations across an I/O - think of the case where >> we are running low on memory and have a disk subsystem capable of >> a few hundred thousand I/Os per second. the allocation failing would >> prevent the I/os from being issued, and if this is buffered writes >> into unwritten extents we'd be preventing dirty pages from being >> cleaned.... > > The allocation has to be done sometime - if have a few hundred thousand > I/Os per second then the queue of unwritten extent conversion requests > is going to grow very quickly. Sure, but the difference is that in a workqueue we are doing: alloc free alloc free ..... alloc free So the instantaneous memory usage is bound by the number of workqueue threads doing conversions. The "pre-allocate" case is: alloc alloc alloc alloc ...... free ..... free ..... so the allocation is bound by the number of parallel I/Os we have not completed. Given that the transaction structure is *800* bytes, they will consume memory very quickly if pre-allocated before the I/O is dispatched. > If a separate workqueue will fix this > then that's a better solution anyway. I think so. The patch I have been testing is below. Cheers, Dave. -- Dave Chinner david@fromorbit.com XFS: Prevent unwritten extent conversion from blocking I/O completion Unwritten extent conversion can recurse back into the filesystem due to memory allocation. Memory reclaim requires I/O completions to be processed to allow the callers to make progress. If the I/O completion workqueue thread is doing the recursion, then we have a deadlock situation. Move unwritten extent completion into it's own workqueue so it doesn't block I/O completions for normal delayed allocation or overwrite data. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_aops.c | 38 +++++++++++++++++++++----------------- fs/xfs/linux-2.6/xfs_aops.h | 1 + fs/xfs/linux-2.6/xfs_buf.c | 9 +++++++++ 3 files changed, 31 insertions(+), 17 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 6f4ebd0..f8fa620 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -119,23 +119,6 @@ xfs_find_bdev_for_inode( } /* - * Schedule IO completion handling on a xfsdatad if this was - * the final hold on this ioend. If we are asked to wait, - * flush the workqueue. - */ -STATIC void -xfs_finish_ioend( - xfs_ioend_t *ioend, - int wait) -{ - if (atomic_dec_and_test(&ioend->io_remaining)) { - queue_work(xfsdatad_workqueue, &ioend->io_work); - if (wait) - flush_workqueue(xfsdatad_workqueue); - } -} - -/* * We're now finished for good with this ioend structure. * Update the page state via the associated buffer_heads, * release holds on the inode and bio, and finally free @@ -266,6 +249,27 @@ xfs_end_bio_read( } /* + * Schedule IO completion handling on a xfsdatad if this was + * the final hold on this ioend. If we are asked to wait, + * flush the workqueue. + */ +STATIC void +xfs_finish_ioend( + xfs_ioend_t *ioend, + int wait) +{ + if (atomic_dec_and_test(&ioend->io_remaining)) { + struct workqueue_struct *wq = xfsdatad_workqueue; + if (ioend->io_work.func == xfs_end_bio_unwritten) + wq = xfsconvertd_workqueue; + + queue_work(wq, &ioend->io_work); + if (wait) + flush_workqueue(wq); + } +} + +/* * Allocate and initialise an IO completion structure. * We need to track unwritten extent write completion here initially. * We'll need to extend this for updating the ondisk inode size later diff --git a/fs/xfs/linux-2.6/xfs_aops.h b/fs/xfs/linux-2.6/xfs_aops.h index 3ba0631..7643f82 100644 --- a/fs/xfs/linux-2.6/xfs_aops.h +++ b/fs/xfs/linux-2.6/xfs_aops.h @@ -19,6 +19,7 @@ #define __XFS_AOPS_H__ extern struct workqueue_struct *xfsdatad_workqueue; +extern struct workqueue_struct *xfsconvertd_workqueue; extern mempool_t *xfs_ioend_pool; typedef void (*xfs_ioend_func_t)(void *); diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 36d5fcd..c1f55b3 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -45,6 +45,7 @@ static struct shrinker xfs_buf_shake = { static struct workqueue_struct *xfslogd_workqueue; struct workqueue_struct *xfsdatad_workqueue; +struct workqueue_struct *xfsconvertd_workqueue; #ifdef XFS_BUF_TRACE void @@ -1756,6 +1757,7 @@ xfs_flush_buftarg( xfs_buf_t *bp, *n; int pincount = 0; + xfs_buf_runall_queues(xfsconvertd_workqueue); xfs_buf_runall_queues(xfsdatad_workqueue); xfs_buf_runall_queues(xfslogd_workqueue); @@ -1812,9 +1814,15 @@ xfs_buf_init(void) if (!xfsdatad_workqueue) goto out_destroy_xfslogd_workqueue; + xfsconvertd_workqueue = create_workqueue("xfsconvertd"); + if (!xfsconvertd_workqueue) + goto out_destroy_xfsdatad_workqueue; + register_shrinker(&xfs_buf_shake); return 0; + out_destroy_xfsdatad_workqueue: + destroy_workqueue(xfsdatad_workqueue); out_destroy_xfslogd_workqueue: destroy_workqueue(xfslogd_workqueue); out_free_buf_zone: @@ -1830,6 +1838,7 @@ void xfs_buf_terminate(void) { unregister_shrinker(&xfs_buf_shake); + destroy_workqueue(xfsconvertd_workqueue); destroy_workqueue(xfsdatad_workqueue); destroy_workqueue(xfslogd_workqueue); kmem_zone_destroy(xfs_buf_zone); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org