Re: deadlock with latest xfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: deadlock with latest xfs
       [not found]         ` <20081026005351.GK18495@disturbed>
@ 2008-10-26  2:50           ` Dave Chinner
  2008-10-26  4:20             ` Dave Chinner
                               ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dave Chinner @ 2008-10-26  2:50 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss; +Cc: linux-mm

On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
> > OK, I just hung a single-threaded rm -rf after this completed:
> > 
> > # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
> > 
> > It has hung with this trace:
> > 
> > # echo w > /proc/sysrq-trigger
> ....
> > [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
> > [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
> > [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
> > [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
> > [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
> > [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
> > [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
> > [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
> > [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
> > [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
> > [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
> > [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
> > [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> ....
> 
> I came back to the system, and found that the hang had gone away - the
> rm -rf had finished sometime in the ~36 hours between triggering the
> problem and coming back to look at the corpse....
> 
> So nothing to report yet.

Got it now. I can reproduce this in a couple of minutes now that both
the test fs and the fs hosting the UML fs images are using lazy-count=1
(and the frequent 10s long host system freezes have gone away, too).

Looks like *another* new memory allocation problem [1]:

[42950422.270000] xfsdatad/0    D 000000000043bf7a     0    51      2
[42950422.270000] 804add98 804ad8f8 60498c40 80474000 804776a0 60014f0d 80442780 1000111a8
[42950422.270000]        80474000 7ff1ac08 804ad8c0 80442780 804776f0 60324b21 80474000 80477700
[42950422.270000]        80474000 1000111a8 80477700 0000000a 804777e0 80477950 80477750 60324e39 <6>Call Trace:
[42950422.270000] 80477668:  [<60014f0d>] _switch_to+0x6d/0xe0
[42950422.270000] 804776a8:  [<60324b21>] schedule+0x171/0x2c0
[42950422.270000] 804776f8:  [<60324e39>] schedule_timeout+0x79/0xf0
[42950422.270000] 80477718:  [<60040360>] process_timeout+0x0/0x10
[42950422.270000] 80477758:  [<60324619>] io_schedule_timeout+0x19/0x30
[42950422.270000] 80477778:  [<6006eb74>] congestion_wait+0x74/0xa0
[42950422.270000] 80477790:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
[42950422.270000] 804777d8:  [<600692a0>] throttle_vm_writeout+0x80/0xa0
[42950422.270000] 80477818:  [<6006cdf4>] shrink_zone+0xac4/0xb10
[42950422.270000] 80477828:  [<601adb5b>] kmem_alloc+0x5b/0x140
[42950422.270000] 804778c8:  [<60186d48>] xfs_iext_inline_to_direct+0x68/0x80
[42950422.270000] 804778f8:  [<60187e38>] xfs_iext_realloc_direct+0x128/0x1c0
[42950422.270000] 80477928:  [<60188594>] xfs_iext_add+0xc4/0x290
[42950422.270000] 80477978:  [<60166388>] xfs_bmbt_set_all+0x18/0x20
[42950422.270000] 80477988:  [<601887c4>] xfs_iext_insert+0x64/0x80
[42950422.270000] 804779c8:  [<6006d75a>] try_to_free_pages+0x1ea/0x330
[42950422.270000] 80477a40:  [<6006ba40>] isolate_pages_global+0x0/0x40
[42950422.270000] 80477a98:  [<60067887>] __alloc_pages_internal+0x267/0x540
[42950422.270000] 80477b68:  [<60086b61>] cache_alloc_refill+0x4c1/0x970
[42950422.270000] 80477b88:  [<60326ea9>] _spin_unlock+0x9/0x10
[42950422.270000] 80477bd8:  [<6002ffc5>] __might_sleep+0x55/0x120
[42950422.270000] 80477c08:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
[42950422.270000] 80477c18:  [<600873c3>] kmem_cache_alloc+0xd3/0x100
[42950422.270000] 80477c58:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
[42950422.270000] 80477ca8:  [<601ada78>] kmem_zone_zalloc+0x18/0x50
[42950422.270000] 80477cc8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
[42950422.270000] 80477ce8:  [<601a3b52>] xfs_trans_alloc+0xa2/0xb0
[42950422.270000] 80477d18:  [<60027655>] set_signals+0x35/0x40
[42950422.270000] 80477d48:  [<6018f93a>] xfs_iomap_write_unwritten+0x5a/0x260
[42950422.270000] 80477d50:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950422.270000] 80477d68:  [<60027655>] set_signals+0x35/0x40
[42950422.270000] 80477db8:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950422.270000] 80477dc8:  [<60063dbf>] mempool_free+0x4f/0x90
[42950422.270000] 80477e18:  [<601af5e5>] xfs_end_bio_unwritten+0x65/0x80
[42950422.270000] 80477e38:  [<60048574>] run_workqueue+0xa4/0x180
[42950422.270000] 80477e50:  [<601af580>] xfs_end_bio_unwritten+0x0/0x80
[42950422.270000] 80477e58:  [<6004c791>] prepare_to_wait+0x51/0x80
[42950422.270000] 80477e98:  [<600488e0>] worker_thread+0x70/0xd0

We've entered memory reclaim inside the xfsdatad while trying to do
unwritten extent completion during I/O completion, and that memory
reclaim is now blocked waiting for I/o completion that cannot make
progress.

Nasty.

My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
then re-queue the I/O completion at the back of the workqueue and let other
I/o completions progress before retrying this one. That way the I/O that
is simply cleaning memory will make progress, hence allowing memory
allocation to occur successfully when we retry this I/O completion...

XFS-folk - thoughts?

[1] I don't see how any of the XFS changes we made make this easier to hit.
What I suspect is a VM regression w.r.t. memory reclaim because this is
the second problem since 2.6.26 that appears to be a result of memory
allocation failures in places that we've never, ever seen failures before.

The other new failure is this one:

http://bugzilla.kernel.org/show_bug.cgi?id=11805

which is an alloc_pages(GFP_KERNEL) failure....

mm-folk - care to weight in?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  2:50           ` deadlock with latest xfs Dave Chinner
@ 2008-10-26  4:20             ` Dave Chinner
  2008-10-27  1:42             ` Lachlan McIlroy
  2008-10-28  6:02             ` Nick Piggin
  2 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2008-10-26  4:20 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Sun, Oct 26, 2008 at 01:50:13PM +1100, Dave Chinner wrote:
> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
> > On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
> > > OK, I just hung a single-threaded rm -rf after this completed:
> > > 
> > > # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
> > > 
> > > It has hung with this trace:
> > > 
> > > # echo w > /proc/sysrq-trigger
> > ....
> > > [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
> > > [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
> > > [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
> > > [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
> > > [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
> > > [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
> > > [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
> > > [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
> > > [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
> > > [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
> > > [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
> > > [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
> > > [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> > ....
> > 
> > I came back to the system, and found that the hang had gone away - the
> > rm -rf had finished sometime in the ~36 hours between triggering the
> > problem and coming back to look at the corpse....
> > 
> > So nothing to report yet.
> 
> Got it now. I can reproduce this in a couple of minutes now that both
> the test fs and the fs hosting the UML fs images are using lazy-count=1
> (and the frequent 10s long host system freezes have gone away, too).
> 
> Looks like *another* new memory allocation problem [1]:

[snip]

And having fixed that, I'm now seeing the log reservation hang:

[42950307.350000] xfsdatad/0    D 00000000407219f0     0    51      2
[42950307.350000] 7bd1acd8 7bd1a838 60498c40 81074000 81077b40 60014f0d 81044780 81074000
[42950307.350000]        81074000 7e15f808 7bd1a800 81044780 81077b90 60324bc1 81074000 00000250
[42950307.350000]        81074000 81074000 7fffffffffffffff 6646a168 80b6dd28 80b6ddf8 81077bf0 60324f0d <6>Call Trace:
[42950307.350000] 81077b08:  [<60014f0d>] _switch_to+0x6d/0xe0
[42950307.350000] 81077b48:  [<60324bc1>] schedule+0x171/0x2c0
[42950307.350000] 81077b98:  [<60324f0d>] schedule_timeout+0xad/0xf0
[42950307.350000] 81077bb8:  [<60326f38>] _spin_unlock_irqrestore+0x18/0x20
[42950307.350000] 81077bf8:  [<601953e9>] xlog_grant_log_space+0x169/0x470
[42950307.350000] 81077c10:  [<60030ba0>] default_wake_function+0x0/0x10
[42950307.350000] 81077c68:  [<60195812>] xfs_log_reserve+0x122/0x140
[42950307.350000] 81077cb8:  [<601a3757>] xfs_trans_reserve+0x147/0x2e0
[42950307.350000] 81077ce8:  [<601adb14>] kmem_zone_zalloc+0x34/0x50
[42950307.350000] 81077d28:  [<6018f985>] xfs_iomap_write_unwritten+0xa5/0x2d0
[42950307.350000] 81077d38:  [<60326f38>] _spin_unlock_irqrestore+0x18/0x20
[42950307.350000] 81077d48:  [<60085750>] cache_free_debugcheck+0x150/0x2e0
[42950307.350000] 81077d50:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950307.350000] 81077d88:  [<60085e02>] kmem_cache_free+0x72/0xb0
[42950307.350000] 81077dc8:  [<60063dbf>] mempool_free+0x4f/0x90
[42950307.350000] 81077e08:  [<601af66d>] xfs_end_bio_unwritten+0x6d/0xa0
[42950307.350000] 81077e38:  [<60048574>] run_workqueue+0xa4/0x180
[42950307.350000] 81077e50:  [<601af600>] xfs_end_bio_unwritten+0x0/0xa0
[42950307.350000] 81077e58:  [<6004c791>] prepare_to_wait+0x51/0x80
[42950307.350000] 81077e98:  [<600488e0>] worker_thread+0x70/0xd0
[42950307.350000] 81077eb0:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
[42950307.350000] 81077ee8:  [<60048870>] worker_thread+0x0/0xd0
[42950307.350000] 81077f08:  [<6004c204>] kthread+0x64/0xb0
[42950307.350000] 81077f48:  [<60026285>] run_kernel_thread+0x35/0x60
[42950307.350000] 81077f58:  [<6004c1a0>] kthread+0x0/0xb0
[42950307.350000] 81077f98:  [<60026278>] run_kernel_thread+0x28/0x60
[42950307.350000] 81077fc8:  [<60014e71>] new_thread_handler+0x71/0xa0

Basically, the log is too small to fit the number of transaction reservations
that are currently being attempted (roughly 1000 parallel transactions), and so
xlog_grant_log_space() is sleeping.  Because it is sleeping in I/O completion,
the log tail can't move forward because I/O completion is not occurring.

I think that at this point, we need a separate workqueue for unwritten extent
conversion to prevent it from blocking normal data and metadata I/O completion.
that way we can allow it to recurse on allocation and transaction reservation
without introducing I/O completion deadlocks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  2:50           ` deadlock with latest xfs Dave Chinner
  2008-10-26  4:20             ` Dave Chinner
@ 2008-10-27  1:42             ` Lachlan McIlroy
  2008-10-27  5:30               ` Dave Chinner
  2008-10-28  6:02             ` Nick Piggin
  2 siblings, 1 reply; 10+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  1:42 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>
>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>
>>> It has hung with this trace:
>>>
>>> # echo w > /proc/sysrq-trigger
>> ....
>>> [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
>>> [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
>>> [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
>>> [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
>>> [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
>>> [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
>>> [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
>>> [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
>>> [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
>>> [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
>>> [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
>>> [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
>>> [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
>> ....
>>
>> I came back to the system, and found that the hang had gone away - the
>> rm -rf had finished sometime in the ~36 hours between triggering the
>> problem and coming back to look at the corpse....
>>
>> So nothing to report yet.
> 
> Got it now. I can reproduce this in a couple of minutes now that both
> the test fs and the fs hosting the UML fs images are using lazy-count=1
> (and the frequent 10s long host system freezes have gone away, too).
> 
> Looks like *another* new memory allocation problem [1]:
> 
> [42950422.270000] xfsdatad/0    D 000000000043bf7a     0    51      2
> [42950422.270000] 804add98 804ad8f8 60498c40 80474000 804776a0 60014f0d 80442780 1000111a8
> [42950422.270000]        80474000 7ff1ac08 804ad8c0 80442780 804776f0 60324b21 80474000 80477700
> [42950422.270000]        80474000 1000111a8 80477700 0000000a 804777e0 80477950 80477750 60324e39 <6>Call Trace:
> [42950422.270000] 80477668:  [<60014f0d>] _switch_to+0x6d/0xe0
> [42950422.270000] 804776a8:  [<60324b21>] schedule+0x171/0x2c0
> [42950422.270000] 804776f8:  [<60324e39>] schedule_timeout+0x79/0xf0
> [42950422.270000] 80477718:  [<60040360>] process_timeout+0x0/0x10
> [42950422.270000] 80477758:  [<60324619>] io_schedule_timeout+0x19/0x30
> [42950422.270000] 80477778:  [<6006eb74>] congestion_wait+0x74/0xa0
> [42950422.270000] 80477790:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
> [42950422.270000] 804777d8:  [<600692a0>] throttle_vm_writeout+0x80/0xa0
> [42950422.270000] 80477818:  [<6006cdf4>] shrink_zone+0xac4/0xb10
> [42950422.270000] 80477828:  [<601adb5b>] kmem_alloc+0x5b/0x140
> [42950422.270000] 804778c8:  [<60186d48>] xfs_iext_inline_to_direct+0x68/0x80
> [42950422.270000] 804778f8:  [<60187e38>] xfs_iext_realloc_direct+0x128/0x1c0
> [42950422.270000] 80477928:  [<60188594>] xfs_iext_add+0xc4/0x290
> [42950422.270000] 80477978:  [<60166388>] xfs_bmbt_set_all+0x18/0x20
> [42950422.270000] 80477988:  [<601887c4>] xfs_iext_insert+0x64/0x80
> [42950422.270000] 804779c8:  [<6006d75a>] try_to_free_pages+0x1ea/0x330
> [42950422.270000] 80477a40:  [<6006ba40>] isolate_pages_global+0x0/0x40
> [42950422.270000] 80477a98:  [<60067887>] __alloc_pages_internal+0x267/0x540
> [42950422.270000] 80477b68:  [<60086b61>] cache_alloc_refill+0x4c1/0x970
> [42950422.270000] 80477b88:  [<60326ea9>] _spin_unlock+0x9/0x10
> [42950422.270000] 80477bd8:  [<6002ffc5>] __might_sleep+0x55/0x120
> [42950422.270000] 80477c08:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
> [42950422.270000] 80477c18:  [<600873c3>] kmem_cache_alloc+0xd3/0x100
> [42950422.270000] 80477c58:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
> [42950422.270000] 80477ca8:  [<601ada78>] kmem_zone_zalloc+0x18/0x50
> [42950422.270000] 80477cc8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> [42950422.270000] 80477ce8:  [<601a3b52>] xfs_trans_alloc+0xa2/0xb0
> [42950422.270000] 80477d18:  [<60027655>] set_signals+0x35/0x40
> [42950422.270000] 80477d48:  [<6018f93a>] xfs_iomap_write_unwritten+0x5a/0x260
> [42950422.270000] 80477d50:  [<60063d12>] mempool_free_slab+0x12/0x20
> [42950422.270000] 80477d68:  [<60027655>] set_signals+0x35/0x40
> [42950422.270000] 80477db8:  [<60063d12>] mempool_free_slab+0x12/0x20
> [42950422.270000] 80477dc8:  [<60063dbf>] mempool_free+0x4f/0x90
> [42950422.270000] 80477e18:  [<601af5e5>] xfs_end_bio_unwritten+0x65/0x80
> [42950422.270000] 80477e38:  [<60048574>] run_workqueue+0xa4/0x180
> [42950422.270000] 80477e50:  [<601af580>] xfs_end_bio_unwritten+0x0/0x80
> [42950422.270000] 80477e58:  [<6004c791>] prepare_to_wait+0x51/0x80
> [42950422.270000] 80477e98:  [<600488e0>] worker_thread+0x70/0xd0
> 
> We've entered memory reclaim inside the xfsdatad while trying to do
> unwritten extent completion during I/O completion, and that memory
> reclaim is now blocked waiting for I/o completion that cannot make
> progress.
> 
> Nasty.
> 
> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
> then re-queue the I/O completion at the back of the workqueue and let other
> I/o completions progress before retrying this one. That way the I/O that
> is simply cleaning memory will make progress, hence allowing memory
> allocation to occur successfully when we retry this I/O completion...
It could work - unless it's a synchronous I/O in which case the I/O is not
complete until the extent conversion takes place.

Could we allocate the memory up front before the I/O is issued?

> 
> XFS-folk - thoughts?
> 
> [1] I don't see how any of the XFS changes we made make this easier to hit.
> What I suspect is a VM regression w.r.t. memory reclaim because this is
> the second problem since 2.6.26 that appears to be a result of memory
> allocation failures in places that we've never, ever seen failures before.
> 
> The other new failure is this one:
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=11805
> 
> which is an alloc_pages(GFP_KERNEL) failure....
> 
> mm-folk - care to weight in?
> 
> Cheers,
> 
> Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  1:42             ` Lachlan McIlroy
@ 2008-10-27  5:30               ` Dave Chinner
  2008-10-27  6:29                 ` Lachlan McIlroy
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2008-10-27  5:30 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss, linux-mm

On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>
>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>
>>>> It has hung with this trace:
....
>> Got it now. I can reproduce this in a couple of minutes now that both
>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>> (and the frequent 10s long host system freezes have gone away, too).
>>
>> Looks like *another* new memory allocation problem [1]:
.....
>> We've entered memory reclaim inside the xfsdatad while trying to do
>> unwritten extent completion during I/O completion, and that memory
>> reclaim is now blocked waiting for I/o completion that cannot make
>> progress.
>>
>> Nasty.
>>
>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>> then re-queue the I/O completion at the back of the workqueue and let other
>> I/o completions progress before retrying this one. That way the I/O that
>> is simply cleaning memory will make progress, hence allowing memory
>> allocation to occur successfully when we retry this I/O completion...
> It could work - unless it's a synchronous I/O in which case the I/O is not
> complete until the extent conversion takes place.

Right. Pushing unwritten extent conversion onto a different
workqueue is probably the only way to handle this easily.
That's the same solution Irix has been using for a long time
(the xfsc thread)....

> Could we allocate the memory up front before the I/O is issued?

Possibly, but that will create more memory pressure than
allocation in I/O completion because now we could need to hold
thousands of allocations across an I/O - think of the case where
we are running low on memory and have a disk subsystem capable of
a few hundred thousand I/Os per second. the allocation failing would
prevent the I/os from being issued, and if this is buffered writes
into unwritten extents we'd be preventing dirty pages from being
cleaned....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  5:30               ` Dave Chinner
@ 2008-10-27  6:29                 ` Lachlan McIlroy
  2008-10-27  6:54                   ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  6:29 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>
>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>
>>>>> It has hung with this trace:
> ....
>>> Got it now. I can reproduce this in a couple of minutes now that both
>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>> (and the frequent 10s long host system freezes have gone away, too).
>>>
>>> Looks like *another* new memory allocation problem [1]:
> .....
>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>> unwritten extent completion during I/O completion, and that memory
>>> reclaim is now blocked waiting for I/o completion that cannot make
>>> progress.
>>>
>>> Nasty.
>>>
>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>> then re-queue the I/O completion at the back of the workqueue and let other
>>> I/o completions progress before retrying this one. That way the I/O that
>>> is simply cleaning memory will make progress, hence allowing memory
>>> allocation to occur successfully when we retry this I/O completion...
>> It could work - unless it's a synchronous I/O in which case the I/O is not
>> complete until the extent conversion takes place.
> 
> Right. Pushing unwritten extent conversion onto a different
> workqueue is probably the only way to handle this easily.
> That's the same solution Irix has been using for a long time
> (the xfsc thread)....

Would that be a workqueue specific to one filesystem?  Right now our
workqueues are per-cpu so they can contain I/O completions for multiple
filesystems.

> 
>> Could we allocate the memory up front before the I/O is issued?
> 
> Possibly, but that will create more memory pressure than
> allocation in I/O completion because now we could need to hold
> thousands of allocations across an I/O - think of the case where
> we are running low on memory and have a disk subsystem capable of
> a few hundred thousand I/Os per second. the allocation failing would
> prevent the I/os from being issued, and if this is buffered writes
> into unwritten extents we'd be preventing dirty pages from being
> cleaned....

The allocation has to be done sometime - if have a few hundred thousand
I/Os per second then the queue of unwritten extent conversion requests
is going to grow very quickly.  If a separate workqueue will fix this
then that's a better solution anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  6:29                 ` Lachlan McIlroy
@ 2008-10-27  6:54                   ` Dave Chinner
  2008-10-27  7:31                     ` Lachlan McIlroy
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2008-10-27  6:54 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss, linux-mm

On Mon, Oct 27, 2008 at 05:29:50PM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>>
>>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>>
>>>>>> It has hung with this trace:
>> ....
>>>> Got it now. I can reproduce this in a couple of minutes now that both
>>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>>> (and the frequent 10s long host system freezes have gone away, too).
>>>>
>>>> Looks like *another* new memory allocation problem [1]:
>> .....
>>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>>> unwritten extent completion during I/O completion, and that memory
>>>> reclaim is now blocked waiting for I/o completion that cannot make
>>>> progress.
>>>>
>>>> Nasty.
>>>>
>>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>>> then re-queue the I/O completion at the back of the workqueue and let other
>>>> I/o completions progress before retrying this one. That way the I/O that
>>>> is simply cleaning memory will make progress, hence allowing memory
>>>> allocation to occur successfully when we retry this I/O completion...
>>> It could work - unless it's a synchronous I/O in which case the I/O is not
>>> complete until the extent conversion takes place.
>>
>> Right. Pushing unwritten extent conversion onto a different
>> workqueue is probably the only way to handle this easily.
>> That's the same solution Irix has been using for a long time
>> (the xfsc thread)....
>
> Would that be a workqueue specific to one filesystem?  Right now our
> workqueues are per-cpu so they can contain I/O completions for multiple
> filesystems.

I've simply implemented another per-cpu workqueue set.

>>> Could we allocate the memory up front before the I/O is issued?
>>
>> Possibly, but that will create more memory pressure than
>> allocation in I/O completion because now we could need to hold
>> thousands of allocations across an I/O - think of the case where
>> we are running low on memory and have a disk subsystem capable of
>> a few hundred thousand I/Os per second. the allocation failing would
>> prevent the I/os from being issued, and if this is buffered writes
>> into unwritten extents we'd be preventing dirty pages from being
>> cleaned....
>
> The allocation has to be done sometime - if have a few hundred thousand
> I/Os per second then the queue of unwritten extent conversion requests
> is going to grow very quickly.

Sure, but the difference is that in a workqueue we are doing:

	alloc
	free
	alloc
	free
	.....
	alloc
	free

So the instantaneous memory usage is bound by the number of
workqueue threads doing conversions. The "pre-allocate" case is:

	alloc
	alloc
	alloc
	alloc
	......
	<io completes>
	free
	.....
	<io_completes>
	free
	.....

so the allocation is bound by the number of parallel I/Os we have
not completed. Given that the transaction structure is *800* bytes,
they will consume memory very quickly if pre-allocated before the
I/O is dispatched.

> If a separate workqueue will fix this
> then that's a better solution anyway.

I think so. The patch I have been testing is below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


XFS: Prevent unwritten extent conversion from blocking I/O completion

Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
deadlock situation.

Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
overwrite data.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |   38 +++++++++++++++++++++-----------------
 fs/xfs/linux-2.6/xfs_aops.h |    1 +
 fs/xfs/linux-2.6/xfs_buf.c  |    9 +++++++++
 3 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 6f4ebd0..f8fa620 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -119,23 +119,6 @@ xfs_find_bdev_for_inode(
 }
 
 /*
- * Schedule IO completion handling on a xfsdatad if this was
- * the final hold on this ioend. If we are asked to wait,
- * flush the workqueue.
- */
-STATIC void
-xfs_finish_ioend(
-	xfs_ioend_t	*ioend,
-	int		wait)
-{
-	if (atomic_dec_and_test(&ioend->io_remaining)) {
-		queue_work(xfsdatad_workqueue, &ioend->io_work);
-		if (wait)
-			flush_workqueue(xfsdatad_workqueue);
-	}
-}
-
-/*
  * We're now finished for good with this ioend structure.
  * Update the page state via the associated buffer_heads,
  * release holds on the inode and bio, and finally free
@@ -266,6 +249,27 @@ xfs_end_bio_read(
 }
 
 /*
+ * Schedule IO completion handling on a xfsdatad if this was
+ * the final hold on this ioend. If we are asked to wait,
+ * flush the workqueue.
+ */
+STATIC void
+xfs_finish_ioend(
+	xfs_ioend_t	*ioend,
+	int		wait)
+{
+	if (atomic_dec_and_test(&ioend->io_remaining)) {
+		struct workqueue_struct *wq = xfsdatad_workqueue;
+		if (ioend->io_work.func == xfs_end_bio_unwritten)
+			wq = xfsconvertd_workqueue;
+
+		queue_work(wq, &ioend->io_work);
+		if (wait)
+			flush_workqueue(wq);
+	}
+}
+
+/*
  * Allocate and initialise an IO completion structure.
  * We need to track unwritten extent write completion here initially.
  * We'll need to extend this for updating the ondisk inode size later
diff --git a/fs/xfs/linux-2.6/xfs_aops.h b/fs/xfs/linux-2.6/xfs_aops.h
index 3ba0631..7643f82 100644
--- a/fs/xfs/linux-2.6/xfs_aops.h
+++ b/fs/xfs/linux-2.6/xfs_aops.h
@@ -19,6 +19,7 @@
 #define __XFS_AOPS_H__
 
 extern struct workqueue_struct *xfsdatad_workqueue;
+extern struct workqueue_struct *xfsconvertd_workqueue;
 extern mempool_t *xfs_ioend_pool;
 
 typedef void (*xfs_ioend_func_t)(void *);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 36d5fcd..c1f55b3 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -45,6 +45,7 @@ static struct shrinker xfs_buf_shake = {
 
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
+struct workqueue_struct *xfsconvertd_workqueue;
 
 #ifdef XFS_BUF_TRACE
 void
@@ -1756,6 +1757,7 @@ xfs_flush_buftarg(
 	xfs_buf_t	*bp, *n;
 	int		pincount = 0;
 
+	xfs_buf_runall_queues(xfsconvertd_workqueue);
 	xfs_buf_runall_queues(xfsdatad_workqueue);
 	xfs_buf_runall_queues(xfslogd_workqueue);
 
@@ -1812,9 +1814,15 @@ xfs_buf_init(void)
 	if (!xfsdatad_workqueue)
 		goto out_destroy_xfslogd_workqueue;
 
+	xfsconvertd_workqueue = create_workqueue("xfsconvertd");
+	if (!xfsconvertd_workqueue)
+		goto out_destroy_xfsdatad_workqueue;
+
 	register_shrinker(&xfs_buf_shake);
 	return 0;
 
+ out_destroy_xfsdatad_workqueue:
+	destroy_workqueue(xfsdatad_workqueue);
  out_destroy_xfslogd_workqueue:
 	destroy_workqueue(xfslogd_workqueue);
  out_free_buf_zone:
@@ -1830,6 +1838,7 @@ void
 xfs_buf_terminate(void)
 {
 	unregister_shrinker(&xfs_buf_shake);
+	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
 	kmem_zone_destroy(xfs_buf_zone);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  6:54                   ` Dave Chinner
@ 2008-10-27  7:31                     ` Lachlan McIlroy
  0 siblings, 0 replies; 10+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  7:31 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Mon, Oct 27, 2008 at 05:29:50PM +1100, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>>>
>>>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>>>
>>>>>>> It has hung with this trace:
>>> ....
>>>>> Got it now. I can reproduce this in a couple of minutes now that both
>>>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>>>> (and the frequent 10s long host system freezes have gone away, too).
>>>>>
>>>>> Looks like *another* new memory allocation problem [1]:
>>> .....
>>>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>>>> unwritten extent completion during I/O completion, and that memory
>>>>> reclaim is now blocked waiting for I/o completion that cannot make
>>>>> progress.
>>>>>
>>>>> Nasty.
>>>>>
>>>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>>>> then re-queue the I/O completion at the back of the workqueue and let other
>>>>> I/o completions progress before retrying this one. That way the I/O that
>>>>> is simply cleaning memory will make progress, hence allowing memory
>>>>> allocation to occur successfully when we retry this I/O completion...
>>>> It could work - unless it's a synchronous I/O in which case the I/O is not
>>>> complete until the extent conversion takes place.
>>> Right. Pushing unwritten extent conversion onto a different
>>> workqueue is probably the only way to handle this easily.
>>> That's the same solution Irix has been using for a long time
>>> (the xfsc thread)....
>> Would that be a workqueue specific to one filesystem?  Right now our
>> workqueues are per-cpu so they can contain I/O completions for multiple
>> filesystems.
> 
> I've simply implemented another per-cpu workqueue set.
> 
>>>> Could we allocate the memory up front before the I/O is issued?
>>> Possibly, but that will create more memory pressure than
>>> allocation in I/O completion because now we could need to hold
>>> thousands of allocations across an I/O - think of the case where
>>> we are running low on memory and have a disk subsystem capable of
>>> a few hundred thousand I/Os per second. the allocation failing would
>>> prevent the I/os from being issued, and if this is buffered writes
>>> into unwritten extents we'd be preventing dirty pages from being
>>> cleaned....
>> The allocation has to be done sometime - if have a few hundred thousand
>> I/Os per second then the queue of unwritten extent conversion requests
>> is going to grow very quickly.
> 
> Sure, but the difference is that in a workqueue we are doing:
> 
> 	alloc
> 	free
> 	alloc
> 	free
> 	.....
> 	alloc
> 	free
> 
> So the instantaneous memory usage is bound by the number of
> workqueue threads doing conversions. The "pre-allocate" case is:
> 
> 	alloc
> 	alloc
> 	alloc
> 	alloc
> 	......
> 	<io completes>
> 	free
> 	.....
> 	<io_completes>
> 	free
> 	.....
> 
> so the allocation is bound by the number of parallel I/Os we have
> not completed. Given that the transaction structure is *800* bytes,
> they will consume memory very quickly if pre-allocated before the
> I/O is dispatched.
Ah, yes of course I see your point.  It would only really work for
synchronous I/O.

Even with the current code we could have queues that grow very large
because buffered writes to unwritten extents don't wait for the
conversion.  So even for the small amount of memory we allocate for
each queue entry we still could consume a lot in total.

> 
>> If a separate workqueue will fix this
>> then that's a better solution anyway.
> 
> I think so. The patch I have been testing is below.

Thanks, I'll add it to the list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  2:50           ` deadlock with latest xfs Dave Chinner
  2008-10-26  4:20             ` Dave Chinner
  2008-10-27  1:42             ` Lachlan McIlroy
@ 2008-10-28  6:02             ` Nick Piggin
  2008-10-28  6:25               ` Dave Chinner
  2 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2008-10-28  6:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Sunday 26 October 2008 13:50, Dave Chinner wrote:

> [1] I don't see how any of the XFS changes we made make this easier to hit.
> What I suspect is a VM regression w.r.t. memory reclaim because this is
> the second problem since 2.6.26 that appears to be a result of memory
> allocation failures in places that we've never, ever seen failures before.
>
> The other new failure is this one:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=11805
>
> which is an alloc_pages(GFP_KERNEL) failure....
>
> mm-folk - care to weight in?

order-0 alloc page GFP_KERNEL can fail sometimes. If it is called
from reclaim or PF_MEMALLOC thread; if it is OOM-killed; fault
injection.

This is even the case for __GFP_NOFAIL allocations (which basically
are buggy anyway).

Not sure why it might have started happening, but I didn't see
exactly which alloc_pages you are talking about? If it is via slab,
then maybe some parameters have changed (eg. in SLUB) which is
using higher order allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-28  6:02             ` Nick Piggin
@ 2008-10-28  6:25               ` Dave Chinner
  2008-10-28  8:56                 ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2008-10-28  6:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Tue, Oct 28, 2008 at 05:02:16PM +1100, Nick Piggin wrote:
> On Sunday 26 October 2008 13:50, Dave Chinner wrote:
> 
> > [1] I don't see how any of the XFS changes we made make this easier to hit.
> > What I suspect is a VM regression w.r.t. memory reclaim because this is
> > the second problem since 2.6.26 that appears to be a result of memory
> > allocation failures in places that we've never, ever seen failures before.
> >
> > The other new failure is this one:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=11805
> >
> > which is an alloc_pages(GFP_KERNEL) failure....
> >
> > mm-folk - care to weight in?
> 
> order-0 alloc page GFP_KERNEL can fail sometimes. If it is called
> from reclaim or PF_MEMALLOC thread; if it is OOM-killed; fault
> injection.
> 
> This is even the case for __GFP_NOFAIL allocations (which basically
> are buggy anyway).
> 
> Not sure why it might have started happening, but I didn't see
> exactly which alloc_pages you are talking about? If it is via slab,
> then maybe some parameters have changed (eg. in SLUB) which is
> using higher order allocations.

In fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_get_noaddr(). It's doing a
single page allocation at a time.

It may be that this failure is caused by an increase base memory
consumption of the kernel as this failure was reported in an lguest
and reproduced with a simple 'modprobe xfs ; mount /dev/xxx
/mnt/xfs' command. Maybe the lguest had very little memory available
to begin with and trying to allocate 2MB of pages for 8x256k log
buffers may have been too much for it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: deadlock with latest xfs
  2008-10-28  6:25               ` Dave Chinner
@ 2008-10-28  8:56                 ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2008-10-28  8:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Tuesday 28 October 2008 17:25, Dave Chinner wrote:
> On Tue, Oct 28, 2008 at 05:02:16PM +1100, Nick Piggin wrote:
> > On Sunday 26 October 2008 13:50, Dave Chinner wrote:
> > > [1] I don't see how any of the XFS changes we made make this easier to
> > > hit. What I suspect is a VM regression w.r.t. memory reclaim because
> > > this is the second problem since 2.6.26 that appears to be a result of
> > > memory allocation failures in places that we've never, ever seen
> > > failures before.
> > >
> > > The other new failure is this one:
> > >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=11805
> > >
> > > which is an alloc_pages(GFP_KERNEL) failure....
> > >
> > > mm-folk - care to weight in?
> >
> > order-0 alloc page GFP_KERNEL can fail sometimes. If it is called
> > from reclaim or PF_MEMALLOC thread; if it is OOM-killed; fault
> > injection.
> >
> > This is even the case for __GFP_NOFAIL allocations (which basically
> > are buggy anyway).
> >
> > Not sure why it might have started happening, but I didn't see
> > exactly which alloc_pages you are talking about? If it is via slab,
> > then maybe some parameters have changed (eg. in SLUB) which is
> > using higher order allocations.
>
> In fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_get_noaddr(). It's doing a
> single page allocation at a time.
>
> It may be that this failure is caused by an increase base memory
> consumption of the kernel as this failure was reported in an lguest
> and reproduced with a simple 'modprobe xfs ; mount /dev/xxx
> /mnt/xfs' command. Maybe the lguest had very little memory available
> to begin with and trying to allocate 2MB of pages for 8x256k log
> buffers may have been too much for it...

I suppose it could have been getting oom-killed, then failing the
alloc, then oopsing on the way out, yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-10-28  8:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4900412A.2050802@sgi.com>
     [not found] ` <20081023205727.GA28490@infradead.org>
     [not found]   ` <49013C47.4090601@sgi.com>
     [not found]     ` <20081024052418.GO25906@disturbed>
     [not found]       ` <20081024064804.GQ25906@disturbed>
     [not found]         ` <20081026005351.GK18495@disturbed>
2008-10-26  2:50           ` deadlock with latest xfs Dave Chinner
2008-10-26  4:20             ` Dave Chinner
2008-10-27  1:42             ` Lachlan McIlroy
2008-10-27  5:30               ` Dave Chinner
2008-10-27  6:29                 ` Lachlan McIlroy
2008-10-27  6:54                   ` Dave Chinner
2008-10-27  7:31                     ` Lachlan McIlroy
2008-10-28  6:02             ` Nick Piggin
2008-10-28  6:25               ` Dave Chinner
2008-10-28  8:56                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox