memory fragmentation issues on 4.4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* memory fragmentation issues on 4.4
@ 2016-03-28  9:14 Nikolay Borisov
       [not found] ` <56F90D94.9000604@I-love.SAKURA.ne.jp>
  2016-03-29 14:20 ` Vlastimil Babka
  0 siblings, 2 replies; 5+ messages in thread
From: Nikolay Borisov @ 2016-03-28  9:14 UTC (permalink / raw)
  To: Linux MM; +Cc: vbabka, mgorman

Hello,

On kernel 4.4 I observe that the memory gets really fragmented fairly
quickly. E.g. there are no order  > 4 pages even after 2 days of uptime.
This leads to certain data structures on XFS (in my case order 4/order 5
allocations)  not being allocated and causes the server to stall. When
this happens either someone has to log on the server and manually invoke
the memory compaction or plain reboot the server. Before that the server
was running with the exact same workload but with 3.12.52 kernel and no
such issue were observed. That is - memory was fragmented but allocation
didn't fail, maybe alloc_pages_direct_compact was doing a better job?

FYI the allocation is performed with GFP_KERNEL | GFP_NOFS

Manual compaction usually does the job, however I'm wondering why isn't
invoking __alloc_pages_direct_compact from within __alloc_pages_nodemask
satisfying the request if manual compaction would do the job. Is there a
difference in the efficiency of manually invoking memory compaction and
the one invoked from the page allocator path?

Another question for my own satisfaction - I created a kernel module
which allocate pages of very high order - 8/9) then later when those
pages are returned I see the number of unmovable pages increase by the
amount of pages returned. So should freed pages go to the unmovable
category?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memory fragmentation issues on 4.4
       [not found] ` <56F90D94.9000604@I-love.SAKURA.ne.jp>
@ 2016-03-28 11:14   ` Nikolay Borisov
  2016-03-28 11:45     ` Tetsuo Handa
  0 siblings, 1 reply; 5+ messages in thread
From: Nikolay Borisov @ 2016-03-28 11:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Linux MM, vbabka, mgorman



On 03/28/2016 01:55 PM, Tetsuo Handa wrote:
> On 2016/03/28 18:14, Nikolay Borisov wrote:
>> Hello,
>>
>> On kernel 4.4 I observe that the memory gets really fragmented fairly
>> quickly. E.g. there are no order  > 4 pages even after 2 days of uptime.
>> This leads to certain data structures on XFS (in my case order 4/order 5
>> allocations)  not being allocated and causes the server to stall. When
>> this happens either someone has to log on the server and manually invoke
>> the memory compaction or plain reboot the server. Before that the server
>> was running with the exact same workload but with 3.12.52 kernel and no
>> such issue were observed. That is - memory was fragmented but allocation
>> didn't fail, maybe alloc_pages_direct_compact was doing a better job?
> 
> I'm not a mm person. But currently the page allocator does not give up
> unless there is no reclaimable zones. That would be the reason the allocation
> did not fail but caused the system to stall. It is interesting for mm people
> if you can try, apart from your fragmentation issue, running linux-next kernel
> which includes OOM detection rework ( https://lwn.net/Articles/667939/ ).

I don't think that this would have helped since the machine didn't run
out of memory rather memory was so fragmented that an order 5 allocation
could not be satisfied. Which I think means no OOM logic would have been
triggered.

Actually the allocation did fail but was infinitely retried by merit of
the logic in kmem_alloc. So in this case kmalloc was returning a NULL-ptr.


> 
>>
>> FYI the allocation is performed with GFP_KERNEL | GFP_NOFS
> 
> Excuse me, but GFP_KERNEL is GFP_NOFS | __GFP_FS, and therefore
> GFP_KERNEL | GFP_NOFS is GFP_KERNEL. What did you mean?

Right, so it's : (GFP_KERNEL | __GFP_NOWARN) &= ~__GFP_FS

> 
>>
>>
>> Manual compaction usually does the job, however I'm wondering why isn't
>> invoking __alloc_pages_direct_compact from within __alloc_pages_nodemask
>> satisfying the request if manual compaction would do the job. Is there a
>> difference in the efficiency of manually invoking memory compaction and
>> the one invoked from the page allocator path?
>>
>>
>> Another question for my own satisfaction - I created a kernel module
>> which allocate pages of very high order - 8/9) then later when those
>> pages are returned I see the number of unmovable pages increase by the
>> amount of pages returned. So should freed pages go to the unmovable
>> category?
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memory fragmentation issues on 4.4
  2016-03-28 11:14   ` Nikolay Borisov
@ 2016-03-28 11:45     ` Tetsuo Handa
  0 siblings, 0 replies; 5+ messages in thread
From: Tetsuo Handa @ 2016-03-28 11:45 UTC (permalink / raw)
  To: kernel; +Cc: linux-mm, vbabka, mgorman

Nikolay Borisov wrote:
> On 03/28/2016 01:55 PM, Tetsuo Handa wrote:
> > On 2016/03/28 18:14, Nikolay Borisov wrote:
> >> Hello,
> >>
> >> On kernel 4.4 I observe that the memory gets really fragmented fairly
> >> quickly. E.g. there are no order  > 4 pages even after 2 days of uptime.
> >> This leads to certain data structures on XFS (in my case order 4/order 5
> >> allocations)  not being allocated and causes the server to stall. When
> >> this happens either someone has to log on the server and manually invoke
> >> the memory compaction or plain reboot the server. Before that the server
> >> was running with the exact same workload but with 3.12.52 kernel and no
> >> such issue were observed. That is - memory was fragmented but allocation
> >> didn't fail, maybe alloc_pages_direct_compact was doing a better job?
> > 
> > I'm not a mm person. But currently the page allocator does not give up
> > unless there is no reclaimable zones. That would be the reason the allocation
> > did not fail but caused the system to stall. It is interesting for mm people
> > if you can try, apart from your fragmentation issue, running linux-next kernel
> > which includes OOM detection rework ( https://lwn.net/Articles/667939/ ).
> 
> I don't think that this would have helped since the machine didn't run
> out of memory rather memory was so fragmented that an order 5 allocation
> could not be satisfied. Which I think means no OOM logic would have been
> triggered.
> 
> Actually the allocation did fail but was infinitely retried by merit of
> the logic in kmem_alloc. So in this case kmalloc was returning a NULL-ptr.

Oops, I missed

	/* The OOM killer will not help higher order allocs */
	if (order > PAGE_ALLOC_COSTLY_ORDER)
		goto out;

in __alloc_pages_may_oom().

> 
> 
> > 
> >>
> >> FYI the allocation is performed with GFP_KERNEL | GFP_NOFS
> > 
> > Excuse me, but GFP_KERNEL is GFP_NOFS | __GFP_FS, and therefore
> > GFP_KERNEL | GFP_NOFS is GFP_KERNEL. What did you mean?
> 
> Right, so it's : (GFP_KERNEL | __GFP_NOWARN) &= ~__GFP_FS
> 

So, !__GFP_FS && !__GFP_NOFAIL && order > 3 allocation from kmem_alloc()
is stalling. Sorry, I'm not familiar with fragmentation.

> > 
> >>
> >>
> >> Manual compaction usually does the job, however I'm wondering why isn't
> >> invoking __alloc_pages_direct_compact from within __alloc_pages_nodemask
> >> satisfying the request if manual compaction would do the job. Is there a
> >> difference in the efficiency of manually invoking memory compaction and
> >> the one invoked from the page allocator path?
> >>
> >>
> >> Another question for my own satisfaction - I created a kernel module
> >> which allocate pages of very high order - 8/9) then later when those
> >> pages are returned I see the number of unmovable pages increase by the
> >> amount of pages returned. So should freed pages go to the unmovable
> >> category?
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memory fragmentation issues on 4.4
  2016-03-28  9:14 memory fragmentation issues on 4.4 Nikolay Borisov
       [not found] ` <56F90D94.9000604@I-love.SAKURA.ne.jp>
@ 2016-03-29 14:20 ` Vlastimil Babka
  2016-03-29 14:53   ` Nikolay Borisov
  1 sibling, 1 reply; 5+ messages in thread
From: Vlastimil Babka @ 2016-03-29 14:20 UTC (permalink / raw)
  To: Nikolay Borisov, Linux MM; +Cc: mgorman

On 03/28/2016 11:14 AM, Nikolay Borisov wrote:
> Hello,
>
> On kernel 4.4 I observe that the memory gets really fragmented fairly
> quickly. E.g. there are no order  > 4 pages even after 2 days of uptime.
> This leads to certain data structures on XFS (in my case order 4/order 5
> allocations)  not being allocated and causes the server to stall. When
> this happens either someone has to log on the server and manually invoke
> the memory compaction or plain reboot the server. Before that the server
> was running with the exact same workload but with 3.12.52 kernel and no
> such issue were observed. That is - memory was fragmented but allocation
> didn't fail, maybe alloc_pages_direct_compact was doing a better job?
>
> FYI the allocation is performed with GFP_KERNEL | GFP_NOFS

GFP_NOFS is indeed excluded from memory compaction in the allocation 
context (i.e. direct compaction).

> Manual compaction usually does the job, however I'm wondering why isn't
> invoking __alloc_pages_direct_compact from within __alloc_pages_nodemask
> satisfying the request if manual compaction would do the job. Is there a
> difference in the efficiency of manually invoking memory compaction and
> the one invoked from the page allocator path?

Manual compaction via /proc is known to be safe in not holding any locks 
that XFS might be holding. Compaction relies on page migration and IIRC 
some filesystems cannot migrate dirty pages unless there's writeback, 
and if that writeback called back to xfs, it would be a deadlock. 
However, we could investigate if the async compaction would be safe.

In any case, such high-order allocations should always have an order-0 
fallback. You're suggesting there's an infinite loop around the 
allocation attempt instead? Do you have the full backtrace?

Even an infinite loop should eventually proceed by having kswapd do the 
compaction work with unrestricted context. But it turns out kswapd 
compaction was broken. Hopefully good news is that 4.6-rc1 has kcompactd 
for that, which should work better. If you're able to test such 
experimental kernel, I would be interested to hear if it helped.

> Another question for my own satisfaction - I created a kernel module
> which allocate pages of very high order - 8/9) then later when those
> pages are returned I see the number of unmovable pages increase by the
> amount of pages returned. So should freed pages go to the unmovable
> category?

Your kernel was likely doing unmovable allocations (e.g. GFP_KERNEL), so 
freed pages are returned to unmovable freelists. But once they merge to 
order-9 or higher, it doesn't really matter, as it's trivial to 
transform them to movable lists in response to movable allocation 
demand, without causing permanent fragmentation.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: memory fragmentation issues on 4.4
  2016-03-29 14:20 ` Vlastimil Babka
@ 2016-03-29 14:53   ` Nikolay Borisov
  0 siblings, 0 replies; 5+ messages in thread
From: Nikolay Borisov @ 2016-03-29 14:53 UTC (permalink / raw)
  To: Vlastimil Babka, Linux MM; +Cc: mgorman



On 03/29/2016 05:20 PM, Vlastimil Babka wrote:
> On 03/28/2016 11:14 AM, Nikolay Borisov wrote:
>> Hello,
>>
>> On kernel 4.4 I observe that the memory gets really fragmented fairly
>> quickly. E.g. there are no order  > 4 pages even after 2 days of uptime.
>> This leads to certain data structures on XFS (in my case order 4/order 5
>> allocations)  not being allocated and causes the server to stall. When
>> this happens either someone has to log on the server and manually invoke
>> the memory compaction or plain reboot the server. Before that the server
>> was running with the exact same workload but with 3.12.52 kernel and no
>> such issue were observed. That is - memory was fragmented but allocation
>> didn't fail, maybe alloc_pages_direct_compact was doing a better job?
>>
>> FYI the allocation is performed with GFP_KERNEL | GFP_NOFS
> 
> GFP_NOFS is indeed excluded from memory compaction in the allocation
> context (i.e. direct compaction).
> 
>> Manual compaction usually does the job, however I'm wondering why isn't
>> invoking __alloc_pages_direct_compact from within __alloc_pages_nodemask
>> satisfying the request if manual compaction would do the job. Is there a
>> difference in the efficiency of manually invoking memory compaction and
>> the one invoked from the page allocator path?
> 
> Manual compaction via /proc is known to be safe in not holding any locks
> that XFS might be holding. Compaction relies on page migration and IIRC
> some filesystems cannot migrate dirty pages unless there's writeback,
> and if that writeback called back to xfs, it would be a deadlock.
> However, we could investigate if the async compaction would be safe.
> 
> In any case, such high-order allocations should always have an order-0
> fallback. You're suggesting there's an infinite loop around the
> allocation attempt instead? Do you have the full backtrace?

Yes, here is a full backtrace:

loop0           D ffff881fe081f038     0 15174      2 0x00000000
 ffff881fe081f038 ffff883ff29fa700 ffff881fecb70d00 ffff88407fffae00
 0000000000000000 0000000502404240 ffffffff81e30d60 0000000000000000
 0000000000000000 ffff881f00000003 0000000000000282 ffff883f00000000
Call Trace:
 [<ffffffff8163ac01>] ? _raw_spin_lock_irqsave+0x21/0x60
 [<ffffffff81636fd7>] schedule+0x47/0x90
 [<ffffffff81639f03>] schedule_timeout+0x113/0x1e0
 [<ffffffff810ac580>] ? lock_timer_base+0x80/0x80
 [<ffffffff816363d4>] io_schedule_timeout+0xa4/0x110
 [<ffffffff8114aadf>] congestion_wait+0x7f/0x130
 [<ffffffff810939e0>] ? woken_wake_function+0x20/0x20
 [<ffffffffa0283bac>] kmem_alloc+0x8c/0x120 [xfs]
 [<ffffffff81181751>] ? __kmalloc+0x121/0x250
 [<ffffffffa0283c73>] kmem_realloc+0x33/0x80 [xfs]
 [<ffffffffa02546cd>] xfs_iext_realloc_indirect+0x3d/0x60 [xfs]
 [<ffffffffa02548cf>] xfs_iext_irec_new+0x3f/0xf0 [xfs]
 [<ffffffffa0254c0d>] xfs_iext_add_indirect_multi+0x14d/0x210 [xfs]
 [<ffffffffa02554b5>] xfs_iext_add+0xc5/0x230 [xfs]
 [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffffa0256269>] xfs_iext_insert+0x59/0x110 [xfs]
 [<ffffffffa0230928>] ? xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
 [<ffffffffa0230928>] xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
 [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff8112b725>] ? mempool_alloc+0x65/0x180
 [<ffffffffa02543d8>] ? xfs_iext_get_ext+0x38/0x70 [xfs]
 [<ffffffffa0254e8d>] ? xfs_iext_bno_to_ext+0xed/0x150 [xfs]
 [<ffffffffa02311b5>] xfs_bmapi_reserve_delalloc+0x225/0x250 [xfs]
 [<ffffffffa023131e>] xfs_bmapi_delay+0x13e/0x290 [xfs]
 [<ffffffffa02730ad>] xfs_iomap_write_delay+0x17d/0x300 [xfs]
 [<ffffffffa022e434>] ? xfs_bmapi_read+0x114/0x330 [xfs]
 [<ffffffffa025ddc5>] __xfs_get_blocks+0x585/0xa90 [xfs]
 [<ffffffff81324b53>] ? __percpu_counter_add+0x63/0x80
 [<ffffffff811374cd>] ? account_page_dirtied+0xed/0x1b0
 [<ffffffff811cfc59>] ? alloc_buffer_head+0x49/0x60
 [<ffffffff811d07c0>] ? alloc_page_buffers+0x60/0xb0
 [<ffffffff811d13e5>] ? create_empty_buffers+0x45/0xc0
 [<ffffffffa025e324>] xfs_get_blocks+0x14/0x20 [xfs]
 [<ffffffff811d34e2>] __block_write_begin+0x1c2/0x580
 [<ffffffffa025e310>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
 [<ffffffffa025bbb1>] xfs_vm_write_begin+0x61/0xf0 [xfs]
 [<ffffffff81127e50>] generic_perform_write+0xd0/0x1f0
 [<ffffffffa026a341>] xfs_file_buffered_aio_write+0xe1/0x240 [xfs]
 [<ffffffff812e16d2>] ? bt_clear_tag+0xb2/0xd0
 [<ffffffffa026ab87>] xfs_file_write_iter+0x167/0x170 [xfs]
 [<ffffffff81199d76>] vfs_iter_write+0x76/0xa0
 [<ffffffffa03fb735>] lo_write_bvec+0x65/0x100 [loop]
 [<ffffffffa03fd589>] loop_queue_work+0x689/0x924 [loop]
 [<ffffffff8163ba52>] ? retint_kernel+0x10/0x10
 [<ffffffff81074d71>] kthread_worker_fn+0x61/0x1c0
 [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
 [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
 [<ffffffff810744d7>] kthread+0xd7/0xf0
 [<ffffffff8107d22e>] ? schedule_tail+0x1e/0xd0
 [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
 [<ffffffff8163b2af>] ret_from_fork+0x3f/0x70
 [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80

Basically on a very large sparse file the array which holds the extents
that describe said file can grow fairly big. In this particular case it
couldn't satisfy an order-5 allocation as evident from the following
message from xfs:


XFS: loop0(15174) possible memory allocation deadlock size 107168 in
kmem_alloc (mode:0x2400240)

This basically says "I cannot allocate 107 contiguous kb"


And here is a discussion that ensued on xfs mailing list:
http://oss.sgi.com/archives/xfs/2016-03/msg00447.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-03-29 14:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-28  9:14 memory fragmentation issues on 4.4 Nikolay Borisov
     [not found] ` <56F90D94.9000604@I-love.SAKURA.ne.jp>
2016-03-28 11:14   ` Nikolay Borisov
2016-03-28 11:45     ` Tetsuo Handa
2016-03-29 14:20 ` Vlastimil Babka
2016-03-29 14:53   ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox