Re: isolate_freepages_block and excessive CPU usage by OSD process

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrey Korolyov <andrey@xdel.ru>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>,
	riel@redhat.com, Mark Nelson <mark.nelson@inktank.com>,
	linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: isolate_freepages_block and excessive CPU usage by OSD process
Date: Sat, 15 Nov 2014 22:52:05 +0400	[thread overview]
Message-ID: <CABYiri-bnDJgVVwrL8uE28zFvEU5TR8qrqe_gqsUH6u9yyRaWA@mail.gmail.com> (raw)
In-Reply-To: <54679F43.8040706@suse.cz>

On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
>> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>>> Hello,
>>>>
>>>> I had found recently that the OSD daemons under certain conditions
>>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>>> go into loop involving isolate_freepages and effectively hit Ceph
>>>> cluster performance. I found this thread
>>>
>>> Do you feel it is a regression, compared to some older kernel version or something?
>>
>> No, it`s just a rare but very concerning stuff. The higher pressure
>> is, the more chance to hit this particular issue, although absolute
>> numbers are still very large (e.g. room for cache memory). Some
>> googling also found simular question on sf:
>> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>> but there are no perf info unfortunately so I cannot say if the issue
>> is the same or not.
>
> Well it would be useful to find out what's doing the high-order allocations.
> With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
> allocation flags can be captured by enabling the page_alloc tracepoint.

Thanks, please give me some time to go through testing iterations, so
I`ll collect appropriate perf.data.
>
>>>
>>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>>> Although I have approximately a half of physical memory for cache-like
>>>> stuff, the problem with mm persists, so I would like to try
>>>> suggestions from the other people. In current testing iteration I had
>>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>>> background ratio to 15 and 10 correspondingly (because default values
>>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>>> 3.10.
>>>
>>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>>> (especially for transparent hugepages) since 3.10.
>>
>> Heh, I may say that I limited to pushing knobs in 3.10, because it has
>> a well-known set of problems and any major version switch will lead to
>> months-long QA procedures, but I may try that if none of mine knob
>> selection will help. I am not THP user, the problem is happening with
>> regular 4k pages and almost default VM settings. Also it worth to mean
>
> OK that's useful to know. So it might be some driver (do you also have
> mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.

Yes, I am using mellanox transport there and SLUB allocator, as SLAB
had some issues with allocations with uneven node fill-up on a
two-head system which I am primarily using.

>
>> that kernel messages are not complaining about allocation failures, as
>> in case in URL from above, compaction just tightens up to some limit
>
> Without the warnings, that's why we need tracing/profiling to find out what's
> causing it.
>
>> and (after it 'locked' system for a couple of minutes, reducing actual
>> I/O and derived amount of memory operations) it goes back to normal.
>> Cache flush fixing this just in a moment, so should large room for
>
> That could perhaps suggest a poor coordination between reclaim and compaction,
> made worse by the fact that there are more parallel ongoing attempts and the
> watermark checking doesn't take that into account.
>
>> min_free_kbytes. Over couple of days, depends on which nodes with
>> certain settings issue will reappear, I may judge if my ideas was
>> wrong.
>>
>>>
>>>> Non-default VM settings are:
>>>> vm.swappiness = 5
>>>> vm.dirty_ratio=10
>>>> vm.dirty_background_ratio=5
>>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>>> situation worsened, because unstable OSD host cause domino-like effect
>>>> on other hosts, which are starting to flap too and only cache flush
>>>> via drop_caches is helping.
>>>>
>>>> Unfortunately there are no slab info from "exhausted" state due to
>>>> sporadic nature of this bug, will try to catch next time.
>>>>
>>>> slabtop (normal state):
>>>>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>>>>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>>>>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>>>>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>>>>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>>
>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
>>>> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
>>>> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
>>>> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
>>>> 113022  80618  71%    0.19K   2691       42     21528K dentry
>>>> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>>>>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>>>>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>>>>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>>>>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>>>>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>>>>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>>>>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>>>>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>>>>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>>>>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>>>>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>>>>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>>>>  16475  14200  86%    0.16K    659       25      2636K sigqueue
>>>>
>>>> zoneinfo (normal state, attached)
>>>>
>>>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-11-15 18:52 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-15 11:48 Andrey Korolyov
2014-11-15 16:32 ` Vlastimil Babka
2014-11-15 17:10   ` Andrey Korolyov
2014-11-15 18:45     ` Vlastimil Babka
2014-11-15 18:52       ` Andrey Korolyov [this message]
     [not found] <CABYiri-do2YdfBx=r+u1kwXkEwN4v+yeRSHB-ODXo4gMFgW-Fg.mail.gmail.com>
2014-11-19  1:21 ` Christian Marie
2014-11-19 18:03   ` Andrey Korolyov
2014-11-19 21:20     ` Christian Marie
2014-11-19 23:10       ` Vlastimil Babka
2014-11-19 23:49         ` Andrey Korolyov
2014-11-20  3:30         ` Christian Marie
2014-11-21  2:35         ` Christian Marie
2014-11-23  9:33           ` Christian Marie
2014-11-24 21:48             ` Andrey Korolyov
2014-11-28  8:03               ` Joonsoo Kim
2014-11-28  9:26                 ` Vlastimil Babka
2014-12-01  8:31                   ` Joonsoo Kim
2014-12-02  1:47                     ` Christian Marie
2014-12-02  4:53                       ` Joonsoo Kim
2014-12-02  5:06                         ` Christian Marie
2014-12-03  4:04                           ` Christian Marie
2014-12-03  8:05                             ` Joonsoo Kim
2014-12-04 23:30                             ` Vlastimil Babka
2014-12-05  5:50                               ` Christian Marie
2014-12-03  7:57                           ` Joonsoo Kim
2014-12-04  7:30                             ` Christian Marie
2014-12-04  7:51                               ` Christian Marie
2014-12-05  1:07                               ` Joonsoo Kim
2014-12-05  5:55                                 ` Christian Marie
2014-12-08  7:19                                   ` Joonsoo Kim
2014-12-10 15:06                                 ` Vlastimil Babka
2014-12-11  3:08                                   ` Joonsoo Kim
2014-12-02 15:46                         ` Vlastimil Babka
2014-12-03  7:49                           ` Joonsoo Kim
2014-12-03 12:43                             ` Vlastimil Babka
2014-12-04  6:53                               ` Joonsoo Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABYiri-bnDJgVVwrL8uE28zFvEU5TR8qrqe_gqsUH6u9yyRaWA@mail.gmail.com \
    --to=andrey@xdel.ru \
    --cc=ceph-users@lists.ceph.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=mark.nelson@inktank.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox