From: Marinko Catovic <marinko.catovic@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org, Christopher Lameter <cl@linux.com>
Subject: Re: Caching/buffers become useless after some time
Date: Tue, 30 Oct 2018 19:26:32 +0100 [thread overview]
Message-ID: <CADF2uSry7SNQE0NPazAtra-4OELPonnWzzhbrBcqGRiVKWRg5Q@mail.gmail.com> (raw)
In-Reply-To: <98305976-612f-cf6d-1377-2f9f045710a9@suse.cz>
Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>:
>
> On 10/30/18 5:08 PM, Marinko Catovic wrote:
> >> One notable thing here is that there shouldn't be any reason to do the
> >> direct reclaim when kswapd itself doesn't do anything. It could be
> >> either blocked on something but I find it quite surprising to see it in
> >> that state for the whole 1500s time period or we are simply not low on
> >> free memory at all. That would point towards compaction triggered memory
> >> reclaim which account as the direct reclaim as well. The direct
> >> compaction triggered more than once a second in average. We shouldn't
> >> really reclaim unless we are low on memory but repeatedly failing
> >> compaction could just add up and reclaim a lot in the end. There seem to
> >> be quite a lot of low order request as per your trace buffer
>
> I realized that the fact that slabs grew so large might be very
> relevant. It means a lot of unmovable pages, and while they are slowly
> being freed, the remaining are scattered all over the memory, making it
> impossible to successfully compact, until the slabs are almost
> *completely* freed. It's in fact the theoretical worst case scenario for
> compaction and fragmentation avoidance. Next time it would be nice to
> also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
> much there (probably dentries and inodes).
how would you like the results? as a job collecting those from 3 >
drop_caches until worst case, which may be 24 hours every 5 seconds,
or at what point in time?
Please note that I already provided them (see my response before) as a
one-time snapshot while being in the worst case;
cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ
cat /proc/slabinfo https://pastebin.com/9ZPU3q7X
> The question is why the problems happened some time later after the
> unmovable pollution. The trace showed me that the structure of
> allocations wrt order+flags as Michal breaks them down below, is not
> significanly different in the last phase than in the whole trace.
> Possibly the state of memory gradually changed so that the various
> heuristics (fragindex, pageblock skip bits etc) resulted in compaction
> being tried more than initially, eventually hitting a very bad corner case.
>
> >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
> >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
> >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
> >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
> >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
> >>
> >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
> >> That leaves us with
> >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>
> I suspect there are lots of short-lived processes, so these are probably
> rapidly recycled and not causing compaction.
Well yes, since it is about shared hosting there are lots of users,
running lots of scripts, perhaps 5-50 new forks and kills every
second, depending on load, hard to tell.
> It also seems to be pgd allocation (2 pages due to PTI) not kernel stack?
plain english, please? :)
> >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> I would again suspect those. IIRC we already confirmed earlier that THP
> defrag setting is madvise or madvise+defer, and there are
> madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
> to plain 'defer'?
Yes, I think I mentioned this before. AFAIK it did not make
(immediate) changes, madvise is the current type.
> and there are madvise(MADV_HUGEPAGE) using processes?
Can't tell you that..
> >>
> >> by large the kernel stack allocations are in lead. You can put some
> >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
> >> THP pages allocations. Just curious are you running on a NUMA machine?
> >> If yes [1] might be relevant. Other than that nothing really jumped at
> >> me.
>
>
> > thanks a lot Vlastimil!
>
> And Michal :)
>
> > I would not really know whether this is a NUMA, it is some usual
> > server running with a i7-8700
> > and ECC RAM. How would I find out?
>
> Please provide /proc/zoneinfo and we'll see.
there you go: cat /proc/zoneinfo https://pastebin.com/RMTwtXGr
> > So I should do CONFIG_VMAP_STACK=y and try that..?
>
> I suspect you already have it.
Yes true, the currently loaded kernel is with =y there.
next prev parent reply other threads:[~2018-10-30 18:26 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-11 13:18 Marinko Catovic
2018-07-12 11:34 ` Michal Hocko
2018-07-13 15:48 ` Marinko Catovic
2018-07-16 15:53 ` Marinko Catovic
2018-07-16 16:23 ` Michal Hocko
2018-07-16 16:33 ` Marinko Catovic
2018-07-16 16:45 ` Michal Hocko
2018-07-20 22:03 ` Marinko Catovic
2018-07-27 11:15 ` Vlastimil Babka
2018-07-30 14:40 ` Michal Hocko
2018-07-30 22:08 ` Marinko Catovic
2018-08-02 16:15 ` Vlastimil Babka
2018-08-03 14:13 ` Marinko Catovic
2018-08-06 9:40 ` Vlastimil Babka
2018-08-06 10:29 ` Marinko Catovic
2018-08-06 12:00 ` Michal Hocko
2018-08-06 15:37 ` Christopher Lameter
2018-08-06 18:16 ` Michal Hocko
2018-08-09 8:29 ` Marinko Catovic
2018-08-21 0:36 ` Marinko Catovic
2018-08-21 6:49 ` Michal Hocko
2018-08-21 7:19 ` Vlastimil Babka
2018-08-22 20:02 ` Marinko Catovic
2018-08-23 12:10 ` Vlastimil Babka
2018-08-23 12:21 ` Michal Hocko
2018-08-24 0:11 ` Marinko Catovic
2018-08-24 6:34 ` Vlastimil Babka
2018-08-24 8:11 ` Marinko Catovic
2018-08-24 8:36 ` Vlastimil Babka
2018-08-29 14:54 ` Marinko Catovic
2018-08-29 15:01 ` Michal Hocko
2018-08-29 15:13 ` Marinko Catovic
2018-08-29 15:27 ` Michal Hocko
2018-08-29 16:44 ` Marinko Catovic
2018-10-22 1:19 ` Marinko Catovic
2018-10-23 17:41 ` Marinko Catovic
2018-10-26 5:48 ` Marinko Catovic
2018-10-26 8:01 ` Michal Hocko
2018-10-26 23:31 ` Marinko Catovic
2018-10-27 6:42 ` Michal Hocko
[not found] ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>
2018-10-30 15:30 ` Michal Hocko
2018-10-30 16:08 ` Marinko Catovic
2018-10-30 17:00 ` Vlastimil Babka
2018-10-30 18:26 ` Marinko Catovic [this message]
2018-10-31 7:34 ` Michal Hocko
2018-10-31 7:32 ` Michal Hocko
2018-10-31 13:40 ` Vlastimil Babka
2018-10-31 14:53 ` Marinko Catovic
2018-10-31 17:01 ` Michal Hocko
2018-10-31 19:21 ` Marinko Catovic
2018-11-01 13:23 ` Michal Hocko
2018-11-01 22:46 ` Marinko Catovic
2018-11-02 8:05 ` Michal Hocko
2018-11-02 11:31 ` Marinko Catovic
2018-11-02 11:49 ` Michal Hocko
2018-11-02 12:22 ` Vlastimil Babka
2018-11-02 12:41 ` Marinko Catovic
2018-11-02 13:13 ` Vlastimil Babka
2018-11-02 13:50 ` Marinko Catovic
2018-11-02 14:49 ` Vlastimil Babka
2018-11-02 14:59 ` Vlastimil Babka
2018-11-30 12:01 ` Marinko Catovic
2018-12-10 21:30 ` Marinko Catovic
2018-12-10 21:47 ` Michal Hocko
2018-10-31 13:12 ` Vlastimil Babka
2018-08-24 6:24 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CADF2uSry7SNQE0NPazAtra-4OELPonnWzzhbrBcqGRiVKWRg5Q@mail.gmail.com \
--to=marinko.catovic@gmail.com \
--cc=cl@linux.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox