From: Andrew Morton <akpm@linux-foundation.org>
To: "Peter Schüller" <scode@spotify.com>
Cc: linux-kernel@vger.kernel.org,
Mattias de Zalenski <zalenski@spotify.com>,
linux-mm@kvack.org
Subject: Re: Sudden and massive page cache eviction
Date: Mon, 22 Nov 2010 16:11:58 -0800 [thread overview]
Message-ID: <20101122161158.02699d10.akpm@linux-foundation.org> (raw)
In-Reply-To: <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>
(cc linux-mm)
On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:
> Hello,
>
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
>
> First, here is a good example of the symptom (more graphs later on):
>
> http://files.spotify.com/memcut/b_daily_allcut.png
>
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
>
> The I/O load on these systems comes mainly from two things:
>
> (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
> (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
> (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
>
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
>
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
>
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
>
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
>
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
>
> (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
> (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
> (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
>
> Some additional "sample" graphs showing a few incidences of the problem:
>
> http://files.spotify.com/memcut/a_daily.png
> http://files.spotify.com/memcut/a_weekly.png
> http://files.spotify.com/memcut/b_daily_allcut.png
> http://files.spotify.com/memcut/c_monthly.png
> http://files.spotify.com/memcut/c_yearly.png
> http://files.spotify.com/memcut/d_monthly.png
> http://files.spotify.com/memcut/d_yearly.png
> http://files.spotify.com/memcut/a_monthly.png
> http://files.spotify.com/memcut/a_yearly.png
> http://files.spotify.com/memcut/c_daily.png
> http://files.spotify.com/memcut/c_weekly.png
> http://files.spotify.com/memcut/d_daily.png
> http://files.spotify.com/memcut/d_weekly.png
>
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
>
> http://files.spotify.com/memcut/postgresql_weekly.png
>
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
>
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
>
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
>
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next parent reply other threads:[~2010-11-23 0:12 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>
2010-11-23 0:11 ` Andrew Morton [this message]
2010-11-23 8:38 ` Dave Hansen
2010-11-23 9:44 ` Peter Schüller
2010-11-23 16:19 ` Dave Hansen
2010-11-24 14:02 ` Peter Schüller
2010-11-24 14:14 ` Peter Schüller
2010-11-24 14:20 ` Pekka Enberg
2010-11-24 15:32 ` Peter Schüller
2010-11-24 17:46 ` Pekka Enberg
2010-11-25 1:18 ` Simon Kirby
2010-11-25 15:59 ` Peter Schüller
2010-12-01 6:36 ` Simon Kirby
2010-11-24 17:32 ` Dave Hansen
2010-11-25 15:33 ` Peter Schüller
2010-12-01 9:15 ` Simon Kirby
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101122161158.02699d10.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=scode@spotify.com \
--cc=zalenski@spotify.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox