linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: "Peter Schüller" <scode@spotify.com>
Cc: linux-kernel@vger.kernel.org,
	Mattias de Zalenski <zalenski@spotify.com>,
	linux-mm@kvack.org
Subject: Re: Sudden and massive page cache eviction
Date: Mon, 22 Nov 2010 16:11:58 -0800	[thread overview]
Message-ID: <20101122161158.02699d10.akpm@linux-foundation.org> (raw)
In-Reply-To: <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:

> Hello,
> 
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
> 
> First, here is a good example of the symptom (more graphs later on):
> 
>    http://files.spotify.com/memcut/b_daily_allcut.png
> 
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
> 
> The I/O load on these systems comes mainly from two things:
> 
>   (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
>   (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
>   (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
> 
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
> 
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
> 
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
> 
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
> 
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
> 
>   (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
>   (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
>   (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
> 
> Some additional "sample" graphs showing a few incidences of the problem:
> 
>    http://files.spotify.com/memcut/a_daily.png
>    http://files.spotify.com/memcut/a_weekly.png
>    http://files.spotify.com/memcut/b_daily_allcut.png
>    http://files.spotify.com/memcut/c_monthly.png
>    http://files.spotify.com/memcut/c_yearly.png
>    http://files.spotify.com/memcut/d_monthly.png
>    http://files.spotify.com/memcut/d_yearly.png
>    http://files.spotify.com/memcut/a_monthly.png
>    http://files.spotify.com/memcut/a_yearly.png
>    http://files.spotify.com/memcut/c_daily.png
>    http://files.spotify.com/memcut/c_weekly.png
>    http://files.spotify.com/memcut/d_daily.png
>    http://files.spotify.com/memcut/d_weekly.png
> 
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
> 
>    http://files.spotify.com/memcut/postgresql_weekly.png
> 
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
> 
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
> 
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

       reply	other threads:[~2010-11-23  0:12 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>
2010-11-23  0:11 ` Andrew Morton [this message]
2010-11-23  8:38   ` Dave Hansen
2010-11-23  9:44     ` Peter Schüller
2010-11-23 16:19       ` Dave Hansen
2010-11-24 14:02         ` Peter Schüller
2010-11-24 14:14           ` Peter Schüller
2010-11-24 14:20             ` Pekka Enberg
2010-11-24 15:32               ` Peter Schüller
2010-11-24 17:46                 ` Pekka Enberg
2010-11-25  1:18                 ` Simon Kirby
2010-11-25 15:59                   ` Peter Schüller
2010-12-01  6:36                     ` Simon Kirby
2010-11-24 17:32             ` Dave Hansen
2010-11-25 15:33               ` Peter Schüller
2010-12-01  9:15                 ` Simon Kirby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101122161158.02699d10.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=scode@spotify.com \
    --cc=zalenski@spotify.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox