Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, jack@suse.cz,
	hughd@google.com, darrick.wong@oracle.com, mhocko@suse.com,
	aryabinin@virtuozzo.com, guro@fb.com,
	mgorman@techsingularity.net, shakeelb@google.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction
Date: Thu, 10 Jan 2019 13:06:29 +0300	[thread overview]
Message-ID: <e8cf89d3-f71a-4d9d-3ea0-18157f7da722@virtuozzo.com> (raw)
In-Reply-To: <20190109163353.pxb574odzfwdbcfe@macbook-pro-91.dhcp.thefacebook.com>

On 09.01.2019 19:33, Josef Bacik wrote:
> On Wed, Jan 09, 2019 at 07:08:09PM +0300, Kirill Tkhai wrote:
>> Hi, Josef,
>>
>> On 09.01.2019 18:49, Josef Bacik wrote:
>>> On Wed, Jan 09, 2019 at 03:20:18PM +0300, Kirill Tkhai wrote:
>>>> On nodes without memory overcommit, it's common a situation,
>>>> when memcg exceeds its limit and pages from pagecache are
>>>> shrinked on reclaim, while node has a lot of free memory.
>>>> Further access to the pages requires real device IO, while
>>>> IO causes time delays, worse powerusage, worse throughput
>>>> for other users of the device, etc.
>>>>
>>>> Cleancache is not a good solution for this problem, since
>>>> it implies copying of page on every cleancache_put_page()
>>>> and cleancache_get_page(). Also, it requires introduction
>>>> of internal per-cleancache_ops data structures to manage
>>>> cached pages and their inodes relationships, which again
>>>> introduces overhead.
>>>>
>>>> This patchset introduces another solution. It introduces
>>>> a new scheme for evicting memcg pages:
>>>>
>>>>   1)__remove_mapping() uncharges unmapped page memcg
>>>>     and leaves page in pagecache on memcg reclaim;
>>>>
>>>>   2)putback_lru_page() places page into root_mem_cgroup
>>>>     list, since its memcg is NULL. Page may be evicted
>>>>     on global reclaim (and this will be easily, as
>>>>     page is not mapped, so shrinker will shrink it
>>>>     with 100% probability of success);
>>>>
>>>>   3)pagecache_get_page() charges page into memcg of
>>>>     a task, which takes it first.
>>>>
>>>> Below is small test, which shows profit of the patchset.
>>>>
>>>> Create memcg with limit 20M (exact value does not matter much):
>>>>   $ mkdir /sys/fs/cgroup/memory/ct
>>>>   $ echo 20M > /sys/fs/cgroup/memory/ct/memory.limit_in_bytes
>>>>   $ echo $$ > /sys/fs/cgroup/memory/ct/tasks
>>>>
>>>> Then twice read 1GB file:
>>>>   $ time cat file_1gb > /dev/null
>>>>
>>>> Before (2 iterations):
>>>>   1)0.01user 0.82system 0:11.16elapsed 7%CPU
>>>>   2)0.01user 0.91system 0:11.16elapsed 8%CPU
>>>>
>>>> After (2 iterations):
>>>>   1)0.01user 0.57system 0:11.31elapsed 5%CPU
>>>>   2)0.00user 0.28system 0:00.28elapsed 100%CPU
>>>>
>>>> With the patch set applied, we have file pages are cached
>>>> during the second read, so the result is 39 times faster.
>>>>
>>>> This may be useful for slow disks, NFS, nodes without
>>>> overcommit by memory, in case of two memcg access the same
>>>> files, etc.
>>>>
>>>
>>> This isn't going to work for us (Facebook).  The whole reason the hard limit
>>> exists is to keep different groups from messing up other groups.  Page cache
>>> reclaim is not free, most of our pain and most of the reason we use cgroups
>>> is to limit the effect of flooding the machine with pagecache from different
>>> groups.
>>
>> I understand the problem.
>>
>>> Memory leaks happen few and far between, but chef doing a yum
>>> update in the system container happens regularly.  If you talk about suddenly
>>> orphaning these pages to the root container it still creates pressure on the
>>> main workload, pressure that results in it having to take time from what it's
>>> doing and free up memory instead.
>>
>> Could you please to clarify additional pressure, which introduces the patchset?
>> The number of actions, which are needed to evict a pagecache page, remain almost
>> the same: we just delay __delete_from_page_cache() to global reclaim. Global
>> reclaim should not introduce much pressure, since it's the iteration on a single
>> memcg (we should not dive into hell of children memcg, since root memcg reclaim
>> should be successful and free enough pages, should't we?).
> 
> If we go into global reclaim at all.  If we're unable to allocate a page as the
> most important cgroup we start shrinking ourselves first right?  And then
> eventually end up in global reclaim, right?  So it may be easily enough
> reclaimed, but we're going to waste a lot of time getting there in the meantime,
> which means latency that's hard to pin down.
> 
> And secondly this allows hard limited cgroups to essentially leak pagecache into
> the whole system, creating waaaaaaay more memory pressure than what I think you
> intend.  Your logic is that we'll exceed our limit, evict some pagecache to the
> root cgroup, and we avoid a OOM and everything is ok.  However what will really
> happen is some user is going to do dd if=/dev/zero of=file and we'll just
> happily keep shoving these pages off into the root cg and suddenly we have 100gb
> of useless pagecache that we have to reclaim.  Yeah we just have to delete it
> from the root, but thats only once we get to that part, before that there's a
> bunch of latency inducing work that has to be done to get to deleting the pages.

Yeah, but what does introduce the most latency in setup? Do I understand correctly
that hard limit on your setup allows all alloc_pages() calls to go thru get_page_from_freelist()
path, so most allocations do not dive into __alloc_pages_slowpath()?

>>
>> Also, what is about implementing this as static key option? What about linking
>> orphaned pagecache pages into separate list, which is easy-to-iterate?
> 
> Yeah if we have a way to short-circuit the normal reclaim path and just go to
> evicting these easily evicted pages then that would make it more palatable.  But
> I'd like to see testing to verify that this faster way really is faster and
> doesn't induce latency on other protected workloads.  We put hard limits on
> groups we don't care about, we want those things to die in a fire.  The excess
> IO from re-reading those pages is mitigated with io.latency, and eventually
> io.weight for proportional control, so really isn't an argument for keeping
> pages around.  Thanks,
> 
> Josef
>

next prev parent reply	other threads:[~2019-01-10 10:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-09 12:20 Kirill Tkhai
2019-01-09 12:20 ` [PATCH 1/3] mm: Uncharge and keep page in pagecache on memcg reclaim Kirill Tkhai
2019-01-09 12:20 ` [PATCH 2/3] mm: Recharge page memcg on first get from pagecache Kirill Tkhai
2019-01-09 12:20 ` [PATCH 3/3] mm: Pass FGP_NOWAIT in generic_file_buffered_read and enable ext4 Kirill Tkhai
2019-01-09 14:11 ` [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction Michal Hocko
2019-01-09 15:43   ` Kirill Tkhai
2019-01-09 17:10     ` Michal Hocko
2019-01-10  9:42       ` Kirill Tkhai
2019-01-10  9:57         ` Michal Hocko
2019-01-09 15:49 ` Josef Bacik
2019-01-09 16:08   ` Kirill Tkhai
2019-01-09 16:33     ` Josef Bacik
2019-01-10 10:06       ` Kirill Tkhai [this message]
2019-01-09 16:45 ` Johannes Weiner
2019-01-09 17:44   ` Shakeel Butt
2019-01-09 17:44     ` Shakeel Butt
2019-01-09 19:20     ` Johannes Weiner
2019-01-09 17:37 ` Shakeel Butt
2019-01-09 17:37   ` Shakeel Butt
2019-01-10  9:46   ` Kirill Tkhai
2019-01-10 19:19     ` Shakeel Butt
2019-01-10 19:19       ` Shakeel Butt
2019-01-11 12:17       ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8cf89d3-f71a-4d9d-3ea0-18157f7da722@virtuozzo.com \
    --to=ktkhai@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=aryabinin@virtuozzo.com \
    --cc=darrick.wong@oracle.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox