From: Vladimir Davydov <vdavydov@parallels.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@kernel.org>,
Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mgorman@suse.de>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] mm: make workingset detection logic memcg aware
Date: Tue, 4 Aug 2015 11:13:29 +0300 [thread overview]
Message-ID: <20150804081329.GB11971@esperanza> (raw)
In-Reply-To: <20150803205532.GA19478@cmpxchg.org>
On Mon, Aug 03, 2015 at 04:55:32PM -0400, Johannes Weiner wrote:
> On Mon, Aug 03, 2015 at 04:52:29PM +0300, Vladimir Davydov wrote:
> > On Mon, Aug 03, 2015 at 09:23:58AM -0400, Johannes Weiner wrote:
> > > On Mon, Aug 03, 2015 at 03:04:22PM +0300, Vladimir Davydov wrote:
> > > > @@ -179,8 +180,9 @@ static void unpack_shadow(void *shadow,
> > > > eviction = entry;
> > > >
> > > > *zone = NODE_DATA(nid)->node_zones + zid;
> > > > + *lruvec = mem_cgroup_page_lruvec(page, *zone);
> > > >
> > > > - refault = atomic_long_read(&(*zone)->inactive_age);
> > > > + refault = atomic_long_read(&(*lruvec)->inactive_age);
> > > > mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> > > > RADIX_TREE_EXCEPTIONAL_SHIFT);
> > > > /*
> > >
> > > You can not compare an eviction shadow entry from one lruvec with the
> > > inactive age of another lruvec. The inactive ages are not related and
> > > might differ significantly: memcgs are created ad hoc, memory hotplug,
> > > page allocator fairness drift. In those cases the result will be pure
> > > noise.
> >
> > That's true. If a page is evicted in one cgroup and then refaulted in
> > another, the activation will be random. However, is it a frequent event
> > when a page used by and evicted from one cgroup is refaulted in another?
> > If there is no active file sharing (is it common?), this should only
> > happen to code pages, but those will most likely end up in the cgroup
> > that has the greatest limit, so they shouldn't be evicted and refaulted
> > frequently. So the question is can we tolerate some noise here?
>
> It's not just the memcg, it's also the difference between zones
> themselves.
But I do take into account the difference between zones in this patch -
zone and node ids are still stored in a shadow entry. I only neglect
memcg id. So if a page is refaulted in another zone within the same
cgroup, its refault distance will be calculated correctly. We only get
noise in case of a page refaulted from a different cgroup.
>
> > > As much as I would like to see a simpler way, I am pessimistic that
> > > there is a way around storing memcg ids in the shadow entries.
> >
> > On 32 bit there is too little space for storing memcg id. We can shift
> > the distance so that it would fit and still contain something meaningful
> > though, but that would take much more code, so I'm trying to try the
> > simplest way first.
>
> It should be easy to trim quite a few bits from the timestamp, both in
> terms of available memory as well as in terms of distance granularity.
> We probably don't care if the refault distance is only accurate to say
> 2MB, and how many pages do we have to represent on 32-bit in the first
> place? Once we trim that, we should be able to fit a CSS ID.
NODES_SHIFT <= 10, ZONES_SHIFT == 2, RADIX_TREE_EXCEPTIONAL_SHIFT == 2
And we need 16 bit for storing memcg id, so there are only 2 bits left.
Even with 2MB accuracy, it gives us the maximal refault distance of 6MB
:-(
However, I doubt there is a 32 bit host with 1024 NUMA nodes. Can we
possibly limit this config option on 32 bit architectures?
Or may be we can limit the number of cgroups to say 1024 if running on
32 bit? This would allow us to win 6 more bits, so that the maximal
refault distance would be 512MB with the accuracy of 2MB. But can we be
sure this won't brake anyone's setup, especially counting that cgroups
can be zombieing around for a while after rmdir?
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-08-04 8:13 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-03 12:04 [PATCH 0/3] Make " Vladimir Davydov
2015-08-03 12:04 ` [PATCH 1/3] mm: move workingset_activation under lru_lock Vladimir Davydov
2015-08-03 12:04 ` [PATCH 2/3] mm: make workingset detection logic memcg aware Vladimir Davydov
2015-08-03 13:23 ` Johannes Weiner
2015-08-03 13:52 ` Vladimir Davydov
2015-08-03 20:55 ` Johannes Weiner
2015-08-04 8:13 ` Vladimir Davydov [this message]
2015-08-03 12:04 ` [PATCH 3/3] mm: workingset: make shadow node shrinker " Vladimir Davydov
2015-08-05 1:34 ` [PATCH 0/3] Make workingset detection logic " Kamezawa Hiroyuki
2015-08-06 8:59 ` Vladimir Davydov
2015-08-07 1:38 ` Kamezawa Hiroyuki
2015-08-08 13:05 ` Vladimir Davydov
2015-08-09 14:12 ` Kamezawa Hiroyuki
2015-08-10 8:14 ` Vladimir Davydov
2015-08-11 15:59 ` Kamezawa Hiroyuki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150804081329.GB11971@esperanza \
--to=vdavydov@parallels.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox