From: Andrew Morton <akpm@linux-foundation.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Honglei Wang <honglei.wang@oracle.com>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-mm@kvack.org
Subject: Re: memcgroup lruvec_lru_size scaling issue
Date: Mon, 14 Oct 2019 15:14:30 -0700 [thread overview]
Message-ID: <20191014151430.fc419425e515188b904cd8af@linux-foundation.org> (raw)
In-Reply-To: <20191014183107.GO317@dhcp22.suse.cz>
On Mon, 14 Oct 2019 20:31:07 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> Please put all that to the changelog. It would be also great to see
> whether that really scales with the number of cgroups if that is easy to
> check with your benchmark.
I restored this, with some changelog additions.
Tim, please send along any updates you'd like to make to this, along
with tested-by, acked-by, etc.
I added the Fixes: tag. Do we think this is serious enough to warrant
backporting into -stable trees? I suspect so, as any older-tree
maintainer who spies this change will say "hell yeah".
From: Honglei Wang <honglei.wang@oracle.com>
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
lruvec_lru_size() is invokving lruvec_page_state_local() to get the
lru_size. It's base on lruvec_stat_local.count[] of mem_cgroup_per_node.
This counter is updated in a batched way. It won't be charged if the
number of incoming pages doesn't meet the needs of MEMCG_CHARGE_BATCH
which is defined as 32.
The testcase in LTP madvise09[1] fails because small blocks of memory are
not charged. It creates a new memcgroup and sets up 32 MADV_FREE pages.
Then it forks a child who will introduce memory pressure in the memcgroup.
The MADV_FREE pages are expected to be released under the pressure, but
32 is not more than MEMCG_CHARGE_BATCH and these pages won't be charged in
lruvec_stat_local.count[] until some more pages come in to satisfy the
needs of batch charging. So these MADV_FREE pages can't be freed in
memory pressure which is a bit conflicted with the definition of
MADV_FREE.
Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is not
updated via batching can making it more accurate in this scenario.
This is effectively a partial reversion of 1a61ab8038e72 ("mm: memcontrol:
replace zone summing with lruvec_page_state()").
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
Tim said:
: We were running a database benchmark in mem cgroup and found that
: lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our
: kernel time - bout 7% of total cpu cycles) on 5.3 kernel.
:
: The main issue is the loop in lruvec_page_state_local called by
: lruvec_lru_size in the mem cgroup path:
:
: for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx],
: cpu); It is costly looping through all the cpus to get the lru vec size
: info. And doing this on our workload with 96 cpu threads and 500 mem
: cgroups makes things much worse. We might end up having 96 cpus * 500
: cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be
: running through all the time.
:
: Hongwei's patch restores the previous method for computing lru_size and is
: much more efficient in getting the lru_size. We got a 20% throughput
: improvement in our database benchmark with Hongwei's patch, and
: lruvec_lru_size's cpu overhead completely disappeared from the cpu
: profile.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/vmscan.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
+++ a/mm/vmscan.c
@@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
*/
unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
{
- unsigned long lru_size;
+ unsigned long lru_size = 0;
int zid;
- if (!mem_cgroup_disabled())
- lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
- else
+ if (!mem_cgroup_disabled()) {
+ for (zid = 0; zid < MAX_NR_ZONES; zid++)
+ lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+ } else
lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
_
next prev parent reply other threads:[~2019-10-14 22:14 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-14 17:17 Tim Chen
2019-10-14 17:37 ` Michal Hocko
2019-10-14 17:49 ` Dave Hansen
2019-10-14 17:59 ` Michal Hocko
2019-10-14 18:06 ` Tim Chen
2019-10-14 18:31 ` Michal Hocko
2019-10-14 22:14 ` Andrew Morton [this message]
2019-10-15 6:19 ` Michal Hocko
2019-10-15 20:38 ` Andrew Morton
2019-10-16 7:25 ` Michal Hocko
2019-10-15 18:23 ` Tim Chen
2019-10-14 18:11 ` Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191014151430.fc419425e515188b904cd8af@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=hannes@cmpxchg.org \
--cc=honglei.wang@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=tim.c.chen@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox