From: Honglei Wang <honglei.wang@oracle.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, vdavydov.dev@gmail.com, hannes@cmpxchg.org
Subject: Re: [PATCH v2] mm/vmscan: get number of pages on the LRU list in memcgroup base on lru_zone_size
Date: Tue, 8 Oct 2019 17:34:03 +0800 [thread overview]
Message-ID: <991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com> (raw)
In-Reply-To: <20191007142805.GM2381@dhcp22.suse.cz>
On 10/7/19 10:28 PM, Michal Hocko wrote:
> On Thu 05-09-19 15:10:34, Honglei Wang wrote:
>> lruvec_lru_size() is involving lruvec_page_state_local() to get the
>> lru_size in the current code. It's base on lruvec_stat_local.count[]
>> of mem_cgroup_per_node. This counter is updated in batch. It won't
>> do charge if the number of coming pages doesn't meet the needs of
>> MEMCG_CHARGE_BATCH who's defined as 32 now.
>>
>> The testcase in LTP madvise09[1] fails due to small block memory is
>> not charged. It creates a new memcgroup and sets up 32 MADV_FREE
>> pages. Then it forks child who will introduce memory pressure in the
>> memcgroup. The MADV_FREE pages are expected to be released under the
>> pressure, but 32 is not more than MEMCG_CHARGE_BATCH and these pages
>> won't be charged in lruvec_stat_local.count[] until some more pages
>> come in to satisfy the needs of batch charging. So these MADV_FREE
>> pages can't be freed in memory pressure which is a bit conflicted
>> with the definition of MADV_FREE.
>
> The test case is simly wrong. The caching and the batch size is an
> internal implementation detail. Moreover MADV_FREE is a _hint_ so all
> you can say is that those pages will get freed at some point in time but
> you cannot make any assumptions about when that moment happens.
>
This is a corner case, it makes extremely memory pressure which give the
group no chance to satisfy the batch operation. There might be small
chance to hit such problem in real workload -- 128K memory is really
small in current amount of memory usage. I know exactly what you mean.
The batch size is internal implementation detail, this *test case* just
happen hit it in black box.
>> Getting lru_size base on lru_zone_size of mem_cgroup_per_node which
>> is not updated in batch can make it a bit more accurate in similar
>> scenario.
>
> What does that mean? It would be more helpful to describe the code path
> which will use this more precise value and what is the effect of that.
>
How about we describe it like this:
Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is
not updated via batching can help any related code path get more precise
lru size in mem_cgroup case. This makes memory reclaim code won't ignore
small blocks of memory(say, less than MEMCG_CHARGE_BATCH pages) in the
lru list.
For this specific MADV_FREE page case, more precise lru size helps
release the pages less than 32 as expected.
Thanks,
Honglei
> As I've said in the previous version, I do not object to the patch
> because a more precise lruvec_lru_size sounds like a nice thing as long
> as we are not paying a high price for that. Just look at the global case
> for mem_cgroup_disabled(). It uses node_page_state and that one is using
> per-cpu accounting with regular global value refreshing IIRC.
>
>> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
>>
>> Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
>> ---
>> mm/vmscan.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c77d1e3761a7..c28672460868 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -354,12 +354,13 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>> */
>> unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>> {
>> - unsigned long lru_size;
>> + unsigned long lru_size = 0;
>> int zid;
>>
>> - if (!mem_cgroup_disabled())
>> - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
>> - else
>> + if (!mem_cgroup_disabled()) {
>> + for (zid = 0; zid < MAX_NR_ZONES; zid++)
>> + lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
>> + } else
>> lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>>
>> for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
>> --
>> 2.17.0
>
next prev parent reply other threads:[~2019-10-08 9:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-05 7:10 Honglei Wang
2019-10-06 0:10 ` Andrew Morton
2019-10-07 14:28 ` Michal Hocko
2019-10-08 9:34 ` Honglei Wang [this message]
2019-10-09 14:16 ` Michal Hocko
2019-10-10 8:40 ` Honglei Wang
2019-10-10 14:33 ` Michal Hocko
2019-10-11 1:40 ` Honglei Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com \
--to=honglei.wang@oracle.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox