From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=k3Aa=2J=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.7 required=3.0
	tests=HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 754A9C43603
	for <linux-mm@archiver.kernel.org>; Thu, 19 Dec 2019 09:04:41 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 272CC2064B
	for <linux-mm@archiver.kernel.org>; Thu, 19 Dec 2019 09:04:41 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 272CC2064B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B521C8E015D; Thu, 19 Dec 2019 04:04:40 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B03358E00F5; Thu, 19 Dec 2019 04:04:40 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A3F018E015D; Thu, 19 Dec 2019 04:04:40 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250])
	by kanga.kvack.org (Postfix) with ESMTP id 8E92E8E00F5
	for <linux-mm@kvack.org>; Thu, 19 Dec 2019 04:04:40 -0500 (EST)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with SMTP id ED0918249980
	for <linux-mm@kvack.org>; Thu, 19 Dec 2019 09:04:39 +0000 (UTC)
X-FDA: 76281305478.03.blood09_2885a8a751801
X-HE-Tag: blood09_2885a8a751801
X-Filterd-Recvd-Size: 13264
Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57])
	by imf20.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 19 Dec 2019 09:04:38 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07417;MF=teawaterz@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0TlL37Kz_1576746267;
Received: from 30.30.208.2(mailfrom:teawaterz@linux.alibaba.com fp:SMTPD_---0TlL37Kz_1576746267)
          by smtp.aliyun-inc.com(127.0.0.1);
          Thu, 19 Dec 2019 17:04:28 +0800
Content-Type: text/plain;
	charset=gb2312
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: [PATCH] mm: vmscan: memcg: Add global shrink priority
From: teawater <teawaterz@linux.alibaba.com>
In-Reply-To: <CALOAHbCU2GHfupDRovk3Wvv=+qJr8sWO3tpu1upug=LM+VO1Og@mail.gmail.com>
Date: Thu, 19 Dec 2019 17:04:27 +0800
Cc: Johannes Weiner <hannes@cmpxchg.org>,
 Michal Hocko <mhocko@kernel.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Roman Gushchin <guro@fb.com>,
 Shakeel Butt <shakeelb@google.com>,
 Chris Down <chris@chrisdown.name>,
 Yang Shi <yang.shi@linux.alibaba.com>,
 Tejun Heo <tj@kernel.org>,
 tglx@linutronix.de,
 LKML <linux-kernel@vger.kernel.org>,
 Cgroups <cgroups@vger.kernel.org>,
 Linux MM <linux-mm@kvack.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <23317BFD-8C0F-4CC7-A97B-DF339F83DCBA@linux.alibaba.com>
References: <1576662179-16861-1-git-send-email-teawaterz@linux.alibaba.com>
 <CALOAHbCU2GHfupDRovk3Wvv=+qJr8sWO3tpu1upug=LM+VO1Og@mail.gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
X-Mailer: Apple Mail (2.3445.104.11)
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


> =D4=DA 2019=C4=EA12=D4=C218=C8=D5=A3=AC18:47=A3=ACYafang Shao =
<laoar.shao@gmail.com> =D0=B4=B5=C0=A3=BA
>=20
> On Wed, Dec 18, 2019 at 5:44 PM Hui Zhu <teawaterz@linux.alibaba.com> =
wrote:
>>=20
>> Currently, memcg has some config to limit memory usage and config
>> the shrink behavior.
>> In the memory-constrained environment, put different priority tasks
>> into different cgroups with different memory limits to protect the
>> performance of the high priority tasks.  Because the global memory
>> shrink will affect the performance of all tasks.  The memory limit
>> cgroup can make shrink happen inside the cgroup.  Then it can =
decrease
>> the memory shrink of the high priority task to protect its =
performance.
>>=20
>> But the memory footprint of the task is not static.  It will change =
as
>> the working pressure changes.  And the version changes will affect it =
too.
>> Then set the appropriate memory limit to decrease the global memory =
shrink
>> is a difficult job and lead to wasted memory or performance loss =
sometimes.
>>=20
>> This commit adds global shrink priority to memcg to try to handle =
this
>> problem.
>> The default global shrink priority of each cgroup is DEF_PRIORITY.
>> Its behavior in global shrink is not changed.
>> And when global shrink priority of a cgroup is smaller than =
DEF_PRIORITY,
>> its memory will be shrink when memcg->global_shrink_priority greater =
than
>> or equal to sc->priority.
>>=20
>=20
> Just a kind reminder that sc->priority is really propotional, rather
> than priority.
> The relcaimer scans (total_size >> priority) pages at once.
> If the relcaimer can't relaim enough memory, it will decrease
> sc->priority and scan MEMCGs again until the sc->pirority drops to 0.
> (sc->priority is really a misleading wording. )
> So comparing the memcg priority with  sc->priority may cause =
unexpected issues.
>=20
>> The following is an example to use global shrink priority in a VM =
that
>> has 2 CPUs, 1G memory and 4G swap:
>> # These are test shells that call usemem that get from
>> # =
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git
>> cat 1.sh
>> sleep 9999
>> # -s 3600: Sleep 3600 seconds after test complete then usemem will
>> # not release the memory at once.
>> # -Z:  read memory again after access the memory.
>> # The first time access memory need shrink memory to allocate page.
>> # Then the access speed of high priority will not increase a lot.
>> # The read again speed of high priority will increase.
>> # $((850 * 1024 * 1024 + 8)): Different sizes are used to distinguish
>> # the results of the two tests.
>> usemem -s 3600 -Z -a -n 1 $((850 * 1024 * 1024 + 8))
>> cat 2.sh
>> sleep 9999
>> usemem -s 3600 -Z -a -n 1 $((850 * 1024 * 1024))
>>=20
>> # Setup swap
>> swapon /swapfile
>> # Setup 2 cgroups
>> mkdir /sys/fs/cgroup/memory/t1/
>> mkdir /sys/fs/cgroup/memory/t2/
>>=20
>> # Run tests with same global shrink priority
>> cat /sys/fs/cgroup/memory/t1/memory.global_shrink_priority
>> 12
>> cat /sys/fs/cgroup/memory/t2/memory.global_shrink_priority
>> 12
>> echo $$ > /sys/fs/cgroup/memory/t1/cgroup.procs
>> sh 1.sh &
>> echo $$ > /sys/fs/cgroup/memory/t2/cgroup.procs
>> sh 2.sh &
>> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
>> killall sleep
>> # This the test results
>> 1002700800 bytes / 2360359 usecs =3D 414852 KB/s
>> 1002700809 bytes / 2676181 usecs =3D 365894 KB/s
>> read again 891289600 bytes / 13515142 usecs =3D 64401 KB/s
>> read again 891289608 bytes / 13252268 usecs =3D 65679 KB/s
>> killall usemem
>>=20
>> # Run tests with 12 and 8
>> cat /sys/fs/cgroup/memory/t1/memory.global_shrink_priority
>> 12
>> echo 8 > /sys/fs/cgroup/memory/t2/memory.global_shrink_priority
>> echo $$ > /sys/fs/cgroup/memory/t1/cgroup.procs
>> sh 1.sh &
>> echo $$ > /sys/fs/cgroup/memory/t2/cgroup.procs
>> sh 2.sh &
>> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
>> killall sleep
>> # This the test results
>> 1002700800 bytes / 1809056 usecs =3D 541276 KB/s
>> 1002700809 bytes / 2184337 usecs =3D 448282 KB/s
>> read again 891289600 bytes / 6666224 usecs =3D 130568 KB/s
>> read again 891289608 bytes / 9171440 usecs =3D 94903 KB/s
>> killall usemem
>>=20
>> # This is the test results of 12 and 6
>> 1002700800 bytes / 1827914 usecs =3D 535692 KB/s
>> 1002700809 bytes / 2135124 usecs =3D 458615 KB/s
>> read again 891289600 bytes / 1498419 usecs =3D 580878 KB/s
>> read again 891289608 bytes / 7328362 usecs =3D 118771 KB/s
>>=20
>> Signed-off-by: Hui Zhu <teawaterz@linux.alibaba.com>
>> ---
>> include/linux/memcontrol.h |  2 ++
>> mm/memcontrol.c            | 32 ++++++++++++++++++++++++++++++++
>> mm/vmscan.c                | 39 =
++++++++++++++++++++++++++++++++++++---
>> 3 files changed, 70 insertions(+), 3 deletions(-)
>>=20
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index a7a0a1a5..8ad2437 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -244,6 +244,8 @@ struct mem_cgroup {
>>        /* OOM-Killer disable */
>>        int             oom_kill_disable;
>>=20
>> +       s8 global_shrink_priority;
>> +
>>        /* memory.events and memory.events.local */
>>        struct cgroup_file events_file;
>>        struct cgroup_file events_local_file;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c5b5f74..39fdc84 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4646,6 +4646,32 @@ static ssize_t =
memcg_write_event_control(struct kernfs_open_file *of,
>>        return ret;
>> }
>>=20
>> +static ssize_t mem_global_shrink_priority_write(struct =
kernfs_open_file *of,
>> +                               char *buf, size_t nbytes, loff_t off)
>> +{
>> +       struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of));
>> +       s8 val;
>> +       int ret;
>> +
>> +       ret =3D kstrtos8(buf, 0, &val);
>> +       if (ret < 0)
>> +               return ret;
>> +       if (val > DEF_PRIORITY)
>> +               return -EINVAL;
>> +
>> +       memcg->global_shrink_priority =3D val;
>> +
>> +       return nbytes;
>> +}
>> +
>> +static s64 mem_global_shrink_priority_read(struct =
cgroup_subsys_state *css,
>> +                                       struct cftype *cft)
>> +{
>> +       struct mem_cgroup *memcg =3D mem_cgroup_from_css(css);
>> +
>> +       return memcg->global_shrink_priority;
>> +}
>> +
>> static struct cftype mem_cgroup_legacy_files[] =3D {
>>        {
>>                .name =3D "usage_in_bytes",
>> @@ -4774,6 +4800,11 @@ static struct cftype mem_cgroup_legacy_files[] =
=3D {
>>                .write =3D mem_cgroup_reset,
>>                .read_u64 =3D mem_cgroup_read_u64,
>>        },
>> +       {
>> +               .name =3D "global_shrink_priority",
>> +               .write =3D mem_global_shrink_priority_write,
>> +               .read_s64 =3D mem_global_shrink_priority_read,
>> +       },
>>        { },    /* terminate */
>> };
>>=20
>> @@ -4996,6 +5027,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state =
*parent_css)
>>=20
>>        memcg->high =3D PAGE_COUNTER_MAX;
>>        memcg->soft_limit =3D PAGE_COUNTER_MAX;
>> +       memcg->global_shrink_priority =3D DEF_PRIORITY;
>>        if (parent) {
>>                memcg->swappiness =3D mem_cgroup_swappiness(parent);
>>                memcg->oom_kill_disable =3D parent->oom_kill_disable;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 74e8edc..5e11d45 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2637,17 +2637,33 @@ static inline bool =
should_continue_reclaim(struct pglist_data *pgdat,
>>        return inactive_lru_pages > pages_for_compaction;
>> }
>>=20
>> +static bool get_is_global_shrink(struct scan_control *sc)
>> +{
>> +       if (!sc->target_mem_cgroup ||
>> +               mem_cgroup_is_root(sc->target_mem_cgroup))
>> +               return true;
>> +
>> +       return false;
>> +}
>> +
>> static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control =
*sc)
>> {
>>        struct mem_cgroup *target_memcg =3D sc->target_mem_cgroup;
>>        struct mem_cgroup *memcg;
>> +       bool is_global_shrink =3D get_is_global_shrink(sc);
>>=20
>>        memcg =3D mem_cgroup_iter(target_memcg, NULL, NULL);
>>        do {
>> -               struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, =
pgdat);
>> +               struct lruvec *lruvec;
>>                unsigned long reclaimed;
>>                unsigned long scanned;
>>=20
>> +               if (is_global_shrink &&
>> +                       memcg->global_shrink_priority < sc->priority)
>> +                       continue;
>> +
>> +               lruvec =3D mem_cgroup_lruvec(memcg, pgdat);
>> +
>>                switch (mem_cgroup_protected(target_memcg, memcg)) {
>>                case MEMCG_PROT_MIN:
>>                        /*
>> @@ -2682,11 +2698,21 @@ static void shrink_node_memcgs(pg_data_t =
*pgdat, struct scan_control *sc)
>>                reclaimed =3D sc->nr_reclaimed;
>>                scanned =3D sc->nr_scanned;
>>=20
>> +               if (is_global_shrink &&
>> +                       memcg->global_shrink_priority !=3D =
DEF_PRIORITY)
>> +                       sc->priority +=3D DEF_PRIORITY
>> +                                       - =
memcg->global_shrink_priority;
>> +
>=20
> For example.
> In this case this memcg can't do full scan.
> This behavior is similar with a hard protect(memroy.min), which may
> cause unexpected OOM under memory pressure.
>=20
> Pls. correct me if I misunderstand you.

Thanks and agree with you.
Low priority task should do more shrink if the high priority task is =
ignored.

Best,
Hui

>=20
>>                shrink_lruvec(lruvec, sc);
>>=20
>>                shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>>                            sc->priority);
>>=20
>> +               if (is_global_shrink &&
>> +                       memcg->global_shrink_priority !=3D =
DEF_PRIORITY)
>> +                       sc->priority -=3D DEF_PRIORITY
>> +                                       - =
memcg->global_shrink_priority;
>> +
>>                /* Record the group's reclaim efficiency */
>>                vmpressure(sc->gfp_mask, memcg, false,
>>                           sc->nr_scanned - scanned,
>> @@ -3395,11 +3421,18 @@ static void age_active_anon(struct =
pglist_data *pgdat,
>>=20
>>        memcg =3D mem_cgroup_iter(NULL, NULL, NULL);
>>        do {
>> +               if (memcg->global_shrink_priority < sc->priority)
>> +                       continue;
>> +
>>                lruvec =3D mem_cgroup_lruvec(memcg, pgdat);
>> +               /*
>> +                * Not set sc->priority according even if this is
>> +                * a global shrink because nr_to_scan is set to
>> +                * SWAP_CLUSTER_MAX and there is not other part use =
it.
>> +                */
>>                shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>>                                   sc, LRU_ACTIVE_ANON);
>> -               memcg =3D mem_cgroup_iter(NULL, memcg, NULL);
>> -       } while (memcg);
>> +       } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL)));
>> }
>>=20
>> static bool pgdat_watermark_boosted(pg_data_t *pgdat, int =
classzone_idx)
>> --
>> 2.7.4