From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=dJQf=2I=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AF3D3C2D0CD
	for <linux-mm@archiver.kernel.org>; Wed, 18 Dec 2019 10:48:27 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 31E0624650
	for <linux-mm@archiver.kernel.org>; Wed, 18 Dec 2019 10:48:27 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="L2qWiJr5"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 31E0624650
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 96F3B8E0106; Wed, 18 Dec 2019 05:48:26 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 920A88E00F5; Wed, 18 Dec 2019 05:48:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 837608E0106; Wed, 18 Dec 2019 05:48:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 6D3DD8E00F5
	for <linux-mm@kvack.org>; Wed, 18 Dec 2019 05:48:26 -0500 (EST)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id 04EA4180AD806
	for <linux-mm@kvack.org>; Wed, 18 Dec 2019 10:48:26 +0000 (UTC)
X-FDA: 76277938212.15.grass60_f5938cdc9b19
X-HE-Tag: grass60_f5938cdc9b19
X-Filterd-Recvd-Size: 13772
Received: from mail-io1-f68.google.com (mail-io1-f68.google.com [209.85.166.68])
	by imf26.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 18 Dec 2019 10:48:25 +0000 (UTC)
Received: by mail-io1-f68.google.com with SMTP id k24so1489898ioc.4
        for <linux-mm@kvack.org>; Wed, 18 Dec 2019 02:48:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=SNFyFaaPVUCbt7ow4ltT2cDXm/chbhSYf2qEXAcBSkY=;
        b=L2qWiJr5OdAEdDGMRosKOcMiWF6DeRQjpgE9htWBH0jCHj8dmkLeMufzMrcsdNAk/q
         AltzdDhr44FG/EA/Q5Uf0lMFtB/PstcK/J6CS3XDIHdC6dWpUJieyz5kylHd1Eu5TJ8s
         AGqj90CYOKnJjNvvJlaa7ULvvwXkDJYAbGFL3pKAHpRdFyM+m0OZlCWy1yP/sT/WjIZg
         G5CPRqyO75q2VKunsfAYGMW0LSGyx5EMRR2A4bM9m/oGUufIq3OFUPC5KOFeFBXzCp0O
         txvH+Wt64Nw6NnlEgcxEdAwNcHr+E0s2x1pA57/hbem6qWx8FrEEMJn73adaxMxpzGLb
         sVZw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=SNFyFaaPVUCbt7ow4ltT2cDXm/chbhSYf2qEXAcBSkY=;
        b=HxMMiVIBSupDsdBqGmrHM/rquvaCHgQCgma+VYn4t+uGaR5YDzHHoJ6lDiACZgwen0
         2nRgxpHDKq6iu66WYniKHhYvsBqebuGrV3oWy9nBt1dQENx0+PzSIZpi4BfmoXSrtU5+
         ho+CcgRcr30GSaEYLBwihi8rx9pSXa12gBpct+WqBgARw4t65hN+VFbT4p/lXZshiBj6
         +ULSXme3TB2wHrzzu14pYAaIpo21G3ECCe3lzeoFlYGoR7tAnsDFEaB871tzElglcbZ6
         ThffS9WAQ76M2vs+UoOo1W+ogHAplQhOfgXdm48fUSuVprk12Xh7gKJm1Uz0aexkhEWE
         wR5w==
X-Gm-Message-State: APjAAAWxqX80p9JeUdYObXlI4och/3w+4a1qtbWMUxYA2GZppMJ4HgwV
	jz2m8ciSzbvnQwSSdXDES/neZ7mk1Dr5wLK+XhA=
X-Google-Smtp-Source: APXvYqzEjxB+L3YQWFpAgI7aRXCZXMedmNW/sQljSGApx6sLUc0uefcA+A6SyKIcbSILlytNBl/vOzWlVw1KcWOyIUs=
X-Received: by 2002:a6b:b941:: with SMTP id j62mr1150341iof.168.1576666104346;
 Wed, 18 Dec 2019 02:48:24 -0800 (PST)
MIME-Version: 1.0
References: <1576662179-16861-1-git-send-email-teawaterz@linux.alibaba.com>
In-Reply-To: <1576662179-16861-1-git-send-email-teawaterz@linux.alibaba.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 18 Dec 2019 18:47:48 +0800
Message-ID: <CALOAHbCU2GHfupDRovk3Wvv=+qJr8sWO3tpu1upug=LM+VO1Og@mail.gmail.com>
Subject: Re: [PATCH] mm: vmscan: memcg: Add global shrink priority
To: Hui Zhu <teawaterz@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Roman Gushchin <guro@fb.com>, Shakeel Butt <shakeelb@google.com>, Chris Down <chris@chrisdown.name>, 
	Yang Shi <yang.shi@linux.alibaba.com>, Tejun Heo <tj@kernel.org>, tglx@linutronix.de, 
	LKML <linux-kernel@vger.kernel.org>, Cgroups <cgroups@vger.kernel.org>, 
	Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Dec 18, 2019 at 5:44 PM Hui Zhu <teawaterz@linux.alibaba.com> wrote:
>
> Currently, memcg has some config to limit memory usage and config
> the shrink behavior.
> In the memory-constrained environment, put different priority tasks
> into different cgroups with different memory limits to protect the
> performance of the high priority tasks.  Because the global memory
> shrink will affect the performance of all tasks.  The memory limit
> cgroup can make shrink happen inside the cgroup.  Then it can decrease
> the memory shrink of the high priority task to protect its performance.
>
> But the memory footprint of the task is not static.  It will change as
> the working pressure changes.  And the version changes will affect it too.
> Then set the appropriate memory limit to decrease the global memory shrink
> is a difficult job and lead to wasted memory or performance loss sometimes.
>
> This commit adds global shrink priority to memcg to try to handle this
> problem.
> The default global shrink priority of each cgroup is DEF_PRIORITY.
> Its behavior in global shrink is not changed.
> And when global shrink priority of a cgroup is smaller than DEF_PRIORITY,
> its memory will be shrink when memcg->global_shrink_priority greater than
> or equal to sc->priority.
>

Just a kind reminder that sc->priority is really propotional, rather
than priority.
The relcaimer scans (total_size >> priority) pages at once.
If the relcaimer can't relaim enough memory, it will decrease
sc->priority and scan MEMCGs again until the sc->pirority drops to 0.
(sc->priority is really a misleading wording. )
So comparing the memcg priority with  sc->priority may cause unexpected issues.

> The following is an example to use global shrink priority in a VM that
> has 2 CPUs, 1G memory and 4G swap:
>  # These are test shells that call usemem that get from
>  # https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git
> cat 1.sh
> sleep 9999
>  # -s 3600: Sleep 3600 seconds after test complete then usemem will
>  # not release the memory at once.
>  # -Z:  read memory again after access the memory.
>  # The first time access memory need shrink memory to allocate page.
>  # Then the access speed of high priority will not increase a lot.
>  # The read again speed of high priority will increase.
>  # $((850 * 1024 * 1024 + 8)): Different sizes are used to distinguish
>  # the results of the two tests.
> usemem -s 3600 -Z -a -n 1 $((850 * 1024 * 1024 + 8))
> cat 2.sh
> sleep 9999
> usemem -s 3600 -Z -a -n 1 $((850 * 1024 * 1024))
>
>  # Setup swap
> swapon /swapfile
>  # Setup 2 cgroups
> mkdir /sys/fs/cgroup/memory/t1/
> mkdir /sys/fs/cgroup/memory/t2/
>
>  # Run tests with same global shrink priority
> cat /sys/fs/cgroup/memory/t1/memory.global_shrink_priority
> 12
> cat /sys/fs/cgroup/memory/t2/memory.global_shrink_priority
> 12
> echo $$ > /sys/fs/cgroup/memory/t1/cgroup.procs
> sh 1.sh &
> echo $$ > /sys/fs/cgroup/memory/t2/cgroup.procs
> sh 2.sh &
> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> killall sleep
>  # This the test results
> 1002700800 bytes / 2360359 usecs = 414852 KB/s
> 1002700809 bytes / 2676181 usecs = 365894 KB/s
> read again 891289600 bytes / 13515142 usecs = 64401 KB/s
> read again 891289608 bytes / 13252268 usecs = 65679 KB/s
> killall usemem
>
>  # Run tests with 12 and 8
> cat /sys/fs/cgroup/memory/t1/memory.global_shrink_priority
> 12
> echo 8 > /sys/fs/cgroup/memory/t2/memory.global_shrink_priority
> echo $$ > /sys/fs/cgroup/memory/t1/cgroup.procs
> sh 1.sh &
> echo $$ > /sys/fs/cgroup/memory/t2/cgroup.procs
> sh 2.sh &
> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> killall sleep
>  # This the test results
> 1002700800 bytes / 1809056 usecs = 541276 KB/s
> 1002700809 bytes / 2184337 usecs = 448282 KB/s
> read again 891289600 bytes / 6666224 usecs = 130568 KB/s
> read again 891289608 bytes / 9171440 usecs = 94903 KB/s
> killall usemem
>
>  # This is the test results of 12 and 6
> 1002700800 bytes / 1827914 usecs = 535692 KB/s
> 1002700809 bytes / 2135124 usecs = 458615 KB/s
> read again 891289600 bytes / 1498419 usecs = 580878 KB/s
> read again 891289608 bytes / 7328362 usecs = 118771 KB/s
>
> Signed-off-by: Hui Zhu <teawaterz@linux.alibaba.com>
> ---
>  include/linux/memcontrol.h |  2 ++
>  mm/memcontrol.c            | 32 ++++++++++++++++++++++++++++++++
>  mm/vmscan.c                | 39 ++++++++++++++++++++++++++++++++++++---
>  3 files changed, 70 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a7a0a1a5..8ad2437 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -244,6 +244,8 @@ struct mem_cgroup {
>         /* OOM-Killer disable */
>         int             oom_kill_disable;
>
> +       s8 global_shrink_priority;
> +
>         /* memory.events and memory.events.local */
>         struct cgroup_file events_file;
>         struct cgroup_file events_local_file;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c5b5f74..39fdc84 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4646,6 +4646,32 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
>         return ret;
>  }
>
> +static ssize_t mem_global_shrink_priority_write(struct kernfs_open_file *of,
> +                               char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       s8 val;
> +       int ret;
> +
> +       ret = kstrtos8(buf, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +       if (val > DEF_PRIORITY)
> +               return -EINVAL;
> +
> +       memcg->global_shrink_priority = val;
> +
> +       return nbytes;
> +}
> +
> +static s64 mem_global_shrink_priority_read(struct cgroup_subsys_state *css,
> +                                       struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +       return memcg->global_shrink_priority;
> +}
> +
>  static struct cftype mem_cgroup_legacy_files[] = {
>         {
>                 .name = "usage_in_bytes",
> @@ -4774,6 +4800,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
>                 .write = mem_cgroup_reset,
>                 .read_u64 = mem_cgroup_read_u64,
>         },
> +       {
> +               .name = "global_shrink_priority",
> +               .write = mem_global_shrink_priority_write,
> +               .read_s64 = mem_global_shrink_priority_read,
> +       },
>         { },    /* terminate */
>  };
>
> @@ -4996,6 +5027,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>
>         memcg->high = PAGE_COUNTER_MAX;
>         memcg->soft_limit = PAGE_COUNTER_MAX;
> +       memcg->global_shrink_priority = DEF_PRIORITY;
>         if (parent) {
>                 memcg->swappiness = mem_cgroup_swappiness(parent);
>                 memcg->oom_kill_disable = parent->oom_kill_disable;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edc..5e11d45 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,17 +2637,33 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>         return inactive_lru_pages > pages_for_compaction;
>  }
>
> +static bool get_is_global_shrink(struct scan_control *sc)
> +{
> +       if (!sc->target_mem_cgroup ||
> +               mem_cgroup_is_root(sc->target_mem_cgroup))
> +               return true;
> +
> +       return false;
> +}
> +
>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
>         struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
>         struct mem_cgroup *memcg;
> +       bool is_global_shrink = get_is_global_shrink(sc);
>
>         memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
>         do {
> -               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               struct lruvec *lruvec;
>                 unsigned long reclaimed;
>                 unsigned long scanned;
>
> +               if (is_global_shrink &&
> +                       memcg->global_shrink_priority < sc->priority)
> +                       continue;
> +
> +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
>                 switch (mem_cgroup_protected(target_memcg, memcg)) {
>                 case MEMCG_PROT_MIN:
>                         /*
> @@ -2682,11 +2698,21 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>                 reclaimed = sc->nr_reclaimed;
>                 scanned = sc->nr_scanned;
>
> +               if (is_global_shrink &&
> +                       memcg->global_shrink_priority != DEF_PRIORITY)
> +                       sc->priority += DEF_PRIORITY
> +                                       - memcg->global_shrink_priority;
> +

For example.
In this case this memcg can't do full scan.
This behavior is similar with a hard protect(memroy.min), which may
cause unexpected OOM under memory pressure.

Pls. correct me if I misunderstand you.

>                 shrink_lruvec(lruvec, sc);
>
>                 shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>                             sc->priority);
>
> +               if (is_global_shrink &&
> +                       memcg->global_shrink_priority != DEF_PRIORITY)
> +                       sc->priority -= DEF_PRIORITY
> +                                       - memcg->global_shrink_priority;
> +
>                 /* Record the group's reclaim efficiency */
>                 vmpressure(sc->gfp_mask, memcg, false,
>                            sc->nr_scanned - scanned,
> @@ -3395,11 +3421,18 @@ static void age_active_anon(struct pglist_data *pgdat,
>
>         memcg = mem_cgroup_iter(NULL, NULL, NULL);
>         do {
> +               if (memcg->global_shrink_priority < sc->priority)
> +                       continue;
> +
>                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               /*
> +                * Not set sc->priority according even if this is
> +                * a global shrink because nr_to_scan is set to
> +                * SWAP_CLUSTER_MAX and there is not other part use it.
> +                */
>                 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>                                    sc, LRU_ACTIVE_ANON);
> -               memcg = mem_cgroup_iter(NULL, memcg, NULL);
> -       } while (memcg);
> +       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>  }
>
>  static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
> --
> 2.7.4
>
>