From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wawx=37=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F1091C352A4
	for <linux-mm@archiver.kernel.org>; Tue, 11 Feb 2020 03:03:41 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A4BE220733
	for <linux-mm@archiver.kernel.org>; Tue, 11 Feb 2020 03:03:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KfiR20VN"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A4BE220733
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 388616B026D; Mon, 10 Feb 2020 22:03:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 337126B026F; Mon, 10 Feb 2020 22:03:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 24E206B0270; Mon, 10 Feb 2020 22:03:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0073.hostedemail.com [216.40.44.73])
	by kanga.kvack.org (Postfix) with ESMTP id 0D5F46B026D
	for <linux-mm@kvack.org>; Mon, 10 Feb 2020 22:03:41 -0500 (EST)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 89C028248076
	for <linux-mm@kvack.org>; Tue, 11 Feb 2020 03:03:40 +0000 (UTC)
X-FDA: 76476351000.23.head60_1ffb438abdd56
X-HE-Tag: head60_1ffb438abdd56
X-Filterd-Recvd-Size: 12037
Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66])
	by imf17.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 11 Feb 2020 03:03:39 +0000 (UTC)
Received: by mail-ed1-f66.google.com with SMTP id m13so2930230edb.6
        for <linux-mm@kvack.org>; Mon, 10 Feb 2020 19:03:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=Yi6J+UoTOA4LATQfgXDqHS1SV3jx4p1u5q4yoorTtJU=;
        b=KfiR20VNpZBAD6rZQOW6g8fYOoIp8Qo1pz9wTaDCkXs2uQCvPRZUpLaYwLzAnFYzBe
         CuCimgPwCcBdGDQ0tAL8qOBjFEv2fkG+3DFBU3Uf0tkBcMmSeTIZXeIHLnmYntziUHHR
         m/kYjW1HqBFEW+4GJ0oPCmh7QXB//qG2zTIcbruR9SDm7yXZCWodqsmxCn9CYL6blbrY
         eoL9kgYeRG6Z4+fVzaTe0gE8+D8THw+Yjim2rj5l1Of9qPlPBRfdglHmdgd+woPT3IDj
         3fcoWCv9Y14DRG+Z03get7MrSsC3I8mfJwZRJCKP9sBcRoXAntwA+f79oxSAiP60CFhK
         BUvA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=Yi6J+UoTOA4LATQfgXDqHS1SV3jx4p1u5q4yoorTtJU=;
        b=cu7LkY1tjxp2xMLc0og62os2Q+DY1ObbGz/cqPSEtdGH3EU5/ualFyhKXqSByo7fHO
         2ogKyUEmPOsE5ayx3McNV/pRHtsuFUhBAEAO6AHguluCfC+2Cu90NsoCjr8bW2/cDgun
         Nr6occ695UhtL64AorduTAvYaIfYq6X/ZFYH0Ek1T7LJ/jNOwrwNLASS/eUN/QnS8BUF
         Vudcmz+6yMY0GAO2Osn4/J6jnaC7rzegFm4StnPRnxy3s6KfXdevtnn2x3ZN3ScI/8Ln
         ZDAYdOpyHkWw0nCfeIoqxWgGThjyICEByOefEiZxeY+Uyap72WDqraiT5pmKYNO/n2sw
         KuJg==
X-Gm-Message-State: APjAAAWd1DzpH+lkqQ6Tn7qaxEg8d2bCrUZzaFtWaXDc/nYoB9XBj1GT
	oFMLNfe7rSm0SKazEydD5ODDX++CzlR93be9sJc=
X-Google-Smtp-Source: APXvYqyInKJ8MzCBbOjN7SVeSwSCnLH1NJZX0Hy0MzBG7+tRP50eJjV8jAlaLIj7XyULn2uLWGuTBQBP+/ZlDr9IOEI=
X-Received: by 2002:a17:906:6c88:: with SMTP id s8mr3989668ejr.23.1581390218292;
 Mon, 10 Feb 2020 19:03:38 -0800 (PST)
MIME-Version: 1.0
References: <20200210121445.711819-1-gshan@redhat.com> <20200210161721.GA167254@tower.DHCP.thefacebook.com>
 <9919b674-244d-0a55-c842-b0661585f9e2@redhat.com> <20200211013118.GA147346@carbon.lan>
 <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com>
In-Reply-To: <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com>
From: Yang Shi <shy828301@gmail.com>
Date: Mon, 10 Feb 2020 19:03:22 -0800
Message-ID: <CAHbLzkoQHSX-5pBfYSB2Dbw95EEQjSZtGfqKyv9qvSASO1A79g@mail.gmail.com>
Subject: Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
To: Gavin Shan <gshan@redhat.com>
Cc: Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>, drjones@redhat.com, 
	david@redhat.com, bhe@redhat.com, Johannes Weiner <hannes@cmpxchg.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Roman,
>
> On 2/11/20 12:31 PM, Roman Gushchin wrote:
> > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
> >> On 2/11/20 3:17 AM, Roman Gushchin wrote:
> >>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> >>>> commit 68600f623d69 ("mm: don't miss the last page because of round-=
off
> >>>> error") makes the scan size round up to @denominator regardless of t=
he
> >>>> memory cgroup's state, online or offline. This affects the overall
> >>>> reclaiming behavior: The corresponding LRU list is eligible for recl=
aiming
> >>>> only when its size logically right shifted by @sc->priority is bigge=
r than
> >>>> zero in the former formula (non-roundup one).
> >>>
> >>> Not sure I fully understand, but wasn't it so before 68600f623d69 too=
?
> >>>
> >>
> >> It's correct that "(non-roundup one)" is typo and should have been dro=
pped.
> >> Will be corrected in v2 if needed.
> >
> > Thanks!
> >
> >>
> >>>> For example, the inactive
> >>>> anonymous LRU list should have at least 0x4000 pages to be eligible =
for
> >>>> reclaiming when we have 60/12 for swappiness/priority and without ta=
king
> >>>> scan/rotation ratio into account. After the roundup is applied, the
> >>>> inactive anonymous LRU list becomes eligible for reclaiming when its
> >>>> size is bigger than or equal to 0x1000 in the same condition.
> >>>>
> >>>>       (0x4000 >> 12) * 60 / (60 + 140 + 1) =3D 1
> >>>>       ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) =3D 1
> >>>>
> >>>> aarch64 has 512MB huge page size when the base page size is 64KB. Th=
e
> >>>> memory cgroup that has a huge page is always eligible for reclaiming=
 in
> >>>> that case. The reclaiming is likely to stop after the huge page is
> >>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups wi=
ll be
> >>>> skipped. It changes the overall reclaiming behavior. This fixes the =
issue
> >>>> by applying the roundup to offlined memory cgroups only, to give mor=
e
> >>>> preference to reclaim memory from offlined memory cgroup. It sounds
> >>>> reasonable as those memory is likely to be useless.
> >>>
> >>> So is the problem that relatively small memory cgroups are getting re=
claimed
> >>> on default prio, however before they were skipped?
> >>>
> >>
> >> Yes, you're correct. There are two dimensions for global reclaim: prio=
rity
> >> (sc->priority) and memory cgroup. The scan/reclaim is carried out by i=
terating
> >> from these two dimensions until the reclaimed pages are enough. If the=
 roundup
> >> is applied to current memory cgroup and occasionally helps to reclaim =
enough
> >> memory, the subsequent priority and memory cgroup will be skipped.
> >>
> >>>>
> >>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine=
,
> >>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and=
 2GB
> >>>> memory. 784MB swap space is consumed after these 8 VMs are completel=
y up.
> >>>> Note that KSM is disable while THP is enabled in the testing. With t=
his
> >>>> applied, the consumed swap space decreased to 60MB.
> >>>>
> >>>>            total        used        free      shared  buff/cache   a=
vailable
> >>>> Mem:     16196       10065        2049          16        4081      =
  3749
> >>>> Swap:     8175         784        7391
> >>>>            total        used        free      shared  buff/cache   a=
vailable
> >>>> Mem:     16196       11324        3656          24        1215      =
  2936
> >>>> Swap:     8175          60        8115
> >>>
> >>> Does it lead to any performance regressions? Or it's only about incre=
ased
> >>> swap usage?
> >>>
> >>
> >> Apart from swap usage, it also had performance downgrade for my case. =
With
> >> your patch (68600f623d69) included, it took 264 seconds to bring up 8 =
VMs.
> >> However, 236 seconds are used to do same thing with my patch applied o=
n top
> >> of yours. There is 10% performance downgrade. It's the reason why I ha=
d a
> >> stable tag.
> >
> > I see...
> >
>
> I will put these data into the commit log of v2, which will be posted sho=
rtly.
>
> >>
> >>>>
> >>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-=
off error")
> >>>> Cc: <stable@vger.kernel.org> # v4.20+
> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >>>> ---
> >>>>    mm/vmscan.c | 9 ++++++---
> >>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index c05eb9efec07..876370565455 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lr=
uvec, struct scan_control *sc,
> >>>>                            /*
> >>>>                             * Scan types proportional to swappiness =
and
> >>>>                             * their relative recent reclaim efficien=
cy.
> >>>> -                   * Make sure we don't miss the last page
> >>>> -                   * because of a round-off error.
> >>>> +                   * Make sure we don't miss the last page on
> >>>> +                   * the offlined memory cgroups because of a
> >>>> +                   * round-off error.
> >>>>                             */
> >>>> -                  scan =3D DIV64_U64_ROUND_UP(scan * fraction[file]=
,
> >>>> +                  scan =3D mem_cgroup_online(memcg) ?
> >>>> +                         div64_u64(scan * fraction[file], denominat=
or) :
> >>>> +                         DIV64_U64_ROUND_UP(scan * fraction[file],
> >>>>                                                      denominator);
> >>>
> >>> It looks a bit strange to round up for offline and basically down for
> >>> everything else. So maybe it's better to return to something like
> >>> the very first version of the patch:
> >>> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__www.spinics.ne=
t_lists_kernel_msg2883146.html&d=3DDwIC-g&c=3D5VD0RTtNlTh3ycd41b3MUw&r=3DjJ=
YgtDM7QT-W-Fz_d29HYQ&m=3DurGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=3Dk2=
RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=3D  ?
> >>> For memcg reclaim reasons we do care only about an edge case with few=
 pages.
> >>>
> >>> But overall it's not obvious to me, why rounding up is worse than rou=
nding down.
> >>> Maybe we should average down but accumulate the reminder?
> >>> Creating an implicit bias for small memory cgroups sounds groundless.
> >>>
> >>
> >> I don't think v1 path works for me either. The logic in v1 isn't too m=
uch
> >> different from commit 68600f623d69. v1 has selective roundup, but curr=
ent
> >> code is having a forced roundup. With 68600f623d69 reverted and your v=
1
> >> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap i=
s used.
> >> It looks more worse than 68600f623d69.
> >>
> >> Yeah, it's not reasonable to have a bias on all memory cgroups regardl=
ess
> >> their states. I do think it's still right to give bias to offlined mem=
ory
> >> cgroups.
> >
> > I don't think so, it really depends on the workload. Imagine systemd re=
starting
> > a service due to some update or with other arguments. Almost entire pag=
ecache
> > is relevant and can be reused by a new cgroup.
> >
>
> Indeed, it depends on the workload. This patch is to revert 68600f623d69 =
for online
> memory cgroups, but keep the logic for offlined memory cgroup to avoid br=
eaking your
> case.
>
> There is something which might be unrelated to discuss here: the pagecach=
e could be backed
> by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them=
 from disk to memory
> isn't equal, meaning we need some kind of bias during reclaiming. It seem=
s something missed
> from current implementation.

Yes, the refault cost was not taken into account. I recalled Johannes
posted a patch series to do swap with refault cost weighted in a
couple of years ago, please see: https://lwn.net/Articles/690079/.

>
> >> So the point is we need take care of the memory cgroup's state
> >> and apply the bias to offlined ones only. The offlined memory cgroup i=
s
> >> going to die and has been dead. It's unlikely for its memory to be use=
d
> >> by someone, but still possible. So it's reasonable to hardly squeeze t=
he
> >> used memory of offlined memory cgroup if possible.
> >
> > Anyway, I think your version is good to mitigate the regression.
> > So, please feel free to add
> > Acked-by: Roman Gushchin <guro@fb.com>
> >
>
> Thanks, Roman! It will be included in v2.
>
> > But I think we need something more clever long-term: e.g. accumulate
> > the leftover from the division and add it to the next calculation.
> >
> > If you can test such an approach on your workload, that would be nice.
> >
>
> Yeah, we need something smart in long run. Lets see if I can sort/test
> it out and then come back to you.
>
> Thanks,
> Gavin
>
>