From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1091C352A4 for ; Tue, 11 Feb 2020 03:03:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A4BE220733 for ; Tue, 11 Feb 2020 03:03:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KfiR20VN" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A4BE220733 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 388616B026D; Mon, 10 Feb 2020 22:03:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 337126B026F; Mon, 10 Feb 2020 22:03:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24E206B0270; Mon, 10 Feb 2020 22:03:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0073.hostedemail.com [216.40.44.73]) by kanga.kvack.org (Postfix) with ESMTP id 0D5F46B026D for ; Mon, 10 Feb 2020 22:03:41 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 89C028248076 for ; Tue, 11 Feb 2020 03:03:40 +0000 (UTC) X-FDA: 76476351000.23.head60_1ffb438abdd56 X-HE-Tag: head60_1ffb438abdd56 X-Filterd-Recvd-Size: 12037 Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Tue, 11 Feb 2020 03:03:39 +0000 (UTC) Received: by mail-ed1-f66.google.com with SMTP id m13so2930230edb.6 for ; Mon, 10 Feb 2020 19:03:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=Yi6J+UoTOA4LATQfgXDqHS1SV3jx4p1u5q4yoorTtJU=; b=KfiR20VNpZBAD6rZQOW6g8fYOoIp8Qo1pz9wTaDCkXs2uQCvPRZUpLaYwLzAnFYzBe CuCimgPwCcBdGDQ0tAL8qOBjFEv2fkG+3DFBU3Uf0tkBcMmSeTIZXeIHLnmYntziUHHR m/kYjW1HqBFEW+4GJ0oPCmh7QXB//qG2zTIcbruR9SDm7yXZCWodqsmxCn9CYL6blbrY eoL9kgYeRG6Z4+fVzaTe0gE8+D8THw+Yjim2rj5l1Of9qPlPBRfdglHmdgd+woPT3IDj 3fcoWCv9Y14DRG+Z03get7MrSsC3I8mfJwZRJCKP9sBcRoXAntwA+f79oxSAiP60CFhK BUvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=Yi6J+UoTOA4LATQfgXDqHS1SV3jx4p1u5q4yoorTtJU=; b=cu7LkY1tjxp2xMLc0og62os2Q+DY1ObbGz/cqPSEtdGH3EU5/ualFyhKXqSByo7fHO 2ogKyUEmPOsE5ayx3McNV/pRHtsuFUhBAEAO6AHguluCfC+2Cu90NsoCjr8bW2/cDgun Nr6occ695UhtL64AorduTAvYaIfYq6X/ZFYH0Ek1T7LJ/jNOwrwNLASS/eUN/QnS8BUF Vudcmz+6yMY0GAO2Osn4/J6jnaC7rzegFm4StnPRnxy3s6KfXdevtnn2x3ZN3ScI/8Ln ZDAYdOpyHkWw0nCfeIoqxWgGThjyICEByOefEiZxeY+Uyap72WDqraiT5pmKYNO/n2sw KuJg== X-Gm-Message-State: APjAAAWd1DzpH+lkqQ6Tn7qaxEg8d2bCrUZzaFtWaXDc/nYoB9XBj1GT oFMLNfe7rSm0SKazEydD5ODDX++CzlR93be9sJc= X-Google-Smtp-Source: APXvYqyInKJ8MzCBbOjN7SVeSwSCnLH1NJZX0Hy0MzBG7+tRP50eJjV8jAlaLIj7XyULn2uLWGuTBQBP+/ZlDr9IOEI= X-Received: by 2002:a17:906:6c88:: with SMTP id s8mr3989668ejr.23.1581390218292; Mon, 10 Feb 2020 19:03:38 -0800 (PST) MIME-Version: 1.0 References: <20200210121445.711819-1-gshan@redhat.com> <20200210161721.GA167254@tower.DHCP.thefacebook.com> <9919b674-244d-0a55-c842-b0661585f9e2@redhat.com> <20200211013118.GA147346@carbon.lan> <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com> In-Reply-To: <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com> From: Yang Shi Date: Mon, 10 Feb 2020 19:03:22 -0800 Message-ID: Subject: Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup To: Gavin Shan Cc: Roman Gushchin , Linux MM , drjones@redhat.com, david@redhat.com, bhe@redhat.com, Johannes Weiner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan wrote: > > Hi Roman, > > On 2/11/20 12:31 PM, Roman Gushchin wrote: > > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote: > >> On 2/11/20 3:17 AM, Roman Gushchin wrote: > >>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote: > >>>> commit 68600f623d69 ("mm: don't miss the last page because of round-= off > >>>> error") makes the scan size round up to @denominator regardless of t= he > >>>> memory cgroup's state, online or offline. This affects the overall > >>>> reclaiming behavior: The corresponding LRU list is eligible for recl= aiming > >>>> only when its size logically right shifted by @sc->priority is bigge= r than > >>>> zero in the former formula (non-roundup one). > >>> > >>> Not sure I fully understand, but wasn't it so before 68600f623d69 too= ? > >>> > >> > >> It's correct that "(non-roundup one)" is typo and should have been dro= pped. > >> Will be corrected in v2 if needed. > > > > Thanks! > > > >> > >>>> For example, the inactive > >>>> anonymous LRU list should have at least 0x4000 pages to be eligible = for > >>>> reclaiming when we have 60/12 for swappiness/priority and without ta= king > >>>> scan/rotation ratio into account. After the roundup is applied, the > >>>> inactive anonymous LRU list becomes eligible for reclaiming when its > >>>> size is bigger than or equal to 0x1000 in the same condition. > >>>> > >>>> (0x4000 >> 12) * 60 / (60 + 140 + 1) =3D 1 > >>>> ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) =3D 1 > >>>> > >>>> aarch64 has 512MB huge page size when the base page size is 64KB. Th= e > >>>> memory cgroup that has a huge page is always eligible for reclaiming= in > >>>> that case. The reclaiming is likely to stop after the huge page is > >>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups wi= ll be > >>>> skipped. It changes the overall reclaiming behavior. This fixes the = issue > >>>> by applying the roundup to offlined memory cgroups only, to give mor= e > >>>> preference to reclaim memory from offlined memory cgroup. It sounds > >>>> reasonable as those memory is likely to be useless. > >>> > >>> So is the problem that relatively small memory cgroups are getting re= claimed > >>> on default prio, however before they were skipped? > >>> > >> > >> Yes, you're correct. There are two dimensions for global reclaim: prio= rity > >> (sc->priority) and memory cgroup. The scan/reclaim is carried out by i= terating > >> from these two dimensions until the reclaimed pages are enough. If the= roundup > >> is applied to current memory cgroup and occasionally helps to reclaim = enough > >> memory, the subsequent priority and memory cgroup will be skipped. > >> > >>>> > >>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine= , > >>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and= 2GB > >>>> memory. 784MB swap space is consumed after these 8 VMs are completel= y up. > >>>> Note that KSM is disable while THP is enabled in the testing. With t= his > >>>> applied, the consumed swap space decreased to 60MB. > >>>> > >>>> total used free shared buff/cache a= vailable > >>>> Mem: 16196 10065 2049 16 4081 = 3749 > >>>> Swap: 8175 784 7391 > >>>> total used free shared buff/cache a= vailable > >>>> Mem: 16196 11324 3656 24 1215 = 2936 > >>>> Swap: 8175 60 8115 > >>> > >>> Does it lead to any performance regressions? Or it's only about incre= ased > >>> swap usage? > >>> > >> > >> Apart from swap usage, it also had performance downgrade for my case. = With > >> your patch (68600f623d69) included, it took 264 seconds to bring up 8 = VMs. > >> However, 236 seconds are used to do same thing with my patch applied o= n top > >> of yours. There is 10% performance downgrade. It's the reason why I ha= d a > >> stable tag. > > > > I see... > > > > I will put these data into the commit log of v2, which will be posted sho= rtly. > > >> > >>>> > >>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-= off error") > >>>> Cc: # v4.20+ > >>>> Signed-off-by: Gavin Shan > >>>> --- > >>>> mm/vmscan.c | 9 ++++++--- > >>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>>> index c05eb9efec07..876370565455 100644 > >>>> --- a/mm/vmscan.c > >>>> +++ b/mm/vmscan.c > >>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lr= uvec, struct scan_control *sc, > >>>> /* > >>>> * Scan types proportional to swappiness = and > >>>> * their relative recent reclaim efficien= cy. > >>>> - * Make sure we don't miss the last page > >>>> - * because of a round-off error. > >>>> + * Make sure we don't miss the last page on > >>>> + * the offlined memory cgroups because of a > >>>> + * round-off error. > >>>> */ > >>>> - scan =3D DIV64_U64_ROUND_UP(scan * fraction[file]= , > >>>> + scan =3D mem_cgroup_online(memcg) ? > >>>> + div64_u64(scan * fraction[file], denominat= or) : > >>>> + DIV64_U64_ROUND_UP(scan * fraction[file], > >>>> denominator); > >>> > >>> It looks a bit strange to round up for offline and basically down for > >>> everything else. So maybe it's better to return to something like > >>> the very first version of the patch: > >>> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__www.spinics.ne= t_lists_kernel_msg2883146.html&d=3DDwIC-g&c=3D5VD0RTtNlTh3ycd41b3MUw&r=3DjJ= YgtDM7QT-W-Fz_d29HYQ&m=3DurGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=3Dk2= RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=3D ? > >>> For memcg reclaim reasons we do care only about an edge case with few= pages. > >>> > >>> But overall it's not obvious to me, why rounding up is worse than rou= nding down. > >>> Maybe we should average down but accumulate the reminder? > >>> Creating an implicit bias for small memory cgroups sounds groundless. > >>> > >> > >> I don't think v1 path works for me either. The logic in v1 isn't too m= uch > >> different from commit 68600f623d69. v1 has selective roundup, but curr= ent > >> code is having a forced roundup. With 68600f623d69 reverted and your v= 1 > >> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap i= s used. > >> It looks more worse than 68600f623d69. > >> > >> Yeah, it's not reasonable to have a bias on all memory cgroups regardl= ess > >> their states. I do think it's still right to give bias to offlined mem= ory > >> cgroups. > > > > I don't think so, it really depends on the workload. Imagine systemd re= starting > > a service due to some update or with other arguments. Almost entire pag= ecache > > is relevant and can be reused by a new cgroup. > > > > Indeed, it depends on the workload. This patch is to revert 68600f623d69 = for online > memory cgroups, but keep the logic for offlined memory cgroup to avoid br= eaking your > case. > > There is something which might be unrelated to discuss here: the pagecach= e could be backed > by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them= from disk to memory > isn't equal, meaning we need some kind of bias during reclaiming. It seem= s something missed > from current implementation. Yes, the refault cost was not taken into account. I recalled Johannes posted a patch series to do swap with refault cost weighted in a couple of years ago, please see: https://lwn.net/Articles/690079/. > > >> So the point is we need take care of the memory cgroup's state > >> and apply the bias to offlined ones only. The offlined memory cgroup i= s > >> going to die and has been dead. It's unlikely for its memory to be use= d > >> by someone, but still possible. So it's reasonable to hardly squeeze t= he > >> used memory of offlined memory cgroup if possible. > > > > Anyway, I think your version is good to mitigate the regression. > > So, please feel free to add > > Acked-by: Roman Gushchin > > > > Thanks, Roman! It will be included in v2. > > > But I think we need something more clever long-term: e.g. accumulate > > the leftover from the division and add it to the next calculation. > > > > If you can test such an approach on your workload, that would be nice. > > > > Yeah, we need something smart in long run. Lets see if I can sort/test > it out and then come back to you. > > Thanks, > Gavin > >