From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8DD2E6F06D for ; Fri, 1 Nov 2024 16:20:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F44B6B0088; Fri, 1 Nov 2024 12:20:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77B836B0089; Fri, 1 Nov 2024 12:20:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F5656B008C; Fri, 1 Nov 2024 12:20:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 3B1E36B0088 for ; Fri, 1 Nov 2024 12:20:10 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A7AED1C2564 for ; Fri, 1 Nov 2024 16:20:09 +0000 (UTC) X-FDA: 82738036206.17.45BD967 Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf11.hostedemail.com (Postfix) with ESMTP id B779840002 for ; Fri, 1 Nov 2024 16:19:32 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JjG0+qfR; spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730477925; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=; b=tNn0K3ChLiU/VsyAqrYhTbB7av+M0k4rn0r6lQxzHZSkmvv14LukE1B4et10UlCb85wwRU AoGPJ0efVo6Qe+VEZe+/vXhslG6KldsQQdobon0U/S1eAmjX1najeDTWvKsY3v+m7hnRyd jXLRnhf+7HfOk5YvRM2aY5P/tlFlAAg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730477925; a=rsa-sha256; cv=none; b=ahTWFqUTq4ufllLS/qbGF89fkS44zvltzT0UikBFpIp99t74h2QeX316wBBBRsFrgNwzEb 6CtZNK2z6kxuKoaCuKmXRCvf9RzrN2mbizAF6RLLWLHrEg86GlcpUvN4Bseam7WkPkOs46 e0D4k5XDm62KCfZxRNNfEf/6prW+FVw= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JjG0+qfR; spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-7b1418058bbso148268085a.3 for ; Fri, 01 Nov 2024 09:20:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730478006; x=1731082806; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=; b=JjG0+qfR+QFIsFS0DgiaAwzCr1aMX/fB6LQKNxI/9Jcx2lGHCpubVHzhgUfPT9bcca YRXy677vo/EYpZo1vWMnUMmkfvr24W+ASIhOJJ/UOjQlMuqDyr/CxYQCf11tiMitEgxe jQk6drPuLYwtj+S8VQmevYFmUe81er6qwCOcHbWM2nnEuk9dZ2Dt958ljN8CGW6CHwze 0Q+UoGOo1yAXvoRcOGMKM87xqJiqkcE3IkXT3ZF7rG0UgFsPMJIWyu9y1+h3+l9IlKx3 SBTrx12VXzJIFSLB275y1IRb/Qq3xpdE4PqCK3UUwlfcZomeqly7V4ER0YcIQBH4XPNx +ZuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730478006; x=1731082806; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=; b=A9/fEkogTM5lO/bWcLNjpVrutFyD15Ii/e2CI8v1QPT4wy1rHhGpN0OmuTO+RFNA/a 2q4z63aH5o3djhLi89qa6phE1ZNA+KetqtS+b2AXMg9ZizgnbLq3ogdWFg3o/98XciG5 bpwuYCPk4PSrMWUPv1WBPOcP4tx+Lkq3/+lIbkRIENjZvLVBLyyTpiVEN4W0ChKfybHM 0j3tq2xEi7Cwq1WuBUaht5wEx2FQuRTxhMgXyUeHApVcq3ft1R+0cCAf0/aoV50j12HN m6HLqfeiq97yN5O5iHuRjI+S6J13ylWu0+ZgSuylyGpQ/xWQHHqUluqEFonu/A/0MHSU YLBQ== X-Forwarded-Encrypted: i=1; AJvYcCUKQi65JTkIbIccd6KL/p/EN7SoyPGasAB6wdzZV0X7FcTuyNcnJyleLEAt3G7Is64NHStxuN+x6w==@kvack.org X-Gm-Message-State: AOJu0Ywe3tcY8NvIwvdNvPPKJAJa+8vgf2nfKWMydEjePdP8W21WqVnS yRpnnWco+B/CoBsv+VO1JHRDm8YfSMLklSDfWVO5aK24OfO2ZYXRqNcycVs12d0EjvHScXg81Gm pJGqFtDhS1rjyVxQOnkaO9ydsy3ipBYhuPsBs X-Google-Smtp-Source: AGHT+IG10wiBsZhBV7O3UiGw66TRilPB4518JYdv2HdxSr/9COnl7l72Ih4bmf9Ba8TlKillRQS5OMI0KE3flvweyn0= X-Received: by 2002:a05:6214:2f0f:b0:6cc:51f:6c31 with SMTP id 6a1803df08f44-6d35c09c2d3mr62590276d6.5.1730478006210; Fri, 01 Nov 2024 09:20:06 -0700 (PDT) MIME-Version: 1.0 References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <20241031153830.GA799903@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Fri, 1 Nov 2024 09:19:29 -0700 Message-ID: Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Barry Song <21cnbao@gmail.com> Cc: Johannes Weiner , Usama Arif , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: 9cpdnjz1irw1p7mftjp4yzj35m44seaq X-Rspamd-Queue-Id: B779840002 X-Rspam-User: X-HE-Tag: 1730477972-392 X-HE-Meta: U2FsdGVkX18aPaqO+QOV83GT5AI7aivFjM9u2PT+lgz8wA3/RW3gOQsjlqsdY6TXsgjq6Cfc2uZtDNKGHYecoYERhXGFyN1ZAROXGjZs/G1f8NEiNIAylhN81+bvOeU+Bb0zKndYA58FcFcRSpUilhXXaF6o+DMTPj8BQkGdswhN3FYe1x2t6X3mbLfxGyZEMAmQsrrg8viXq0j8lI/IMms8ae6hCSonQA7KYvLbGurJ4++MYS8lrxT+nMmLaYrUKVFXwsZD8XOA6C8RqilLu7slJsjagUfrJZI9RMBq5DzV/HTUaglxz836YF4+CJ1Ju0QugDpdH0XTLqpImtv75tbZCTYEyWeC3INEtZ077ZSN4Lqlt0WVlIhzulE14KvqgDzd2AigGUXH50QpKuwCRCuXcQHgGySWlhzS50HhcQGQ8zo/GA7Q2DiSUeJ1sjbhfunfQZTwPt2EsQ4ReDpmPBuYaE5oQttVQi2eVKSb72pTdSyfo8XA6UUCWp+qngRy9FBkUzE2J6XC073IUVxNeZ8IIgdzc9d92g0ptYRgw3S5V9zLkB5lZxWLrh36XH8H82Jn3q4witqaOa95Cr2lwGh1/i4+ooAvXUraIyMA1qvkM0JC5Nk8fa2lXbSmQCL7DdVPvZIE4cDRs5GwGyIt2FXsrA4zpy2U1GpMDJCsbMHlAR+G6UY5JgjZ2Y/pSr0sCcWYtb2y3SWeqbvYZq7FDZTkhZewNHUW2EbH8B163ERMPccWCCS+1/hcJUOfuVaM9XKptv5a5ye2+uPTNAXUZ//PKD3LgOWAP5LlPklNl+WHUP2LHKwmutXVnO0QF/4b9OZbsBfRtAzQ7jwDZjwZbodt1MlckimpkMCfML+bof3FHfoXDFUL4yRYgB5jd0jB387WF9VutU1sGPsg+nvg38l7cjiPhxR8jf7uZ3TwipRS/vzGLsTIw6qC5qFETb3+7Mu5lh1txl4zTjMLpYT vkKQNXiF CH85pJPi1gReLytaKmbzD/OWzy8HnKP3opCaEOIRGsSonebWyoTS1hAa3OEBCHJGTyDQogVYkL6UsqeVSMejKhvMcbEU5McbZNoP7wGBLaloyWdGzIW7iOTd1gTCfD0CcI1Q1RuxjMKIhlgOGBGa/7OGgJOxVI+EcZzeY6NMQGMshALlgI6PZIDVqpR2RDlwLEAnRynnwMY5NMNVI4Zo9D+wMcdvrQpoRI92txrGdKEsGqATI/YFAa1nc2z1U0CpcfJP6nl3BKRFSWtAmhBVy8ayViVeuN563sKP73RI7Wkfo9rg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 31, 2024 at 2:00=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Fri, Nov 1, 2024 at 5:00=E2=80=AFAM Yosry Ahmed wrote: > > > > On Thu, Oct 31, 2024 at 8:38=E2=80=AFAM Johannes Weiner wrote: > > > > > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > > > On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif wrote: > > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > > > On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif wrote: > > > > > >>>> I am not sure that the approach we are trying in this patch = is the right way: > > > > > >>>> - This patch makes it a memcg issue, but you could have memc= g disabled and > > > > > >>>> then the mitigation being tried here wont apply. > > > > > >>> > > > > > >>> Is the problem reproducible without memcg? I imagine only if = the > > > > > >>> entire system is under memory pressure. I guess we would want= the same > > > > > >>> "mitigation" either way. > > > > > >>> > > > > > >> What would be a good open source benchmark/workload to test wi= thout limiting memory > > > > > >> in memcg? > > > > > >> For the kernel build test, I can only get zswap activity to ha= ppen if I build > > > > > >> in cgroup and limit memory.max. > > > > > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > > > pressure? I am not sure, it ultimately depends on the size of m= emory > > > > > > you have, among other factors. > > > > > > > > > > > > What if you run the kernel build test in a VM? Then you can lim= it is > > > > > > size like a memcg, although you'd probably need to leave more r= oom > > > > > > because the entire guest OS will also subject to the same limit= . > > > > > > > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very= high. > > > > > Much higher than the AMD numbers I posted in reply to Barry. So f= ound > > > > > it very difficult to make comparison. > > > > > > > > Hmm yeah maybe more factors come into play with global memory > > > > pressure. I am honestly not sure how to test this scenario, and I > > > > suspect variance will be high anyway. > > > > > > > > We can just try to use whatever technique we use for the memcg limi= t > > > > though, if possible, right? > > > > > > You can boot a physical machine with mem=3D1G on the commandline, whi= ch > > > restricts the physical range of memory that will be initialized. > > > Double check /proc/meminfo after boot, because part of that physical > > > range might not be usable RAM. > > > > > > I do this quite often to test physical memory pressure with workloads > > > that don't scale up easily, like kernel builds. > > > > > > > > >>>> - Instead of this being a large folio swapin issue, is it mo= re of a readahead > > > > > >>>> issue? If we zswap (without the large folio swapin series) a= nd change the window > > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in l= inux kernel build time > > > > > >>>> when cgroup memory is limited as readahead would probably ca= use swap thrashing as > > > > > >>>> well. > > > > > > +1 > > > > > > I also think there is too much focus on cgroup alone. The bigger issu= e > > > seems to be how much optimistic volume we swap in when we're under > > > pressure already. This applies to large folios and readahead; global > > > memory availability and cgroup limits. > > > > Agreed, although the characteristics of large folios and readahead are > > different. But yeah, different flavors of the same problem. > > > > > > > > It happens to manifest with THP in cgroups because that's what you > > > guys are testing. But IMO, any solution to this problem should > > > consider the wider scope. > > > > +1, and I really think this should be addressed separately, not just > > rely on large block compression/decompression to offset the cost. It's > > probably not just a zswap/zram problem anyway, it just happens to be > > what we support large folio swapin for. > > Agreed these are two separate issues and should be both investigated > though 2 can offset the cost of 1. > 1. swap thrashing > 2. large block compression/decompression > > For point 1, we likely want to investigate the following: > > 1. if we can see the same thrashing if we always perform readahead > (rapidly filling > the memcg to full again after reclamation). > > 2. Whether there are any issues with balancing file and anon memory > reclamation. > > The 'refault feedback loop' in mglru compares refault rates between anon = and > file pages to decide which type should be prioritized for reclamation. > > type =3D get_type_to_scan(lruvec, swappiness, &tier); > > static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int > *tier_idx) > { > ... > read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); > read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); > type =3D positive_ctrl_err(&sp, &pv); > > read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); > for (tier =3D 1; tier < MAX_NR_TIERS; tier++) { > read_ctrl_pos(lruvec, type, tier, gain[type], &pv); > if (!positive_ctrl_err(&sp, &pv)) > break; > } > > *tier_idx =3D tier - 1; > return type; > } > > In this case, we may want to investigate whether reclamation is primarily > targeting anonymous memory due to potential errors in the statistics path > after mTHP is involved. > > 3. Determine if this is a memcg-specific issue by setting mem=3D1GB and > running the same test on the global system. > > Yosry, Johannes, Usama, > Is there anything else that might interest us? > > I'll get back to you after completing the investigation mentioned above. Thanks for looking into this. Perhaps a naive question, but is this only related to swap faults? Can the same scenario happen with other types of faults allocating large folios (e.g. faulting in a file page, or a new anon allocation)? Do swap faults use a different policy for determining the folio order, or is it just the swap faults are naturally more correlated to memory pressure, so that's how the issue was surfaced?