From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46261D6B6D9 for ; Wed, 30 Oct 2024 21:01:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D050A6B00B4; Wed, 30 Oct 2024 17:01:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB5546B00B5; Wed, 30 Oct 2024 17:01:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2F3A6B00B8; Wed, 30 Oct 2024 17:01:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8E7ED6B00B4 for ; Wed, 30 Oct 2024 17:01:51 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 10981410CB for ; Wed, 30 Oct 2024 21:01:51 +0000 (UTC) X-FDA: 82731489414.25.D925E90 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf23.hostedemail.com (Postfix) with ESMTP id 35844140011 for ; Wed, 30 Oct 2024 21:01:31 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Kl+Er4sk; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730321895; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=; b=u1m9D22PrWwFYwNqjpwWUAGZM9j23PF07jAG+mekuF93kOVJhfRAFOiVuL51BNcgGYxNSB UXo264Zy6zjRan2QBeQr6oGY0MJKPECg6UrD5qoOlWyipOzW9+sn3m1A+f56JkYVZrFszm 2ZbIwz0USz+ie1W3nGxoTHyhfl6ZRvA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Kl+Er4sk; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730321895; a=rsa-sha256; cv=none; b=mBAy9FI86VIlfKUAPXdRLJSnH/TDnfuaoiM0FQ8ChTSezd1dd0VXQ0T1ICHO5qWgw4/aGT sU5jFuCyH2gvsUQD+v84BlZ0hhr112dpF0xDVMZCFFV+bnL83esAc6Xk1fKFMROh8pc0OO K1R31wdn1eWGRifc3cM20kT30j04X64= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-460b16d4534so2007771cf.3 for ; Wed, 30 Oct 2024 14:01:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730322108; x=1730926908; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=; b=Kl+Er4sk1pYO5S5S59Hr1ChLxLMiCExnUvG8cBdxCC1gImoBacy47yXHsHFbI7WsU4 Y0zLuHrpRbqUETX5uJNJxtFRDdbWiiCxuNpRuA6yKePW/X1j8yMuFtv9nuP5hX7QVdZw LUcyKP0b1Jvr0VisX9UR0tubmT4TCyI0EkVmWJNkO9DA69aar0QcUGl9bCBfQbDG1EX6 E1IxIQHBDsPWmBVzGH24MLeDWPXOwxY1rGWA3JzM5Pl6YVkZhr4aMBwmX4yMoKy5hCyx Zh8GEDGeGSVQ24TkzQbrpHCkxLfso2/HlJx+2T8ocJw4O6PwFXM1naX8ANv27fsqGlhV MF8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730322108; x=1730926908; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=; b=CLQK+TL5+3dRdUPrS7hQu+n7TlJv8wyLReYE2J9CihdnrCVdNRfn9UrCaFoIlyRcNC 8P9+H99jVW7NDIW1KrkO5/6rS/o8NZ5BK7K4dm6wLvRUDcSmp3Z6s8h10XmxY9e9QV4z StA4OqUYqeqaIgCDiJ5p8z6ini9wo7payLUnwoYabBxaIh+i2N+NK4D8MeNKrIS+NJJe VVUG9O41sxP0TMwMteobXgTrYXUukT9qISwRuitrpSu95oPIuznFEJYgK2RwSWdst9Bv I+FJfMbic1HFaDnHUAb8gyJmh85xaTKEUH7I965dDZXNL0bRgJJb2oUU1foeILIRkhA4 s5/g== X-Forwarded-Encrypted: i=1; AJvYcCVqkiLRkY075NzdWAtsLTyBFDkCmFn1fiMEhFoYQtCx5yZtRldtZ4Ps+CxfePIaEzhwT8Qcc5N6yw==@kvack.org X-Gm-Message-State: AOJu0YwfJk1qyVFlQkeAbtGI79jDiIhLK5mB4YZSgoQXsvAedjbSwIa7 sV6/0JC8rTAz5RmNiOaaibcbHZe6LDGYnRnDlMUjnkwIKHvjshVBarE9LlA7rPXA/Zd2ydvkYPC lQL9X9EI6qeruI2sjtI8c1JNHVdLkqML5+8pR X-Google-Smtp-Source: AGHT+IHC7F99QDfKE3T8gIdLTca0twuukJg5DJ7WOAp4mqV+/Kb/DUCNeGxo8uvCDS8B44leoJurfVv0tsKAurSnVzQ= X-Received: by 2002:a05:6214:598e:b0:6cc:598:67a8 with SMTP id 6a1803df08f44-6d185838a0emr276556896d6.38.1730322107629; Wed, 30 Oct 2024 14:01:47 -0700 (PDT) MIME-Version: 1.0 References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> In-Reply-To: <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> From: Yosry Ahmed Date: Wed, 30 Oct 2024 14:01:11 -0700 Message-ID: Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Usama Arif Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 35844140011 X-Stat-Signature: gg7gpxy67h6q84w9b55rbjp51k9ifmx3 X-Rspam-User: X-HE-Tag: 1730322091-417945 X-HE-Meta: U2FsdGVkX1/jhiDCInCLN7zgQoCLjZW59/LqIhheaH31EYfxdXo0rSom2KD3TFnOA5ULvMxilIXGcHXkhK5XZjr6oLX4ROUHU5tkeOvx4T2vjxh9C6UrtZLpyf2vEEHUj3IQNddnCykqQ42VqHsJHTfTUZ0JuTRoSfXU7dx3Y0ckaMYg3+hLp0qw1JUNBd9ZqtDUxrGKK1pS03G2YJTnh632fcnLnp/zjnvCb6u3AzHXVHrFo2sMFyxdX2/QYnbS6fQ0pdBdJwCI4qL46NOROOzhZXoWb0JiK2Dn8AaWNnwN2qJopOh37pu/tTWMN8qqiQZhJ1W5Mgdpg4WKQw4gXZlbJ5yeGTPXbMs73noGwuF7cXUSEolImIHqMJktkUA5nrC/fS6arpGcvJ86pSJ7P7AXCoAPi+qfKvwnBOqYkJ2WeVBvXzmq5UvbZlfYrzu+2Y7iaBnHvTuNt5T/Jo4cRYNiM6JWObBZov8whgvGx7c6FROFl/r7HOD3pPqlPDyYQQNkLJhESojc6eQccZowJEsGXs6CydIErxaGScM1CkLUCg4+M6GAC5Hl6jihmHdBnivi/yUMYwh3r4o3wi7DOqgdpcuq0T8hOON2IaN4GUYoyaWB0WXKV5ykRMu/yePwLUasVpWRq30JKw3wiEEfYMEXUKcqYqJRlx5NetMp3zZKtCdhhQXJvColqWqK/yUkIfm4EW/QCXM24xHqPNosBh6xE1qt2PUXgi7PkqdxZpQayXSmpmRZIxVt4+jrb3E8FZfT9LfAOsOWm3dQVbwGt1iJkZNEkwXWOyUp2WASzwOWJQqcXC7QgGdqxCp8V///+iJ/88Ydh7fJGNheFHR1M1CO5hnkauXSqQlq50b3LV6MG/bYv9YOLIsid8BK7eMGe95Iy4KJWqFinPOl9FZ8JvcwJ125eo3487pxssYtoOHLjoXRtmZb/Q7veCPxVz8qn6a5XDGUlJACkqT4Sny FnRXyFR9 o49/prqjWxc064657TqmPG8cgDMExTU6vgUjzjvU3+S2eeL/MJ09ApfMGAvQ6xrbOjQ5JXWTeu7Q/BIeLkJtmXjfpUV7VbwWjr2trFk7bZMAbuAdhqZOcaaLc3WSH1Vs0Gj2YqfC16vd8jMk1v224ERIvx1LRaCMHxbxuObEYwE9vN0kmE1hH3qdL9iKZ4c16Cz69sU1B3ymJsc9dZeohKbFlwIa+OaUrmLZt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif = wrote: > > > > On 30/10/2024 19:51, Yosry Ahmed wrote: > > [..] > >>> My second point about the mitigation is as follows: For a system (or > >>> memcg) under severe memory pressure, especially one without hardware = TLB > >>> optimization, is enabling mTHP always the right choice? Since mTHP op= erates at > >>> a larger granularity, some internal fragmentation is unavoidable, reg= ardless > >>> of optimization. Could the mitigation code help in automatically tuni= ng > >>> this fragmentation? > >>> > >> > >> I agree with the point that enabling mTHP always is not the right thin= g to do > >> on all platforms. I also think it might be the case that enabling mTHP > >> might be a good thing for some workloads, but enabling mTHP swapin alo= ng with > >> it might not. > >> > >> As you said when you have apps switching between foreground and backgr= ound > >> in android, it probably makes sense to have large folio swapping, as y= ou > >> want to bringin all the pages from background app as quickly as possib= le. > >> And also all the TLB optimizations and smaller lru overhead you get af= ter > >> you have brought in all the pages. > >> Linux kernel build test doesnt really get to benefit from the TLB opti= mization > >> and smaller lru overhead, as probably the pages are very short lived. = So I > >> think it doesnt show the benefit of large folio swapin properly and > >> large folio swapin should probably be disabled for this kind of worklo= ad, > >> eventhough mTHP should be enabled. > >> > >> I am not sure that the approach we are trying in this patch is the rig= ht way: > >> - This patch makes it a memcg issue, but you could have memcg disabled= and > >> then the mitigation being tried here wont apply. > > > > Is the problem reproducible without memcg? I imagine only if the > > entire system is under memory pressure. I guess we would want the same > > "mitigation" either way. > > > What would be a good open source benchmark/workload to test without limit= ing memory > in memcg? > For the kernel build test, I can only get zswap activity to happen if I b= uild > in cgroup and limit memory.max. You mean a benchmark that puts the entire system under memory pressure? I am not sure, it ultimately depends on the size of memory you have, among other factors. What if you run the kernel build test in a VM? Then you can limit is size like a memcg, although you'd probably need to leave more room because the entire guest OS will also subject to the same limit. > > I can just run zswap large folio zswapin in production and see, but that = will take me a few > days. tbh, running in prod is a much better test, and if there isn't any = sort of thrashing, > then maybe its not really an issue? I believe Barry doesnt see an issue i= n android > phones (but please correct me if I am wrong), and if there isnt an issue = in Meta > production as well, its a good data point for servers as well. And maybe > kernel build in 4G memcg is not a good test. If there is a regression in the kernel build, this means some workloads may be affected, even if Meta's prod isn't. I understand that the benchmark is not very representative of real world workloads, but in this instance I think the thrashing problem surfaced by the benchmark is real. > > >> - Instead of this being a large folio swapin issue, is it more of a re= adahead > >> issue? If we zswap (without the large folio swapin series) and change = the window > >> to 1 in swap_vma_readahead, we might see an improvement in linux kerne= l build time > >> when cgroup memory is limited as readahead would probably cause swap t= hrashing as > >> well. > > > > I think large folio swapin would make the problem worse anyway. I am > > also not sure if the readahead window adjusts on memory pressure or > > not. > > > readahead window doesnt look at memory pressure. So maybe the same thing = is being > seen here as there would be in swapin_readahead? Maybe readahead is not as aggressive in general as large folio swapins? Looking at swap_vma_ra_win(), it seems like the maximum order of the window is the smaller of page_cluster (2 or 3) and SWAP_RA_ORDER_CEILING (5). Also readahead will swapin 4k folios AFAICT, so we don't need a contiguous allocation like large folio swapin. So that could be another factor why readahead may not reproduce the problem. > Maybe if we check kernel build test > performance in 4G memcg with below diff, it might get better? I think you can use the page_cluster tunable to do this at runtime. > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4669f29cf555..9e196e1e6885 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t t= arg_entry, gfp_t gfp_mask, > pgoff_t ilx; > bool page_allocated; > > - win =3D swap_vma_ra_win(vmf, &start, &end); > + win =3D 1; > if (win =3D=3D 1) > goto skip; >