From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D324C7618D for ; Thu, 6 Apr 2023 17:52:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8FB406B0071; Thu, 6 Apr 2023 13:52:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DC5F6B0074; Thu, 6 Apr 2023 13:52:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79D526B0075; Thu, 6 Apr 2023 13:52:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6B4846B0071 for ; Thu, 6 Apr 2023 13:52:45 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4279B140833 for ; Thu, 6 Apr 2023 17:52:45 +0000 (UTC) X-FDA: 80651711490.12.06B6DFD Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf23.hostedemail.com (Postfix) with ESMTP id 6858E140004 for ; Thu, 6 Apr 2023 17:52:43 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=s5QCc1DI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680803563; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mso+OWc0NQ4kyA8lIzevjY2mQtAeqGXpiGyLYztjC/Y=; b=oMRHB9YZMFSTRraRatQIfM1BqGAj2cqlwx4Lks4tF3HXhPwQ14cqJ2c+Xky4a6w5Y+dz/p QDWNE/pg9oxnmDid8Zgku2zFEabGIq8boFWnKoec9wlReiGG5nyAJBa+caUjnJQ/ZfWa6U 0VDqQiBHHZ+IyrOa1Q33ZYieQLO/Kp4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=s5QCc1DI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680803563; a=rsa-sha256; cv=none; b=UO9/bzhhNLW8sCW4/zCtOpNONuwLo1wKNV3CIJs8LGPi1FZC1HXCHaK5TheKO5EZ4wWTBD l6ep+gtzA8qo+ajNIvroqMTbUL7TnZsV+qRKUz13Im25BD+jSSt0zd+Ctc9CeYl6/iZUhx tG4E/jEuyuPlyzHU5Kly+0sIz/7ddns= Received: by mail-ej1-f50.google.com with SMTP id 11so3322153ejw.0 for ; Thu, 06 Apr 2023 10:52:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680803562; x=1683395562; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mso+OWc0NQ4kyA8lIzevjY2mQtAeqGXpiGyLYztjC/Y=; b=s5QCc1DIbED97dHwuf8rxyDbmRf2F4IYzf5opI/14uRp1aXTwC4lD0imTvSCQufV6B X0LmlL3+zP8Z4+S2CPA1qLby3O9EvzYfOd0VlUOegtwFw6pN04N3xs0l0SfgmXCp+Yv3 obuyFP4UksfJ/LCY10Bv8YdngrpN6Ziv6pIeqVKvlelCn/SZJrC4lvU00uCyAVAg9c7/ vXr6CTPZM0eiUEplNIqwYWehXL/xeBf2y+OIS1s95p4Dq+QA6RKjZ2S4xG5hPXqhPPzr xUsFMkVvjX9m/peKtPgc07nKE12zWeXLBHDpjWUnNBSegsQnBUBs9Og/n3QELi1H/cmZ HSNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680803562; x=1683395562; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mso+OWc0NQ4kyA8lIzevjY2mQtAeqGXpiGyLYztjC/Y=; b=6rmX5qXKVfAq9ENhl0ubEGKssAoNnLlCHMWUWgBNiHL6vNjxE603E/kbeypNgY43ix wFgKtBjtUwQ9rTF0NO9+UHaJMwG8Tzc9zplesPR3/PtZhxYCiQ0J129XgqYEEvbQVIpP SgMvqEgvATt8mYG6fHLZmOpRNjjdqUSxMsuo7ySPHk8+C1xlwE7ghmt8OZGtwdfssR5Y um8aiBIum3HapqzNXGin29bGEwknd8d1PEFSCvqPt7IWcGJaHPDWfdFzCxbe5kwurgvl hzwT56U7gEWDNCb5vpkf77gLMFuSHJW6Xf1ogHZpx+s3xZUideTH2YIhRZEexi0Hz01V fYDg== X-Gm-Message-State: AAQBX9eM9yZSGzVunfyjH0kyC6J5Xmgs4cbqboVj/TyadA9IJMrQHsVu VtI/Jr/wHFJi0aT6F0Gg5sjQ8gTEy4ezhZYfU+9wxA== X-Google-Smtp-Source: AKy350a/zH2HmIWU3g5/OCjvmyr0CzZS6J95jcesZjoDC13AW/qyRPcnARZrC8SELVEDgBE6kxWR4pkbhrYcIo9w1Ew= X-Received: by 2002:a17:906:3393:b0:933:7658:8b44 with SMTP id v19-20020a170906339300b0093376588b44mr3571466eja.15.1680803561756; Thu, 06 Apr 2023 10:52:41 -0700 (PDT) MIME-Version: 1.0 References: <20230405185427.1246289-1-yosryahmed@google.com> <20230405185427.1246289-2-yosryahmed@google.com> <14d50ddd-507e-46e7-1a32-72466dec2a40@redhat.com> In-Reply-To: <14d50ddd-507e-46e7-1a32-72466dec2a40@redhat.com> From: Yosry Ahmed Date: Thu, 6 Apr 2023 10:52:05 -0700 Message-ID: Subject: Re: [PATCH v5 1/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim To: David Hildenbrand Cc: Andrew Morton , Alexander Viro , "Darrick J. Wong" , Christoph Lameter , David Rientjes , Joonsoo Kim , Vlastimil Babka , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, "Matthew Wilcox (Oracle)" , Miaohe Lin , Johannes Weiner , Peter Xu , NeilBrown , Shakeel Butt , Michal Hocko , Yu Zhao , Dave Chinner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-mm@kvack.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 6858E140004 X-Stat-Signature: mn8bxqg6i9oy5qm4fqxux9uia5do9ft1 X-HE-Tag: 1680803563-465491 X-HE-Meta: U2FsdGVkX1/yX79NA12dEd+VkGY3Ym7rG+Bd//jj1EuteORmCxo9SEcLPpCtZ3Lq9/e2Loz25fqOxJcDn5pqL0gnMXFuw39dAaPLHaXWALpMOTM3puLlLjc20e1kJnR2RMc5vqjRiIUXu+DgwZbs0Mlz3u6dtebpbC9Q32xa8zXV5bVnyi+34PXNikWnNQADIpvFhBj7OyYSV8jGlGdCwJLuTtjkZstoc2LYibAipvx8yiiYcZjJgA5C1FinKGPH8zk97kZF7YCeEsjdOphk+Bu+MaINlES5+JqrdWQvk3Axd3vTxqfT/ZEd4iQW75akfkmEbWaxR+PxrwFCyCv0E3HnoTCgEwHkMf4TGv0mjkAqQFAHcp5423XsXr596dYQhuvrQkFO35NkYf6/Scv49HgPeKEPBKaj8M/ozZ7NHA6D7eEAjwvM4jVAilyZiX91+jnmMCRrt3kRGaWTXh+nNzwCjlcTIriTIA2+0lUNBQ3WLDA9UB1jg6XbqlsU0fSn4EIUMw7RH0DR4BDorlgBQDmnnVunN5krEe0nAQiHeVnk1T+xpvwr6m+g1xM5v3D4wabDpXTvkk2PiFeZK/2gCLjxr7DcAG+AQ2G8mTIyPCBl0fvkn9RLpekxN2ZvSB3uMOBRtJNp39bhRm2IPXV0NdqPA1QgIQ/OIBP9WUyy8zW6Bvy9Ts3neCvg5KJEYnK4eAmqnMIoHWhD+ujX5Wx+7HKVsUZD+yTQH5SSnrRr890oE0pTjmFKRkXqVNDBrVg/YxieTiNc+4ewrIWpisBvmyVbPcSYYBdbf7JPmOgEMTguHxEfk6lVazyOnrmm6h8uyIFlFV7PZVa1aHXwCnfYkgJLT7ZlTrHwHdw9uZ1tnMUmY8DY30XhVtJYXZ0OQwPiaiHY+kfWYJg9NRiyIxp5/+n5XjfvhqXTD/4Bl2NHUa4TycZ9qWTG/u6t2onOEDNHE0y/cMSo+DHz4g3PXmv sBoeVro+ zHUXEGgwLPw64TRKJYl3qfvXkZsGPY+yNEc1/GIL0pKRuTqEDbVMARLbqppVcyNJCxHV5tEfplTR9cQrlJz8fbENrFVApfPab8v4CY2qKN3Ig/OHsPLvMLVs6XfUnAbBQfwPtm+vrPc/NWMpUAqkzBGwx5wQDAY1mIXXIm9cfD2XDhjMwOMK5tm5KLKJKZjWqw+o67dm6ZpaZu8S1xIKNpCw44WSvZeF22pXUtHWB1j7msXlW7xKrW8MWyk8EaTu5yc/7WK2Ws355xyMW4IpXyTTJGd1qEgYZREsBDTtPGP/qVd/zXOAB4HElhTE1rsu+hN+uyfRtMR0ISPo7CCpdFB3HtkvQbULFCrlTcN4jZX26D3R9vAyplsIt+Q86fXPMlni5Yh9Dy6EhLqHWZk9emvO3sKG0WEXs/VFPajeu0BjhJ9M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Apr 6, 2023 at 10:50=E2=80=AFAM David Hildenbrand wrote: > > On 06.04.23 16:07, Yosry Ahmed wrote: > > Thanks for taking a look, David! > > > > On Thu, Apr 6, 2023 at 3:31=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 05.04.23 20:54, Yosry Ahmed wrote: > >>> We keep track of different types of reclaimed pages through > >>> reclaim_state->reclaimed_slab, and we add them to the reported number > >>> of reclaimed pages. For non-memcg reclaim, this makes sense. For mem= cg > >>> reclaim, we have no clue if those pages are charged to the memcg unde= r > >>> reclaim. > >>> > >>> Slab pages are shared by different memcgs, so a freed slab page may h= ave > >>> only been partially charged to the memcg under reclaim. The same goe= s for > >>> clean file pages from pruned inodes (on highmem systems) or xfs buffe= r > >>> pages, there is no simple way to currently link them to the memcg und= er > >>> reclaim. > >>> > >>> Stop reporting those freed pages as reclaimed pages during memcg recl= aim. > >>> This should make the return value of writing to memory.reclaim, and m= ay > >>> help reduce unnecessary reclaim retries during memcg charging. Writi= ng to > >>> memory.reclaim on the root memcg is considered as cgroup_reclaim(), b= ut > >>> for this case we want to include any freed pages, so use the > >>> global_reclaim() check instead of !cgroup_reclaim(). > >>> > >>> Generally, this should make the return value of > >>> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (= e.g. > >>> freed a slab page that was mostly charged to the memcg under reclaim)= , > >>> the return value of try_to_free_mem_cgroup_pages() can be underestima= ted, > >>> but this should be fine. The freed pages will be uncharged anyway, an= d we > >> > >> Can't we end up in extreme situations where > >> try_to_free_mem_cgroup_pages() returns close to 0 although a huge amou= nt > >> of memory for that cgroup was freed up. > >> > >> Can you extend on why "this should be fine" ? > >> > >> I suspect that overestimation might be worse than underestimation. (se= e > >> my comment proposal below) > > > > In such extreme scenarios even though try_to_free_mem_cgroup_pages() > > would return an underestimated value, the freed memory for the cgroup > > will be uncharged. try_charge() (and most callers of > > try_to_free_mem_cgroup_pages()) do so in a retry loop, so even if > > try_to_free_mem_cgroup_pages() returns an underestimated value > > charging will succeed the next time around. > > > > The only case where this might be a problem is if it happens in the > > final retry, but I guess we need to be *really* unlucky for this > > extreme scenario to happen. One could argue that if we reach such a > > situation the cgroup will probably OOM soon anyway. > > > >> > >>> can charge the memcg the next time around as we usually do memcg recl= aim > >>> in a retry loop. > >>> > >>> The next patch performs some cleanups around reclaim_state and adds a= n > >>> elaborate comment explaining this to the code. This patch is kept > >>> minimal for easy backporting. > >>> > >>> Signed-off-by: Yosry Ahmed > >>> Cc: stable@vger.kernel.org > >> > >> Fixes: ? > >> > >> Otherwise it's hard to judge how far to backport this. > > > > It's hard to judge. The issue has been there for a while, but > > memory.reclaim just made it more user visible. I think we can > > attribute it to per-object slab accounting, because before that any > > freed slab pages in cgroup reclaim would be entirely charged to that > > cgroup. > > > > Although in all fairness, other types of freed pages that use > > reclaim_state->reclaimed_slab cannot be attributed to the cgroup under > > reclaim have been there before that. I guess slab is the most > > significant among them tho, so for the purposes of backporting I > > guess: > > > > Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects > > instead of pages") > > > >> > >>> --- > >>> > >>> global_reclaim(sc) does not exist in kernels before 6.3. It can be > >>> replaced with: > >>> !cgroup_reclaim(sc) || mem_cgroup_is_root(sc->target_mem_cgroup) > >>> > >>> --- > >>> mm/vmscan.c | 8 +++++--- > >>> 1 file changed, 5 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>> index 9c1c5e8b24b8f..c82bd89f90364 100644 > >>> --- a/mm/vmscan.c > >>> +++ b/mm/vmscan.c > >>> @@ -5346,8 +5346,10 @@ static int shrink_one(struct lruvec *lruvec, s= truct scan_control *sc) > >>> vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned = - scanned, > >>> sc->nr_reclaimed - reclaimed); > >>> > >>> - sc->nr_reclaimed +=3D current->reclaim_state->reclaimed_slab; > >>> - current->reclaim_state->reclaimed_slab =3D 0; > >> > >> Worth adding a comment like > >> > >> /* > >> * Slab pages cannot universally be linked to a single memcg. So onl= y > >> * account them as reclaimed during global reclaim. Note that we mig= ht > >> * underestimate the amount of memory reclaimed (but won't overestim= ate > >> * it). > >> */ > >> > >> but ... > >> > >>> + if (global_reclaim(sc)) { > >>> + sc->nr_reclaimed +=3D current->reclaim_state->reclaimed= _slab; > >>> + current->reclaim_state->reclaimed_slab =3D 0; > >>> + } > >>> > >>> return success ? MEMCG_LRU_YOUNG : 0; > >>> } > >>> @@ -6472,7 +6474,7 @@ static void shrink_node(pg_data_t *pgdat, struc= t scan_control *sc) > >>> > >>> shrink_node_memcgs(pgdat, sc); > >>> > >> > >> ... do we want to factor the add+clear into a simple helper such that = we > >> can have above comment there? > >> > >> static void cond_account_reclaimed_slab(reclaim_state, sc) > >> { > >> /* > >> * Slab pages cannot universally be linked to a single memcg.= So > >> * only account them as reclaimed during global reclaim. Note > >> * that we might underestimate the amount of memory reclaimed > >> * (but won't overestimate it). > >> */ > >> if (global_reclaim(sc)) { > >> sc->nr_reclaimed +=3D reclaim_state->reclaimed_slab; > >> reclaim_state->reclaimed_slab =3D 0; > >> } > >> } > >> > >> Yes, effective a couple LOC more, but still straight-forward for a > >> stable backport > > > > The next patch in the series performs some refactoring and cleanups, > > among which we add a helper called flush_reclaim_state() that does > > exactly that and contains a sizable comment. I left this outside of > > this patch in v5 to make the effective change as small as possible for > > backporting. Looks like it can be confusing tho without the comment. > > > > How about I pull this part to this patch as well for v6? > > As long as it's a helper similar to what I proposed, I think that makes > a lot of sense (and doesn't particularly bloat this patch). Sounds good to me, I will do that and respin. Thanks David! > > -- > Thanks, > > David / dhildenb > >