From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BEDBAC433FE
	for <linux-mm@archiver.kernel.org>; Thu,  6 Oct 2022 23:55:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5C5296B0073; Thu,  6 Oct 2022 19:55:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 54DB86B0074; Thu,  6 Oct 2022 19:55:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3A10E8E0002; Thu,  6 Oct 2022 19:55:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 25D186B0073
	for <linux-mm@kvack.org>; Thu,  6 Oct 2022 19:55:42 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id E77AA1A1233
	for <linux-mm@kvack.org>; Thu,  6 Oct 2022 23:55:41 +0000 (UTC)
X-FDA: 79992184482.10.7DD79E8
Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41])
	by imf12.hostedemail.com (Postfix) with ESMTP id 9BC8540015
	for <linux-mm@kvack.org>; Thu,  6 Oct 2022 23:55:41 +0000 (UTC)
Received: by mail-vs1-f41.google.com with SMTP id 63so3573159vse.2
        for <linux-mm@kvack.org>; Thu, 06 Oct 2022 16:55:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=T4mxXhpF84I6akWlYzDe5Gw+8i4lTZ56Lv0l9+70zzM=;
        b=Bq5A1mKudnh0A+o4CH6pTUhbmuThgiCNQGGHKR4Ycnz/34kL9MQ7iOIFDgY9uzlDAm
         4mB4bS9RhEZVNiv6pVcJrCbhIqK7qosRit3UslqGOgU/vHG0tiNss/TQn2Q9s9alc5dZ
         cVawtXTsDNhEdHwQcCIw8CiSjoaV5qStZ9LYMGuBX4LgP5Z4xQdXENWR/HxaeNAmrUPO
         ODMqxoXpo4vBrZgbfFe2AXz+VhB8Uvi1PfmgbASI1hiJZZCjdH9phoyFYxzbE9ZkIC2N
         QrhNvDWkz+1vpYU2zUXP04hvINbsY8u0JZOOCNwBTTb228Y9UcHZnQ5rgE+sjtXGeX2u
         opCQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=T4mxXhpF84I6akWlYzDe5Gw+8i4lTZ56Lv0l9+70zzM=;
        b=VT340l42vLtZMXOn044MCk1s0cCDYpA9zpkIvxoBS4hkaZXr5nw/wH2BGW0NQs2Ckl
         gZXE6hlmBNV8zxny+1YuLEL4P9+nOwbXYzdxO0wZwZkS34xA7xhjyOqWQ2q2ga7EkI64
         zz0qqy4wHEJBAvCFf84pK5regx1a5igJtNn9lnE5B3IHuW9Ye7RYBMs6oiVd6EVtPHv7
         9nT7FQArq5X9nM5gA+XMsthcTU90oe4imGa4zKUQD7tN0GAErsgGysloRNYt8DCCjhbw
         aLihlQfitoaEzSzgd0xFiC+ka1mHde/em3YCAISP7i84hMP+2UpggzMLqqRcaNgjrzCA
         AVrQ==
X-Gm-Message-State: ACrzQf3SAmdwkYnqUEGEnoVqPSXfekKZWK0DPBFCMsEPxerYNXDaytoh
	ip9CUTH6QTCWRYimrxAhnHvq3Xio/U2epW1rydo4Zg==
X-Google-Smtp-Source: AMsMyM7fskDFYjjfO6feGxVo9l3jfUKsuNvnXNSxzozxWTPS8A+yUPApE/+l3J4wYcEpLx4V1idVjApyI58Z1pbzQkk=
X-Received: by 2002:a67:ac08:0:b0:3a5:d34b:ae1 with SMTP id
 v8-20020a67ac08000000b003a5d34b0ae1mr1239151vse.46.1665100540584; Thu, 06 Oct
 2022 16:55:40 -0700 (PDT)
MIME-Version: 1.0
References: <20221005173713.1308832-1-yosryahmed@google.com>
 <CAOUHufaDhmHwY_qd2z26k6vK=eCHudJL1Pp4xALP25iZfbSJWA@mail.gmail.com>
 <CAJD7tkaS4T5dD3CpST2wsie5uP1ruHiaWL5AJv0j8V9=yiOuug@mail.gmail.com>
 <CAOUHufYKvbZTJ_ofD4+DyzY+DuHrRKYChnJVwqD7OKwe6sw-hw@mail.gmail.com>
 <Yz5XVZfq8abvMYJ8@cmpxchg.org> <CAJD7tkao9DU2e_2co_HgOm38PxvLqdRS=kHcOdRfqcqN6MRdaw@mail.gmail.com>
 <Yz71HQpeS6ccOIe2@cmpxchg.org> <CAJD7tka+wzjw8dHHGnz5jWULqhvbSF5WQ4gJCui7ztMUeVwfTg@mail.gmail.com>
 <CAOUHufZo5WMpHvZMevGfB_T4wxWn86Z76NcPK9GymoHK8-o0Kg@mail.gmail.com> <CAJD7tkZAKhtbd7Gk4hoN-y9p9PHAxQqv5p3ePZs=Au84=Y0ViQ@mail.gmail.com>
In-Reply-To: <CAJD7tkZAKhtbd7Gk4hoN-y9p9PHAxQqv5p3ePZs=Au84=Y0ViQ@mail.gmail.com>
From: Yu Zhao <yuzhao@google.com>
Date: Thu, 6 Oct 2022 17:55:04 -0600
Message-ID: <CAOUHufY07M=YxZ-Ycrtd8=7WAAcaXrJXK9w1qMrz7JMT44nrZA@mail.gmail.com>
Subject: Re: [PATCH v2] mm/vmscan: check references from all memcgs for
 swapbacked memory
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, 
	Greg Thelen <gthelen@google.com>, David Rientjes <rientjes@google.com>, 
	Cgroups <cgroups@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665100541; a=rsa-sha256;
	cv=none;
	b=55bb51AiuGa3dVin+tgvtRRH9u8vQ1a7XkY5cK/7JREsEIpyGaJMak714KQJEFNzil9fLk
	RxlNh4y3GLGfREun2d1HlAxuryrIPwISqb93sUv2877wmS31gOwrGWRoFGX64Hu+b7o5zz
	F0/g9y4XeSXGHiMSDH3t4hSZ9VtI6Zc=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Bq5A1mKu;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1665100541;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=T4mxXhpF84I6akWlYzDe5Gw+8i4lTZ56Lv0l9+70zzM=;
	b=2VrCExFqM0puetkkEbZZLOrcDtykdqVV7iehWsiV41/0sy9lzIXWXhZrehKgAud5MjDPuh
	lhhzMadE+r1R5iqImYmJL5gQsGDaIhhU0TuNqWVo9k1k0zihHjc0gUzjRx6HLFRHiO1PbC
	V85DMUPjaUntMufu5VwanGF+HUhcVcM=
X-Rspam-User: 
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Bq5A1mKu;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=yuzhao@google.com
X-Stat-Signature: r3z7cx546qi3zb6xmerper5647edoc1e
X-Rspamd-Queue-Id: 9BC8540015
X-Rspamd-Server: rspam09
X-HE-Tag: 1665100541-464837
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Oct 6, 2022 at 5:07 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Oct 6, 2022 at 2:57 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Oct 6, 2022 at 12:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > > > > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > > > > >
> > > > > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > > > > for it appropriately.
> > > > > > > > > >
> > > > > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > > > > page/folio is a viable reclaim target.
> > > > > > > > > >
> > > > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > > > > the folio is swapbacked.
> > > > > > > > >
> > > > > > > > > It seems to me this change can potentially increase the number of
> > > > > > > > > zombie memcgs. Any risk assessment done on this?
> > > > > > > >
> > > > > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > > > > the parent?
> > > > > > >
> > > > > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > > > > and also actively used by memcg B, if we don't ignore the access from
> > > > > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > > > > >
> > > > > > This patch changes the behavior of limit-induced reclaim. There is no
> > > > > > limit reclaim on A after it's been deleted. And parental/global
> > > > > > reclaim has always recognized outside references.
> > > > >
> > > > > Do you mind elaborating on the parental reclaim part?
> > > > >
> > > > > I am looking at the code and it looks like memcg reclaim of a parent
> > > > > (limit-induced or proactive) will only consider references coming from
> > > > > its subtree, even when reclaiming from its dead children. It looks
> > > > > like as long as sc->target_mem_cgroup is set, we ignore outside
> > > > > references (relative to sc->target_mem_cgroup).
> > > >
> > > > Yes, I was referring to outside of A.
> > > >
> > > > As of today, any siblings of A can already pin its memory after it's
> > > > dead. I suppose your patch would add cousins to that list. It doesn't
> > > > seem like a categorial difference to me.
> > > >
> > > > > If that is true, maybe we want to keep ignoring outside references for
> > > > > swap-backed pages if the folio is charged to a dead memcg? My
> > > > > understanding is that in this case we will uncharge the page from the
> > > > > dead memcg and charge the swapped entry to the parent, reducing the
> > > > > number of refs on the dead memcg. Without this check, this patch might
> > > > > prevent the charge from being moved to the parent in this case. WDYT?
> > > >
> > > > I don't think it's worth it. Keeping the semantics simple and behavior
> > > > predictable is IMO more valuable.
> > > >
> > > > It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> > > > which I think is the more practical concern. In light of that, it
> > > > might be best to table the patch for now. (Until we have
> > > > reparent-on-delete for anon and file pages...)
> > >
> > > If we add a mem_cgroup_online() check, we partially solve the problem.
> > > Maybe scrape-before-rmdir will not reclaim those pages at once, but
> > > the next time we try to reclaim from the dead memcg (global, limit,
> > > proactive,..) we will reclaim the pages. So we will only be delaying
> > > the freeing of those zombie memcgs.
> >
> > As an observer, this seems to be the death by a thousand cuts of the
> > existing mechanism that Google has been using to virtually eliminate
> > zombie memcgs for the last decade.
> >
> > I understand the desire to fix a specific problem with this patch. But
> > it's methodically wrong to focus on specific problems without
> > considering the large picture and how it's evolving.
> >
> > Our internal memory.reclaim, which is being superseded, is a superset
> > of the mainline version. It has two flags relevant to this discussion:
> >     1. hierarchical walk of a parent
> >     2. target dead memcgs only
> > With these, our job scheduler (Borg) doesn't need to scrape before
> > rmdir at all. It does something called "applying root pressure",
> > which, as one might imagine, is to write to the root memory.reclaim
> > with the above flags. We have metrics on the efficiency of this
> > mechanism and they are closely monitored.
> >
> > Why is this important? Because Murphy's law is generally true for a
> > fleet when its scale and diversity is large and high enough. *We used
> > to run out memcg IDs.* And we are still carrying a patch that enlarges
> > swap_cgroup->id from unsigned short to unsigned int.
> >
> > Compared with the recharging proposal we have been discussing, the two
> > cases that the above solution can't help:
> >     1. kernel long pins
> >     2. foreign mlocks
> > But it's still *a lot* more reliable than the scrape-before-rmdir
> > approach (or scrape-after-rmdir if we can hold the FD open before
> > rmdir), because it offers unlimited retries and no dead memcgs, e.g.,
> > those created and deleted by jobs (not the job scheduler), can escape.
> >
> > Unless you can provide data, my past experience tells me that this
> > patch will make scrape-before-rmdir unacceptable (in terms of
> > effectiveness) to our fleet. Of course you can add additional code,
> > i.e., those two flags or the offline check, which I'm not object to.
>
> I agree that the zombie memcgs problem is a serious problem that needs
> to be dealt with, and recharging memory when a memcg is dying seems
> like a promising direction. However, this patch's goal is to improve
> reclaim of shared swapbacked memory in general, regardless of the
> zombie memcgs problem. I understand that the current version affects
> the zombie memcgs problem, but I believe this is an oversight that
> needs to be fixed, not something that should make us leave the reclaim
> problem unsolved.
>
> I think this patch + an offline check should be sufficient to fix the
> reclaim problem while not regressing the zombie memcgs problem for
> multiple reasons, see below.
>
> If we implement recharging memory of dying memcgs in the future, we
> can always come back and remove the offline check.
>
> > Frankly, let me ask the real question: are you really sure this is the
> > best for us and the rest of the community?
>
> Yes. This patch should improve reclaim of shared memory as I
> elaborated in my previous email, and with an offline check I believe
> we shouldn't be regressing the zombie memcgs problem whether for
> Google or the community, for the following reasons.

But it will regress a (non-root) parent who wishes to keep the hot
memory it shared with its deleted children to itself.

I think we should sleep on this.