From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DE974C4167D
	for <linux-mm@archiver.kernel.org>; Mon, 30 Oct 2023 18:16:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3566F6B0279; Mon, 30 Oct 2023 14:16:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 306746B027A; Mon, 30 Oct 2023 14:16:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1F55E6B027B; Mon, 30 Oct 2023 14:16:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 109E26B0279
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 14:16:50 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id DA1DD160A25
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 18:16:49 +0000 (UTC)
X-FDA: 81402933738.05.423644A
Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42])
	by imf06.hostedemail.com (Postfix) with ESMTP id 23A42180027
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 18:16:47 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="iNGhsyC/";
	spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698689808; a=rsa-sha256;
	cv=none;
	b=TKINbc5uiaSGPeys4IWzc+OocaD464Ai++FCurSXJxVbqMeLK+g6l5UDuUrgJ1Qh607vbF
	q+CJqCro5yT8bj5ZZNSopzAU7PjK4GAOJWWrqUpuz8Rq798XbjZMKO8iNfZGW7ihMs/b4E
	LWKfVjeqCfKxK+nVVHplUYE2KDgA5U8=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="iNGhsyC/";
	spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698689808;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=;
	b=4IRPs1idKRjHlqdGNPGSIcqODQ0mXI8pHZ/5WPPedzIl5/OFBbasOT1e6s+EZvg6HybyCc
	+doPYQdEMdBivrPkbox8qnBpAo2t2/pEMyUM+mo75P04SnKVDtMqBXHFm8gGimgLeVBRyj
	2ZVqIbOELo3ozHzYDUyS60HwB2fjtxA=
Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-991c786369cso771324166b.1
        for <linux-mm@kvack.org>; Mon, 30 Oct 2023 11:16:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1698689806; x=1699294606; darn=kvack.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=;
        b=iNGhsyC/aj/tZ4THDWq8KZbXvxu1zxlfT8j3KqWVcovaK3Zc6V1RhijDVqzNSXVAY6
         0qY06iiayI1hmbk9FchZjEiRVOtJ3NOVYOOgtSi+U1lminOiBI0Qi7a0P5/TYCrf6V/a
         qQsz16kQk63pPznaho4OSSVtVVEu3W4fP0EUehDV6zQYN2rQCa/EaF4vosvUE6aOFPa4
         LfdLvXHpv9GhXApQgayTALTtVswP8m4ly7bCsspfjcpkBN9RVeV+/oOP19xkOsGxNPNE
         3Wfb6geHCIC3PrPXOdJa0Jo6xz6e/kRI51hOf7e1d0/6ljKnTxwlOB+mwWDRvstgsj7Y
         Bziw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698689806; x=1699294606;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=;
        b=PzJ5BUDhTdsu/Fi8y1FyO5fP5J8bzmlZp81sbSug5ojKpIXLgjlb9v9oz/gsEGeRVh
         VLhfFEQ9Dv63AArVbnGbNaDx794PeMK9h+GOhAIs2NQHyKgzNJcKaU7CG9t8KpSxmgRJ
         TiNQLiPSKvDZvj0lNKbS9teZGFs1LJZnLTLzZJ/IWcM1VbZqHDfsZHe4ue69de7j34us
         hMncpFFGGRQbWSAAwZunfRSK6ne8AsABM3BPODsSWZqUu7vddsYesS8qtr2r5Nf6terP
         +gOqIuEit8OHfZf0hnFruWpyxNae5TT+ijJZpNBQfVZQceodGnHCJHkuxTAr91p3l2bc
         q4Lw==
X-Gm-Message-State: AOJu0YwIkjfW1+rG1n/m2gxRxryODtyyRl5w1rtzFIPKQtmINwV6NO1Q
	U8TvwJg4R+PG/2yANA1zax0QFSR/HhG4X/gPYGnB5w==
X-Google-Smtp-Source: AGHT+IENwS1N7Hx+TRzjhWYcV0enr0HvAOm8fD5bZdSc53hLpUYyGMmqwwxUWxQxThG/uKP/3FP8UjKuG236SN4+FF0=
X-Received: by 2002:a17:907:3f20:b0:9be:e278:4d47 with SMTP id
 hq32-20020a1709073f2000b009bee2784d47mr9500540ejc.27.1698689806504; Mon, 30
 Oct 2023 11:16:46 -0700 (PDT)
MIME-Version: 1.0
References: <20231024203302.1920362-1-nphamcs@gmail.com> <20231024203302.1920362-3-nphamcs@gmail.com>
 <CAJD7tkZM4aOAwc4nRiU1PHofxHeZmV-NNGP5-E7X88ivRC7Pgw@mail.gmail.com> <CAKEwX=Nr7xJYpL2nE_ob0dWg9rnfoz67OMe_wvGsKjxboo1H+A@mail.gmail.com>
In-Reply-To: <CAKEwX=Nr7xJYpL2nE_ob0dWg9rnfoz67OMe_wvGsKjxboo1H+A@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Mon, 30 Oct 2023 11:16:10 -0700
Message-ID: <CAJD7tkZ7PPQmWu9UVH7WS3KFjmNW3q=JoMASmYtUb-Uy702iJg@mail.gmail.com>
Subject: Re: [PATCH v4 2/5] zswap: make shrinking memcg-aware
To: Nhat Pham <nphamcs@gmail.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, 
	sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, 
	mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, 
	muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, 
	kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 23A42180027
X-Stat-Signature: 6wfa1arnhqitxoadq88ajwprtyanr5w6
X-Rspam-User: 
X-HE-Tag: 1698689807-414523
X-HE-Meta: U2FsdGVkX18ietG8dkeFqbyjYkY/fPnib5iv/JmE9tmNLopWPrwpRtNFXIBQ7cykXhJehT6kdWQH6mKdr/mOARqUXhonRTI3F/clz6LCdP9bWm/FhW6klaGsx1SCy775rid3M3vGxhsIIBjEWFk5YuME93ZuaBsFcDtvP0kvl2AGgUysFAb8iPDb4Q94wBiV2Q3+jj+jB6lOWTvEUmtsQ0z/iuUGo4y4D9FyAEVpgqdoC9hC4vgJvL3qTfZLGVNi6m4Ip8Mt6hOgmB91qwe0X2uPVgs2B1Ya4ao7Ae2sxhbKEpain2ZtP/trxqjphtmzGDwS3i+A2nOu5yXC9AoYtmYffBaaHXg8zMMRPtWhgtSiZgBovrloJMR0GRTWnmie3lilcm4y82UXncPOz6+hu6KhREuIwag5dWXxGkrq/HMsFKFLjke3qY6Ma3ep2+LEd0nVuGdjhoQTJIb2Tg2Z54YSIn77tYFECxc/43G8HCJgG1JMsU7ZI+va+JOEcSxkkIc1ZTOzxVY4CcdBl6kCC3VdGfMuFd3ngwfeNbvxji08BXbWSiAZTPKeNRpR/34y5tnCwYoexH8lpUFmL9TIyncPd3doHRFB7MVDbMnW564nHGPlLF/pGTwqRRK4Yt29FHNRQRHPQheinEjGXofmBMyilO6UgWeFfwm1Ctz1IFa/gl1crbLJDXJSUy1ynXCN2K0OtVXArXBV1RZ/W9+x7Y2EK/sNH60jzuAl0L3ScBFc9YK+WlBjHONAmSipnkkOkf9b8dynZJPTgdlB7TaSzXyxboyzv3NRjjOeZhfDoHzXcO9IxcSaCZwrC5kxZ6yRf86e76zJ+NZbGvHUIWwy6qZ3F8TcO2VrlNE2jlyU6s0TvaU6zJf9/NdZ3AbgCEsXt0jpAdDf6REKYrV4JONCOdquU4504jl9RpqaAZSPntRjHuXjSkC3GJ1qWh3TcgWex3yowa5Q9ADcAX+ceRg
 Wuw5DLfJ
 Z5hYpSRcRvjCdAIiA2quqygxYaCyRUHvOAD1X68q4omLJI3CTcRAipD8I3iax+xgDMmwGmlN01eG+bKBqMEoVGp0PiPGNO+r6yxSlgycBaL33zDl4Nl97keUsytqhhY0jgz71Y0PGPBf8/mhugytW6cG4stCc8FbnCWsi/gsDKdBMkf2qz8UDcE3J6gZxO6Dt1npHwzAbHVfb8wPfeWwVKVKg0SgBwLO35cwO
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

> > [..]
> > > +/*********************************
> > > +* lru functions
> > > +**********************************/
> > > +static bool zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> > > +{
> > > +       struct mem_cgroup *memcg = get_mem_cgroup_from_entry(entry);
> > > +       int nid = entry_to_nid(entry);
> > > +       bool added = list_lru_add(list_lru, &entry->lru, nid, memcg);
> > > +
> > > +       mem_cgroup_put(memcg);
> >
> > Still not fond of the get/put pattern but okay..
>
> Actually, Johannes and I took another look to see if we can replace
> the memcg reference getting with just rcu_read_lock().
>
> It seems there might be a race between zswap LRU manipulation
> and memcg offlining - not just with the rcu_read_lock() idea, but also
> with our current implementation!
>
> I'll shoot another email with more details later when I'm sure of it
> one way or another...
>

Interesting, well at least something came out of my complaining :)

> > [..]
> > > @@ -652,28 +679,37 @@ static int zswap_reclaim_entry(struct zswap_pool *pool)
> > >          */
> > >         swpoffset = swp_offset(entry->swpentry);
> > >         tree = zswap_trees[swp_type(entry->swpentry)];
> > > -       spin_unlock(&pool->lru_lock);
> > > +       list_lru_isolate(l, item);
> > > +       /*
> > > +        * It's safe to drop the lock here because we return either
> > > +        * LRU_REMOVED_RETRY or LRU_RETRY.
> > > +        */
> > > +       spin_unlock(lock);
> > >
> > >         /* Check for invalidate() race */
> > >         spin_lock(&tree->lock);
> > > -       if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) {
> > > -               ret = -EAGAIN;
> > > +       if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
> > >                 goto unlock;
> > > -       }
> > > +
> > >         /* Hold a reference to prevent a free during writeback */
> > >         zswap_entry_get(entry);
> > >         spin_unlock(&tree->lock);
> > >
> > > -       ret = zswap_writeback_entry(entry, tree);
> > > +       writeback_result = zswap_writeback_entry(entry, tree);
> > >
> > >         spin_lock(&tree->lock);
> > > -       if (ret) {
> > > -               /* Writeback failed, put entry back on LRU */
> > > -               spin_lock(&pool->lru_lock);
> > > -               list_move(&entry->lru, &pool->lru);
> > > -               spin_unlock(&pool->lru_lock);
> > > +       if (writeback_result) {
> > > +               zswap_reject_reclaim_fail++;
> > > +               memcg = get_mem_cgroup_from_entry(entry);
> >
> > Can this return NULL? Seems like we don't check the return in most/all places.
>
> I believe so, but memcg experts should fact check me on this.

If that's the case, there should be NULL checks, no?

> It's roughly the same pattern as zswap charging/uncharging:
>
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length)
> -> getting memcg (under rcu_read_lock())
>
> >
> > > +               spin_lock(lock);
> > > +               /* we cannot use zswap_lru_add here, because it increments node's lru count */
> > > +               list_lru_putback(&entry->pool->list_lru, item, entry_to_nid(entry), memcg);
> >
> > Perhaps we can move this call with the memcg get/put to a helper like
> > add/del? (e.g. zswap_lru_putback)
> >
> > We would need to move get_mem_cgroup_from_entry() into the lock but I
> > think that's okay.
>
> We probably could, but that sounds like extra code for not a lot of gains, no?

I don't feel strongly, just a fan of consistency.

>
> >
> > > +               spin_unlock(lock);
> > > +               mem_cgroup_put(memcg);
> > > +               ret = LRU_RETRY;
> > >                 goto put_unlock;
> > >         }
> > > +       zswap_written_back_pages++;
> > >
> > >         /*
> > >          * Writeback started successfully, the page now belongs to the
[..]
> > > @@ -696,15 +759,17 @@ static void shrink_worker(struct work_struct *w)
> > >                                                 shrink_work);
> > >         int ret, failures = 0;
> > >
> > > +       /* global reclaim will select cgroup in a round-robin fashion. */
> > >         do {
> > > -               ret = zswap_reclaim_entry(pool);
> > > -               if (ret) {
> > > -                       zswap_reject_reclaim_fail++;
> > > -                       if (ret != -EAGAIN)
> > > -                               break;
> > > -                       if (++failures == MAX_RECLAIM_RETRIES)
> > > -                               break;
> > > -               }
> > > +               pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
> >
> > I think this can be a problem. We hold a ref to a memcg here until the
> > next time we shrink, which can be a long time IIUC. This can cause the
> > memcg to linger as a zombie. I understand it is one memcg per-zswap
> > pool, but I am still unsure about it.
> >
> > MGLRU maintains a memcg LRU for global reclaim that gets properly
> > cleaned up when a memcg is going away, so that's one option, although
> > complicated.
> >
> > A second option would be to hold a pointer to the objcg instead, which
> > should be less problematic (although we are still holding that objcg
> > hostage indefinitely). The problem here is that if the objcg gets
> > reparented, next time we will start at the parent of the memcg we
> > stopped at last time, which tbh doesn't sound bad at all to me.
> >
> > A third option would be to flag the memcg such that when it is getting
> > offlined we can call into zswap to reset pool->next_shrink (or move it
> > to the parent) and drop the ref. Although synchronization can get
> > hairy when racing with offlining.
> >
> > Not sure what's the right solution, but I prefer we don't hold any
> > memcgs hostages indefinitely. I also think if we end up using
> > mem_cgroup_iter() then there should be a mem_cgroup_iter_break()
> > somewhere if/when breaking the iteration.
> >
>
> I'm not sure if this is that big of a problem in the first place, but
> if it is, doing something similar to MGLRU is probably the cleanest:
> when the memcg is freed, trigger the zswap_exit_memcg() callback,
> which will loop through all the zswap pools and update pool->next_shrink
> where appropriate.
>
> Note that we only have one pool per (compression algorithm x allocator)
> combinations, so there cannot be that many pools, correct?
>
> Johannes suggests this idea to me (my apologies if I butcher it)
> during one of our conversations. That sounds relatively easy IIUC.

Be careful that there will be a race between memcg offlining and
zswap's usage of pool->next_shrink. AFAICT there is no lock to prevent
offlining so there will have to be some sort of dance here to make
sure everything works correctly.