From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE974C4167D for ; Mon, 30 Oct 2023 18:16:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3566F6B0279; Mon, 30 Oct 2023 14:16:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 306746B027A; Mon, 30 Oct 2023 14:16:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1F55E6B027B; Mon, 30 Oct 2023 14:16:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 109E26B0279 for ; Mon, 30 Oct 2023 14:16:50 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DA1DD160A25 for ; Mon, 30 Oct 2023 18:16:49 +0000 (UTC) X-FDA: 81402933738.05.423644A Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf06.hostedemail.com (Postfix) with ESMTP id 23A42180027 for ; Mon, 30 Oct 2023 18:16:47 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="iNGhsyC/"; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698689808; a=rsa-sha256; cv=none; b=TKINbc5uiaSGPeys4IWzc+OocaD464Ai++FCurSXJxVbqMeLK+g6l5UDuUrgJ1Qh607vbF q+CJqCro5yT8bj5ZZNSopzAU7PjK4GAOJWWrqUpuz8Rq798XbjZMKO8iNfZGW7ihMs/b4E LWKfVjeqCfKxK+nVVHplUYE2KDgA5U8= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="iNGhsyC/"; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698689808; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=; b=4IRPs1idKRjHlqdGNPGSIcqODQ0mXI8pHZ/5WPPedzIl5/OFBbasOT1e6s+EZvg6HybyCc +doPYQdEMdBivrPkbox8qnBpAo2t2/pEMyUM+mo75P04SnKVDtMqBXHFm8gGimgLeVBRyj 2ZVqIbOELo3ozHzYDUyS60HwB2fjtxA= Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-991c786369cso771324166b.1 for ; Mon, 30 Oct 2023 11:16:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698689806; x=1699294606; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=; b=iNGhsyC/aj/tZ4THDWq8KZbXvxu1zxlfT8j3KqWVcovaK3Zc6V1RhijDVqzNSXVAY6 0qY06iiayI1hmbk9FchZjEiRVOtJ3NOVYOOgtSi+U1lminOiBI0Qi7a0P5/TYCrf6V/a qQsz16kQk63pPznaho4OSSVtVVEu3W4fP0EUehDV6zQYN2rQCa/EaF4vosvUE6aOFPa4 LfdLvXHpv9GhXApQgayTALTtVswP8m4ly7bCsspfjcpkBN9RVeV+/oOP19xkOsGxNPNE 3Wfb6geHCIC3PrPXOdJa0Jo6xz6e/kRI51hOf7e1d0/6ljKnTxwlOB+mwWDRvstgsj7Y Bziw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698689806; x=1699294606; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=XNAE/3aC2lNL4lcg+U/6tDVAD81XY3dhOt1l+j+r1Y0=; b=PzJ5BUDhTdsu/Fi8y1FyO5fP5J8bzmlZp81sbSug5ojKpIXLgjlb9v9oz/gsEGeRVh VLhfFEQ9Dv63AArVbnGbNaDx794PeMK9h+GOhAIs2NQHyKgzNJcKaU7CG9t8KpSxmgRJ TiNQLiPSKvDZvj0lNKbS9teZGFs1LJZnLTLzZJ/IWcM1VbZqHDfsZHe4ue69de7j34us hMncpFFGGRQbWSAAwZunfRSK6ne8AsABM3BPODsSWZqUu7vddsYesS8qtr2r5Nf6terP +gOqIuEit8OHfZf0hnFruWpyxNae5TT+ijJZpNBQfVZQceodGnHCJHkuxTAr91p3l2bc q4Lw== X-Gm-Message-State: AOJu0YwIkjfW1+rG1n/m2gxRxryODtyyRl5w1rtzFIPKQtmINwV6NO1Q U8TvwJg4R+PG/2yANA1zax0QFSR/HhG4X/gPYGnB5w== X-Google-Smtp-Source: AGHT+IENwS1N7Hx+TRzjhWYcV0enr0HvAOm8fD5bZdSc53hLpUYyGMmqwwxUWxQxThG/uKP/3FP8UjKuG236SN4+FF0= X-Received: by 2002:a17:907:3f20:b0:9be:e278:4d47 with SMTP id hq32-20020a1709073f2000b009bee2784d47mr9500540ejc.27.1698689806504; Mon, 30 Oct 2023 11:16:46 -0700 (PDT) MIME-Version: 1.0 References: <20231024203302.1920362-1-nphamcs@gmail.com> <20231024203302.1920362-3-nphamcs@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Mon, 30 Oct 2023 11:16:10 -0700 Message-ID: Subject: Re: [PATCH v4 2/5] zswap: make shrinking memcg-aware To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 23A42180027 X-Stat-Signature: 6wfa1arnhqitxoadq88ajwprtyanr5w6 X-Rspam-User: X-HE-Tag: 1698689807-414523 X-HE-Meta: U2FsdGVkX18ietG8dkeFqbyjYkY/fPnib5iv/JmE9tmNLopWPrwpRtNFXIBQ7cykXhJehT6kdWQH6mKdr/mOARqUXhonRTI3F/clz6LCdP9bWm/FhW6klaGsx1SCy775rid3M3vGxhsIIBjEWFk5YuME93ZuaBsFcDtvP0kvl2AGgUysFAb8iPDb4Q94wBiV2Q3+jj+jB6lOWTvEUmtsQ0z/iuUGo4y4D9FyAEVpgqdoC9hC4vgJvL3qTfZLGVNi6m4Ip8Mt6hOgmB91qwe0X2uPVgs2B1Ya4ao7Ae2sxhbKEpain2ZtP/trxqjphtmzGDwS3i+A2nOu5yXC9AoYtmYffBaaHXg8zMMRPtWhgtSiZgBovrloJMR0GRTWnmie3lilcm4y82UXncPOz6+hu6KhREuIwag5dWXxGkrq/HMsFKFLjke3qY6Ma3ep2+LEd0nVuGdjhoQTJIb2Tg2Z54YSIn77tYFECxc/43G8HCJgG1JMsU7ZI+va+JOEcSxkkIc1ZTOzxVY4CcdBl6kCC3VdGfMuFd3ngwfeNbvxji08BXbWSiAZTPKeNRpR/34y5tnCwYoexH8lpUFmL9TIyncPd3doHRFB7MVDbMnW564nHGPlLF/pGTwqRRK4Yt29FHNRQRHPQheinEjGXofmBMyilO6UgWeFfwm1Ctz1IFa/gl1crbLJDXJSUy1ynXCN2K0OtVXArXBV1RZ/W9+x7Y2EK/sNH60jzuAl0L3ScBFc9YK+WlBjHONAmSipnkkOkf9b8dynZJPTgdlB7TaSzXyxboyzv3NRjjOeZhfDoHzXcO9IxcSaCZwrC5kxZ6yRf86e76zJ+NZbGvHUIWwy6qZ3F8TcO2VrlNE2jlyU6s0TvaU6zJf9/NdZ3AbgCEsXt0jpAdDf6REKYrV4JONCOdquU4504jl9RpqaAZSPntRjHuXjSkC3GJ1qWh3TcgWex3yowa5Q9ADcAX+ceRg Wuw5DLfJ Z5hYpSRcRvjCdAIiA2quqygxYaCyRUHvOAD1X68q4omLJI3CTcRAipD8I3iax+xgDMmwGmlN01eG+bKBqMEoVGp0PiPGNO+r6yxSlgycBaL33zDl4Nl97keUsytqhhY0jgz71Y0PGPBf8/mhugytW6cG4stCc8FbnCWsi/gsDKdBMkf2qz8UDcE3J6gZxO6Dt1npHwzAbHVfb8wPfeWwVKVKg0SgBwLO35cwO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > [..] > > > +/********************************* > > > +* lru functions > > > +**********************************/ > > > +static bool zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > > > +{ > > > + struct mem_cgroup *memcg = get_mem_cgroup_from_entry(entry); > > > + int nid = entry_to_nid(entry); > > > + bool added = list_lru_add(list_lru, &entry->lru, nid, memcg); > > > + > > > + mem_cgroup_put(memcg); > > > > Still not fond of the get/put pattern but okay.. > > Actually, Johannes and I took another look to see if we can replace > the memcg reference getting with just rcu_read_lock(). > > It seems there might be a race between zswap LRU manipulation > and memcg offlining - not just with the rcu_read_lock() idea, but also > with our current implementation! > > I'll shoot another email with more details later when I'm sure of it > one way or another... > Interesting, well at least something came out of my complaining :) > > [..] > > > @@ -652,28 +679,37 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > */ > > > swpoffset = swp_offset(entry->swpentry); > > > tree = zswap_trees[swp_type(entry->swpentry)]; > > > - spin_unlock(&pool->lru_lock); > > > + list_lru_isolate(l, item); > > > + /* > > > + * It's safe to drop the lock here because we return either > > > + * LRU_REMOVED_RETRY or LRU_RETRY. > > > + */ > > > + spin_unlock(lock); > > > > > > /* Check for invalidate() race */ > > > spin_lock(&tree->lock); > > > - if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) { > > > - ret = -EAGAIN; > > > + if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) > > > goto unlock; > > > - } > > > + > > > /* Hold a reference to prevent a free during writeback */ > > > zswap_entry_get(entry); > > > spin_unlock(&tree->lock); > > > > > > - ret = zswap_writeback_entry(entry, tree); > > > + writeback_result = zswap_writeback_entry(entry, tree); > > > > > > spin_lock(&tree->lock); > > > - if (ret) { > > > - /* Writeback failed, put entry back on LRU */ > > > - spin_lock(&pool->lru_lock); > > > - list_move(&entry->lru, &pool->lru); > > > - spin_unlock(&pool->lru_lock); > > > + if (writeback_result) { > > > + zswap_reject_reclaim_fail++; > > > + memcg = get_mem_cgroup_from_entry(entry); > > > > Can this return NULL? Seems like we don't check the return in most/all places. > > I believe so, but memcg experts should fact check me on this. If that's the case, there should be NULL checks, no? > It's roughly the same pattern as zswap charging/uncharging: > > obj_cgroup_uncharge_zswap(entry->objcg, entry->length) > -> getting memcg (under rcu_read_lock()) > > > > > > + spin_lock(lock); > > > + /* we cannot use zswap_lru_add here, because it increments node's lru count */ > > > + list_lru_putback(&entry->pool->list_lru, item, entry_to_nid(entry), memcg); > > > > Perhaps we can move this call with the memcg get/put to a helper like > > add/del? (e.g. zswap_lru_putback) > > > > We would need to move get_mem_cgroup_from_entry() into the lock but I > > think that's okay. > > We probably could, but that sounds like extra code for not a lot of gains, no? I don't feel strongly, just a fan of consistency. > > > > > > + spin_unlock(lock); > > > + mem_cgroup_put(memcg); > > > + ret = LRU_RETRY; > > > goto put_unlock; > > > } > > > + zswap_written_back_pages++; > > > > > > /* > > > * Writeback started successfully, the page now belongs to the [..] > > > @@ -696,15 +759,17 @@ static void shrink_worker(struct work_struct *w) > > > shrink_work); > > > int ret, failures = 0; > > > > > > + /* global reclaim will select cgroup in a round-robin fashion. */ > > > do { > > > - ret = zswap_reclaim_entry(pool); > > > - if (ret) { > > > - zswap_reject_reclaim_fail++; > > > - if (ret != -EAGAIN) > > > - break; > > > - if (++failures == MAX_RECLAIM_RETRIES) > > > - break; > > > - } > > > + pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL); > > > > I think this can be a problem. We hold a ref to a memcg here until the > > next time we shrink, which can be a long time IIUC. This can cause the > > memcg to linger as a zombie. I understand it is one memcg per-zswap > > pool, but I am still unsure about it. > > > > MGLRU maintains a memcg LRU for global reclaim that gets properly > > cleaned up when a memcg is going away, so that's one option, although > > complicated. > > > > A second option would be to hold a pointer to the objcg instead, which > > should be less problematic (although we are still holding that objcg > > hostage indefinitely). The problem here is that if the objcg gets > > reparented, next time we will start at the parent of the memcg we > > stopped at last time, which tbh doesn't sound bad at all to me. > > > > A third option would be to flag the memcg such that when it is getting > > offlined we can call into zswap to reset pool->next_shrink (or move it > > to the parent) and drop the ref. Although synchronization can get > > hairy when racing with offlining. > > > > Not sure what's the right solution, but I prefer we don't hold any > > memcgs hostages indefinitely. I also think if we end up using > > mem_cgroup_iter() then there should be a mem_cgroup_iter_break() > > somewhere if/when breaking the iteration. > > > > I'm not sure if this is that big of a problem in the first place, but > if it is, doing something similar to MGLRU is probably the cleanest: > when the memcg is freed, trigger the zswap_exit_memcg() callback, > which will loop through all the zswap pools and update pool->next_shrink > where appropriate. > > Note that we only have one pool per (compression algorithm x allocator) > combinations, so there cannot be that many pools, correct? > > Johannes suggests this idea to me (my apologies if I butcher it) > during one of our conversations. That sounds relatively easy IIUC. Be careful that there will be a race between memcg offlining and zswap's usage of pool->next_shrink. AFAICT there is no lock to prevent offlining so there will have to be some sort of dance here to make sure everything works correctly.