From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76CACD69105 for ; Thu, 28 Nov 2024 12:27:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CD3196B0083; Thu, 28 Nov 2024 07:27:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C5C266B0085; Thu, 28 Nov 2024 07:27:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AD6FC6B0088; Thu, 28 Nov 2024 07:27:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8526F6B0083 for ; Thu, 28 Nov 2024 07:27:18 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 09C381C77E6 for ; Thu, 28 Nov 2024 12:27:18 +0000 (UTC) X-FDA: 82835428872.28.33C4440 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) by imf01.hostedemail.com (Postfix) with ESMTP id DABC540009 for ; Thu, 28 Nov 2024 12:27:10 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YbqC4nM4; spf=pass (imf01.hostedemail.com: domain of aliceryhl@google.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=aliceryhl@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732796829; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kARs3OMR7SZwBtW0TWcyLkFWqc6w0sy3+x2iKM9UqcY=; b=PZgI2At/F+6nQMWgOol+e07EWYCbQW3THNljeTgZfDY7qhczTlsO6CU9kdJmeqehfC1EXv Roj/AYTHjLtfjMOHuzCMydjpybevWX671Adt+NVLhBXC26cILilESwlV3Oq50HHK0Qy772 kLOcMxhP6zjqDHsGOEd4K7TobaKkEFU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732796829; a=rsa-sha256; cv=none; b=2tZ0QPUOkTK7oUr5u3bK5T1Bf5M5uypeS9dJpPoX6jPAQJ1RpGDtwcgTb3NWFHZXrGa1/d fB1OCh3mBWSgXnvqVcLdqy/0S89UHJyoftMclLbm9s/QquN7i91shNUu3IVQv1ZDtMgydA OpEIQJaSl4hOaV+qZcb3RSRl5rSstw4= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YbqC4nM4; spf=pass (imf01.hostedemail.com: domain of aliceryhl@google.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=aliceryhl@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-432d86a3085so6629775e9.2 for ; Thu, 28 Nov 2024 04:27:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732796835; x=1733401635; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kARs3OMR7SZwBtW0TWcyLkFWqc6w0sy3+x2iKM9UqcY=; b=YbqC4nM4q1SI/KmNpFQJXF2TOQNVpdaeSAYXeRD3/RMI8chY5ov06OR3c7EdrcFscF Hzn9eCTTPTlQB33PEbLYxKm1d3jofV9CyVxdRRTLFHXiIZT64NGwZksuHtqRsiRGjtbD q8nEExN2D3mjN5QUAtQ0vIWC/9BTjTE8TMPrXwcbkDyHa556D+n+rIRI7whJi6Dr1rbH outm1wOtME9Ik/lVaLpa9TtSkajurQxqG9eZqmJhresFQnpIzdayaQjWCK0Rf50h4wI8 moa2Nd1jHrWU2OmuiPGxml+Puz2pVZAbcRDT5bDQ+Agl6HkqoOzk27K/GjQfK9NK+Okn J7Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732796835; x=1733401635; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kARs3OMR7SZwBtW0TWcyLkFWqc6w0sy3+x2iKM9UqcY=; b=rSswQuAlYETOccOP1cl0rYDGyGG01ARKrAvV6MLnvAFvnr6ErgZLUOdgzHG73kTTUw 7mNtbU7dA76U4LF03oXAXx0UxIb1D1ItMEm+rPXAgi/UmmOFJbONrinm/JFtPaWpx6iF njHWlGkB0C2SYK17VhWXDNB98H7szajqtYyteOflJxR5eEqq/rzwXYk08AvZglxhnv0v 7WLzQWn2inUeXa4M+2ibcFrhtQqv5U3QVZd5WSQUPgrf6nQGTGaFK9OO6xsk2ujK6yID ZSaPUT343pzmhsDmzknfekF377sVMopyDZOqs0jwoi9wVGSAVS5t/NCkZ2awahviCbLg PRqQ== X-Forwarded-Encrypted: i=1; AJvYcCV04QyzGPXn1Bk7gbDJviwZxU1SILxeT2DjvpWV+tYmSOlyu2o71kMeObAA9UJr6yUiiD7JKnVQZQ==@kvack.org X-Gm-Message-State: AOJu0YxGYbGzNFaYyF3BQt2fn6LjrJVYW7/oYT+toVy7qbtQxEHlUKsX 7nVTemzCPxkGtXASnzteSxe6MoOG7G++E5MStIP78jkx7QRmD3N1f8so7vaPK6kU3N2y3q8B/lp q9/N3Js+GVQdzMWwqPBqQJE55XPKPQEmMZF08 X-Gm-Gg: ASbGnct5A3dZQ3OgMflLdrZKCxbC4UVUtPiavVOTthfS0n/WzCKTeYoIjaPVa3zhmMp ai1dee8L47VUopozuv7wVLTSGUfaLL83hbZylwjpZYzUoASJj1m2zshySQsWL X-Google-Smtp-Source: AGHT+IEkERCIeGBJ8BGltoa6+wRnWijLPp8bL9sP35PNF9jXvEWnjvqtbJxvc4qLxkfCVrgrzjnipQiAdSGpxRUopa4= X-Received: by 2002:a05:600c:1988:b0:434:a239:d2fe with SMTP id 5b1f17b1804b1-434a9df7dadmr50709745e9.28.1732796834639; Thu, 28 Nov 2024 04:27:14 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alice Ryhl Date: Thu, 28 Nov 2024 13:27:03 +0100 Message-ID: Subject: Re: [QUESTION] What memcg lifetime is required by list_lru_add? To: Dave Chinner Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Qi Zheng , Roman Gushchin , Muchun Song , Linux Memory Management List , Michal Hocko , Shakeel Butt , cgroups@vger.kernel.org, open list Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DABC540009 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: io4gs4d38ao9p4aeknhhg9yo6p618wsz X-HE-Tag: 1732796830-606322 X-HE-Meta: U2FsdGVkX1/kqpkQxnrs0GAfYHtu7H+kK2q9RLxZUMcKuvbMvsjxNukFht8iHV0dqP5nZ+tbHP/Z6xroEpfp7ZjBPhWSfm7hrLJfvOoeo1/uUno5oYyKeYHGOguKIvTZwNA9LOG/VKCXbYXMe/p89ZGpPxG7mGCxVDG2EjTa4bml2nj8AgBIfitrbMiE6CHWF9D16UAekFjloEOQB/stJwx8rbMOnBx7+39tZYVWRHhJW5jW1hf5++f0NM0lDZ6PcyEs0C4SZvUJ/VsLK0EF3+PqT9pRC/XteMGot+bmQ8qGaroUVRIiP9TD5xiSt+gJmHETd1UtOIPGnv5RBJ7SnW9/oB51NEs3DBphKRFfhtiCB2u+nUwG+TJBeMjgVXM/u5TW6ANhra8zslKHMrmavQdgQXlh4M2MX3A2Q07Cc5QHb9unNjNt8GXeRnserohAWN8tUktsuW+MWU0b8aiQe/hrEYDHa/yfFCy16iOcgxutQixx2FEEMOXwsdMyYkH9iTftgrvCR19xKipvTbiaDtyMykYeUXzoX+hEl/MYq+CltbZSXZwFQHCk+aShjUY/pWgtVgFXAvT2J1QUx6UxsytH2rX8xsmWIhv+BL/LyWmEYocJRaGRliHViDLpIDKYqm5r75PWINt1SR95nCtAsZ6WLOXNx3uxOq0camjAlToGMdhtmP/CwSXTdHz00/e+B3ggnpq7fIY+xz00lVC0Tgqy7y9aPDke76WCMmR2WuZFMoca12Cqfwt60eqR2g1QB14+mhzPl5MZeDUcxOuAYJxnBayirnBroGSlv7nnTYnMPAxGRXRKPGJinf+mSsGejQS2BsKwkddZazhnSt5D5FLy16KFS2L0KGvJndaKBCpDfkFuRmxuhK/ifvi9qbxynOq90gbZaOaQh+v8Nmxy9pK3O+P5DO1YttOw9zKjA+pReHObkeSKqBC31Ot2drPSbwD4twejyIVSi7aA1/V 9K19sJJJ amKMoLYJ3PM2hI20U2HvULc9fMDfUFfrwSxz5dApnvQ6oTsPWtVx9rpcgK6kmt5zFxaRW5u4L4rTDUaBn1Gxvk2qAs4dArdSx+kKNYPSqh+Cnn1MItS/fc+ZooiNzGMibab9WGRgNb+2JI1MkxUsB6wqfOqFqkLN2GFaDk7L1zpv/7JhryOVgt5s20Yk3PGw1/MqBLQf461Cm5L2U0U/taYXPjB9I3YPG6uru5HFPsMgfSKfvYGFXB6Eef6wcfjMuKcpc6bPWJhw2HKbar/FbENRwjD869Yx5xM15HcYvKyqveYlOf7d8DFlmZp7oxA9wpv71aFapQEVIs1Elnia1fCAKCtUGI+t55rwn/347iSGYsC2FFBxmAj/7sTVffzgIqUX935wYxMBBEBE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.034950, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 27, 2024 at 11:05=E2=80=AFPM Dave Chinner = wrote: > > On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote: > > Dear SHRINKER and MEMCG experts, > > > > When using list_lru_add() and list_lru_del(), it seems to be required > > that you pass the same value of nid and memcg to both calls, since > > list_lru_del() might otherwise try to delete it from the wrong list / > > delete it while holding the wrong spinlock. I'm trying to understand > > the implications of this requirement on the lifetime of the memcg. > > > > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking > > to keep the memcg object alive for the duration of list_lru_add(). > > That rcu locking is used here seems to imply that without it, the > > memcg could be deallocated during the list_lru_add() call, which is of > > course bad. But rcu is not enough on its own to keep the memcg alive > > all the way until the list_lru_del_obj() call, so how does it ensure > > that the memcg stays valid for that long? > > We don't care if the memcg goes away whilst there are objects on the > LRU. memcg destruction will reparent the objects to a different > memcg via memcg_reparent_list_lrus() before the memcg is torn down. > New objects should not be added to the memcg LRUs once the memcg > teardown process starts, so there should never be add vs reparent > races during teardown. > > Hence all the list_lru_add_obj() function needs to do is ensure that > the locking/lifecycle rules for the memcg object that > mem_cgroup_from_slab_obj() returns are obeyed. > > > And if there is a mechanism > > to keep the memcg alive for the entire duration between add and del, > > It's enforced by the -complex- state machine used to tear down > control groups. > > tl;dr: If the memcg gets torn down, it will reparent the objects on > the LRU to it's parent memcg during the teardown process. > > This reparenting happens in the cgroup ->css_offline() method, which > only happens after the cgroup reference count goes to zero and is > waited on via: > > kill_css > percpu_ref_kill_and_confirm(css_killed_ref_fn) > > css_killed_ref_fn > offline_css > mem_cgroup_css_offline > memcg_offline_kmem > { > ..... > memcg_reparent_objcgs(memcg, parent); > > /* > * After we have finished memcg_reparent_objcgs(), all list_lrus > * corresponding to this cgroup are guaranteed to remain empty. > * The ordering is imposed by list_lru_node->lock taken by > * memcg_reparent_list_lrus(). > */ > memcg_reparent_list_lrus(memcg, parent) > } > > Then the cgroup teardown control code then schedules the freeing > of the memcg container via a RCU work callback when the reference > count is globally visible as killed and the reference count has gone > to zero. > > Hence the cgroup infrastructure requires RCU protection for the > duration of unreferenced cgroup object accesses. This allows for > subsystems to perform operations on the cgroup object without > needing to holding cgroup references for every access. The complex, > multi-stage teardown process allows for cgroup objects to release > objects that it tracks hence avoiding the need for every object the > cgroup tracks to hold a reference count on the cgroup. > > See the comment above css_free_rwork_fn() for more details about the > teardown process: > > /* > * css destruction is four-stage process. > * > * 1. Destruction starts. Killing of the percpu_ref is initiated. > * Implemented in kill_css(). > * > * 2. When the percpu_ref is confirmed to be visible as killed on all CPU= s > * and thus css_tryget_online() is guaranteed to fail, the css can be > * offlined by invoking offline_css(). After offlining, the base ref = is > * put. Implemented in css_killed_work_fn(). > * > * 3. When the percpu_ref reaches zero, the only possible remaining > * accessors are inside RCU read sections. css_release() schedules th= e > * RCU callback. > * > * 4. After the grace period, the css can be freed. Implemented in > * css_free_rwork_fn(). > * > * It is actually hairier because both step 2 and 4 require process conte= xt > * and thus involve punting to css->destroy_work adding two additional > * steps to the already complex sequence. > */ Thanks a lot Dave, this clears it up for me. I sent a patch containing some additional docs for list_lru: https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@= google.com/ Alice