From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15638D6D241 for ; Wed, 27 Nov 2024 22:05:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8014A6B0082; Wed, 27 Nov 2024 17:05:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B1246B0083; Wed, 27 Nov 2024 17:05:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 679016B0085; Wed, 27 Nov 2024 17:05:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4B61D6B0082 for ; Wed, 27 Nov 2024 17:05:42 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id EC1D1160425 for ; Wed, 27 Nov 2024 22:05:41 +0000 (UTC) X-FDA: 82833257514.22.F9AF149 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf21.hostedemail.com (Postfix) with ESMTP id 3E06D1C0007 for ; Wed, 27 Nov 2024 22:05:28 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=oB6Kxxvq; spf=pass (imf21.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732745133; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UMsGQydQR4ffQMrLiCfWPs1SrKc1nDVN+ATpbx3XMwQ=; b=EXD0t6MdieLu9qonH6jWq58XOy68FO4j2zYzrR2t3obpwh97fzaESTQG+/bL9RZidDYmHN IWiq4Dst+vJCV/TCw4bG+LB6MiIqu4ermH+NV2U8zefhvVzBpUAqDCEp4g44+s7RjGaK1V bJM5UHhdteiBxKYSBh59kVLGE3MWdEw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732745133; a=rsa-sha256; cv=none; b=n8qxvDUSVD7aeU5/PNOgnPgW5eiaT3iTWM+GxFxVHUERbYUl9hbOdR2Fv1wazdkKzdTjRt mYI1qzwnV0NG41CuNq8Bv7ndxSPvHEZi6vnKM4hLOCHzMAmC3uTv18fgwZGA2q8uLuOOm1 YvU3AdTocJ833Bn/oLzjNtON8Rrv0G8= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=oB6Kxxvq; spf=pass (imf21.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-724e1742d0dso214846b3a.0 for ; Wed, 27 Nov 2024 14:05:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1732745138; x=1733349938; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UMsGQydQR4ffQMrLiCfWPs1SrKc1nDVN+ATpbx3XMwQ=; b=oB6Kxxvq316yzPkeEBIHDjYxP5+RQzwRgyku0xeEDLEMmK0rN4q3YfVnMQsWhQsxJ4 b2QSZaLkxDirYdTak9WJkZyK47D3aN+a0q+/NzE1xUHZA2Y1+aMRBx4VXjFbYdGDMqPG YwXpW8cxS3JW2gJCg+IZx5MDpUve6mT5TmqC53wjiLpbLwVv6z6IZfcshrLvoedXd4xj XtYaxUra6Jd3QMedgU9P4RY3hIZYKJdL5xLr4PRi48DaEAgTWeY5Sr+ngFeZiGWhf0/n 2t4Ehv2sgl9IGw6+gZLeCtePzXeY9fBRIr6OD/ctrWaExV8TcGgxTgRHVP9hPFLRx36+ MlMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732745138; x=1733349938; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UMsGQydQR4ffQMrLiCfWPs1SrKc1nDVN+ATpbx3XMwQ=; b=p2gueYQcYf/4lQXFAfBwwaaE0I2AM34bWKZDaQyntJsx0EaJvawJ9mYKDQeuM1yzPE jcP/aXxtG3dsEktYIXc9e4H5tXwUtzCwlhhagT/JPtIvE8agMfjHWaly31ej0rZyS+Rl Bp2ZiI9GVdl+/UVeFQvQsNa7alPJFZK7OJ7/ceZqM1pVy/Ah9qzF7XKsPOFPBQHvCr2c xBGmTuX/g81CKCst48VG405ZNxiY47sUDUD3XPWfPfRsgpyzc+UNrg9aTw0Z6YUXPltV E1JKwBYIqZP6Rz5jmLFm+pNzQbuZI+IQxXYFfvZ/dsaHxD2Wx7mmTJIpziDMPnP1QLVB XWVg== X-Forwarded-Encrypted: i=1; AJvYcCXGPNM++H7r+Lsi/9doXfkk794XXDyFS77SUWjjWCE7fDMgBWVXrHYhrJDG2OpP5wK6aLYgRVpcZg==@kvack.org X-Gm-Message-State: AOJu0YxDAdWi0ynE6ek9VeyDsfr/VnYADP8+J/Vz1bhGYSfaTyaVxSTA 8k3eTmYhmUWD9fXUQyZVwIRcBSidv92+hgE5FLtx6Akvq9PviI8ouxTfj8DMZlU= X-Gm-Gg: ASbGncuaXPvfIWfNkYGLGrieH0HPmMu8APKNdlydaAG/kRyfat3vAom4sl8xQidZHNH 0M3M4CWXv8QUFg1aKhNWYCluUWB7EKXnnlFCcXuZR+CVAGv4CnT/+z4v08pqFMOGk/japF0up0D Ljuv94Rc46ks6J9+T+Usw2omx/zK9Bj/ccFEBDPNnAFowgANkxalx4L91YXckiat3Xa7BAqc0lQ h9xqanbPJZLOliaTRRYjEmxTrjEBNI1XCDWV9eK3Ia1dRcA13zAbrT/+6i+4a1hXW3OUKDaVofn Lns+FoQjMON7b4jJT3o3GmV5QQ== X-Google-Smtp-Source: AGHT+IEltlKl2PJHg3RFm39OBPMjZ8HODMlAfcoAhjcRSnu6VwmW9h4bksPWqKeNgStjEZ96bmLh5Q== X-Received: by 2002:a05:6a00:14d5:b0:724:63f1:a522 with SMTP id d2e1a72fcca58-7253017551amr6450497b3a.22.1732745138524; Wed, 27 Nov 2024 14:05:38 -0800 (PST) Received: from dread.disaster.area (pa49-180-121-96.pa.nsw.optusnet.com.au. [49.180.121.96]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7254180fcb0sm49277b3a.147.2024.11.27.14.05.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Nov 2024 14:05:37 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tGQAA-00000003reS-1DGz; Thu, 28 Nov 2024 09:05:34 +1100 Date: Thu, 28 Nov 2024 09:05:34 +1100 From: Dave Chinner To: Alice Ryhl Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Qi Zheng , Roman Gushchin , Muchun Song , Linux Memory Management List , Michal Hocko , Shakeel Butt , cgroups@vger.kernel.org, open list Subject: Re: [QUESTION] What memcg lifetime is required by list_lru_add? Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 3E06D1C0007 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: jwiygndp79xznuz8baae4thdxcnkwuqf X-HE-Tag: 1732745128-616560 X-HE-Meta: U2FsdGVkX1/vedjOb7J+1rszmTiJZp8iuuTu9BcvEBWoOCerwBwUN5tD6ptjinfq867EZw3xwFghhVtaThed2MZwU+jAGhm4xlTPMc/Wji8YyZ3rX3HOgj3l3R4yBtEpr/mTwLNplkRE71347JGVV1qc9lexl/VpbVVgszzK/JZ1Uo5QDRQKBChqC2podAKMwQwV3U2yK+mIgE1nszAUBEHBHeoVg0R4ftrb2FSFhbPb4qe43H1quyVEbIkaoXnTuPtPgHU/QI6pNu2T5+L6yxNmXGRqCjmEQnrlsbYLBxuj+I5MWno5ST/619StAsTFpDHjTQorDXE+vytr+fnh+3YDzEL8RLpKSaQTY0K4fFlVA9/hDW1dRUUwhIBh6TuDtvBlLvuPjPPTTfulOK0cObgfVT/Kp8fiLFtnFP5+d3TxFXmqbTVL3dL7z0shQFTMgFhzOfOuwyycwJgBgoKPNMdCwgATZZWRBKyJkUvhttc9y1jXPUniz5760wmCl9PWfglViLdtklwS4fqu5RKsbABugxpGgT3TSlATXmE0ikZA3gBSnsTH3HBzcVi1UOASnyE4FCE/FQ1t2MZPr2bJ9lWOWk+MDb7cfeCFTR2iv2YdQbV5mR9Bw4zlZ3sgzHxVpyMf8UvGHMuYzKWwK66NDjo6+Y5jSr6EaGE95zXPWrFakGUdZtEN0GISF1YKt14lM89sgh0tGL0nLnRr9fquDVCfnbQ1fpUQyxJJaabHR5dOCF0BFIrkSQzFoxG/hV5St9oA/0JunoFVgu+7ARmQ5a9L2H5U4Oro+3sc2H7dkH8Ld65/trFImHTJb09xgF0wQgHYg1jYK8bS6B6j0TjaeVCGM7HJDvtwU9V8onFL1vs5ivSBB7suYM3I/3jAguFrpLOvH5nOik6baOWRPVj6olvTgLCbtP/xyQl6KBOZAMlhFDIY2S6pgBD0GSrOrGDhttjSG1duIjGppj48S4e RcozYIj4 3OyJ2cAai+O2og+tnL7s4/RRylpWdRX5YgmAgfyGOZsQU32xA40K4UVjtLLKhX6NIwLVRjkAtpKk/JjSwOMJ5ZiGIXWEw9HINpG9xR27u8VZScisPwIqiDOu7Ywg61Gtx9mx22mZft4l/jonf7jC6bDtJo8POBwj/ZD0cPSvjifZrrAcyIztSPt8mHcjCmiP7VBdF0nKf2jtTryqCy/SQOggUo+4569xJisFG5aPdy7Xecqw86bRkj6u9b3r1a1iBj6MsmRNgXbsEpxecjuzXr9B1zflHjd9tSih3DvBJ8HLFw4R9jvX+m4gCbirgDQQimrK2yu0vhYLw4CbA36pP4U29/RvSeDlpBaFykDyurmiuMRRaqDi55iqxjUbxMbGkSLDa48pUgDC476zpmTb5SbySupv55eQrlgPfRvX0YJ+SRM9NmGT9tqOuUCl668rR81R81QZXqiIFMkogBuA8y6J9io6Mxv2SMjF9c4IzagbOWtaisCsRgOSxMcz7cfi22cRF X-Bogosity: Ham, tests=bogofilter, spamicity=0.002087, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote: > Dear SHRINKER and MEMCG experts, > > When using list_lru_add() and list_lru_del(), it seems to be required > that you pass the same value of nid and memcg to both calls, since > list_lru_del() might otherwise try to delete it from the wrong list / > delete it while holding the wrong spinlock. I'm trying to understand > the implications of this requirement on the lifetime of the memcg. > > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking > to keep the memcg object alive for the duration of list_lru_add(). > That rcu locking is used here seems to imply that without it, the > memcg could be deallocated during the list_lru_add() call, which is of > course bad. But rcu is not enough on its own to keep the memcg alive > all the way until the list_lru_del_obj() call, so how does it ensure > that the memcg stays valid for that long? We don't care if the memcg goes away whilst there are objects on the LRU. memcg destruction will reparent the objects to a different memcg via memcg_reparent_list_lrus() before the memcg is torn down. New objects should not be added to the memcg LRUs once the memcg teardown process starts, so there should never be add vs reparent races during teardown. Hence all the list_lru_add_obj() function needs to do is ensure that the locking/lifecycle rules for the memcg object that mem_cgroup_from_slab_obj() returns are obeyed. > And if there is a mechanism > to keep the memcg alive for the entire duration between add and del, It's enforced by the -complex- state machine used to tear down control groups. tl;dr: If the memcg gets torn down, it will reparent the objects on the LRU to it's parent memcg during the teardown process. This reparenting happens in the cgroup ->css_offline() method, which only happens after the cgroup reference count goes to zero and is waited on via: kill_css percpu_ref_kill_and_confirm(css_killed_ref_fn) css_killed_ref_fn offline_css mem_cgroup_css_offline memcg_offline_kmem { ..... memcg_reparent_objcgs(memcg, parent); /* * After we have finished memcg_reparent_objcgs(), all list_lrus * corresponding to this cgroup are guaranteed to remain empty. * The ordering is imposed by list_lru_node->lock taken by * memcg_reparent_list_lrus(). */ memcg_reparent_list_lrus(memcg, parent) } Then the cgroup teardown control code then schedules the freeing of the memcg container via a RCU work callback when the reference count is globally visible as killed and the reference count has gone to zero. Hence the cgroup infrastructure requires RCU protection for the duration of unreferenced cgroup object accesses. This allows for subsystems to perform operations on the cgroup object without needing to holding cgroup references for every access. The complex, multi-stage teardown process allows for cgroup objects to release objects that it tracks hence avoiding the need for every object the cgroup tracks to hold a reference count on the cgroup. See the comment above css_free_rwork_fn() for more details about the teardown process: /* * css destruction is four-stage process. * * 1. Destruction starts. Killing of the percpu_ref is initiated. * Implemented in kill_css(). * * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs * and thus css_tryget_online() is guaranteed to fail, the css can be * offlined by invoking offline_css(). After offlining, the base ref is * put. Implemented in css_killed_work_fn(). * * 3. When the percpu_ref reaches zero, the only possible remaining * accessors are inside RCU read sections. css_release() schedules the * RCU callback. * * 4. After the grace period, the css can be freed. Implemented in * css_free_rwork_fn(). * * It is actually hairier because both step 2 and 4 require process context * and thus involve punting to css->destroy_work adding two additional * steps to the already complex sequence. */ -Dave. -- Dave Chinner david@fromorbit.com