From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96B93E7AD77 for ; Tue, 3 Oct 2023 16:06:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 12D968D007D; Tue, 3 Oct 2023 12:06:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0DE018D0003; Tue, 3 Oct 2023 12:06:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE7808D007D; Tue, 3 Oct 2023 12:06:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DC0908D0003 for ; Tue, 3 Oct 2023 12:06:47 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A45041CA32F for ; Tue, 3 Oct 2023 16:06:47 +0000 (UTC) X-FDA: 81304628454.02.0724110 Received: from out-198.mta0.migadu.com (out-198.mta0.migadu.com [91.218.175.198]) by imf03.hostedemail.com (Postfix) with ESMTP id 7E6D320020 for ; Tue, 3 Oct 2023 16:06:44 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=HGW90tkx; spf=pass (imf03.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.218.175.198 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696349205; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tRheB7dETWf8f4OAiyndFdU2UVxXlXBvd7C4CoDG3TU=; b=DFelnSuwUxa+KttGC2X/tV7rQ0ngPnNm1Ahh6U5XzzOh0CJDOWz0FF6wTPYQnzeU+Sh4FJ Dju5Tv1ldDOwToLoqcaWKSiwI54UtCO4vFbSErE4p8PSC8sJkn23t89RgthzPAEUF0YHbT JwKxTArW4Qpj7NjjVq/WgfS/cSXOp5o= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=HGW90tkx; spf=pass (imf03.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.218.175.198 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696349205; a=rsa-sha256; cv=none; b=U3CrqyNZb7AbXBCeRZEhRYlu8PD9N/J7InhoPg0CTtS7T+oZI851OIDd5oanSTFs/eSWxf OBpJETlijyy/2SjpC0FoqQcDBxnfO2Fyh04uwtM5/YoZWIAw+jc0isTjU80LzwpnUDZY5x ApriDHRwbkSR4ppGZ0IS0BBhA1Q31UU= Date: Tue, 3 Oct 2023 09:06:26 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1696349202; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tRheB7dETWf8f4OAiyndFdU2UVxXlXBvd7C4CoDG3TU=; b=HGW90tkxTBjygi4RoJR32zHjjtfg8fkKEkDJxZqeuBobY/HHNGCENMNy1rysNoGjI+5qco KV33S9OqA2Gu5B1LimooAIORHHK5wXzZLkX7/UH+SBxULBvecb4YNTpQKI0RYu7EpKWXeh Lew/eGx9aHP4rQiu5KVyV48Vu0mjUlg= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Michal Hocko , Shakeel Butt , Muchun Song , Dennis Zhou , Andrew Morton Subject: Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct Message-ID: References: <20230927150832.335132-1-roman.gushchin@linux.dev> <20230927150832.335132-3-roman.gushchin@linux.dev> <20231002201254.GA8435@cmpxchg.org> <20231003142255.GE17012@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231003142255.GE17012@cmpxchg.org> X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 7E6D320020 X-Rspam-User: X-Stat-Signature: bij197p1gdkk4xubraako3kn933ngmnt X-Rspamd-Server: rspam01 X-HE-Tag: 1696349204-423395 X-HE-Meta: U2FsdGVkX1/jnWygXPbBlz5mywZteMNs+jOzHUev3WuubDj5AjqyjO6qURRRmszre2ustLnwf7hCX5p6wrZWNr3oJIl2jF32LCGcYqaN/K+SouGHTgy1iPbv03cJM+WKxwutIQUSDKvPiuycH7HW/cwAYkC0TpcrXGQTPzp5AoKERzraDucfqmWsu1Kb0JQu9g4uUzIOd7yB2+xvxUD/mXnoqUdz4x3PRInX06/9Ru9S6I4cbGpfs0EPhrKaP/R11unLBV8bdvN/gMswsFW6EBrE+mRl1K7d2KDWfGO58H2n/IEN6qRU1YF+0SuRC75/pD04i0CbhLqhfFHAtI6nDmbli6EZ6AmNWH8CCOTlWNC0uUSSf6Xr/qqpFL7cmBnzP2HODqfQUhjOmxs1+Tc/crwOp8tuornvbSVzzo1zOPlIq399DaJCzfe4MD0BpBNfA0XH3kwHMuCpvDXTMKLq87Nd/iFeUyUnsBGriWzUOzCczYGRw6QHiineywUC7tte9jCDqh/UKDH0HtLOUKZhdpvYgiPtA0faL+6iaRSxTxfqij23O7dp4daMtLsi3NsJfy8xZVvO3FDkrv1V1qNEinD2Ubq05ug8nK1KaBSXsBQzHBjAFprwQWu7aNhudTW6rDiizqyyHxuY1UASh7SD03svfyeQUODg/mLO9+hTpAwfBk15WgppJNQjMRBHm3KjxM+3i9JvgXEbLqVeB4wTTAs/aOrkaBiu4k5e1ldJb4VLOFiUL1S4lu0c0rxbHp7fFGuUdFz2JXY5BFpv7GvTaQKX/R3P2k/+2WZCib3gUim07scK8oc5sLH2gKg7M91oLVsBVb2GNpnzcKgwj0N/DnZ9jifYK/DwctbHJ6kjNm2v7ZUk5aCoRx6N3cU7WP4hJo9z54OA08hFqCLa/1cDuvxUEH/h5vEX2qj6TS2EGCf5VOB9KXAgcCk2DOccgiNy+fJPpHWxHwcw+mBwPxK lgUonfid xK12v2I58qpUhQDxGV5bo+bHjVKyMb0klCqc5beb1joZGSLqdkKUy1jZ0ux9YjIF62jSltl9Mhmsmuu2GV8F7Qv2gZ5V3B5J4MlH/b4/BENP2s0J01dYGHUnfpXx3YuZijyVp2BtnQT9uxJaCLEqv0uuR9sadYWpxXn4dMkcpEiXsilpcq/5ccSyZwQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Oct 03, 2023 at 10:22:55AM -0400, Johannes Weiner wrote: > On Mon, Oct 02, 2023 at 03:03:48PM -0700, Roman Gushchin wrote: > > On Mon, Oct 02, 2023 at 04:12:54PM -0400, Johannes Weiner wrote: > > > On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote: > > > > @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) > > > > return objcg; > > > > } > > > > > > > > +static DEFINE_SPINLOCK(current_objcg_lock); > > > > + > > > > +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old) > > > > +{ > > > > + struct mem_cgroup *memcg; > > > > + struct obj_cgroup *objcg; > > > > + unsigned long flags; > > > > + > > > > + old = current_objcg_clear_update_flag(old); > > > > + if (old) > > > > + obj_cgroup_put(old); > > > > + > > > > + spin_lock_irqsave(¤t_objcg_lock, flags); > > > > + rcu_read_lock(); > > > > + memcg = mem_cgroup_from_task(current); > > > > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { > > > > + objcg = rcu_dereference(memcg->objcg); > > > > + if (objcg && obj_cgroup_tryget(objcg)) > > > > + break; > > > > + objcg = NULL; > > > > + } > > > > + rcu_read_unlock(); > > > > > > Can this tryget() actually fail when this is called on the current > > > task during fork() and attach()? A cgroup cannot be offlined while > > > there is a task in it. > > > > Highly theoretically it can if it races against a migration of the current > > task to another memcg and the previous memcg is getting offlined. > > Ah right, if this runs between css_set_move_task() and ->attach(). The > cache would be briefly updated to a parent in the old hierarchy, but > then quickly reset from the ->attach(). Even simpler: rcu_read_lock(); memcg = mem_cgroup_from_task(current); --------- Here the task can be moved to another memcg and the previous one can be offlined, making objcg fully detached. --------- for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { objcg = rcu_dereference(memcg->objcg); if (objcg && obj_cgroup_tryget(objcg)) --------- Objcg can be NULL here or it can be not NULL, but loose the last reference between the objcg check and obj_cgroup_tryget(). --------- break; objcg = NULL; } rcu_read_unlock(); > > Can you please add a comment along these lines? Sure, will do. > > > I actually might make sense to apply the same approach for memcgs as well > > (saving a lazily-updating memcg pointer on task_struct). Then it will be > > possible to ditch this "for" loop. But I need some time to master the code > > and run benchmarks. Idk if it will make enough difference to justify the change. > > Yeah the memcg pointer is slightly less attractive from an > optimization POV because it already is a pretty direct pointer from > task through the cset array. > > If you still want to look into it from a simplification POV that > sounds reasonable, but IMO it would be fine with a comment. I'll come back with some numbers, hard to speculate without it. In this case the majority of savings came from not bumping and decreasing a percpu objcg refcounter on the slab allocation path - that was quite surprising to me. > > > > > @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void) > > > > mem_cgroup_clear_mc(); > > > > } > > > > } > > > > + > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > +static void mem_cgroup_fork(struct task_struct *task) > > > > +{ > > > > + task->objcg = (struct obj_cgroup *)0x1; > > > > > > dup_task_struct() will copy this pointer from the old task. Would it > > > be possible to bump the refcount here instead? That would save quite a > > > bit of work during fork(). > > > > Yeah, it should be possible. It won't save a lot, but I agree it makes > > sense. I'll take a look and will prepare a separate patch for this. > > I guess the hairiest part would be synchronizing against a migration > because all these cgroup core callbacks are unlocked. Yep. > > Would it make sense to add ->fork_locked() and ->attach_locked() > callbacks that are dispatched under the css_set_lock? Then this could > be a simple if (p && !(p & 0x1)) obj_cgroup_get(), which would > certainly be nice to workloads where fork() is hot, with little > downside otherwise. Maybe, but then the question is if it really worth it. In the final version the update path doesn't need a spinlock, so it's quite cheap and happens once on the first allocation, so Idk if it's worth it at all, but I'll take a look. I think the bigger question I have here (and probably worth a lsfmmbpf/plumbers discussion) - what if we introduce a cgroup mount (or even Kconfig) option to prohibit moving tasks between cgroups and rely solely on fork to enter the right cgroup (a-la namespaces). I start thinking that this is the right path long-term, things will be not only more reliable, but we also can ditch a lot of synchronization and get better performance. Obviously not a small project. Thanks!