From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C5AAC3DA70 for ; Thu, 25 Jul 2024 23:34:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A45906B009B; Thu, 25 Jul 2024 19:34:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F5476B009C; Thu, 25 Jul 2024 19:34:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BCCF6B009D; Thu, 25 Jul 2024 19:34:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6CE526B009B for ; Thu, 25 Jul 2024 19:34:00 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 1DF671C049A for ; Thu, 25 Jul 2024 23:34:00 +0000 (UTC) X-FDA: 82379880240.26.551848E Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) by imf12.hostedemail.com (Postfix) with ESMTP id E85F240016 for ; Thu, 25 Jul 2024 23:33:57 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=uAPV4yBf; spf=pass (imf12.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.185 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721950373; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n+9XM/fW5RfwKkN8eewGZAUVoURY2akltnsnwAvZUkc=; b=lmCdLbj2pkiqz36klNgG8rJrqRRWDeBAyaL/2+GHgOLX15WQ/IEGexJlcNTPhkacwmswb4 Pf9f52ZiXyihL6LACi9CFfy6lg4SnV4tTi4VkXDBCZhQuoRFTIG0uTkFapoDFSdN+KPXmZ FaQCsmmweGFqxbd5fE96NBp+01MhReU= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=uAPV4yBf; spf=pass (imf12.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.185 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721950373; a=rsa-sha256; cv=none; b=087DyOBBNmIGT2Jsy44ElNOq8ERapR4GBTzlnNkAxFlfgBZDI1ofGQ3OZJckhDxzi5Cgux pZNZqPMt5yuC3bFhtEHv50xVTBdK+qZjSH4HYS6mWpDapJ0XKvd2ymZmvoXgTvSynp1vOR FYpbD2YtE3vAjaPioxzMrgDvf92RPSI= Date: Thu, 25 Jul 2024 23:33:51 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1721950435; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=n+9XM/fW5RfwKkN8eewGZAUVoURY2akltnsnwAvZUkc=; b=uAPV4yBf/iuqodww1FoyZ6JHMnBCp2eGgGqbuI6xDaus8JYa6hr4e/AbUxU4i3nYKmM/42 3BpEvWbXQnZ1YaS71Njy64i5ZH45wDxTLCDVUmuMPmdck3x0EFMavaPlO9vEkfZ17rcD+l HUzMr+eu/Q0s623RSlhnBjxmzcbmaDw= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Johannes Weiner Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Michal Hocko , Shakeel Butt , Muchun Song Subject: Re: [PATCH v2 3/5] mm: memcg: merge multiple page_counters into a single structure Message-ID: References: <20240724202103.1210065-1-roman.gushchin@linux.dev> <20240724202103.1210065-4-roman.gushchin@linux.dev> <20240725214227.GC1702603@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240725214227.GC1702603@cmpxchg.org> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E85F240016 X-Stat-Signature: xu1aq14apntfeu833gpg8ofztushu6bp X-Rspam-User: X-HE-Tag: 1721950437-726652 X-HE-Meta: U2FsdGVkX19BIowo/luiOJ21BoxIqKNw6+GD/ARznx/LEo3rmt0IS6wyp0iAkM7IthJNUInxKdyMESvRxO1XfB/W1ZUga2x9yhiBLpSSFVWZ/ncQ3XBDpBK61sUF3Mf80LbHADgPEey2WmZ2V5FDoTe3kcTiJMsBrad5H1NsRkzHC4EYPhsLhNRRV+yo4EfupoaUqFsKwl22F4H51NZeqb//i/mT9znJBb5Xhr8abqnJwH5t8cvblTIp5Cr3UapSwpqYAgL0ok8RQ1B+/G1Q+5/AGbKYasIuZVmPCDByvTvQC1k/zB5zLWgR6UTPoMDaMXnl8JMhvEIhmjN7O4Rbx9+ixsOIhdTWpKkcfFF3v+YIxJ5OESGsQpjLvn7C1u4ASAIlQ/u0wXE8rf/qxiDOZD4wybYZm9vx0TRhot0wl4/THkL9spA5j0g6JbvVsQzqUPEs5bj+v3eJJwE4aRVShZrkipCsiXBpYiD78DzpyMTaJOtWGIZOTLjDbwJXI5v761r5td+WRd3jdXCshEsiR9Epc2xlhIk5hdH63uJ9wdlSv3xsjyKOhklx/HK1gZc1bsvpufjYoDYXdtWfHATsJEbGbZ/DZS5gUQ4IQf+J4mqs84a+TuJyheBcFyKNrCVHalFgY83jBnW0i+ppJRcgxiHhubUzIHxgAXOekXPYA5VvlIqvrPIwLsQ9fQqj51Zk7gZgGv8nB0KY+3Z20+IJ6SCSzQr/l2aF9m11gjCVQYOamNJSfk1NKC/26deOJzH2YAqPuxB+W1ckLP1AskKEdDnB5ahzhgUqQJCJO1FDuHrsptUYlsLA9jHnS5mWFS7eLhubf1/D0NkX01MqEmWE0Y1SRnsbfnbMpCpVVFWN+aZMtLImPv37TvLY+q7fx4TW2TGzsBpy18C5OAsO/q8q/EGtkAqy4BtWXkQpFsc2JTrrpFYusCd1LbpHGBoulza1YAw80Puih1YXrGHKmCR jy9fpv4T 7dRGWjWthlIuFEXc4VLS1SNJ6f45fmwFs9ActKHNJKUSwE25jtTviXQUXTvZxRkjJfF7j0tGP5dtJ0KmAG6iSpfbZbp8RtKGnpU6XuA3HK+TzbnOItpqQH+RDrabnUDBbrGL+6xBQF1/y5juiVB+H0GT/2RSxxDnfEz+Jq7nt6HC2snTo1r8wb020SccJbE49bFsY5hquT4KeFtc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 25, 2024 at 05:42:27PM -0400, Johannes Weiner wrote: > On Wed, Jul 24, 2024 at 08:21:01PM +0000, Roman Gushchin wrote: > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -5,14 +5,71 @@ > > #include > > #include > > #include > > +#include > > #include > > > > +/* > > + * Page counters are used by memory and hugetlb cgroups. > > + * Memory cgroups are using up to 4 separate counters: > > + * memory, swap (memory + swap on cgroup v1), kmem and tcpmem. > > + * Hugetlb cgroups are using 2 arrays of counters with HUGE_MAX_HSTATE > > + * in each to track the usage and reservations of each supported > > + * hugepage size. > > + * > > + * Protection (min/low) is supported only for the first counter > > + * with idx 0 and only if the counter was initialized with the protection > > + * support. > > + */ > > + > > +enum mem_counter_type { > > +#ifdef CONFIG_MEMCG > > + /* Unified memory counter */ > > + MCT_MEM, > > + > > + /* Swap */ > > + MCT_SWAP, > > + > > + /* Memory + swap */ > > + MCT_MEMSW = MCT_SWAP, > > + > > +#ifdef CONFIG_MEMCG_V1 > > + /* Kernel memory */ > > + MCT_KMEM, > > + > > + /* Tcp memory */ > > + MCT_TCPMEM, > > +#endif /* CONFIG_MEMCG_V1 */ > > +#endif /* CONFIG_MEMCG */ > > + > > + /* Maximum number of memcg counters */ > > + MCT_NR_MEMCG_ITEMS, > > +}; > > + > > +#ifdef CONFIG_CGROUP_HUGETLB > > +#ifdef HUGE_MAX_HSTATE > > +#define MCT_NR_HUGETLB_ITEMS HUGE_MAX_HSTATE > > +#else > > +#define MCT_NR_HUGETLB_ITEMS 1 > > +#endif > > + > > +/* > > + * max() can't be used here: even though __builtin_choose_expr() evaluates > > + * to true, the false clause generates a compiler error: > > + * error: braced-group within expression allowed only inside a function . > > + */ > > +#define MCT_NR_ITEMS (__builtin_choose_expr(MCT_NR_MEMCG_ITEMS > MCT_NR_HUGETLB_ITEMS, \ > > + MCT_NR_MEMCG_ITEMS, MCT_NR_HUGETLB_ITEMS)) > > + > > +#else /* CONFIG_CGROUP_HUGETLB */ > > +#define MCT_NR_ITEMS MCT_NR_MEMCG_ITEMS > > +#endif /* CONFIG_CGROUP_HUGETLB */ > > + > > struct page_counter { > > /* > > * Make sure 'usage' does not share cacheline with any other field. The > > * memcg->memory.usage is a hot member of struct mem_cgroup. > > */ > > - atomic_long_t usage; > > + atomic_long_t usage[MCT_NR_ITEMS]; > > CACHELINE_PADDING(_pad1_); > > > > /* effective memory.min and memory.min usage tracking */ > > @@ -25,9 +82,9 @@ struct page_counter { > > atomic_long_t low_usage; > > atomic_long_t children_low_usage; > > > > - unsigned long watermark; > > - unsigned long local_watermark; /* track min of fd-local resets */ > > - unsigned long failcnt; > > + unsigned long watermark[MCT_NR_ITEMS]; > > + unsigned long local_watermark[MCT_NR_ITEMS]; /* track min of fd-local resets */ > > + unsigned long failcnt[MCT_NR_ITEMS]; > > > > /* Keep all the read most fields in a separete cacheline. */ > > CACHELINE_PADDING(_pad2_); > > @@ -35,8 +92,9 @@ struct page_counter { > > bool protection_support; > > unsigned long min; > > unsigned long low; > > - unsigned long high; > > - unsigned long max; > > + unsigned long high[MCT_NR_ITEMS]; > > + unsigned long max[MCT_NR_ITEMS]; > > + > > struct page_counter *parent; > > } ____cacheline_internodealigned_in_smp; > > This hardcodes way too much user specifics into what should be a > self-contained piece of abstraction. Should anybody else try to use > the 'struct page_counter' for a single resource elsewhere, they'd end > up with up to 5 counters, watermarks, failcnts, highs and maxs etc. > > It's clear that the hierarchical aspect of the page_counter no longer > works. It was okay-ish when all we had was a simple hard limit. Now we > carry to much stuff in it that is only useful to a small set of users. > Instead of doubling down and repeating the mistakes we made for struct > page, it would be better to move user specific stuff out of it > entirely. I think there are two patterns that would be more natural: > a) pass a callback function and have a priv pointer in page_counter > for user-specifics; or b) remove the hierarchy aspect from the page > counter entirely, let the *caller* walk the tree, call the page > counter code to trycharge (and log watermark) only that level, and > handle everything caller specific on the caller side. > > All users already have parent linkage in the css, so this would > eliminate the parent pointer in page_counter altogether. > > The protection stuff could be moved to memcg proper, eliminating yet > more fields that aren't interesting to any other user. Hm, Idk, I do agree with what you're saying about the self-contained piece of abstraction (and I had very similar thoughts in the process), but there are also some complications. First, funny enough, the protection calculation code was just moved to mm/page_counter.c by a8585ac68621 ("mm/page_counter: move calculating protection values to page_counter"). The commit log says that it's gonna be used by the drm cgroup controller, but the code is not (yet?) upstream apparently. I don't have any insights here. If there will be the second user for the protection functionality, moving it back to memcontrol.c is not feasible. Second, I agree that it would be nice to get rid of the parent pointer in struct page_cgroup entirely and use one in css. But Idk how to do it without making the code way more messy or duplicate a lot of tree walks. In C++ (or another language with generics) we could make struct page_counter taking the number of counters and the set of features as template parameters. I feel like with memcg1 code being factored out the benefit of this reorg lowered, so how about me resending 2 first patches separately and spending more time on thinking on the best long-term strategy? I have some ideas here and I thought about this work as a preparatory step, but maybe you're right and it makes sense to approach it more comprehensively. Thanks!