From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9A818C04FFE
	for <linux-mm@archiver.kernel.org>; Fri, 17 May 2024 03:30:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 846EA6B007B; Thu, 16 May 2024 23:30:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7F7BF6B0083; Thu, 16 May 2024 23:30:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6C0366B0085; Thu, 16 May 2024 23:30:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 4DE2D6B007B
	for <linux-mm@kvack.org>; Thu, 16 May 2024 23:30:19 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id AF607A18F3
	for <linux-mm@kvack.org>; Fri, 17 May 2024 03:30:18 +0000 (UTC)
X-FDA: 82126459716.08.EFDAF9E
Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177])
	by imf28.hostedemail.com (Postfix) with ESMTP id CEAA8C0007
	for <linux-mm@kvack.org>; Fri, 17 May 2024 03:30:16 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=EGHgabTa;
	spf=pass (imf28.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715916617; a=rsa-sha256;
	cv=none;
	b=GEmXAf1pzNKukEeJ2jhcvLeo+g/ETg//NBuXX3jJZ3RAOhGHlE7nPwvCg28rGpMSIduNQ8
	QwwNuJNIgfJGZtu5iUT0ymAsMylXiei+l2NJ4Z9Pznzkq0Crnde7yvUBc1vJBJdViKdGvT
	Hsua6TccQNFc6U8wkqMdh2kn9cgqyuo=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=EGHgabTa;
	spf=pass (imf28.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1715916616;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=5Cz7/vI1LH0f2WLQEBvwGZZcGzYd1g9y6LcZk4JclSo=;
	b=VL84mQhErTHEabtd/hii9a4aMKyEXF7w11GvBsjWX4iY2GdWMVh8Ve2u2mR+z9TXhFILv/
	NrCrxHRDq+R6h2YWeDQobebeLXE2iJ+Tt2RcSESSx/kjGWl9Ogp8TdN8AppxbpW5+ySm65
	wlJhUBhR7HtWf0R0PFR9iiRZgyLNveE=
Received: by mail-lj1-f177.google.com with SMTP id 38308e7fff4ca-2e3efa18e6aso1373911fa.0
        for <linux-mm@kvack.org>; Thu, 16 May 2024 20:30:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1715916615; x=1716521415; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=5Cz7/vI1LH0f2WLQEBvwGZZcGzYd1g9y6LcZk4JclSo=;
        b=EGHgabTa0oa4/bhyxPfNZCDQy6NoY2cOo6xIkABGu8SQViETEaCe39cdLKHK4KZulP
         3v/MtICAeno+6WDy+qqJIYYPn395PEOw5onJhLXpx1HcvGPnYuQLkxauWYNzDe8kinag
         JdaBTPqkGlkvoKzijqbIgzceJ7t76F6lgD6VH5fGp3fLOs32VeE3to5YYANUpxpFD1dd
         mu5HI78WhdJDcheew14sLCCTnRZZet+gq/I5t5WnhuF/PRU8l9/6aOw0kpVIQgZVrtHc
         y8ClyqUsYjd/bgqaz8D4Cs9gsa0zRGscMnjkX8NPEVA3C0Yx/avznOYbCPO0doTOwZlD
         gqzQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1715916615; x=1716521415;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=5Cz7/vI1LH0f2WLQEBvwGZZcGzYd1g9y6LcZk4JclSo=;
        b=vgVKO+5mprU3BYoy/vF/UjV7vsuiRj0YeeOwuE2FdclDDSAjkK/AECTiCCyREPhlor
         KiK0zfzuPYmz9Aamr6vi483gpI9o4C25NRfXS5MiQDS6C9ESQZvQD7g1Qsui01V3V5c8
         uFgidO4oeJVHu2GP8g21iid6n8qK3NzQHuCNoFpwcFZ8FylJdOmTOupGSp2Fw3jKXwYB
         Cy2BTmP7q4c52+ab6eIboL/vlggFsdV5T8WqLLNPbTgg6swTFn27qskv4KwysBeR1rZD
         IivKdM/Kt78tORchHeDlO+OBiMg8veFDqcjLCAMlHwBpjmxrCGI+/1gvIkFnb2CeB+Fq
         lZEw==
X-Forwarded-Encrypted: i=1; AJvYcCW3mc2w9g2is2TENpv67tCEjJBRJL7nYx84FgRgFDFdfmRLPDuFhXu2sstj8Btqzao6EUTxwTu/j2fMZQwmWXOyxdg=
X-Gm-Message-State: AOJu0YzIoepV/Gw7lwOMBtonvUJVZR0sET8PtYZjdU5wUfEHdt9uuZjC
	Kq7fiA0asNLygDakeGNF0pyWCJBv7jSnl1aNGTvYS7yu3L804Cz/8bx8/hy9OOcaSoZsro33tTx
	+2xkL1R6DR1qYxWv7lllaLttq5Xw=
X-Google-Smtp-Source: AGHT+IHxdGRowkhI8JyVI/R/U+n2YMZLnVoEbowchZjFVvm0PTMlqKjRrbJSZNvULmQ438LX58zIBcV/whp7Xh2hA08=
X-Received: by 2002:a2e:3518:0:b0:2e5:67a8:10fc with SMTP id
 38308e7fff4ca-2e567a811e0mr56894041fa.9.1715916614622; Thu, 16 May 2024
 20:30:14 -0700 (PDT)
MIME-Version: 1.0
References: <20240418142008.2775308-1-zhangpeng362@huawei.com>
 <20240418142008.2775308-3-zhangpeng362@huawei.com> <ec2b110b-fb85-4af2-942b-645511a32297@gmail.com>
 <c1c79eb5-4d48-40e5-6f17-f8bc42f2d274@huawei.com> <CAMgjq7DHUgyR0vtkYXH4PuzBHUVZ5cyCzi58TfShL57TUSL+Tg@mail.gmail.com>
 <5nv3r7qilsye5jgqcjrrbiry6on7wtjmce6twqbxg6nmvczue3@ikc22ggphg3h>
In-Reply-To: <5nv3r7qilsye5jgqcjrrbiry6on7wtjmce6twqbxg6nmvczue3@ikc22ggphg3h>
From: Kairui Song <ryncsn@gmail.com>
Date: Fri, 17 May 2024 11:29:57 +0800
Message-ID: <CAMgjq7DgE6NZPR8Sf2nq3vpVG8ZoC03e8aXi-QKbiievi3BB_g@mail.gmail.com>
Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: "zhangpeng (AS)" <zhangpeng362@huawei.com>, Rongwei Wang <rongwei.wrw@gmail.com>, linux-mm@kvack.org, 
	LKML <linux-kernel@vger.kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	dennisszhou@gmail.com, shakeelb@google.com, jack@suse.cz, 
	Suren Baghdasaryan <surenb@google.com>, kent.overstreet@linux.dev, mhocko@suse.cz, 
	vbabka@suse.cz, Yu Zhao <yuzhao@google.com>, yu.ma@intel.com, 
	wangkefeng.wang@huawei.com, sunnanyong@huawei.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: CEAA8C0007
X-Stat-Signature: wady5a7xah3ea6a4tze9w3opera5sjxc
X-HE-Tag: 1715916616-986222
X-HE-Meta: U2FsdGVkX1+aDld9G713KEnnBkfiFq9j4BjEfolGdd+cC0sxyLIiQy5laELN5iR8DIO4agkwV/PlK9/10uTfJobAyxBREFfrvfkrrpS3B4fFQybAk3DdnFA2nJGSMjX6p2aWIw3idXNF0hiScvdMbG4B54vKDWIBTR7Dg7gQzGCFPGryMLivuH0/BnDCZXQMby/3Q0KXhH131KVCLx4f1yu4KCX43/bWmSLVmucNj12CBS/as4cJCiAK/viqlYN635q5FQdmT6m4GF9APrxQcuPHd7oLIzKhaLYxmf/GVeio1efsik79Nx5BDP+ApGGpJi9rgA9knIVByNA4uI+H6WG83b9UZCDwSm7ivZBINEp0gr4IqBKegYaO/ebwKugDKE/vV/W43uWLa4PcdMBEVrtUL0lIAVLOSgAH17Sfsix/7JjeqozZRXBLr8J8B1XzNoDenquETzf2bYdxaQnjPgZNIXkr/IzeHcvBqtp5IHMjfNpy5DlFP6ZoiOY8vIp7wPP7U2VamI+hvZJl1p+qhA8BQQ8Pp6vheFMcvA0ZhzBlz1kma6iijU0cFh+g4pfvugTIkeM2kq2BOmbK4RUHMSqIe07ijmBKU6BThc7WipfzGpOQbFuU1j/TypfhRQeRuZvX3QYjRYWslKohAOeofLOMz/SgtchUcpGaBRJDKIHnmiWPgx8tSmhF6/smYpua9rbSZ4WBjmOFCp1a7M05W8KPnKOn9jnMPPVENL0355txHiMXpd7C1He5lm3fbqelLsqHLyoPuEIt8f6NZrNNwsVhDkixMGFVuUaJUHSi8WLMxLeMyJ8GU9MmqAUD+Vo/EnRO35S0aHn5j2wOqJyurUcUBL+nV9RDX027Stv1fJciowjyYAEcEpc0ewBlksiSHspFX1pG4qb3aCtsd57cxrzd0QS6mC+Dsr0exXL8kdF/PTGqQCca8h+wi24pJT/YTihyxO5zl7k72XXKeAy
 MuauH2L7
 hpoRv6H0puW6ciHNoGo6ycDJ2bnYDuo0TIYaTMCPxpXcd6n7NRuBzNGn7c3616gq0dW9prV9hBY7l5xxJtQr1XYSZ4tZh1bZKF4Me3+/uYf4uElpzu8FBLN/55jh7alwl9wHWx13A2/iCT0bxhZrPa/wij1PftKBg5ehA5soVZKimSGVv+x5Zn94lQujYybIVVkKKs/jYDJ5tHiIlt7jWuXbgJSRGhhdyvIZT98cQbPl4M66Tt+gOZsEi+xVtcvSREjGSsVvr6GRDHseKcVHLyvpX/XkKrlXCqWKd2RvdU1BQymPQIMRZ5xHP+MwXEvbIcrPns0F0wZpidEvzN+t3N/jJSkXL9FcRntj/mBqRNDU5A8+1ZwU5A378pASgCgNTx+jXOKfLmNfFxRpRtIarccOVLrXj0vPHn/749ceWRbYPGCMmfZUDBeC8U8eAwhifaE7bLe1oMq2oXf6AolLrGuJ141Xn1jzmw40zoy5ONoEYMwNUaFIj5lNQyrIad5HcCWk9/I+0O7g0Fe9aJecJ4QMC7w==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Mateusz Guzik <mjguzik@gmail.com> =E4=BA=8E 2024=E5=B9=B45=E6=9C=8816=E6=97=
=A5=E5=91=A8=E5=9B=9B 23:14=E5=86=99=E9=81=93=EF=BC=9A
>
> On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote:
> > > > On 2024/4/18 22:20, Peng Zhang wrote:
> > > >> From: ZhangPeng <zhangpeng362@huawei.com>
> > > >>
> > > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> > > >> percpu_counter"), the rss_stats have converted into percpu_counter=
,
> > > >> which convert the error margin from (nr_threads * 64) to approxima=
tely
> > > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() cau=
ses a
> > > >> performance regression on fork/exec/shell. Even after commit
> > > >> 14ef95be6f55
> > > >> ("kernel/fork: group allocation/free of per-cpu counters for mm
> > > >> struct"),
> > > >> the performance of fork/exec/shell is still poor compared to previ=
ous
> > > >> kernel versions.
> > > >>
> > > >> To mitigate performance regression, we delay the allocation of per=
cpu
> > > >> memory for rss_stats. Therefore, we convert mm's rss stats to use
> > > >> percpu_counter atomic mode. For single-thread processes, rss_stat =
is in
> > > >> atomic mode, which reduces the memory consumption and performance
> > > >> regression caused by using percpu. For multiple-thread processes,
> > > >> rss_stat is switched to the percpu mode to reduce the error margin=
.
> > > >> We convert rss_stats from atomic mode to percpu mode only when the
> > > >> second thread is created.
> >
> > I've a patch series that is earlier than commit f1a7941243c1 ("mm:
> > convert mm's rss stats into
> > percpu_counter"):
> >
> > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com=
/
> >
> > Instead of a per-mm-per-cpu cache, it used only one global per-cpu
> > cache, and flush it on schedule. Or, if the arch supports, flush and
> > fetch it use mm bitmap as an optimization (like tlb shootdown).
> >
>
> I just spotted this thread.
>
> I have a rather long rant to write about the entire ordeal, but don't
> have the time at the moment. I do have time to make some remarks though.
>
> Rolling with a centralized counter and only distributing per-cpu upon
> creation of a thread is something which was discussed last time and
> which I was considering doing. Then life got it in the way and in the
> meantime I managed to conclude it's a questionable idea anyway.
>
> The state prior to the counters moving to per-cpu was not that great to
> begin with, with quite a few serialization points. As far as allocating
> stuff goes one example is mm_alloc_cid, with the following:
>         mm->pcpu_cid =3D alloc_percpu(struct mm_cid);
>
> Converting the code to avoid per-cpu rss counters in the common case or
> the above patchset only damage-control the state back to what it was,
> don't do anything to push things further.
>
> Another note is that unfortunately userspace is increasingly
> multithreaded for no good reason, see the Rust ecosystem as an example.
>
> All that to say is that the multithreaded case is what has to get
> faster, as a side effect possibly obsoleting both approaches proposed
> above. I concede if there is nobody wiling to commit to doing the work
> in the foreseeable future then indeed a damage-controlling solution
> should land.

Hi, Mateusz,

Which patch are you referencing? My series didn't need any allocations
on thread creation or destruction. Also RSS update is extremely
lightweight (pretty much just read GS and do a few ADD/INC, that's
all), performance is better than all even with micro benchmarks. RSS
read only collects info from CPUs that may contain real updates.

I understand you may not have time to go through my series... but I
think I should add some details here.

> On that note in check_mm there is this loop:
>         for (i =3D 0; i < NR_MM_COUNTERS; i++) {
>                 long x =3D percpu_counter_sum(&mm->rss_stat[i]);
>
> This avoidably walks all cpus 4 times with a preemption and lock trip
> for each round. Instead one can observe all modifications are supposed
> to have already stopped and that this is allocated in a banch. A
> routine, say percpu_counter_sum_many_unsafe, could do one iteration
> without any locks or interrupt play and return an array. This should be
> markedly faster and I perhaps will hack it up.

Which is similar to the RSS read in my earlier series... It is based
on the assumption that updates are likely stopped so just read the
counter "unsafely" with a double (and fast) check to ensure no race.

And even more, when coupled with mm shootdown
(CONFIG_ARCH_PCP_RSS_USE_CPUMASK), it doesn't need to collect RSS info
on thread exit at all.

>
> A part of The Real Solution(tm) would make counter allocations scale
> (including mcid, not just rss) or dodge them (while maintaining the
> per-cpu distribution, see below for one idea), but that boils down to
> balancing scalability versus total memory usage. It is trivial to just
> slap together a per-cpu cache of these allocations and have the problem
> go away for benchmarking purposes, while being probably being too memory
> hungry for actual usage.
>
> I was pondering an allocator with caches per some number of cores (say 4
> or 8). Microbenchmarks aside I suspect real workloads would not suffer
> from contention at this kind of granularity. This would trivially reduce
> memory usage compared to per-cpu caching. I suspect things like
> mm_struct, task_struct, task stacks and similar would be fine with it.
>
> Suppose mm_struct is allocated from a more coarse grained allocator than
> per-cpu. Total number of cached objects would be lower than it is now.
> That would also mean these allocated but not currently used mms could
> hold on to other stuff, for example per-cpu rss and mcid counters. Then
> should someone fork or exit, alloc/free_percpu would be avoided for most
> cases. This would scale better and be faster single-threaded than the
> current state.

And what is the issue with using only one CPU cache, and flush on mm
switch? No more alloc after boot, and the total (and fixed) memory
usage is just about a few unsigned long per CPU, which should be even
lower that the old RSS cache solution (4 unsigned long per task). And
it scaled very well with many kinds of microbench or workload I've
tested.

Unless the workload keeps doing something like "alloc one page then
switch to another mm", I think the performance will be horrible
already due to cache invalidations and many switch_*()s, RSS isn't
really a concern there.

>
> (believe it or not this is not the actual long rant I have in mind)
>
> I can't commit to work on the Real Solution though.
>
> In the meantime I can submit percpu_counter_sum_many_unsafe as described
> above if Denis likes the idea.