From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BFCBFC25B75
	for <linux-mm@archiver.kernel.org>; Sun, 19 May 2024 14:13:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E33E76B007B; Sun, 19 May 2024 10:13:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DBCD66B0083; Sun, 19 May 2024 10:13:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C368C6B0085; Sun, 19 May 2024 10:13:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id A21FB6B007B
	for <linux-mm@kvack.org>; Sun, 19 May 2024 10:13:48 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id F10C4409F5
	for <linux-mm@kvack.org>; Sun, 19 May 2024 14:13:47 +0000 (UTC)
X-FDA: 82135338894.07.7B619C9
Received: from mail-oo1-f50.google.com (mail-oo1-f50.google.com [209.85.161.50])
	by imf11.hostedemail.com (Postfix) with ESMTP id 2567B40010
	for <linux-mm@kvack.org>; Sun, 19 May 2024 14:13:45 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h1r2bFMr;
	spf=pass (imf11.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.161.50 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1716128026;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=kHrw/x5NVoGqtuVQNcrErFWhfjb0N0uwlSFD9+cPjas=;
	b=1cYgfPiQH6CdHh3GltdCsIGyiCdzWyEbzjx0qwjnE3KvMpd3ksh/lQBUvOPR0X7vmbb4vv
	WYO+regcZx0KlGcNdJ8FE8auslMxJeslbyfs7EQHprZzb7dOzW7keezecAPnFljONyfH6b
	ywbNqcY9fVTd0SOVp5GyPuxpfxCzWXA=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h1r2bFMr;
	spf=pass (imf11.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.161.50 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716128026; a=rsa-sha256;
	cv=none;
	b=rpWvJ8a22h6jemlAPIDe8Jv8/WVq/Qxsj41olcACIK7ckgmAFW/BrT+oTI9GV/Mu2XTkBg
	9QvckRg9sLCJ1Xx2z3GQCuEJArd4h66+l0cfyiZW4cY4ddrnnYDZyUrOKI3EjhpEFR1Z6T
	BsSwOtf0J37T1MBtjM7k2P3xcpGvSlc=
Received: by mail-oo1-f50.google.com with SMTP id 006d021491bc7-5b27bbcb5f0so962743eaf.3
        for <linux-mm@kvack.org>; Sun, 19 May 2024 07:13:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1716128025; x=1716732825; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=kHrw/x5NVoGqtuVQNcrErFWhfjb0N0uwlSFD9+cPjas=;
        b=h1r2bFMruyqL7SjvHX/UYZE1GNQggCIFu/CpyuseqEjLKi6zVQ3V+JPMbTvTOoncv2
         e6UiqgYaMSpqV4CuUtwxmcQdsx/nPwZ8hgMob6IsCxhZLFi0ouh+UqBN87se8j5k2yfA
         Sev2/moX2fT0nxWJMRDRUbmwYMliOIz9eQn1IOIJwapGFctBxlBej/v/Z5U/yBE7o1kg
         QP9Gk9fS+o8Nmz4Va6nhEIayMyN4o8Fd1RCVc8o0mkXb5lWebdvnOymoEenVi03yRuGT
         qvLT4DP7EdfjDz5peSbjJCrzigrZ5mcrOsJXAse4bkMLNMezfItgIx6mEbCsyG2ggYUX
         MC5g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1716128025; x=1716732825;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=kHrw/x5NVoGqtuVQNcrErFWhfjb0N0uwlSFD9+cPjas=;
        b=B4Wb3prwvoFJ754Fv/T1gEPu/nIbTaI12gGmAifTTfJJZnbQNoo447fIGT3aH0j8Lg
         4jhJHLstol/lrnXN6rwE13a+BN49P/KQz5WTiSZXyWQUfKNv4JM2LN4aZZnXzuOYg2N0
         /54F0gBYkmF5AbY4B4GsZ5Kf6gXql0bzpadWqV6ft/jJrtIj9h3hLyWu2aKEd41fOG6N
         btNCPZYY6AlTXRGbbtxNXBPgTDnEStpGG7KW/1TVreMk6ukl237ozw6LFjffAnST4m4W
         3fre1Ll4Z299zC84xN5I9REtsF0MMgADJeNMdJDF7Jg+TJI0wzvgj0BWN7bKfp5IZEZP
         Stvg==
X-Forwarded-Encrypted: i=1; AJvYcCWAKAXw0YtFDiBSQrHEohPJas2MPMw/hd/9E7WBY4epw3Eqh65Gf5kkfWJeTl+AQR6OZ/vpZgHeZl7yqM34RjF23uY=
X-Gm-Message-State: AOJu0YxdYDbTldRlSOuawuKYbrnnMFcFn7PpTeQ7zXwub513EtcEONaN
	hvty7e0aSRe7rYm5mBAQClTRrhQ2bK60l6Qdv1qHkh0GMqRzU9KT
X-Google-Smtp-Source: AGHT+IFEmSZ/a8PpAbsS7AI+i8TGHMbB4LmiK3qZj++q80foFK5Zl7fz92u6Q03LAk63OkdLP/9agA==
X-Received: by 2002:a05:6359:4c9f:b0:186:3beb:90e0 with SMTP id e5c5f4694b2df-193bb64d4d4mr2667711355d.18.1716128024862;
        Sun, 19 May 2024 07:13:44 -0700 (PDT)
Received: from snowbird ([2600:4041:54fc:a302:414c:9ba6:5d92:63bc])
        by smtp.gmail.com with ESMTPSA id d75a77b69052e-43df56a57afsm135905681cf.73.2024.05.19.07.13.42
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 19 May 2024 07:13:43 -0700 (PDT)
Date: Sun, 19 May 2024 07:13:40 -0700
From: Dennis Zhou <dennisszhou@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Kairui Song <ryncsn@gmail.com>,
	"zhangpeng (AS)" <zhangpeng362@huawei.com>,
	Rongwei Wang <rongwei.wrw@gmail.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	dennisszhou@gmail.com, shakeelb@google.com, jack@suse.cz,
	surenb@google.com, kent.overstreet@linux.dev, mhocko@suse.cz,
	vbabka@suse.cz, yuzhao@google.com, yu.ma@intel.com,
	wangkefeng.wang@huawei.com, sunnanyong@huawei.com
Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
Message-ID: <ZkoJFBfz7P3xuCrx@snowbird>
References: <20240418142008.2775308-1-zhangpeng362@huawei.com>
 <20240418142008.2775308-3-zhangpeng362@huawei.com>
 <ec2b110b-fb85-4af2-942b-645511a32297@gmail.com>
 <c1c79eb5-4d48-40e5-6f17-f8bc42f2d274@huawei.com>
 <CAMgjq7DHUgyR0vtkYXH4PuzBHUVZ5cyCzi58TfShL57TUSL+Tg@mail.gmail.com>
 <5nv3r7qilsye5jgqcjrrbiry6on7wtjmce6twqbxg6nmvczue3@ikc22ggphg3h>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5nv3r7qilsye5jgqcjrrbiry6on7wtjmce6twqbxg6nmvczue3@ikc22ggphg3h>
X-Stat-Signature: y866eum75z35yos9t8yc11x77sr7m1j7
X-Rspamd-Queue-Id: 2567B40010
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1716128025-154977
X-HE-Meta: U2FsdGVkX1+upXlIP6ob+MuyU+U5ojy782NiWztOcH0qWBBx01pikZvWRHZbcDvbwuviVX42U1oET7hvhYYau6gOrBGZqYFfWDwJdMWjXJt3rKx3Uw1rcmk6+EcS9+Wj1Iu7M+0g2rRRPCk8BGoiv7JBi9a5VHgPuc2KyC53ws7Yhxl1siQLgzwTMZe7JRosvPgpuhpb7Coezgw6GQMYchEtKWJhTgmOBGSZKTMj9xc4c2cJreROkJpeqqStPBpzBv/p2l8iKPFAEsmkQYQTcFwS68QtlmD05C/OIXSNCv3FdkbhHFXtnGu82ENqpKVN4DSNIwJCvp3sjXOSC+VIzp3uZVKJPgMCzz2VrQRk0azmCgMVlInNBPeMRDtkSiK8R19PPwP7HNduu1ZDlUTmRmCx6WpExurMeC18QjRxvmnfZ3Wh5CHBMDEVv9RvBtpYTQ8GPBITZrqZHercUfzvzCdfFBhbDwvogecWOi4LomqTF7DA5Vz60bTdj76lQPZNpX02wpOtHmMOG/6Ix465aINNHVw8cVx9N4tBETzdyVcUc+sFeeShrE8x2ROnLjzXy6RAJ9fCPul4Gg5LSCJrRZO2eNs/wS34GM64JNZxdTcE64A/hQZZvbSCgybXm5hixtS/xqao5jhlWiio4v4kxN3Z+g1vgjY+1EMz/zGQhF0ZqepnlOlEJe4ZCouX6nmhuUJA75yeUBgkk2zhizGVPQIXnm/D2Q9ibkowC6h6jex7HapoC6TlgfUiFkI1Gv8ZwML0r5IBKUU8rTVt9azTSBrGNk1Q0Qsy8FT8KOCxrPBkQKDCVw/BdsS1IxubIfmNBJZAAwRHL8Zi7cQ8jMjtX60s8NUzIHyc940JYPG2i5/5csNMDoUAE5Z4lR6JzfhVXzn3l3OLUqCeb9S0nvlKSVfuiWMpcV2lgnU6KeUIcMiSevvQufG6ukUIW2MHVXdGJDAd9fBTIGUHtSOMZ9H
 kTBjSGor
 lkyvy3qX48H6BT0XNqUjztYNMhydspzQSHfGrRvyvLI4FjQBWEktM5jl+uc8zAri0ThJBgoCxeld1veJvMDfkgy1Dw/GoLUWx3BO9bOLPL6G+vm0y1lnrzgUQK6od3+Y+xt8gGnc0W9/NurU2psFBpD3HKPRc445G6+GITVyr7+Ho6PNlZVAqrzJ4hBUB1e1c+w0pvUgE2oXTtIo9X8OCRbaHvbLwOMu+fDCT7xyT3A3kTJxRfekOVEj1JzjIFsb3jyoaYRdWPFSFegeyzt5erEzNAFX8ieZVScP7EAa91GUJHRxOtMCdnpyR6htcm24aBLz9PSqWUo6exbmCXoYwoTl0vnFIEASS0GO6CsQDF4vMS9RfnXy1M8WJbtaAo0IWjXgXIbZYuGUXovuyVD8q0IGBWFmJTBJVk4JNwpuvhNzZ1U/ubkI37iVDL5rCpz9102Tc+G11I6vNf1DwWOxpUcNfD+74lg79H7QtvVJpCnuQY/xrmjW8oJTppo94em1bCwgLP+6JGIi65jSFJ5yYzDMm3Hc3yma/+udDQ4dnBTA/Ljkb3L0gh+IPuEBcSPaPiBW74v9b8m7DQL87XzpxU/fvioMQK1qyPufzna9gaym+cOc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Mateusz and Kairui,

On Thu, May 16, 2024 at 05:14:06PM +0200, Mateusz Guzik wrote:
> On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote:
> > > > On 2024/4/18 22:20, Peng Zhang wrote:
> > > >> From: ZhangPeng <zhangpeng362@huawei.com>
> > > >>
> > > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> > > >> percpu_counter"), the rss_stats have converted into percpu_counter,
> > > >> which convert the error margin from (nr_threads * 64) to approximately
> > > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> > > >> performance regression on fork/exec/shell. Even after commit
> > > >> 14ef95be6f55
> > > >> ("kernel/fork: group allocation/free of per-cpu counters for mm
> > > >> struct"),
> > > >> the performance of fork/exec/shell is still poor compared to previous
> > > >> kernel versions.
> > > >>
> > > >> To mitigate performance regression, we delay the allocation of percpu
> > > >> memory for rss_stats. Therefore, we convert mm's rss stats to use
> > > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in
> > > >> atomic mode, which reduces the memory consumption and performance
> > > >> regression caused by using percpu. For multiple-thread processes,
> > > >> rss_stat is switched to the percpu mode to reduce the error margin.
> > > >> We convert rss_stats from atomic mode to percpu mode only when the
> > > >> second thread is created.
> > 
> > I've a patch series that is earlier than commit f1a7941243c1 ("mm:
> > convert mm's rss stats into
> > percpu_counter"):
> > 
> > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/
> > 

I hadn't seen this series as my inbox filters on percpu, but not
per-cpu. I can take a closer look this week.

> > Instead of a per-mm-per-cpu cache, it used only one global per-cpu
> > cache, and flush it on schedule. Or, if the arch supports, flush and
> > fetch it use mm bitmap as an optimization (like tlb shootdown).
> > 
> 
> I just spotted this thread.
> 
> I have a rather long rant to write about the entire ordeal, but don't
> have the time at the moment. I do have time to make some remarks though.
> 
> Rolling with a centralized counter and only distributing per-cpu upon
> creation of a thread is something which was discussed last time and
> which I was considering doing. Then life got it in the way and in the
> meantime I managed to conclude it's a questionable idea anyway.
> 

To clarify my stance, I'm not against the API of switching a
percpu_counter to percpu mode. We do it for percpu_refcount. I think the
implementation here was fragile. Secondly, Kent did implement
lazy_percpu_counters. We likely should see how that can be leveraged and
how we can reconcile the 2 APIs.

> The state prior to the counters moving to per-cpu was not that great to
> begin with, with quite a few serialization points. As far as allocating
> stuff goes one example is mm_alloc_cid, with the following:
> 	mm->pcpu_cid = alloc_percpu(struct mm_cid);
> 
> Converting the code to avoid per-cpu rss counters in the common case or
> the above patchset only damage-control the state back to what it was,
> don't do anything to push things further.
> 
> Another note is that unfortunately userspace is increasingly
> multithreaded for no good reason, see the Rust ecosystem as an example.
> 
> All that to say is that the multithreaded case is what has to get
> faster, as a side effect possibly obsoleting both approaches proposed
> above. I concede if there is nobody wiling to commit to doing the work
> in the foreseeable future then indeed a damage-controlling solution
> should land.
> 
> On that note in check_mm there is this loop:
>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>                 long x = percpu_counter_sum(&mm->rss_stat[i]);
> 
> This avoidably walks all cpus 4 times with a preemption and lock trip
> for each round. Instead one can observe all modifications are supposed
> to have already stopped and that this is allocated in a banch. A
> routine, say percpu_counter_sum_many_unsafe, could do one iteration
> without any locks or interrupt play and return an array. This should be
> markedly faster and I perhaps will hack it up.

I'm a little worried about the correctness in the cpu_hotplug case, idk
if it's warranted.

> 
> A part of The Real Solution(tm) would make counter allocations scale
> (including mcid, not just rss) or dodge them (while maintaining the
> per-cpu distribution, see below for one idea), but that boils down to
> balancing scalability versus total memory usage. It is trivial to just
> slap together a per-cpu cache of these allocations and have the problem
> go away for benchmarking purposes, while being probably being too memory
> hungry for actual usage.
> 
> I was pondering an allocator with caches per some number of cores (say 4
> or 8). Microbenchmarks aside I suspect real workloads would not suffer
> from contention at this kind of granularity. This would trivially reduce
> memory usage compared to per-cpu caching. I suspect things like
> mm_struct, task_struct, task stacks and similar would be fine with it.
> 
> Suppose mm_struct is allocated from a more coarse grained allocator than
> per-cpu. Total number of cached objects would be lower than it is now.
> That would also mean these allocated but not currently used mms could
> hold on to other stuff, for example per-cpu rss and mcid counters. Then
> should someone fork or exit, alloc/free_percpu would be avoided for most
> cases. This would scale better and be faster single-threaded than the
> current state.
> 
> (believe it or not this is not the actual long rant I have in mind)
> 
> I can't commit to work on the Real Solution though.
> 
> In the meantime I can submit percpu_counter_sum_many_unsafe as described
> above if Denis likes the idea.

To be honest, I'm a little out of my depth here. I haven't spent a lot
of time in the mm code paths to really know when and how we're
accounting RSS and other stats. Given that, I think we should align on
the approach we want to take because in some way it sounds like percpu
RSS might not be the final evolution here for this and other stats.

I think lets hold off on percpu_counter_sum_many_unsafe() initially and
if that's the way we have to go so be it. I'm going to respin a
cpu_hotplug related series that might change some of the correctness
here and have that ready for the v6.11 merge window.

Thanks,
Dennis