From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 15B08C25B74
	for <linux-mm@archiver.kernel.org>; Thu, 16 May 2024 15:14:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 742986B008A; Thu, 16 May 2024 11:14:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6F1CA6B008C; Thu, 16 May 2024 11:14:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5BA706B0092; Thu, 16 May 2024 11:14:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 34EB36B008A
	for <linux-mm@kvack.org>; Thu, 16 May 2024 11:14:33 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id A67281C1FDB
	for <linux-mm@kvack.org>; Thu, 16 May 2024 15:14:32 +0000 (UTC)
X-FDA: 82124605584.06.1237632
Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46])
	by imf30.hostedemail.com (Postfix) with ESMTP id BD27A80023
	for <linux-mm@kvack.org>; Thu, 16 May 2024 15:14:29 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RPH3uUkT;
	spf=pass (imf30.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1715872469;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1S2eWYwd0RSPH92xJmhlUyx+mVQoKIPweBEeWH6A6Bw=;
	b=x20fNIs5j5IAxdTEW74D/74rwl44+6Aoo3xhWyIgphOaibUDjeXCQsmHYN88hbadRRBGh9
	QqjcOklszUzZNUeXO8hLdrzY7J7jpeQx0MzYhCt/VVOdQ4ahTRTNefKOE/T8y25uaFZxar
	63GajgVGsArpvYJi2BvEJ6hpNniwZgQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715872469; a=rsa-sha256;
	cv=none;
	b=ZVhW3tBVuYgdR1DPFrYp36pkvkeo9St7WBR/G5GypcZwzi+JEi5YljMCISqmBR95fyxQkI
	KGqRG1+ZLvBvpQfkObh0RpVTmpavwx0vb1TIaA9BPblgk7SnKcpC9jTXySDI9xgcEL6KMG
	qCSv1PjQawOsT+Mi9rW4JDb6rre5+D8=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RPH3uUkT;
	spf=pass (imf30.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-571c2055cb1so1057471a12.1
        for <linux-mm@kvack.org>; Thu, 16 May 2024 08:14:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1715872468; x=1716477268; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=1S2eWYwd0RSPH92xJmhlUyx+mVQoKIPweBEeWH6A6Bw=;
        b=RPH3uUkTWVeUvtpzIJobkNsFUgcbvOUhcw+BbtXub9evOdCULkWRBfnAVuhZgW80zb
         DTejzuCCo7YYZ9yqZSlM0ReQV/t908I0FNf+3QHDGahqRtIxN7YPsJA05UNMyqVfvr9J
         qCIxOqARhaZPUTDsIGSlKWeYFtzRuocXqEIwE2vuRIEbBIbaSk3l4EZVaGvO3XbrwoOR
         DGiAaiMnasN4G0lU8KdtWA0zS49qTC0ViM+C1bCDnbCcFJk0QWCxEG0phruAne57rT9m
         BTU78escU3eX1WSguNOzaYkEFI4YlzBWt+4JVJIqJrKktqode8ZC73quRRcjKf1LUu/S
         0tjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1715872468; x=1716477268;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=1S2eWYwd0RSPH92xJmhlUyx+mVQoKIPweBEeWH6A6Bw=;
        b=rmQNoFR0mJOQ/alLyunyVj9b3U2smbUevw8IX8sDCr4pP9hqQuniJCL5L4eDjUs5pj
         9kJY1mpkWqI4SI8zG/ykX2jOrN8Dhkr8CitPh/REokP/or4++263FBNpzGuqGwUjxvoE
         YqtBi60CYsRnplnaCaPaSDa1pHpYi+WGWEap6LUZS2ottw7BuIVr4IbuAA7V5p7GAxN4
         M7uufZB2a2FhYFOdOAfi0veikPp3JJB+4mYsL1IssaaTmP7Q74gr70UvnASmL4D4g8SQ
         mNIPWoLvD3LFcZjT3LCC7CfEq96dwFSr7p02iaRBp6o9mC50NFiVckpA0VOylSpZDxRS
         2wOQ==
X-Forwarded-Encrypted: i=1; AJvYcCXRTcDzlBteuNTmy+f8B3SKv69WBqPOHttyPD/vKFBZK/OsvQ0X/ljgs3G+W5ZW5IU2lk3pqcjU2kyn8npf9hvB8gs=
X-Gm-Message-State: AOJu0YwYGzeEhcgkYBlMltUeZ55znodhotoDTBKTwqPcB6XABmCi2My9
	3l9YcdWqDUqKX06RNI/pv/N/gl9Mq7cVnGfwJchSvj1gdkU0oy3E
X-Google-Smtp-Source: AGHT+IFthx2bnGaWcEtherBvVyvy/WIe+XS8j3rbgROOod0/ZXYgaulUNAalWPL4lXTxH+A/lctZwA==
X-Received: by 2002:a50:a6d4:0:b0:56e:3293:3777 with SMTP id 4fb4d7f45d1cf-5734d5c1594mr18621932a12.17.1715872467943;
        Thu, 16 May 2024 08:14:27 -0700 (PDT)
Received: from f (cst-prg-82-229.cust.vodafone.cz. [46.135.82.229])
        by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5733c34e5ccsm10599095a12.95.2024.05.16.08.14.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 16 May 2024 08:14:27 -0700 (PDT)
Date: Thu, 16 May 2024 17:14:06 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: "zhangpeng (AS)" <zhangpeng362@huawei.com>, 
	Rongwei Wang <rongwei.wrw@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	akpm@linux-foundation.org, dennisszhou@gmail.com, shakeelb@google.com, jack@suse.cz, 
	surenb@google.com, kent.overstreet@linux.dev, mhocko@suse.cz, vbabka@suse.cz, 
	yuzhao@google.com, yu.ma@intel.com, wangkefeng.wang@huawei.com, 
	sunnanyong@huawei.com
Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
Message-ID: <5nv3r7qilsye5jgqcjrrbiry6on7wtjmce6twqbxg6nmvczue3@ikc22ggphg3h>
References: <20240418142008.2775308-1-zhangpeng362@huawei.com>
 <20240418142008.2775308-3-zhangpeng362@huawei.com>
 <ec2b110b-fb85-4af2-942b-645511a32297@gmail.com>
 <c1c79eb5-4d48-40e5-6f17-f8bc42f2d274@huawei.com>
 <CAMgjq7DHUgyR0vtkYXH4PuzBHUVZ5cyCzi58TfShL57TUSL+Tg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAMgjq7DHUgyR0vtkYXH4PuzBHUVZ5cyCzi58TfShL57TUSL+Tg@mail.gmail.com>
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: BD27A80023
X-Rspam-User: 
X-Stat-Signature: ax9stcyha6ey77czrihmjunz33uswez6
X-HE-Tag: 1715872469-843733
X-HE-Meta: U2FsdGVkX18kJ+dTChkYePOn3DOdi8HLU7sfwT+9Og9Qozoxe7pj08k25KyYHrs7jbi7tv0455QMjJRUw6Bh65PAu7kgmRc4Gu9P92clMajNt6F3XhY6a8JdzDtCRhQBQEzLOxTu7yZXEBI9nQCNw9AJe1W3iK+Zdnyz8Rw+X7djKEU7p+wR6+h7VoGjmA9N7q+xd+9Chivz1Sdmxf3qaGtOFmpSn3t7+VawtZ3WQnxM73cp1OyQ5IPq7bLvpHKYd581zdMU211Z29a2S7pqdossUaskEOANh3r35toK1uNMsYFsKHA17d0CtuUCE5gkAarcCyEmm4hkYpLuGBX4b+IgNju5aIk3JepoDIYRV2/KiZ1UBTMZrk0rG1E5Uw9zDwdkQfiZk+e5mCAg62DtSQMnh5hUp+OIq6RYfxJTY9elt10iIYksENHUirseeCAAZ49Swqyl7v7r05mO/1I6g4AlmitfeK1L6s1ULia6fW0vNLlfnrBiZJBVAqgWs8k+gsQLSNMgJXJHZYFoyoFPkqu8xrRkb7SixGUp0VBkmYCf+eqPxt/7XwdryAYqYP7Da3eQZTsTwX9xBzz2wdBn2bayOIYwk08DBbZ8EAeX0IVgyRSktzDF7+u4ctLeIOGpfYpfzngOj1oztwD8zarS1izwyB2GwEHs2iNrFcUDpTOA9Vkmz3qrSRaNP/y6CMi4/zcTszdst+qw/98VjriUN17Lj/fVlTsH+CrAhG74ygXqwCmscr6X+ramEGJJvVO9vVOlvuO7u+2J3W+ZO8rJK2emGxixmqmPMEuX3RfL8NxkWuN34NJELu6NiAFT45bKAUYl9WTMnsJwwQCyi6u/Jn+czheO0y9eZpickijtp1X4VNUn6wdVUSDs+toie1Pn/BOtT3UiERCCz6yKhs6wNNdqmg6OJaqECXyt1/e8yqFTt6wZSPNV2t48W4mAY1Qtp/1D5ckCjC29vqIfvNw
 vxmLdVlv
 xsYH/FFjMtPJkhRnCa1AUmwFPXlTcd9M+MDny2tGpGGpvcX7FmLGtE6cp9b29EIyfYrbLwGc7fRAxTJwoIW8aFN6g2qbZqatCtSdauxeSSw5/S0cgV8fBuRG9Q9By10bnQRUbijxxZ2MR126OZNzS3TxWtEwFRQdxHPntR4hjsN4gU9UGjv0QzlMSlGylUKg3wkP0/zLYdraOhrZr/YUvVzQTccD8K14I48soOVDTMuINcHYjg+jrQv7qPirx0MFmlFsl+Up+/lo7rbjTCRxCHveMJzBzgHpNUF+2xVmSG2AflrA56YARLv9/mdL2eoeyQaodFmSgay7iB5SX35qALZYxdrQcwmAYvyzw5yHY16X3mGBEbJeMuNDzdui9CjA2NaEWfqmMh5aftHYzZsHKO/SglpTvAz5GxyCpQNXxo4DK6QX9dyssP0oRU9FHl+PyezkBtmG9+srzA8Mcqr4migP9LY1aSvmnTMr4XH0F9+dEdsASMyU2BY1nEoDPw9dErBw6OaCajmoshu0TXbW1GmnAsP5dL9mQCvMFV8bDtbF4jieEpfbXv8hLgySFC9gfzRJf9VKylMJ6VbSO8SycAL6x/QSNXVidqap6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, May 16, 2024 at 07:50:52PM +0800, Kairui Song wrote:
> > > On 2024/4/18 22:20, Peng Zhang wrote:
> > >> From: ZhangPeng <zhangpeng362@huawei.com>
> > >>
> > >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> > >> percpu_counter"), the rss_stats have converted into percpu_counter,
> > >> which convert the error margin from (nr_threads * 64) to approximately
> > >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> > >> performance regression on fork/exec/shell. Even after commit
> > >> 14ef95be6f55
> > >> ("kernel/fork: group allocation/free of per-cpu counters for mm
> > >> struct"),
> > >> the performance of fork/exec/shell is still poor compared to previous
> > >> kernel versions.
> > >>
> > >> To mitigate performance regression, we delay the allocation of percpu
> > >> memory for rss_stats. Therefore, we convert mm's rss stats to use
> > >> percpu_counter atomic mode. For single-thread processes, rss_stat is in
> > >> atomic mode, which reduces the memory consumption and performance
> > >> regression caused by using percpu. For multiple-thread processes,
> > >> rss_stat is switched to the percpu mode to reduce the error margin.
> > >> We convert rss_stats from atomic mode to percpu mode only when the
> > >> second thread is created.
> 
> I've a patch series that is earlier than commit f1a7941243c1 ("mm:
> convert mm's rss stats into
> percpu_counter"):
> 
> https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/
> 
> Instead of a per-mm-per-cpu cache, it used only one global per-cpu
> cache, and flush it on schedule. Or, if the arch supports, flush and
> fetch it use mm bitmap as an optimization (like tlb shootdown).
> 

I just spotted this thread.

I have a rather long rant to write about the entire ordeal, but don't
have the time at the moment. I do have time to make some remarks though.

Rolling with a centralized counter and only distributing per-cpu upon
creation of a thread is something which was discussed last time and
which I was considering doing. Then life got it in the way and in the
meantime I managed to conclude it's a questionable idea anyway.

The state prior to the counters moving to per-cpu was not that great to
begin with, with quite a few serialization points. As far as allocating
stuff goes one example is mm_alloc_cid, with the following:
	mm->pcpu_cid = alloc_percpu(struct mm_cid);

Converting the code to avoid per-cpu rss counters in the common case or
the above patchset only damage-control the state back to what it was,
don't do anything to push things further.

Another note is that unfortunately userspace is increasingly
multithreaded for no good reason, see the Rust ecosystem as an example.

All that to say is that the multithreaded case is what has to get
faster, as a side effect possibly obsoleting both approaches proposed
above. I concede if there is nobody wiling to commit to doing the work
in the foreseeable future then indeed a damage-controlling solution
should land.

On that note in check_mm there is this loop:
        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = percpu_counter_sum(&mm->rss_stat[i]);

This avoidably walks all cpus 4 times with a preemption and lock trip
for each round. Instead one can observe all modifications are supposed
to have already stopped and that this is allocated in a banch. A
routine, say percpu_counter_sum_many_unsafe, could do one iteration
without any locks or interrupt play and return an array. This should be
markedly faster and I perhaps will hack it up.

A part of The Real Solution(tm) would make counter allocations scale
(including mcid, not just rss) or dodge them (while maintaining the
per-cpu distribution, see below for one idea), but that boils down to
balancing scalability versus total memory usage. It is trivial to just
slap together a per-cpu cache of these allocations and have the problem
go away for benchmarking purposes, while being probably being too memory
hungry for actual usage.

I was pondering an allocator with caches per some number of cores (say 4
or 8). Microbenchmarks aside I suspect real workloads would not suffer
from contention at this kind of granularity. This would trivially reduce
memory usage compared to per-cpu caching. I suspect things like
mm_struct, task_struct, task stacks and similar would be fine with it.

Suppose mm_struct is allocated from a more coarse grained allocator than
per-cpu. Total number of cached objects would be lower than it is now.
That would also mean these allocated but not currently used mms could
hold on to other stuff, for example per-cpu rss and mcid counters. Then
should someone fork or exit, alloc/free_percpu would be avoided for most
cases. This would scale better and be faster single-threaded than the
current state.

(believe it or not this is not the actual long rant I have in mind)

I can't commit to work on the Real Solution though.

In the meantime I can submit percpu_counter_sum_many_unsafe as described
above if Denis likes the idea.