linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
@ 2025-11-27 23:36 Gabriel Krisman Bertazi
  2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-11-27 23:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Gabriel Krisman Bertazi, linux-kernel, jack, Mateusz Guzik,
	Shakeel Butt, Michal Hocko, Mathieu Desnoyers, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions.  In particular, Jan Kara reported the
commit introducing per-cpu counters for rss_stat caused a 10% regression
of system time for gitsource in his system [1].  In that same occasion,
Jan suggested we special-cased the single-threaded case: since we know
there won't be frequent remote updates of rss_stats for single-threaded
applications, we could special case it with a local counter for most
updates, and an atomic counter for the infrequent remote updates.  This
patchset implements this idea.

It exposes a dual-mode counter that starts as a simple counter, cheap to
initialize on single-threaded tasks, that can be upgraded inflight to a
fully-fledged per cpu counter later.  Patch 3 then modifies the rss_stat
counters to use that structure, forcing the upgrade as soon as a second
task sharing the mm_struct is spawned.  By delaying the initialization
cost until the MM is shared, we cover single-threaded applications
fairly cheaply, while not penalizing applications that spawn multiple
threads.  On a 256c system, where the pcpu allocation of the rss_stats
is quite noticeable, this has reduced the wall-clock time between 6%
15% (depending on the number of cores) of an artificial fork-intensive
microbenchmark (calling /bin/true in a loop).  In a more realistic
benchmark, it showed an improvement of 1.5% on kernbench elapsed time.

More performance data, including profilings is available in the patch
modifying the rss_stat counters.

While this patch exposes a single users of this API, this should be
useful in more cases.  This is why I made it into a proper API.  In
addition, considering the recent efforts in this area, such as
hierarchical per-cpu counters which are orthogonal to this work because
they improve multi-threaded workloads, abstracting this with a new API
could help the merging of both works.

Finally, this is a RFC because it is an early work. in particular, I'd
be interested in more benchmarks suggestions, and I'd like feedback
whether this new interface should be implemented inside percpu_counters
as lazy counters or as a completely separated interface.

Thanks,

[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3

---

Cc: linux-kernel@vger.kernel.org
Cc: jack@suse.cz
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>

Gabriel Krisman Bertazi (4):
  lib/percpu_counter: Split out a helper to insert into hotplug list
  lib: Support lazy initialization of per-cpu counters
  mm: Avoid percpu MM counters on single-threaded tasks
  mm: Split a slow path for updating mm counters

 arch/s390/mm/gmap_helpers.c         |   4 +-
 arch/s390/mm/pgtable.c              |   4 +-
 fs/exec.c                           |   2 +-
 include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
 include/linux/mm.h                  |  26 ++---
 include/linux/mm_types.h            |   4 +-
 include/linux/percpu_counter.h      |   5 +-
 include/trace/events/kmem.h         |   4 +-
 kernel/events/uprobes.c             |   2 +-
 kernel/fork.c                       |  14 ++-
 lib/percpu_counter.c                |  68 ++++++++++---
 mm/filemap.c                        |   2 +-
 mm/huge_memory.c                    |  22 ++---
 mm/khugepaged.c                     |   6 +-
 mm/ksm.c                            |   2 +-
 mm/madvise.c                        |   2 +-
 mm/memory.c                         |  20 ++--
 mm/migrate.c                        |   2 +-
 mm/migrate_device.c                 |   2 +-
 mm/rmap.c                           |  16 +--
 mm/swapfile.c                       |   6 +-
 mm/userfaultfd.c                    |   2 +-
 22 files changed, 276 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/lazy_percpu_counter.h

-- 
2.51.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-12-03 14:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
2025-12-01 10:19   ` David Hildenbrand (Red Hat)
2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
2025-11-28 20:10   ` Jan Kara
2025-11-28 20:12     ` Mathieu Desnoyers
2025-11-29  5:57     ` Mateusz Guzik
2025-11-29  7:50       ` Mateusz Guzik
2025-12-01 10:38       ` Harry Yoo
2025-12-01 11:31         ` Mateusz Guzik
2025-12-01 14:47           ` Mathieu Desnoyers
2025-12-01 15:23       ` Gabriel Krisman Bertazi
2025-12-01 19:16         ` Harry Yoo
2025-12-03 11:02         ` Mateusz Guzik
2025-12-03 11:54           ` Mateusz Guzik
2025-12-03 14:36             ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox