From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C54DC77B7C for ; Thu, 3 Jul 2025 17:38:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 935E36B026C; Thu, 3 Jul 2025 13:38:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DF816B026D; Thu, 3 Jul 2025 13:38:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F5296B026E; Thu, 3 Jul 2025 13:38:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 56F526B026C for ; Thu, 3 Jul 2025 13:38:24 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0BE66C0326 for ; Thu, 3 Jul 2025 17:38:24 +0000 (UTC) X-FDA: 83623662528.09.D0BD13B Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18]) by imf17.hostedemail.com (Postfix) with ESMTP id A24C54000D for ; Thu, 3 Jul 2025 17:38:22 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=efficios.com header.s=smtpout1 header.b=Z8MhtjT1; dmarc=pass (policy=none) header.from=efficios.com; spf=pass (imf17.hostedemail.com: domain of mathieu.desnoyers@efficios.com designates 158.69.130.18 as permitted sender) smtp.mailfrom=mathieu.desnoyers@efficios.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751564302; a=rsa-sha256; cv=none; b=yz/uroU3SPxqQ31Tm6CxUdlmu1cygc+loUChk1UUaEsslmJ2NG2h7M3fTq3x3Or4LLcaZx JFaw9/wivImDvbrJKEvdIjxubCgN2Ggq9XlcA8RJAtNGaIOhKWNla951TchEPae686/1GP flMsKmBxa4nwPp3LBUx001TrsQvkvZ0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=efficios.com header.s=smtpout1 header.b=Z8MhtjT1; dmarc=pass (policy=none) header.from=efficios.com; spf=pass (imf17.hostedemail.com: domain of mathieu.desnoyers@efficios.com designates 158.69.130.18 as permitted sender) smtp.mailfrom=mathieu.desnoyers@efficios.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751564302; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bCUT1L7YmTlRP8zULz9syHhpc1og7r2drkTwKPRNmH8=; b=obtbeF34Ctya15SVu0poDZ1QWPDztbNEXEa9QH108Jos2newtbezvh0+U5gE+w2BGoGvsp QEBk0BVjYIIB67+U8iKSPTfBLK2fQeAT5fsg6BnCD/b7XHwG6FYfH0Vf3SIwg3leLEg69u l8Dkytfv1wgfSxLcmZoqiRebxdbjOqo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1751564301; bh=V9Nfu0Sql0IBeo1MXLiFqUZzXUo0YMxEakDYFmtsNx4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Z8MhtjT1LDamaD3gwVzzg69wy2Hv3bOv5+svZmiWmBWLS27UGVoH5ldSq/6H524qf jqE/XsLFe8kULAMu0+MPOLtokqRBwjjS7SncQxVHe7W/gUyDjphciko+Z0UjZ4k4ZZ VDv7mJ9FMggs1RAJk/MrZQrs+F20Gh3ZIMVoee0mkyfvIU0Ip1NYUjd8bjx6bBRlNP CSPFRpn2sa7wExbRSUf/wkIQK1nzzW7nRmh+EYMNh+4f3GtmDLgLufl4/6O055YmbR FKllh+KZbITGQkh/ObYOwji1iil4/BS+v+gqjSupdkcpXnxlyYMGOykVxJZU8THNhH XcZmp82/QUD1g== Received: from thinkos.internal.efficios.com (192-222-132-26.qc.cable.ebox.net [192.222.132.26]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4bY3rs32ldz1Ndw; Thu, 3 Jul 2025 13:38:21 -0400 (EDT) From: Mathieu Desnoyers To: Andrew Morton Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Steven Rostedt , Masami Hiramatsu , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , christian.koenig@amd.com, Shakeel Butt , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Mateusz Guzik , Matthew Wilcox Subject: [RFC PATCH v5 2/2] mm: Fix OOM killer inaccuracy on large many-core systems Date: Thu, 3 Jul 2025 13:38:13 -0400 Message-Id: <20250703173813.18432-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250703173813.18432-1-mathieu.desnoyers@efficios.com> References: <20250703173813.18432-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: A24C54000D X-Stat-Signature: bcs81omkurkx3s487nhi5bms4zga4zpr X-Rspam-User: X-HE-Tag: 1751564302-193129 X-HE-Meta: U2FsdGVkX18NPSAjWINLAUmhMlZcs0oBjaaL/ppx32r25y3qng2Hj9YFvw0BttjDWSrMEZQzdFmUmEVJQYG88aPdPbr9cuj8Q8IJb3jS1OU3HPFhTypNw92Hkdokubv544eqT8iDebGG4ekCvAgMgz3UEyvoFnByRujRm/j0gfTJBFRGHYwOAQ5y16e0y4TEnnwNhd+w4c6hGsWtF07rxdaxygVgRCFC3jjLGWQcYHDCD8SkOWlKL4cjmEE/LH1Da5VasNjfx9YTnSAplYCrTjf9yH58VZbUc21gJPQx+SFZz1NhHO0SAoqBJgnhBiDsp57tgCQpVb1WUwGAaQE9EI5umQu5uKb5LljKmXrXTSZVZCx9YKMDY8/6haz0c6AIQjIDmmBMghbHSXkhzBYfxGqzy43Jcp7V5XIJSpyPrGZNitkY8FHbcWMUfwcvsYT20py81A1FHuVpdYGF8vgc63y1wNUoF5L6diHE0Do3yj7zIGTHR3HO3Ei29XGZgYEIoNxz7dlpwnEm7EMjqmfAh+WtX7Uhv8EH7oUjagWpcq0IUuE1U6+v9SZdqwk7y2qnbA4pgrtGgT/wuRpeVkq+QsoPR5uUHlT5t1DX7Xrb0xcwQkueAsPc5GSlCSU+Gmuq62UF8EI7X+wqvJXwg4ew20UHE0IjoVZcgdahy1UMmCVhzH/DefnSl61BBMm9+Y3d3Pwg/i6t/9hMK6jmCsTINpavpStbCQEg5SHz+8dC7pZ1j6Hd0z/KPeUB2s/brDp9UZ3kcaYDFFI8ecI7YnVoBLJkYK5BX7SJcSyvwn3JTcKI2sSc8r3Oui55fPC6MRyEAQAoRhbM8B6PnekyWCM3RUjh9djIuQX3ZYD9z7sHiIPyiW52Oo6N2TB8Qy+YRh/IUQI99CrQ6mna8F4rwKSM270qyvQGrYDwxwRaBMBuaQeMxGn8vAZzkFs7NWjCR40y4ZzCSFe7+69wMOWudl2 obEKNXXl hMq/rQNwWc/JU9aKu/dGJJSQhKu1DKUGzel0EnDnDiGdd+bAuPOyhYVIi1B6Os/B6lC+Gq1cshUZ+50mhHi4jmckF+n34bkMYGvKoaJqBGGJeduY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Use hierarchical per-cpu counters for rss tracking to fix the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems. The following rss tracking issues were noted by Sweet Tea Dorminy [1], which lead to picking wrong tasks as OOM kill target: Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. Here is a (possibly incomplete) list of the prior approaches that were used or proposed, along with their downside: 1) Per-thread rss tracking: large error on many-thread processes. 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% increased system time in make test workloads [1]. Moreover, the inaccuracy increases with O(n^2) with the number of CPUs. 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), error is high with systems that have lots of NUMA nodes (32 times the number of NUMA nodes). The approach proposed here is to replace this by the hierarchical per-cpu counters, which bounds the inaccuracy based on the system topology with O(N*logN). Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox --- Change since v4: - get_mm_counter needs to return 0 or a positive value. --- include/linux/mm.h | 10 ++++++---- include/linux/mm_types.h | 4 ++-- include/trace/events/kmem.h | 2 +- kernel/fork.c | 31 +++++++++++++++++++++---------- 4 files changed, 30 insertions(+), 17 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e51dba8398f7..18ccb51dad88 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2705,28 +2705,30 @@ static inline bool get_user_page_fast_only(unsigned long addr, */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + int v = percpu_counter_tree_approximate_sum(&mm->rss_stat[member]); + + return v > 0 ? v : 0; } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + percpu_counter_tree_add(&mm->rss_stat[member], value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], 1); mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], -1); mm_trace_rss_stat(mm, member); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..85b15109106a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include @@ -1059,7 +1059,7 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct percpu_counter_tree rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index f74925a6cf69..d6199b99c771 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -477,7 +477,7 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + __entry->size = (percpu_counter_tree_approximate_sum(&mm->rss_stat[member]) << PAGE_SHIFT); ), diff --git a/kernel/fork.c b/kernel/fork.c index 168681fc4b25..dd458adc5543 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -129,6 +129,11 @@ */ #define MAX_THREADS FUTEX_TID_MASK +/* + * Batch size of rss stat approximation + */ +#define RSS_STAT_BATCH_SIZE 32 + /* * Protected counters by write_lock_irq(&tasklist_lock) */ @@ -843,11 +848,10 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); - - if (unlikely(x)) - pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", - mm, resident_page_types[i], x); + if (unlikely(percpu_counter_tree_precise_compare_value(&mm->rss_stat[i], 0) != 0)) + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%d\n", + mm, resident_page_types[i], + percpu_counter_tree_precise_sum(&mm->rss_stat[i])); } if (mm_pgtables_bytes(mm)) @@ -930,6 +934,8 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm) */ void __mmdrop(struct mm_struct *mm) { + int i; + BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); @@ -945,8 +951,8 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); - + for (i = 0; i < NR_MM_COUNTERS; i++) + percpu_counter_tree_destroy(&mm->rss_stat[i]); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); @@ -1285,6 +1291,8 @@ static void mmap_init_lock(struct mm_struct *mm) static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { + int i; + mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); atomic_set(&mm->mm_users, 1); @@ -1332,15 +1340,18 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; + for (i = 0; i < NR_MM_COUNTERS; i++) { + if (percpu_counter_tree_init(&mm->rss_stat[i], RSS_STAT_BATCH_SIZE, GFP_KERNEL_ACCOUNT)) + goto fail_pcpu; + } mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; fail_pcpu: + for (i--; i >= 0; i--) + percpu_counter_tree_destroy(&mm->rss_stat[i]); mm_destroy_cid(mm); fail_cid: destroy_context(mm); -- 2.39.5