From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 53DBBD2502E for ; Sun, 11 Jan 2026 15:03:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B4196B0089; Sun, 11 Jan 2026 10:03:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 147B26B008C; Sun, 11 Jan 2026 10:03:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 008356B0093; Sun, 11 Jan 2026 10:02:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D716A6B0089 for ; Sun, 11 Jan 2026 10:02:59 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 40BC4BBF0F for ; Sun, 11 Jan 2026 15:02:59 +0000 (UTC) X-FDA: 84320000478.08.981D346 Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18]) by imf17.hostedemail.com (Postfix) with ESMTP id A6DE840012 for ; Sun, 11 Jan 2026 15:02:57 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=efficios.com header.s=smtpout1 header.b=w2O2li7L; spf=pass (imf17.hostedemail.com: domain of mathieu.desnoyers@efficios.com designates 158.69.130.18 as permitted sender) smtp.mailfrom=mathieu.desnoyers@efficios.com; dmarc=pass (policy=none) header.from=efficios.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1768143777; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8Eb1gb6rvz6OV1woaUXBjvGSCB+OU/GBfNfCaTfNrdA=; b=AnNGnWXkdcm903mkgyjxI9E5gGpn/gbhn7dgQnI28EgGmBN/hXacEyoCBvUDE99n8CQlUq MM741cBxILFxjFjTs4gWW3jsUs3PXVdruX1dZHd9h5fI35Nflwo/JjVdU2CKpEnAbgf9oo 7I8NXXY7wCH/eiztHazCKQ8CzafDyqk= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=efficios.com header.s=smtpout1 header.b=w2O2li7L; spf=pass (imf17.hostedemail.com: domain of mathieu.desnoyers@efficios.com designates 158.69.130.18 as permitted sender) smtp.mailfrom=mathieu.desnoyers@efficios.com; dmarc=pass (policy=none) header.from=efficios.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1768143777; a=rsa-sha256; cv=none; b=hbWlKSnFQVVeb62d2pDmyDt86mbtfYDO4IukvqW2F4gcaRKE8tlwGhKzisopYLdQacYPk1 SZ7nvr5bjBQVivAlWJ7U8t8zyztyNLfZ/RqnQiwUZ7e42sOw4GxL/CdBeKGpdc2lPNqTn5 hgmixHBbE3FJ3jPeUNAoJrxXWLiHf8c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=smtpout1; t=1768143776; bh=8Eb1gb6rvz6OV1woaUXBjvGSCB+OU/GBfNfCaTfNrdA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=w2O2li7LGod8ExiiCZAqsrV2XKBrj6sCUYyJDqE0ZEChMadwvD/mco7i9Y2RkgQwm E0zZ5wjHsjPGpH/y74DPtY976WWYQTX/Ti8u0MCM9bS5WtHRYZ2Gp7Km8hqoJ58d2W MhFRN5Lt0Aecw/Ri1MeunK2lvonB3SiaUXjm4mBVY5CWFOt2tZs+nbqP7r02ejhAhK OOBJ7gGNNXo322Nute7QURlOExKrgLgEUyebEQRgRWicFGbHjlwOgA4iLeBCfVB/bs GYtMuh6qTLxo/j03P7uSN9/zOanyYuGhluBMW6qDmw/pDSvBW7VqgmAcoNf/lrX6iz 8igQ10YFSuKNw== Received: from thinkos.internal.efficios.com (unknown [IPv6:2606:6d00:100:4000:a253:d09e:90e7:323f]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4dpzJw4WyLzkvr; Sun, 11 Jan 2026 10:02:56 -0500 (EST) From: Mathieu Desnoyers To: Andrew Morton Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Steven Rostedt , Masami Hiramatsu , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , christian.koenig@amd.com, Shakeel Butt , SeongJae Park , Michal Hocko , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Mateusz Guzik , Matthew Wilcox , Baolin Wang , Aboorva Devarajan Subject: [PATCH v12 2/3] mm: Fix OOM killer inaccuracy on large many-core systems Date: Sun, 11 Jan 2026 10:02:48 -0500 Message-Id: <20260111150249.1222944-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20260111150249.1222944-1-mathieu.desnoyers@efficios.com> References: <20260111150249.1222944-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: 45prg6mdje3uki3krerka3zch3sfrqod X-Rspam-User: X-Rspamd-Queue-Id: A6DE840012 X-Rspamd-Server: rspam08 X-HE-Tag: 1768143777-239893 X-HE-Meta: U2FsdGVkX1+w5tD+jAFcc//syTpQgOpbq8rbmI0R8yxfoEmHD2S99W+Szw+oS/1nzGVwpO+eefIh44XL7XMqOsDq1AJ2XNOTVgDPIfh3uHEYIKcUxm29zXcpv09UZoQGdIQvvoUgA/taKa4XyPazrmJKlc6Q5wu710fL5oStPJh2YMagRtzjQiia2/jbsbjGRGeZim2jfIP69bHgRyo7O/FmojvDSOHMIRZmPvZyuC8bqrv2y3H11m6eSYAPSaxlP+J4vpH3lNRo/Nb4H1ND0vUsXgjrVU4v10LMjIam8EjyktZz7k51Uzg746zQramOe5Zyh7zMJ797LGtpi4IRpgQvTrBScoGcvPA3TiwRnWMq6i3culOmxQ+wKrIgFo20KY/p1V+bO5wUU+xMQVyY0yf/GIH//+YmvevcX5kzRU5SKywovupH7QOoatcOU46MRNqijW6KTAq9JhTBSGAQpNv1E9u2fKzWJVqo9kR4TuPquVYWYLvOeOCHsOUkR72I3zoOQXH6a6txspk+ZYeeMkSb9qSDmW5Mr1ee5mK9jEQjVVjts7eQciuF4ya4LJSwBi7rzNtWvWlW6EwynXjlMRYvxc6fD6Psz8FmvxKBv5J3B+HzM2eDFLUrjaJjLJsbn8tfO8erfB6cq3Kd/5z3A5cebivcrh7C79acH/tLSYKzdN/y/zNb1nomjo1eLmQ/s1gNlCk8QFBLOXx1dZ3i+XVFUAn7NCg/XvsIKM/NreX17SnBDHk3e6rQRO8Abdz0wvS5LNXB/ge+qgnJBB0enulZKFkR1En5pauUhczF0g+PkslvicjpL+vqr2EDcFQ2t0VEgWAwSBYOJamPA1TmfB6BjT3wgo3ENxQrby2MT1ogDjFxY+3SXuPt8TDo0f8UdHpiZXRFRzqObY5ESL1yDCWdr2oHNCRkzW4ivSmNIliAtykHf4qlumlfU/i8nSrUckVfW0/aZjaeL3BmX/M TUi5qYm8 Fx1trEpebAYmVGc2NqNdawRtDe9n8FUHumAckTPgTbuhSjDfN41Gf52eoHvYu4+6cGU077L8sotnmpuBrd9jBW3llxzAqiymbo5CURsOfKOvXXA2vZVOI3N64WfiylbT8n//u3uM2+WLOk1FXBaWBa2xLHNBRMD5WpuEVEJTUKAPrKNzHCdwkRbKLVeC1krxbOKP4bT5cKQr/DMPGwww/kKBp4AAkwi6DN8mcwD7UC93rRgE7Z0t6xW4PUELLzwix1IXKO3CYGbuAYs37ZwCRwRCrIyR713lGcXjcTeJB0RgitaCcZooriTohIrMjyjwDsIsrB6nw04J+jwH1D4qpOJoECkmB+1yFFHJ5eF8D3P5+M4l7vL8ULhtAzaeO1MlF+/hF6FIawwSvJqB7kmcLHdZ/r7/O3n8N/JJDlXhKcdGZIpuho/O5lmCwIh8TyTCaKV1NjHKJJNBNsxPN5wq/oaSm65ipAYbpYsyuYXBFRQRYG+UoUesQBQQZtrF7aXwOwWs0fUGwS17ToiIhQwTf2zSwOEZKqIPxzSP0q0za25IB+/KSJnZMBKNxegjPbxV2cWTJG9r9M2LbiTAA0D9ddo5jyxIJDtdWj8HUG5AcfbiJsJ4uxhr2vX/M35/t0R6ZoTfVmbu5gbivHJTJD4b3+POg7AX1Q8ECaPMZbbWcjuwiGt+PlbHGQ/eLUwgtve9W5+Fh5cKyzqTYBm7r/aIUN0w1K76Io23EriPHcOJlsYHgUa2zMYiLqNGZHJTljmFDQ27Iz9nmT6lBPyTsrQzq1QBH4LlRSbp5/TT/OgbEChLa/gq68G1f4s/BqB/CwyWFqVUPayro8am7VRwPMxLpI3BDbXF1ufDuoFKeHqzOy+OBmParQz95F4+j6tQd3C4YigDgpmssP8Og2K4uDomXwXmHNYowJ53qvT5oYlKgaI5vY5Go5bcjzh1EJnJV1aOlDdW7JkeHJrJ3pJdLQ7OgRJ3oGsfc dLj9rN8g C6y3rf4coZkyyExBGxZnW7aip/Ad3r5OomJMmoftz8+QxOkPeOPH1o/pt7npN1ySpP5IjR4oL/rJd+ZjJqSj7pOprsP8CHb2FAYsybDGFoA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Use hierarchical per-cpu counters for rss tracking to fix the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems. The following rss tracking issues were noted by Sweet Tea Dorminy [1], which lead to picking wrong tasks as OOM kill target: Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. Here is a (possibly incomplete) list of the prior approaches that were used or proposed, along with their downside: 1) Per-thread rss tracking: large error on many-thread processes. 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% increased system time in make test workloads [1]. Moreover, the inaccuracy increases with O(n^2) with the number of CPUs. 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), error is high with systems that have lots of NUMA nodes (32 times the number of NUMA nodes). The approach proposed here is to replace this by the hierarchical per-cpu counters, which bounds the inaccuracy based on the system topology with O(N*logN). commit 82241a83cd15 ("mm: fix the inaccurate memory statistics issue for users") introduced get_mm_counter_sum() for precise /proc memory status queries. Implement it with percpu_counter_tree_precise_sum() since it is not a fast path and precision is preferred over speed. * Testing results: Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs total) Methodology: Comparing the current upstream implementation with the hierarchical counters is done by keeping both implementations wired up in parallel, and running a single-process, single-threaded program which hops randomly across CPUs in the system, calling mmap(2) and munmap(2) on random CPUs, keeping track of an array of allocated mappings, randomly choosing entries to either map or unmap. get_mm_counter() is instrumented to compare the upstream counter approximation to the precise value, and print the delta when going over a given threshold. The delta of the hierarchical counter approximation to the precise value is also printed for comparison. After a few minutes running this test, the upstream implementation counter approximation reaches a 1GB delta from the precise value, compared to 80MB delta with the hierarchical counter. The hierarchical counter provides a guaranteed maximum approximation inaccuracy of 192MB on that hardware topology. * Fast path implementation comparison The new inline percpu_counter_tree_add() uses a this_cpu_add_return() for the fast path (under a certain allocation size threshold). Above that, it calls a slow path which "trickles up" the carry to upper level counters with atomic_add_return. In comparison, the upstream counters implementation calls percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast path, and does a raw_spin_lock_irqsave above a certain threshold. The hierarchical implementation is therefore expected to have less contention on mid-sized allocations than the upstream counters because the atomic counters tracking those bits are only shared across nearby CPUs. In comparison, the upstream counters immediately use a global spinlock when reaching the threshold. * Benchmarks Using will-it-scale page_fault1 benchmarks to compare the upstream counters to the hierarchical counters. This is done with hyperthreading disabled. The speedup is within the standard deviation of the upstream runs, so the overhead is not significant. upstream hierarchical speedup page_fault1_processes -s 100 -t 1 614783 615558 +0.1% page_fault1_threads -s 100 -t 1 612788 612447 -0.1% page_fault1_processes -s 100 -t 96 37994977 37932035 -0.2% page_fault1_threads -s 100 -t 96 2484130 2504860 +0.8% page_fault1_processes -s 100 -t 192 71262917 71118830 -0.2% page_fault1_threads -s 100 -t 192 2446437 2469296 +0.1% Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] Link: https://lore.kernel.org/lkml/20250704150226.47980-1-mathieu.desnoyers@efficios.com/ Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Baolin Wang Cc: Aboorva Devarajan --- Changes since v10: - Rebase on top of mm_struct static init fixes. - Change the alignment of mm_struct flexible array to the alignment of the rss counters items (which are cacheline aligned on SMP). - Move the rss counters items to first position within the flexible array at the end of the mm_struct to place content in decreasing alignment requirement order. Changes since v8: - Use percpu_counter_tree_init_many and percpu_counter_tree_destroy_many APIs. - Remove percpu tree items allocation. Extend mm_struct size to include rss items. Those are handled through the new helpers get_rss_stat_items() and get_rss_stat_items_size() and passed as parameter to percpu_counter_tree_init_many(). Changes since v7: - Use precise sum positive API to handle a scenario where an unlucky precise sum iteration would observe negative counter values due to concurrent updates. Changes since v6: - Rebased on v6.18-rc3. - Implement get_mm_counter_sum as percpu_counter_tree_precise_sum for /proc virtual files memory state queries. Changes since v5: - Use percpu_counter_tree_approximate_sum_positive. Change since v4: - get_mm_counter needs to return 0 or a positive value. --- include/linux/mm.h | 19 ++++++++++---- include/linux/mm_types.h | 50 +++++++++++++++++++++++++++---------- include/trace/events/kmem.h | 2 +- kernel/fork.c | 24 ++++++++++-------- 4 files changed, 66 insertions(+), 29 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f959d8ca4b4..6d938b3e3709 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2843,38 +2843,47 @@ static inline bool get_user_page_fast_only(unsigned long addr, { return get_user_pages_fast_only(addr, 1, gup_flags, pagep) == 1; } + +static inline struct percpu_counter_tree_level_item *get_rss_stat_items(struct mm_struct *mm) +{ + unsigned long ptr = (unsigned long)mm; + + ptr += offsetof(struct mm_struct, flexible_array); + return (struct percpu_counter_tree_level_item *)ptr; +} + /* * per-process(per-mm_struct) statistics. */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]); } static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) { - return percpu_counter_sum_positive(&mm->rss_stat[member]); + return percpu_counter_tree_precise_sum_positive(&mm->rss_stat[member]); } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + percpu_counter_tree_add(&mm->rss_stat[member], value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], 1); mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], -1); mm_trace_rss_stat(mm, member); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 9f861ceabe61..c3e8f0ce3112 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include #include @@ -1070,6 +1070,19 @@ typedef struct { DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS); } __private mm_flags_t; +/* + * The alignment of the mm_struct flexible array is based on the largest + * alignment of its content: + * __alignof__(struct percpu_counter_tree_level_item) provides a + * cacheline aligned alignment on SMP systems, else alignment on + * unsigned long on UP systems. + */ +#ifdef CONFIG_SMP +# define __mm_struct_flexible_array_aligned __aligned(__alignof__(struct percpu_counter_tree_level_item)) +#else +# define __mm_struct_flexible_array_aligned __aligned(__alignof__(unsigned long)) +#endif + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1215,7 +1228,7 @@ struct mm_struct { unsigned long saved_e_flags; #endif - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct percpu_counter_tree rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; @@ -1326,10 +1339,13 @@ struct mm_struct { } __randomize_layout; /* - * The mm_cpumask needs to be at the end of mm_struct, because it - * is dynamically sized based on nr_cpu_ids. + * The rss hierarchical counter items, mm_cpumask, and mm_cid + * masks need to be at the end of mm_struct, because they are + * dynamically sized based on nr_cpu_ids. + * The content of the flexible array needs to be placed in + * decreasing alignment requirement order. */ - char flexible_array[] __aligned(__alignof__(unsigned long)); + char flexible_array[] __mm_struct_flexible_array_aligned; }; /* Copy value to the first system word of mm flags, non-atomically. */ @@ -1368,22 +1384,28 @@ extern struct mm_struct init_mm; #define MM_STRUCT_FLEXIBLE_ARRAY_INIT \ { \ - [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE - 1] = 0 \ + [0 ... PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE + sizeof(cpumask_t) + MM_CID_STATIC_SIZE - 1] = 0 \ } -/* Pointer magic because the dynamic array size confuses some compilers. */ -static inline void mm_init_cpumask(struct mm_struct *mm) +static inline size_t get_rss_stat_items_size(void) { - unsigned long cpu_bitmap = (unsigned long)mm; - - cpu_bitmap += offsetof(struct mm_struct, flexible_array); - cpumask_clear((struct cpumask *)cpu_bitmap); + return percpu_counter_tree_items_size() * NR_MM_COUNTERS; } /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) { - return (struct cpumask *)&mm->flexible_array; + unsigned long ptr = (unsigned long)mm; + + ptr += offsetof(struct mm_struct, flexible_array); + /* Skip RSS stats counters. */ + ptr += get_rss_stat_items_size(); + return (struct cpumask *)ptr; +} + +static inline void mm_init_cpumask(struct mm_struct *mm) +{ + cpumask_clear((struct cpumask *)mm_cpumask(mm)); } #ifdef CONFIG_LRU_GEN @@ -1475,6 +1497,8 @@ static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm) unsigned long bitmap = (unsigned long)mm; bitmap += offsetof(struct mm_struct, flexible_array); + /* Skip RSS stats counters. */ + bitmap += get_rss_stat_items_size(); /* Skip cpu_bitmap */ bitmap += cpumask_size(); return (struct cpumask *)bitmap; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 7f93e754da5c..91c81c44f884 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -442,7 +442,7 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + __entry->size = (percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]) << PAGE_SHIFT); ), diff --git a/kernel/fork.c b/kernel/fork.c index b1f3915d5f8e..949ac019a7b1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -133,6 +133,11 @@ */ #define MAX_THREADS FUTEX_TID_MASK +/* + * Batch size of rss stat approximation + */ +#define RSS_STAT_BATCH_SIZE 32 + /* * Protected counters by write_lock_irq(&tasklist_lock) */ @@ -626,14 +631,12 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); - - if (unlikely(x)) { - pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n", - mm, resident_page_types[i], x, + if (unlikely(percpu_counter_tree_precise_compare_value(&mm->rss_stat[i], 0) != 0)) + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%d Comm:%s Pid:%d\n", + mm, resident_page_types[i], + percpu_counter_tree_precise_sum(&mm->rss_stat[i]), current->comm, task_pid_nr(current)); - } } if (mm_pgtables_bytes(mm)) @@ -731,7 +734,7 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); + percpu_counter_tree_destroy_many(mm->rss_stat, NR_MM_COUNTERS); free_mm(mm); } @@ -1123,8 +1126,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) + if (percpu_counter_tree_init_many(mm->rss_stat, get_rss_stat_items(mm), + NR_MM_COUNTERS, RSS_STAT_BATCH_SIZE, + GFP_KERNEL_ACCOUNT)) goto fail_pcpu; mm->user_ns = get_user_ns(user_ns); @@ -3006,7 +3010,7 @@ void __init mm_cache_init(void) * dynamically sized based on the maximum CPU number this system * can have, taking hotplug into account (nr_cpu_ids). */ - mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_cid_size(); + mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_cid_size() + get_rss_stat_items_size(); mm_cachep = kmem_cache_create_usercopy("mm_struct", mm_size, ARCH_MIN_MMSTRUCT_ALIGN, -- 2.39.5