From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8F28C36010 for ; Fri, 4 Apr 2025 16:51:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BBDCE6B000C; Fri, 4 Apr 2025 12:51:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B52A36B0010; Fri, 4 Apr 2025 12:51:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BF366B0011; Fri, 4 Apr 2025 12:51:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7942D6B000C for ; Fri, 4 Apr 2025 12:51:52 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A1857C055E for ; Fri, 4 Apr 2025 16:51:52 +0000 (UTC) X-FDA: 83296953264.07.1C923DB Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf01.hostedemail.com (Postfix) with ESMTP id ABB774000B for ; Fri, 4 Apr 2025 16:51:50 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1PQ4TB2; spf=pass (imf01.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743785510; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zCJOxCsq4sP6eihazz9u78w+Z/VcmA1ipTY8uGgwWqQ=; b=0W0OkCBs6rhuBavJ15HGJqCbW11ZlnD0gNTsdJSRs0HGJQliA9V5Pwdd1fa/8PTsDDdYUG W0UQqh454bFfglAQGFZtUjLZ3SsxhAoHK5HlvEdv6uFLly2CfnCC1je6s3p4eeKLESMZuI 0UlqnfSzmLI/gtYxoI75kIhEjZGrAKo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1PQ4TB2; spf=pass (imf01.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743785510; a=rsa-sha256; cv=none; b=A4FM/2PtLBOtm66Ajr0JW7lfUZWjA5Xn1j7pVjmW+G7OF87oYWR8PMR9fdpEHFl4J5gQum grwVX2/USpZMQmd+HyzCHSej7ZkCIlJaOpcZDmb8x3T7NnjQuoCAF6gdvGdqIlZS5oyHRz q9oBrjKS6NYxVK/q/hjE/WP11uSpK1s= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-54c090fc7adso2586232e87.2 for ; Fri, 04 Apr 2025 09:51:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743785509; x=1744390309; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zCJOxCsq4sP6eihazz9u78w+Z/VcmA1ipTY8uGgwWqQ=; b=i1PQ4TB2rTg7dVHiYx4VD3SQ80AVf8qxdK32u6CjTXTosBq4WeXs7kjbDeaWPpnZMc J+EhW1zoL64DxkkRhD+Lv6c6DVl8OMqDfPwPnt8bdRnH/qylBJStPCEsaxQZEvNkfcvR GwMcA1U6HvrO6Vr9qFbintolEZVYdSUbi8SKPB5VCA8idRRvrXFbjo33ugTf0bVxiUtN BwrfBA7KjRZlv0kaDN8k+GlK03l0fFt8ysYF0o+yE+jpiaYzFTReIDokd2DhAVlnsORd 7QCqWDTBvpFq1Td1JNJiobk/FeXpqGehPE3S4yIQcZo9s79FiNQXi+GBP0BJ1r92SmFK H4aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743785509; x=1744390309; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zCJOxCsq4sP6eihazz9u78w+Z/VcmA1ipTY8uGgwWqQ=; b=GPjy/HL4cUH2Ft4gP4PES8RI7vYOd1OsI1/YWsAzs0Bsi09HlCR7rpu98UVnXuFHos pK/TaZe3pZbpDhfCbKf/Is7oGeKAoxJ1mqX8eO0nLfzE+J/2bGjUacxpQ0aqB9YK5PZ/ X8VGH/6WfnnOTT2AA4fijZl8tIFTfBHApWgcV51SeGqq867Z40QLHIIrG7vKqAznvc4A +k0H9g/rG3X+ShfvXVoJBVpHP97ETtqrdQ5dWlch76ciBIbo+0qXWmYv5cZqOzxo8PtZ 0G8IP2pgOYKM5MtXtaQGTaBilnBMFA2/yDLKxgAUROsVQw+/Aql5tFJWk4DaoebFzEbw 3xtQ== X-Forwarded-Encrypted: i=1; AJvYcCVWoAfGZCHEpw6Zb8z8GzkqTm6srnM3Dos6x0VPN3fs991FMrQdFpnDOcC1um9ipeO3lx1IjJMCJw==@kvack.org X-Gm-Message-State: AOJu0YwEyMqUFKSmwvDtMWezelQlYayeamyZwwMGvNj0TB4RJdZDDq0o hoqWvNog2W3h/H1SZMIH+nu3CocJYRk3txHofKL2VBwsV6AkQd0VaBsmkHCol1atRddL3T4vgNo K0dQv/T9ey2bBc1NiR5WX5mVAaeo= X-Gm-Gg: ASbGnct0pUaHXhaF0w/cKz2AAS6dKfDP/MlmNpA1SHPASNIXLoCihz72s7+PBwYktKc tAlAS6m3j/0vj1cSk11VyyInGu6v0EGULwBg8LVpbvn9XJBkJpChxTjcerarm4EyxfyZ3Qh3pBT lG3rbZCnAmJibHdYFQKTK2/0X1yg== X-Google-Smtp-Source: AGHT+IGzzdDf2+O+E4/8+jxtfp2u1ZLEf3GLMetOJ2O4e5uP72yC+bvImYtxfvw5tvT6UgNEgQsHq4u2w0+aFxWqr1w= X-Received: by 2002:a05:6512:4013:b0:549:66d8:a1d2 with SMTP id 2adb3069b0e04-54c232e2686mr1144449e87.19.1743785508633; Fri, 04 Apr 2025 09:51:48 -0700 (PDT) MIME-Version: 1.0 References: <20250331223516.7810-2-sweettea-kernel@dorminy.me> In-Reply-To: From: Kairui Song Date: Sat, 5 Apr 2025 00:51:32 +0800 X-Gm-Features: ATxdqUHJHnHN5IVy15yQ4251S2SiHtKLSV0rTOnKNbdjM_u66PODTpuuzvggkbw Message-ID: Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters To: Mateusz Guzik Cc: Sweet Tea Dorminy , Andrew Morton , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , =?UTF-8?Q?Christian_K=C3=B6nig?= , Shakeel Butt , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: ABB774000B X-Stat-Signature: 6y35hmjeoaiaypk5sgya85pq4rgff47o X-HE-Tag: 1743785510-68879 X-HE-Meta: U2FsdGVkX1/PHwMjzg8p9buRr7BV23zFoV/l2UtF2kWXAKuOObWoD1ci4EaEfRaqPcEcrrsuY2pS560rt2VEtIWxPKDzJNObCysfBrf9ImB68UeoPnmVvZJm4iHbOIuwBi+P4/Xxv/stZ3QS8ap47iDJuIVVCi9/6cAvRCh6Ug6LaFa9p86YoTtG6tFw3PI1sWKH74KVYD4OlP+6SRq90T4bs4gpknEY8LiJMUvgwoo7YF83yRzQx9WFba3kiwsQm6F6WWiHGuJ/+OvaOW/mRBo7yeu81d36lOhJrmoaH8W71oi4Tto//AEp2yyEz91pBYT+9K/DOsk0RxfIYNA0ZZiv7YG/FlYvu7UqnhtGnVBa/GkHQCY74EwaYvcrBJ/2saMBwIDJ1ndHwqNPwE1hxY5FW9+pwfZyVexLCE7zzr+RJNwinVXxnBnP7i/H+sYDTPuRdw74osCI+MNQB1RlTwhO6FcPldIfdT3hNXoe2blbfqfz7Fcm4DbjOer2yYYKrd6hmQULvlRL0ICpQ3KOXOuX3aBzCkLrri/VtpBAkjjId2CejoDfazPnpkuJ9gEfdZ0LDcnfiil4GriUx8erpwrqn8QJ5XX3bAgrRnkww8hKxb9R/YzFHCnc1lEa4GmtDu2ifzMBDBLkpVFibQE8ag8hGIbhwk+aHnFUQvVnj9zwFeINI72knez61ZInIoWm0tu5rch1O61VDan5pv7WrUWKO1kR+mWj2y2cdMRB7Cx1GePHflGIpt2dlZ4OGCTerQ+vj9PIiIZXSwsGsSVVqPYyat2O17Occ3An7HUBpazy7gEG33JMTEfj/JhuJILutvCcwF4JoQtlhv+/hqOoNgD1/lkvTHPs+agYGCemb++qa8s3+1QWaxoKkIiDRMt2kHkh2u5BOoPYVJjibdUN42+DullkL7sveTvrXGRJShebvu/DtpYMu3FdNcPyaYLdDYHH7IuwZdS6OgoSy1b juFisw/6 bA96+sPCsmwRnB66EkXAEQtEkdAiveSZjRqV9wpZS4HKspNfyi1HZwRG2snU1QYFmaDJmFKEcx49sPUprpvlY2qJ3yv2dLvRUx+dXJhCi3z50uBIX6cWWfXdK74fZquWr2th13WBqexhvKVA/l0bSzJ8chtQs8p7/Y7Xf4rXY9UruW6RFqw1TVY+FoptRhK26bhtcTkxNRTZxskitAllVUA/QhUJcazFM3tMEqdPKXIphymi4Idrt65Y58A91iTSBeC+T32hry6vdicp72uTGGoSXT097RJz6YFgczhuAhooP5gveEqXndmrTMHf1I5FES7C+wpWwR2y1gj2/dXOy5XVG3hsty8pu9n80Hk9Zm7LUEiyJprN7zIYgVu4fXVLxY0u1NwK16dC7LO42MeKJPR27TShSn/aVA4asT347rmdafPS0g7pP+Oy7qj8UxShrYGDLcFNvycFTqfc7puMGn0KdaZiAm6AubOxrNjvwAutU6Z1lZlNJRKZ4bSAs6vwJ6ywYC/Ex2o7z9m5wxkalTQyGvQPG23Y69UMUCX4OTKkwiTe1swcWwcaqlegy1MClm9Yu6dQ4QO3BZSzlkNfWB1ynXg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 3, 2025 at 10:31=E2=80=AFPM Mateusz Guzik w= rote: > > On Tue, Apr 1, 2025 at 5:27=E2=80=AFAM Kairui Song wro= te: > > > > On Tue, Apr 1, 2025 at 6:36=E2=80=AFAM Sweet Tea Dorminy > > wrote: > > > > > > [Resend as requested as RFC and minus prereq-patch-id junk] > > > > > > Recently, several internal services had an RSS usage regression as pa= rt of a > > > kernel upgrade. Previously, they were on a pre-6.2 kernel and were ab= le to > > > read RSS statistics in a backup watchdog process to monitor and decid= e if > > > they'd overrun their memory budget. Now, however, a representative se= rvice > > > with five threads, expected to use about a hundred MB of memory, on a= 250-cpu > > > machine had memory usage tens of megabytes different from the expecte= d amount > > > -- this constituted a significant percentage of inaccuracy, causing t= he > > > watchdog to act. > > > > > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > > > percpu_counter") [1]. Previously, the memory error was bounded by > > > 64*nr_threads pages, a very livable megabyte. Now, however, as a resu= lt of > > > scheduler decisions moving the threads around the CPUs, the memory er= ror could > > > be as large as a gigabyte. > > > > > > This is a really tremendous inaccuracy for any few-threaded program o= n a > > > large machine and impedes monitoring significantly. These stat counte= rs are > > > also used to make OOM killing decisions, so this additional inaccurac= y could > > > make a big difference in OOM situations -- either resulting in the wr= ong > > > process being killed, or in less memory being returned from an OOM-ki= ll than > > > expected. > > > > > > Finally, while the change to percpu_counter does significantly improv= e the > > > accuracy over the previous per-thread error for many-threaded service= s, it does > > > also have performance implications - up to 12% slower for short-lived= processes > > > and 9% increased system time in make test workloads [2]. > > > > > > A previous attempt to address this regression by Peng Zhang [3] used = a hybrid > > > approach with delayed allocation of percpu memory for rss_stats, show= ing > > > promising improvements of 2-4% for process operations and 6.7% for pa= ge > > > faults. > > > > > > This RFC takes a different direction by replacing percpu_counters wit= h a > > > more efficient set of per-NUMA-node atomics. The approach: > > > > > > - Uses one atomic per node up to a bound to reduce cross-node updates= . > > > - Keeps a similar batching mechanism, with a smaller batch size. > > > - Eliminates the use of a spin lock during batch updates, bounding st= at > > > update latency. > > > - Reduces percpu memory usage and thus thread startup time. > > > > > > Most importantly, this bounds the total error to 32 times the number = of NUMA > > > nodes, significantly smaller than previous error bounds. > > > > > > On a 112-core machine, lmbench showed comparable results before and a= fter this > > > patch. However, on a 224 core machine, performance improvements were > > > significant over percpu_counter: > > > - Pagefault latency improved by 8.91% > > > - Process fork latency improved by 6.27% > > > - Process fork/execve latency improved by 6.06% > > > - Process fork/exit latency improved by 6.58% > > > > > > will-it-scale also showed significant improvements on these machines. > > > > > > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@goo= gle.com/ > > > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack= 3/ > > > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362= @huawei.com/ > > > > Hi, thanks for the idea. > > > > I'd like to mention my previous work on this: > > https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com= / > > > > Basically using one global percpu counter instead of a per-task one, an= d > > flush each CPU's sub-counter on context_switch (if next->active_mm !=3D > > current->active_mm, no switch for IRQ or kthread). > > More like a percpu stash. > > > > Benchmark looks great and the fast path is super fast (just a > > this_cpu_add). context_switch is also fine because the scheduler would > > try to keep one task on the same CPU to make better use of cache. And > > it can leverage the cpu bitmap like tlb shootdown to optimize the > > whole thing. > > > > The error and total memory consumption are both lower than current desi= gn too. > Thanks for checking the patch. > Note there are 2 unrelated components in that patchset: > - one per-cpu instance of rss counters which is rolled up on context > switches, avoiding the costly counter alloc/free on mm > creation/teardown > - cpu iteration in get_mm_counter > > The allocation problem is fixable without abandoning the counters, see > my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the > counters). This aspect has to be solved anyway due to mm_alloc_cid(). > Providing a way to sort it out covers *both* the rss counters and the > cid thing. It's not just about the fork performance, on some servers there could be ~100K processes and ~200 CPUs, that will be hundreds of MBs of memory just for the counters. And nowadays it's not something uncommon for a desktop to have ~64 CPUs and ~10K processes. If we use a single shared "per-cpu" counter (as in the patch), the total consumption will always be only about just dozens of bytes. > > In your patchset the accuracy increase comes at the expense of walking > all CPUs every time, while a big part of the point of using percpu > counters is to have a good enough approximation somewhere that this is > not necessary. It usually doesn't walk all CPUs, only the CPUs that actually used that mm_struct, by checking mm_struct's cpu_bitmap. I didn't check if all arch uses that bitmap though. It's true that one CPU having its bit set on one mm_struct's cpu_bitmap doesn't mean it updated the RSS counter so there will be false positives, the false positive rate is low as schedulers don't shuffle processes between processors randomly, and not every process will be ran at a period. Also per my observation the reader side is much colder compared to updater for /proc. > > Indeed the stock kernel fails to achieve that at the moment and as you > can see there is discussion how to tackle it. It is a general percpu > counter problem. > > I verified get_mm_counter is issued in particular on mmap and munmap. > On high core count boxes (hundreds of cores) the mandatory all CPU > walk has to be a problem, especially if a given process is also highly > multi-threaded and mmap/munmap heavy. > > Thus I think your patchset would also benefit from some form of > distribution of the counter other than just per-cpu and the one > centralized value. At the same time if RSS accuracy is your only > concern and you don't care about walking the CPUs, then you could > modify the current code to also do it. > > Or to put it differently, while it may be changing the scheme to have > a local copy makes sense, the patchset is definitely not committable > in the proposed form -- it really wants to have better quality caching > of the state. > -- > Mateusz Guzik