From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6FD55C36008 for ; Wed, 26 Mar 2025 21:54:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B5E12800B2; Wed, 26 Mar 2025 17:54:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93D872800A5; Wed, 26 Mar 2025 17:54:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B8DC2800B2; Wed, 26 Mar 2025 17:54:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 57C342800A5 for ; Wed, 26 Mar 2025 17:54:39 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 22EED121630 for ; Wed, 26 Mar 2025 21:54:39 +0000 (UTC) X-FDA: 83265057078.11.00FFC3B Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf14.hostedemail.com (Postfix) with ESMTP id 3144110000B for ; Wed, 26 Mar 2025 21:54:36 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cTYxbp+J; spf=pass (imf14.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743026077; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Y7bWRiMn3i2Cx/2MWZmhpCx5tvwp0xN6YJAcrqwLq/M=; b=pYZvNQPfjOjpnG8zwu6ZTux67R4vLlhoNTRB7gdRqwTwXGj61y/fOPEgtoeZrQWN6HUNbj 6Fvtln6CGz3d13PUOnaBdO6SViul/P9TQbYtcsyJse5W1Zg/BWICjwixKm7lJ8Yrq9bVNj OvrK4iueNwpIy/qqPuyY+721xmqUgWs= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cTYxbp+J; spf=pass (imf14.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743026077; a=rsa-sha256; cv=none; b=MIcMq1QkmF723TIqivuHIEVIHYpcWfao4CX2zsgNfDCq0OiLDecFpOncz2VcNg/pUctqhv THuqTGPSJBQ/YiaoBpY4p34x7Qky05J129EXfDyusHIXZfnHrC8Ysk1d14hB7ywEETUSlU OaqJVpoEQ23jlzqolijx7bUH4HtKKNE= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-43cfba466b2so2981755e9.3 for ; Wed, 26 Mar 2025 14:54:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743026075; x=1743630875; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Y7bWRiMn3i2Cx/2MWZmhpCx5tvwp0xN6YJAcrqwLq/M=; b=cTYxbp+Jpbj9Qw3tSIiVgXkaGf0J+HrNozxR6GOdcYE8ZT4CNiXZe8CQgOonahnLJc hUMD3cwGbtNPs9L39oRvw7L37WUe2vUA8opyBv+lBHJv+zdFv4IiHZrOSm6Nl4k0f/oh KOXZtRkeCfKZUPpkgQTdZm4qm7BLyPGZmyb+1b6tUdtk5qkE5KksWf4IN/+MzIsh+hGl AMGH24bCLXJpytQ0jwLllZ/QRhs74K7MgTttIpWKTIjSOzjkbHCLLR6qBXzit5SNTywo vHSnUEvbso2zL9Ok3DX1jlrOhFWU1q/HySRxxNzHnuVVTblE2RYURhWnoTYCeSyjNJMX Ypdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743026075; x=1743630875; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Y7bWRiMn3i2Cx/2MWZmhpCx5tvwp0xN6YJAcrqwLq/M=; b=Qu3nANYC7PlQprG80x3TAEiGygJVf8MVmErGpwcUSrIfGYaKn8AsjLKa7Mu/DqQnSD 16A4DeabpggXXcA9j0jT3M9VdhOXAubopAfvKQQmigucdDQdZQ/+quPm95FXElr+ZOuA QW+sLGnNvxwDJUCDy5FMf17j5r/TvbbUEU4vY90jdEG6ln7pJ+Ujz+o27rqY6+0Ucq+X z7lpGUQfVBZf6dCHtlUite6w7DZc2jv6/SFSjbcuWgKJKTrlqtwTms0tF7z8JxlTKSM5 eJByg119Cdo7uK8OwoahMUbqV/YU1AEHso4XPvuHcNRUgCWkYITUiUIlEyAqIXxXicsX kQ3g== X-Forwarded-Encrypted: i=1; AJvYcCV19zu24wqE8M7V0Dw8Vea0jIBE/W9n1ug4gcjuSQkDDlsg3V5LyH6eC+MN//C1ngPYPJxdErwD1g==@kvack.org X-Gm-Message-State: AOJu0YyxKc8OqKsey2ihd+9frZUakkkbcPf+P+JqT0VNbqscCQ3TjneO LjcL/IShcJA2EZgO2X5+Oqfpq2QyO1HCNZARMKGeSH/OyI9Nkple X-Gm-Gg: ASbGncuTDBdXAbGW/cBnuKWMZNqXQ4Cbe01VCKH6wP3c56wFL1ms/Gpk6LNIJlcy0Jp 4hYrBu3huAlO9qqVZt8TpafpO5fMG7bxVClIoH0QFfjctT+2qT3zy1NNs7NMzrKpNcKDD4Ykrxv W1Abz66hBHWseU7FLHjBZuHV6TrqO6VI1ogy2Vp/APci/P00Ay2Z7r0zRGaB3ryv+x0Jd2bh1m2 dx32IhMRX74+uqfVsa0PwFppi8AMsqmuQz+W+ZFDy4XKSNGfisCNAYxFq8xlgxhwWtX7/EW6D2x V7XLtxW/yDZ4UKaAEHOiHK8QhKoWngLFBpjffDp6klWOX4DYG5DpQBcRx9t4 X-Google-Smtp-Source: AGHT+IGIq5a1efTx2UlvGIV3W/NrYSnjJ1SgQ0JCgVxcOr8nge55s3HmwsMUiw6kMpxpXUhhuuZbAg== X-Received: by 2002:a05:6000:4210:b0:38d:d666:5457 with SMTP id ffacd0b85a97d-39ad1784627mr849206f8f.42.1743026075141; Wed, 26 Mar 2025 14:54:35 -0700 (PDT) Received: from f (cst-prg-80-192.cust.vodafone.cz. [46.135.80.192]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3997f9e674fsm17677985f8f.80.2025.03.26.14.54.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Mar 2025 14:54:34 -0700 (PDT) Date: Wed, 26 Mar 2025 22:54:23 +0100 From: Mateusz Guzik To: Mathieu Desnoyers Cc: Sweet Tea Dorminy , Andrew Morton , Steven Rostedt , Masami Hiramatsu , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , Jani Nikula , Sweet Tea Dorminy , Johannes Weiner , Christian Brauner , Lorenzo Stoakes , Suren Baghdasaryan , "Liam R . Howlett" , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Matthew Wilcox , paulmck , Yu Zhao , Roman Gushchin , Greg Thelen , shakeel.butt@linux.dev Subject: Re: [PATCH] mm: use per-numa-node atomics instead of percpu_counters Message-ID: References: <20250325221550.396212-1-sweettea-kernel@dorminy.me> <67e2a6a1-8c9b-43d0-b960-10cd47c08873@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <67e2a6a1-8c9b-43d0-b960-10cd47c08873@efficios.com> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 3144110000B X-Stat-Signature: u75xigtxt4ypauwaeuiegr38nscttyi1 X-HE-Tag: 1743026076-682426 X-HE-Meta: U2FsdGVkX1/NFbloawdljCDQ8aPNbIp9lx/qRiFQxcKx4Srdi8RQ6iTMpEdAXxc8E1olswsUj6fxVNA+gX+SBJk5834JAClavxsN2aVlGxz8DqxD9bgrCaLSV0xtBh7XGwN6De+7vV04N3jJB8A2Ca2mlvurdwfRMLj0VTjr2CT2vgNIVOspOR6OrRkitgoninZVa6eufcv6i8Vk6WqpB99vl0o+KvseSskLRBQFAlHPXWtqtTEqg3a6fswAfHmULIeauBGPwxXJbTPIDs9ENeE3mENfX7LKulBIORJXtaNgfkNjwLkmdXBbu0JVMiCcTpSWA+325TiBbjIkdWAHNekXpqVtATQQHtZ6fwXcy4jsbssAT/rlEmuH5u7P1E5sY77GlLG1rASWMGnU0Ev04PTwRItaAdFw7lU7IYae2QDy7O0o24+Btl76HMXoeN+Kb9PTwoAvgv98E4qEdr7Q3mpE48L7U58ddU7oaB0DmBanjt+6WBY2+cEjoqqLKsPwHHTxJrYI5AJ+OBWCiApAERGO0g174s6otalX9X+axDgFY+kUPlRp3yZv1gUaPd6tCytL86k9aFhUF4NiUYPpNLF/Cmpp8E8gS+AMpBgUT3AOQvnT/950NDdRnZsGOBOnBer/i+DMl5G/6mwnPhs9SqR5p5uF5hp7Nx9fd7OS64/q2waVzkpBBM7b7x8j3WnoAACb9Novf+fPZZecxLv/k0q6h+SIldsofmi/p1vDG/Hvkn0tbmPI0q/LNBxnUJ4OMRqMmiu8lNwXAOnt3KAjRJbvh4pzOCcEvniCqUscv/AaD4LcVP4MQMZJmQBfaNB5NsZziPmITkMlZdO8IdVjhau/9fw4s/v6La576ACQk0anF+i5LB9d+x8pctlhmSvvPZGXE5a30qQRxmcstxW6ELtKwLqFVC6EXM3O96qpvDjt0Lgy5/cgxjFf8qDQBvMvS6/ujAGcfC/szBvZ+IJ PLxLbQuY H3UPf9GD+8nh87PjKDLRfrubMJhPJGoJhBPWnUUw1l4m4YiSan3PQtb0NbapQaMz0hXjxEnLgH0xokLzdc+uUvcVvq1y2/YEdCvRRw81xnpcQzSPLwcOjz6iKUybpzojD6Gx6QrUAizAMKtL6FxE7dJ+Pq02VDmvlPk7RFbuRQzwpzsV0hvvska0nKAIrYfSDkTMyY71MarWRBQ0rS2dHvLDrJ+KzUG9SWUdoWGndP0K2gzEz7KrPvAytazwpsrA2supzUppKsj8yXowXSGnTcqlvglub1q65IBLdW5TFYXSQMCgVqba+XKxjQ2bR03aXYPyuOSQmNZZRCqA61/ctk2OY2PfJVW4X4QErt3FL5avZGMCG7EtsDHFrg3zHHwqUybXzi0IPm2AddvB1WMVigf0JjQUeK3KRnlZ9jgmQlSA85ydfeP/jMi3q/R1sXnmggKB5a3HW52jt25t6s+dl6k5pgbiNj6y1Fbu5pBPiLQecBz6RnET84+o0ZYkur0R+Us+0bI7VC1wqfo98brwawPW9CrwKR/VzJIOtsZfekp2QZB9mhRGkiqmdubE/bONoEXKg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 26, 2025 at 03:56:15PM -0400, Mathieu Desnoyers wrote: > On 2025-03-25 18:15, Sweet Tea Dorminy wrote: > > From: Sweet Tea Dorminy > > > > Recently, several internal services had an RSS usage regression as part of a > > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to > > read RSS statistics in a backup watchdog process to monitor and decide if > > they'd overrun their memory budget. Now, however, a representative service > > with five threads, expected to use about a hundred MB of memory, on a 250-cpu > > machine had memory usage tens of megabytes different from the expected amount > > -- this constituted a significant percentage of inaccuracy, causing the > > watchdog to act. > > > > I suspect the culprit sits here: > > int percpu_counter_batch __read_mostly = 32; > EXPORT_SYMBOL(percpu_counter_batch); > > static int compute_batch_value(unsigned int cpu) > { > int nr = num_online_cpus(); > > percpu_counter_batch = max(32, nr*2); > return 0; > } > > So correct me if I'm wrong, but in this case the worse-case > inaccuracy for a 256 cpu machine would be > "+/- percpu_counter_batch" within each percpu counter, > thus globally: > > +/- (256 * 2) * 256, or 131072 pages, meaning an inaccuracy > of +/- 512MB with 4kB pages. This is quite significant. > > So I understand that the batch size is scaled up as the > number of CPUs increases to minimize contention on the > percpu_counter lock. Unfortunately, as the number of CPUs > increases, the inaccuracy increases with the square of the > number of cpus. > > Have you tried decreasing this percpu_counter_batch value on > larger machines to see if it helps ? > per-cpu rss counters replaced a per-thread variant, which for sufficiently threaded processes had a significantly bigger error. See f1a7941243c102a4 ("mm: convert mm's rss stats into percpu_counter"). The use in rss aside, the current implementation of per-cpu counters is met with two seemingly conflicting requirements: on one hand, synchronisation with other CPUs needs to be rare to maintain scalability, on the other the more CPUs are there to worry about, the bigger the error vs the central value and the more often you should synchronize it. So I think something needs to be done about the mechansism in general. While I don't have throught out idea, off hand I suspect turning these into a hierarchical state should help solve it? As in instead of *one* central value everyone writes to in order to offload their batch, there could be a level or two of intermediary values -- think of a tree you go up as needed. Then for example the per-cpu batch could be much smaller as the penalty for rolling it up to one level higher would be significantly lower than going after the main counter. I have no time to work on something like this though. Myabe someone has a better idea.