From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F43DC36008 for ; Tue, 1 Apr 2025 03:27:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16CEE280003; Mon, 31 Mar 2025 23:27:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11B15280001; Mon, 31 Mar 2025 23:27:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F23E0280003; Mon, 31 Mar 2025 23:27:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D7828280001 for ; Mon, 31 Mar 2025 23:27:18 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id EED4857BEF for ; Tue, 1 Apr 2025 03:27:18 +0000 (UTC) X-FDA: 83284039356.04.7C7A930 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by imf10.hostedemail.com (Postfix) with ESMTP id 04F1AC0005 for ; Tue, 1 Apr 2025 03:27:16 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VNkPYenW; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743478037; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2PMpO2/suGC0ARSAeEmRf7TeWHcQ6pkbpeUtrBkFfIM=; b=6NIYDyYupXZi5GySxYQifl9k7YnnH7iAf4qLmcPt5Aw9DO5P9h+ZL3KylmumNbHGoC1Nsx QFVPdyM5NRtfOEjiUJSqUQ/u37Zkjzt50UB69sKyKCw+KhaCwjSkcx1HWamqvnSv/CEnP6 ctIZweQfZemRBxyi18E/upmGggqLEPM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743478037; a=rsa-sha256; cv=none; b=vMBWHBrc7J9RPY1JJPgSnKDSb9d/Y6Eq36aBDJXA5b5thiwv2DbFAB2LiPK0zPTTGVvEax xxtgZUhBADgS9+iMtxBnKcvCOQsVZd2tO3DnUGC7+8dd9w6k7E/+j5DXqCPW8AYWBXMV4u Z/CwHw8lUYSnesgW0BaENHYoG/pdiNU= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VNkPYenW; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-5499d2134e8so5733272e87.0 for ; Mon, 31 Mar 2025 20:27:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743478035; x=1744082835; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2PMpO2/suGC0ARSAeEmRf7TeWHcQ6pkbpeUtrBkFfIM=; b=VNkPYenWm7OI9Kct99RLjMQNdU1ksb/VFGpOUrF3Xg+i3wKefFK45oNWtnZ/kzsU0L 70cjlR60QRkt2OOXbzp75IouJawd83fvNsPPgD8RFYKXKBh6UHEOrqjHrv3bECVvYmgN jb0XOhzHjarUjt4Z6VC/SbXXPMC0+uvSUF38PqQD0IqBEUXbhL1wBdLsrSYz8Ius6Nou JNujYWuKhS1K3Vu8G7+I1pdoy3hAgfq54Kwi8P/j/rdNU7KVCe55rSbuVqvPQIQjJU9h NR9aK3TDPrG+ahJCGlGEIQpHPJ4VxZKCMcY9jVeu5ghecg+JPLW4wVhR7na90SoKXt9f Sa7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743478035; x=1744082835; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2PMpO2/suGC0ARSAeEmRf7TeWHcQ6pkbpeUtrBkFfIM=; b=ECsJGrAp80l59IspWTOIIGglc0QmUe5qlrfPhOm70iv/+9wRoUxeQfSzWmLR/LcK1X CPj2aXpCbKAB26sd2PJGHnkWGgX+yZzqBVWwlkAJhlBj7SL/kRljrUPICxeiFjfJU8Qy 0KdWPVYOZjGWzWqEs1tXjKo+GJwW6tp6DmPfyCJslhlnDgdNT49z0zEdKHx9asi30zd8 rDRgKX1iX5rnALq73KKv+PQsoy6WI+F2Uvbw1klsCXzPYKLJKh/OCPQDXdSyAIXKw0in 3gICh6QkOtHnvouBevliveIm8lULqpsujomGtDgV3SfefEKxq1hYJPhqSmcYtqtBZGbX XBug== X-Forwarded-Encrypted: i=1; AJvYcCUPhMCdooONzdPjD9krJwoWn1znt6o5qGiDCvY6dzQG5wIOtQoPSf+FSn4WBsiOmaX7NKEJJtkc0w==@kvack.org X-Gm-Message-State: AOJu0Yw6CofBcNRcwqUs+1PtACdB/FkcS07cd67/K+3q/MK+AZhds3pl B0nL2LTk6Yf08EzOGwNaA8O2gToFskAO3BmMWecQWmzRsTC4qZqWHgaZVrKRvTrysnT+DVF0Vrt UXI9vMAv6tKDJ7Wm+jJdIifX+w3U= X-Gm-Gg: ASbGncvOluN0sx+vfWGwYOjkgXkOgi6HVfUlHRJQXyTidbzTPbqRi0bEMTaNwEsXJW7 6zoWlTzSCquCZZj+TXXRCO5E+YQGXs8qq8gcSNG/S9HTvUc2R4MfZhP2Fcwm6vYitUpjsQcyUbB Gg1yf/tiWqyqvkPgeCcTN7nPXQq3jKf0HSpMT7 X-Google-Smtp-Source: AGHT+IEm8Qm2x7jTGG6I1atPXJ5ybseu2lCiRGTMOndtoEUvMXIrXP9gxloBkptnrRjLwd9n2WDiSdi5eAm5ic+RfbY= X-Received: by 2002:a05:6512:2244:b0:540:1f7d:8bce with SMTP id 2adb3069b0e04-54b11128191mr2739891e87.38.1743478034910; Mon, 31 Mar 2025 20:27:14 -0700 (PDT) MIME-Version: 1.0 References: <20250331223516.7810-2-sweettea-kernel@dorminy.me> In-Reply-To: <20250331223516.7810-2-sweettea-kernel@dorminy.me> From: Kairui Song Date: Tue, 1 Apr 2025 11:26:58 +0800 X-Gm-Features: AQ5f1JoNr6kDqJ43H441717JQENbbrXTG_y_ViKPl6dlH7HzyhSUtpo35zjSf_Q Message-ID: Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters To: Sweet Tea Dorminy Cc: Andrew Morton , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , =?UTF-8?Q?Christian_K=C3=B6nig?= , Shakeel Butt , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Mateusz Guzik Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 04F1AC0005 X-Stat-Signature: 5dnnx94nw8wi9mjkeugp1188tnp5o9hm X-HE-Tag: 1743478036-366139 X-HE-Meta: U2FsdGVkX1/U/fJAcRtXzfVLRVOyEdTAMqHEaFpz0uIyAkQJCaQze02o4UeeE5sxnJ1B6OTf/E1aW7ewxNh/n4OExYKqYDNrKiU2M7ciyWE6WjcXz3NJmPLw8GtFU08MKawMD4C2RMnBCjj7hNGzTObh4mVz9qSX75P/WMo/sUB5vNVQ3agcLebwCttL1kCTSAAFZ8CaYWx+CiFBu0AzAlLV0hUEZr/2lJSQi9ZBcdw9+NPjyOCHIphvhhJSgU6D1Ktl8AbaGs9K0CzCTCkLGoVqu+arMfzUwKUoyMlBOQi8W2zCYITgLYUc/nWoYmzMddUUM8HwqDgod6+wrTzqla8OC+wLX5XD11AWw3Doy2IO8g93ifY9wRrE7sK4fUDSLCSyNDXO+9hbqvr6Ebj7Z69+LoQa/rV12MkkcHyt444h5n+w0XJh31CWPihhWVgSF1EnEUx+y+oN6NR/rSO8joabQ2yFbuEDWF1RaW0pR0dmVZzgH7Q0lYaIKKY9uFk3tnI4RLqy1/RT9ySIiFCC6CwgBYGuyoZMkDVsTVsScUwhHRAdcNUFb8OleiF0YVrJf0J9KUd8toFkV+B2RnW0/8V8B+Z7G5wcCUBERVshwlJ4d9AMtrbz7DW3mtNjFaPybS6qR524rq7z5TYP/ukrkcCtQ7HZpD97hPYZMOqxd6HQoBfR44CjfZ0iElMyRIt1QhFOWsEPyG6RfgnA7/QY4p7ttandEPZYU1GWtqgB2kVPjKAiGf8VSr1NDGwlfONGa4tTTVdy+tRIV4s90CMeNkjhM4yIeNhTNxozUiaGvnKgLvtlLTLGKwFQluQ2krD25jIaCXyrqIE97q99brEGrTIP581J5rCeLzuH3wBEkK7HYY1HZ98Ix6djZg1yWyYIdpLd0LlclOFS5Jyh2q67Cl+vaG1MX4d/UcwO1M955S78v7R5CuZQ07QcY6jWwTpvLZwrGL1tzXl8DXPbJJT fHTSOhoW gROyl1KB+A+Tkfdj5IIoCYyutJ7VhQLv6RwxWTaqdEpQQnQDgpOqdCFOy37s3u5HOkosvkp5QYyhpswYLznJ6iz9/bLo+SAMXhO0TvgVX++M0ypZpdmUh5TG/JJCEJbibt45s0u68AfL4PZnsvMlgh/tNIiFKmn4z8ieGt4tNaGuVU5U2Y/7RaW/w4J3dMhmsrZNZCor/MudX2vKo+PhWquFVNJyR9yuu/H+Sd/CNYNgVQFjRSGVisCVLpCUmR9r5hUjYvHN3wqpqFO5MnVb2wr5jvbPUOulkPKExKFaI8K75I6c8gBapc5y1KnfwVdBmEowb6yz9YXJ+pXrBkIzu3skm/pok8/xQ9fm/PZWcLmGrrAThfmHKag2QvL6aslEUcMUupq368sIdClS5dBgQ23rVlOWh7HZsZM7CbNm6rxK0l/TW9nciHE5Lc1qZvQx9Ol0r+GIR69ZWCEamwnTk+0w4shsROWOlIJq+J61BjTZfIXNdmtkZbpN6GtUIWGou6IOmddMQoFrTsjVUAvzXuCfkAkAOTInRZKzhqyA09+OLJFKnj8T2VKBwbLpgsC3VsUesXBLi4gy+FJ8A8xTMru9vvw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 1, 2025 at 6:36=E2=80=AFAM Sweet Tea Dorminy wrote: > > [Resend as requested as RFC and minus prereq-patch-id junk] > > Recently, several internal services had an RSS usage regression as part o= f a > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able t= o > read RSS statistics in a backup watchdog process to monitor and decide if > they'd overrun their memory budget. Now, however, a representative servic= e > with five threads, expected to use about a hundred MB of memory, on a 250= -cpu > machine had memory usage tens of megabytes different from the expected am= ount > -- this constituted a significant percentage of inaccuracy, causing the > watchdog to act. > > This was a result of f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") [1]. Previously, the memory error was bounded by > 64*nr_threads pages, a very livable megabyte. Now, however, as a result o= f > scheduler decisions moving the threads around the CPUs, the memory error = could > be as large as a gigabyte. > > This is a really tremendous inaccuracy for any few-threaded program on a > large machine and impedes monitoring significantly. These stat counters a= re > also used to make OOM killing decisions, so this additional inaccuracy co= uld > make a big difference in OOM situations -- either resulting in the wrong > process being killed, or in less memory being returned from an OOM-kill t= han > expected. > > Finally, while the change to percpu_counter does significantly improve th= e > accuracy over the previous per-thread error for many-threaded services, i= t does > also have performance implications - up to 12% slower for short-lived pro= cesses > and 9% increased system time in make test workloads [2]. > > A previous attempt to address this regression by Peng Zhang [3] used a hy= brid > approach with delayed allocation of percpu memory for rss_stats, showing > promising improvements of 2-4% for process operations and 6.7% for page > faults. > > This RFC takes a different direction by replacing percpu_counters with a > more efficient set of per-NUMA-node atomics. The approach: > > - Uses one atomic per node up to a bound to reduce cross-node updates. > - Keeps a similar batching mechanism, with a smaller batch size. > - Eliminates the use of a spin lock during batch updates, bounding stat > update latency. > - Reduces percpu memory usage and thus thread startup time. > > Most importantly, this bounds the total error to 32 times the number of N= UMA > nodes, significantly smaller than previous error bounds. > > On a 112-core machine, lmbench showed comparable results before and after= this > patch. However, on a 224 core machine, performance improvements were > significant over percpu_counter: > - Pagefault latency improved by 8.91% > - Process fork latency improved by 6.27% > - Process fork/execve latency improved by 6.06% > - Process fork/exit latency improved by 6.58% > > will-it-scale also showed significant improvements on these machines. > > [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.= com/ > [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/ > [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@hua= wei.com/ Hi, thanks for the idea. I'd like to mention my previous work on this: https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@gmail.com/ Basically using one global percpu counter instead of a per-task one, and flush each CPU's sub-counter on context_switch (if next->active_mm !=3D current->active_mm, no switch for IRQ or kthread). More like a percpu stash. Benchmark looks great and the fast path is super fast (just a this_cpu_add). context_switch is also fine because the scheduler would try to keep one task on the same CPU to make better use of cache. And it can leverage the cpu bitmap like tlb shootdown to optimize the whole thing. The error and total memory consumption are both lower than current design t= oo.